Topic

Vision-Language Models

Vision-language models that connect text and perception.

3 papers · latest 2026-04-23

Most active fields for this topic

Robotics · 2 Multimodal · 1

PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance

Yupeng Zheng, Xiang Li, Songen Gu et al.

significant🔴 AdvancedRobotics Embodied Agents Vision-Language Models

cs.ROcs.RO

Presents a lightweight VLA model with world knowledge integration for efficient robot manipulation, enhancing spatial reasoning and task execution in compact robotic systems.

Details → arXiv →

Do VLMs Truly "Read" Candlesticks? A Multi-Scale Benchmark for Visual Stock Price Forecasting

Kaiqi Hu, Linda Xiao, Shiyue Xu et al.

breakthrough🟡 IntermediateMultimodal Vision-Language Models

cs.LGcs.CLcs.LG

Introduces the first rigorous benchmark proving whether VLMs truly understand candlestick patterns—not just correlate them—essential for financial AI builders relying on visual market signal interpretation.

Details → arXiv →

E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes

Jiajun Zhai, Hao Shi, Shangwei Guo et al.

breakthrough🔴 AdvancedRobotics Embodied Agents Vision-Language Models

cs.CVcs.MMcs.RO

E-VLA uses event cameras—normally used in robotics—to let robots see and act in near-total darkness or blur, where normal cameras fail. This enables real-world robotic systems to operate reliably in challenging environments like smoke-filled rooms or fast-moving scenes.

Details → arXiv →