← Back to topics
Topic
Multimodal Understanding
Cross-modal understanding across text, image, video, and audio.
1 papers · latest 2026-04-10
Most active fields for this topic
Shilin Yan, Jintao Tong, Hongwei Xue et al.
cs.CVcs.AIcs.CV
Act Wisely separates task accuracy from tool-efficiency rewards so multimodal agents learn when not to call tools, cutting unnecessary invocations by orders of magnitude while improving accuracy, latency, and cost.