Field
Computer Vision
Image, video, and 3D perception plus visual generation.
14 papers · latest 2026-04-23
Common topics in this field
Yuxuan Xue, Ruofan Liang, Egor Zakharov et al.
Presents GeoRelight, a unified framework for joint geometrical relighting and 3D reconstruction using diffusion transformers, improving physical consistency and reducing error accumulation in single-image relighting.
Chaojie Mao, Chen-Wei Xie, Chongyang Zhong et al.
Wan-Image transforms image generation from aesthetic synthesis to professional-grade control, enabling precise typography, identity preservation, and workflow integration—essential for designers and product builders needing pixel-perfect outputs.
Tanjim Rahaman Fardin, S M Zunaid Alam, Mahadi Hasan Fahim et al.
MetaCloak-JPEG delivers JPEG-robust adversarial perturbations that block unauthorized DreamBooth deepfakes even after compression—essential for real-world privacy protection where images are routinely shared in degraded formats.
Geunyoung Jung, Soohong Kim, Inseok Kong et al.
APC introduces a lightweight, transferable counterattack module that boosts 3D point cloud robustness without sacrificing accuracy—critical for real-time systems facing adversarial inputs in robotics or autonomous driving.
Zonghai Yao, Zhipeng Tang, Chengtao Lin et al.
Reframes patient education as dynamic multi-modal interaction, not static QA. Enables systems to guide users through images and respond to distress—critical for real-world medical AI interfaces.
Haoyu Zheng, Tianwei Lin, Wei Wang et al.
IAD-Unify unifies defect segmentation, explanation, and generation in one model, enabling end-to-end industrial inspection. A paradigm shift for AI-driven manufacturing quality control with real-time interpretability.
Rong Wang, Ruyi Zha, Ziang Cheng et al.
Uses 3D foundation priors to generate geometrically consistent orbital videos from single images, solving long-range view synthesis—a leap for AR/VR and robotics perception systems.
AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation
Ziwei Zhou, Zeyuan Lai, Rui Wang et al.
AVGen-Bench finds that today's flashy text-to-audio-video systems are still semantically unreliable, especially for speech, text rendering, physical reasoning, and musical pitch control.
InSpatio Team, Donghui Shen, Guofeng Zhang et al.
A real-time 4D world simulator from a single video that emphasizes spatial consistency and controllable interaction, pointing toward more usable interactive environments for embodied training and evaluation.
Hiba Dahmani, Nathan Piasco, Moussab Bennehar et al.
SEM-ROVER enables scalable, geometrically coherent 3D driving scene generation via semantic voxel-guided diffusion—enabling realistic, large-scale simulation for autonomous driving systems without view limitations.
Zirui Li, Xinghao Chen, Lingyu Jiang et al.
PVIR introduces the first physics-aware benchmark for video object removal, forcing models to preserve physical consistency like shadows and reflections—critical for realistic video editing in production systems.
Haoxuan Han, Weijie Wang, Zeyu Zhang et al.
DDP shows that deliberately blurring images can make AI answer visual questions more accurately by forcing it to focus on core structures instead of distracting details. This flips conventional wisdom—less data can mean better performance, and it’s easy to plug into existing VQA systems.
Hyunsoo Cha, Wonjung Woo, Byungjun Kim et al.
Vanast eliminates the need for separate try-on and animation steps by doing both in one go, reducing distortions and identity drift. This means you can generate realistic, coherent videos of people wearing new clothes from just one image—useful for e-commerce and virtual fashion without complex pipelines.
Ahan Shabanov, Peter Hedman, Ethan Weber et al.
This paper changes how 3D scenes are built by removing the need for a rigid grid structure, allowing for more efficient and detailed models from just a few photos. It solves the problem of missing data in unobserved areas by generating plausible details rather than leaving gaps. Practitioners can use this to create lighter, faster 3D assets for games or VR without needing extensive camera rigs.