Field

Computer Vision

Image, video, and 3D perception plus visual generation.

14 papers · latest 2026-04-23

Common topics in this field

3D Vision · 7 Diffusion Models · 5 Video Generation · 4

GeoRelight: Learning Joint Geometrical Relighting and Reconstruction with Flexible Multi-Modal Diffusion Transformers

Yuxuan Xue, Ruofan Liang, Egor Zakharov et al.

significant🔴 AdvancedComputer Vision Diffusion Models 3D Vision

cs.CVcs.CV

Presents GeoRelight, a unified framework for joint geometrical relighting and 3D reconstruction using diffusion transformers, improving physical consistency and reducing error accumulation in single-image relighting.

Details → arXiv →

Wan-Image: Pushing the Boundaries of Generative Visual Intelligence

Chaojie Mao, Chen-Wei Xie, Chongyang Zhong et al.

breakthrough🔴 AdvancedComputer Vision Diffusion Models

cs.CVcs.CV

Wan-Image transforms image generation from aesthetic synthesis to professional-grade control, enabling precise typography, identity preservation, and workflow integration—essential for designers and product builders needing pixel-perfect outputs.

Details → arXiv →

MetaCloak-JPEG: JPEG-Robust Adversarial Perturbation for Preventing Unauthorized DreamBooth-Based Deepfake Generation

Tanjim Rahaman Fardin, S M Zunaid Alam, Mahadi Hasan Fahim et al.

breakthrough🔴 AdvancedComputer Vision Diffusion Models

cs.CVcs.CV

MetaCloak-JPEG delivers JPEG-robust adversarial perturbations that block unauthorized DreamBooth deepfakes even after compression—essential for real-world privacy protection where images are routinely shared in degraded formats.

Details → arXiv →

APC: Transferable and Efficient Adversarial Point Counterattack for Robust 3D Point Cloud Recognition

Geunyoung Jung, Soohong Kim, Inseok Kong et al.

significant🔴 AdvancedComputer Vision 3D Vision

cs.CVcs.CV

APC introduces a lightweight, transferable counterattack module that boosts 3D point cloud robustness without sacrificing accuracy—critical for real-time systems facing adversarial inputs in robotics or autonomous driving.

Details → arXiv →

Rethinking Patient Education as Multi-turn Multi-modal Interaction

Zonghai Yao, Zhipeng Tang, Chengtao Lin et al.

breakthrough🔴 AdvancedComputer Vision 3D Vision

cs.AIcs.CLcs.CV

Reframes patient education as dynamic multi-modal interaction, not static QA. Enables systems to guide users through images and respond to distress—critical for real-world medical AI interfaces.

Details → arXiv →

IAD-Unify: A Region-Grounded Unified Model for Industrial Anomaly Segmentation, Understanding, and Generation

Haoyu Zheng, Tianwei Lin, Wei Wang et al.

breakthrough🔴 AdvancedComputer Vision 3D Vision

cs.CVcs.AIcs.CV

IAD-Unify unifies defect segmentation, explanation, and generation in one model, enabling end-to-end industrial inspection. A paradigm shift for AI-driven manufacturing quality control with real-time interpretability.

Details → arXiv →

Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors

Rong Wang, Ruyi Zha, Ziang Cheng et al.

breakthrough🔴 AdvancedComputer Vision Video Generation 3D Vision

cs.CVcs.CV

Uses 3D foundation priors to generate geometrically consistent orbital videos from single images, solving long-range view synthesis—a leap for AR/VR and robotics perception systems.

Details → arXiv →

AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation

Ziwei Zhou, Zeyuan Lai, Rui Wang et al.

significant🟡 IntermediateComputer Vision Video Generation

cs.CVcs.AIcs.CL

AVGen-Bench finds that today's flashy text-to-audio-video systems are still semantically unreliable, especially for speech, text rendering, physical reasoning, and musical pitch control.

Details → arXiv →

INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

InSpatio Team, Donghui Shen, Guofeng Zhang et al.

significant🔴 AdvancedComputer Vision Video Generation

cs.CVcs.CV

A real-time 4D world simulator from a single video that emphasizes spatial consistency and controllable interaction, pointing toward more usable interactive environments for embodied training and evaluation.

Details → arXiv →

SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation

Hiba Dahmani, Nathan Piasco, Moussab Bennehar et al.

breakthrough🔴 AdvancedComputer Vision Diffusion Models

cs.CVcs.CV

SEM-ROVER enables scalable, geometrically coherent 3D driving scene generation via semantic voxel-guided diffusion—enabling realistic, large-scale simulation for autonomous driving systems without view limitations.

Details → arXiv →

Physics-Aware Video Instance Removal Benchmark

Zirui Li, Xinghao Chen, Lingyu Jiang et al.

breakthrough🔴 AdvancedComputer Vision Video Generation

cs.CVcs.CV

PVIR introduces the first physics-aware benchmark for video object removal, forcing models to preserve physical consistency like shadows and reflections—critical for realistic video editing in production systems.

Details → arXiv →

Less Detail, Better Answers: Degradation-Driven Prompting for VQA

Haoxuan Han, Weijie Wang, Zeyu Zhang et al.

breakthrough🟡 IntermediateComputer Vision 3D Vision

cs.CVcs.CV

DDP shows that deliberately blurring images can make AI answer visual questions more accurately by forcing it to focus on core structures instead of distracting details. This flips conventional wisdom—less data can mean better performance, and it’s easy to plug into existing VQA systems.

Details → arXiv →

Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision

Hyunsoo Cha, Wonjung Woo, Byungjun Kim et al.

significant🔴 AdvancedComputer Vision Diffusion Models

cs.CVcs.CV

Vanast eliminates the need for separate try-on and animation steps by doing both in one go, reducing distortions and identity drift. This means you can generate realistic, coherent videos of people wearing new clothes from just one image—useful for e-commerce and virtual fashion without complex pipelines.

Details → arXiv →

Free-Range Gaussians: Non-Grid-Aligned Generative 3D Gaussian Reconstruction

Ahan Shabanov, Peter Hedman, Ethan Weber et al.

significant🔴 AdvancedComputer Vision 3D Vision

cs.CVcs.CV

This paper changes how 3D scenes are built by removing the need for a rigid grid structure, allowing for more efficient and detailed models from just a few photos. It solves the problem of missing data in unobserved areas by generating plausible details rather than leaving gaps. Practitioners can use this to create lighter, faster 3D assets for games or VR without needing extensive camera rigs.

Details → arXiv →