GeoRelight: Learning Joint Geometrical Relighting and Reconstruction with Flexible Multi-Modal Diffusion Transformers

Yuxuan Xue, Ruofan Liang, Egor Zakharov, Timur Bagautdinov, Chen Cao, Giljoo Nam, Shunsuke Saito, Gerard Pons-Moll, Javier Romero

Recommendation Score

significant🔴 AdvancedComputer Vision Diffusion Models 3D VisionMethodBest for builders

Research context

Primary field

Computer Vision

Image, video, and 3D perception plus visual generation.

Topics

Diffusion Models, 3D Vision

Paper type

Method

Best for

Best for builders

arXiv categories

cs.CVcs.CV

Why It Matters

Presents GeoRelight, a unified framework for joint geometrical relighting and 3D reconstruction using diffusion transformers, improving physical consistency and reducing error accumulation in single-image relighting.

Abstract

Relighting a person from a single photo is an attractive but ill-posed task, as a 2D image ambiguously entangles 3D geometry, intrinsic appearance, and illumination. Current methods either use sequential pipelines that suffer from error accumulation, or they do not explicitly leverage 3D geometry during relighting, which limits physical consistency. Since relighting and estimation of 3D geometry are mutually beneficial tasks, we propose a unified Multi-Modal Diffusion Transformer (DiT) that jointly solves for both: GeoRelight. We make this possible through two key technical contributions: isotropic NDC-Orthographic Depth (iNOD), a distortion-free 3D representation compatible with latent diffusion models; and a strategic mixed-data training method that combines synthetic and auto-labeled real data. By solving geometry and relighting jointly, GeoRelight achieves better performance than both sequential models and previous systems that ignored geometry.

More in Computer Vision → More on Diffusion Models →

View on arXiv → Download PDF →

Published April 22, 2026