Generating high-quality, textured 3D scenes from a single image remains a fundamental challenge in vision and graphics. Recent image-to-3D generators recover reasonable geometry from single views, but their object-centric training limits generalization to complex, large-scale scenes with faithful structure and texture. We present EvoScene, a self-evolving, training-free framework that progressively reconstructs complete 3D scenes from single images. The key idea is combining the complementary strengths of existing models: geometric reasoning from 3D generation models and visual knowledge from video generation models. Through three iterative stages—Spatial Prior Initialization, Visual-guided 3D Scene Mesh Generation, and Spatial-guided Novel View Generation—EvoScene alternates between 2D and 3D domains, gradually improving both structure and appearance. Experiments on diverse scenes demonstrate that EvoScene achieves superior geometric stability, view-consistent textures, and unseen-region completion compared to strong baselines, producing ready-to-use 3D meshes for practical applications.
Figure 2. Our self-evolving framework consists of three coupled stages that form a virtuous cycle: (A) Prior Initialization extracts depth from 2D observations and back-projects to 3D point clouds, providing geometric constraints. (B) Visual-guided 3D Scene Mesh Generation uses a 3D diffusion model with test-time rendering optimization to complete the mesh guided by point cloud priors. (C) Spatial-guided Novel View Generation renders depth from the mesh to guide depth-conditioned video diffusion, synthesizing photorealistic multi-view images. These new views are converted back to point clouds (via depth re-estimation and multi-view filtering) and fed into the next iteration. This cycle progressively refines geometry and appearance to produce a complete 3D scene from a single input image.
Qualitative examples from EvoScene and baselines on 3D scene asset generation from single images. From the comparisons, we can find the EvoScene can provide more coherent and fine-grained 3D scenes than other baselines in various scene styles.