EvoScene: Self-Evolving 3D Scene Generation from a Single Image

Abstract

Generating high-quality, textured 3D scenes from a single image remains a fundamental challenge in vision and graphics. Recent image-to-3D generators recover reasonable geometry from single views, but their object-centric training limits generalization to complex, large-scale scenes with faithful structure and texture. We present EvoScene, a self-evolving, training-free framework that progressively reconstructs complete 3D scenes from single images. The key idea is combining the complementary strengths of existing models: geometric reasoning from 3D generation models and visual knowledge from video generation models. Through three iterative stages—Spatial Prior Initialization, Visual-guided 3D Scene Mesh Generation, and Spatial-guided Novel View Generation—EvoScene alternates between 2D and 3D domains, gradually improving both structure and appearance. Experiments on diverse scenes demonstrate that EvoScene achieves superior geometric stability, view-consistent textures, and unseen-region completion compared to strong baselines, producing ready-to-use 3D meshes for practical applications.

Spatial-aware Region-based 3D Scene Generation

We propose EvoScene, a self-evolving framework for single-image 3D scene generation that progressively expands spatial coverage and refines complete textured meshes through iterative cycles between geometry and appearance across 2D and 3D domains.
We introduce a three-stage modular design that synergistically combines point-cloud spatial priors, SLAT-based 3D diffusion for mesh generation, and depth-conditioned video generation, enabling effective integration of geometric knowledge from 3D models with visual priors from video models.
Comprehensive experiments with human preference and GPT-4o-based evaluations on diverse scenes demonstrate that EvoScene achieves superior geometric completeness, layout coherence, and texture fidelity compared to state-of-the-art image-to-3D baselines, with ablations confirming substantial improvements over pose-only world-model variants.

Figure 2. Our self-evolving framework consists of three coupled stages that form a virtuous cycle: (A) Prior Initialization extracts depth from 2D observations and back-projects to 3D point clouds, providing geometric constraints. (B) Visual-guided 3D Scene Mesh Generation uses a 3D diffusion model with test-time rendering optimization to complete the mesh guided by point cloud priors. (C) Spatial-guided Novel View Generation renders depth from the mesh to guide depth-conditioned video diffusion, synthesizing photorealistic multi-view images. These new views are converted back to point clouds (via depth re-estimation and multi-view filtering) and fed into the next iteration. This cycle progressively refines geometry and appearance to produce a complete 3D scene from a single input image.

Qualitative Comparison

Qualitative examples from EvoScene and baselines on 3D scene asset generation from single images. From the comparisons, we can find the EvoScene can provide more coherent and fine-grained 3D scenes than other baselines in various scene styles.