Frankenstein: Generating
Semantic-Compositional 3D Scenes
in One Tri-Plane

SIGGRAPH Asia 2024 (Conference Track)

Han Yan^1,*, Yang Li², Zhennan Wu^3,*, Shenzhou Chen², Weixuan Sun², Taizhang Shang², Weizhe Liu² Tian Chen², Xiaqiang Dai², Chao Ma^1,† Hongdong Li⁴, Pan Ji²

¹Shanghai Jiao Tong University ²Tencent XR Vision Labs
³The University of Tokyo ⁴Australian National University
*: Work done during internship at Tencent XR Vision Labs †: Corresponding Author

arXiv Video Code

Frankenstein generates semantic-compositional 3D scenes in a single forward pass.

Abstract

We present Frankenstein, a diffusion-based framework that can generate semantic-compositional 3D scenes in a single pass. Unlike existing methods that output a single, unified 3D shape, Frankenstein simultaneously generates multiple separated shapes, each corresponding to a semantically meaningful part. The 3D scene information is encoded in one single tri-plane tensor, from which multiple Signed Distance Function (SDF) fields can be decoded to represent the compositional shapes. During training, an auto-encoder compresses tri-planes into a latent space, and then the denoising diffusion process is employed to approximate the distribution of the compositional scenes. Frankenstein demonstrates promising results in generating room interiors as well as human avatars with automatically separated parts. The generated scenes facilitate many downstream applications, such as part-wise re-texturing, object rearrangement in the room or avatar cloth re-targeting.

Video

Pipeline

Tri-plane fitting: training scenes are converted into tri-planes.
VAE training: tri-planes are compressed into latent tri-planes via an auto-encoder.
Conditional denoising: the distributions of latent tri-planes are approximated by a diffusion model conditioned on layout maps.

Room Generation

Semantic-compositional scenes provide semantic priors, which are compatible with off-the-shelf object-targeted texturing models. Additionally, object rearrangements can be performed to customize the room layout and appearance. Moreover, retrieval and refinement can be implemented as a post-processing stage to further enhance the quality of 3D models.

Generalization

Our model demonstrates the ability to adhere to a conditional layout while maintaining generation capacity when altering the layouts configuration.

Avatar Generation

The generated compositional avatar facilitates numerous downstream applications, including component-wise texture generation, random cloth swapping, cloth re-targeting, and automatic rigging and animation.

Gallery

BibTeX


      @article{yan2024frankenstein,
        author    = {Han, Yan and Yang, Li and Zhennan, Wu and Shenzhou, Chen and Weixuan, Sun and Taizhang, Shang and Weizhe, Liu and Tian, Chen and Xiaqiang, Dai and Chao, Ma and Hongdong, Li and Pan, Ji},
        title     = {Frankenstein: Generating Semantic-Compositional 3D Scenes in One Tri-Plane},
        journal   = {ACM SIGGRAPH Asia Conference Proceedings},
        year      = {2024},
      }