Multimodal Triplane Diffusion Transformer for High Fidelity 3D Shape Generation
Shuang WU , Youtian LIN , Yifei ZENG , Feihu ZHANG , Hao ZHU , Xun CAO , Yao YAO
Recent advancements in native 3D generation have demonstrated remarkable capabilities in producing high-quality 3D assets from image or text prompts. However, these methods face a critical challenge: insufficient alignment between generated meshes and input conditions. In this paper, we propose Multimodal Triplane Diffusion Transformer to address the issue, featuring two core components: A triplane-based 3D variational autoencoder that compresses point clouds, sampled uniformly from mesh surfaces and concentrated near sharp edges, into a triplane latent space; And a diffusion model trained on this latent space, empowered by multimodal diffusion transformer blocks to establish cross-modality interactions between latent representations and conditions. Extensive experiments demonstrate that our method achieves not only superior generalization capability but also significantly enhanced geometric alignment with input images compared to state-of-the-art approaches in the image-to-3D task.
Native 3D Generation / Diffusion Model / Multimodal Diffusion Transformer
Higher Education Press 2026
/
| 〈 |
|
〉 |