Multimodal Triplane Diffusion Transformer for High Fidelity 3D Shape Generation

Shuang WU , Youtian LIN , Yifei ZENG , Feihu ZHANG , Hao ZHU , Xun CAO , Yao YAO

Front. Comput. Sci. ››

PDF (7532KB)
Front. Comput. Sci. ›› DOI: 10.1007/s11704-026-50361-3
RESEARCH ARTICLE
Multimodal Triplane Diffusion Transformer for High Fidelity 3D Shape Generation
Author information +
History +
PDF (7532KB)

Abstract

Recent advancements in native 3D generation have demonstrated remarkable capabilities in producing high-quality 3D assets from image or text prompts. However, these methods face a critical challenge: insufficient alignment between generated meshes and input conditions. In this paper, we propose Multimodal Triplane Diffusion Transformer to address the issue, featuring two core components: A triplane-based 3D variational autoencoder that compresses point clouds, sampled uniformly from mesh surfaces and concentrated near sharp edges, into a triplane latent space; And a diffusion model trained on this latent space, empowered by multimodal diffusion transformer blocks to establish cross-modality interactions between latent representations and conditions. Extensive experiments demonstrate that our method achieves not only superior generalization capability but also significantly enhanced geometric alignment with input images compared to state-of-the-art approaches in the image-to-3D task.

Keywords

Native 3D Generation / Diffusion Model / Multimodal Diffusion Transformer

Cite this article

Download citation ▾
Shuang WU, Youtian LIN, Yifei ZENG, Feihu ZHANG, Hao ZHU, Xun CAO, Yao YAO. Multimodal Triplane Diffusion Transformer for High Fidelity 3D Shape Generation. Front. Comput. Sci. DOI:10.1007/s11704-026-50361-3

登录浏览全文

4963

注册一个新账户 忘记密码

References

RIGHTS & PERMISSIONS

Higher Education Press 2026

PDF (7532KB)

0

Accesses

0

Citation

Detail

Sections
Recommended

/