HexaDream: hexaview prior and constraint for text to 3D creation

Zhi-Chao ZHANG , Hui CHEN , Jin-Sheng DENG , Ming XU , Zheng-Bin PANG

Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (2) : 2002311

PDF (2861KB)
Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (2) : 2002311 DOI: 10.1007/s11704-025-40774-x
Artificial Intelligence
RESEARCH ARTICLE

HexaDream: hexaview prior and constraint for text to 3D creation

Author information +
History +
PDF (2861KB)

Abstract

The burgeoning field of text-to-3D synthesis offers transformative potential in diverse domains such as computer-aided design, gaming, virtual reality, and artistic creation. However, the generation struggles with issues of inconsistency and low resolution, primarily due to the lack of critical visual clues like views and attributes. Furthermore, random constraint in rendering may impair model inference, leading to the Janus problem. In response to these challenges, we introduce HexaDream to produce high-quality 3D content. Hexaview Generation Diffusion Model is designed to merge object types, attributes, and view-specific text into unified latent space. Besides, the feature aggregation attention significantly enhances the detail and consistency of the generated output by mapping point features from orthogonal view into the 3D domain. Another innovation is the Dynamic-weighted HexaConstraint. This module employs a projection matrix to generate projected views and calculates the differential loss between these projections and the hexaviews, ensuring high fidelity. Our comparative experiments show that HexaDream achieves improvements of 8% in CLIP-R, 12% in Keypart Fidelity, and especially 20.6% in Multihead Alleviation compared with existing methods respectively.

Graphical abstract

Keywords

text to 3D / janus problem / HexaConstraint

Cite this article

Download citation ▾
Zhi-Chao ZHANG, Hui CHEN, Jin-Sheng DENG, Ming XU, Zheng-Bin PANG. HexaDream: hexaview prior and constraint for text to 3D creation. Front. Comput. Sci., 2026, 20(2): 2002311 DOI:10.1007/s11704-025-40774-x

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Poole B, Jain A, Barron J T, Mildenhall B. DreamFusion: text-to-3D using 2D diffusion. In: Proceedings of the 11th International Conference on Learning Representations. 2023

[2]

Melas-Kyriazi L, Laina I, Rupprecht C, Vedaldi A. RealFusion: 360° reconstruction of any object from a single image. In: Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 8446−8455

[3]

Karnewar A, Vedaldi A, Novotny D, Mitra N J. HoloDiffusion: training a 3D diffusion model using 2D images. In: Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 18423−18433

[4]

Shi Y, Wang P, Ye J, Long M, Li K, Yang X. MVDream: multi-view diffusion for 3D generation. 2023, arXiv preprint arXiv: 2308.16512

[5]

Hu Z, Zhao M, Zhao C, Liang X, Li L, Zhao Z, Fan C, Zhou X, Yu X. EfficientDreamer: high-fidelity and stable 3D creation via orthogonal-view diffusion priors. In: Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024, 4949–4958

[6]

Zhang Z, Chen H, Yin X, Deng J . EAWNet: an edge attention-wise objector for real-time visual internet of things. Wireless Communications and Mobile Computing, 2021, 2021( 1): 7258649

[7]

Zhang Z, Chen H, Yin X, Deng J, Li W . Dynamic selection of proper kernels for image deblurring: a multistrategy design. The Visual Computer, 2023, 39( 4): 1375–1390

[8]

Tang S, Zhang F, Chen J, Wang P, Furukawa Y. MVDiffusion: enabling holistic multi-view image generation with correspondence-aware diffusion. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 2229

[9]

Cho J, Youwang K, Oh T H. Cross-attention of disentangled modalities for 3D human mesh recovery with transformers. In: Proceedings of the 17th European Conference on Computer Vision. 2022, 342−359

[10]

Vilums R, Buikis A . Conservative averaging and finite difference methods for transient heat conduction in 3D fuse. WSEAS Transactions on Heat and Mass Transfer, 2008, 3( 1): 111–124

[11]

Qian G, Mai J, Hamdi A, Ren J, Siarohin A, Li B, Lee H Y, Skorokhodov I, Wonka P, Tulyakov S, Ghanem B. Magic123: one image to high-quality 3D object generation using both 2D and 3D diffusion priors. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[12]

Wang Z, Lu C, Wang Y, Bao F, Li C, Su H, Zhu J. ProlificDreamer: high-fidelity and diverse text-to-3D generation with variational score distillation. 2023, arXiv preprint arXiv: 2305.16213

[13]

Saharia C, Chan W, Saxena S, Lit L, Whang J, Denton E, Ghasemipour S K S, Ayan B K, Mahdavi S S, Gontijo-Lopes R, Salimans T, Ho J, Fleet D J, Norouzi M. Photorealistic text-to-image diffusion models with deep language understanding. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 2643

[14]

Radford A, Kim J W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I. Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 8748−8763

[15]

Nichol A Q, Dhariwal P, Ramesh A, Shyam P, Mishkin P, McGrew B, Sutskever I, Chen M. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. In: Proceedings of the 39th International Conference on Machine Learning. 2022, 16784−16804

[16]

Ramesh A, Dhariwal P, Nichol A, Chu C, Chen M. Hierarchical text-conditional image generation with CLIP latents. 2022, arXiv preprint arXiv: 2204.06125

[17]

Yu J, Xu Y, Koh J Y, Luong T, Baid G, Wang Z, Vasudevan V, Ku A, Yang Y, Ayan B K, Hutchinson B C, Han W, Parekh Z, Li X, Zhang H, Baldridge J, Wu Y. Scaling autoregressive models for content-rich text-to-image generation. 2022, arXiv preprint arXiv:2206.10789

[18]

Ding M, Zheng W, Hong W, Tang J. CogView2: faster and better text-to-image generation via hierarchical transformers. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 1229

[19]

Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B. High-resolution image synthesis with latent diffusion models. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 10674−10695

[20]

Ruiz N, Li Y, Jampani V, Pritch Y, Rubinstein M, Aberman K. DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 22500−22510

[21]

Liu M, Shi R, Chen L, Zhang Z, Xu C, Wei X, Chen H,Zeng C, Gu J, Su H. One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024, 10072–10083

[22]

Liu R, Wu R, Van Hoorick B, Tokmakov P, Zakharov S, Vondrick C. Zero-1-to-3: zero-shot one image to 3D object. In: Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. 2023, 9264−9275

[23]

Shi R, Chen H, Zhang Z, Liu M, Xu C, Wei X, Chen L, Zeng C, Su H. Zero123++: a single image to consistent multi-view diffusion base model. 2023, arXiv preprint arXiv: 2310.15110

[24]

Chen Z, Zhang H. Learning implicit fields for generative shape modeling. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019, 5932−5941

[25]

Mescheder L, Oechsle M, Niemeyer M, Nowozin S, Geiger A. Occupancy networks: learning 3D reconstruction in function space. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019, 4455−4465

[26]

Henzler P, Mitra N, Ritschel T. Escaping plato’s cave: 3D shape from adversarial rendering. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019, 9983−9992

[27]

Lunz S, Li Y, Fitzgibbon A, Kushman N. Inverse graphics GAN: learning to generate 3D shapes from unstructured 2D data. 2020, arXiv preprint arXiv: 2002.12674

[28]

Luo S, Hu W. Diffusion probabilistic models for 3D point cloud generation. In: Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021, 2836−2844

[29]

Zeng X, Vahdat A, Williams F, Gojcic Z, Litany O, Fidler S, Kreis K. LION: latent point diffusion models for 3D shape generation. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 728

[30]

Zhou L, Du Y, Wu J. 3D shape generation and completion through point-voxel diffusion. In: Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021, 5806−5815

[31]

Mildenhall B, Srinivasan P P, Tancik M, Barron J T, Ramamoorthi R, Ng R . NeRF: representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 2021, 65( 1): 99–106

[32]

Meng Q, Chen A, Luo H, Wu M, Su H, Xu L, He X, Yu J. GNeRF: GAN-based neural radiance field without posed camera. In: Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021, 6331−6341

[33]

Wang Z, Wu S, Xie W, Chen M, Prisacariu V A. NeRF--: neural radiance fields without known camera parameters. 2021, arXiv: 2102.07064

[34]

Deng K, Liu A, Zhu J Y, Ramanan D. Depth-supervised NeRF: fewer views and faster training for free. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022, 12872−12881

[35]

Roessle B, Barron J T, Mildenhall B, Srinivasan P P, Nießner M. Dense depth priors for neural radiance fields from sparse input views. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022, 12882−12891

[36]

Niemeyer M, Barron J T, Mildenhall B, Sajjadi M S M, Geiger A, Radwan N. RegNeRF: regularizing neural radiance fields for view synthesis from sparse inputs. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022, 5470−5480

[37]

Deng, Congyue and Jiang, “Max” Chiyu and Qi, Charles R and Yan, Xinchen and Zhou, Yin and Guibas, Leonidas and Anguelov, Dragomir. NeRDi: single-view NeRF synthesis with language-guided diffusion as general image priors. In: Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).2023, 20637−20647

[38]

Mohammad Khalid N, Xie T, Belilovsky E, Popa T. CLIP-Mesh: generating textured meshes from text using pretrained image-text models. In: Proceedings of SIGGRAPH Asia 2022 Conference Papers. 2022, 25

[39]

Michel O, Bar-On R, Liu R, Benaim S, Hanocka R. Text2Mesh: text-driven neural stylization for meshes. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 13482–13492

[40]

Jain A, Mildenhall B, Barron J T, Abbeel P, Poole B. Zero-shot text-guided object generation with dream fields. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 857−866

[41]

Barron J T, Mildenhall B, Verbin D, Srinivasan P P, Hedman P. Mip-NeRF 360: unbounded anti-aliased neural radiance fields. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 5460−5469

[42]

Lin C H, Gao J, Tang L, Takikawa T, Zeng X, Huang X, Kreis K, Fidler S, Liu M Y, Lin T Y. Magic3D: high-resolution text-to-3D content creation. In: Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 300−309

[43]

Armandpour M, Sadeghian A, Zheng H, Sadeghian A, Zhou M. Re-imagine the negative prompt algorithm: transform 2D diffusion into 3D, alleviate Janus problem and beyond. 2023, arXiv preprint arXiv: 2304.04968

[44]

Chang A X, Funkhouser T, Guibas L, Hanrahan P, Huang Q, Li Z, Savarese S, Savva M, Song S, Su H, Xiao J, Yi L, Yu F. ShapeNet: an information-rich 3D model repository. 2015, arXiv preprint arXiv: 1512.03012

[45]

Yu L, Xiang W, Han K. Edit-DiffNeRF: editing 3D neural radiance fields using 2D diffusion model. 2023, arXiv preprint arXiv: 2306.09551

[46]

Wu T, Zhang J, Fu X, Wang Y, Ren J, Pan L, Wu W, Yang L, Wang J, Qian C, Lin D, Liu Z. OmniObject3D: large-vocabulary 3D object dataset for realistic perception, reconstruction and generation. In: Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023, 803−814

[47]

Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu P J . Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020, 21( 1): 140

[48]

Schuhmann C, Beaumont R, Vencu R, Gordon C, Wightman R, Cherti M, Coombes T, Katta A, Mullis C, Wortsman M, Schramowski P, Kundurthy S, Crowson K, Schmidt L, Kaczmarczyk R, Jitsev J. LAION-5B: an open large-scale dataset for training next generation image-text models. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 1833

[49]

Chong M J, Forsyth D. Effectively unbiased fid and inception score and where to find them. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 6069−6078

RIGHTS & PERMISSIONS

The Author(s) 2025. This article is published with open access at link.springer.com and journal.hep.com.cn

AI Summary AI Mindmap
PDF (2861KB)

Supplementary files

Highlights

495

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/