Semantics-aware transformer for 3D reconstruction from binocular images

Xin Jia; Shourui Yang; Diyi Guan

doi:10.1007/s11801-022-2055-0

Optoelectronics Letters ›› 2022, Vol. 18 ›› Issue (5) : 293 -299. DOI: 10.1007/s11801-022-2055-0

Article

Semantics-aware transformer for 3D reconstruction from binocular images

Author information +

History +

PDF

Abstract

Existing multi-view three-dimensional (3D) reconstruction methods can only capture single type of feature from input view, failing to obtain fine-grained semantics for reconstructing the complex shapes. They rarely explore the semantic association between input views, leading to a rough 3D shape. To address these challenges, we propose a semantics-aware transformer (SATF) for 3D reconstruction. It is composed of two parallel view transformer encoders and a point cloud transformer decoder, and takes two red, green and blue (RGB) images as input and outputs a dense point cloud with richer details. Each view transformer encoder can learn a multi-level feature, facilitating characterizing fine-grained semantics from input view. The point cloud transformer decoder explores a semantically-associated feature by aligning the semantics of two input views, which describes the semantic association between views. Furthermore, it can generate a sparse point cloud using the semantically-associated feature. At last, the decoder enriches the sparse point cloud for producing a dense point cloud with richer details. Extensive experiments on the ShapeNet dataset show that our SATF outperforms the state-of-the-art methods.

Cite this article

Download citation ▾

Xin Jia, Shourui Yang, Diyi Guan. Semantics-aware transformer for 3D reconstruction from binocular images. Optoelectronics Letters, 2022, 18(5): 293-299 DOI:10.1007/s11801-022-2055-0

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	ZhangZ, XueW L. Video image mosaic via multi-module cooperation[J]. Optoelectronics letters, 2021, 17(11):688-692

[2]	ZhangZ B, XueW L, FuG K. Unsupervised image-to-image translation by semantics consistency and self-attention[J]. Optoelectronics letters, 2022, 18: 175-180

[3]	GanY, ZhangJ H, ChenK Q, et al.. A dynamic detection method to improve SLAM performance[J]. Optoelectronics letters, 2021, 17(11):693-698

[4]	ChenZ Q, ZhangH. Learning implicit fields for generative shape modeling, 2019, New York, IEEE: 5939-5948

[5]	LiuF, TranL, LiuX. Fully understanding generic objects: modeling, segmentation, and reconstruction, 2021, New York, IEEE: 7423-7433

[6]	AgarwalN, GopiM. GAMesh: guided and augmented meshing for deep point networks, 2020, New York, IEEE: 702-711

[7]	FanH, SuH, GuibasL J. A point set generation network for 3D object reconstruction from a single image, 2017, New York, IEEE: 605-613

[8]	WangN Y, ZhangY D, LiZ W, et al.. Pixel2Mesh: 3D mesh model generation via image guided deformation[J]. IEEE transactions on pattern analysis and machine intelligence (TPAMI), 2021, 43(10):3600-3613

[9]	ChoyC B, XuD F, GwakJ Y, et al.. 3D-R2N2: a unified approach for single and multi-view 3D object reconstruction, 2016, Heidelberg, Springer: 628-644

[10]	ChenR, HanS F, XuJ, et al.. Point-based multi-view stereo network, 2019, New York, IEEE: 1538-1547

[11]	YaoY, LuoZ X, LiS W, et al.. Recurrent MVSNet for high-resolution multi-view stereo depth inference, 2019, New York, IEEE: 5525-5534

[12]	KarA, HaneC, MalikJ. Learning a multi-view stereo machine, 2017, Cambridge, MIT Press: 365-376

[13]	WenC, ZhangY D, LiZ W, et al.. Pixel2Mesh++: multi-view 3D mesh generation via deformation, 2019, New York, IEEE: 1042-1051

[14]	JiaX, YangS R, PengY X, et al.. DV-net: dual-view network for 3D reconstruction by fusing multiple sets of gated control point clouds[J]. Pattern recognition letters, 2020, 131: 376-382

[15]	AshishV, NoamS, NikiP, et al.. Attention is all you need, 2017, Cambridge, MIT Press: 5998-6008

[16]	QiC R, YiL, SuH, et al.. PointNet++: deep hierarchical feature learning on point sets in a metric space, 2017, Cambridge, MIT Press: 5099-5108

[17]	HuangL, WangW M, ChenJ, et al.. Attention on attention for image captioning, 2019, New York, IEEE: 4634-4643

[18]	YuanW T, KhotT, HeldD, et al.. PCN: point completion network, 2020, New York, IEEE: 728-737

[19]	CHANG A X, FUNKHOUSER T, GUIBAS L, et al. Shapenet: an information-rich 3D model repository[EB/OL]. (2015-12-09) [2022-01-22]. http://arxiv.org/abs/1512.03012.