Graph-Segmenter: graph transformer with boundary-aware attention for semantic segmentation
Zizhang WU, Yuanzhu GAN, Tianhao XU, Fan WANG
Graph-Segmenter: graph transformer with boundary-aware attention for semantic segmentation
The transformer-based semantic segmentation approaches, which divide the image into different regions by sliding windows and model the relation inside each window, have achieved outstanding success. However, since the relation modeling between windows was not the primary emphasis of previous work, it was not fully utilized. To address this issue, we propose a Graph-Segmenter, including a graph transformer and a boundary-aware attention module, which is an effective network for simultaneously modeling the more profound relation between windows in a global view and various pixels inside each window as a local one, and for substantial low-cost boundary adjustment. Specifically, we treat every window and pixel inside the window as nodes to construct graphs for both views and devise the graph transformer. The introduced boundary-aware attention module optimizes the edge information of the target objects by modeling the relationship between the pixel on the object’s edge. Extensive experiments on three widely used semantic segmentation datasets (Cityscapes, ADE-20k and PASCAL Context) demonstrate that our proposed network, a Graph Transformer with Boundary-aware Attention, can achieve state-of-the-art segmentation performance.
graph transformer / graph relation network / boundary-aware / attention / semantic segmentation
Zizhang Wu received the BSc and MSc degrees in Pattern Recognition and Intelligent Systems from Northeastern University, China in 2010 and 2012, respectively. He is currently a perception algorithm manager in the Computer Vision Perception Department of ZongMu Technology, China. He is mainly responsible for the development of core perception algorithms, optimizing algorithms, and model effects, and driving business development with technology
Yuanzhu Gan received the MSc degree in Pattern Recognition and Artificial Intelligence from Nanjing University, China in 2021. He is now an algorithm engineer at ZongMu Technology, China. His current interests include 3D object detection for autonomous driving
Tianhao Xu received the BE degree in Vehicle Engineering from Jilin University, China in 2017. He is currently pursuing the MSc degree in Electric Mobility at Technical University of Braunschweig, Germany. His current research interests include computer vision and microstructural analysis in machine learning
Fan Wang received the BSc and MSc degrees in Computer Science and Artificial Intelligence from Northwestern Polytechnical University, China in 1997 and 2000, respectively. He is currently with ZongMu Technology as a Vice President and Chief Technology Officer. His current research interests include computer vision, sensor fusion, automatic parking, planning & control, and L2/L3/L4 autonomous driving
[1] |
Ruan H, Song H, Liu B, Cheng Y, Liu Q . Intellectual property protection for deep semantic segmentation models. Frontiers of Computer Science, 2023, 17( 1): 171306
|
[2] |
Zhang D, Zhou Y, Zhao J, Yang Z, Dong H, Yao R, Ma H . Multi-granularity semantic alignment distillation learning for remote sensing image semantic segmentation. Frontiers of Computer Science, 2022, 16( 4): 164351
|
[3] |
Grigorescu S, Trasnea B, Cocias T, Macesanu G . A survey of deep learning techniques for autonomous driving. Journal of Field Robotics, 2020, 37( 3): 362–386
|
[4] |
Feng D, Haase-Schütz C, Rosenbaum L, Hertlein H, Gläser C, Timm F, Wiesbeck W, Dietmayer K . Deep multi-modal object detection and semantic segmentation for autonomous driving: datasets, methods, and challenges. IEEE Transactions on Intelligent Transportation Systems, 2021, 22( 3): 1341–1360
|
[5] |
Janai J, Güney F, Behl A, Geiger A. Computer vision for autonomous vehicles: problems, datasets and state of the art. Foundations and Trends® in Computer Graphics and Vision, 2020, 12(1−3): 1−308
|
[6] |
Arnold E, Al-Jarrah O Y, Dianati M, Fallah S, Oxtoby D, Mouzakitis A . A survey on 3D object detection methods for autonomous driving applications. IEEE Transactions on Intelligent Transportation Systems, 2019, 20( 10): 3782–3795
|
[7] |
Wang P, Chen P, Yuan Y, Liu D, Huang Z, Hou X, Cottrell G. Understanding convolution for semantic segmentation. In: Proceedings of the Winter Conference on Applications of Computer Vision. 2018, 1451−1460
|
[8] |
Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019, 4171−4186
|
[9] |
Wang L, Li D, Zhu Y, Tian L, Shan Y. Dual super-resolution learning for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 3773−3782
|
[10] |
Yu C, Wang J, Gao C, Yu G, Shen C, Sang N. Context prior for scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 12413−12422
|
[11] |
Rae J W, Potapenko A, Jayakumar S M, Lillicrap T P. Compressive transformers for long-range sequence modelling. In: Proceedings of the 8th International Conference on Learning Representations. 2020
|
[12] |
Lee J, Lee Y, Kim J, Kosiorek A, Choi S, Teh Y W. Set transformer: a framework for attention-based permutation-invariant neural networks. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 3744−3753
|
[13] |
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An image is worth 16x16 words: transformers for image recognition at scale. In: Proceedings of the 9th International Conference on Learning Representations. 2021
|
[14] |
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of IEEE/CVF International Conference on Computer Vision. 2021, 9992−10002
|
[15] |
Xie E, Wang W, Yu Z, Anandkumar A, Alvarez J M, Luo P. SegFormer: simple and efficient design for semantic segmentation with transformers. In: Proceedings of the 35th Conference on Neural Information Processing Systems. 2021
|
[16] |
Chu X, Tian Z, Wang Y, Zhang B, Ren H, Wei X, Xia H, Shen C. Twins: revisiting the design of spatial attention in vision transformers. In: Proceedings of the 35th Conference on Neural Information Processing Systems. 2021
|
[17] |
Fang J, Xie L, Wang X, Zhang X, Liu W, Tian Q. MSG-transformer: exchanging local spatial information by manipulating messenger tokens. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 12053−12062
|
[18] |
Wang P, Wang X, Wang F, Lin M, Chang S, Li H, Jin R. KVT: k-NN attention for boosting vision transformers. In: Proceedings of the 17th European Conference on Computer Vision. 2022, 285−302
|
[19] |
Chu X, Zhang B, Tian Z, Wei X, Xia H. Do we really need explicit position encodings for vision transformers? 2021, arXiv preprint arXiv: 2102.10882
|
[20] |
Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B. The cityscapes dataset for semantic urban scene understanding. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2016, 3213−3223
|
[21] |
Zhou B, Zhao H, Puig X, Xiao T, Fidler S, Barriuso A, Torralba A . Semantic understanding of scenes through the ADE20K dataset. International Journal of Computer Vision, 2019, 127( 3): 302–321
|
[22] |
Mottaghi R, Chen X, Liu X, Cho N G, Lee S, Fidler S, Urtasun R, Yuille A. The role of context for object detection and semantic segmentation in the wild. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2014, 891−898
|
[23] |
Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2015, 3431−3440
|
[24] |
Shen Y, Zhang H, Fan Y, Lee A P, Xu L . Smart health of ultrasound telemedicine based on deeply represented semantic segmentation. IEEE Internet of Things Journal, 2021, 8( 23): 16770–16778
|
[25] |
Zhao H, Shi J, Qi X, Wang X, Jia J. Pyramid scene parsing network. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2017, 6230−6239
|
[26] |
Chen L C, Zhu Y, Papandreou G, Schroff F, Adam H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 833−851
|
[27] |
Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H. Dual attention network for scene segmentation. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 3141−3149
|
[28] |
Ding H, Zhang H, Liu J, Li J, Feng Z, Jiang X. Interaction via bi-directional graph of semantic region affinity for scene parsing. In: Proceedings of IEEE/CVF International Conference on Computer Vision. 2021, 15828−15838
|
[29] |
Yuan Y, Chen X, Wang J. Object-contextual representations for semantic segmentation. In: Proceedings of the European Conference on Computer Vision. 2020
|
[30] |
Li X, You A, Zhu Z, Zhao H, Yang M, Yang K, Tan S, Tong Y. Semantic flow for fast and accurate scene parsing. In: Proceedings of the 16th European Conference on Computer Vision. 2020, 775−793
|
[31] |
Huang Z, Wang X, Huang L, Huang C, Wei Y, Liu W. CCNet: criss-cross attention for semantic segmentation. In: Proceedings of IEEE/CVF International Conference on Computer Vision. 2019, 603−612
|
[32] |
Li X, Zhao H, Han L, Tong Y, Tan S, Yang K. Gated fully fusion for semantic segmentation. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 11418−11425
|
[33] |
He J, Deng Z, Zhou L, Wang Y, Qiao Y. Adaptive pyramid context network for semantic segmentation. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 7511−7520
|
[34] |
Ding H, Jiang X, Liu A Q, Thalmann N M, Wang G. Boundary-aware feature propagation for scene segmentation. In: Proceedings of IEEE/CVF International Conference on Computer Vision. 2019, 6818−6828
|
[35] |
Mnih V, Heess N, Graves A, Kavukcuoglu K. Recurrent models of visual attention. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. 2014, 2204−2212
|
[36] |
Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. In: Proceedings of the 3rd International Conference on Learning Representations. 2015
|
[37] |
Parmar N, Vaswani A, Uszkoreit J, Kaiser L, Shazeer N, Ku A, Tran D. Image transformer. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 4055−4064
|
[38] |
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with transformers. In: Proceedings of the 16th European Conference on Computer Vision. 2020, 213−229
|
[39] |
Zhu X, Su W, Lu L, Li B, Wang X, Dai J. Deformable DETR: deformable transformers for end-to-end object detection. In: Proceedings of the 9th International Conference on Learning Representations. 2021
|
[40] |
Wang Y, Xu Z, Wang X, Shen C, Cheng B, Shen H, Xia H. End-to-end video instance segmentation with transformers. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 8737−8746
|
[41] |
Wang Y, Guizilini V, Zhang T, Wang Y, Zhao H, Solomon J. DETR3D: 3D object detection from multi-view images via 3D-to-2D queries. In: Proceedings of the Conference on Robot Learning. 2021, 180−191
|
[42] |
Strudel R, Garcia R, Laptev I, Schmid C. Segmenter: transformer for semantic segmentation. In: Proceedings of IEEE/CVF International Conference on Computer Vision. 2021, 7242−7252
|
[43] |
Zheng S, Lu J, Zhao H, Zhu X, Luo Z, Wang Y, Fu Y, Feng J, Xiang T, Torr P H S, Zhang L. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 6877−6886
|
[44] |
Zhang L, Li X, Arnab A, Yang K, Tong Y, Torr P H S. Dual graph convolutional network for semantic segmentation. In: Proceedings of the 30th British Machine Vision Conference 2019. 2019, 254
|
[45] |
Pan S Y, Lu C Y, Lee S P, Peng W H. Weakly-supervised image semantic segmentation using graph convolutional networks. In: Proceedings of IEEE International Conference on Multimedia and Expo. 2021, 1−6
|
[46] |
Wang H, Dong L, Sun M . Local feature aggregation algorithm based on graph convolutional network. Frontiers of Computer Science, 2022, 16( 3): 163309
|
[47] |
Wu J, He X, Wang X, Wang Q, Chen W, Lian J, Xie X . Graph convolution machine for context-aware recommender system. Frontiers of Computer Science, 2022, 16( 6): 166614
|
[48] |
Bruna J, Zaremba W, Szlam A, LeCun Y. Spectral networks and locally connected networks on graphs. In: Proceedings of the 2nd International Conference on Learning Representations. 2014
|
[49] |
Velickovic P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y. Graph attention networks. In: Proceedings of the 6th International Conference on Learning Representations. 2018
|
[50] |
Zhang L, Xu D, Arnab A, Torr P H S. Dynamic graph message passing networks. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 3723−3732
|
[51] |
Zhu Y, Xu X, Shen F, Ji Y, Gao L, Shen H T. PoseGTAC: graph transformer encoder-decoder with atrous convolution for 3D human pose estimation. In: Proceedings of the 30th International Joint Conference on Artificial Intelligence. 2021, 1359−1365
|
[52] |
Dong X, Long C, Xu W, Xiao C. Dual graph convolutional networks with transformer and curriculum learning for image captioning. In: Proceedings of the 29th ACM International Conference on Multimedia. 2021, 2615−2624
|
[53] |
Yan S, Xiong Y, Lin D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018, 7444−7452
|
[54] |
Li T, Zhang K, Shen S, Liu B, Liu Q, Li Z . Image co-saliency detection and instance co-segmentation using attention graph clustering based graph convolutional network. IEEE Transactions on Multimedia, 2022, 24: 492–505
|
[55] |
Li X, Yang Y, Zhao Q, Shen T, Lin Z, Liu H. Spatial pyramid based graph reasoning for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 8947−8956
|
[56] |
Hu H, Ji D, Gan W, Bai S, Wu W, Yan J. Class-wise dynamic graph convolution for semantic segmentation. In: Proceedings of the 16th European Conference on Computer Vision. 2020, 1−17
|
[57] |
Zhang Y, Liu M, He J, Pan F, Guo Y . Affinity fusion graph-based framework for natural image segmentation. IEEE Transactions on Multimedia, 2022, 24: 440–450
|
[58] |
Chen C, Qian S, Fang Q, Xu C . HAPGN: hierarchical attentive pooling graph network for point cloud segmentation. IEEE Transactions on Multimedia, 2021, 23: 2335–2346
|
[59] |
Su Y, Liu W, Yuan Z, Cheng M, Zhang Z, Shen X, Wang C . DLA-Net: learning dual local attention features for semantic segmentation of large-scale building facade point clouds. Pattern Recognition, 2022, 123: 108372
|
[60] |
Liu Y, Yang S, Li B, Zhou W, Xu J, Li H, Lu Y. Affinity derivation and graph merge for instance segmentation. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 708−724
|
[61] |
Zhang Z, Cui P, Zhu W . Deep learning on graphs: a survey. IEEE Transactions on Knowledge and Data Engineering, 2022, 34( 1): 249–270
|
[62] |
Wu Z, Pan S, Chen F, Long G, Zhang C, Yu P S . A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2021, 32( 1): 4–24
|
[63] |
Hamilton W L, Ying R, Leskovec J. Inductive representation learning on large graphs. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 1025−1035
|
[64] |
Kipf T N, Welling M. Semi-supervised classification with graph convolutional networks. In: Proceedings of the 5th International Conference on Learning Representations. 2017
|
[65] |
Yin M, Yao Z, Cao Y, Li X, Zhang Z, Lin S, Hu H. Disentangled non-local neural networks. In: Proceedings of the 16th European Conference on Computer Vision. 2020, 191−207
|
[66] |
Wang X, Girshick R, Gupta A, He K. Non-local neural networks. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 7794−7803
|
[67] |
Yang M, Yu K, Zhang C, Li Z, Yang K. DenseASPP for semantic segmentation in street scenes. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 3684−3692
|
[68] |
Chen L C, Collins M D, Zhu Y, Papandreou G, Zoph B, Schroff F, Adam H, Shlens J. Searching for efficient multi-scale architectures for dense image prediction. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 8713−8724
|
[69] |
Cheng B, Collins M D, Zhu Y, Liu T, Huang T S, Adam H, Chen L C. Panoptic-DeepLab: a simple, strong, and fast baseline for bottom-up panoptic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 12472−12482
|
[70] |
Hou Q, Zhang L, Cheng M M, Feng J. Strip pooling: rethinking spatial pooling for scene parsing. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 4002−4011
|
[71] |
Yu C, Wang J, Peng C, Gao C, Yu G, Sang N. BiSeNet: bilateral segmentation network for real-time semantic segmentation. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 334−349
|
[72] |
Zhao H, Zhang Y, Liu S, Shi J, Loy C C, Lin D, Jia J. PSANet: point-wise spatial attention network for scene parsing. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 270−286
|
[73] |
Yuan Y, Huang L, Guo J, Zhang C, Chen X, Wang J. OCNet: object context network for scene parsing. 2018, arXiv preprint arXiv: 1809.00916
|
[74] |
Xiao T, Liu Y, Zhou B, Jiang Y, Sun J. Unified perceptual parsing for scene understanding. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 432−448
|
[75] |
Fu J, Liu J, Wang Y, Li Y, Bao Y, Tang J, Lu H. Adaptive context network for scene parsing. In: Proceedings of IEEE/CVF International Conference on Computer Vision. 2019, 6747−6756
|
[76] |
Huang Y, Kang D, Chen L, Zhe X, Jia W, Bao L, He X. CAR: class-aware regularizations for semantic segmentation. In: Proceedings of the 17th European Conference on Computer Vision. 2022, 518−534
|
[77] |
Li X, Zhong Z, Wu J, Yang Y, Lin Z, Liu H. Expectation-maximization attention networks for semantic segmentation. In: Proceedings of IEEE/CVF International Conference on Computer Vision. 2019, 9166−9175
|
[78] |
Ding H, Jiang X, Shuai B, Liu A Q, Wang G. Semantic correlation promoted shape-variant context for segmentation. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 8877−8886
|
/
〈 | 〉 |