SSA: semantic structure aware inference on CNN networks for weakly pixel-wise dense predictions without cost

Yanpeng SUN, Zechao LI

PDF(5384 KB)
PDF(5384 KB)
Front. Comput. Sci. ›› 2025, Vol. 19 ›› Issue (2) : 192702. DOI: 10.1007/s11704-024-3571-9
Image and Graphics
RESEARCH ARTICLE

SSA: semantic structure aware inference on CNN networks for weakly pixel-wise dense predictions without cost

Author information +
History +

Abstract

The pixel-wise dense prediction tasks based on weakly supervisions currently use Class Attention Maps (CAMs) to generate pseudo masks as ground-truth. However, existing methods often incorporate trainable modules to expand the immature class activation maps, which can result in significant computational overhead and complicate the training process. In this work, we investigate the semantic structure information concealed within the CNN network, and propose a semantic structure aware inference (SSA) method that utilizes this information to obtain high-quality CAM without any additional training costs. Specifically, the semantic structure modeling module (SSM) is first proposed to generate the class-agnostic semantic correlation representation, where each item denotes the affinity degree between one category of objects and all the others. Then, the immature CAM are refined through a dot product operation that utilizes semantic structure information. Finally, the polished CAMs from different backbone stages are fused as the output. The advantage of SSA lies in its parameter-free nature and the absence of additional training costs, which makes it suitable for various weakly supervised pixel-dense prediction tasks. We conducted extensive experiments on weakly supervised object localization and weakly supervised semantic segmentation, and the results confirm the effectiveness of SSA.

Graphical abstract

Keywords

class attention maps / semantic structure / weakly-supervised object localization / weakly-supervised semantic segmentation

Cite this article

Download citation ▾
Yanpeng SUN, Zechao LI. SSA: semantic structure aware inference on CNN networks for weakly pixel-wise dense predictions without cost. Front. Comput. Sci., 2025, 19(2): 192702 https://doi.org/10.1007/s11704-024-3571-9

Yanpeng Sun received the MS degree from Guilin University of Electronic Technology, China in 2019. He is currently pursuing the PhD degree with the School of Computer Science and Engineering, Nanjing University of Science and Technology, China. His research interests include deep learning, visual segmentation and understanding

Zechao Li is currently a professor at Nanjing University of Science and Technology, China. He received his PhD degree from National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, China in 2013, and his BE degree from University of Science and Technology of China, China in 2008. His research interests include big media analysis, computer vision. He serves as an Associate Editor for IEEE TNNLS

References

[1]
Cheng Z, Qiao P, Li K, Li S, Wei P, Ji X, Yuan L, Liu C, Chen J. Out-of-candidate rectification for weakly supervised semantic segmentation. In: Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 23673−23684
[2]
Cheng T, Wang X, Chen S, Zhang Q, Liu W. BoxTeacher: exploring high-quality pseudo labels for weakly supervised instance segmentation. In: Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 3145−3154
[3]
Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A. Learning deep features for discriminative localization. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016, 2921−2929
[4]
Selvaraju R R, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-CAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of 2017 IEEE International Conference on Computer Vision. 2017, 618−626
[5]
Wang H, Naidu R, Michael J, Kundu S S. SS-CAM: smoothed score-CAM for sharper visual feature localization. 2020, arXiv preprint arXiv: 2006.14255
[6]
Chattopadhay A, Sarkar A, Howlader P, Balasubramanian V N. Grad-CAM++: generalized gradient-based visual explanations for deep convolutional networks. In: Proceedings of 2018 IEEE Winter Conference on Applications of Computer Vision. 2018, 839−847
[7]
Zeng C, Yan K, Wang Z, Yu Y, Xia S, Zhao N . Abs-CAM: a gradient optimization interpretable approach for explanation of convolutional neural networks. Signal, Image and Video Processing, 2023, 17( 4): 1069–1076
[8]
Choe J, Shim H. Attention-based dropout layer for weakly supervised object localization. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 2214−2223
[9]
Zhang X, Wei Y, Kang G, Yang Y, Huang T. Self-produced guidance for weakly-supervised object localization. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 610−625
[10]
Zhang C, Zhong W, Li C, Deng H . Random walk-based erasing data augmentation for deep learning. Signal, Image and Video Processing, 2023, 17( 5): 2447–2454
[11]
Zhong Z, Zheng L, Kang G, Li S, Yang Y. Random erasing data augmentation. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence. 2020, 13001−13008
[12]
Fu R, Hu Q, Dong X, Guo Y, Gao Y, Li B. Axiom-based grad-cam: Towards accurate visualization and explanation of CNNs. In: Proceedings of the 31st British Machine Vision Conference. 2020
[13]
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016, 770−778
[14]
Omeiza D, Speakman S, Cintas C, Weldermariam K. Smooth grad-CAM++: an enhanced inference level visualization technique for deep convolutional neural network models. 2019, arXiv preprint arXiv: 1908.01224
[15]
Zhang Q, Rao L, Yang Y. Group-CAM: group score-weighted visual explanations for deep convolutional networks. 2021, arXiv preprint arXiv: 2103.13859
[16]
Zhang D, Zhang H, Tang J, Hua X S, Sun Q. Causal intervention for weakly-supervised semantic segmentation. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 56
[17]
Xie J, Xiang J, Chen J, Hou X, Zhao X, Shen L. C2 AM: contrastive learning of class-agnostic activation map for weakly supervised object localization and semantic segmentation. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 989−998
[18]
Lee J, Kim E, Yoon S. Anti-adversarially manipulated attributions for weakly and semi-supervised semantic segmentation. In: Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 4071−4080
[19]
Wei Y, Feng J, Liang X, Cheng M M, Zhao Y, Yan S. Object region mining with adversarial erasing: a simple classification to semantic segmentation approach. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 6488−6496
[20]
DeVries T, Taylor GW. Improved regularization of convolutional neural networks with cutout. 2017, arXiv preprint arXiv: 1708.04552
[21]
Lee J, Kim E, Lee S, Lee J, Yoon S. FickleNet: weakly and semi-supervised semantic image segmentation using stochastic inference. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 5262−5271
[22]
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An image is worth 16x16 words: Transformers for image recognition at scale. In: Proceedings of the 9th International Conference on Learning Representations. 2021
[23]
Ru L, Zhan Y, Yu B, Du B. Learning affinity from attention: end-to-end weakly-supervised semantic segmentation with transformers. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 16825−16834
[24]
Ru L, Zheng H, Zhan Y, Du B. Token contrast for weakly-supervised semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2023, 3093−3102
[25]
Caron M, Touvron H, Misra I, Jégou H, Mairal J, Bojanowski P, Joulin A. Emerging properties in self-supervised vision transformers. In: Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. 2021, 9630−9640
[26]
Gao W, Wan F, Pan X, Peng Z, Tian Q, Han Z, Zhou B, Ye Q. TS-CAM: token semantic coupled attention map for weakly supervised object localization. In: Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. 2021, 2866−2875
[27]
Xu L, Ouyang W, Bennamoun M, Boussaid F, Xu D. Multi-class token transformer for weakly supervised semantic segmentation. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 4300−4309
[28]
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Köpf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S. PyTorch: an imperative style, high-performance deep learning library. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 721
[29]
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg A C, Fei-Fei L . ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 2015, 115( 3): 211–252
[30]
Wah C, Branson S, Welinder P, Perona P, Belongie S. The Caltech-UCSD birds-200−2011 dataset. Technical Report CNS-TR-2011−001. California Institute of Technology, 2011
[31]
Zhang X, Wei Y, Yang Y, Wu F. Rethinking localization map: towards accurate object perception with self-enhancement maps. 2020, arXiv preprint arXiv: 2006.05220
[32]
Pan X, Gao Y, Lin Z, Tang F, Dong W, Yuan H, Huang F, Xu C. Unveiling the potential of structure preserving for weakly supervised object localization. In: Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 11637−11646
[33]
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In: Proceedings of the 3rd International Conference on Learning Representations. 2015
[34]
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016, 2818−2826
[35]
Everingham M, Van Gool L, Williams C K I, Winn J, Zisserman A . The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision, 2010, 88( 2): 303–338
[36]
Hariharan B, Arbeláez P, Bourdev L, Maji S, Malik J. Semantic contours from inverse detectors. In: Proceedings of 2011 International Conference on Computer Vision. 2011, 991−998
[37]
Li Z, Sun Y, Zhang L, Tang J . CTNet: context-based tandem network for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44( 12): 9904–9917
[38]
Sun Y, Chen Q, He X, Wang J, Feng H, Han J, Ding E, Cheng J, Li Z, Wang J. Singular value fine-tuning: Few-shot segmentation requires few-parameters fine-tuning. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022, 37484−37496
[39]
Ahn J, Cho S, Kwak S. Weakly supervised learning of instance segmentation with inter-pixel relations. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 2204−2213
[40]
Chen L C, Papandreou G, Kokkinos I, Murphy K, Yuille A L . DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40( 4): 834–848
[41]
Yun S, Han D, Chun S, Oh S J, Yoo Y, Choe J. CutMix: regularization strategy to train strong classifiers with localizable features. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 6022−6031
[42]
Xue H, Liu C, Wan F, Jiao J, Ji X, Ye Q. DANet: divergent activation for weakly supervised object localization. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 6588−6597
[43]
Zhang X, Wei Y, Feng J, Yang Y, Huang T. Adversarial complementary learning for weakly supervised object localization. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 1325−1334
[44]
Zhang X, Wei Y, Yang Y. Inter-image communication for weakly supervised localization. In: Proceedings of the 16th European Conference on Computer Vision. 2020, 271−287
[45]
Mai J, Yang M, Luo W. Erasing integrated learning: a simple yet effective approach for weakly supervised object localization. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 8763−8772
[46]
Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. 2015, 3431−3440
[47]
Dai J, He K, Sun J. BoxSup: exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In: Proceedings of 2015 IEEE International Conference on Computer Vision. 2015, 1635−1643
[48]
Khoreva A, Benenson R, Hosang J, Hein M, Schiele B. Simple does it: weakly supervised instance and semantic segmentation. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 1665−1674
[49]
Sun G, Wang W, Dai J, Van Gool L. Mining cross-image semantics for weakly supervised semantic segmentation. In: Proceedings of the 16th European Conference on Computer Vision. 2020, 347−365
[50]
Jiang P T, Han L H, Hou Q, Cheng M M, Wei Y . Online attention accumulation for weakly supervised semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44( 10): 7062–7077
[51]
Li K, Zhang Y, Li K, Li Y, Fu Y. Attention bridging network for knowledge transfer. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 5197−5206
[52]
Jiang P T, Hou Q, Cao Y, Cheng M M, Wei Y, Xiong H K. Integral object mining via online attention accumulation. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 2070−2079
[53]
Fan J, Zhang Z, Tan T, Song C, Xiao J. CIAN: cross-image affinity net for weakly supervised semantic segmentation. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence, the 32nd Innovative Applications of Artificial Intelligence Conference, the 10th AAAI Symposium on Educational Advances in Artificial Intelligence. 2020, 10762−10769
[54]
Kolesnikov A, Lampert C H. Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In: Proceedings of the 14th European Conference on Computer Vision. 2016, 695−711
[55]
Shimoda W, Yanai K. Self-supervised difference detection for weakly-supervised semantic segmentation. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 5207−5216
[56]
Wang Y, Zhang J, Kan M, Shan S, Chen X. Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 12272−12281
[57]
Chang Y T, Wang Q, Hung W C, Piramuthu R, Tsai Y H, Yang M H. Weakly-supervised semantic segmentation via sub-category exploration. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 8988−8997
[58]
Sun K, Shi H, Zhang Z, Huang Y. ECS-Net: improving weakly supervised semantic segmentation by using connections between class activation maps. In: Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. 2021, 7263−7272
[59]
Li Y, Duan Y, Kuang Z, Chen Y, Zhang W, Li X. Uncertainty estimation via response scaling for pseudo-mask noise mitigation in weakly-supervised semantic segmentation. In: Proceedings of the 36th AAAI Conference on Artificial Intelligence,34th Conference on Innovative Applications of Artificial Intelligence, The 12th Symposium on Educational Advances in Artificial Intelligence. 2022, 1447−1455
[60]
Jiang P T, Yang Y, Hou Q, Wei Y. L2G: a simple local-to-global knowledge transfer framework for weakly supervised semantic segmentation. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 16865−16875

Acknowledgements

This work was partially supported by the National Key R&D Program of China (2022ZD0118802) and the National Natural Science Foundation of China (Grant Nos. U20B2064 and U21B2043).

Competing interests

The authors declare that they have no competing interests or financial conflicts to disclose.

RIGHTS & PERMISSIONS

2025 Higher Education Press
AI Summary AI Mindmap
PDF(5384 KB)

Accesses

Citations

Detail

Sections
Recommended

/