DCPNet: a comprehensive framework for multimodal sarcasm detection via graph topology extraction and multi-scale feature fusion

Youjiang FANG , Liang ZHANG , Shihao WANG , Wenyuan ZHANG , Yuxin WANG , Yuanyuan LIU , Xiaopeng WEI , Xin YANG

Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (7) : 2007336

PDF (1363KB)
Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (7) : 2007336 DOI: 10.1007/s11704-025-50196-4
Artificial Intelligence
RESEARCH ARTICLE

DCPNet: a comprehensive framework for multimodal sarcasm detection via graph topology extraction and multi-scale feature fusion

Author information +
History +
PDF (1363KB)

Abstract

Recent advancements in multimodal sarcasm detection (MSD) have made significant progress in understanding the interplay between textual and visual cues. However, existing methods tend to overemphasize cross-modal semantic alignment, consequently neglecting sarcasm cues that are independently embedded within each modality. In this paper, we present DCPNet, a novel Dual-channel Cross-modal Perception Network, to integrate unimodal and cross-modal features via the incorporation of comprehensive structural semantics. To capture rich topological relationships within each modality, we introduce a Graph Topology Extraction and Enhancement Module (GTEE) that builds graph structures from both text and image features, facilitating deeper semantic representation. Additionally, we propose a Cross-Modal Multi-Scale Feature Fusion (CMFF) module that aligns and integrates features from both text and image at multiple scales, ensuring the capture of comprehensive contextual information. An attention mechanism is incorporated to assign appropriate weights to the textual and visual features, thereby optimizing the fusion process for more accurate sarcasm detection. Extensive experiments conducted on the MMSD and MMSD2.0 benchmark datasets demonstrate that DCPNet outperforms existing state-of-the-art (SOTA) methods in both accuracy and robustness.

Graphical abstract

Keywords

multimodal sarcasm detection / graph topology extraction / cross-modal feature fusion

Cite this article

Download citation ▾
Youjiang FANG, Liang ZHANG, Shihao WANG, Wenyuan ZHANG, Yuxin WANG, Yuanyuan LIU, Xiaopeng WEI, Xin YANG. DCPNet: a comprehensive framework for multimodal sarcasm detection via graph topology extraction and multi-scale feature fusion. Front. Comput. Sci., 2026, 20(7): 2007336 DOI:10.1007/s11704-025-50196-4

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Verma P, Shukla N, Shukla A P. Techniques of sarcasm detection: a review. In: Proceedings of 2021 International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE). 2021, 968−972

[2]

Godara J, Aron R, Shabaz M . RETRACTED: sentiment analysis and sarcasm detection from social network to train health-care professionals. World Journal of Engineering, 2022, 19( 1): 124–133

[3]

Li J, Pan H, Lin Z, Fu P, Wang W . Sarcasm detection with commonsense knowledge. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3192–3201

[4]

Rao M V, Sindhu C. Detection of sarcasm on amazon product reviews using machine learning algorithms under sentiment analysis. In: Proceedings of 2021 Sixth International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET). 2021, 196−199

[5]

Zhang Y, Wang J, Liu Y, Rong L, Zheng Q, Song D, Tiwari P, Qin J . A multitask learning model for multimodal sarcasm, sentiment and emotion recognition in conversations. Information Fusion, 2023, 93: 282–301

[6]

Bhat A, Chauhan A. Multimodal sarcasm detection: a survey. In: Proceedings of 2022 IEEE Delhi Section Conference (DELCON). 2022, 1−7

[7]

Dutta P, Bhattacharyya C K. Multi-modal sarcasm detection in social networks: a comparative review. In: Proceedings of the 6th International Conference on Computing Methodologies and Communication (ICCMC). 2022, 207−214

[8]

Băroiu A C, Trăușan-Matu Ș . Automatic sarcasm detection: systematic literature review. Information, 2022, 13( 8): 399

[9]

Helal N A, Hassan A, Badr N L, Afify Y M . A contextual-based approach for sarcasm detection. Scientific Reports, 2024, 14( 1): 15415

[10]

Goel P, Jain R, Nayyar A, Singhal S, Srivastava M . Sarcasm detection using deep learning and ensemble learning. Multimedia Tools and Applications, 2022, 81( 30): 43229–43252

[11]

Vinoth D, Prabhavathy P . An intelligent machine learning-based sarcasm detection and classification model on social networks. The Journal of Supercomputing, 2022, 78( 8): 10575–10594

[12]

Schifanella R, de Juan P, Tetreault J, Cao L. Detecting sarcasm in multimodal social platforms. In: Proceedings of the 24th ACM International Conference on Multimedia. 2016, 1136−1145

[13]

Qiao Y, Jing L, Song X, Chen X, Zhu L, Nie L. Mutual-enhanced incongruity learning network for multi-modal sarcasm detection. In: Proceedings of the 37th AAAI Conference on Artificial Intelligence. 2023, 9507−9515

[14]

Xu N, Zeng Z, Mao W. Reasoning with multimodal sarcastic tweets via modeling cross-modality contrast and semantic association. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 3777−3786

[15]

Liang B, Lou C, Li X, Gui L, Yang M, Xu R. Multi-modal sarcasm detection with interactive in-modal and cross-modal graphs. In: Proceedings of the 29th ACM International Conference on Multimedia. 2021, 4707−4715

[16]

Wei Y, Yuan S, Zhou H, Wang L, Yan Z, Yang R, Chen M. Gˆ2SAM: graph-based global semantic awareness method for multimodal sarcasm detection. In: Proceedings of the 38th AAAI Conference on Artificial Intelligence. 2024, 9151−9159

[17]

Tian Y, Xu N, Zhang R, Mao W. Dynamic routing transformer network for multimodal sarcasm detection. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023, 2468−2480

[18]

Chen J, Yu H, Liu M, Zhang L. InterCLIP-MEP: interactive CLIP and memory-enhanced predictor for multi-modal sarcasm detection. 2024, arXiv preprint arXiv: 2406.16464

[19]

Qin L, Huang S, Chen Q, Cai C, Zhang Y, Liang B, Che W, Xu R. MMSD2.0: towards a reliable multi-modal sarcasm detection system. In: Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023. 2023, 10834−10845

[20]

Zhang M, Chang K, Wu Y. Multi-modal semantic understanding with contrastive cross-modal feature alignment. In: Proceedings of the Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024, 11934−11943

[21]

Rendalkar S, Chandankhede C. Sarcasm detection of online comments using emotion detection. In: Proceedings of 2018 International Conference on Inventive Research in Computing Applications (ICIRCA). 2018, 1244−1249

[22]

Gupta S, Singh R, Singla V. Emoticon and text sarcasm detection in sentiment analysis. In: Luhach A K, Kosa J A, Poonia R C, Gao X Z, Singh D, eds. First International Conference on Sustainable Technologies for Computational Intelligence: Proceedings of ICTSCI 2019. Singapore: Springer, 2020, 1−10

[23]

Lai P H, Chan J Y, Chin K O. Ensembles for text-based sarcasm detection. In: Proceedings of the 19th IEEE Student Conference on Research and Development (SCOReD). 2021, 284−289

[24]

Liu L, Priestley J L, Zhou Y, Ray H E, Han M. A2Text-Net: a novel deep neural network for sarcasm detection. In: Proceedings of 2019 IEEE First International Conference on Cognitive Machine Intelligence (CogMI). 2019, 118−126

[25]

Bedi M, Kumar S, Akhtar M S, Chakraborty T . Multi-modal sarcasm detection and humor classification in code-mixed conversations. IEEE Transactions on Affective Computing, 2023, 14( 2): 1363–1375

[26]

Wang J, Yang Y, Jiang Y, Ma M, Xie Z, Li T . Cross-modal incongruity aligning and collaborating for multi-modal sarcasm detection. Information Fusion, 2024, 103: 102132

[27]

Zhang X, Chen Y, Li G. Multi-modal sarcasm detection based on contrastive attention mechanism. In: Proceedings of the 10th CCF International Conference on Natural Language Processing and Chinese Computing. 2021, 822−833

[28]

Cai Y, Cai H, Wan X. Multi-modal sarcasm detection in twitter with hierarchical fusion model. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 2506−2515

[29]

Liang B, Lou C, Li X, Yang M, Gui L, He Y, Pei W, Xu R. Multi-modal sarcasm detection via cross-modal graph convolutional network. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022, 1767−1777

[30]

Fukui A, Park D H, Yang D, Rohrbach A, Darrell T, Rohrbach M. Multimodal compact bilinear pooling for visual question answering and visual grounding. In: Proceedings of 2016 Conference on Empirical Methods in Natural Language Processing. 2016, 457−468

[31]

Kim J H, On K W, Lim W, Kim J, Ha J W, Zhang B T. Hadamard product for low-rank bilinear pooling. In: Proceedings of the 5th International Conference on Learning Representations. 2017, 1–14

[32]

Yu Z, Yu J, Fan J, Tao D. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of 2017 IEEE International Conference on Computer Vision. 2017, 1839−1848

[33]

Dai Y, Gieseke F, Oehmcke S, Wu Y, Barnard K. Attentional feature fusion. In: Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision. 2021, 3559−3568

[34]

Wang G, Gan X, Cao Q, Zhai Q . MFANet: multi-scale feature fusion network with attention mechanism. The Visual Computer, 2023, 39( 7): 2969–2980

[35]

Jiang Y, Xie S, Xie X, Cui Y, Tang H . Emotion recognition via multiscale feature fusion network and attention mechanism. IEEE Sensors Journal, 2023, 23( 10): 10790–10800

[36]

Li P, Li X. Multimodal fusion with co-attention mechanism. In: Proceedings of the 23rd IEEE International Conference on Information Fusion (FUSION). 2020, 1−8

[37]

Tay Y, Luu A T, Hui S C. Multi-pointer co-attention networks for recommendation. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018, 2309−2318

[38]

Wu C, Liu J, Wang X, Dong X. Object-difference attention: a simple relational attention for visual question answering. In: Proceedings of the 26th ACM International Conference on Multimedia. 2018, 519−527

[39]

Liang W, Zhang Y, Kwon Y, Yeung S, Zou J. Mind the gap: understanding the modality gap in multi-modal contrastive representation learning. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 1280

[40]

Radford A, Kim J W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I. Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 8748−8763

[41]

Wen C, Jia G, Yang J. DIP: dual incongruity perceiving network for sarcasm detection. In: Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 2540−2550

[42]

Chen L C, Papandreou G, Kokkinos I, Murphy K, Yuille A L . DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40( 4): 834–848

[43]

Kim Y. Convolutional neural networks for sentence classification. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014, 1746−1751

[44]

Graves A, Schmidhuber J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 2005, 18(5−6): 602−610

[45]

Xiong T, Zhang P, Zhu H, Yang Y. Sarcasm detection with self-matching networks and low-rank bilinear pooling. In: Proceedings of the World Wide Web Conference. 2019, 2115−2124

[46]

Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy M, Lewis M, Lewis L, Stoyanov V. RoBERTa: a robustly optimized BERT pretraining approach. 2019, arXiv preprint arXiv: 1907.11692

[47]

He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016, 770−778

[48]

Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An image is worth 16x16 words: transformers for image recognition at scale. In: Proceedings of the 9th International Conference on International Conference on Learning Representations. 2021, 1−21

[49]

Pan H, Lin Z, Fu P, Qi Y, Wang W. Modeling intra and inter-modality incongruity for multi-modal sarcasm detection. In: Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020. 2020, 1383−1392

[50]

Liu H, Wang W, Li H. Towards multi-modal sarcasm detection via hierarchical congruity modeling with knowledge enhancement. In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 4995−5006

RIGHTS & PERMISSIONS

Higher Education Press

AI Summary AI Mindmap
PDF (1363KB)

Supplementary files

Highlights

405

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/