DCPNet: a comprehensive framework for multimodal sarcasm detection via graph topology extraction and multi-scale feature fusion
Youjiang FANG , Liang ZHANG , Shihao WANG , Wenyuan ZHANG , Yuxin WANG , Yuanyuan LIU , Xiaopeng WEI , Xin YANG
Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (7) : 2007336
DCPNet: a comprehensive framework for multimodal sarcasm detection via graph topology extraction and multi-scale feature fusion
Recent advancements in multimodal sarcasm detection (MSD) have made significant progress in understanding the interplay between textual and visual cues. However, existing methods tend to overemphasize cross-modal semantic alignment, consequently neglecting sarcasm cues that are independently embedded within each modality. In this paper, we present DCPNet, a novel Dual-channel Cross-modal Perception Network, to integrate unimodal and cross-modal features via the incorporation of comprehensive structural semantics. To capture rich topological relationships within each modality, we introduce a Graph Topology Extraction and Enhancement Module (GTEE) that builds graph structures from both text and image features, facilitating deeper semantic representation. Additionally, we propose a Cross-Modal Multi-Scale Feature Fusion (CMFF) module that aligns and integrates features from both text and image at multiple scales, ensuring the capture of comprehensive contextual information. An attention mechanism is incorporated to assign appropriate weights to the textual and visual features, thereby optimizing the fusion process for more accurate sarcasm detection. Extensive experiments conducted on the MMSD and MMSD2.0 benchmark datasets demonstrate that DCPNet outperforms existing state-of-the-art (SOTA) methods in both accuracy and robustness.
multimodal sarcasm detection / graph topology extraction / cross-modal feature fusion
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
Dutta P, Bhattacharyya C K. Multi-modal sarcasm detection in social networks: a comparative review. In: Proceedings of the 6th International Conference on Computing Methodologies and Communication (ICCMC). 2022, 207−214 |
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
|
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
Lai P H, Chan J Y, Chin K O. Ensembles for text-based sarcasm detection. In: Proceedings of the 19th IEEE Student Conference on Research and Development (SCOReD). 2021, 284−289 |
| [24] |
|
| [25] |
|
| [26] |
|
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
|
| [32] |
|
| [33] |
|
| [34] |
|
| [35] |
|
| [36] |
Li P, Li X. Multimodal fusion with co-attention mechanism. In: Proceedings of the 23rd IEEE International Conference on Information Fusion (FUSION). 2020, 1−8 |
| [37] |
|
| [38] |
|
| [39] |
|
| [40] |
|
| [41] |
|
| [42] |
|
| [43] |
|
| [44] |
|
| [45] |
|
| [46] |
|
| [47] |
|
| [48] |
|
| [49] |
|
| [50] |
|
Higher Education Press
/
| 〈 |
|
〉 |