Bridging modalities: a unified framework for textual and multimodal dialogue discourse parsing

Chen GONG; Nan YU; Guo-Hong FU

doi:10.1007/s11704-025-50170-0

Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (9) :2009351 DOI: 10.1007/s11704-025-50170-0

Artificial Intelligence

RESEARCH ARTICLE

Bridging modalities: a unified framework for textual and multimodal dialogue discourse parsing

Author information +

History +

PDF (1535KB)

Abstract

Dialogue discourse parsing is a fundamental task in natural language understanding. It aims to capture the relationships between utterances in a dialogue, facilitating a deeper understanding of dialogue structures and semantics, especially in long and complex dialogues. Existing research often develops separate dialogue discourse parsers for text-only and multimodal scenarios, largely due to the scarcity of parallel multimodal annotated datasets. This separation limits the ability to fully utilize diverse data with different modalities and poses challenges for real-world artificial intelligence applications. To address the limitation, we propose a unified dialogue discourse parsing framework that bridges text-only and multimodal parsing within a single model. We first develop a basic text-only parser, pre-trained on textual datasets. Then, we extend it to multimodal scenarios by adding additional multimodal encoders and fusion modules, while freezing the parameters learned during the text-only stage. We conduct extensive experiments on three datasets, covering both text-only and multimodal dialogues. Experimental results show that our approach achieves significant average improvements over several existing benchmarks. This demonstrates the generalizability and effectiveness of our framework for dialogue discourse parsing across different modalities.

Graphical abstract

Keywords

dialogue discourse parsing / dialogue systems / multimodal data / unified framework / natural language processing

Cite this article

Download citation ▾

Chen GONG, Nan YU, Guo-Hong FU. Bridging modalities: a unified framework for textual and multimodal dialogue discourse parsing. Front. Comput. Sci., 2026, 20(9): 2009351 DOI:10.1007/s11704-025-50170-0

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Jia Q, Liu Y, Ren S, Zhu K, Tang H. Multi-turn response selection using dialogue dependency relations. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. 2020, 1911−1920

[2]	Qin L, Huang S, Chen Q, Liu Q, Che W, Xu R . MPFToD: a modularized pre-training framework for consistency identification in task-oriented dialogue. Frontiers of Computer Science, 2025, 19( 10): 1910351

[3]	Zhang D, Chen F, Chen X. DualGATs: dual graph attention networks for emotion recognition in conversations. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 2023, 7395−7408

[4]	Fu Z, Liu F, Xu Q, Fu X, Qi J . LMR-CBT: learning modality-fused representations with CB-transformer for multimodal emotion recognition from unaligned multimodal sequences. Frontiers of Computer Science, 2024, 18( 4): 184314

[5]	Chen J, Yang D. Structure-aware abstractive conversation summarization via discourse and action graphs. In: Proceedings of 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021, 1380−1391

[6]	Huang Y, Yu Z, Xiang Y, Yu Z, Guo J . Exploiting comments information to improve legal public opinion news abstractive summarization. Frontiers of Computer Science, 2022, 16( 6): 166333

[7]	Ouyang S, Zhang Z, Zhao H. Dialogue graph modeling for conversational machine reading. In: Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021, 3158−3169

[8]	He Y, Zhang Z, Zhao H. Multi-tasking dialogue comprehension with discourse parsing. In: Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation. 2021, 551−561

[9]	Chan C, Cheng J, Wang W, Jiang Y, Fang T, Liu X, Song Y. Exploring the potential of ChatGPT on sentence level relations: A focus on temporal, causal, and discourse relations. In: Proceedings of the Findings of the Association for Computational Linguistics: EACL 2024. 2024, 684−721

[10]	Wu Y, Chen R, Liu P, Qian H. LiveLongBench: tackling long-context understanding for spoken texts from live streams. 2025, arXiv preprint arXiv: 2504.17366

[11]	Zhao H, Chen H, Yang F, Liu N, Deng H, Cai H, Wang S, Yin D, Du M . Explainability for large language models: A survey. ACM Transactions on Intelligent Systems and Technology, 2024, 15( 2): 20

[12]	Li W, Zhu L, Shao W, Yang Z, Cambria E. Task-aware self-supervised framework for dialogue discourse parsing. In: Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023. 2023, 14162−14173

[13]	Li J, Liu M, Wang Y, Zhang D, Qin B . A speaker-aware multiparty dialogue discourse parser with heterogeneous graph neural network. Cognitive Systems Research, 2023, 79: 15–23

[14]	Liu Z, Chen N. Improving multi-party dialogue discourse parsing via domain integration. In: Proceedings of the 2nd Workshop on Computational Approaches to Discourse. 2021, 122−127

[15]	Li C, Huber P, Xiao W, Amblard M, Braud C, Carenini G. Discourse structure extraction from pre-trained and fine-tuned language models in dialogues. In: Proceedings of the Findings of the Association for Computational Linguistics: EACL 2023. 2023, 2562−2579

[16]	Fan Y, Jiang F, Li P, Kong F, Zhu Q. Improving dialogue discourse parsing via reply-to structures of addressee recognition. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 8484−8495

[17]	Xiao W, Huber P, Carenini G. Predicting discourse trees from transformer-based neural summarizers. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021, 4139−4152

[18]	Zhao N, Li H, Wu Y, He X. JDDC 2.1: a multimodal Chinese dialogue dataset with joint tasks of query rewriting, response generation, discourse parsing, and summarization. In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 12037−12051

[19]	Gong C, Kong D, Zhao S, Li X, Fu G. MODDP: a multi-modal open-domain Chinese dataset for dialogue discourse parsing. In: Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024. 2024, 10561−10573

[20]	Li J, Li D, Savarese S, Hoi S. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 814

[21]	Zhu D, Chen J, Shen X, Li X, Elhoseiny M. MiniGPT-4: enhancing vision-language understanding with advanced large language models. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[22]	Liu H, Li C, Li Y, Lee Y J. Improved baselines with visual instruction tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024, 26286−26296

[23]	Liu H, Li C, Wu Q, Lee Y J. Visual instruction tuning. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 1516

[24]	Lascarides A, Asher N . Agreement, disputes and commitments in dialogue. Journal of Semantics, 2009, 26( 2): 109–158

[25]	Prasad R, Dinesh N, Lee A, Miltsakaki E, Robaldo L, Joshi A, Webber B. The Penn discourse TreeBank 2.0. In: Proceedings of the 6th International Conference on Language Resources and Evaluation. 2008

[26]	Aktas B, Özmen B. Shallow discourse parsing on Twitter conversations. In: Proceedings of the 20th Joint ACL - ISO Workshop on Interoperable Semantic Annotation @ LREC-COLING 2024. 2024, 60−65

[27]	Mann W C, Thompson S A . Rhetorical structure theory: Toward a functional theory of text organization. Text-Interdisciplinary Journal for the Study of Discourse, 1988, 8( 3): 243–281

[28]	Xue N, Ng H T, Pradhan S, Rutherford A, Webber B, Wang C, Wang H. CoNLL 2016 shared task on multilingual shallow discourse parsing. In: Proceedings of the CoNLL-16 Shared Task. 2016, 1−19

[29]	Pastor M, Oostdijk N, Martin-Rodilla P, Parapar J. Enhancing discourse parsing for local structures from social media with LLM-generated data. In: Proceedings of the 31st International Conference on Computational Linguistics. 2025, 8739−8748

[30]	Afantenos S, Kow E, Asher N, Perret J. Discourse parsing for multi-party chat dialogues. In: Proceedings of 2015 Conference on Empirical Methods in Natural Language Processing. 2015, 928−937

[31]	Asher N, Hunter J, Morey M, Farah B, Afantenos S. Discourse structure and dialogue acts in multiparty dialogue: the STAC corpus. In: Proceedings of the 10th International Conference on Language Resources and Evaluation. 2016, 2721−2727

[32]	Shi Z, Huang M. A deep sequential model for discourse parsing on multi-party dialogues. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 7007−7014

[33]	Li J, Liu M, Kan M Y, Zheng Z, Wang Z, Lei W, Liu T, Qin B. Molweni: a challenge multiparty dialogues-based machine reading comprehension dataset with discourse structure. In: Proceedings of the 28th International Conference on Computational Linguistics. 2020, 2642−2652

[34]	Wang A, Song L, Jiang H, Lai S, Yao J, Zhang M, Su J. A structure self-aware model for discourse parsing on multi-party dialogues. In: Proceedings of the 30th International Joint Conference on Artificial Intelligence. 2021, 3943−3949

[35]	Chi T C, Rudnicky A. Structured dialogue discourse parsing. In: Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue. 2022, 325−335

[36]	Yu N, Fu G, Zhang M. Speaker-aware discourse parsing on multi-party dialogues. In: Proceedings of the 29th International Conference on Computational Linguistics. 2022, 5372−5382

[37]	Wang A, Song L, Jin L, Yao J, Mi H, Lin C, Su J, Yu D . D²PSG: Multi-party dialogue discourse parsing as sequence generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31: 4004–4013

[38]

Yang J, Xu K, Xu J, Li S, Gao S, Guo J, Xue N, Wen J R. A joint model for dropped pronoun recovery and conversational discourse parsing in Chinese conversational speech. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021, 1752−1763

[39]	Fan Y, Jiang F, Li P, Li H. Uncovering the potential of ChatGPT for discourse analysis in dialogue: an empirical study. In: Proceedings of 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation. 2024, 16998−17010

[40]	Koto F, Lau J H, Baldwin T. Discourse probing of pretrained language models. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021, 3849−3864

[41]	Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V. RoBERTa: a robustly optimized BERT pretraining approach. 2019, arXiv preprint arXiv: 1907.11692

[42]

Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An image is worth 16x16 words: transformers for image recognition at scale. In: Proceedings of the 9th International Conference on Learning Representations. 2021

[43]	Baevski A, Zhou Y, Mohamed A, Auli M. wav2vec 2.0: a framework for self-supervised learning of speech representations. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1044

[44]	Lowe R, Pow N, Serban I, Pineau J. The Ubuntu dialogue corpus: a large dataset for research in unstructured multi-turn dialogue systems. In: Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue. 2015, 285−294

[45]	Shazeer N, Stern M. Adafactor: adaptive learning rates with sublinear memory cost. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 4603−4611

[46]	Liu A, Feng B, Xue B, Wang B, Wu B, , . DeepSeek-v3 technical report. 2025, arXiv preprint arXiv: 2412.19437

[47]

Bai J, Bai S, Chu Y, Cui Z, Dang K, Deng X, Fan Y, Ge W, Han Y, Huang F, Hui B, Ji L, Li M, Lin J, Lin R, Liu D, Liu G, Lu C, Lu K, Ma J, Men R, Ren X, Ren X, Tan C, Tan S, Tu J, Wang P, Wang S, Wang W, Wu S, Xu B, Xu J, Yang A, Yang H, Yang J, Yang S, Yao Y, Yu B, Yuan H, Yuan Z, Zhang J, Zhang X, Zhang Y, Zhang Z, Zhou C, Zhou J, Zhou X, Zhu T. Qwen technical report. 2023, arXiv preprint arXiv: 2309.16609

[48]	Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, , . Gpt-4 technical report. 2024, arXiv preprint arXiv: 2303.08774

RIGHTS & PERMISSIONS

Higher Education Press

PDF (1535KB)

Part of a collection:

Supplementary files

Highlights

968

Accesses

Citation

Detail

Sections

Recommended

About the journal

Aims & scope

Description

Editorial board

Abstracting / indexing

Contact us

Browse

Just accepted

All volumes and issues

Collections

Featured articles

Most accessed

Most cited

Collections

Multimedia collections

Authors & reviewers

Online submisson

Call for papers

Guidelines for authors

Download templates