RiverEcho-2.0: A Real-Time Interactive System for Yellow River Culture via Enhanced MultiModal Document RAG

Haofeng Wang; Yilin Guo; Tiange Zhang; Zehao Li; Tong Yue; Yizong Wang; Rongqun Lin; Feng Gao; Shiqi Wang; Siwei Ma

doi:10.53941/tai.2025.100014

Transactions on Artificial Intelligence ›› 2025, Vol. 1 ›› Issue (1) :212 -226. DOI: 10.53941/tai.2025.100014

Article

research-article

RiverEcho-2.0: A Real-Time Interactive System for Yellow River Culture via Enhanced MultiModal Document RAG

Author information +

History +

PDF (4312KB)

Abstract

The Yellow River culture is a cornerstone of Chinese civilization, em- bodying rich historical, social, and ecological significance. To conserve and promote this invaluable cultural heritage, we propose RiverEcho-2.0, a real-time interactive digital system designed to facilitate user engagement with Yellow River culture. As the foundation of our system, we curated and digitized a comprehensive col- lection of books and documents related to Yellow River heritage, constructing a dedicated multimodal corpus. To effectively leverage this corpus, we introduce a novel multi-modal Document Retrieval-Augmented Generation (RAG) framework that enhances document retrieval through context-aware image-text alignment and joint embedding. Experimental results demonstrate that our method achieves a large improvement over existing state-of-the-art multi-modal RAG baselines, leading to significant gains in downstream tasks.

Keywords

Yellow River culture / dataset construction / multi-modal document RAG

Cite this article

Download citation ▾

Haofeng Wang, Yilin Guo, Tiange Zhang, Zehao Li, Tong Yue, Yizong Wang, Rongqun Lin, Feng Gao, Shiqi Wang, Siwei Ma. RiverEcho-2.0: A Real-Time Interactive System for Yellow River Culture via Enhanced MultiModal Document RAG. Transactions on Artificial Intelligence, 2025, 1(1): 212-226 DOI:10.53941/tai.2025.100014

登录浏览全文

4963

注册一个新账户忘记密码

Author Contributions

H.W.: Experiments, system implementation, and writing. Y.G.: System implementation, writing, and revision. Z.L.: Data guidance. T.Y.: Assistance with system implementation. Y.W.: Writing guidance. T.Z.: Data engineering. R.L.: Experimental design. F.G.: Artistic guidance. S.W.: Academic supervision and writing guidance. S.M.: Conceptualization, academic supervision, and writing guidance. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key R&D Program of China 2022YFF0902400, NSFC 62025101,BNSF L242014 and New Cornerstone Science Foundation through the XPLORER PRIZE.

Data Availability Statement

The datasets generated during and/or analysed during the current study are available from the following sources: The M3DocVQA dataset is available at https://github.com/bloomberg/m3docrag/tree/main/m3docvqa, The MP-DocVQA dataset is available at https://rrc.cvc.uab.es/?ch=17&com=downloads, The Yellow River Corpus is available at https://pan.baidu.com/s/1uPo206qqeTDGaWRxZufG5w?pwd=tlpd, The source code developed for this study is available at the GitHub repository: https://github.com/hfwang2001/MMRAG.

Conflicts of Interest

The authors declare no conflict of interest.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Cao G. The Historical Inheritance and Contemporary Value of Yellow River Culture. Jinyang Acad. J. 2022, 2, 119-124.

[2]	Langote M.; Saratkar S.; Kumar P.; et al. Human-computer interaction in healthcare: Comprehensive review. Aims Bioeng. 2024, 11, 343-390.

[3]	De Wet L. Teaching Human-Computer Interaction Modules—And Then Came COVID-19. Front. Comput. Sci. 2021, 3, 793466.

[4]	Amato F.; Barolli L.; Cozzolino G.; et al. An Intelligent Interface for Human-Computer Interaction in Legal Domain. In Proceedings of the In International Conference on P2P, Parallel, Grid, Cloud and Internet Computing, Tirana, Albania, 27-29 October 2022.

[5]	Hirsch L.; Paananen S.; Lengyel D.; et al. Human-Computer Interaction (HCI) Advances to Re-Contextualize Cultural Heritage toward Multiperspectivity, Inclusion, and Sensemaking. Appl. Sci. 2024, 14, 7652.

[6]	Achiam J.; Adler S.; Agarwal S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774.

[7]	Touvron H.; Lavril T.; Izacard G.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971.

[8]	Yang A.; Xiao B.; Wang B.; et al. Baichuan 2: Open large-scale language models. arXiv 2023, arXiv:2309.10305.

[9]	Liu A.; Feng B.; Xue B.; et al. Deepseek-v3 technical report. arXiv 2024, arXiv:2412.19437.

[10]	Yang A.; Yang B.; Zhang B.; et al. Qwen2. 5 technical report. arXiv 2024, arXiv:2412.15115.

[11]	Roziere B.; Gehring J.; Gloeckle F.; et al. Code llama: Open foundation models for code. arXiv 2023, arXiv:2308.12950.

[12]	Li Y.; Li Z.; Zhang K.; et al. Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge. Cureus 2023, 15, e40895.

[13]	Zhang H.; Qiu B.; Feng Y.; et al. Baichuan4-Finance Technical Report. arXiv 2024, arXiv:2412.15270.

[14]	Cui J.; Ning M.; Li Z.; et al. Chatlaw: A multi-agent collaborative legal assistant with knowledge graph enhanced mixture-of-experts large language model. arXiv 2023, arXiv:2306.16092.

[15]	Jiang Z.; Wang J.; Cao J.; et al. Towards better translations from classical to modern Chinese: A new dataset and a new method. In Proceedings of the CCF International Conference on Natural Language Processing and Chinese Computing, Foshan, China, 12-15 October 2023.

[16]	Chang E.; Shiue Y.T.; Yeh H.S.; et al. Time-aware ancient chinese text translation and inference. arXiv 2021, arXiv:2107.03179.

[17]	Li Z.; Sun M. Punctuation as implicit annotations for Chinese word segmentation. Comput. Linguist. 2009, 35, 505-512.

[18]	Yu P.; Wang X. BERT-based named entity recognition in Chinese twenty-four histories. In Proceedings of the International Conference on Web Information Systems and Applications, Guangzhou, China, 23-25 September 2020.

[19]	Han X.; Xu L.; Qiao F. CNN-BiLSTM-CRF model for term extraction in Chinese corpus. In Proceedings of the Web In- formation Systems and Applications: 15th International Conference, WISA 2018, Taiyuan, China, 14-15 September 2018.

[20]	Wang D.; Liu C.; Zhao Z.; et al. GujiBERT and GujiGPT: Construction of intelligent information processing foundation language models for ancient texts. arXiv 2023, arXiv:2307.05354.

[21]	Chang L.; Dongbo W.; Zhixiao Z.; et al. SikuGPT: A generative pre-trained model for intelligent information processing of ancient texts from the perspective of digital humanities. arXiv 2023, arXiv:2304.07778.

[22]	Wptoux. Bloom-7B-Chunhua. Available online: accessed on 1 October 2023).

[23]	XunziALLM.Available online: accessed on 1 March 2024).

[24]	Cao J.; Peng D.; Zhang P.; et al. TongGu: Mastering Classical Chinese Understanding with Knowledge-Grounded Large Language Models. arXiv 2024, arXiv:2407.03937.

[25]	Mallen A.; Asai A.; Zhong V.; et al. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. arXiv 2022, arXiv:2212.10511.

[26]	Carlini N.; Tramer F.; Wallace E.; et al. Extracting training data from large language models. In Proceedings of the 30th USENIX Security Symposium (USENIX Security 21), Virtual, 11-13 August 2021.

[27]	Huang L.; Yu W.; Ma W.; et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. Acm Trans. Inf. Syst. 2025, 43, 1-55.

[28]	Izacard G.; Lewis P.; Lomeli M.; et al. Atlas: Few-shot learning with retrieval augmented language models. J. Mach. Learn. Res. 2023, 24, 1-43.

[29]	Wu Y.; Rabe M.N.; Hutchins D.; et al. Memorizing transformers. arXiv 2022, arXiv:2203.08913.

[30]	He Z.; Zhong Z.; Cai T.; et al. Rest: Retrieval-based speculative decoding. arXiv 2023, arXiv:2311.08252.

[31]	Kang M.; Gürel N.M.; Yu N.; et al. C-rag: Certified generation risks for retrieval-augmented language models. arXiv 2024, arXiv:2402.03181.

[32]	Karpukhin V.; Oguz B.; Min S.; et al. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the Empirical Methods in Natural Language Processing, Virtual, 16-20 November 2020.

[33]	Ni J.; Qu C.; Lu J.; et al. Large dual encoders are generalizable retrievers. arXiv 2021, arXiv:2112.07899.

[34]	Nogueira R.; Cho K. Passage Re-ranking with BERT. arXiv 2019, arXiv:1901.04085.

[35]	Yoran O.; Wolfson T.; Bogin B.; et al. Answering questions by meta-reasoning over multiple chains of thought. arXiv 2023, arXiv:2304.13007.

[36]	Yao S.; Zhao J.; Yu D.; et al. React: Synergizing reasoning and acting in language models. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1-5 May 2023.

[37]	Lewis P.; Perez E.; Piktus A.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459-9474.

[38]	Liu Z.; Simon C.E.; Caspani F. Passage segmentation of documents for extractive question answering. In European Conference on Information Retrieval; Springer: Berlin/Heidelberg, Germany, 2025; pp. 345-352.

[39]	Laitenberger A.; Manning C.D.; Liu N.F. Stronger Baselines for Retrieval-Augmented Generation with Long-Context Language Models. arXiv 2025, arXiv:2506.03989.

[40]	Edge D.; Trinh H.; Cheng N.; et al. From local to global: A graph rag approach to query-focused summarization. arXiv 2024, arXiv:2404.16130.

[41]	Cho J.; Mahata D.; Irsoy O.; et al. M3docrag: Multi-modal retrieval is what you need for multi-page multi-document understanding. arXiv 2024, arXiv:2411.04952.

[42]	Faysse M.; Sibille H.; Wu T.; et al. Colpali: Efficient document retrieval with vision language models. arXiv 2024, arXiv:2407.01449.

[43]	Wang Q.; Ding R.; Chen Z.; et al. Vidorag: Visual document retrieval-augmented generation via dynamic iterative reasoning agents. arXiv 2025, arXiv:2502.18017.

[44]	Memon J.; Sami M.; Khan R.A.; et al. Handwritten optical character recognition (OCR): A comprehensive systematic literature review (SLR). IEEE Access 2020, 8, 142642-142668.

[45]	Ingemarsson P.; Daniel P. PDF Parsing, Unveiling the Most Efficient Method. Bachelor’s Thesis, Linnaeus University, VäxjöSweden, 2024.

[46]	LiveTalking: Real-Time Interactive Streaming Digital Human. 2024. Available online: accessed on 16 March 2025).

[47]	Prajwal K.R.; Mukhopadhyay R. Wav2Lip:A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild. 2020. Available online: accessed on 16 March 2025).

[48]	Zhang Y.; Liu M.; Chen Z.; et al. Musetalk: Real-time high quality lip synchronization with latent space inpainting. arXiv 2024, arXiv:2410.10122.

[49]	Metahuman-stream: Real-time Streaming Digital Human Based on NeRF. 2023. Available online: accessed on 16 March 2025).

[50]	Adobe Systems Incorporated.Real-Time Messaging Protocol (RTMP) Specification. 2002. accessed on 16 March 2025).

[51]	IETF and W3C. Web Real-Time Communication (WebRTC) Standard. 2011. Available online: accessed on 16 March 2025).

[52]	Synthesia. Synthesia: AI Video Generation Platform. 2017. Available online: accessed on 16 March 2025).

[53]	Diener V. VTube Studio:Live2D VTuber Streaming Software. 2021. Available online: accessed on 16 March 2025).

[54]	Guo Z.; Xia L.; Yu Y.; et al. Lightrag: Simple and fast retrieval-augmented generation. arXiv 2024, arXiv:2410.05779..

[55]	Gao Z.; Li Z.; Wang J.; et al. Funasr: A fundamental end-to-end speech recognition toolkit. arXiv 2023, arXiv:2305.11013.

[56]	Edge-tts: Use Microsoft Edge’s Online Text-to-Speech Service from Python WITHOUT Needing Microsoft Edge or Windows or an API Key. 2024. Available online: accessed on 16 March 2025).

[57]	Glm T.; Zeng A.; Xu B.; et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv 2024, arXiv:2406.12793.

[58]	Chaplot D.S.; Jiang A.Q.; Sablayrolles A.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825.

[59]	Jha R.; Wang B.; Günther M.; et al. Jina-colbert-v2: A general-purpose multilingual late interaction retriever. arXiv 2024, arXiv:2408.16672.

[60]	Vavekanand R.; Sam K. Llama 3.1: An in-depth analysis of the next-generation large language model. Preprint 2024.

[61]	Lu H.; Liu W.; Zhang B.; et al. Deepseek-vl: Towards real-world vision-language understanding. arXiv 2024, arXiv:2403.05525.

[62]	Laurenc¸on H.; Tronchon L.; Cord M.; et al. What matters when building vision-language models? Adv. Neural Inf. Process. Syst. 2024, 37, 87874-87907.

[63]	Guo Z.; Xu R.; Yao Y.; et al. Llava-uhd:An lmm perceiving any aspect ratio and high-resolution images. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2024; pp. 390-406.

[64]	Dong X.; Zhang P.; Zang Y.; et al. Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd. Adv. Neural Inf. Process. Syst. 2024, 37, 42566-42592.

[65]	Hu A.; Xu H.; Ye J.; et al. mplug-docowl 1.5: Unified structure learning for ocr-free document understanding. arXiv 2024, arXiv:2403.12895.

[66]	Bai J.; Bai S.; Chu Y.; et al. Qwen technical report. arXiv 2023, arXiv:2309.16609.

[67]	Li Z.; Yang B.; Liu Q.; et al. Monkey: Image resolution and text label are important things for large multi-modal models. In proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17-18 June 2024.

PDF (4312KB)

Accesses

Citation

Detail

Sections

Recommended

About the journal

Aims & scope

Description

Editorial board

Cover gallery

Contact us

Browse

Just accepted

Online first

Latest issue

All volumes and issues

Collections

Featured articles

Most accessed

Most cited

Collections

Authors & reviewers

Online submisson

Guidelines for authors

Editorial policy

Ethical requirements

Download templates