Position-aware modeling for fine-grained visually-rich long document understanding

Yixiao MA , Shulan RUAN , Zijie SONG , Xin ZHANG , Yuze ZHAO , Zhenya HUANG , Enhong CHEN

Front. Comput. Sci. ›› 2027, Vol. 21 ›› Issue (4) : 2104704

PDF (2975KB)
Front. Comput. Sci. ›› 2027, Vol. 21 ›› Issue (4) :2104704 DOI: 10.1007/s11704-026-51131-x
Image and Graphics
RESEARCH ARTICLE
Position-aware modeling for fine-grained visually-rich long document understanding
Author information +
History +
PDF (2975KB)

Abstract

Visually-rich long document understanding task requires accurate extraction of answers from documents like manuals and academic papers, which often consist of dozens of text-rich images. Recently, multimodal large language models (MLLMs) have demonstrated strong performance in this task. To alleviate the inefficiency problem of MLLMs caused by the increase of document length, retrieval-augmented methods select key pages to reduce computational cost by conducting answer generation only on retrieved pages. Despite the significant progress, existing methods still face some inherent challenges. For one thing, relative pages are usually retrieved based on textual content, which neglects spatial layout information. For another, coarse-grained retrieval at the page level can also lead to the semantic gap between retrieved pages and the query. In this paper, we propose PDU, a position-aware fine-grained retrieval-augmented model for long document understanding. Specifically, to bridge the semantic gap between the query and full pages, we first develop a fine-grained document encoding module to partition each document page into chunks and encode them with MLLMs. Then, we design a position-enhanced similarity calculation approach to compute the similarity between the query and each document chunk for retrieving the most relevant ones. To improve the model in terms of understanding document layout and structure, we further encode the bound coordinates and page number of each document chunk and add them to the MLLM-derived visual features. Next, we propose a chunk-to-page answer generation method to map back the retrieved chunks to their corresponding pages and generate the final answer. To support training, we construct a minimal answerable region (MAR) dataset using a bidirectional approximation algorithm to precisely link querys to relevant document chunks. Our method achieves strong results on public benchmarks, highlighting the value of incorporating layout information in retrieval-augmented document understanding.

Graphical abstract

Keywords

document understanding / retrieval augmented generation / multimodal large language models

Cite this article

Download citation ▾
Yixiao MA, Shulan RUAN, Zijie SONG, Xin ZHANG, Yuze ZHAO, Zhenya HUANG, Enhong CHEN. Position-aware modeling for fine-grained visually-rich long document understanding. Front. Comput. Sci., 2027, 21(4): 2104704 DOI:10.1007/s11704-026-51131-x

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Tito R, Karatzas D, Valveny E . Hierarchical multimodal transformers for multipage DocVQA. Pattern Recognition, 2023, 144: 109834

[2]

Van Landeghem J, Powalski R, Tito R, Jurkiewicz D, Blaschko M, Borchmann Ł, Coustaty M, Moens S, Pietruszka M, Ackaert B, Stanisławek T, Józiak P, Valveny E. Document understanding dataset and evaluation (DUDE). In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023, 19471−19483

[3]

Ding Y, Ren K, Huang J, Luo S, Han S C. MMVQA: a comprehensive dataset for investigating multipage multimodal information retrieval in pdf-based visual question answering. In: Proceedings of the 33rd International Joint Conference on Artificial Intelligence. 2024, 6243−6251

[4]

Tanaka R, Nishida K, Yoshida S. VisualMRC: machine reading comprehension on document images. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021, 13878−13888

[5]

Ma J, Liu J, Chai Q, Wang P, Tao J . Diagram perception networks for textbook question answering via joint optimization. International Journal of Computer Vision, 2024, 132( 5): 1578–1591

[6]

Xu Y, Li M, Cui L, Huang S, Wei F, Zhou M. LayoutLM: pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020, 1192−1200

[7]

Xu Y, Xu Y, Lv T, Cui L, Wei F, Wang G, Lu Y, Florencio D, Zhang C, Che W, Zhang M, Zhou L. LayoutLMv2: multi-modal pre-training for visually-rich document understanding. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021, 2579−2591

[8]

Huang Y, Lv T, Cui L, Lu Y, Wei F. LayoutLMv3: pre-training for document AI with unified text and image masking. In: Proceedings of the 30th ACM International Conference on Multimedia. 2022, 4083−4091

[9]

Hu A, Xu H, Ye J, Yan M, Zhang L, Zhang B, Zhang J, Jin Q, Huang F, Zhou J. mPLUG-DocOwl 1.5: unified structure learning for OCR-free document understanding. In: Proceedings of the Association for Computational Linguistics: EMNLP 2024. 2024, 3096−3120

[10]

Hu A, Xu H, Zhang L, Ye J, Yan M, Zhang J, Jin Q, Huang F, Zhou J. mPLUG-DocOwl2: high-resolution compressing for OCR-free multi-page document understanding. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025, 5817−5834

[11]

Zhang X, Wang D, Dou L, Zhu Q, Che W . A survey of table reasoning with large language models. Frontiers of Computer Science, 2025, 19( 9): 199348

[12]

Gao Y, Xiong Y, Gao X, Jia K, Pan J, Bi Y, Dai Y, Sun J, Wang H, Wang H. Retrieval-augmented generation for large language models: a survey. 2023, arXiv preprint arXiv: 2312.10997

[13]

Guu K, Lee K, Tung Z, Pasupat P, Chang M W. Retrieval augmented language model pre-training. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 3929−3938

[14]

Zhang J, Yu Y, Zhang Y. CREAM: coarse-to-fine retrieval and multimodal efficient tuning for document VQA. In: Proceedings of the 32nd ACM International Conference on Multimedia. 2024, 925−934

[15]

Xie X, Yan H, Yin L, Liu Y, Ding J, Liao M, Liu Y, Chen W, Bai X. PDF-WuKong: a large multimodal model for efficient long PDF reading with end-to-end sparse sampling. 2024, arXiv preprint arXiv: 2410.05970

[16]

Faysse M, Sibille H, Wu T, Omrani B, Viaud G, Hudelot C, Colombo P. ColPali: efficient document retrieval with vision language models. In: Proceedings of the 13th International Conference on Learning Representations. 2025

[17]

Ma X, Lin S C, Li M, Chen W, Lin J. Unifying multimodal retrieval via document screenshot embedding. In: Proceedings of 2024 Conference on Empirical Methods in Natural Language Processing. 2024, 6492−6505

[18]

Chen J, Zhang R, Zhou Y, Yu T, Dernoncourt F, Gu J, Rossi R A, Chen C, Sun T. SV-RAG: LoRA-contextualizing adaptation of MLLMs for long document understanding. In: Proceedings of the 13th International Conference on Learning Representations. 2025

[19]

Khattab O, Zaharia M. ColBERT: efficient and effective passage search via contextualized late interaction over BERT. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2020, 39−48

[20]

Liu S, Zhang F, Zhao B, Guo R, Chen T, Zhang M . APPCorp: a corpus for android privacy policy document structure analysis. Frontiers of Computer Science, 2023, 17( 3): 173320

[21]

Qian Z, Li P, Zhu Q, Zhou G . A multi-view heterogeneous and extractive graph attention network for evidential document-level event factuality identification. Frontiers of Computer Science, 2025, 19( 6): 196319

[22]

Wang F, Qian T, Liu B, Peng Z . Patent expanded retrieval via word embedding under composite-domain perspectives. Frontiers of Computer Science, 2019, 13( 5): 1048–1061

[23]

Tanaka R, Nishida K, Nishida K, Hasegawa T, Saito I, Saito K. SlideVQA: a dataset for document visual question answering on multiple images. In: Proceedings of the 37th AAAI Conference on Artificial Intelligence. 2023, 13636−13645

[24]

Wang W, Huang Z, Luo B, Chen Q, Peng Q, Pan Y, Yin W, Feng S, Sun Y, Yu D, Zhang Y. mmLayout: multi-grained MultiModal transformer for document understanding. In: Proceedings of the 30th ACM International Conference on Multimedia. 2022, 4877−4886

[25]

Li Q, Li Z, Cai X, Du B, Zhao H. Enhancing visually-rich document understanding via layout structure modeling. In: Proceedings of the 31st ACM International Conference on Multimedia. 2023, 4513−4523

[26]

Wang J, Jin L, Ding K. LiLT: a simple yet effective language-independent layout transformer for structured document understanding. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022, 7747−7757

[27]

Wang D, Raman N, Sibue M, Ma Z, Babkin P, Kaur S, Pei Y, Nourbakhsh A, Liu X. DocLLM: a layout-aware generative language model for multimodal document understanding. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024, 8529−8548

[28]

Tang Z, Yang Z, Wang G, Fang Y, Liu Y, Zhu C, Zeng M, Zhang C, Bansal M. Unifying vision, text, and layout for universal document processing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 19254−19264

[29]

Kim G, Hong T, Yim M, Nam J, Park J, Yim J, Hwang W, Yun S, Han D, Park S. OCR-free document understanding transformer. In: Proceedings of the 17th European Conference on Computer Vision. 2022, 498−517

[30]

Lee K, Joshi M, Turc I R, Hu H, Liu F, Eisenschlos J M, Khandelwal U, Shaw P, Chang M W, Toutanova K. Pix2Struct: screenshot parsing as pretraining for visual language understanding. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 18893−18912

[31]

Zheng Z, Qiu Z, Zhu C, Hu X, Wu L, Song Y, Zhu H, Xiong H . Exploiting large language model with reinforcement learning for generative job recommendations. Frontiers of Computer Science, 2026, 20( 1): 2001303

[32]

Li M, Qian H, Lv J, He M, Zhang W, Zhou A . Foundation model enhanced derivative-free cognitive diagnosis. Frontiers of Computer Science, 2025, 19( 1): 191318

[33]

Dong Y, He D, Wang X, Jin Y, Ge M, Yang C, Jin D. Unveiling implicit deceptive patterns in multi-modal fake news via neuro-symbolic reasoning. In: Proceedings of the 38th AAAI Conference on Artificial Intelligence. 2024, 8354−8362

[34]

Blecher L, Cucurull G, Scialom T, Stojnic R. Nougat: neural optical understanding for academic documents. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[35]

Ye J, Hu A, Xu H, Ye Q, Yan M, Xu G, Li C, Tian J, Qian Q, Zhang J, Jin Q, He L, Lin X, Huang F. UReader: universal OCR-free visually-situated language understanding with multimodal large language model. In: Proceedings of the Association for Computational Linguistics: EMNLP 2023. 2023, 2841–2858

[36]

Hu A, Shi Y, Xu H, Ye J, Ye Q, Yan M, Li C, Qian Q, Zhang J, Huang F. mPLUG-PaperOwl: scientific diagram analysis with the multimodal large language model. In: Proceedings of the 32nd ACM International Conference on Multimedia. 2024, 6929−6938

[37]

Song Z, Hu Z, Zhou Y, Zhao Y, Hong R, Wang M . Embedded heterogeneous attention transformer for cross-lingual image captioning. IEEE Transactions on Multimedia, 2024, 26: 9008–9020

[38]

Ma Y, Zang Y, Chen L, Chen M, Jiao Y, Li X, Lu X, Liu Z, Ma Y, Dong X, Zhang P, Pan L, Jiang Y G, Wang J, Cao Y, Sun A. MMLONGBENCH-DOC: benchmarking long-context document understanding with visualizations. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. 2024, 3041

[39]

Beltagy I, Peters M E, Cohan A. Longformer: the long-document transformer. 2020, arXiv preprint arXiv: 2004.05150

[40]

Li W, Yuan Y, Liu J, Tang D, Wang S, Qin J, Zhu J, Zhang L . TokenPacker: efficient visual projector for multimodal LLM. International Journal of Computer Vision, 2025, 133( 10): 6794–6812

[41]

Chen Z, Wu J, Wang W, Su W, Chen G, Xing S, Zhong M, Zhang Q, Zhu X, Lu L, Li B, Luo P, Lu T, Qiao Y, Dai J. Intern VL: scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024, 24185−24198

[42]

Abdin M, Aneja J, Awadalla H, Awadallah A, Awan A A, , et al. Phi-3 technical report: a highly capable language model locally on your phone. 2024, arXiv preprint arXiv: 2404.14219

[43]

Ruan S, Zhang K, Wu L, Xu T, Liu Q, Chen E . Color enhanced cross correlation net for image sentiment analysis. IEEE Transactions on Multimedia, 2024, 26: 4097–4109

[44]

Hu E J, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, Wang L, Chen W. LoRA: low-rank adaptation of large language models. In: Proceedings of the 10th International Conference on Learning Representations. 2022

[45]

Mao Y, Ge Y, Fan Y, Xu W, Mi Y, Hu Z, Gao Y . A survey on LoRA of large language models. Frontiers of Computer Science, 2025, 19( 7): 197605

[46]

Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019, 4171−4186

[47]

Zaheer M, Guruganesh G, Dubey A, Ainslie J, Alberti C, Ontanon S, Pham P, Ravula A, Wang Q, Yang L, Ahmed A. Big bird: transformers for longer sequences. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1450

[48]

Bai S, Chen K, Liu X, Wang J, Ge W, , et al. Qwen2.5-VL technical report. 2025, arXiv preprint arXiv: 2502.13923

[49]

Pradeep R, Sharifymoghaddam S, Lin J. RankVicuna: zero-shot listwise document reranking with open-source large language models. 2023, arXiv preprint arXiv: 2309.15088

[50]

Biten A F, Tito R, Mafla A, Gomez L, Rusiñol M, Jawahar C V, Valveny E, Karatzas D. Scene text visual question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, 4290−4300

[51]

Tito R, Nguyen K, Tobaben M, Kerkouche R, Souibgui M A, Jung K, Jälkö J, D’Andecy V P, Joseph A, Kang L, Valveny E, Honkela A, Fritz M, Karatzas D. Privacy-aware document visual question answering. In: Proceedings of the 18th International Conference on Document Analysis and Recognition. 2024, 199−218

RIGHTS & PERMISSIONS

Higher Education Press

PDF (2975KB)

Supplementary files

Highlights

356

Accesses

0

Citation

Detail

Sections
Recommended

/