Position-aware modeling for fine-grained visually-rich long document understanding
Yixiao MA , Shulan RUAN , Zijie SONG , Xin ZHANG , Yuze ZHAO , Zhenya HUANG , Enhong CHEN
Front. Comput. Sci. ›› 2027, Vol. 21 ›› Issue (4) : 2104704
Visually-rich long document understanding task requires accurate extraction of answers from documents like manuals and academic papers, which often consist of dozens of text-rich images. Recently, multimodal large language models (MLLMs) have demonstrated strong performance in this task. To alleviate the inefficiency problem of MLLMs caused by the increase of document length, retrieval-augmented methods select key pages to reduce computational cost by conducting answer generation only on retrieved pages. Despite the significant progress, existing methods still face some inherent challenges. For one thing, relative pages are usually retrieved based on textual content, which neglects spatial layout information. For another, coarse-grained retrieval at the page level can also lead to the semantic gap between retrieved pages and the query. In this paper, we propose PDU, a position-aware fine-grained retrieval-augmented model for long document understanding. Specifically, to bridge the semantic gap between the query and full pages, we first develop a fine-grained document encoding module to partition each document page into chunks and encode them with MLLMs. Then, we design a position-enhanced similarity calculation approach to compute the similarity between the query and each document chunk for retrieving the most relevant ones. To improve the model in terms of understanding document layout and structure, we further encode the bound coordinates and page number of each document chunk and add them to the MLLM-derived visual features. Next, we propose a chunk-to-page answer generation method to map back the retrieved chunks to their corresponding pages and generate the final answer. To support training, we construct a minimal answerable region (MAR) dataset using a bidirectional approximation algorithm to precisely link querys to relevant document chunks. Our method achieves strong results on public benchmarks, highlighting the value of incorporating layout information in retrieval-augmented document understanding.
document understanding / retrieval augmented generation / multimodal large language models
| [1] |
|
| [2] |
|
| [3] |
Ding Y, Ren K, Huang J, Luo S, Han S C. MMVQA: a comprehensive dataset for investigating multipage multimodal information retrieval in pdf-based visual question answering. In: Proceedings of the 33rd International Joint Conference on Artificial Intelligence. 2024, 6243−6251 |
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
Faysse M, Sibille H, Wu T, Omrani B, Viaud G, Hudelot C, Colombo P. ColPali: efficient document retrieval with vision language models. In: Proceedings of the 13th International Conference on Learning Representations. 2025 |
| [17] |
|
| [18] |
Chen J, Zhang R, Zhou Y, Yu T, Dernoncourt F, Gu J, Rossi R A, Chen C, Sun T. SV-RAG: LoRA-contextualizing adaptation of MLLMs for long document understanding. In: Proceedings of the 13th International Conference on Learning Representations. 2025 |
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
|
| [26] |
|
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
|
| [32] |
|
| [33] |
|
| [34] |
Blecher L, Cucurull G, Scialom T, Stojnic R. Nougat: neural optical understanding for academic documents. In: Proceedings of the 12th International Conference on Learning Representations. 2024 |
| [35] |
|
| [36] |
|
| [37] |
|
| [38] |
|
| [39] |
|
| [40] |
|
| [41] |
|
| [42] |
|
| [43] |
|
| [44] |
|
| [45] |
|
| [46] |
|
| [47] |
|
| [48] |
|
| [49] |
|
| [50] |
|
| [51] |
|
Higher Education Press
/
| 〈 |
|
〉 |