Heterogeneous Tabular Data Query and Analysis System Based on Large Language Models
Jiajun Li , Xiaojun SHI , Angeline Mary Marchella , Tianze HU , Lianpeng Qiao , Kaiyu Li , Guoren WANG
With the increase of digital information, there is a growing demand for efficiently extracting insights and answering queries from tables. However, real-world tables often feature complex structures like merged cells, multi-level headers, and block layouts. These heterogeneous tables pose significant challenges for question-answering systems. Existing research focuses primarily on normalized tables. Even state-of-the-art large language models for tabular data achieve 31% to 42% lower accuracy on heterogeneous tables. Current methods either linearize tables into text and lose structural information or adopt graph-based representations that are computationally inefficient and yield suboptimal accuracy. Moreover, these methods typically require extensive manual processing to convert complex tables into specific formats, incurring additional annotation costs. To address these limitations, this paper presents DTI-HTQA (Dual-Tree-Indexing Heterogeneous Table Question Answering), an end-to-end framework for heterogeneous table question answering. It uses a graph-based representation to preserve table semantics, a LLM-driven header recognition algorithm to eliminate manual annotations, and dual-tree indices for intelligent retrieval. Four core operators guide precise cell retrieval, while a chain-of-thought reasoning strategy enhances transparency and accuracy for complex queries. Experiments on HiTab and AIT-QA show that DTI-HTQA achieves top performance in exact match and LLM-evaluated accuracy, approaching or surpassing state-of-the-art methods. The header recognition algorithm achieves an accuracy of 88%, and an upper-bound analysis confirms that this causes only a 2–4% accuracy gap compared to gold-standard annotations, validating the feasibility of the LLM-based approach. Comprehensive ablation studies demonstrate that each component contributes meaningfully to the overall performance, with the dual-tree index providing the most substantial gains. Cost-efficiency analysis shows that DTI-HTQA achieves 52–71% lower token consumption than comparable multi-step frameworks while delivering higher accuracy. This study tackles key challenges in heterogeneous table question answering and offers a practical solution by reducing reliance on costly manual annotations, enabling more efficient use of tabular data.
Heterogeneous Tables / Large Language Models / Tree Indexing / Header Recognition / Tabular Data Question Answering
Higher Education Press 2026
/
| 〈 |
|
〉 |