Heterogeneous Tabular Data Query and Analysis System Based on Large Language Models

Jiajun Li; Xiaojun SHI; Angeline Mary Marchella; Tianze HU; Lianpeng Qiao; Kaiyu Li; Guoren WANG

doi:10.1007/s11704-026-60020-2

Front. Comput. Sci. ›› DOI: 10.1007/s11704-026-60020-2

RESEARCH ARTICLE

Heterogeneous Tabular Data Query and Analysis System Based on Large Language Models

Author information +

History +

PDF (3819KB)

Abstract

With the increase of digital information, there is a growing demand for efficiently extracting insights and answering queries from tables. However, real-world tables often feature complex structures like merged cells, multi-level headers, and block layouts. These heterogeneous tables pose significant challenges for question-answering systems. Existing research focuses primarily on normalized tables. Even state-of-the-art large language models for tabular data achieve 31% to 42% lower accuracy on heterogeneous tables. Current methods either linearize tables into text and lose structural information or adopt graph-based representations that are computationally inefficient and yield suboptimal accuracy. Moreover, these methods typically require extensive manual processing to convert complex tables into specific formats, incurring additional annotation costs. To address these limitations, this paper presents DTI-HTQA (Dual-Tree-Indexing Heterogeneous Table Question Answering), an end-to-end framework for heterogeneous table question answering. It uses a graph-based representation to preserve table semantics, a LLM-driven header recognition algorithm to eliminate manual annotations, and dual-tree indices for intelligent retrieval. Four core operators guide precise cell retrieval, while a chain-of-thought reasoning strategy enhances transparency and accuracy for complex queries. Experiments on HiTab and AIT-QA show that DTI-HTQA achieves top performance in exact match and LLM-evaluated accuracy, approaching or surpassing state-of-the-art methods. The header recognition algorithm achieves an accuracy of 88%, and an upper-bound analysis confirms that this causes only a 2–4% accuracy gap compared to gold-standard annotations, validating the feasibility of the LLM-based approach. Comprehensive ablation studies demonstrate that each component contributes meaningfully to the overall performance, with the dual-tree index providing the most substantial gains. Cost-efficiency analysis shows that DTI-HTQA achieves 52–71% lower token consumption than comparable multi-step frameworks while delivering higher accuracy. This study tackles key challenges in heterogeneous table question answering and offers a practical solution by reducing reliance on costly manual annotations, enabling more efficient use of tabular data.

Keywords

Heterogeneous Tables / Large Language Models / Tree Indexing / Header Recognition / Tabular Data Question Answering

Cite this article

Download citation ▾

Jiajun Li, Xiaojun SHI, Angeline Mary Marchella, Tianze HU, Lianpeng Qiao, Kaiyu Li, Guoren WANG. Heterogeneous Tabular Data Query and Analysis System Based on Large Language Models. Front. Comput. Sci. DOI:10.1007/s11704-026-60020-2

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

RIGHTS & PERMISSIONS

Higher Education Press 2026

PDF (3819KB)

212

Accesses

Citation

Detail

Sections

Recommended

About the journal

Aims & scope

Description

Editorial board

Abstracting / indexing

Contact us

Browse

Just accepted

All volumes and issues

Collections

Featured articles

Most accessed

Most cited

Collections

Multimedia collections

Authors & reviewers

Online submission

Call for papers

Guidelines for authors

Download templates

Guidelines for reviewers

Abstract

Keywords

Cite this article

References

RIGHTS & PERMISSIONS

Just Accepted