A survey of table reasoning with large language models

Xuanliang ZHANG; Dingzirui WANG; Longxu DOU; Qingfu ZHU; Wanxiang CHE

doi:10.1007/s11704-024-40330-z

PDF(1369 KB)

Front. Comput. Sci. ›› 2025, Vol. 19 ›› Issue (9) : 199348. DOI: 10.1007/s11704-024-40330-z

Artificial Intelligence

REVIEW ARTICLE

A survey of table reasoning with large language models

Author information +

History +

Abstract

Table reasoning aims to generate inference results based on the user requirement and the provided table. Enhancing the table reasoning capability of the model can aid in obtaining information efficiently. Recent advancements have positioned Large Language Models (LLMs) as the predominant method for table reasoning, primarily due to their substantial reduction in annotation costs and superior performance compared to previous methods. However, existing research still lacks a summary of LLM-based table reasoning works. Therefore, questions about which techniques can improve table reasoning performance in the era of LLMs and how to enhance table reasoning abilities in the future, remain largely unexplored. This gap significantly limits progress in research. To answer the above questions and advance table reasoning research with LLMs, we present this survey to analyze existing research, inspiring future work. In this paper, we analyze the mainstream techniques used to improve table reasoning performance in the LLM era. Also, we provide research directions from the improvement of existing methods to inspire future research.

Graphical abstract

Keywords

table reasoning / large language models / survey

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Xuanliang ZHANG, Dingzirui WANG, Longxu DOU, Qingfu ZHU, Wanxiang CHE. A survey of table reasoning with large language models. Front. Comput. Sci., 2025, 19(9): 199348 https://doi.org/10.1007/s11704-024-40330-z

1 Introduction

Table reasoning, a significant task in the study of Natural Language Processing (NLP), focuses on the extraction and processing of data from extensive tabular datasets [1]. An illustration of the table reasoning task is shown in Fig.1. Given one or more tables, this task requires the model to complete the user requirement accordingly (e.g., table QA [2], table fact verification [3]).

Fig.1 The illustration of various table reasoning tasks

Full size|PPT slide

In the past, research in table reasoning has gone through several phases: rule-based, neural network-based, and pre-trained language model-based [1], which we call the pre-LLM era. Recent research [4] has shown that Large Language Models (LLMs) exhibit compelling performance across NLP tasks, in particular, dramatically reducing annotation requirements when applied to downstream tasks, which we call the LLM era. Consequently, there have been a lot of works on applying LLMs to the table reasoning task to reduce overhead and outperform the methods in the pre-LLM era, which has become the current mainstream method, as shown in Fig.2.

Fig.2 In the LLM era, the trend in research on the table reasoning task has evolved over several months. #Paper denotes the number of papers. The specifics of these categories are elaborated in Section 3

Full size|PPT slide

However, there is currently a lack of summary analysis on table reasoning works with LLMs, leading to how to improve the performance is still under exploration, which limits existing research to a certain extent. Besides, the table reasoning survey of pre-LLMs is not suitable for LLMs. Some mainstream techniques in the pre-LLM era, such as changing the model structure [1], are not suitable for LLM in table reasoning, while LLM methods focus more on designing prompts or pipelines [4]. Therefore, this paper summarizes the existing works on table reasoning with LLMs to shed light on future research. In detail, we focus on three questions of the table reasoning: 1. What techniques can improve table reasoning performance in the LLM era; 2. How to enhance table reasoning ability in the future. The structure of this survey is shown in Fig.3.

Fig.3 The structure overview of our paper, taking the most representative works as an example

Full size|PPT slide

Regarding the first topic, to better adapt to table reasoning research in the LLM era, we introduce the mainstream techniques and detailed methods of table reasoning in the LLM era in Section 3. Specifically, we categorize existing works according to the different techniques that they utilize and detail them respectively. Also, we compare the best performance of pre-LLM and LLM on different benchmarks and prove that LLMs consistently surpass pre-LLMs in the table reasoning task. Then, we discuss the advantages of LLMs in solving the table reasoning task based on the two inherent challenges of the task. Regarding the second topic, we discuss the potential future directions of table reasoning in Section 4. To promote table reasoning research, we analyze how to further improve table reasoning performance from the aspect of mainstream techniques.

2 Background

2.1 Paper selection criteria

To ensure the selected papers are highly related to the survey, the papers we select should meet the following criteria: 1. Each question in the task that the paper aims to solve must be related to at least one table. 2. Utilize LLMs for fine-tuning, inference, or both.

2.2 Task definition

As the basis for subsequent analysis, in this section, we present the definition of the table and the table reasoning task. A table is a structured arrangement of data organized into rows and columns, used for systematically displaying information in a compact, accessible format. In the table reasoning task, the input consists of the table and the user requirement corresponding to various tasks (e.g., table QA, table fact verification, table-to-text, and text-to-SQL), and the output is the inference result.

2.3 Benchmarks

To help researchers understand the existing application scenarios of table reasoning in detail, we introduce four mainstream table reasoning tasks, to which more than 90% of the selected papers adapt, including text generation, entailment, and semantic parsing. An illustration of four tasks is shown in Fig.1. Although most works of solving table reasoning tasks with LLMs do not need fine-tuning data, they still need to rely on labeled data to validate the performance.

● Table QA is to answer the user question according to the given table [2]. Typically, the answer is a short span, which needs to be extracted from the table or obtained through inference on the table, such as calculation and common sense reasoning. WikiTableQuestions [2] serves as the initial benchmark in the table QA task, which has open-domain tables accompanied by complex questions.

● Table fact verification aims to verify whether a textual hypothesis is entailed or refuted based on the evidence tables [3]. The answers for Table Fact Verification can be categorized into two or three types:

S u p p o r t s

R e f u t e s

[3], and optionally

N o t E n o u g h

I n f o r m a t i o n

, which means that the table does not provide enough information the predict the claim [18]. Among them, it is more challenging to predict

N o t E n o u g h I n f o r m a t i o n

, because it especially requires the model to understand the overall table [19,20]. TabFact [3], as the first benchmark in the table fact verification task, features large-scale cross-domain table data and complex reasoning requirements.

● Table-to-text is to generate a natural language description corresponding to the given question with a table [21]. Different from the table QA task that only generates the span, table-to-text requires the answer to be one paragraph. This task requires accurately conveying the information contained within tabular data in a readable and informative manner. FeTaQA [21] requires the model to generate a free-form answer to the question, with large-scale and high-quality data.

● Text-to-SQL aims to convert a textual question under a database to executable structured query language (SQL). This task necessitates an understanding of both the syntactic structure of natural language and the semantic requirements of SQL to accurately map user queries to database operations. Spider [22] is the first multi-domain, multi-table benchmark on the text-to-SQL task.

2.4 Table formats

In this subsection, we categorize the tables by their formats as shown in Fig.4, and summarize the current mainstream table reasoning datasets, including their table formats in Tab.1. We also specifying the challenges of different table formats because there may be different solutions faced with different table formats in practical application scenarios [38–40], to facilitate readers to understand and distinguish, we discuss in this section.

Fig.4 Examples of the four table formats. (a) Matrix table; (b) hierarchical table; (c) info box; (d) database tables

Full size|PPT slide

Tab.1 The information of the current table reasoning datasets. #Table of Text-to-SQL denotes the number of databases

Task	Dataset	#Example	#Table	Domain	Table format	Cross tables
Table QA	WikiTableQuestions [2]	18,496	2,108	Wikipedia	Matrix table	×
	TabMWP [23]	38,431	7,549	Math	Matrix table	×
	HiTab [24]	10,672	3,597	Wikipedia	Hierarchical table	×
	AIT-QA [25]	515	113	Airline industry	Hierarchical table	×
Table fact verification	SciTab [26]	1,225	216	Scientific	Matrix table	×
	TabFact [3]	117,854	16,573	Wikipedia	Matrix table	×
	InfoTabS [27]	23,738	2,540	Wikipedia	Info-Box	×
Table-to-text	FeTaQA [21]	10,330	992	Wikipedia	Matrix table	×
	ToTTo [28]	136,161	94,385	Wikipedia	Hierarchical table	×
	QTSumm [29]	7,111	2,934	Wikipedia	Hierarchical table	×
	SciGen [30]	53,136	47,866	Scientific	Hierarchical table	×
	LogicNLG [31]	37,015	7,392	Wikipedia	Matrix table	×
	WikiTableT [32]	1,462,678	840,586	Wikipedia	Info-Box	×
Text-to-SQL	WikiSQL [33]	80,654	26,521	Wikipedia	Database table	√
	Spider [22]	10,181	200	Wikipedia	Database table	√
	CoSQL [34]	15,598	200	Wikipedia	Database table	√
	SQUALL [35]	11,468	1,679	Wikipedia	Database table	√
	KaggleDBQA [36]	272	8	Wikipedia	Database table	√
	BIRD [37]	12,751	95	Wikipedia	Database table	√

2.4.1 Matrix table

The matrix table refers to a table in which a row name/column name corresponds to a row/column. Generally, the first column presents row names, the first row shows column names, e.g., the tables in WikiTableQuestions [2]. Matrix tables are widely used in various fields due to their very common table structure. However, due to the excessive generality, if other types of tables are represented as matrix tables, it would cause a certain degree of structural information loss [38].

2.4.2 Hierarchical table

The hierarchical table is a table with row or column names corresponding to multiple rows or columns, e.g., the tables in AIT-QA [25]. Hierarchical tables are often used in numerical-intensive fields to represent the same attributes of different objects, such as income in different years, results in different experiments, etc. The cells in hierarchical tables often correspond to multiple rows and columns, increasing the difficulty of understanding the table and locating the cells relevant to the user query. Also, LLMs struggle to capture multi-level relationships in hierarchical tables since LLMs can only accept sequential input [38].

2.4.3 Info-box

The info-box is the table that presents the information of one entity and has only two columns, e.g., the tables in InfoTabS [27]. Encyclopedias often have info-boxes, such as introducing the personal information of an actor. Because info-boxes revolve around an entity, the relationship between each row is closer, and it is more necessary to infer across multiple rows [27].

2.4.4 Database table

Different from the above table formats, the data stored in the database is usually structured, e.g., the tables in Spider [22]. The database is widely used in finance, education, medical care, and other fields. It is usually necessary to convert user queries in natural languages into database query languages, such as SQL. This not only requires a correct understanding of the user query and linking to the entity in the database but also needs to generate SQL with the correct syntax structure [41–43].

3 What techniques can improve table reasoning performance in the LLM era

There are significant differences between the model ability of the pre-LLM era and the LLM era, leading to the change in the mainstream techniques [4]. To help researchers better transition from the pre-LLM era to the LLM era, in this section, we discuss the mainstream techniques in the LLM era from two aspects: 1. the techniques following the pre-LLM era (Section 1) and 2. the techniques unique to the LLM era (Section 2). We categorize the table reasoning methods based on techniques they use into five categories, which are shown in Fig.5. Then, we introduce the methods and highlight the changes in the technique, aiming to understand how to utilize the mainstream techniques in the LLM era.

Fig.5 The mainstream techniques that can be utilized to improve table reasoning performance in the LLM era

Full size|PPT slide

3.1 Mainstream techniques following pre-LLMs

Despite the considerable change in research brought about by LLMs, many pre-LLM techniques can still be applied to LLM. Therefore, we introduce the mainstream techniques following the pre-LLM era in this subsection.

3.1.1 Supervised fine-tuning

Supervised fine-tuning refers to fine-tuning the LLM with annotated data to enhance the table reasoning capability. Since some open-source small-scale LLMs are weak in solving table tasks [5], researchers utilize the supervised fine-tuning technique to enhance their performance without incurring the high costs associated with training larger models from scratch.

Existing works on supervised fine-tuning of LLM table reasoning include two types: 1. leveraging pre-existing or manually labeled data, and 2. leveraging distilled data generated by LLMs. Focusing on pre-existing or manually labeled data, to better complete the table reasoning task, TableGPT [44] fine-tunes the LLM by constructing instruction datasets. Considering the lack of generalization of previous work, TableLlama [5] constructs the training data by selecting representative table task datasets. Noting that annotating SQL data is too challenging, APEL [6] proposes a method to annotate SQL, which generates the database according to schema and judges SQL correctness based on execution results.

Focusing on distilled data, [45] observes that the performance of the open-source model lags behind that of LLM on table-to-text tasks, thereby, this work utilizes the LLM as a teacher model to distill rationales and table descriptions and fine-tunes the open-source model with the distilled data. Besides, HELLaMA [46] concerns that some models could not locate the evidence based on the inputs, therefore it obtains training data to predict where the labeled description would be located by using other LLMs, and then fine-tune models. To enhance the ability to multi-step table reasoning, ProTrix [47] proposes to use LLM to generate instruction tuning data, following the Plan-then-Reason framework, and then use the data to fine-tune models. To increase the diversity of training data, TableLLM [48] proposes to use LLM to augment the training set by generating questions and answers according to the given table and verifying the correctness of the answer.

Based on pre-existing or manually labeled data, and distilled data embody the two thoughts to obtain training data in the LLM era. Pre-existing datasets are generally of high quality, but more limited in certain domains and tasks; whereas distilled data is less restrictive data but faces the problem of low-quality data. Therefore, how to significantly enhance the data quality of model distill through as little manual intervention as possible is an urgent issue to be studied.

Highlight The supervised fine-tuning methods in the pre-LLM era, limited by the model capabilities, can not bring about the generalization on unseen tasks [49]. In contrast, for the LLM era, researchers design the instruction-based and multi-task data to fine-tune the model to enhance the table reasoning ability to generalize to different tasks, even tasks that are not seen in the training phase.

3.1.2 Result ensemble

Result ensemble denotes improving table reasoning ability by selecting the most suitable answer from multiple results generated by LLM. The models of both the pre-LLM era and the LLM era could be less capable of maintaining correct results facing slight disturbance (e.g., random number seeds, meaningless words in questions), leading to model performance degradation [8], researchers utilize the technique of result ensemble following the pre-LLM era. By generating multiple results and then carefully evaluating and selecting the most suitable one, this method effectively reduces the impact of disturbances and enhances the overall accuracy and robustness of the table reasoning task.

Existing methods of result ensemble in the LLM era mainly focus on two problems: 1. how to obtain diverse results for a question, and 2. how to select the correct result among the multiple results.

Considering the work of obtaining diverse results, SQLPrompt [7] notes that the low diversity of results with fixed prompt and model causes results could be focused on specific incorrect answers, so proposes to generate results with multiple prompts and models. Mix Self-Consistency [50] finds the different advantages of using natural language and programs to solve the table reasoning task, so it proposed to aggregate the results of two reasoning pathways to improve the table reasoning performance.

Regarding the work of selecting the correct result, Lever [8] specifically trains a verifier to score each generated answer and selects the result with the highest score as the answer. To select the correct from multiple candidate SQL queries, [51] proposes to construct test cases by generating new databases and using LLM to predict the execution results so that the test cases can distinguish all SQL with different execution results.

These methods of solving the two problems can enhance the ensemble performance independently. Therefore, these two problems can be focused on together to further improve the table reasoning performance of the LLM.

Highlight Compared with pre-LLM methods, LLMs can generate more diverse results with more and simpler ways. For example, LLMs can obtain diverse results by only changing the instruction without changing the question, while pre-LLM methods have to ensure that the instructions of the fine-tuning and the inference are aligned [52].

3.2 Mainstream techniques unique to LLMs

In the LLM era, in addition to the mainstream techniques following the pre-LLM era, there are also techniques unique to LLM due to the emergence phenomenon [4]. We present three typical emergent abilities mentioned in the previous research [4].

3.2.1 In-context learning

The in-context learning refers to making the model generate the expected answer by using more suitable natural language instruction and multiple demonstrations (that is, the prompt), without requiring additional training or gradient updates [4]. Since LLM performance is significantly affected by the prompt [10,53], researchers employ the in-context learning technique by designing prompts to solve the table reasoning task directly. By incorporating specific examples that illustrate how to interpret and analyze the table in the prompt, the model can better grasp the context and apply similar reasoning to generate accurate responses for the given task.

Regarding the work utilizing in-context learning ability in the table reasoning task, [53] is the first to explore and demonstrate that LLM can reason about tables with in-context learning. ODIS [9] observes that in-domain demonstrations can improve model performance, so it synthesizes in-domain SQL based on SQL similarity. To address the challenge of demonstration selection, DAIL-SQL [10], and [54] select demonstrations based on masked question similarity and SQL similarity respectively. To better parse complex tables, [38] proposes to decode the table cells as a tuple to input that contains rich information. TAP4LLM [55] notices tables could contain noise and ambiguous information, therefore, it decomposes the table and then augments the sub-tables. Auto-CoT [56] finds that the existing rationale annotation methods consume intensive resources, so uses the rule-based method of schema linking to generate rationales. To obtain hybrid domain data, [57] proposes to adopt the method of Table-to-text generation, using Markdown format, Template serialization, TPLM-based method and LLM-based method respectively.

Highlight Because the models in the pre-LLM era can only learn fixed types of prompts through fine-tuning, it is hard to flexibly adjust prompts to enhance the reasoning performance of the model [49]. Due to the in-context learning capability, LLMs can use various prompts that are suitable for different questions without further fine-tuning, which greatly reduces labeling overhead while enhancing performance.

3.2.2 Instruction design

The instruction design denotes utilizing LLMs to solve tasks that are unseen during the training phase by designing the instruction description due to the instruction following ability of LLMs [4]. In the table reasoning task, researchers utilize the instruction design technique to solve the task indirectly by instructing the LLM to complete multiple decomposed sub-tasks which could be novel and require the model to learn through instructions. Existing works using the instruction design on table reasoning with LLMs focus on two types of methods: 1. based on modular decomposition, and 2. based on tool using.

The researchers find that it is easier to complete decomposed sub-tasks than to complete the whole table reasoning task [42], and LLM can generalize to different sub-tasks using the instruction following technique, thereby improving the performance of LLM on the table reasoning task by taking the method of modular decomposition. Both DATER [11] and DIN-SQL [42] note that decomposing table reasoning can effectively facilitate multi-step inference, thus they design pipelines for the table reasoning task to reduce the difficulty of inference. TableQAKit [58] identifies that Table QA tasks face different data and task forms, hindering the ease of research, so divide the Table QA task into a configuration module, a data module, a model module, and an evaluation module. In the open-domain setting, CRUSH4SQL [59], OpenTab [60], and DB-GPT [61] decompose the task into two distinct phases, which are retrieving and reasoning to alleviate the problem of increased difficulty caused by extraneous irrelevant information. DBCopilot [62] notices retrieval could suffer from the diverse expressions and vocabulary mismatch, so the task is decomposed into, firstly generating the question-relevant schema instead of retrieving and then reasoning. MAC-SQL [63] finds that the limited context window, single-pass generation, and the lack of verification result in poor performance, so the task is modularly decomposed into three modules to solve the problems. To enhance the model understanding of tables, Self-augmented prompting [39] first prompts the model to select the most critical table contents, and generate a natural language description of the table accordingly, which can help the model understand the table and complete the reasoning. To solve the text-to-SQL problem in the medical field, EHRAGENT [64] proposes to first generate code to extract relevant medical knowledge, then gradually solve the problem, and finally combine environmental feedback to iteratively debug the generated SQL. To reduce the cost of classifying tables with LLM, FeatLLM [65] uses LLM to summarize the rules of classification, and employs the summarized rules to classify the tables.

Faced with the decomposed sub-tasks of table reasoning, LLM, despite maintaining acceptable performance on most sub-tasks, does not excel at solving all sub-tasks (e.g., retrieval, numerical reasoning) [66], so researchers instruct the LLM to invoke diverse tools to solve some sub-tasks, which is the method of tool using. StructGPT [67] observes that the amount of structured data is too large to input to the model, so it provides different interfaces to extract multiple types of data, and the model obtains valid data by calling the appropriate interfaces. [68], to explore and evaluate the action and reasoning capacities of LLM, proposes the long-form database question-answering task, where LLMs need to decide an interaction strategy by reasoning, and then generate interaction commands to invoke the external model. To extend the model capabilities of various TableQA tasks, [66] enables querying knowledge and performing extra tabular operations by calling other LLM APIs. Also, some works focus on making tools and then employing them. Binder [12], noting that existing neural-symbolic works are model- and language-specific and require large training data, proposes to utilize the LLM to parse the sub-questions that are not translatable into the target program, such as SQL, then invoke the LLM to solve the sub-question. Recognizing the challenge of automatically transforming an arbitrary table in response to the question, ReAcTable [69] proposes leverage the LLM to generate a sequence of functions, which are then executed to produce an intermediate table, ultimately getting the answer.

In summary, the methods of modular decomposition and tool using can be used together. Specifically, during solving the task with multiple modules, each module can enhance performance by employing tools. For example, about the retrieval modular, we can use programs to filter out the rows not related to the user question.

Highlight The pre-LLMs do not have the instruction following capability due to their weak generalization, where researchers have to train separate models for each sub-task when using the method of modular decomposition to solve table reasoning tasks [70]. Also, it is hard to flexibly use or make diverse tools in the pre-LLM era [4]. In contrast, LLM can achieve superior performance without individually fine-tuning for each sub-task or tool, saving the training overhead.

3.2.3 Step-by-step reasoning

The step-by-step reasoning indicates that solving complex reasoning tasks by employing prompt mechanisms that incorporate intermediate reasoning stages, referring to the technique and capability at the same time [4]. In the context of the table-based reasoning task, step-by-step reasoning means LLMs automatically decompose a complex question into multiple simpler sub-questions. This technique is distinct from modular decomposition, where researchers manually break down tasks into widely different sub-tasks. By using step-by-step reasoning, the LLM incrementally works through each sub-question, ensuring a more thorough and accurate interpretation and analysis of the table data. This technique enhances the table reasoning capability of handling complicated queries by systematically addressing each component of the query, thereby leading to a more robust and reliable output [71].

MURMUR [13] notices that prompting the LLM to reason step-by-step lacks explicit conditions between reasoning steps, proposes to first select the potentially correct models at each step, and then select the best model based on the score model. Chain-of-Table [14], to reduce the difficulty of the single-hop reasoning, provides predefined table operations, from which LLMs select one operation and execute in each step.

Highlight The methods of the pre-LLM era do not have the capability for step-by-step reasoning, so it is difficult to improve the performance of solving complex table reasoning by decomposing the complex into several easy-to-verify sub-tasks. In contrast, LLM can decompose the reasoning into multiple steps, where the hardness of each step is lower than the full question, thereby decreasing the complexity of the table reasoning.

3.3 Expanding application

In addition to solving the table reasoning task with mainstream technology, in this subsection, we analyze the requirements of table reasoning tasks in real-life scenarios and present expandable directions accordingly.

3.3.1 Multi-modal table reasoning

The multi-modal setting requires the model to encompass automated comprehension, classification, and extraction of information from textual, visual, and other forms of evidence [72]. Because there are many tables stored in the form of images in actual scenarios, direct Optical Character Recognition (OCR) will cause information loss due to recognition errors, so we need Multi-modal Large Language Models (MLLMs) to understand better and reason about table images [15]. To comprehensively and strictly evaluate the performance of MLLMs on multi-modal table reasoning tasks, ChartX [15] constructs an evaluation set on the chart domain and develop the ChartVLM, which can automatically select different cascaded decoders according to the difficulty of instruction. For improving the efficiency of chart reasoning, TinyChart [73] proposes a 3B model, which is trained to generate the Python program to reduce the overhead of numerical calculation, and employs the Vision Token Merging module to reduce similar vision feature. Also, mChartQA [74] proposes to integrate the chart text generated by the Chart-to-text Engine when inputting the text and visual information to better align and process complex visual and text information. To enhance the multimodal table understanding ability of the model, [75] build the MMTab dataset, including a variety of table images and instruction following data, and develop the Table-LLaVA by converting table images into tables in the form of text as the pre-training task, and instruction fine-tuning.

3.3.2 Multi-turn table reasoning

Dialogue systems are designed to converse with humans as assistants through conversational interactions [76]. When interacting with users, there could be problems such as incorrect model results and ambiguous questions, which require multiple turns to correct errors [34,77]. To solve ambiguous or unanswerable questions in the multi-turn text-to-SQL task, QDA-SQL [16] proposes a data augmentation method that can generate diverse multi-turn questions and answers, including answerable questions, ambiguous questions, and unanswerable questions, with the verification and modification mechanisms to ensure the correctness of the data and fine-tune the model with the augmented data.

3.3.3 Retrieval-augmented table reasoning

Retrieval-augmented generation (RAG) technology denotes retrieving the reasoning-related information from a large number of documents before reasoning [78]. Since the table reasoning task often faces knowledge-intensive scenarios in applications where the in-domain knowledge of LLM is insufficient to solve, future work should focus on enhancing table reasoning capabilities by retrieving knowledge [79]. To solve this challenge, Knowledge-to-SQL [17] proposes to train a Data Expert LLM (DELLM), which can generate relevant knowledge based on the question and given database tables to help generate SQL.

3.4 Why LLMs excel at table reasoning

LLMs surpass pre-LLMs in the table reasoning task across different benchmarks as shown in Tab.2. We analyze the key insights behind this from structure understanding and schema linking, which are two main challenges of the table reasoning [82].

Tab.2 The best performance of pre-LLM and LLM methods on different benchmarks. ^† denotes WikiTableQuestions

Benchmarks	pre-LLM	LLM
WikiTQ^† (accuracy)	0.63 [80]	0.74 [50]
TabFact (accuracy)	0.85 [81]	0.93 [11]
FeTaQA (ROUGE-1)	0.65 [49]	0.71 [69]
Spider (execution accuracy)	0.80 [41]	0.87 [10]

3.4.1 Why are LLMs proficient at structure understanding

Structure understanding means understanding the table schema (e.g., columns, rows) and their relationships, which provides the key evidence and necessary context information for decoding [82]. LLMs demonstrate enhanced capability in understanding structure compared to pre-LLMs [38], primarily for the following two reasons. 1. The code parsing ability that LLMs excel at could benefit table understanding ability because both require recognizing the hierarchical structure from plain input (e.g., linearized table to structured table, contextualized code to structured code) [66]. 2. The spatial understanding capabilities of LLM help the model understand structured tables, which both need to be imagined into a two-dimensional spatial structure based on linear descriptions [83].

3.4.2 Why LLMs perform well in schema linking

Schema Linking refers to aligning the entity mentioned in question with the entity in tables [82]. Compared to pre-LLMs, LLMs exhibit superior schema linking capability for two primary reasons: 1. Due to the the step-by-step reasoning ability, LLM can simplify the linking from sentence-level to span-level, via decomposing the complete question and filtering the irrelevant context [42]. 2. The dense commonsense knowledge of LLMs is superior for domain-specific or abbreviation schema linking [84], compared with the sparse knowledge of PLM [85].

4 How to enhance table reasoning ability in the future

To promote table reasoning research in the LLM era, we discuss future research directions in this section. In this subsection, we analyze the shortcomings and possible improvements of existing works on the table reasoning task under each category in Section 3.

4.1 Supervised fine-tuning: establishing diverse training data

Due to the strong generalization of LLMs, researchers should construct diverse data for multiple table tasks when performing supervised fine-tuning of LLMs to improve the overall performance on table reasoning tasks. As the discussion in Section 1, current pre-existing or manually labeled data methods simply mix diverse data from different table tasks as training data to fine-tune the LLMs. However, the proportion of different tasks in the training data has a significant impact on model performance [86]. Future work should balance the diverse training data from multiple tasks in different proportions to explore the optimal proportion for optimizing the table reasoning capabilities of fine-tuning LLMs [87,88]. For example, to optimize the data mixture proportion effectively, future works can explore the impact of different proportions of diverse data on performance and summarize the rules, or fine-tune small-scale models to predict the performance of fine-tuning LLM [89].

Apart from labeling data, existing methods of distilled data only focus on certain features or specific tasks, resulting in a lack of diversity in the distilled data, and the table reasoning performance of the model cannot be comprehensively improved by fine-tuning with the distilled data. Therefore, it is worth exploring how to distill diverse data for different tasks to improve the LLM comprehensive ability and generalization in table reasoning tasks [90,91].

4.2 Result ensemble: sampling results more efficiently

To obtain the correct answer after ensemble, researchers should focus on how to sample in the possible result space effectively. The main purpose of obtaining multiple results is to widen the sampling space so that the correct answer can be sampled multiple times. However, existing works do not consider changing the demonstrations in the prompt to improve the correctness of the results, and the impact of the demonstrations on the table reasoning performance of LLMs is significant. Given this significant impact, it is imperative for future research to explore methodologies for altering the demonstrations included in the prompts [92]. For example, researchers can employ LLMs to synthesize more demonstrations according to the table or retrieve examples similar to the table or query in the demonstration pool, thus obtaining diverse results [10,93,94].

Current studies on selecting the correct answer only rely on the final result and do not take into account that the number of results increases exponentially with the growing number of reasoning steps, and it is difficult to sample the correct answer in an exponentially large search space. Future work should narrow the search space by selecting the correct reasoning path at each step, and then selecting the correct answer based on the searched path [95]. This process involves developing robust algorithms and methodologies that can accurately evaluate and prioritize potential reasoning pathways. By doing so, the selection process can be streamlined, reducing unnecessary computational efforts and improving overall efficiency.

4.3 In-context learning: optimizing prompts automatically

Since the in-context learning performance of LLMs relies heavily on prompts, researchers should focus on how to automatically optimize prompts for table reasoning based on the question. Prompt design research on single-step reasoning compares candidate prompts from a limited range of human-labeled instructions and examples, which results in performance improvement is also limited. To design a better table-reasoning prompt, future work should automatically generate and optimize the prompt with LLMs [96]. Specifically, LLM itself can be used as a scorer and optimizer for prompts.

4.4 Instruction design: automatically refining design with verification

Depending on the discussion in Section 2, how to make fuller use of the capability of the instruction following to reduce the difficulty of each table reasoning question deserves the attention of researchers. Current methods of modular decomposition require manually decomposing the task into different modules in advance. However, this decomposition can only apply to a certain table task. In contrast, the fixed decomposition applicable to all table tasks is too general and does not reduce the difficulty of reasoning well. Therefore, rather than specifying the decomposition for a particular table task, future work should automatically decompose the task according to the question, which is suitable for all table tasks without human involvement and greatly reduces the difficulty degree of single-step reasoning [97]. Such advancements could ensure that the system is adaptable to a diverse range of table reasoning tasks, thereby enhancing its efficiency and robustness.

For the methods of tool using, current works do not notice that the process of invoking tools may cause extra errors in the table reasoning process. Future work should include a tool verification process that prompts the LLMs to revise the tools to ensure that the tools can be applied correctly in the table reasoning task, thereby enhancing the accuracy [98]. For example, LLMs can generate multiple test cases, including the table, query, and answer to determine the correctness of the tool [51].

4.5 Step-by-step reasoning: mitigating the error cascade in multi-step reasoning

Existing studies on step-by-step reasoning overlook the issue of error cascades in table reasoning, resulting in incorrect intermediate outcomes that propagate errors in subsequent reasoning steps. The Tree-of-Thought (ToT) [99] approach mitigates this issue by preserving multiple potential intermediate steps in multi-step reasoning. Therefore, exploring the application of ToT to table reasoning tasks warrants future research.

5 Conclusion

In this paper, we summarize existing research work on table reasoning with LLMs. In the LLM era, the supervised fine-tuning and result ensemble methods following the pre-LLM era are still effective. Besides, the in-context learning, instruction following, and step-by-step reasoning techniques unique to the LLM era can also be used to improve the model table reasoning performance. To inspire future research, we explore potential future directions for improving table reasoning performance. Finally, we summarize the current resources of the table reasoning in GitHub and will continue to update it.

Xuanliang Zhang is a graduate student of Harbin Institute of Technology, China, where she is a member of Language Analysis Group of HIT-SCIR Lab under the supervision of Prof. Wanxiang Che. Her research interest is table reasoning

Dingzirui Wang is a PhD student of Harbin Institute of Technology, China, where he is a member of Language Analysis Group of HIT-SCIR Lab under the supervision of Prof. Wanxiang Che. His research interest is text-to-SQL semantic parsing and mathematical reasoning

Longxu Dou is a PhD student of Harbin Institute of Technology, China, where he is a member of Language Analysis Group of HIT-SCIR Lab under the supervision of Prof. Wanxiang Che. His research interest is text-to-SQL semantic parsing, which could greatly facilitate the interaction between database and data analyst

Qingfu Zhu is an assistant professor of School of Computer Science and Technology, Harbin Institute of Technology, China. He is a joint training PhD of the University of California, Santa Barbara, USA. His main research directions include natural language processing, code generation, and pre-training language models

Wanxiang Che is a professor of School of Computer Science and Technology, Harbin Institute of Technology, China. He is the vice director of Research Center for Social Computing and Information Retrieval. He is a young scholar of “Heilongjiang Scholar” and a visiting scholar of Stanford University, USA. He is currently the vice director and secretary-general of the Computational Linguistics Professional Committee of the Chinese Information Society of China; Officer and Secretary of AACL Executive Board; a senior member of the China Computer Federation (CCF)

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Jin N, Siebert J, Li D, Chen Q. A survey on table question answering: recent advances. In: Proceedings of the 7th China Conference on Knowledge Graph and Semantic Computing: Knowledge Graph Empowers the Digital Economy. 2022, 174–186

[2]	Pasupat P, Liang P. Compositional semantic parsing on semi-structured tables. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2015, 1470–1480

[3]	Chen W, Wang H, Chen J, Zhang Y, Wang H, Li S, Zhou X, Wang W Y. TabFact: a large-scale dataset for table-based fact verification. In: Proceedings of the 8th International Conference on Learning Representations. 2020

[4]	Zhao W X, Zhou K, Li J, Tang T, Wang X, Hou Y, Min Y, Zhang B, Zhang J, Dong Z, Du Y, Yang C, Chen Y, Chen Z, Jiang J, Ren R, Li Y, Tang X, Liu Z, Liu P, Nie J Y, Wen J R. A survey of large language models. 2023, arXiv preprint arXiv: 2303.18223

[5]	Zhang T, Yue X, Li Y, Sun H. TableLlama: towards open large generalist models for tables. In: Proceedings of 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024, 6024−6044

[6]	Zhong R, Snell C, Klein D, Eisner J. Non-programmers can label programs indirectly via active examples: a case study with text-to-SQL. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 5126−5152

[7]	Sun R, Arik S, Sinha R, Nakhost H, Dai H, Yin P, Pfister T. SQLPrompt: in-context text-to-SQL with minimal labeled data. In: Proceedings of the Association for Computational Linguistics: EMNLP 2023. 2023, 542−550

[8]	Ni A, Iyer S, Radev D, Stoyanov V, Yih W T, Wang S, Lin X V. LEVER: learning to verify language-to-code generation with execution. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 26106−26128

[9]	Chang S, Fosler-Lussier E. Selective demonstrations for cross-domain text-to-SQL. In: Proceedings of the Association for Computational Linguistics: EMNLP 2023. 2023, 14174−14189

[10]	Gao D, Wang H, Li Y, Sun X, Qian Y, Ding B, Zhou J . Text-to-SQL empowered by large language models: a benchmark evaluation. Proceedings of the VLDB Endowment, 2024, 17( 5): 1132–1145

[11]	Ye Y, Hui B, Yang M, Li B, Huang F, Li Y. Large language models are versatile decomposers: decomposing evidence and questions for table-based reasoning. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2023, 174−184

[12]	Cheng Z, Xie T, Shi P, Li C, Nadkarni R, Hu Y, Xiong C, Radev D, Ostendorf M, Zettlemoyer L, Smith N A, Yu T. Binding language models in symbolic languages. In: Proceedings of the 11th International Conference on Learning Representations. 2023

[13]	Saha S, Yu X, Bansal M, Pasunuru R, Celikyilmaz A. MURMUR: modular multi-step reasoning for semi-structured data-to-text generation. In: Proceedings of the Association for Computational Linguistics: ACL 2023. 2023, 11069−11090

[14]	Wang Z, Zhang H, Li C L, Eisenschlos J M, Perot V, Wang Z, Miculicich L, Fujii Y, Shang J, Lee C Y, Pfister T. Chain-of-table: evolving tables in the reasoning chain for table understanding. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[15]	Xia R, Zhang B, Ye H, Yan X, Liu Q, Zhou H, Chen Z, Dou M, Shi B, Yan J, Qiao Y. ChartX & ChartVLM: a versatile benchmark and foundation model for complicated chart reasoning. 2024, arXiv preprint arXiv: 2402.12185

[16]	Sun Y, Guo Z, Yu H, Liu C, Li X, Wang B, Yu X, Zhao T. QDA-SQL: questions enhanced dialogue augmentation for multi-turn text-to-SQL. 2024, arXiv preprint arXiv: 2406.10593

[17]	Hong Z, Yuan Z, Chen H, Zhang Q, Huang F, Huang X. Knowledge-to-SQL: enhancing SQL generation with data expert LLM. In: Proceedings of the Association for Computational Linguistics ACL 2024. 2024, 10997−11008

[18]	Aly R, Guo Z, Schlichtkrull M S, Thorne J, Vlachos A, Christodoulopoulos C, Cocarascu O, Mittal A. The fact extraction and VERification over unstructured and structured information (FEVEROUS) shared task. In: Proceedings of the 4th Workshop on Fact Extraction and VERification (FEVER). 2021, 1−13

[19]

Wuehrl A, Menchaca Resendiz Y, Grimminger L, Klinger R. What makes medical claims (UN)verifiable? Analyzing entity and relation properties for fact verification. In: Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 2024, 2046−2058

[20]	Guo Z, Schlichtkrull M, Vlachos A . A survey on automated fact-checking. Transactions of the Association for Computational Linguistics, 2022, 10: 178–206

[21]

Nan L, Hsieh C, Mao Z, Lin X V, Verma N, Zhang R, Kryściński W, Schoelkopf H, Kong R, Tang X, Mutuma M, Rosand B, Trindade I, Bandaru R, Cunningham J, Xiong C, Radev D, Radev D . FeTaQA: free-form table question answering. Transactions of the Association for Computational Linguistics, 2022, 10: 35–49

[22]

Yu T, Zhang R, Yang K, Yasunaga M, Wang D, Li Z, Ma J, Li I, Yao Q, Roman S, Zhang Z, Radev D. Spider: a large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task. In: Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 3911−3921

[23]	Lu P, Qiu L, Chang K W, Wu Y N, Zhu S C, Rajpurohit T, Clark P, Kalyan A. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. In: Proceedings of the 11th International Conference on Learning Representations. 2023

[24]	Cheng Z, Dong H, Wang Z, Jia R, Guo J, Gao Y, Han S, Lou J G, Zhang D. HiTab: a hierarchical table dataset for question answering and natural language generation. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022, 1094−1110

[25]

Katsis Y, Chemmengath S, Kumar V, Bharadwaj S, Canim M, Glass M, Gliozzo A, Pan F, Sen J, Sankaranarayanan K, Chakrabarti S. AIT-QA: question answering dataset over complex tables in the airline industry. In: Proceedings of 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track. 2022, 305−314

[26]	Lu X, Pan L, Liu Q, Nakov P, Kan M Y. SCITAB: a challenging benchmark for compositional reasoning and claim verification on scientific tables. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 7787−7813

[27]	Gupta V, Mehta M, Nokhiz P, Srikumar V. INFOTABS: inference on tables as semi-structured data. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 2309−2324

[28]	Parikh A, Wang X, Gehrmann S, Faruqui M, Dhingra B, Yang D, Das D. ToTTo: a controlled table-to-text generation dataset. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020, 1173−1186

[29]	Zhao Y, Qi Z, Nan L, Mi B, Liu Y, Zou W, Han S, Chen R, Tang X, Xu Y, Radev D, Cohan A. QTSumm: query-focused summarization over tabular data. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 1157−1172

[30]	Moosavi N, Rücklé A, Roth D, Gurevych I. SciGen: a dataset for reasoning-aware text generation from scientific tables. In: Proceedings of the 1st Neural Information Processing Systems Track on Datasets and Benchmarks. 2021

[31]	Chen W, Chen J, Su Y, Chen Z, Wang W Y. Logical natural language generation from open-domain tables. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 7929−7942

[32]	Chen M, Wiseman S, Gimpel K. WikiTableT: a large-scale data-to-text dataset for generating Wikipedia article sections. In: Proceedings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021, 193−209

[33]	Zhong V, Xiong C, Socher R. Seq2SQL: generating structured queries from natural language using reinforcement learning. 2017, arXiv preprint arXiv: 1709.00103

[34]

Yu T, Zhang R, Er H, Li S, Xue E, Pang B, Lin X V, Tan Y C, Shi T, Li Z, Jiang Y, Yasunaga M, Shim S, Chen T, Fabbri A, Li Z, Chen L, Zhang Y, Dixit S, Zhang V, Xiong C, Socher R, Lasecki W, Radev D. CoSQL: a conversational text-to-SQL challenge towards cross-domain natural language interfaces to databases. In: Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019, 1962−1979

[35]	Shi T, Zhao C, Boyd-Graber J, Daumé III H, Lee L. On the potential of Lexico-logical alignments for semantic parsing to SQL queries. In: Proceedings of the Association for Computational Linguistics: EMNLP 2020. 2020, 1849−1864

[36]	Lee C H, Polozov O, Richardson M. KaggleDBQA: realistic evaluation of text-to-SQL parsers. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021, 2261−2273

[37]

Li J, Hui B, Qu G, Yang J, Li B, Li B, Wang B, Qin B, Geng R, Huo N, Zhou X, Ma C, Li G, Chang K C C, Huang F, Cheng R, Li Y. Can LLM already serve as a database interface? A big bench for large-scale database grounded text-to-SQLs. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2024, 1835

[38]	Zhao B, Ji C, Zhang Y, He W, Wang Y, Wang Q, Feng R, Zhang X. Large language models are complex table parsers. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 14786−14802

[39]	Sui Y, Zhou M, Zhou M, Han S, Zhang D. Table meets LLM: can large language models understand structured table data? A benchmark and empirical study. In: Proceedings of the 17th ACM International Conference on Web Search and Data Mining. 2024, 645−654

[40]	Singha A, Cambronero J, Gulwani S, Le V, Parnin C. Tabular representation, noisy operators, and impacts on table structure understanding tasks in LLMs. In: Proceedings of NeurIPS 2023 Second Table Representation Learning Workshop. 2023

[41]	Li H, Zhang J, Li C, Chen H. RESDSQL: decoupling schema linking and skeleton parsing for text-to-SQL. In: Proceedings of the 37th AAAI Conference on Artificial Intelligence. 2023, 13067−13075

[42]	Pourreza M, Rafiei D. DIN-SQL: decomposed in-context learning of text-to-SQL with self-correction. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 1577

[43]	Wang D, Dou L, Zhang X, Zhu Q, Che W. DAC: decomposed automation correction for text-to-SQL. 2024, arXiv preprint arXiv: 2408.08779

[44]	Zha L, Zhou J, Li L, Wang R, Huang Q, Yang S, Yuan J, Su C, Li X, Su A, Zhang T, Zhou C, Shou K, Wang M, Zhu W, Lu G, Ye C, Ye Y, Ye W, Zhang Y, Deng X, Xu J, Wang H, Chen G, Zhao J. TableGPT: towards unifying tables, nature language and commands into one GPT. 2023, arXiv preprint arXiv: 2307.08674

[45]	Yang B, Tang C, Zhao K, Xiao C, Lin C. Effective distillation of table-based reasoning ability from LLMs. In: Proceedings of 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024, 5538−5550

[46]	Bian J, Qin X, Zou W, Huang M, Zhang W. HELLaMA: LLaMA-based table to text generation by highlighting the important evidence. 2023, arXiv preprint arXiv: 2311.08896

[47]	Wu Z, Feng Y. ProTrix: building models for planning and reasoning over tables with sentence context. 2024, arXiv preprint arXiv: 2403.02177

[48]	Zhang X, Zhang J, Ma Z, Li Y, Zhang B, Li G, Yao Z, Xu K, Zhou J, Zhang-Li D, Yu J, Zhao S, Li J, Tang J. TableLLM: enabling tabular data manipulation by LLMs in real office usage scenarios. 2024, arXiv preprint arXiv: 2403.19318

[49]

Xie T, Wu C H, Shi P, Zhong R, Scholak T, Yasunaga M, Wu C S, Zhong M, Yin P, Wang S I, Zhong V, Wang B, Li C, Boyle C, Ni A, Yao Z, Radev D, Xiong C, Kong L, Zhang R, Smith N A, Zettlemoyer L, Yu T. UnifiedSKG: unifying and multi-tasking structured knowledge grounding with text-to-text language models. In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 602−631

[50]	Liu T, Wang F, Chen M. Rethinking tabular data understanding with large language models. In: Proceedings of 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024, 450−482

[51]	Li Z, Xie T. Using LLM to select the right SQL query from candidates. 2024, arXiv preprint arXiv: 2401.02115

[52]

Gan Y, Chen X, Huang Q, Purver M, Woodward J R, Xie J, Huang P. Towards robustness of text-to-SQL models against synonym substitution. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021, 2505−2515

[53]	Chen W. Large language models are few(1)-shot table reasoners. In: Proceedings of the Association for Computational Linguistics: EACL 2023. 2023, 1120−1130

[54]	Nan L, Zhao Y, Zou W, Ri N, Tae J, Zhang E, Cohan A, Radev D. Enhancing text-to-SQL capabilities of large language models: a study on prompt design strategies. In: Proceedings of the Association for Computational Linguistics: EMNLP 2023. 2023, 14935−14956

[55]	Sui Y, Zou J, Zhou M, He X, Du L, Han S, Zhang D. TAP4LLM: table provider on sampling, augmenting, and packing semi-structured data for large language model reasoning. 2023, arXiv preprint arXiv: 2312.09039

[56]	Zhang H, Cao R, Chen L, Xu H, Yu K. ACT-SQL: in-context learning for text-to-SQL with automatically-generated chain-of-thought. In: Proceedings of the Association for Computational Linguistics: EMNLP 2023. 2023, 3501−3532

[57]

Min D, Hu N, Jin R, Lin N, Chen J, Chen Y, Li Y, Qi G, Li Y, Li N, Wang Q. Exploring the impact of table-to-text methods on augmenting LLM-based question answering with domain hybrid data. In: Proceedings of 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track). 2024, 464−482

[58]	Lei F, Luo T, Yang P, Liu W, Liu H, Lei J, Huang Y, Wei Y, He S, Zhao J, Liu K. TableQAKit: a comprehensive and practical toolkit for table-based question answering. 2023, arXiv preprint arXiv: 2310.15075

[59]	Kothyari M, Dhingra D, Sarawagi S, Chakrabarti S. CRUSH4SQL: collective retrieval using schema hallucination for Text2SQL. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 14054−14066

[60]	Kong K, Zhang J, Shen Z, Srinivasan B, Lei C, Faloutsos C, Rangwala H, Karypis G. OpenTab: advancing large language models as open-domain table reasoners. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[61]	Xue S, Jiang C, Shi W, Cheng F, Chen K, Yang H, Zhang Z, He J, Zhang H, Wei G, Zhao W, Zhou F, Qi D, Yi H, Liu S, Chen F. DB-GPT: empowering database interactions with private large language models. 2024, arXiv preprint arXiv: 2312.17449

[62]	Wang T, Lin H, Han X, Sun L, Chen X, Wang H, Zeng Z. DBCopilot: scaling natural language querying to massive databases. 2023, arXiv preprint arXiv: 2312.03463

[63]	Wang B, Ren C, Yang J, Liang X, Bai J, Zhang Q W, Yan Z, Li Z. MAC-SQL: multi-agent collaboration for text-to-SQL. 2023, arXiv preprint arXiv: 2312.11242

[64]	Shi W, Xu R, Zhuang Y, Yu Y, Zhang J, Wu H, Zhu Y, Ho J, Yang C, Wang M D. EHRAgent: code empowers large language models for few-shot complex tabular reasoning on electronic health records. 2024, arXiv preprint arXiv: 2401.07128

[65]	Han S, Yoon J, Arik S O, Pfister T. Large language models can automatically engineer features for few-shot tabular learning. In: Proceedings of the 41st International Conference on Machine Learning. 2024

[66]	Cao Y, Chen S, Liu R, Wang Z, Fried D. API-assisted code generation for question answering on varied table structures. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 14536−14548

[67]	Jiang J, Zhou K, Dong Z, Ye K, Zhao X, Wen J R. StructGPT: a general framework for large language model to reason over structured data. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 9237−9251

[68]	Nan L, Zhang E, Zou W, Zhao Y, Zhou W, Cohan A. On evaluating the integration of reasoning and action in LLM agents with database question answering. In: Proceedings of the Association for Computational Linguistics: NAACL 2024. 2024, 4556−4579

[69]	Zhang Y, Henkel J, Floratou A, Cahoon J, Deep S, Patel J M . ReAcTable: enhancing ReAct for table question answering. Proceedings of the VLDB Endowment, 2024, 17( 8): 1981–1994

[70]	Dou L, Gao Y, Pan M, Wang D, Che W, Lou J G, Zhan D . UNISAR: a unified structure-aware autoregressive language model for text-to-SQL semantic parsing. International Journal of Machine Learning and Cybernetics, 2023, 14( 12): 4361–4376

[71]	Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, Chi E H, Le Q V, Zhou D. Chain-of-thought prompting elicits reasoning in large language models. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 1800

[72]	Cui L, Xu Y, Lv T, Wei F. Document AI: Benchmarks, models and applications. 2021, arXiv preprint arXiv: 2111.08609

[73]	Zhang L, Hu A, Xu H, Yan M, Xu Y, Jin Q, Zhang J, Huang F. TinyChart: efficient chart understanding with visual token merging and program-of-thoughts learning. 2024, arXiv preprint arXiv: 2404.16635

[74]	Wei J, Xu N, Chang G, Luo Y, Yu B, Guo R. mChartQA: a universal benchmark for multimodal chart question answer based on vision-language alignment and reasoning. 2024, arXiv preprint arXiv: 2404.01548

[75]	Zheng M, Feng X, Si Q, She Q, Lin Z, Jiang W, Wang W. Multimodal table understanding. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024, 9102−9124

[76]	Ni J, Young T, Pandelea V, Xue F, Cambria E . Recent advances in deep learning based dialogue systems: a systematic survey. Artificial Intelligence Review, 2023, 56( 4): 3055–3155

[77]

Yu T, Zhang R, Yasunaga M, Tan Y C, Lin X V, Li S, Er H, Li I, Pang B, Chen T, Ji E, Dixit S, Proctor D, Shim S, Kraft J, Zhang V, Xiong C, Socher R, Radev D. SParC: cross-domain semantic parsing in context. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 4511−4523

[78]	Gao Y, Xiong Y, Gao X, Jia K, Pan J, Bi Y, Dai Y, Sun J, Guo Q, Wang M, Wang H. Retrieval-augmented generation for large language models: a survey. 2024, arXiv preprint arXiv: 2312.10997

[79]	Dou L, Gao Y, Liu X, Pan M, Wang D, Che W, Zhan D, Kan M Y, Lou J G. Towards knowledge-intensive text-to-SQL semantic parsing with formulaic knowledge. In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 5240−5253

[80]	Jiang Z, Mao Y, He P, Neubig G, Chen W. OmniTab: pretraining with natural and synthetic data for few-shot table-based question answering. In: Proceedings of 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022, 932−942

[81]	Zhao G, Yang P. Table-based fact verification with self-labeled keypoint alignment. In: Proceedings of the 29th International Conference on Computational Linguistics. 2022, 1401−1411

[82]	Yin P, Neubig G, Yih W T, Riedel S. TaBERT: pretraining for joint understanding of textual and tabular data. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 8413−8426

[83]	Bubeck S, Chandrasekaran V, Eldan R, Gehrke J, Horvitz E, Kamar E, Lee P, Lee Y T, Li Y, Lundberg S, Nori H, Palangi H, Ribeiro M T, Zhang Y. Sparks of artificial general intelligence: early experiments with GPT-4. 2023, arXiv preprint arXiv: 2303.12712

[84]	Liu Q, Yang D, Zhang J, Guo J, Zhou B, Lou J G. Awakening latent grounding from pretrained language models for semantic parsing. In: Proceedings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021, 1174−1189

[85]	Yu W, Iter D, Wang S, Xu Y, Ju M, Sanyal S, Zhu C, Zeng M, Jiang M. Generate rather than retrieve: large language models are strong context generators. In: Proceedings of the Eleventh International Conference on Learning Representations. 2023

[86]	Longpre S, Hou L, Vu T, Webson A, Chung H W, Tay Y, Zhou D, Le Q V, Zoph B, Wei J, Roberts A. The flan collection: designing data and methods for effective instruction tuning. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 22631−22648

[87]	Liu W, Zeng W, He K, Jiang Y, He J. What makes good data for alignment? A comprehensive study of automatic data selection in instruction tuning. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[88]	Bukharin A, Zhao T. Data diversity matters for robust instruction tuning. 2024, arXiv preprint arXiv: 2311.14736

[89]	Liu Q, Zheng X, Muennighoff N, Zeng G, Dou L, Pang T, Jiang J, Lin M. RegMix: data mixture as regression for language model pre-training. 2024, arXiv preprint arXiv: 2407.01492

[90]	Xu C, Sun Q, Zheng K, Geng X, Zhao P, Feng J, Tao C, Jiang D. WizardLM: empowering large language models to follow complex instructions. 2023, arXiv preprint arXiv: 2304.12244

[91]	Luo Z, Xu C, Zhao P, Sun Q, Geng X, Hu W, Tao C, Ma J, Lin Q, Jiang D. WizardCoder: empowering code large language models with evol-instruct. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[92]	Chang T Y, Jia R. Data curation alone can stabilize in-context learning. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023, 8123−8144

[93]	Tonglet J, Reusens M, Borchert P, Baesens B. SEER: a knapsack approach to exemplar selection for in-context HybridQA. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 13569−13583

[94]	Wang D, Dou L, Zhang X, Zhu Q, Che W. Improving demonstration diversity by human-free fusing for text-to-SQL. 2024, arXiv preprint arXiv: 2402.10663

[95]	Xie Y, Kawaguchi K, Zhao Y, Zhao X, Kan M Y, He J, Xie M Q. Self-evaluation guided beam search for reasoning. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 1802

[96]	Yang C, Wang X, Lu Y, Liu H, Le Q V, Zhou D, Chen X. Large language models as optimizers. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[97]	Yao S, Zhao J, Yu D, Du N, Shafran I, Narasimhan K R, Cao Y. ReAct: synergizing reasoning and acting in language models. In: Proceedings of the 11th International Conference on Learning Representations. 2023

[98]	Cai T, Wang X, Ma T, Chen X, Zhou D. Large language models as tool makers. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[99]	Yao S, Yu D, Zhao J, Shafran I, Griffiths T L, Cao Y, Narasimhan K R. Tree of thoughts: deliberate problem solving with large language models. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 517

Acknowledgment

We thank all anonymous reviewers for their constructive comments. We gratefully acknowledge the support of the National Natural Science Foundation of China (NSFC) (Grant Nos. 62236004, 62206078, 62441603, and 62476073).

Competing interests

The authors declare that they have no competing interests or financial conflicts to disclose.

RIGHTS & PERMISSIONS

2025 Higher Education Press

AI Summary AI Mindmap

PDF(1369 KB)

Supplementary files

Highlights (388 KB)

400

Accesses

Citations

Detail

Sections

Recommended

Abstract
Graphical abstract
Keywords
Cite this article
1 Introduction
Fig.1 The illustration of various table reasoning tasks
Fig.2 In the LLM era, the trend in research on the table reasoning task has evolved over several months. #Paper denotes the number of papers. The specifics of these categories are elaborated in Section 3
Fig.3 The structure overview of our paper, taking the most representative works as an example
2 Background
2.1 Paper selection criteria
2.2 Task definition
2.3 Benchmarks
2.4 Table formats
Fig.4 Examples of the four table formats. (a) Matrix table; (b) hierarchical table; (c) info box; (d) database tables
Tab.1 The information of the current table reasoning datasets. #Table of Text-to-SQL denotes the number of databases
2.4.1 Matrix table
2.4.2 Hierarchical table
2.4.3 Info-box
2.4.4 Database table
3 What techniques can improve table reasoning performance in the LLM era
Fig.5 The mainstream techniques that can be utilized to improve table reasoning performance in the LLM era
3.1 Mainstream techniques following pre-LLMs
3.1.1 Supervised fine-tuning
3.1.2 Result ensemble
3.2 Mainstream techniques unique to LLMs
3.2.1 In-context learning
3.2.2 Instruction design
3.2.3 Step-by-step reasoning
3.3 Expanding application
3.3.1 Multi-modal table reasoning
3.3.2 Multi-turn table reasoning
3.3.3 Retrieval-augmented table reasoning
3.4 Why LLMs excel at table reasoning
Tab.2 The best performance of pre-LLM and LLM methods on different benchmarks. † denotes WikiTableQuestions
3.4.1 Why are LLMs proficient at structure understanding
3.4.2 Why LLMs perform well in schema linking
4 How to enhance table reasoning ability in the future
4.1 Supervised fine-tuning: establishing diverse training data
4.2 Result ensemble: sampling results more efficiently
4.3 In-context learning: optimizing prompts automatically
4.4 Instruction design: automatically refining design with verification
4.5 Step-by-step reasoning: mitigating the error cascade in multi-step reasoning
5 Conclusion
References
Acknowledgment
Competing interests
RIGHTS & PERMISSIONS

Received	Accepted
02 Apr 2024	18 Oct 2024
Just Accepted Date	Issue Date
18 Oct 2024	12 Dec 2024

About the journal

Browse

Authors & reviewers

Abstract

Graphical abstract

Keywords

Cite this article

1 Introduction

Fig.1 The illustration of various table reasoning tasks

Fig.2 In the LLM era, the trend in research on the table reasoning task has evolved over several months. #Paper denotes the number of papers. The specifics of these categories are elaborated in Section 3

Fig.3 The structure overview of our paper, taking the most representative works as an example

2 Background

2.1 Paper selection criteria

2.2 Task definition

2.3 Benchmarks

2.4 Table formats

Fig.4 Examples of the four table formats. (a) Matrix table; (b) hierarchical table; (c) info box; (d) database tables

Tab.1 The information of the current table reasoning datasets. #Table of Text-to-SQL denotes the number of databases

2.4.1 Matrix table

2.4.2 Hierarchical table

2.4.3 Info-box

2.4.4 Database table

3 What techniques can improve table reasoning performance in the LLM era

Fig.5 The mainstream techniques that can be utilized to improve table reasoning performance in the LLM era

3.1 Mainstream techniques following pre-LLMs

3.1.1 Supervised fine-tuning

3.1.2 Result ensemble

3.2 Mainstream techniques unique to LLMs

3.2.1 In-context learning

3.2.2 Instruction design

3.2.3 Step-by-step reasoning

3.3 Expanding application

3.3.1 Multi-modal table reasoning

3.3.2 Multi-turn table reasoning

3.3.3 Retrieval-augmented table reasoning

3.4 Why LLMs excel at table reasoning

Tab.2 The best performance of pre-LLM and LLM methods on different benchmarks. † denotes WikiTableQuestions

3.4.1 Why are LLMs proficient at structure understanding

3.4.2 Why LLMs perform well in schema linking

4 How to enhance table reasoning ability in the future

4.1 Supervised fine-tuning: establishing diverse training data

4.2 Result ensemble: sampling results more efficiently

4.3 In-context learning: optimizing prompts automatically

4.4 Instruction design: automatically refining design with verification

4.5 Step-by-step reasoning: mitigating the error cascade in multi-step reasoning

5 Conclusion

{{custom_sec.title}}

{{custom_sec.title}}

References

Acknowledgment

Competing interests

RIGHTS & PERMISSIONS

Tab.2 The best performance of pre-LLM and LLM methods on different benchmarks. ^† denotes WikiTableQuestions