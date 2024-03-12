



Posted by: Zilong Wang (Student Researcher), Chen-Yu Lee (Research Scientist, Cloud AI Team)

People use tables every day to organize and interpret complex information in a structured and easily accessible format. Because of the ubiquity of such tables, reasoning on tabular data has long been a central topic in natural language processing (NLP). Researchers in this field have aimed to leverage language models to help users answer questions, validate statements, and analyze data based on tables. However, because language models are trained using large amounts of plain text, it can be difficult for language models to fully understand and take advantage of the inherently structured nature of tabular data.

Recently, large-scale language models (LLMs) have been used to improve various types of natural language understanding by generating reliable inference chains, as shown in studies such as Chain-of-Thought and Least-to-Most. (NLU) achieves superior performance across tasks. However, the optimal way for LLM to reason on tabular data remains an open question.

In “Chain-of-Table: Evolution of tables in the inference chain for table understanding,” we propose a framework for tackling table understanding tasks. There, you train the LLM to outline the inference step-by-step, repeatedly updating a specified table to reflect each part. A thought process similar to how people solve table-based problems. This allows LLM to transform tables into simpler, more manageable segments that allow deeper understanding and analysis of each part of the table. This approach resulted in significant improvements and new state-of-the-art results on the WikiTQ, TabFact, and FeTaQA benchmarks. The diagram below provides an overview of the proposed Chain-of-Table and other techniques.

Given a complex table with nationalities and names of cyclists in the same cell, (a) general multi-step reasoning cannot yield the correct answer; (b) Program-assisted reasoning generates a program (such as an SQL query) and runs it, which provides an answer but falls short of answering the question accurately. In contrast, (c) Chain-of-Table repeatedly samples a series of operations that effectively transform a complex table into a question-specific version.table chain

Chain-of-Table uses in-context learning to guide the LLM to iteratively generate operations and update tables to represent chains of inference on tabular data. This allows LLM to dynamically plan the next operation based on the results of the previous operation. The continuous evolution of this table forms a chain that makes the reasoning process for a particular problem more structured and clearly expressed, allowing for more accurate and reliable predictions from LLM.

For example, when asked, “Which actor has won the most NAACP Image Awards?”, the Chain-of-Table framework generates tabular operations that mirror the tabular reasoning process in LLM. I urge you to do so. First, identify the relevant columns. Then aggregate the rows based on shared content. Finally, sort the aggregated results to produce a final table that clearly answers the questions posed.

These operations transform the table to suit the question posed. To balance performance and computational cost on large tables, construct chains of operations according to a subset of tabular rows. Step-by-step operations, on the other hand, reveal the underlying reasoning processes through the presentation of intermediate results from tabular operations, increasing interpretability and understandability.

Diagram of the tabular inference process in Chain-of-Table. This iterative process involves dynamically planning the chain of operations and accurately storing intermediate results in the transformed table. These intermediate tables serve as a tabular thought process to guide the LLM more reliably to the correct answer.

Chain-of-Table consists of three main stages. The first stage instructs the LLM to dynamically plan the next operation through in-context learning. Specifically, the prompt includes three components, as shown in the following image.

Question Q: “Which country had the most cyclists in the top 3?” Operation history chain: f_add_col(Country) and f_select_row(1, 2, 3). Latest intermediate table T: Converted intermediate table.

When you specify a triplet (T, Q, chain) at the prompt, LLM can observe the previous tabular inference process and select the next operation from the operation pool to complete the inference chain step by step.

An illustration of how a table chain selects the next operation from the operation pool and generates arguments for that operation. (a) The table chain samples the next operation from the operation pool. (b) Takes the selected operation as input and produces its arguments.

After the next operation f has been determined, the arguments must be generated in the second stage. As mentioned earlier, a Chain-of-Table consists of three components in the prompt as shown in the diagram: (1) the question, (2) the selected operation and its required arguments, and (3) the latest intermediate Consider the table.

For example, if operation f_group_by is selected, it requires a header name as its argument.

LLM selects the appropriate header in the table. Using the selected operation and the generated arguments, Chain-of-Table performs the operation and builds a new intermediate table according to the following reasons:

Chain-of-Table repeats the previous two stages to plan the next operation and generate the necessary arguments. During this process, we create a chain of operations that acts as a proxy for the tabular inference step. These operations produce intermediate tables that represent the results of each step to the LLM. Therefore, the output table contains comprehensive information about the intermediate phases of tabular inference. The final step is to use this output table to create a final query and ask the LLM for the final answer along with the questions.

Experimental device

We use PaLM 2-S and GPT 3.5 as the backbone LLM and conduct experiments on three public table understanding benchmarks: WikiTQ, TabFact, and FeTaQA. WikiTQ and FeTaQA are table-based question answering datasets. TabFact is a table-based fact-checking benchmark. This blog post will focus on WikiTQ and TabFact results. Compare Chain-of-Table with common reasoning techniques (e.g. end-to-end QA, Fewshot QA, and Chain-of-Thought) and program-assisted techniques (e.g. Text-to-SQL, Binder) . , and Deiter).

more accurate answer

Compared to general inference methods and program-assisted inference methods, Chain-of-Table achieves superior performance on PaLM 2 and GPT 3.5. This is due to dynamically sampled operations and beneficial intermediate tables.

Compare and understand WikiTQ and TabFact results using PaLM 2 and GPT 3.5 with different models.Improved robustness to difficult questions

In Chain-of-Table, the longer the chain of operations, the more difficult and complex the question and its corresponding table are. Classify the test samples according to the length of the chain-of-table operations. We will compare Chain-of-Table, Chain-of-Thought, and Dater as representative general-purpose inference methods and program-assisted inference methods. I'll illustrate this using WikiTQ's PaLM 2 results.

Performance of WikiTQ's proposed Chain-of-Thought, Dater, and the proposed Chain-of-Table for questions requiring operation chains of various lengths. Our proposed atomic operations significantly improve performance compared to common program-assisted inference.

In particular, Chain-of-Table consistently outperforms both baseline methods over all operation chain lengths, by up to 11.6% compared to Chain-of-Thought and by up to 7.9% compared to Dater. It shows the margin. Furthermore, the performance of Chain-of-Table degrades more slowly as the number of operations increases compared to other baseline techniques, showing minimal degradation when the number of operations increases from 4 to 5. not.

Increased robustness with larger table

Classify the WikiTQ table into three groups based on token numbers.<2000 tokens), medium (2000 to 4000 tokens) and large (>4000 tokens). Next, we compare Chain-of-Table to two of the latest and strongest baselines: Dater and Binder.

Performance of Binder, Dater, and the proposed Chain-of-Table at small scale (<2000 tokens), medium (2000 to 4000 tokens), and large (>4000 tokens) WikiTQ table. We see that while the performance decreases as the input table grows, Chain-of-Table decreases steadily and achieves a significant improvement compared to competing methods. (Similar to above, underlined text indicates his second best performance and bold indicates his best performance.)

As expected, performance decreases as the input table grows, as the model has to infer longer contexts. Nevertheless, the performance of the proposed chain of tables gradually degrades and achieves a significant improvement of more than 10% compared to the second-best competing method when processing large tables. . This shows the effectiveness of the inference chain in handling long tabular inputs.

conclusion

Our proposed Chain-of-Table method enhances the reasoning ability of LLM by utilizing table structures to represent intermediate steps of table-based reasoning. This instructs LLM to dynamically plan the operation chain according to the input table and its associated questions. This evolving table design sheds new light on our understanding of facilitating table understanding in LLMs.

Acknowledgment

This research was conducted by Zilong Wang, Hao Zhang, Chun-Liang Li, Julian Martin Aisenschlos, Vincent Perot, Zifeng Wang, Lesly Miculicich, YasuhiFujii, Jingbo Shang, Chen-Yu Lee, and Tomas Pfister. Thank you Sergey Ioffe for your valuable feedback.

