Investigating effective LLM-based in-context tool use: what matters and how to improve

Yining ZHENG; Haiyang WEI; Jiahao LU; Linqi YIN; Yunke ZHANG; Chengguo XU; Hetao CUI; Tianxiang SUN; Shuang CHEN; Xipeng QIU

doi:10.1007/s11704-025-41365-6

Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (7) : 2007323 DOI: 10.1007/s11704-025-41365-6

Artificial Intelligence

RESEARCH ARTICLE

Investigating effective LLM-based in-context tool use: what matters and how to improve

Author information +

History +

PDF (5803KB)

Abstract

Large Language Models (LLMs) have demonstrated the capability to utilize tools after training. However, there remains limited understanding of how to optimally enhance this ability. In this paper, we focus on the in-context tool use of LLMs and investigate effective methods to enable and improve this capability. Through preliminary analysis, 3 key factors influencing in-context tool use are identified: (1) the number of tools, (2) the number of instances per tool, and (3) model parameter size. Moreover, RapidTools, a large and high-quality tool-use dataset, is constructed to investigate these factors through two experimental series by varying the number of tools and instances per tool in the training data. Experimental results show that increasing the model parameter size and the number of tools in the training data consistently enhances performance, whereas increasing the number of instances per tool produces mixed effects. In this work, we deliver insightful and critical direction in order to establish a future foundation on tool use in LLMs.

Graphical abstract

Keywords

large language model (LLM) / tool use / function calling / tool learning

Cite this article

Download citation ▾

Yining ZHENG, Haiyang WEI, Jiahao LU, Linqi YIN, Yunke ZHANG, Chengguo XU, Hetao CUI, Tianxiang SUN, Shuang CHEN, Xipeng QIU. Investigating effective LLM-based in-context tool use: what matters and how to improve. Front. Comput. Sci., 2026, 20(7): 2007323 DOI:10.1007/s11704-025-41365-6

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Gao L, Madaan A, Zhou S, Alon U, Liu P, Yang Y, Callan J, Neubig G. PAL: program-aided language models. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 10764−10799

[2]	Parisi A, Zhao Y, Fiedel N. TALM: tool augmented language models. 2022, arXiv preprint arXiv: 2205.12255

[3]	Schick T, Dwivedi-Yu J, Dessí R, Raileanu R, Lomeli M, Hambro E, Zettlemoyer L, Cancedda N, Scialom T. Toolformer: language models can teach themselves to use tools. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 2997

[4]	OpenAI. GPT-4 technical report. 2023, arXiv preprint arXiv: 2303.08774

[5]	Sun T, Zhang X, He Z, Li P, Cheng Q, Liu X, Yan H, Shao Y, Tang Q, Zhang S, Zhao X, Chen K, Zheng Y, Zhou Z, Li R, Zhan J, Zhou Y, Li L, Yang X, Wu L, Yin Z, Huang X, Jiang Y G, Qiu X . MOSS: an open conversational large language model. Machine Intelligence Research, 2024, 21( 5): 888–905

[6]	Grattafiori A, Dubey A, Jauhri A, Pandey A, Kadian A, , . The Llama 3 herd of models. 2024, arXiv preprint arXiv: 2407.21783

[7]	Tang Q, Deng Z, Lin H, Han X, Liang Q, Cao B, Sun L. ToolAlpaca: generalized tool learning for language models with 3000 simulated cases. 2023, arXiv preprint arXiv: 2306.05301

[8]	Li M, Zhao Y, Yu B, Song F, Li H, Yu H, Li Z, Huang F, Li Y. API-bank: a comprehensive benchmark for tool-augmented LLMs. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 3102−3116

[9]

Qin Y, Liang S, Ye Y, Zhu K, Yan L, Lu Y, Lin Y, Cong X, Tang X, Qian B, Zhao S, Hong L, Tian R, Xie R, Zhou J, Gerstein M, Li D, Liu Z, Sun M. ToolLLM: facilitating large language models to master 16000+ real-world APIs. In: Proceedings of the 12th International Conference on Learning Representations. 2024, 1−23

[10]	RapidAPI. RapidAPI: a platform for discovering and connecting to APIs. Available at the website of rapidapi.com

[11]	Huang Y, Shi J, Li Y, Fan C, Wu S, Zhang Q, Liu Y, Zhou P, Wan Y, Gong N Z, Sun L. MetaTool benchmark for large language models: deciding whether to use tools and which to use. In: Proceedings of the 12th International Conference on Learning Representations. 2024, 1−30

[12]	Lu J, Holleis T, Zhang Y, Aumayer B, Nan F, Bai F, Ma S, Ma S, Li M, Yin G, Wang Z, Pang R. ToolSandbox: a stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities. 2024, arXiv preprint arXiv: 2408.04682

[13]	Ma Z, Huang W, Zhang J, Gupta T, Krishna R. m&m’s: a benchmark to evaluate tool-use for multi-step multi-modal tasks. In: Proceedings of the 18th European Conference on Computer Vision. 2024, 18−34

[14]	Moon S, Jha S, Erdogan L E, Kim S, Lim W, Keutzer K, Gholami A. Efficient and scalable estimation of tool representations in vector space. 2024, arXiv preprint arXiv: 2409.02141

[15]	Yuan S, Song K, Chen J, Tan X, Shen Y, Kan R, Li D, Yang D. EASYTOOL: enhancing LLM-based agents with concise tool instruction. 2024, arXiv preprint arXiv: 2401.06201

[16]

Abdelaziz I, Basu K, Agarwal M, Kumaravel S, Stallone M, Panda R, Rizk Y, Shrivatsa Bhargav G P, Crouse M, Gunasekara C, Ikbal S, Joshi S, Karanam H, Kumar V, Munawar A, Neelam S, Raghu D, Sharma U, Soria A M, Sreedhar D, Venkateswaran P, Unuvar M, Cox D D, Roukos S, Lastras L A, Kapanipathi P. Granite-function calling model: introducing function calling abilities via multi-task learning of granular tasks. In: Proceedings of 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2024, 1131−1139

[17]	Erdogan L E, Lee N, Jha S, Kim S, Tabrizi R, Moon S, Hooper C, Anumanchipalli G, Keutzer K, Gholami A. TinyAgent: function calling at the edge. 2024, arXiv preprint arXiv: 2409.00608

[18]	Fanjia Y, Huanzhi M, Charlie Cheng-Jie J, Tianjun Z, Shishir G P, Ion S, Joseph E G. Berkeley Function Calling Leaderboard. Available at the website of gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html

[19]	Hsieh C Y, Chen S A, Li C L, Fujii Y, Ratner A, Lee C Y, Krishna R, Pfister T. Tool documentation enables zero-shot tool-usage with large language models. 2023, arXiv preprint arXiv: 2308.00675

[20]	Guo Z, Cheng S, Wang H, Liang S, Qin Y, Li P, Liu Z, Sun M, Liu Y. StableToolBench: towards stable large-scale benchmarking on tool learning of large language models. In: Proceedings of the Findings of the Association for Computational Linguistics. 2024, 11143−11156

[21]	Wu M, Zhu T, Han H, Tan C, Zhang X, Chen W. Seal-tools: self-instruct tool learning dataset for agent tuning and detailed benchmark. In: Proceedings of the 13th National CCF Conference on Natural Language Processing and Chinese Computing. 2024, 372−384

[22]	Ye J, Wu Y, Gao S, Huang C, Li S, Li G, Fan X, Zhang Q, Gui T, Huang X. RoTBench: a multi-level benchmark for evaluating the robustness of large language models in tool learning. In: Proceedings of 2024 Conference on Empirical Methods in Natural Language Processing. 2024, 313−333

[23]	Ye J, Li G, Gao S, Huang C, Wu Y, Li S, Fan X, Dou S, Ji T, Zhang Q, Gui T, Huang X. ToolEyes: fine-grained evaluation for tool learning capabilities of large language models in real-world scenarios. In: Proceedings of the 31st International Conference on Computational Linguistics. 2024, 156−187

[24]	Zhuang Y, Yu Y, Wang K, Sun H, Zhang C. ToolQA: a dataset for LLM question answering with external tools. In: Proceedings of the 37th International Conference on Neural Information Processing. 2023, 2180

[25]	Shen Y, Song K, Tan X, Li D, Lu W, Zhuang Y. HuggingGPT: solving AI tasks with ChatGPT and its friends in hugging face. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 1657

[26]

Chen Z, Du W, Zhang W, Liu K, Liu J, Zheng M, Zhuo J, Zhang S, Lin D, Chen K, Zhao F. T-eval: evaluating the tool utilization capability of large language models step by step. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024, 9510−9529

[27]	Shen Y, Song K, Tan X, Zhang W, Ren K, Yuan S, Lu W, Li D, Zhuang Y. TaskBench: benchmarking large language models for task automation. In: Proceedings of the 38th Conference on Neural Information Processing Systems. 2023, 1−35

[28]	Bassamzadeh N, Methani C. Plan with code: comparing approaches for robust NL to DSL generation. 2024, arXiv preprint arXiv: 2408.08335

[29]	Liu Y, Peng X, Cao J, Bo S, Zhang Y, Zhang X, Cheng S, Wang X, Yin J, Du T. Tool-planner: task planning with clusters across multiple tools. 2024, arXiv preprint arXiv: 2406.03807

[30]	Huang T, Jung D, Kumar V, Kachuee M, Li X, Xu P, Chen M. Planning and editing what you retrieve for enhanced tool learning. In: Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024. 2024, 975−988

[31]

Basu K, Abdelaziz I, Chaudhury S, Dan S, Crouse M, Munawar A, Austel V, Kumaravel S, Muthusamy V, Kapanipathi P, Lastras L. API-BLEND: a comprehensive corpora for training and benchmarking API LLMs. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024, 12859−12870