1 Introduction
Recently, large language models (LLMs) represent a significant breakthrough in AI through scaling up language models [
1-
9]. Current studies on LLMs, such as GPT-series [
10,
11], PaLM-series [
12], OPT [
13], and LLaMA [
14], have shown impressive zero-shot performance. In addition, LLMs also bring some emergent abilities including instruction following [
15], chain-of-thought reasoning [
16] and in-context learning [
17], which attract increasing attention [
18].
With the advancement of large language models, as shown in Fig. 1, LLMs allow various natural language processing (NLP) tasks (e.g., zero-shot mathematical reasoning [
16,
19], text summarization [
20,
21], machine translation [
22,
23], information extraction [
24,
25] and sentiment analysis [
26,
27]) to be achieved through a unified generative paradigm, which has achieved remarkable success [
1,
28-
30]. Additionally, some LLMs in NLP work without needing any additional training data and can even surpass traditional models fine-tuned with supervised learning. This advancement significantly contributes to the development of NLP. As a result, the community has witnessed an exponential growth of LLMs for NLP studies, which motivates us to investigate the following questions. (1)
How are LLMs currently applied to NLP tasks in the literature? (2)
Have traditional NLP tasks already been solved with LLMs? (3)
What is the future of the LLMs for NLP?
To answer the above questions, we present a comprehensive and detailed analysis on LLMs from the perspective of independent NLP tasks. The overarching goal of this work is to explore current developments in LLMs for NLP. To this end, in this paper, we first introduce the relevant background and preliminary. Furthermore, we introduce a unified paradigm on LLMs for NLP: (1) parameter-frozen paradigm including (i) zero-shot learning and (ii) few-shot learning; (2) parameter-tuning paradigm containing (i) full-parameter tuning and (ii) parameter-efficient tuning, {aiming} to provide a unified perspective to understand the current progress of LLMs for NLP tasks:
● Parameter-frozen paradigm directly applies prompting approach on LLM for NLP tasks without the need for parameter tuning. This category includes zero-shot and few-shot learning, depending on whether the few-shot demonstrations is required.
● Parameter-tuning paradigm refers to the need for tuning parameters of LLMs for NLP tasks. This category includes both full-parameter and parameter-efficient tuning, depending on whether fine-tuning is required for all model parameters.
Finally, we conclude by identifying potential frontier areas for future research, along with the associated challenges to stimulate further exploration. In summary, this work offers the following contributions:
(1) First survey: We present the first comprehensive survey of Large Language Models (LLMs) for Natural Language Processing (NLP) tasks.
(2) New taxonomy: We introduce a new taxonomy including (i) parameter-frozen paradigm and (ii) parameter-tuning paradigm, which provides a unified view to understand LLMs for NLP tasks.
(3) New frontiers: We discuss emerging areas of research in LLMs for NLP and highlight the challenges associated with them, aiming to inspire future breakthroughs.
(4) Abundant resources: We create the first curated collection of LLM resources for NLP, including open-source implementations, relevant corpora, and a list of research papers. These resources are available at the website of github.com/LightChen233/Awesome-LLM-for-NLP.
We hope this work will be a valuable resource for researchers and spur further advancements in the field of LLM-based NLP.
2 Background
As shown in Fig. 2, this section describes the background of parameter-frozen paradigm (§2.1) and parameter-tuning paradigm (§2.2).
2.1 Parameter-frozen paradigm
Parameter-frozen paradigm can directly apply prompting for NLP tasks without any parameter tuning. As shown in Fig. 2(a), this category encompasses
zero-shot learning and
few-shot learning [
10,
31].
● Zero-shot learning
In zero-shot learning, LLMs leverage the instruction following capabilities to solve NLP tasks based on a given instruction prompt, which is defined as:
where and denote the input and output of prompting, respectively.
● Few-shot learning
Few-shot learning uses in-context learning capabilities to solve the NLP tasks imitating few-shot demonstrations. Formally, given some demonstrations , the process of few-shot learning is defined as:
2.2 Parameter-tuning paradigm
As shown in Fig. 2(b), the parameter-tuning paradigm involves adjusting LLM parameters for NLP tasks, covering both full-parameter and parameter-efficient tuning.
● Full-parameter tuning
In the full-parameter tuning approach, all parameters of the model are fine-tuned on the training dataset :
where is the fine-tuned model with the updated parameters.
● Parameter-efficient tuning
Parameter-efficient tuning (PET) involves adjusting a set of existing parameters or incorporating additional tunable parameters (like Bottleneck Adapter [
32], Low-Rank Adaptation (LoRA) [
33], Prefix-tuning [
34], and QLoRA [
35]) to efficiently adapt models for specific NLP tasks. Formally, parameter-efficient tuning first tunes a set of parameters
, denoting as:
where stands for the trained parameters.
2.3 Comparison of paradigms
To further understand the advantages on different paradigms, we summarize the resource consumption and performance of each paradigm in Table 1. Generally speaking, zero-shot learning offers the highest application efficiency, moderate improvements on in-domain tasks, and robust out-of-domain generalization. In contrast, few-shot learning typically yields superior in-domain performance relative to zero-shot learning; however, it demands greater computational resources, achieves lower overall efficiency, and exhibits reduced generalization to novel domains. Full-parameter tuning, when ample training data and resources are available, attains the best in-domain performance but at the expense of the least efficient deployment and the weakest transfer to out-of-domain settings. Finally, parameter-efficient tuning strikes a balance: with limited resources, it can match or exceed the performance of full-parameter tuning in certain cases, while offering higher efficiency and often improved generalization beyond the training domain.
3 Natural language understanding
As shown in Fig. 3, we first describe some typical NLP understanding tasks, which consists of Semantic Analysis (§3.1), Information Extraction (§3.2), Dialogue Understanding (§3.3), and Table Understanding (§3.4).
3.1 Sentiment analysis
Sentiment analysis, a key function in natural language processing, identifies the emotional tone of a text, like positive opinions or criticisms [
37].
3.1.1 Parameter-frozen paradigm
● Zero-shot learning
With the help of instruction tuning, LLMs have been equipped with excellent zero-shot learning ability [
38]. Recent studies [
39] find that using simple instructions can elicit ChatGPT’s strong capabilities on a series of sentiment analysis tasks such as sentiment classification and aspect-based sentiment analysis. Current {mainstream} LLMs possess the ability of multilingual understanding to analyze the sentiment conveyed by different languages based on sentiment lexicons [
40]. Moreover, Du et al. [
41] propose a prompting framework to evaluate and reveal LLMs’ limitations in financial attribute reasoning for sentiment analysis, highlighting weaknesses in numerical and understanding.
● Few-shot learning
Few-shot prompting not only elicits in-context learning in LLMs but also elaborates the intent of users more clearly. According to the findings presented by previous studies [
39,
42-
44], incorporating exemplars to the prompts significantly boosts LLMs’ performance on aspect-based sentiment analysis and emotion recognition tasks. Furthermore, Sun et al. [
45] introduce few-shot learning on more complex procedures, incorporating multi-LLM negotiation framework for deeper sentiment analysis.
3.1.2 Parameter-tuning paradigm
● Full-parameter tuning
Full-parameter instruction tuning has been shown to be an effective approach to bridge the gap between task-agnostic pre-training and task-specific inference. Specifically, Wang et al. [
118] design unified sentiment instruction for various aspect-based sentiment analysis tasks to elicit the LLMs. Varia et al. [
119] utilize task-specific sentiment instructions to fine-tune LLMs for the inter-task dependency. Yang and Li [
120] transform the visual input into plain text during prompt construction for instruction tuning. Moreover, Zhang et al. [
46] conduct an empirical study to evaluate bLLMs’ effectiveness in sentiment analysis for software engineering, revealing their advantages in data-scarce scenarios and limitations compared to fine-tuned sLLMs with sufficient training data. These works demonstrate the potential of tuning LLMs for advanced sentiment analysis.
● Parameter-efficient tuning
Sentiment analysis techniques have numerous real-world applications such as opinion mining [
121]. Therefore, efficiency is a vital dimension for evaluating sentiment analysis methods. Qiu et al. [
122] utilize LoRA to tune LLMs on the empathy multi-turn conversation dataset namely SMILECHAT to develop emotional systems.
3.2 Information extraction
Information Extraction (IE) tasks aim at extracting structural information from plain text, which typically includes relation extraction (RE), named entity recognition (NER), and event extraction (EE) [
179].
3.2.1 Parameter-frozen paradigm
● Zero-shot learning
Inspired by the impressive capabilities of LLMs on various tasks, recent studies [
24,
47] begin to explore zero-shot prompting methods to solve IE tasks by leveraging knowledge embedded in LLMs. Wei et al. [
24], Xie et al. [
48], and Zhang et al. [
47] propose a series of methods to decompose question-answering tasks by breaking down NER into smaller, simpler subproblems, which improves the overall process. Xie et al. [
48] introduce two methods, syntactic prompting and tool augmentation, to improve performance of LLMs by incorporating the syntactic information. Siepmann et al. [
180] explore the use of GPT-4 for automated information extraction of diagnoses, medications, and allergies from discharge letters, demonstrating high accuracy with prompt tuning and highlighting its potential to reduce administrative burden in healthcare. In addition, Gu et al. [
181] conduct a cross-sectional study to evaluate advanced open-source LLMs for information extraction of social determinants of health (SDoH) from clinical notes, using human-annotated EHR data and comparing against a pattern-matching baseline.
● Few-shot learning
Considering the gap between sequence labeling and text generation, providing exemplars could help LLMs better understand the given task and follow the problem-solving steps, especially in tasks requiring structured outputs and clear format adherence for accuracy. To select pertinent demonstrations, Li and Zhang [
49] deploy the retrieval module to retrieve the most suitable examples for the given test sentence, aiming to enhance task relevance and response accuracy. Instead of using natural language for structured output, Li et al. [
50] and Bi et al. [
51] propose reformulating IE tasks as code with code-related LLMs such as Codex, effectively leveraging their powerful syntax-aware generation and reasoning capabilities. Fornasiere et al. [
52] introduce prompt-based strategies for small-scale LLMs to extract structured and unstructured medical information from clinical texts, demonstrating strong zero-shot performance and enhanced explainability through line-number referencing to source text. Tang et al. [
53] explore the impact of various prompt engineering strategies, persona, chain-of-thought, and few-shot prompting, on the performance of GPT-3.5 and GPT-4 in extracting key information from medical publications, evaluating alignment with ground truth using multiple comprehensive metrics.
3.2.2 Parameter-tuning paradigm
● Full-parameter tuning
A common practice to customize LLMs is fine-tuning LLMs on the collected dataset. There typically are three tuning paradigms adopted to enhance LLMs’ abilities. The first one is tuning LLMs on a single dataset to strengthen a specific ability. The second one is standardizing data formats across all IE subtasks, thus enabling a single model to efficiently handle diverse tasks [
123–
124]. The last one is tuning LLMs on a mixed dataset and testing on the unseen tasks [
125–
126], which is always used to improve the generalization ability of LLMs. Rixewa et al. [
131] introduce a unified interleaved representation with cross-modal attention to enhance multi-modal information retrieval, enabling accurate and efficient processing of complex content across text and image formats.
● Parameter-efficient tuning
Tuning huge parameters of LLMs poses a significant challenge to both research and development. To address this challenge [
182–
183], Das et al. [
127] propose a method for dynamic sparse fine-tuning that focuses on a specific subset of parameters during the IE training process. This approach is particularly useful when dealing with limited data. Meanwhile, Liang et al. [
128] introduce Lottery Prompt Tuning (LPT), a method that efficiently tunes only a portion of the prompt vectors used for lifelong information extraction. This technique optimizes both parameter efficiency and deployment efficiency. Dagdelen et al. [
129] introduce a simple and flexible approach to fine-tuning LLMs for joint named entity recognition and relation extraction, enabling the generation of structured scientific knowledge records from complex materials chemistry texts. Xue et al. [
130] introduce AutoRE, a novel end-to-end document-level relation extraction model using the RHF paradigm and parameter-efficient fine-tuning, enabling state-of-the-art performance without relying on predefined options.
3.3 Dialogue understanding
Dialogue understanding typically consists of spoken language understanding (SLU) [
184] and dialogue state tracking (DST) [
185].
3.3.1 Parameter-frozen paradigm
● Zero-shot learning
Recent studies highlight the effectiveness of LLMs in dialogue understanding through zero-shot prompting [
54-
57,
186]. Gao et al. [
58] and Addlesee et al. [
67] introduce zero-shot chain-of-thought prompting strategies in LLMs, enhancing understanding by step-by-step reasoning. Moreover, Zhang et al. [
60] and Wu et al. [
62] treat SLU and DST as agent systems and code generation tasks to effectively improve task performance. Further, Chung et al. [
68], Chi et al. [
64], and Zhang et al. [
61] extend the task to actual scenarios and understand the dialog by zero-shot prompting for efficient interaction and dialog management. Recently, Qin et al. [
187] and Qin et al. [
188] propose a series of multi-stage solution frameworks that leverages the interactive capabilities of LLMs to address single-intent and multi-intent SLU tasks respectively. Dong et al. [
189] propose a multi-agent framework, ProTOD, which is a novel active DST planner framework based on multiple LLMs’ interation, designed to enhance the dialog’s proactivity and goal completion rate.
● Few-shot learning
Limited by the instruction following ability of the LLMs, recent studies have focused on improving model performance in dialogue understanding through the relevant few-shot demonstrations [
56]. To address “overfitting” in the given few-shot demonstrations, Hu et al. [
65], King and Flanigan [
66], Das et al. [
63], Li et al. [
59], Lee et al. [
69], King and Flanigan [
66], and Addlesee et al. [
67] further introduce some methods for retrieving diverse few-shot demonstrations to improve understanding performance. Lin et al. [
70] and Cao [
71] integrate DST tasks with an agent through in-context-learning, enhancing dialogue understanding capabilities.
3.3.2 Parameter-tuning paradigm
● Full-parameter tuning
Full-parameter tuning involves not freezing any parameters and using all parameters to train dialogue understanding tasks [
135]. Specifically, Xie et al. [
132], Zhao et al. [
133] unifies structured tasks into a textual format by training full parameters demonstrating significant improvement and generalization. Gupta et al. [
134] utilize input with some demonstrations as a new DST representation format to train LLM with full parameters and achieve great results. Acikgoz et al. [
190] suggest that DST, typically trained on a limited set of APIs, needs new data for quality maintenance. They propose a unified instruction-tuning paradigm for multi-turn DST and advanced function calls, enhancing dialogue management and generalization.
● Parameter-efficient tuning
Limited by the huge cost of full-parameter fine-tuning, a lot of work begins to focus more on Parameter-Efficient Tuning (PET) for lower-cost dialogue understanding task training. Specifically, Feng et al. [
136] present LDST, a LLaMA-driven DST framework that leverages LoRA technology for parameter-efficient fine-tuning, achieving performance comparable to ChatGPT. Liu et al. [
137] provide a key-value pair soft-prompt pool, selecting soft-prompts from the prompting pool based on the conversation history for better PET. Further Yin et al. [
191] address the multi-intent detection task and introduces MIDLM, a bidirectional LLM framework that enables autoregressive LLMs to leverage bidirectional information through post-training, thereby eliminating the need to train the model from scratch.
3.4 Table understanding
Table understanding involves the comprehension and analysis of structured data presented in tables, focusing on interpreting and extracting meaningful information, like Table Question Answering [
192-
194].
3.4.1 Parameter-frozen paradigm
● Zero-shot learning
Recently, the advancements for LLMs have paved the way for exploring zero-shot learning capabilities in understanding and interpreting tabular data [
72–
73,
75]. Ye et al. [
74] and Sui et al. [
76] concentrate on breaking down large tables into smaller segments to reduce irrelevant data interference during table understanding. Further, Patnaik et al. [
73] introduce CABINET, a framework that includes a module for generating parsing statements to emphasize the data related to a given question. Sui et al. [
77] develop TAP4LLM, enhancing LLMs’ table understanding abilities by incorporating reliable information from external knowledge sources into prompts. Additionally, Ye et al. [
75] propose a DataFrameQA framework to utilize secure Pandas queries to address issues of data leakage in table understanding. These efforts signify a significant stride towards leveraging LLMs for more effective and efficient zero-shot learning in table data comprehension.
● Few-shot learning
Few-shot learning has been an increasingly focal point for researchers to address the limitations of LLMs, particularly in the context of table understanding and instruction following ability [
82–
83,
195]. Luo et al. [
84] propose a hybrid prompt strategy coupled with a retrieval-of-thought to further improve the example quality for table understanding tasks. Cheng et al. [
78] introduce Binder to redefine the table understanding task as a coding task, enabling the execution of code to derive answers directly from tables. Furthermore, Li et al. [
85], Jiang et al. [
86], and Zhang et al. [
79–
80] conceptualize the table understanding as a more complex agent task, which utilizes external tools to augment LLMs in table tasks. Building upon these developments, ReAcTable [
81] integrates additional actions into the process, such as generating SQL queries, producing Python code, and directly answering questions, thereby further enriching the few-shot learning landscape. Wang et al. [
87] introduce Chain-of-Table, a framework that guides LLMs to perform table-based reasoning by iteratively updating tabular data as intermediate steps, enabling structured, dynamic reasoning chains that significantly improve performance on table understanding tasks. Kong et al. [
88] propose OpenTab, an open-domain table reasoning framework that enhances LLMs’ ability to handle structured table data by retrieving relevant tables and generating SQL programs for reasoning, significantly improving accuracy over existing methods in both open and closed domain scenarios.
3.4.2 Parameter-tuning paradigm
● Full-Parameter tuning
Leveraging the existing capabilities of LLMs, Full-Parameter Tuning optimizes these models for specific table understanding tasks. Li et al. [
138] and Xie et al. [
132] adapt a substantial volume of table-related data for table instruction tuning, which leads to better generalization in table understanding tasks. Additionally, Xue et al. [
139] introduce DB-GPT to enhance LLMs by fine-tuning them and integrating a retrieval-augmented generation component to better support table understanding.
● Parameter-efficient tuning
Xie et al. [
132] utilize prompt-tuning for efficient fine-tuning within a unified framework of table representation instructions. Moreover, Zhang et al. [
140], Zhu et al. [
141], and Bai et al. [
142] adapt Low-Rank Adaptation (LoRA) during instruction-tuning for better table understanding and further table cleaning. Furthermore, Zhang et al. [
143] address challenges related to long table inputs by implementing LongLoRA, demonstrating its efficacy in managing long-context issues in table understanding tasks. He et al. [
144] introduce TableLoRA, a table-specific fine-tuning module that enhances LLMs’ understanding of tabular data under parameter-efficient settings by combining specialized table serialization and 2D positional encoding to improve performance on structured table tasks. Li et al. [
145] introduce a new “table fine-tuning” paradigm that enhances language models like GPT-3.5 and ChatGPT on diverse table-understanding tasks by training them with synthesized table-based instructions, significantly improving their performance and generalizability on structured tabular data.
4 Natural language generation
This section presents the LLMs for classific NLP generation tasks containing Summarization (§4.1), Code Generation (§4.2), Machine Translation (§4.3), and Mathematical Reasoning (§4.4), which are illustrated in Fig. 3.
4.1 Summarization
Summarization aims to distill the essential information from a text document, producing a concise and coherent synopsis that retains the original content’s themes [
196].
4.1.1 Parameter-frozen paradigm
● Zero-shot learning
In the exploration of zero-shot learning for text summarization, LLMs such as GPT-3 have demonstrated amazing and superior performance in generating concise and factually accurate summaries, challenging the need for traditional fine-tuning approaches [
20,
89,
91]. Zhang et al. [
92] highlight instruction tuning as pivotal for LLMs’ summarization success. Ravaut et al. [
90] scrutinize LLMs’ context utilization, identifying a bias towards initial document segments in summarization tasks [
197–
198]. Furthermore, Yun et al. [
199] enhances automatic summarization by integrating human interaction and semantic graphs, enabling the generation of higher-quality, personalized summaries tailored to individual users’ interests and needs. These studies collectively underscore the versatility and challenges of deploying LLMs in zero-shot summarization.
● Few-shot learning
For few-shot learning, LLMs like ChatGPT are scrutinized for their summarization abilities. Zhang et al. [
93] and Tang et al. [
95] demonstrate that leveraging in-context learning and a dialog-like approach can enhance LLMs’ extractive summarization, particularly in achieving summary faithfulness. Adams et al. [
94] introduce a "Chain of Density” prompting technique, revealing a preference for denser, entity-rich summaries over sparser ones. Moreover, recent studies have begun to leverage the reflective capabilities [
200], deeper reasoning abilities [
201], and planning abilities [
202] of large reasoning models to enhance the depth of thought as well as the conciseness and clarity of summaries. Together, these studies reveal the evolving strategies to optimize LLMs for summarization tasks.
4.1.2 Parameter-tuning paradigm
● Full-Parameter tuning
Full-Parameter Tuning for text summarization leverages the power of LLMs, optimizing them for specific summarization tasks. DIONYSUS [
203] adapts to new domains through a novel pre-training strategy tailored for dialogue summarization. Socratic Pretraining [
146] introduces a question-driven approach to improve the summarization process. Further, Wang et al. [
204] and Lu et al. [
205] demonstrate that carefully prompting LLMs produces well-structured rationales, which can guide smaller models with fully tuning to generate summaries that are both more concise and of higher quality. More recently, Aali et al. [
206] and Wu et al. [
207] employ meticulously annotated supervised fine-tuning (SFT) data and prediction feedback-based reinforcement learning, respectively, enabling their models to match or even surpass the performance of proprietary closed-source models. Overall, this allows the model to be easily adapted for different summarization tasks, resulting in more controllable and relevant summaries.
● Parameter-efficient tuning
PET strategies have revolutionized the adaptability of large pre-trained models for specific summarization tasks, demonstrating the power of fine-tuning with minimal parameter adjustments [
149]. Zhao et al. [
147] and Yuan et al. [
148] adapt prefix-tuning [
34] for dialogue summarization, enhancing model knowledge and generalization across domains. Ravaut et al. [
150] develop PromptSum to combine prompt tuning with discrete entity prompts for controllable abstractive summarization. These approaches collectively show the efficacy of PET in enabling robust, domain-adaptive, and controllable summarization with minimal additional computational costs.
4.2 Code generation
Code generation involves the automatic creation of executable code from natural language specifications, facilitating a more intuitive interface for programming [
96].
4.2.1 Parameter-frozen paradigm
● Zero-shot learning
Recent advancements in code generation have been significantly propelled by the development of LLMs, with studies showcasing their proficiency in generating code in a zero-shot manner. Code LLMs, trained on both code and natural language, have a robust and amazing zero-shot learning capability for programming tasks [
97,
104]. Moreover, CodeT5+ enriches the landscape by proposing a flexible encoder-decoder architecture and a suite of pretraining objectives, leading to notable improvements [
152]. These models collectively push the boundary of what is achievable in code generation, offering promising avenues for zero-shot learning. Recent releases of code-specific LLMs, such as CodeGemma [
208] and Qwen2.5-Coder [
209], further advance the field of LLM-based code generation, delivering superior benchmark performance. Additionally, Seed-Coder [
210] introduces a model-centric data curation pipeline, while Ling-Coder-Lite [
211] leverages a Mixture-of-Experts architecture to balance efficiency and performance, marking state-of-the-art progress in open-source code generation LLMs.
● Few-shot learning
Code generation is being revolutionized by few-shot learning. This technique allows models to create precise code snippets by learning from just minimal examples [
212]. Chen et al. [
96], Allal et al. [
100], Li et al. [
101], Luo et al. [
99], and Christopoulou et al. [
98] illustrate the efficacy of few-shot learning, demonstrating an adeptness at code generation that surpasses its predecessors. The development of smaller, yet powerful models [
102–
103] further highlights the accessibility of few-shot code generation technologies, making them indispensable tools in the arsenal of modern developers. Importantly, most modern LLMs for code generation, including Code Llama [
104], Seed-Coder [
210], Qwen2.5-Coder [
209], and CodeGemma [
208], provide both base and instruct variants, enabling flexible few-shot learning execution across diverse programming tasks.
4.2.2 Parameter-tuning paradigm
● Full-Parameter tuning
Full-parameter tuning represents a pivotal strategy in enhancing code generation models, allowing comprehensive model optimization. Specifically, CodeT series [
151–
152] epitomize this approach by incorporating code-specific pre-training tasks and architecture flexibility, respectively, to excel in both code understanding and generation. CodeRL [
153] and PPOCoder [
154] introduce deep reinforcement learning, leveraging compiler feedback and execution-based strategies for model refinement, whereas StepCoder [
154] advances this further by employing reinforcement learning, curriculum learning, and fine-grained optimization techniques. These models collectively demonstrate significant improvements across a spectrum of code-related tasks, embodying the evolution of AI-driven programming aids. Emerging work such as PRLCoder [
213] leverages process-supervised reinforcement learning, Focused-DPO [
214] enhances preference optimization on error-prone points, and ACECoder [
215] applies automated test-case synthesis to refine reward models. Furthermore, SWE-RL [
216] expands reinforcement learning into real-world software engineering, significantly advancing the reasoning capacities of LLMs. Reinforcement learning thus demonstrates strong potential for training code LLMs and warrants further exploration.
● Parameter-efficient tuning
PET emerges as a pivotal adaptation in code tasks, striking a balance between performance and computational efficiency [
157]. Studies [
155–
156] exploring adapters and LoRA showcase PET’s viability on code understanding and generation tasks, albeit with limitations in performance. Recent investigations, such as Storhaug and Li [
217], demonstrate that PEFT methods can rival full fine-tuning for unit test generation, reducing resource demands. Additionally, Zhang et al. [
218] provide a comprehensive evaluation of PEFT on method-level code smell detection, revealing that small models often perform competitively, reinforcing the scalability and cost-effectiveness of PET techniques for specialized software engineering tasks.
4.3 Machine translation
Machine translation is a classical task that utilizes computers to automatically translate the given information from one language to another, striving for accuracy and preserving the semantic essence of the original material [
219]. Recent work [
220] revisits key challenges in neural machine translation (NMT), highlighting how LLMs address issues such as long sentence translation and reduced parallel data reliance while facing new challenges like inference efficiency and low-resource language translation.
4.3.1 Parameter-frozen paradigm
● Zero-shot learning
In the realm of zero-shot learning, Zhu et al. [
107] and Wei et al. [
106] enhance LLMs’ multilingual performance through cross-lingual and multilingual instruction-tuning, significantly improving translation tasks. OpenBA contributes to the bilingual model space, demonstrating superior performance in Chinese-oriented tasks with a novel architecture [
109]. These advancements highlight the potential of LLMs in aligning language in zero-shot settings.
● Few-shot learning
In the exploration of few-shot learning for machine translation (MT), recent studies present innovative strategies to enhance the capabilities of LLMs [
108,
221]. Lu et al. [
112] introduce Chain-of-Dictionary Prompting (CoD) to improve the MT of rare words by in-context learning in low-resource languages. Raunak et al. [
111] investigate the impact of demonstration attributes on in-context learning, revealing the critical role of output text distribution in translation quality. Zhu et al. [
222] propose a robust multi-view approach for selecting fine-grained demonstrations, effectively reducing noise in in-context learning and significantly improving domain adaptation. Together, these works illustrate the significant potential of few-shot learning in advancing the field of MT with LLMs.
4.3.2 Parameter-tuning paradigm
● Full-parameter tuning
Full-parameter tuning in machine translation with LLMs represents a frontier for enhancing translation accuracy and adaptability [
158]. Iyer et al. [
160] demonstrate the potential of LLMs in disambiguating polysemous words through in-context learning and fine-tuning on ambiguous datasets, achieving superior performance in multiple languages. Moslem et al. [
161] and Wu et al. [
164] focus on exploring fine-tuning methods that enhance real-time and context-aware translation capabilities. Xu et al. [
159] propose Contrastive Preference Optimization (CPO) to refine translation quality further, pushing LLMs towards better performance. Feng et al. [
223] introduce MT-R1-Zero, applying reinforcement learning frameworks to MT without supervised fine-tuning, achieving competitive results on multilingual benchmarks and offering insights into emergent reasoning patterns. Feng et al. [
224] present MT-Ladder, a cost-effective hierarchical fine-tuning framework that boosts general-purpose LLMs’ translation performance to match state-of-the-art models. These studies reveal the efficacy and necessity of fine-tuning approaches, and point toward reinforcement learning as a promising future direction for advancing machine translation by leveraging LLMs’ emergent reasoning and adaptability.
● Parameter-efficient tuning
PET is emerging as a transformative approach for integrating LLMs into machine translation (MT), balancing performance and efficiency. [
162] empirically assess PET’s efficacy across different languages and model sizes, highlighting adapters’ effectiveness with adequate parameter budgets. Alves et al. [
110] optimize the fine-tuning process with adapters, striking a balance between few-shot learning and fine-tuning efficiency. Recent work further demonstrates PET’s scalability and robustness in multilingual and domain-specific tasks, confirming its potential to make LLMs more adaptable and resource-efficient while maintaining competitive performance. These studies collectively underline PET’s promise to revolutionize MT by offering scalable and cost-effective solutions.
4.4 Mathematical reasoning
Mathematical reasoning tasks in NLP involve the use of NLP techniques to understand information from mathematical text, perform logical reasoning processes, and ultimately generate accurate answers to mathematical questions [
225–
226].
4.4.1 Parameter-frozen paradigm
● Zero-shot learning
Mathematics serves as a testbed to investigate the reasoning capabilities of LLMs [
14,
227]. The vanilla prompting method asks LLMs to directly arrive at the final answer to a given mathematical problem. It is very challenging and the reasoning process is not transparent to humans. To address it, Kojima et al. [
31] develop a zero-shot chain-of-thought technique, which utilizes the simple prompt “Let’s think step by step” to elicit mathematical reasoning in LLMs. By doing this, the LLM can break down the problem into smaller, easier-to-solve pieces before arriving at a final answer. Further, Wang et al. [
114] propose a new decoding strategy, called self-consistency. This approach integrates a series of prompting results to boost the performance. Tang et al. [
228] propose an automatically enhanced zero-shot prompting strategy that adjusts the prompts through model retrieval to improve the performance of LLMs on mathematical reasoning tasks. Moreover, Yuksekgonul et al. [
229] and Peng et al. [
230] employ reflection-based, iterative prompting strategies to improve zero-shot mathematical reasoning accuracy.
● Few-shot learning
Recent studies explore constructing more suitable exemplars for LLMs to improve mathematical reasoning. Weiet al. [
16] introduce chain-of-thought prompting, using a few demonstrations to guide LLMs through step-by-step reasoning. However, creating these examples by hand is laborious, so Zhang et al. [
113] and Lu et al. [
115] propose methods to select in-context examples automatically. To improve numerical precision, PAL [
116] generates and executes intermediate program steps in a runtime environment. Building on this idea, Das et al. [
117] present MathSensei, a tool-augmented LLM that integrates web search, code execution, and symbolic solving, showing greater gains on harder problems. Liu et al. [
174] propose XoT, a unified framework that dynamically switches among diverse prompting methods for better math reasoning. To probe consistency, Yu et al. [
175] use symbolic programs to reveal that LLMs often rely on brittle reasoning despite strong static performance. More recently, [
176] introduce QuaSAR, which blends natural language with selective formalization to enhance chain-of-thought robustness without full symbolic translation. Moreover, Zhang et al. [
231] enhance the mathematical capabilities of LLMs by improving their single-step reasoning in the context of fine-grained in-context learning.
4.4.2 Parameter-tuning paradigm
● Full-parameter tuning
Full-parameter tuning is a standard method for guiding LLMs in mathematical reasoning tasks [
178]. Several studies have improved general math-solving ability by creating high-quality instruction-tuning datasets, from web-curated collections [
166], advanced LLM distillation [
167], and self-generated samples [
178,
232]. Moreover, [
168] introduce ToolFormer, which leverages a calculator for numeric operations. Chen et al. [
173] propose perturbing token-level chain-of-thought during fine-tuning, improving accuracy without external labels. Yu et al. [
233] develop Chain-of-Reasoning, which integrates natural-language, algorithmic, and symbolic reasoning to boost benchmarks. Beyond supervised tuning, reinforcement learning has also shown promise. Luo et al. [
165] apply RLEIF to enhance math reasoning; Luo et al. [
172] propose OmegaPRM, an MCTS-based method for training reward models on MATH500 and GSM8K without human oversight; Shao et al. [
171] train DeepSeekMath 7B on 120 B tokens using web data and GRPO; and Qian et al. [
234] introduce ToolRL, examining tool selection and reward design in RL-based fine-tuning.
● Parameter-efficient tuning
Fine-tuning LLMs with full parameter updates incurs significant memory overhead, limiting accessibility for many users. Parameter-efficient tuning techniques, such as LoRA [
37], offer a promising alternative. Additionally, Hu et al. [
169] propose a user-friendly framework for integrating various adapters into LLMs, enabling them to tackle tasks like mathematical reasoning. SPHERE [
235] introduces a self-evolving data-generation pipeline leveraging LoRA to enhance the performance of small-scale language models on mathematical reasoning tasks through the self-generation, refinement, and diversification of reasoning chains. Prottasha et al. [
236] present Semantic Knowledge Tuning (SK-Tuning), which employs semantically meaningful vocabulary in lieu of random tokens for prompt and prefix tuning, thereby boosting LLM performance on mathematical reasoning tasks. Srivastava et al. [
177] propose DTE, a ground truth-free training framework using multi-agent debates and a Reflect-Critique-Refine strategy to enhance LLM reasoning, achieving notable accuracy gains and strong cross-domain generalization. Further, Alazraki and Rei [
237] introduce a meta-reasoning-based tool selection framework, a two-stage system first performs meta-reasoning over the given task and then leverages a custom, fine-tuned language-modeling head to generate candidate tools, thereby substantially improving mathematical reasoning performance.
● Takeaways
(1) LLMs offer a unified generative solution paradigm for various NLP tasks. (2) LLMs in NLP tasks still have a certain gap from smaller supervised learning models. (3) Continuing to fine-tune LLMs on NLP tasks bring substantial improvements.
5 Future work and new frontier
In this section, as shown in Fig. 4, we highlight some new frontiers, aiming to inspire further innovations and groundbreaking advancements in the near future.
5.1 Multilingual LLMs for NLP
Despite the significant success of LLMs in English NLP tasks, there are over 7,000 languages worldwide. How to extend the success of English-centric LLMs to NLP tasks in other languages is an important research question [
238-
242]. Inspired by this, Researchers have made efforts to enhance the multilingual LLM through parameter-tuning strategies, including multilingual pretraining [
243-
246], supervised fine-tuning [
245-
247], and reinforcement learning [
248]. Other studies focus on cross-lingual alignment via prompting, using few-shot approaches [
249-
252] and zero-shot instructions [
253–
254] to enhance alignment.
Two main challenges in this direction are as follows. (1) Enhancing low-resource language performance: Due to poor performance in low-resource languages, how to build universal multilingual LLMs that achieve promising performance in NLP tasks across languages is a direction worth exploring. (2) Improving cross-lingual alignment: The key to multilingual LLMs is improving the alignment between English and other languages. Effectively achieving this alignment is critical for ensuring optimal performance in cross-lingual NLP tasks, making it a challenging yet essential area for advancement.
5.2 Multi-modal LLMs for NLP
The current LLMs achieve excellent performance in text modality. However, integrating modalities is one of the key ways to achieve AGI [
255-
258]. Therefore, a lot of work has begun to explore multi-modal LLMs for multi-modal NLP tasks [
259-
265].
There are two primary challenges in this field. (1)
Complex multi-modal reasoning: Currently, most multi-modal LLMs focus on simple multi-modal reasoning, like recognition [
266–
267], while neglecting complex multi-modal reansoning [
268-
270]. Therefore, how to effectively explore complex multi-modal reasoning for NLP is a crucial topic [
256,
271-
273]. (2)
Effective multi-modal interaction: Existing methods often simply focus on adding direct multi-modal projection or prompting to LLM for bridge multi-modality gap [
266–
267,
274-
276]. Crafting a more effective multi-modal interaction mechanism in the inference process of multi-modal LLMs to solve NLP tasks is an essential problem.
5.3 Tool-usage in LLMs for NLP
While LLMs have shown success in NLP tasks, they can still face challenges when applied in real-world scenarios [
277–
278]. Therefore, a lot of work focuses on exploring utilizing LLMs as central controllers to enable the usage or construction of tools and agents to solve practical NLP tasks [
279-
284].
There are two primary concerns. (1) Appropriate tool usage: Current works always consider static tool usage, neglecting to choose appropriate tools to use. Identifying the correct tools and using them accurately is a key issue in solving NLP tasks efficiently. (2) Efficient tool planning: Current works still focus on the usage of a single tool for NLP tasks. Motivated by this, there is a pressing need for NLP tasks to achieve an efficient tool chain that leverages multiple tools in a coordinated manner. For example, when facing Task-oriented Dialogue tasks, we can use three tools: booking flight tickets, booking train tickets, and booking bus tickets. Then, how to collaborate to make the trip time as short as possible and the cost as low as possible is a typical problem in effective tool planning.
5.4 X-of-thought in LLMs for NLP
When LLMs solve complex NLP problems, they often cannot directly give correct answers and require complex thinking. Therefore, some works adapt X-of-thought (XoT) for advanced logical reasoning. XoT primarily focuses on refining the model’s ability to process and reason through complex logic, ultimately aiming to improve the overall performance and accuracy in solving challenging NLP tasks [
31,
113,
253,
285-
288].
Key challenges in this direction include: (1) Universal step decomposition: How to develop a method for universally applicable step decomposition to generalize LLMs to various NLP tasks is the core challenge of XoT. (2) Prompting knowledge integration: Diverse promptings enhance model performance across various scenarios. How to better integrate the knowledge of different XoT to solve NLP problems is an important direction.
5.5 Hallucination in LLMs for NLP
During solving the NLP tasks, LLMs inevitably suffer from the hallucinations where LLMs produce outputs that deviate from world knowledge [
289–
290], user request [
291], or self-generated context [
292]. This deviation harms the reliability of LLMs in practical scenarios.
The primary challenges in hallucination are: (1) Efficient hallucination evaluation: How to find appropriate and unified evaluation benchmarks and metrics for LLMs in various NLP tasks is a key challenge. (2) Leveraging hallucinations for creativity: Hallucinations can often stimulate certain creative abilities. How to leverage hallucination to stimulate creativity and generate better innovative knowledge is an interesting topic.
5.6 Safety in LLMs for NLP
Applying large models to downstream NLP tasks also raises inevitable safety concerns, including copyright issues [
293], hate toxicity [
294], social bias [
295–
296], and psychological safety [
26]. Inspired by this, a growing body of research has emerged, focusing on ensuring the safety of LLMs for various NLP tasks [
297-
300].
The main challenges to safety in LLMs are: (1)
Safety benchmark construction: Currently, there are few security-related benchmarks for LLM on various NLP tasks. Establishing effective safety benchmarks is a critical objective in this area. (2)
Multilingual safety risks: LLM suffers more safety risks across languages and cultures [
301]. Identifying and mitigating these risks in a multilingual context is a significant challenge.
5.7 Long Chain-of-Thought in LLMs for NLP
Long Chain-of-Thought (Long-CoT) extends standard CoT prompting by allowing models to reason more deeply, explore multiple solution paths, and reflect on intermediate outcomes instead of following a single linear chain of thought [
28,
302–
303]. By organizing reasoning into hierarchical levels or segmented sub-chains, Long-CoT equips large language models to address complex NLP challenges and compositional reasoning tasks beyond the reach of conventional CoT [
16,
285,
304-
308]. Recent innovations integrate reflective mechanisms [
279,
309], inference-time scaling techniques [
114,
310–
311], and reinforcement-learning enhancements [
7,
312-
315].
Key challenges in this direction include: (1)
Adaptive reasoning length control: Selecting the appropriate depth and breadth for each sub-chain is challenging [
316-
318]. If a sub-chain is too shallow, the model may overlook critical intermediate abstractions; if it is too deep, it risks propagating errors or exceeding token limits [
19,
319]. (2)
Interactive reasoning: Enabling a dynamic, iterative problem-solving process, where models pose clarifying questions [
320], integrate external feedback [
278,
321], and refine intermediate steps [
322], remains insufficiently explored [
28,
323]. Such interactive chains could substantially improve performance and accuracy in tasks requiring real-time adaptation [
324–
325].
6 Conclusion
In this work, we make the first attempt to offer a systemic overview of LLMs in NLP, introducing a unified taxonomy of parameter-frozen paradigm and parameter-tuning paradigm. Besides, we highlight new research frontiers and challenges, hoping to facilitate future research. Additionally, we maintain a publicly available resource website to track the latest developments in the literature. We hope this work can provide valuable insights and resources to build effective LLMs in NLP.
The Author(s) 2025. This article is published with open access at link.springer.com and journal.hep.com.cn