1 Introduction
Rapidly increasing parameter scales of pre-training language models improves their generalization ability and brings emergent abilities. In the last few years, the parameter scales of pre-training languages models have increased by thousands of times (e.g., from 330 M parameter BERT [
1] to 540 B parameter PaLM [
2]). These pre-training language models having large parameter scales are termed Large language models (LLMs). Nevertheless, due to the knowledge boundaries of the LLMs, their abilities on some downstream tasks are still limited. To expand the knowledge boundaries, it remains necessary to fine-tune LLMs on the downstream tasks.
However, fine-tuning the full parameters of an LLM, namely full fine-tuning, is extremely computationally expensive, for example, full fine-tuning of a LLaMA2-7B [
3] model requires approximately 60 GB of memory, which exceeds the capacity of common consumer GPUs [
4]. To reduce the computational cost, various parameter-efficient fine-tuning (PEFT) methods have been proposed [
5]. They adapt LLMs to downstream tasks by only fine-tuning a small number of (extra) model parameters. From the perspective of whether extra parameters are involved, PEFT methods can be divided into two categories: extra-parameter methods and intra-parameter methods. The extra-parameter methods freeze all of the original parameters of an LLM and insert a set of learnable parameters to optimize the model input or model layers such as adapter tuning [
6] and prompt tuning [
7]. By contrast, intra-parameter methods freeze most of the original parameters of an LLM and only tune a small number of parameters of the LLM such as BitFit [
8], LISA [
4] and LoRA [
9].
When we do not have access to modify the model architecture, intra-parameter methods are desirable. Among the intra-parameter methods, LoRA is the most widely used one, because it can achieve a comparable or better downstream adaptation performance to the full fine-tuning on a range of downstream tasks [
9] and is easy to implement. Besides, there are many variants have been proposed to further improve the downstream adaptation ability of LoRA on more challenging downstream tasks.
LoRA achieves parameter efficiency by updating the dense neural network layers of an LLM with pluggable low-rank matrices. These matrices (a.k.a, LoRA plugins) are independent of the LLM, which can be stored and reused in other related downstream tasks. Furthermore, these LoRA plugins can be combined to achieve cross-task generalization, which can facilitate multi-task learning, domain adaptation, and continual learning.
As the LoRA modules accumulate, the computation cost of managing LoRA modules is increasing. Although LoRA is computation-efficient, the computational cost of managing a larger number of LoRA modules is unignorable. It is necessary to further improve the computation efficiency of LoRA. The improvement can come from reducing the computation cost of single LoRA modules and accelerating the scalable serving of multiple modules. It can boost the application of LoRA in real-world use cases, such as Generative-as-a-Service (GaaS) cloud products.
In some cases, the training data are privately owned by multiple clients and cannot be centralized. To adapt LLMs with the distributed training data, we can adopt federated learning to protect the data privacy of each client. However, federated learning suffers expensive communication and computation costs. To reduce costs, LoRA is a natural choice. Its parameter-efficient nature helps to reduce the computation cost of each client and the communication cost of sharing parameters across clients. Furthermore, the pluggable feature of LoRA, which supports the localization or encryption of personalized parameters, enhances privacy protection within federated learning. Therefore, LoRA has a great potential for privacy-preserving.
While some previous surveys have mentioned LoRA [
5,
10,
11], they mainly focus on PEFT and only introduce a small number of LoRA-related works, lacking systematic treatment and comprehensive overview on LoRA and its variants. In this survey, we give a comprehensive overview of the current progress on LoRA for methods (1) improving downstream adaption performance of LoRA; (2) mixing LoRA modules to achieve cross-task generalization; (3) boosting the computation-efficiency of LoRA; (4) adopting LoRA in federated learning. Besides, the application of LoRA is briefly introduced. This taxonomy of LoRA-related methods is illustrated in Fig.1. This survey is expected to give comprehensive background knowledge, research trends and technical insights for LoRA.
The rest of this survey is organized as follows. Section 2 introduces the background knowledge of LoRA, and Section 3 introduces the LoRA’s variants that aim to improve the downstream adaptation performance. In Section 4, we review the LoRA mixture methods that mix LoRA modules to achieve cross-task generalization. Section 5 discusses the methods that are propsoed to improve the computational efficiency of LoRA. The LoRA-driven federated learning methods are introduced in Section 6. Section 7 reports the applications of LoRA. We conclude this survey and discuss the future directions in Section 8.
2 Low-rank adaptation (LoRA)
The Low-dimensional intrinsic dimensionality hypothesis [
189] presents that over-parameterized models reside on a low intrinsic dimension, which demonstrates that we can achieve proper learning performance by only updating parameters related to the intrinsic rank. Based on this hypothesis, LoRA [
9] proposes to update dense layers in a model with low-rank matrices. It can achieve both parameter- and computational- efficiency. In this section, we first introduce the details of LoRA and then introduce existing works that focus on the theoretical analysis of LoRA. Furthermore, we demonstrate LoRA’s efficiency in practice. At last, this section presents that LoRA can be used in other use cases except fine-tuning.
2.1 LoRA
Given a dense neural network layer parameterized by , to adapt it to a downstream task, we update it with and obtain an updated layer parameterized by . For full fine-tuning, is computed based on gradients of all the parameters for the layer, which is computationally expensive and requires a large amount of GPU memory for LLMs. To improve the computational efficiency, LoRA decomposes into two small matrices and , i.e.,
where , and are initialized with a random Gaussian distribution and zero respectively, represents the scaling factor that controls the strength of updates. The parameter number of LoRA is , which is significantly less than . Fig.2(a) and (b) compare the structures of full fine-tuning and LoRA.
LoRA is highly
parameter efficient for it updates only a small subset of model parameters, which reduces the memory and computational requirements for fine-tuning without increasing inference latency [
190]. Furthermore, The parameter efficiency can be further improved by extending from the low-rank matrix to low-rank tensor [
191] or combining with the Kronecker decomposition [
192,
193]. Except for parameter efficiency, LoRA is also
pluggable for the LoRA parameters that can be separated from the model after training. The pluggable character of LoRA enables it to be shared and reused by multiple users [
194]. When we have LoRA modules for multiple tasks, we can combine these modules and expect a proper
cross-task generalization performance [
60]. Besides, the low-rank mechanism of LoRA is
compatible with other parameter-efficient methods, such as adapter [
195,
196]. Besides, LoRA can achieve
proper downstream adaptation performance on various downstream tasks. For example, on MMLU [
197] benchmark, comparing with full fine-tuning, fine-tuning with LoRA can achieve comparable or even better performance across 57 tasks [
4].
In practice, for a Transformer-based LLM, the dense layers typically consist of two types of weight matrices: the projection matrices in attention modules and feed-forward neural (FFN) modules. The experiments mentioned above are conducted based on the original LoRA settings, applying it to the query and value weight matrices in the attention modules. It is worth mentioning that subsequent work shows that applying it to the FFN layers can further improve model performance [
198].
2.2 Theoretical analysis
To understand why LoRA is effective and how LoRA can be more effective, several works have provided theoretical analyses from various aspects. To answer the question that why LoRA is effective, Malladi et al. [
12] analyze the fine-tuning dynamics of LoRA from the kernel view and demonstrate that in the lazy regime, LoRA fine-tuning is nearly equivalent to full fine-tuning. Besides, Zeng et al. [
16] provides a theoretical analysis of the LoRA’s expressive power for both fully connected neural networks (FNNs) and Transformer networks (TFNs). They proved that LoRA can adapt any model
to accurately represent any smaller target model
if LoRA-rank
(width of
)
under a mild assumption, where the depth and width are the number of layers and the number of neurons of the layer having the largest number of neurons, respectively. Moreover, they quantify the approximation error when the LoRA-rank falls below this threshold. Regarding TFNs, they showed that any model can be adapted to a target model of equivalent size using a rank-
for LoRA. Additionally, Koubbi et al. [
13] utilize the mathematical framework for Transformers established by [
199–
201] to investigate the how low-rank perturbations in attention parameters affect.
As to the question that how LoRA can be more effective, Jang et al. [
14] analyze the fine-tuning of LoRA within the neural tangent kernel (NTK) [
202] framework when
data points are available. They demonstrate that employing a rank
in LoRA helps to avoid spurious local minima and facilitates the discovery of low-rank solutions that exhibit good generalization. Besides, Zhu et al. [
15] observe that the project-down matrix
is utilized for extracting features from the input, while the project-up matrix
employs these features to create the desired output. Based on this observation, they demonstrate that freezing the project-down matrix
while tuning only the project-up matrix
leads to better generalization compared to tuning both matrices, in addition to achieving a
reduction in parameters.
2.3 Efficiency in practice
The computational efficiency of LoRA is significantly higher than that for full fine-tuning. Taking fine-tuning the dense weight matrix of the first FFN layer in LLaMA2-7B as an example, full fine-tuning needs to fine-tune parameters while LoRA only needs to tune parameters when . For this layer, LoRA only adjusts nearly one-thousandth of the parameters compared to full fine-tuning.
LoRA can significantly decrease the memory usage of fine-tuning an LLM, which can be divided into four parts: (1) Model Memory: the memory required to store the model weights; (2) Activation Memory: the memory occupied by intermediate activations during forward propagation. It mainly depends on factors such as batch size and sequence length; (3) Gradient Memory: the memory required to store gradients during backpropagation. The gradients are only calculated for trainable parameters; (4) Optimization Memory: the memory used to store optimizer states. For example, the Adam optimizer stores the “first moment” and “second moment” of trainable parameters.
Pan et al. [
4] provides a comprehensive empirical comparison between full fine-tuning and LoRA fine-tuning on an LLaMA2-7B model with batch size 1, utilizing a single NVIDIA RTX4090 (24 GB) GPU. According to this study, full fine-tuning requires approximately 60 GB of memory, which exceeds the capacity of an RTX4090 GPU; by contrast, LoRA fine-tuning only needs about 23 GB of memory. LoRA significantly reduces memory usage and makes fine-tuning LLaMA2-7B feasible on a single NVIDIA RTX4090 (24 GB) GPU. Specifically, due to fewer trainable parameters, both optimization memory and gradient memory decrease significantly by approximately 25 GB and 14 GB respectively. On the other hand, while LoRA introduces additional “incremental parameters” resulting in slight increases in activation memory and weight memory (totaling about 2 GB), this increase is negligible when considering the overall reduction in memory. Moreover, reducing memory brings an acceleration of forward propagation. LoRA is
times faster compared to full fine-tuning.
2.4 Beyond fine-tuning
Besides fine-tuning, LoRA can be applied to other learning paradigms, such as pre-training [
17,
19] and continual training [
20]. For pre-training,
ReLoRA [
17] and
MoRA [
18] are proposed to use low-rank updates to train high-rank networks; moreover,
LTE [
19] is proposed to perform parallel training of multiple low-rank heads across computing nodes to minimize the need for frequent synchronization, which facilitates the utilization of LoRA in pre-training. As for continual training, there are several methods have been proposed to address the catastrophic forgetting problem.
InfLoRA [
20] addresses catastrophic forgetting by reparameterizing pre-trained weights with a minimal set of parameters in a subspace.
GS-LoRA [
21] uses group sparse regularization to automatically select specific LoRA groups while zeroing out others to mitigate catastrophic forgetting effects.
I-LoRA [
22] leverages dual-memory experience replay combined with LoRA parameter interpolation to combat catastrophic forgetting.
Furthermore, LoRA can be used to overcome the limited context size for LLMs [
3,
23]. For instance,
LongLoRA [
3] successfully computaitional efficiently extends the context window of LLaMA2-7B [
203] from 4k to 100k tokens by combining LoRA with shifted sparse attention. However, LongLoRA does not match the efficiency of vanilla attention due to chaotic attention head structures and unnecessary information exchange between token groups. To address these issues,
SinkLoRA [
23] introduces Sink Fixed Attention (SF-Attn) to proportionally returns cyclically shifted groups of attention heads to their un-shifted state and achieves proper performance.
3 Downstream adaptation improving
Although LoRA can achieve proper adaptation performance on some downstream tasks, there is still a performance gap between LoRA and full fine-tuning on many downstream tasks, such as mathematical reasoning [
204–
206]. To fill this gap, many methods are proposed to further improve the downstream adaptation performance of LoRA. Typically, existing methods improve the downstream adaptation performance from the following perspectives: (1) breaking the low-rank bottleneck, refer to Fig.2(c); (2) adaptively allocating the ranks of different LoRA modules, refer to Fig.2(d); (3) optimizing the learning procedure of LoRA; (4) combining with other learning paradigms. In this section, we introduce these four types of methods respectively.
3.1 Breaking the low-rank bottleneck
The low-rank updates enable LoRA to be parameter efficient; however, it restricts LLMs’ ability to memorize downstream knowledge and generalization on downstream tasks [
18,
205–
208]. This low-rank limitation causes inferior performance of LoRA in knowledge- and skill-intensive domains comparing to full-fine tuning, such as code and math. Experimental study [
206] demonstrates that the rank for full fine-tuning is significant (10-100
) higher than that for LoRA, and increasing the rank of LoRA updation can narrow the performance gap between LoRA and full fine-tuning. To increase the rank of LoRA and improve its performance, several methods have been proposed [
17,
24,
27,
209], which typically increase the rank through (1) stacking LoRAs along learning iterations; (2) updating as gradient compressors; (3) co-updating LLM and LoRA modules during fine-tuning.
3.1.1 Stacking LoRAs along fine-tuning
Matrix rank is subadditive, i.e.,
for matrices
and
that have the same size. Based on the subadditivity, we can aggregate multiple LoRA modules together to increase the rank and break the low-rank bottleneck. Following this idea,
ReLoRA [
17] proposes a merge-and-reinit procedure for LoRA, which periodically merges the LoRA modules to the LLM and then reinitializes the LoRA modules during fine-tuning. It equals stacking multiple LoRA modules along with fine-tuning and can increase the rank of the overall updates. Similarly,
COLA [
24] proposes another merge-and-reinit method based on Frank-Wolfe algorithm [
210]. However,
MELoRA [
25] points out that the merge-and-reinit procedure does not necessarily guarantee an increase in rank, because there can be overlap between the series of LoRA modules along fine-tuning. To solve this problem, MELoRA proposes to decompose the LoRA modules into smaller mini LoRAs and then parallelly stack these mini LoRAs, whose effectiveness in increasing the rank is theoretically verified.
3.1.2 Updating as gradient compressor
The above methods break the low-rank bottleneck in the parameter space. As a supplement,
FLoRA [
26] finds that LoRA performs a fixed random projection to compress gradients and restricts the total weight matrix change to low-rank. To overcome this low-rank bottleneck in gradient space,
FLoRA proposes to resample the random projection, which is demonstrated to largely recover the performance of full-matrix SGD.
3.1.3 Co-updating LLM and LoRA
The above two kinds of methods focus on improving the representation ability of LoRA itself. Different from them,
Delta-LoRA [
27] proposes to jointly update the LLM and LoRA modules, which directly updates the high-rank LLM and can gain better representation capable than updating LoRA independently. It updates the LLM based on the difference between two LoRA modules of two consecutive iterations, which enables it to update the LLM without any extra memory.
3.2 Dynamic rank allocation
For the rank of LoRA, higher is not always better. The abundant LoRA ranks may cause degeneration in both performance and efficiency. Furthermore, the importance of weights can vary across different layers of a Transformer model during fine-tuning, requiring different ranks for each layer [
28,
31,
33,
211]. Therefore, assigning the same rank to LoRA modules of different layers is not the optimal choice. It is better to adaptively allocate ranks to LoRA modules of different layers. Existing methods adaptively allocate ranks for LoRA modules from the perspectives of (1) singular value decomposition (SVD); (2) single-rank decomposition (SRD); (3) rank sampling.
3.2.1 SVD-based methods
Decomposing a matrix with singular value decomposition (SVD) and selectively truncating its singular values is an effective way to control the rank of the matrix. Inspire by SVD, we can decompose the LoRA parameter matrix
into an SVD form, i.e.,
where
and
are orthogonal and
is a non-negative diagonal matrix. By controlling the elements in
, we can control the rank of
and allocate ranks for LoRA modules. Following this idea, several rank allocation methods approximate the SVD decomposition for
and allocate the ranks by filtering the diagonal matrix. For instance,
AdaLoRA [
28] approximates the SVD decomposition by regularizing the orthogonality of
and
. Then, it drops unimportant singular values based on novel importance scoring methods. Similarly,
SaLoRA [
29] also introduces an orthogonality regularization for
and
; by contrast, it drops unimportant singular values based on the
norm. However, the above methods are not efficient enough for they start with a high rank and then reduce the rank iteratively, which brings a pre-defined budget [
30]. To solve this problem,
IncreLoRA [
30] proposes to start from a single rank and then automatically increase the rank based on a heuristic importance score, where the orthogonality regularization is also involved while the elements in
is not required to be non-negative.
3.2.2 SRD-based methods
However, the orthogonality regularization brings unignorable computational costs for LoRA and degenerates its efficiency. To address this problem, several methods omit the orthogonality requirement of SVD and directly decompose
into single-rank components. Then, they allocate the ranks by selecting the proper components.
DoRA (Dynamic Low-Rank Adaptation) [
31] proposes to decompose the LoRA parameter matrix
into single-rank components and prunes the components based on a heuristic importance score. Similarly,
AutoLoRA [
32] also decomposes the LoRA parameter matrix
into single-rank components, but it prunes the components based on meta-learning.
SoRA [
33] eliminates the orthogonality regularization and filters columns and rows of
and
(their combination can be regarded as single-rank components) by directly controlling the diagonal matrix. It controls the diagonal matrix by formulating them as a set of learnable gating units which are updated in the fine-tuning procedure.
ALoRA [
34] also filters the components by using gating units; by contrast, it learns the gating units based on neural architecture search [
212].
3.2.3 Rank sampling-based methods
In the SVD parameterization- and component-wise decomposition-based methods, we need to spend the extra computational costs to search proper ranks. To avoid the extra cost,
DyLoRA [
35] points out that we can allocate ranks directly by random sampling. In each training step, it samples a value
from a pre-defined discrete distribution and allocates
as the rank. Then, the matrices
and
are truncated to rank-
. In the fine-tuning procedure, only the parameters on the
th row of
and
th column of
are tunable while other parameters are frozen. Besides, the distribution can be defined based on users’ preferences.
3.3 Optimizing the learning procedure
In practice, LoRA converges more slowly than full fine-tuning. Moreover, it is also sensitive to hyperparameters and suffers from overfitting. These issues affect LoRA’s efficiency and hinder its downstream adaptation performance. To address these issues, researchers have developed several approaches to optimize the learning procedure of LoRA, which can be categorized into the following three types: (1) Initialization Improvement; (2) Gradient Update Optimization; (3) Overfitting Mitigation.
3.3.1 Initialization improvement
LoRA usually initializes its parameter matrices
A and
B using Gaussian noise and zeros respectively. There are two simple schemes: Init[A], which sets matrix
B to zero and randomly initializes matrix
A, and Init[B], which does the reverse. Literature [
36] compares these two schemes and concludes that Init[A] is better through theoretical analysis. It reveals that Init[A] allows using a larger learning rate without causing instability, making the learning process more efficient. However, even with Init[A], this random initialization method still results in small initial gradients, leading to slower convergence. To solve this,
PiSSA [
37] initializes LoRA with the principal singular components of the pre-trained matrix. Since principal singular components represent the most significant directions in the matrix, aligning the initial weights with these components can accelerate convergence and improve performance. In contrast,
MiLoRA [
38] initializes LoRA with the minor singular components. Given that random initialization of low-rank matrices can interfere with the important features learned in the pre-trained matrix, it reduces this interference to improve overall performance while adapting to new tasks.
3.3.2 Gradient update optimization
To further enhance the convergence and reliability of LoRA, several studies have proposed improvements from the perspective of gradient updates. [
39] introduces a scaled gradient method based on Riemannian optimization, which incorporates an
preconditioner item in the gradient update step to improve the convergence and hyperparameter robustness of LoRA. Through theoretical analysis,
LoRA+ [
40] discovered the necessity of setting a proportional learning rate for matrices
A and
B to achieve stable feature learning and accelerate convergence.
ResLoRA [
41] introduced residual connections into LoRA to optimize the gradient propagation path, speeding up training convergence and enhancing model performance. Similarly,
SIBO [
42] mitigate over-smoothing by injecting residual connections of initial token representations into LoRA’s input. Additionally, to further reduce computational resources, literature [
43] employs gradient-free optimization methods such as CMA-ES and FWA to optimize LoRA, demonstrating competitive performance in few-shot NLU tasks. Besides,
DoRA (Weight-Decomposed Low-Rank Adaptation) [
44] constrains the gradient update, focusing on the directional change of the parameter. It decomposes pre-trained weight into two components, direction and magnitude, and applies LoRA only to the direction component to enhance training stability.
3.3.3 Overfitting mitigation
Although LoRA effectively reduces the number of trainable parameters compared to full fine-tuning, some studies have shown that LoRA is also prone to overfitting [
47], which contradicts previous views. To address this issue,
BiLoRA [
45] adopts a bi-level optimization strategy. It alternately trains the singular vectors and singular values of the low-rank increment matrix on different subsets of the training data. This approach avoids the simultaneous optimization of parameters at different levels on a single dataset, thus mitigating overfitting. In addition, literature [
46] applies dropout to LoRA parameters to reduce overfitting, while
HiddenKey [
47] employs column-wise dropout for attention layers and element-wise dropout for feedforward layers.
3.4 Combining with other learning paradigms
LoRA is compatible with other learning paradigms, such as Bayesian Learning, In-context Learning and Active Learning. Combining LoRA with these learning paradigms can address several problems that hurt the downstream adaptation performance. For example, combining with Bayesian Learning,
Laplace-LoRA [
48] can relieve the overconfidence phenomenon that happened in downstream adaptation. Combining with In-context Learning,
PILLOW [
49] aims to solve the low-resource dilemmas existing in some downstream tasks. Combining with Active Learning,
STAR [
50] can effectively improve the data efficiency.
At last, to illustrate the performance difference between LoRA and some of its variants, we report their performance for RoBERTa-base [
213] model on the GLUE benchmark [
214] in Tab.1. These results are derived from previous studies [
9,
16,
32,
45,
90].
4 Cross-task generalization
LoRA’s pluggable nature enables users to accumulate LoRA plugins for different tasks. For example, on Hugging Face platform, there are more than 20,000 LoRA plugins compatible with various LLMs for different tasks. These accumulated LoRA plugins can not only be utilized independently but also be mixed to achieve cross-task generalization [
60]. Mixing multiple LoRA plugins together, namely LoRA mixture, has been widely applied in areas requiring cross-task generalization, such as multi-task learning, domain adaptation, and continual learning. Existing LoRA mixture methods can be categorized into (1) mixture with manually designed weights; (2) mixture with learnt weights; (3) mixture of LoRA experts. This section introduces each category of methods respectively, as shown in Fig.3.
4.1 Mixture with manually designed weights
Early LoRA mixture methods attempt to linearly combine different LoRA modules with manually designed weights. Some research demonstrates that we can achieve proper cross-task generalization ability by simply averaging LoRA modules or their related outputs [
51–
53]. Furthermore, several methods have been proposed to further improve the performance of the LoRA mixture via adopting manually designed weights. For example,
ControlPE [
54], [
55] and [
56] set the weight factors as hyperparameters, and ControlPE uses hyperparameter search to determine the optimal combination of two LoRA modules. Additionally,
Token-Level Adaptation [
57] utilizes cosine similarity between the input feature and the adapter dataset center as weight factors, while
BYOM [
58] applies basic model fusion methods such as Task Arithmetic, Fisher-Merging, and RegMean.
Mixture with manually designed weights can quickly mix multiple LoRAs without extra training, which demonstrates simplicity and computational efficiency. However, it often fails to find the optimal weights, leading to unstable performance and limited generalization. Subsequently, researchers have explored using learning-based methods to achieve more precise and adaptive mixtures.
4.2 Mixture with learnt weights
To learn the optimal mixture weights, several methods have been proposed at task level, instance level and token level to meet different needs. Task-level methods focus on enhancing task transferability, which can be either gradient-based, such as [
59], or gradient-free, as seen in
LoRAHub [
60]. LoRAHub employs a black-box algorithm named CMA-ES [
216] to optimize weight factors for LoRA modules, simplifying the training process. Later,
ComPEFT [
61] and
L-LoRA [
62] use LoRAHub to mix quantized LoRA modules, further improving computational efficiency.
Compared to task-level methods, instance-level and token-level methods can provide flexibility and precision for complex inputs. For multimodal instruction tuning,
MixLoRA [
63] dynamically chooses appropriate low-rank decomposition vectors based on the input instance, which are then integrated into LoRA matrices for training. To conduct protein mechanics analysis and design tasks,
X-LoRA [
64] develops a dynamic gating mechanism to assign weights for LoRA modules at the token level and layer granularity. These approaches demonstrate better performance in specific tasks or application scenarios.
4.3 Mixture of LoRA experts
When the LoRA modules are trainable, we can jointly learn the mixture weights and the LoRA modules, which can further improve the performance of the LoRA mixture. To jointly learn the mixture weights and LoRA modules, Mixture of LoRA Experts (LoRA MoE) is a natural choice, where each LoRA module acts as an expert, while a router network typically assigns the mixture weights. LoRA MoE has been proven to be effective in many tasks, such as continual learning [
65,
66], vision-language tasks [
67] and multi-task medical applications [
68].
Existing methods improve the performance of LoRA MoE from the perspectives of initialization, task relationship management and efficiency. For initialization,
Mixture-of-LoRAs [
69] first trains multiple LoRAs separately as initialization and then optimizes the router and LoRAs jointly.
MultiLoRA [
70] proposes refining the initialization to reduce parameter dependency, which can yield more balanced unitary subspaces. As for task balance,
MLoRE [
71] adds a low-rank convolution path in the MoE structure to capture global task relationships.
MTLoRA [
72] adopts both task-agnostic and task-specific LoRA modules to address task conflicts. For efficiency,
MoLA [
73] adaptively allocates different numbers of LoRA experts to different layers of the Transformer model to save the number of LoRA modules.
LLaVA-MoLE [
74] and
SiRA [
75] leverage sparse computation to reduce computational cost. Additionally,
Octavius [
76] sparsely activates independent LoRA experts with instance-level instructions to mitigate task interference and improve efficiency.
Fast LoRA [
77] allows each sample in a minibatch to have its unique low-rank adapters, enabling efficient batching.
Besides, some methods are not explicitly based on MoE but follow MoE ideas. For example,
MoSLoRA [
78] decomposes LoRA into subspaces and employs a learnable mixer to fuse these subspaces.
5 Efficiency improving
With the popularization of LLMs, the demand for training and running LoRA modules increases rapidly. This increasing demand brings an unignorable computational burden; thus, for LoRA, the smaller, the faster, the better. To meet this demand, existing methods improve the computational efficiency of LoRA from the perspectives of (1) parameter reduction; (2) parameter quantization; (3) parallel LoRA computing frameworks. This section introduces each category of methods, as illustrated in Fig.4.
5.1 Parameter reduction
LoRA significantly reduces the number of tunable parameters for fine-tuning LLMs. However, it still requires expensive activation memory to update low-rank matrices. To further reduce the memory cost, existing methods reduce the number of tunable parameters of LoRA via parameter freezing, parameter pruning, and parameter sharing.
5.1.1 Parameter freezing
Parameter freezing methods reduce the number of tunable parameters for LoRA via freezing some of its parameters. They can be divided into two categories: intra-parameter methods and extra-parameter methods.
The intra-parameter methods tune a subset of parameters of LoRA while freezing the others.
LoRA-SP [
79] randomly selects half of the LoRA parameters to freeze during fine-tuning.
LoRA-FA [
80]freezes the down-projection weights and updates the up-projection weights in each layer of LoRA.
AFLoRA [
81] constructs a low-rank trainable path and gradually freezes parameters during training LoRA. Additionally,
DropBP [
82] accelerates the training process by randomly dropping some LoRA gradient calculations during backpropagation.
By contrast, the extra-parameter methods introduce and tune a set of extra parameters while freezing the original parameters of LoRA. Most of them are proposed based on Singular Value Decomposition (SVD).
LoRA-XS [
83] adds a small
weight matrix between frozen LoRA matrices, which are constructed using the SVD of the original weight matrix; then it tunes only the
weight matrices in fine-tuning. Similarly,
BYOM-LoRA [
58] adopts SVD to compress LoRA matrices for multi-task models.
5.1.2 Parameter pruning
Parameter pruning methods aim to remove unimportant LoRA parameters during training and inference. They prune parameters by either pruning LoRA independently or jointly pruning LoRA and the LLM.
LoRA-drop [
84] uses the output of LoRA at each layer to evaluate the importance of parameters and prune the unimportant parameters. By contrast,
LoRAPrune [
85] jointly pruning LoRA matrices and the LLM parameters based on LoRA’s gradients. Besides, we can also use LoRA to support parameters pruning for LLMs [
86,
87].
5.1.3 Parameter sharing
Parameter-sharing methods reduce the number of parameters by sharing parameters across different layers or modules of LLMs.
VeRA [
88] and
VB-LoRA [
89] are two representative parameter-sharing methods for LoRA. Specifically, VeRA proposes to share a pair of frozen random matrices across all layers and conduct layer-wise adaptation with “scaling vectors”. By contrast, VB-LoRA proposes a “divide-and-share” paradigm, which divides LoRA’s low-rank decomposition by a rank-one decomposition and achieves global sharing based on an admixture model. Instead of sharing parameters in the original parameter space,
FourierFT [
90] converts the incremental matrix
into the spatial domain using Fourier transform. It shares spectral entries across all layers and only learns its sparse spectral coefficients for each layer, thus reducing the number of trainable parameters.
5.2 Parameter quantization
Quantization, which reduces the bit width of parameters (e.g., from 32-bit floats to 4-bit integers), can be used to reduce the memory and computational cost of LoRA. Existing quantization-aware LoRA methods consist of post-training quantization (PTQ)-based methods and quantization-aware training (QAT)-based methods [
95].
5.2.1 PTQ-based methods
In PTQ-based methods, we first quantize an LLM and then fine-tune the quantized model, namely quantization and fine-tuning are sequentially conducted.
QLoRA [
91] is the first PTQ-based quantization-aware LoRA method. In the fine-tuning stage, it first quantizes an LLM to 4 bits and then fine-tunes a LoRA module on it with a higher precision, such as BFloat16 or Float16. In the inference stage, it dequantizes the LLM to the same precision as LoRA and then adds the LoRA updates to the LLM.
Although QLoRA can significantly reduce memory cost for fine-tuning, it does not bring benefits for inference, because it requires dequantizing the LLM to high precision again. To solve this problem,
QA-LoRA [
92] is proposed to reduce memory cost for both the fine-tuning and inference stages. QA-LoRA uses group-wise operators to balance the degrees of freedom of the LLM quantization and fine-tuning, which enables it to obtain a LoRA module having identical precision with the quantized LLM. Thus, it can perform inference without dequantization.
5.2.2 QAT-based methods
In QAT-based methods, we jointly quantize and fine-tune an LLM, namely quantization and fine-tuning are simultaneously conducted. These methods can alleviate the quantization discrepancies observed in PTQ-based methods. To address the quantization discrepancy of QLoRA,
LoftQ [
93] alternatively applies quantization and low-rank approximation during fine-tuning to minimize the quantization error. However,
ApiQ [
94] points out that LoftQ ignores the error propagation across layers and proposes activation-preserved initialization to avoid error propagation. Besides,
L4Q [
95] is another QAT-based method that has an advanced layer design.
5.3 Parallel LoRA computing frameworks
LoRA’s parameter-efficient nature enables us to fine-tune or infer multiple modules on a single GPU or a GPU cluster, which can save computational resources and improve the efficiency of LoRA. This section introduces the parallel fine-tuning and parallel inference frameworks, respectively.
5.3.1 Parallel fine-tuning
Parallelly fine-tuning multiple LoRA modules on a single GPU can reduce GPU memory usage and improve computation efficiency.
ASPEN [
96] proposes a high-throughput parallel finetuning framework for LoRA, which consists of a BatchFusion approach and an adaptive job scheduling algorithm. Specifically, the BatchFusion approach supports parallelly fine-tuning multiple LoRA modules on a shared LLM by fusing multiple input batches into a single batch, while the adaptive job scheduling algorithm allocates computation resources to the fine-tuning jobs.
5.3.2 Parallel inference
Parallel inference framework for LoRA can not only improve the computational efficiency but also support the needs of multi-tenant service.
Punica [
97] uses a new CUDA kernel design to batch GPU operations for different LoRA modules. Based on Punica,
S-LoRA [
98] further optimizes the parallel inference framework by introducing a unified paging mechanism and a new tensor parallelism strategy, which enables the service of thousands of concurrent LoRA modules. Then, based on Punica and S-LoRA,
CaraServe [
99] reduces the cold-start overhead and further improves the service efficiency and SLO (service-level objective) attainment rates by CPU-GPU cooperation and rank-aware scheduling.
6 LoRA for federated learning
When adapting LLMs to vertical domains such as medicine and finance, the available training data can be privately owned by multiple clients. In this scenario, the training data is not centralized, and we have to fine-tune LLMs while keeping the data localized, namely federated learning. In federated learning, the clients typically compute weight updates locally and then share these updates with others to globally update the LLM. It brings both communication and computation costs for the clients. Fortunately, LoRA is parameter efficient and pluggable, which can reduce communication costs and lower computational resource requirements. LoRA can enhance the overall efficiency and scalability of federated learning.
However, adopting LoRA in federated learning is not trivial for federated learning faces challenges such as data heterogeneity, device heterogeneity, and model heterogeneity. To address these issues, recent studies have designed various methods for LoRA to meet the diverse needs of federated learning, as shown in Fig.5. Additionally, as a localized parameter component, LoRA’s pluggable nature allows it to support parameter privacy protection in federated learning.
6.1 Data heterogeneity
Data heterogeneity refers to differences in data distribution across clients. In federated learning, different clients usually have different data distributions. The inconsistency in data distribution affects the overall performance of the model. Research reveals that in federated learning, as user data becomes more diverse, the performance gap between LoRA and full fine-tuning widens [
100]. To address this issue, researchers have proposed several improvement methods.
SLoRA [
100] introduces a data-driven initialization method for LoRA. It first performs sparse federated fine-tuning before applying LoRA and then performs SVD to decompose the accumulated gradient updates into low-rank matrices for LoRA initialization. The goal is to enable the LoRA modules to better adapt to the data distribution of each client, thereby integrating these heterogeneous data characteristics into the global model more effectively.
FeDeRA [
101] uses a simpler initialization method. It directly applies SVD to pre-trained weights to initialize LoRA. Retaining the principal components of the pre-trained weights aligns the direction and magnitude of weight updates across different clients to handle data heterogeneity. Additionally,
FFA-LoRA [
102] freezes one low-rank matrix and fine-tunes only the other. This reduces inconsistency during server aggregation of LoRA gradients, alleviating the optimization instability caused by non-IID data.
6.2 Device heterogeneity
Device heterogeneity refers to the differences in hardware capabilities, and network connectivity among clients participating in federated learning. Traditional federated learning methods often encounter the “buckets effect”, implying that the system’s overall performance is limited by the capability of the least powerful client. Specifically, these methods use the smallest LoRA rank to accommodate all clients, which prevents many resource-rich clients from fully utilizing their potential.
To address this issue, a dynamic parameter allocation strategy can be adopted.
FedMS [
103] dynamically adjusts the number of activated LoRA matrices based on the real-time computational resources of clients.
FlexLoRA [
104] uses a dynamic parameter allocation strategy. It adjusts the LoRA rank and redistributes the SVD components of the global LoRA weights based on resource constraints. Similarly,
HETLORA [
105] assigns different ranks for different clients. However, it performs weighted aggregation according to the sparsity of the updates from different clients, balancing update information better than simple aggregation.
6.3 Model heterogeneity
Model heterogeneity indicates differences in model structures among clients. In traditional federated learning, clients use local models with the same architecture, allowing their parameters to be aggregated into a global model on the server. However, in practice, clients may prefer unique local model architectures due to personal needs and often do not want to disclose model details. Thus, it is necessary to transfer knowledge between heterogeneous models without sharing private data or revealing local model structures [
217].
Previous work has used knowledge distillation, model ensembling, and mutual learning to address model heterogeneity. However, these methods have limitations, such as reliance on public datasets, additional communication costs and poor local model performance. To avoid these limitations,
pFedLoRA [
106] uses LoRA as a carrier of both global and local knowledge. It adopts an iterative training strategy to facilitate knowledge transfer and integration, enabling knowledge sharing among heterogeneous models across different clients.
6.4 Parameter privacy
In federated learning, protecting client-specific parameters is crucial because ensuring the privacy of these parameters also indirectly safeguards client data privacy. As a modular approach to adjusting personalized parameters, LoRA can be effectively integrated into federated learning systems to achieve parameter privacy protection.
Literature [
107] proposes a secure distributed language model training framework based on model slicing. They deploy LoRA in a Trusted Execution Environment (TEE) and use OTP encryption to transmit features between the GPU and TEE, protecting model parameter privacy.
PrivateLoRA [
108] introduces a distributed system based on LoRA. It adds a square matrix
between low-rank matrices
and
. The non-trainable matrices
A and
B, along with most of the pre-trained weights, are deployed on the global server to enhance computation. Meanwhile, the trainable matrix
is stored on the client as personalized parameters, thus ensuring parameter privacy protection.
Furthermore, recent works have integrated differential privacy (DP) techniques with LoRA in federated learning to enhance data privacy.
DP-LoRA [
218] ensures differential privacy by adding Gaussian noise to LoRA’s weight updates during the update process. This approach maintains privacy and improves communication efficiency. To solve the noise amplification when applying differential privacy in LoRA,
FFA-LoRA [
102] fixes the matrix
, avoiding the local semi-quadratic structure and enhancing robustness and performance.
7 Applications of LoRA
In the rapidly evolving field of deep learning, LoRA has become widely used due to its unique advantages. Researchers utilize LoRA to fine-tune pre-trained models for various downstream tasks, reducing computational resource requirements while enhancing performance. LoRA’s strong adaptability and efficiency have significantly improved various applications. In this section, we will introduce LoRA’s applications in the following scenarios: (1) language tasks; (2) vision tasks; (3) multimodal tasks.
7.1 Language tasks
Recently, the rapid development of pre-trained language models, especially LLMs, is revolutionizing the approach to language tasks due to their outstanding performance. However, these pre-trained models are trained on a large amount of general data and still require further fine-tuning on task-specific data to adapt to downstream tasks. Therefore, it is natural to use LoRA to fine-tune these pre-trained language models, as it reduces computational resource requirements. We mainly focus on some representative downstream tasks, which include traditional NLP tasks, code tasks, model alignment and vertical domain tasks.
7.1.1 Traditional NLP tasks
Given the strong instruction-following and contextual understanding abilities of LLMs, some researches apply LoRA to fine-tune these models for traditional NLP tasks. For example, LoRA is widely adopted in LLaMA for various tasks, such as emotion recognition [
109], text classification [
110] and role recognition [
111].
AutoRE [
112] applies QLoRA to three document-level relation extraction tasks, achieving great performance on different LLMs. Some studies [
113–
115] leverage LoRA from different perspectives to enhance the model’s capability in machine translation tasks. Additionally, LoRA can also improve the performance of models like BERT and T5 for text understanding tasks [
116,
117].
7.1.2 Code tasks
Some researchs apply LoRA to improve model performance in various code-related tasks. For example, BERT-style models fine-tuned with LoRA are suitable for code-change-related tasks, specifically in Just-In-Time defect prediction (JIT-DP) [
118,
119]. Similarly, training CodeT5 and PLBART with LoRA can enhance their adaptability for code summarization and code clone detection [
120]. As for the decoder-only model,
RepairLLaMA [
121] uses LoRA to fine-tune Llama for automated program repair (APR), while WizardCoder-15B is fine-tuned with LoRA for Text-to-SQL task [
122]. Additionally,
SteloCoder [
123], a fine-tuned version of StarCoder, is designed for multi-language to Python code translation.
7.1.3 Model alignment tasks
Model alignment tasks focus on adjusting a machine learning model to align with human values and intentions, often using techniques like Reinforcement Learning from Human Feedback (RLHF). To reduce memory requirements of RLHF, some studies use LoRA to fine-tune the reward model and policy model [
124–
126]. Furthermore, other works improve reward models by integrating multiple LoRA adapters. For example,
DMoERM [
127] combines MoE with LoRA, routing model inputs to multiple LoRA experts while another work [
128] proposes a LoRA-based ensemble method as well. The integration can also benefit the quantification of uncertainty in reward models [
129]. Besides, literature [
130] applies Laplace-LoRA with a Gaussian prior assumption [
131] to train Bayesian reward models, which mitigates reward overoptimization in best-of-n sampling.
7.1.4 Vertical domain tasks
LLMs often perform suboptimally in vertical domains, requiring fine-tuning with domain-specific expertise. Some works apply LoRA to improve the performance of LLMs on domain-specific tasks. For example, some studies fine-tune LLMs on medical datasets with LoRA to adapt them to the medical domain [
132–
134]. Additionally, other studies improve medical tasks like clinical dialogue summarization [
135], assertion detection [
136] and medical QA tasks [
137,
138]. Similarly, several studies fine-tune LLMs with LoRA on financial data to solve tasks such as financial news analytics and sentiment classification [
139–
142]. Besides, LoRA can also be used to enhance the performance in database tasks like query rewrite and index tuning [
143].
7.2 Vision tasks
In vision tasks, LoRA is primarily applied to image generation and image segmentation, significantly improving training efficiency and optimizing model performance.
7.2.1 Image generation
Image generation tasks hold significant importance in the field of computer vision. In recent years, diffusion model have demonstrated exceptional performance in image generation tasks. LoRA is widely used in diffusion models to address various image generation tasks while reducing computational resources. Some works use LoRA to fine-tune diffusion models for image style transfer [
144–
148], while others apply it to text-to-image generation [
149–
153].
Furthermore, researchers have designed several LoRA-based methods to improve image generation quality. For instance,
Smooth Diffusion [
154] uses LoRA to achieve smoothness in the latent space, leading to better performance in various image generation and editing tasks.
ResAdapter [
155] employs LoRA to learn resolution priors, adjusting the receptive fields of convolutional layers to dynamical resolution. Additionally, to specifically enhance text-to-image quality,
STAMINA [
156] uses LoRA to fine-tune diffusion models for longer concept sequences.
DreamSync [
157] and
StyleAdapter [
158] use LoRA to improve text fidelity and image quality.
Mix-of-Show [
159] captures out-of-domain information with LoRA weights to combine multiple customized concepts with high fidelity, reducing concept conflicts. Other studies combine LoRA with model distillation to accelerate image generation [
160,
161]. Moreover, LoRA can also be applied to video generation [
162–
167] and 3D generation tasks [
168–
172].
7.2.2 Image segmentation
Image segmentation is a significant challenge in computer vision, aiming to divide an image into multiple meaningful regions or objects. To address this, SAM has been proposed as a foundational model for image segmentation and demonstrated superior generalization ability. To further enhance its performance in specific vertical domains, many studies utilize LoRA to fine-tune it. For instance, in license plate detection,
SamLP [
173] utilizes LoRA to adapt SAM for efficient segmentation of license plates. In structural damage detection, literature [
174] fine-tunes SAM’s encoder using LoRA for instance segmentation task. In the medical domain, many studies also apply LoRA to fine-tune SAM for a variety of tasks, including nuclei segmentation [
175], OCTA image segmentation [
176], brain tumor segmentation [
177], organ segmentation [
178], and surgical instrument segmentation [
179]. Additionally, some studies use LoRA to fine-tune Vision Transformer (ViT) for visual tracking [
180] and face forgery detection [
181].
7.3 Multimodal tasks
Multimodal Large Language Models (MLLMs) aim to integrate text with various modalities such as audio, image and video, which enable cross-modal understanding and reasoning through a unified embedding space. The success of LoRA in both NLP and vision tasks has sparked considerable interest in applying them to MLLMs.
In MLLMs, LoRA can not only improve training efficiency but also facilitate effective modality alignment. In audio-text tasks,
SALM [
182] comprises LoRA layers, a frozen text-based LLM, an audio encoder and a modality adapter to handle speech inputs and corresponding task instructions. For image-text tasks,
InternLM-XComposer2 [
183] achieves modality alignment by applying LoRA to image tokens,
mPLUG-Owl [
184] freezes the visual module while jointly fine-tuning LoRA and abstractor of the text module, and
CoLLaVO [
185] employs QLoRA to preserve object-level image understanding. In the realm of video-text tasks,
VSP-LLM [
186] fine-tunes the text module with QLoRA for visual speech processing,
MolCA [
187] uses LoRA to understand 2D molecular graphs and text, while
TPLLM [
188] employs LoRA for efficient traffic prediction by integrating sequence and spatial features. These applications demonstrate the versatility and power of LoRA in MLLMs tasks.
8 Conclusion and future direction
In this survey, the recent progress of LoRA have been systematically reviewed from the perspective of downstream adaptation improving, cross-task generalization, efficiency improving, federated learning and applications. From this review, we can find that LoRA is parameter efficient, pluggable, compatible and easy to achieve cross-task generalization, which enables it to be one of the most important technology for LLMs applications. Recent progress further boosts the generalization and efficiency of LoRA, and stimulate its potential to be used in more scenarios. Here, we list three future directions where LoRA will be indispensable.
8.1 LoRA for GaaS
In Generative-as-a-Service (GaaS), cloud-based platforms provide users with generative artificial intelligence (AGI) services. GaaS enables users enjoy AGI without deploying local computational resources. For the users’ needs are diverse, it is necessary to provides various functions for GaaS. To implement the various functions, we can construct a LoRA module for each function. The pramameter efficiency and plugability of LoRA can facilitate efficient functions’ construction and execution. Besides, the services on GaaS platforms can change rapidly alonging time. To follow the changes, we can train new LoRA modules that initialized by combination of previous LoRA modules. The cross-task generalization ability of LoRA can facilitate fast adaption to service updations.
8.2 LoRA for continued pre-training
In continued pre-training, a foundation model is continuely trained with unlabeled user data to adapt the model to specific domains. Typically, the self-supervised training objective is same with that for pre-training, and the learning rate is much smaller than than for pre-training. Continued pre-training is a important stage for constructing vertical domain LLMs. However, it is highly computational expensive, which impedes the development of vertical domain LLMs, especailly for the organizations with limited computational resources. Enhancing LoRA for continued pre-training and reducing its computational cost is worth to explored.
8.3 LoRA for autonomous agents
In LLM-based autonomous agents, the agents are assigned with specific roles. Based the roles and environment, agents make actions to response users’ or other agents’ request. The actions can be made based on self-knowledge or tools that designed for domain-specific tasks. The request and the actions are stored in memory to support the future requests.
In the current agents, the roles are typically assigned by prompts; however, prompt may cannot give a comprehensive discription of the role when the role is complex and the number of related data is large. Assiging roles with LoRA modules training from data related to the roles can be a better choice. Furthermore, the tools for agent can be LoRA modules. Besides, the memory usually augments the agents with retrieval augmented generation (RAG); however, due to the input token limitation and the short-comings of in-context learning, the RAG-based support may be less effective. By contrast, we can use LoRA-based continual learning to construct memory modules, which can solve the problem of RAG. Therefore, LoRA-driven agents are worth to explore.
The Author(s) 2024. This article is published with open access at link.springer.com and journal.hep.com.cn