1 Introduction
Large Language Models (LLMs) possess massive parameters and are trained on vast datasets, demonstrating exceptional proficiency in various tasks. The remarkable advancements in LLMs also inspire the exploration of leveraging LLMs as recommenders (LLMRec), whose effectiveness stems from extensive open-world knowledge and reasoning ability in LLMs [
1]. LLMRec obtains the recommendation ability through instruction tuning on the user interaction data. But in many cases, it is also crucial for LLMRec to forget specific user data, which is referred to as
recommendation unlearning [
2], as shown in Fig.1.
Fig.1 The process of recommendation unlearning |
Full size|PPT slide
The necessity of recommendation unlearning mainly arises from two aspects. 1) Privacy. According to privacy legislation, recommenders are obligated to erase the sensitive data upon user requests in order to protect user privacy. 2) Utility. Noisy data or polluted data can severely degrade recommendation performance. Once these data are identified, recommenders need to forget them to regain the utility [
3]. However, recommendation unlearning in the era of LLMs brings
a new challenge of inefficiency to existing approaches. Current unlearning methods all require updating all parameters of the model, which is expensive and time consuming given the billions of parameters in LLMs [
4,
5]. Some studies [
6] explore unlearning in LLMs, but they use a gradient ascent based approach to unlearn knowledge, which breaks the classification boundary and affects model utility on normal data.
To this end, we propose E2URec, Efficient and Effective Unlearning for LLMRec. Our main contributions are summarized as follows. 1) We study the unlearning problem for LLMRec. Our proposed E2URec outperforms existing approaches in terms of both efficiency and effectiveness. 2) For efficiency, we propose to add a lightweight low-rank adaption (LoRA) module to the original LLM. Only the LoRA parameters will be updated in the unlearning, while the parameters of the LLM are frozen. 3) For effectiveness, we design a novel forgetting teacher and a remembering teacher to guide the unlearned model, so that the unlearned model can forget data and maintain the recommendation performance respectively.
2 LLMs as recommenders
LLMRec aims to utilize LLM to predict whether a item will be clicked by a user. We denote the recommendation dataset as with samples, where is the features of -th sample, and is the label. Features is converted into textual sentence via hard prompt templates. Similarly, the label (click or not) is converted into corresponding answer words .
The causal language modeling objective is used to optimize LLM on dataset , by minimizing the negative log-likelihood of generating conditioned on input :
where is the th token of , and is the tokens before .
3 Methodologies
Formally, suppose that on the recommendation dataset , we have trained a LLMRec that is called the original model. Subsequently, a request for deletion is received, whereby the data to be removed is denoted as forgotten data , and the remaining data is retained data . Then, the goal of unlearning is to learn a unlearned model that forgets the information about without hurting the recommendation performance. The unlearned model is initialized with the parameters of the original model.
3.1 Parameter-efficient unlearning
Considering the billions of parameters of LLM, updating all its parameters for forgetting is resource-intensive. Inspired by recent advances in parameter-efficient finetuning, we propose to insert the lightweight LoRA modules into LLM, as shown in Fig.2(b). The LoRA modules add pairs of rank decomposition weight matrices to the original parameters of the LLM while just introducing a few parameters. In this way, we model the unlearned model as , where is the parameters of LLM and is the LoRA parameters. During the unlearning process, we only need to update , while remains frozen. This greatly reduces the computing resource and time.
Fig.2 The framework of our proposed E2URec |
Full size|PPT slide
3.2 Unlearning with forgetting/remembering teachers
We aim to achieve unlearning by using two teachers, as depicted in Fig.2(b). To remove knowledge, we update the unlearned model to produce distributions similar to the forgetting teacher on the forgotten data . Simultaneously, to preserve recommendation performance, we update the unlearned model to produce distributions similar to the remembering teacher on the retained data . The whole process can be formulated as:
where is the KL divergence between the output probability distributions of the teacher and unlearned model.
The forgetting teacher should have never seen the forgotten data. The retrained model, which refers to the model trained from scratch without observing , seems to be a suitable forgetting teacher. However, it is inefficient and not viable in practice. Here, we use an approximate method to design the forgetting teacher.
As shown in Fig.2(a), we first finetune an augmented model on the forgotten data . The augmented model with additional training on will output logits that are more relevant to . Therefore, the difference between the logits of the augmented and the original model represents the information related to the forgotten data . We denote the logits of the augmented model and the original model as and respectively, so the difference is . Subtracting this difference from the logits of original model can obtain the logits , which exclude the information. The formula of logits is as follows:
where is a positive hyper-parameter. Then, the output probability distribution of the forgetting teacher is the normalized , defined as .
So far, we have acquired the forgetting teacher’s outputs. Then the forgetting loss can be formulated as:
Simply forgetting will hurt the model’s recommendation performance. To retain the original recommendation ability, we encourage the unlearned model to “stay close” to the remembering teacher on retain data. We choose the original model as the remembering teacher , because it has the best recommendation performance. Besides, to further strengthen the knowledge related to the recommendation task, we also add the prediction loss from Eq. (1). Formally, the remembering loss is:
Finally, the loss of E2URec is the weighted sum of forgetting loss and remembering loss controlled by the hyper-parameter :
4 Experiments
We conduct experiments on two public recommendation datasets: MovieLen-1M (ML-1M) and GoodReads (GD). Both datasets are split into training, validation and testing sets with a ratio of 6:2:2 according to the global timestamp. In the experiment, 20% randomly chosen users would request to remove their training data. We use T5-base [
1] as the LLM backbone. We set
and
.
Our method only needs to update 0.7% of the total parameters. The code is available
†See github.com/justarter/E2URec website.
.
We compare our E2URec with the state-of-the-art methods:
Original: the original model without unlearning.
Retrain: the model retrained from scratch without the forgotten data. We include it as a gold standard.
SISA [
2]: Sharded, Isolated, Sliced and Aggregated training.
RecEraser [
3]: improves SISA by collaborative sharding and aggregation.
NegKL: uses KL loss to finetune the original model both on the retained and forgotten data, negating the KL loss for the latter.
NegGrad [
4]: uses prediction loss to finetune the original model both on the retained and forgotten data, negating the gradient for the latter.
Bad-T [
5]: use prediction loss to finetune the original model both on the retained and forgotten data, randomly assigning arbitrary labels for the latter.
We use the following metrics for analysis. 1) AUC, ACC and LogLoss (LL) on test set: measure the recommendation performance of the unlearned model. 2) JS-Divergence (JSD) and L2-norm on the forgotten data: JSD and L2-norm between the outputs of the unlearned and retrained model measure the effectiveness of unlearning. Smaller the metrics, better the unlearning. 3) Unlearning Time and the number of Trainable Parameters (#Params): measure the efficiency of unlearning method.
4.1 Results analysis
We list the comparison results in Tab.1 and Tab.2. From the results on two datasets, we observe that: 1) our method E2URec can better maintain the recommendation performance. E2URec achieves better AUC, ACC and LogLoss compared to other baselines. This is because E2URec minimizes the KL distance between the forgetting teacher and unlearned model to remove knowledge, instead of reversing gradients as in previous methods, thereby preserving model performance. 2) The prediction distributions of our unlearned model on forgotten data closely align with the retrained model, evidenced by the smallest JSD and L2-norm. This indicates that E2URec achieves the best unlearning effect due to our innovative forgetting teacher design, which only requires to modify the model’s output to mimic the retrained model. 3) E2URec attains superior unlearning efficiency compared to other methods. E2URec has the lowest time cost and #Params since it only updates lightweight LoRA parameters instead of all model parameters.
Tab.1 All metrics comparison results (in %) on ML-1M. The best results (except for original and retrain) are in bold |
Metrics | Effectiveness | | Efficiency |
AUC ↑ | ACC ↑ | LL ↓ | JSD ↓ | L2-norm ↓ | Time (s) ↓ | #Params ↓ |
Original | 77.44 | 70.60 | 56.78 | − | − | | 9048 | 2.2×108 |
Retrain | 76.85 | 69.98 | 57.35 | − | − | 5279 | 2.2×108 |
SISA | 75.35 | 68.52 | 58.89 | 2.05 | 9.82 | | 3042 | 8.9×108 |
RecEraser | 75.59 | 68.84 | 58.86 | 2.03 | 9.64 | 4009 | 8.9×108 |
NegKL | 75.65 | 69.19 | 59.34 | 3.67 | 12.47 | 1805 | 2.2×108 |
NegGrad | 75.97 | 69.31 | 59.20 | 4.64 | 14.13 | 1940 | 2.2×108 |
Bad-T | 75.61 | 69.41 | 58.83 | 4.35 | 14.95 | 1684 | 2.2×108 |
E2URec | 76.34 | 69.76 | 57.75 | 1.91 | 9.51 | 941 | 1.7×106 |
Tab.2 All metrics comparison results (in %) on GD. The best results (except for original and retrain) are in bold |
Metrics | Effectiveness | | Efficiency |
AUC ↑ | ACC ↑ | LL ↓ | JSD ↓ | L2-norm ↓ | Time (s) ↓ | #Params ↓ |
Original | 73.52 | 70.67 | 55.46 | − | − | | 9152 | 2.2×108 |
Retrain | 73.39 | 70.53 | 55.56 | − | − | 5448 | 2.2×108 |
SISA | 72.19 | 70.07 | 56.76 | 2.04 | 8.85 | | 3008 | 8.9×108 |
RecEraser | 72.29 | 69.93 | 56.57 | 1.66 | 7.71 | 3208 | 8.9×108 |
NegKL | 72.88 | 70.15 | 56.38 | 2.02 | 10.01 | 1866 | 2.2×108 |
NegGrad | 72.85 | 70.26 | 57.21 | 2.56 | 10.44 | 1608 | 2.2×108 |
Bad-T | 72.75 | 70.13 | 61.43 | 8.02 | 19.09 | 1753 | 2.2×108 |
E2URec | 73.41 | 70.42 | 55.48 | 0.90 | 6.54 | 800 | 1.7×106 |
We also conduct the ablation study to explore the contribution of each loss. In Tab.3, “w/o ” and “w/o ” respectively represent removing and . We observe that removing would increase JSD significantly, indicating that is the main factor to forget the data. Removing would result in a notable drop in AUC, suggesting that is essential to maintain recommendation performance.
Tab.3 Ablation results (in %). The best results are in bold |
Variants | MovieLens-1M | | GoodReads |
AUC | JSD | | AUC | JSD |
E2URec | 76.34 | 1.91 | | 73.41 | 0.90 |
w/o | 76.25 | 2.62 | | 73.40 | 1.25 |
w/o | 75.75 | 2.27 | | 72.98 | 0.99 |
5 Conclusion
In this letter, we propose E2URec, the efficient and effective unlearning method for LLMRec. Our method enables LLMRec to efficiently forget the specific data by only updating the lightweight LoRA modules. Besides, to enhance the effectiveness, our method develop two teacher models to instruct the unlearned model to forget information without harming the recommendation performance. Extensive experiments show that E2URec outperforms state-of-the-art baselines on two real-world datasets.
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}