The gains do not make up for the losses: a comprehensive evaluation for safety alignment of large language models via machine unlearning

Weixiang ZHAO; Yulin HU; Xingyu SUI; Zhuojun LI; Yang DENG; Yanyan ZHAO; Bing QIN; Wanxiang CHE

doi:10.1007/s11704-024-41099-x

Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (2) : 2002319 DOI: 10.1007/s11704-024-41099-x

Artificial Intelligence

RESEARCH ARTICLE

The gains do not make up for the losses: a comprehensive evaluation for safety alignment of large language models via machine unlearning

Author information +

History +

PDF (1261KB)

Abstract

Machine Unlearning (MU) has emerged as a promising technique for aligning large language models (LLMs) with safety requirements to steer them forgetting specific harmful contents. Despite the significant progress in previous studies, we argue that the current evaluation criteria, which solely focus on safety evaluation, are actually impractical and biased, leading to concerns about the true effectiveness of MU techniques. To address this, we propose to comprehensively evaluate LLMs after MU from three aspects: safety, over-safety, and general utility. Specifically, a novel benchmark MUBENCH with 18 related datasets is first constructed, where the safety is measured with both vanilla harmful inputs and 10 types of jailbreak attacks. Furthermore, we examine whether MU introduces side effects, focusing on over-safety and utility-loss. Extensive experiments are performed on 3 popular LLMs with 7 recent MU methods. The results highlight a challenging trilemma in safety alignment without side effects, indicating that there is still considerable room for further exploration. MUBENCH serves as a comprehensive benchmark, fostering future research on MU for safety alignment of LLMs.

Graphical abstract

Keywords

machine unlearning / safety alignment / large language models

Cite this article

Download citation ▾

Weixiang ZHAO, Yulin HU, Xingyu SUI, Zhuojun LI, Yang DENG, Yanyan ZHAO, Bing QIN, Wanxiang CHE. The gains do not make up for the losses: a comprehensive evaluation for safety alignment of large language models via machine unlearning. Front. Comput. Sci., 2026, 20(2): 2002319 DOI:10.1007/s11704-024-41099-x

登录浏览全文

4963

注册一个新账户忘记密码

1 Introduction

Extensive efforts have been made for safety alignment to align the behavior of large language models (LLMs) [1–6] with human values [7–10], ensuring that they are helpful and harmless [11,12]. To achieve this, beyond the commonly adopted yet computationally intensive preference optimization techniques [7,13], a more efficient approach called machine unlearning (MU) has recently emerged.

Using only harmful data, MU aims to guide models to “forget” content that does not align with human values and behaviors. Specifically, the goal is to fine-tune the model to behave like a “benign” one, which is trained only on the harmless data. In other words, the model is required to perform as if the samples in the harmful data set were never used in its training. To evaluate the performance of models after applying MU, existing studies use the criteria established by [14], which assess safety based on the harmful rate in a forget set and general utility through perplexity in a retain set. However, we argue that these evaluation criteria are both impractical and biased, and may result in misguided progress in this field.

On one hand, as shown in the left of Fig.1, it is impractical because whitespace [14] or irrelevant content [15] are regarded as valid results in response to harmful inputs after MU. However, we believe that compared to the direct denied responses, these “hallucination-style” ones are irresponsible for an AI assistant that aims to be harmless and honest, and they significantly harm the user interaction experience. Therefore, the impracticality of the existing evaluation criteria leaves it uncertain whether MU can be an effective method for safety alignment.

On the other hand, the evaluation is biased because it ignores to evaluate the potential side effects, such as over-safety and utility-loss, caused by MU. Over-safety refers to the model showing an imbalance between being helpful and being harmless, refusing to respond even to harmless user queries. Recent studies indicate that models with enhanced safety often tend to be overly safe [16,17]. In addition, the utility of the resulting model should not be measured solely by the perplexity on a retaining set [14,15]. It is more important to assess whether its more crucial aspects, such as world knowledge, reasoning, and ability to follow instructions, are compromised. Therefore, all the above shortcomings raise the central research question of this work: Can existing machine unlearning methods effectively improve the safety of large language models without causing any side effects?

To answer this question, we propose to comprehensively assess the performance of MU on LLMs across three aspects: safety, over-safety, and utility. Specifically, MUBENCH, a novel MU benchmark with 18 datasets, is first constructed. The evaluation of safety includes both vanilla harmful inputs and jailbreak attacks, conducted on 4 relevant datasets and 10 jailbreak attack methods, to fully assess whether the model’s internal unsafe content can still be triggered post-MU. For the evaluation of over-safety, we adopt XSTest [16] and OKTest [17], two datasets specifically designed with unsafe sensitive words on the surface but inherently safe content. Finally, for utility, we assess the model’s performance on world knowledge, reasoning ability, reading comprehension, mathematics, and instruction-following abilities across 12 relevant datasets.

We first categorize existing MU methods into 2 distinct groups, including gradient-based [14,18,19] and task-vector-based [15,20,21], and then evaluate these methods on 3 widely used LLMs, LLaMA-2-Chat [5], Mistral-7B-Instruct-v0.1 [22], and Vicuna-7b-v1.5 [23]. Experimental results reveal a challenging trilemma, where the safety enhancement by MU would be accompanied by a considerable increase in over-safety and a decline in overall utility. Through qualitative and quantitative analysis, we further disclose the causes of the changes in these three aspects due to MU. Finally, we summarize directions that are worth exploring in the future.

The main contributions of this work are summarized as follows:

● Based on our constructed MUBENCH, we comprehensively assess the performance of current MU methods on LLMs across three aspects: safety, over-safety, and utility.

● We offer empirical insights into existing 7 MU methods for safety alignment by comprehensively evaluating them on 3 popular LLMs.

● Through extensive experiments and analysis, we uncover the trilemma of current MU approaches and identify potential solutions.

2 Related works

● Machine unlearning for LLM safety

Machine unlearning has emerged as a post-hoc remediation technique to correct content or behaviors in LLMs that do not align with human expectations. Recent surveys [24–27] have detailed its applications in various scenarios, including issues related to copyright and personal information [28–30], hallucinations [14], and unsafe content [15,19,31]. This work mainly focuses on the application of MU in erasing unsafe content to enhance the safety of LLMs. Existing works follow the evaluation defined by [14], measuring safety through harmful rate on a forget set and general utility through perplexity on a retain set.

By contrast, this work stands out in the following ways: (1) Empirical insights: In contrast to the simplistic summaries in existing surveys, this work offers intuitive experimental results paired with thorough evaluation and analysis, ultimately suggesting promising directions for future exploration. (2) Comprehensive benchmarking: Targeting at the impractical and biased evaluation criteria, we introduce MUBENCH to provide a more comprehensive assessment of the safety, over-safety, and general utility of LLMs post-MU.

● Safety Alignment for Large Language Models

Aligning the behavior of LLMs with human instructions and values has gained significant interests. Existing works could be divided into two categories. (1) A group of works seek solutions outside the LLM backbones, filtering out those inputs that could potentially make LLMs produce harmful content through trained unsafe prompt detector [32–34]. (2) Another branch of works endeavor to achieve alignment inside the LLMs, which are applied into different stages of LLM development cycle, including pre-training, pre-alignment and post-alignment.

For the safety alignment in the pre-training stage, existing works perform strict filtering mechanism to remove the harmful data from pre-training datasets [2,5,35,36].

Most current safety alignment efforts are concentrated on the pre-alignment stage, primarily utilizing supervised fine-tuning [37,38] and preference optimization techniques [7,8,13,39–41]. Recently, some works have also explored the potential of LLMs to achieve self-alignment [42–45].

However, even aligned LLMs still pose safety risks [46–48], which directly stimulates further research to fill safety gaps in the post-alignment stage. On one hand, some current work focuses on the decoding phase of aligned LLMs to enhance the safety [17,49] or reduce the likelihood of successful jailbreak attacks [50–53]. On the other hand, we believe that current MU [14,15,18] techniques should be applied at this post-alignment stage to further eliminate unsafe content in aligned LLMs, with the expectation that this could be orthogonal to the aforementioned alignment methods or provide additional benefits. Therefore, we construct MUBENCHMUBENCH to comprehensively evaluate MU methods in terms of safety, over-safety and utility.

3 Machine unlearning

3.1 Preliminary

MU for the safety alignment of LLMs addresses the following issue: Given an initial model (also the reference model)

π r e f (y | x)

that has been trained on a dataset

D = {(x i, y i)} i ∈ [n]

, MU seeks to make the model forget a specific subset related to harmful content (referred to as the harmful forget set)

D f ⊆ D

of the training data. Specifically, the goal is to fine-tune the model to behave like the retrained model

π r e t a i n

, which is trained only on the harmless retain set

D r = D ∖ D f

. In other words, we want the model to perform as if the samples in the harmful forget set

D f

were never used in its training. Ideally, the best method for machine unlearning would be to retrain the model from scratch using only

D r

, but this is often impractical in reality.

3.2 Methodology

We comprehensively evaluate recent machine unlearning techniques for safety alignment of LLMs. They are divided into two groups:

● Gradient-based methods

One of the most straightforward methods is gradient ascent (GA) [14], updating the model parameters by maximizing the likelihood of prediction loss for the samples within the forget set

D f

(1)

L G A = E D f [log ⁡ π θ (y ∣ x)] .

The rationale of gradient ascent is that since the initial model

π r e f

is trained on

D = D f ∪ D r

, a subsequent maximization of prediction loss on the forget set

D f

would approximately revert the optimization on it, thus unlearning

D f

and approximating a model trained on

D r

only.

In addition, we also include negative preference optimization (NPO) [18], a simple drop-in fix of the GA to make it more stable.

● Task-vector-based methods

The concept of task vector [54] refers to the difference between the original weights of a pre-trained model and its weights after it has been fine-tuned for a specific task. Thus, at the heart of these approaches [15,20,21] is the seek for task vectors

θ f

representing the harmful content over

D f

, and then perform negation between the pre-trained backbone and these task vectors to unlearn such harmful content, which could be formulated as:

(2)

θ u n l e a r n = θ − θ f .

● Auxiliary loss function

In addition to the primary forgetting objective, the above works also introduce other auxiliary loss functions that either encourage unlearning or preserve utility, including:

● Random Loss:

L R a n d = − E D f [log ⁡ π θ (y ~ ∣ x)]

, where

(x, y ~) ∼ D f

and

y ~

is any replaced random response from

D f

D r

other than the golden one paired with

x

● Retain Loss:

L R T = − E D r [log ⁡ π θ (y ∣ x)]

, which encourages the model to still perform well on the retain set

D r

● Forget KL Loss:

K L f = E D f [D K L (π θ (⋅ | x) | | π r e f (⋅ | x)]

, which measures the distance to the initial model

π r e f

in terms of KL divergence on the forget set

D f

● Retain KL Loss:

K L r = E D r [D K L (π θ (⋅ | x) | | π r e f (⋅ | x)]

, which measures the distance to the initial model

π r e f

in terms of KL divergence on the retain set

D r

For more details on the aforementioned baseline methods and the combinations of different loss functions, please refer to Appendix A.5.

MUBENCH steps further and refines the existing evaluation protocol, focusing on a comprehensive evaluation for the effect of machine unlearning on LLMs. Specifically, as shown in Fig.2, MUBENCH covers 3 aspects with a total of 18 datasets. (1) Safety (with 4 datasets) focuses on whether the harmful content can stilled by prompted under both vanilla harmful input and jailbreak attack settings. Only the type of direct denied response is accepted as valid results after MU, rather than those “hallucination-style” ones such as whitespace and irrelevant content. (2) Over-safety (with 2 datasets) aims to measure whether the MU would cause an imbalance between being helpful and being harmless. It refers to situations where the model even refuses to respond to harmless user inputs. (3) Utility (with 12 datasets) assesses the model’s practical performance, examining whether its world knowledge, reasoning ability, reading comprehension, mathematical skills, and instruction-following capability are impaired after MU. The subsequent section will offer a detailed introduction to these aspects.

3.3 Safety evaluation

We evaluate the effect of MU for the safety of LLMs under two settings, vanilla harmful prompt and jailbreak attack. Examples of them are shown in Table A3 from Appendix A.3.1.

● Vanilla harmful prompt

assesses whether the harmful responses could be directly prompted with harmful inputs.

● Jailbreak attack

Recent studies have exposed a significant threat termed jailbreak attack, which can successfully bypass existing safety mechanism. We also evaluate models after MU under this scenario and existing jailbreak methods can be classified into the following three types [55]: (1) Human Design [52,56–58], which encompasses jailbreak prompts crafted manually, leveraging human creativity to bypass safeguards, (2) Long-tail Encoding [59–61], which leverages the limited cross-task generalization ability of LLMs to unseen data during safety alignment, and (3) Prompt Optimization [62–66] aims at automatically designing jailbreak prompt to induce harmful content. Please refer to Appendix A.3.2 for more details of these jailbreak methods.

● Harmful prompts

Following [14], we include harmful inputs from Beavertails [67]. To further diversify the testing samples, harmful prompts are also compiled from: AdvBench [62], DoNotAnswer [68], and HarmfulQA [69]. Detailed description of these datasets are:

● AdvBench [62]: A set of 520 harmful instructions, encompassing a wide spectrum of detrimental content such as profanity, graphic depictions, threatening behavior, misinformation, discrimination, cybercrime, and dangerous or illegal suggestions.

● Beavertails [67]: 16,851 unique prompts sampled from AnthropicRedTeam [46], covering 14 harm categories and annotated for safety by 3.34 crowdworkers on average. We randomly sample up to 100 prompts from each harm category, leading to 2,388 testing samples in total.

● DoNotAnswer [68]: A set of 939 prompts under a three-level hierarchical risk taxonomy, based on the criterion that all instructions should not be followed.

● HarmfulQA [69]: utilize Chain of Utterances-based prompting to collect harmful data within 10 topics, leading to 1,960 harmful prompts in total.

● Evaluation protocol

We adopt the refusal rate as the evaluation metric and it is automatically calculated by a Longformer-based [70] classifier provided by [68]. It exhibits identical performance as GPT-4 and human annotators to judge whether a response is harmful or not. We only consider the direct denied responses (rather than whitespace or irrelevant ones) of the model to be harmless. Please refer to Appendix A.3.3 for more details of the classifier. For jailbreak attack, ASR (Attack Success Rate) is adopted as the evaluation metric. And ASR is also automatically derived from the above classifier.

3.4 Over-safety evaluation

The evaluation for over-safety is performed on XSTest [16] and OKTest [17], two widely-adopted datasets for measuring the fasle rejection of benign inputs. XSTest comprises 250 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with. And in OKTest, there are 300 test samples characterized by safe questions containing harmful words. Examples from these two datasets are displayed in Table A5 and more detailed description of them are as follows:

● XSTest [16] comprises 250 safe prompts across ten prompt types (with 25 samples in each type) that well-calibrated models should not refuse to comply with. To be more specific, each prompt consists of a single, handcrafted English sentence. All prompts are designed in such a way that a) they are clearly safe, and b) they superficially resemble unsafe prompts in terms of the vocabulary they use. Each prompt type tests a distinct kind of resemblance. Examples of prompts for each type are provided in Table A5.

● OKTest [17] is a autonomously constructed dataset of 300 safe questions containing harmful words. The construction process is: (1) Harmful word collection. To ensure that the constructed sentences invariably contain harmful words, a list of over a thousand sensitive words is initially compiled. (2) Safe question generation. To obtain the safe questions with harmful words from the previous step, GPT-4 is prompted to generate the questions. (3) Data filtering. The resulting data is also manually checked to make sure they indeed are harmless and slightly correct the grammar to improve the data quality.

● Evaluation protocol

As suggested by [16], we hire three human annotators for manual evaluation on the generated responses in terms of refusal rate. Each response would be categorized into three types: Full Compliance, Full Refusal and Partial Refusal. Please refer to Appendix A.4 for detailed definition of them.

3.5 Utility evaluation

To evaluate the utility of the backbone model, following [5], we conduct evaluations across five crucial dimensions: (1) World Knowledge: MMLU (5-shot) [71]; (2) Reasoning: 0-shot for HellaSwag [72], ARC-easy [73], ARC-challenge [73], WinoGrande [74], PIQA [75], and BBH (3-shot) [76]; (3) Reading Comprehension: BoolQ (0-shot) [77]; (4) MATH: GSM8K (8-shot) [78] and MATH (4-shot) [79]; (5) Multi-Turn Instruction-Following: MT-bench [23].

● Evaluation protocol

The evaluation for utility benchmarks is performed with lm-evaluation-harness [80]. Performance on MT-Bench is rated by GPT-4 (gpt-4o).

4 MUBENCH benchmark

As pointed out by existing works [14,15,26], MU methods should not only unlearn the harmful forget set, but also maintain the model’s utility on the retain set.

5 Experiment

5.1 Experimental setup

● Models

We perform the above MU methods on three popular open-sourced aligned LLMs.

● LLaMA-2-Chat-7B [5] LLaMA-2 is an open foundation model trained on 2T tokens. And the chat version is the official aligned model with SFT and RLHF.

● Mistral-7B-Instruct-v0.1 [22] is an instruction fine-tuned version of the Mistral-7B-v0.1. It achieves the strongest performance among models of its size.

● Vicuna-7b-v1.5 [23] Vicuna is a chat assistant trained by fine-tuning LLaMA-2 on user-shared conversations gathered from ShareGPT.com with its public APIs.

5.2 Implementation details

Our experiments are implemented with PyTorch [81] and Transformer library [82]. 1,400 harmful data points from Beavertails [67] constitute the forget set

D f

and we adopt TruthfulQA [83] as the retrain set

D r

following [14]. We carefully re-implement the official codes of all baselines, strictly following most of their hyper-parameter settings. The steps of forgetting stop when the model is unable to produce meaningful responses on the validation set, preventing the generation of whitespace or random answers. For more detailed settings, please refer to the Appendix A.6.2.

5.3 Overall results

We demonstrate the experimental results for safety, over-safety and utility evaluation of existing machine unlearning methods based on LLaMA-2-Chat-7B in Tab.1–Tab.3, respectively. Please refer to Appendix A.7 for more results on Mistral-7B-Instruct-v0.1 and Vicuna-7b-v1.5.

Gradient-based MU methods are most effective for safety enhancement.

Please note that the responses used to calculate refusal rates in Tab.1 and Tab.2 are all “meaningful” denied responses, excluding any whitespace or irrelevant ones. Such meaningful rejections can directly reflect the benefits of the MU methods in enhancing the model’s safety. Under this evaluation criterion, the gradient-based MU methods (GA, GA + Mismatch and NPO) prove to be the most effective and transferable in enhancing safety across all three base models and in both vanilla harmful prompt and jailbreak settings. This demonstrates the feasibility of directly maximizing prediction loss on the forget set. More importantly, the success in jailbreak prevention manifests MU could tackle the root issue of harmful outputs from LLMs by erasing detrimental knowledge within the model, thereby preventing its induction.

The introduction of auxiliary loss undermine the effectiveness of safety enhancements.

Various auxiliary loss functions have been introduced in RMU and SKU to balance the effects of unlearning and maintain general utility. However, these two methods do not lead to improvements in safety accross all three backbone models. This highlights the need for future work to further balance the impact of auxiliary loss functions.

MU is more effective at enhancing the safety of backbone LLMs that have undergone preference alignment.

Among the three LLMs, only LLaMA-2-Chat-7B has been preference-aligned, while Mistral-7B-Instruct-v0.1 and Vicuna-7b-v1.5 have been instruction-aligned through supervised fine-tuning. However, the current MU methods demonstrate significant safety improvement only in LLaMA-2-Chat-7B. This highlights a notable difference in the effectiveness of MU across different backbone LLMs, indicating that it cannot be considered a universal and transferable method for improving the safety performance of all aligned models. To ensure the generalizability of the findings, we perform a more appropriate comparison between the same backbone model trained with different alignment methods. Please refer to Appendix A.7 for detailed results. Additionally, this emphasizes the critical impact of preference alignment on the safety alignment of LLMs, which we will discuss in detail through qualitative analysis in Section 5.5.

The enhanced safety of a MU method may worsen the over-safety problem.

As shown in Tab.3 with the LLaMA-2-Chat-7B, MU methods that significantly improve safety performance (such as GA and NPO) also tend to exacerbate the over-safety issue. This creates an imbalance in the model’s helpfulness and harmlessness, causing it to refuse to respond to benign user inputs. This problem should be carefully considered in future works involving MU, hoping that they could achieve a better balance between the model’s helpfulness and harmlessness. In addition, the inherently more overly safe backbone itself (i.e., LLaMA-2-Chat-7B) shows a deeper exacerbation of its over-safety after MU.

After MU, the overall utility of the backbones, especially their instruction-following ability, would be significantly affected.

We present the changes in overall utility after MU for the three backbone models in Tab.3, A12, and A15, respectively. The results consistently demonstrate a significant decline in performance on the MT-Bench for those more effective MU methods in safety enhancement, indicating that MU would severely compromises the utility of the backbone LLMs. In contrast, the performance on other utility datasets remains relatively unaffected, likely because these tasks primarily consist of simple multiple-choice questions. This suggests that MU may impact the models’ expressive capabilities more than their internal knowledge. Finally, since most MU methods involve the retrain set TruthfulQA for training, the performance of the resulting models on it has significantly improved. However, this could also pose a potential risk of overfitting the model’s general utility to this specific dataset.

5.4 Case study

We present the responses to harmful inputs from the backbone model after different MU methods in Table A16 from Appendix A.8.1. The results indicate that MU do elicit denied answers to harmful inputs, manifesting valid and effective safety enhancements. It is important to note that the harmful forget set

D f

used to train these MU methods lacks such “rejection-style” supervision signals. Nevertheless, the backbone LLM can automatically generate denied responses. We will discuss the reason behind this in detail in Section 5.5.

In addition, cases from the MT-Bench are displayed in Table A22 from Appendix A.8.3. The main reason for the performance decline of the resulting LLMs after MU is that the increased over-safety makes the model overly sensitive to criticize the wording and phrasing in user inputs or even refuse them, which leads to a deviation from directly generating helpful responses.

5.5 Qualitative analysis

In this section, we analyze why MU methods can effectively enhance the safety of LLMs. This phenomenon is surprising, as the backbone model, without any rejection signal supervision during the MU process, can be spontaneously elicited rejection responses. Inspired by [84], there exists a rejection region within LLMs’ parameter space. Therefore, we hypothesize that the reason for the effectiveness of MU is also related to this. Specifically, as shown in the left of Fig.3, we first perform Principal Component Analysis (PCA) to visualize hidden states of the harmful inputs from AdvBench that the original LLaMA-2-Chat-7B either refuses (yellow) or comply with (blue) and find a clear distinction between these two types of data points. Then, PCA is conducted again on these data points for models after the most effective (GA) and least effective (RMU) MU methods. We find that the data points of GA (red) shift more towards the rejection region (yellow), while the data points on the RMU (green) remain in place. This confirms that the reason for effective MU is that it can shift harmful data towards the established rejection region within the aligned LLMs.

The same reason applies to the exacerbation of over-safety. As shown in the right side of Fig.3, the GA (red) also clearly shifts the harmless data on XSTest towards the rejection region, while these data are unaffected by the RMU (green). For more analyses on the Mistral and Vicuna backbones, please refer to Appendix A.9.

6 Future direction and discussion

We propose two directions for future research on using MU for the safety alignment of LLMs.

● Safety alignment without side effects

One of the primary challenges for future work is effectively addressing the existing trilemma: enhancing the safety of LLMs while ensuring that this does not lead to over-safety (or even mitigating it) and protecting the existing general utility from being compromised. To tackle this, a possible solution is to incorporate continual learning techniques [85] into the MU process. This includes, but is not limited to, replaying relevant over-safety and utility data for multi-task training [86–88], further regularizing the gradient optimization process [89,90], and introducing additional architectures [91,92].

● Towards cost-efficient machine unlearning

Future work can aim to further reduce the computational overhead of MU by focusing on two main areas. On one hand, although introducing more over-safety and general data as a retrain set may be effective for achieving comprehensive performance, it also significantly increases computational demands. Therefore, developing data selection strategies to choose the most impactful samples for the MU process could help minimize training volume [93,94]. On the other hand, current approaches often depend on an additional reference model for utility maintenance through distillation, which requires extra GPU memory and is impractical for larger-scale LLMs. Future research could explore leveraging the self-improvement mechanism [95,96] of LLMs to obtain the supervision signal directly from the target model itself [45,97].

7 Conclusion

The current evaluation criteria for LLMs after machine learning (MU) are impractical and biased, raising concerns about the true effectiveness of existing MU techniques. To fill this gap, we propose to comprehensively assess the performance of MU across three aspects: safety, over-safety, and utility. To achieve this, we present a novel benchmark MUBENCH with 18 datasets and conduct extensive experiments with 7 state-of-the-art MU methods based on 3 open-sourced aligned LLMs. Our findings highlight a challenging trilemma, where effective safety alignment is always accompanied by exacerbated over-safety and utility loss, indicating the gains from MU do not make up for the losses. Moreover, we provide thorough analysis, suggesting potential directions for further exploration.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Brown T B, Mann B, Ryder N, Subbiah M, Kaplan J D, , . Language models are few-shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 159

[2]	Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu P J . Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020, 21( 1): 140

[3]	Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, , . Gpt-4 technical report. 2024, arXiv preprint arXiv: 2303.08774

[4]	Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M A, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F, Rodriguez A, Joulin A, Grave E, Lample G. LlaMA: open and efficient foundation language models. 2023, arXiv preprint arXiv: 2302.13971

[5]	Touvron H, Martin L, Stone K, Albert P, Almahairi A, , . Llama 2: open foundation and fine-tuned chat models. 2023, arXiv preprint arXiv: 2307.09288

[6]	Gemini Team Google. Gemini: a family of highly capable multimodal models. 2024, arXiv preprint arXiv: 2312.11805

[7]

Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C L, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A, Schulman J, Hilton J, Kelton F, Miller L, Simens M, Askell A, Welinder P, Christiano P, Leike J, Lowe R. Training language models to follow instructions with human feedback. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022, 27730−27744

[8]	Bai Y, Kadavath S, Kundu S, Askell A, Kernion J, , . Constitutional AI: harmlessness from AI feedback. 2022, arXiv preprint arXiv: 2212.08073

[9]	Glaese A, McAleese N, Trębacz M, Aslanides J, Firoiu V, , . Improving alignment of dialogue agents via targeted human judgements. 2022, arXiv preprint arXiv: 2209.14375

[10]	Korbak T, Shi K, Chen A, Bhalerao R V, Buckley C, Phang J, Bowman S R, Perez E. Pretraining language models with human preferences. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 17506−17533

[11]	Askell A, Bai Y, Chen A, Drain D, Ganguli D, , . A general language assistant as a laboratory for alignment. 2021, arXiv preprint arXiv: 2112.00861

[12]	Bai Y, Jones A, Ndousse K, Askell A, Chen A, , . Training a helpful and harmless assistant with reinforcement learning from human feedback. 2022, arXiv preprint arXiv: 2204.05862

[13]	Rafailov R, Sharma A, Mitchell E, Ermon S, Manning C D, Finn C. Direct preference optimization: Your language model is secretly a reward model. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 2338

[14]	Yao Y, Xu X, Liu Y. Large language model unlearning. 2024, arXiv preprint arXiv: 2310.10683

[15]	Liu Z, Dou G, Tan Z, Tian Y, Jiang M. Towards safer large language models through machine unlearning. In: Proceedings of Findings of the Association for Computational Linguistics: ACL 2024. 2024, 1817−1829

[16]

Röttger P, Kirk H, Vidgen B, Attanasio G, Bianchi F, Hovy D. XSTest: a test suite for identifying exaggerated safety behaviours in large language models. In: Proceedings of 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2024, 5377−5400

[17]	Shi C, Wang X, Ge Q, Gao S, Yang X, Gui T, Zhang Q, Huang X, Zhao X, Lin D. Navigating the OverKill in large language models. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024, 4602−4614

[18]	Zhang R, Lin L, Bai Y, Mei S. Negative preference optimization: from catastrophic collapse to effective unlearning. 2024, arXiv preprint arXiv: 2404.05868

[19]	Li N, Pan A, Gopal A, Yue S, Berrios D, , . The WMDP benchmark: measuring and reducing malicious use with unlearning. In: Proceedings of the 41st International Conference on Machine Learning. 2024

[20]	Zhang J, Chen S, Liu J, He J. Composing parameter-efficient modules with arithmetic operations. In: Proceedings of the 37th Conference on Neural Information Processing Systems. 2023, 12589−12610

[21]	Gao L, Niu Y, Tang T, Avestimehr S, Annavaram M. Ethos: rectifying language models in orthogonal parameter space. In: Proceedings of Findings of the Association for Computational Linguistics. 2024, 2054−2068

[22]	Jiang A Q, Sablayrolles A, Mensch A, Bamford C, Chaplot D S, , . Mistral 7B. 2023, arXiv preprint arXiv: 2310.06825

[23]	Zheng L, Chiang W L, Sheng Y, Zhuang S, Wu Z, , . Judging LLM-as-a-judge with MT-bench and chatbot arena. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 2020

[24]	Si N, Zhang H, Chang H, Zhang W, Qu D, Zhang W. Knowledge unlearning for LLMs: tasks, methods, and challenges. 2023, arXiv preprint arXiv: 2311.15766

[25]	Zhang D, Finckenberg-Broman P, Hoang T, Pan S, Xing Z, Staples M, Xu X. Right to be forgotten in the era of large language models: implications, challenges, and solutions. 2024, arXiv preprint arXiv: 2307.03941

[26]	Liu S, Yao Y, Jia J, Casper S, Baracaldo N, Hase P, Yao Y, Liu Y, Xu X, Li H, Varshney K R, Bansal M, Koyejo S, Liu Y. Rethinking machine unlearning for large language models. 2024, arXiv preprint arXiv: 2402.08787

[27]	Qu Y, Ding M, Sun N, Thilakarathna K, Zhu T, Niyato D. The frontier of data erasure: machine unlearning for large language models. 2024, arXiv preprint arXiv: 2403.15779

[28]	Jang J, Yoon D, Yang S, Cha S, Lee M, Logeswaran L, Seo M. Knowledge unlearning for mitigating privacy risks in language models. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 2023, 14389−14408

[29]	Wu X, Li J, Xu M, Dong W, Wu S, Bian C, Xiong D. DEPN: detecting and editing privacy neurons in pretrained language models. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 2875−2886

[30]	Wang L, Zeng X, Guo J, Wong K F, Gottlob G. Selective forgetting: advancing machine unlearning techniques and evaluation in language models. 2024, arXiv preprint arXiv: 2402.05813

[31]	Bhardwaj R, Do D A, Poria S. Language models are homer Simpson! Safety re-alignment of fine-tuned language models through task arithmetic. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024, 14138−14149

[32]	Lin Z, Wang Z, Tong Y, Wang Y, Guo Y, Wang Y, Shang J. ToxicChat: unveiling hidden challenges of toxicity detection in real-world user-AI conversation. In: Proceedings of Findings of the Association for Computational Linguistics. 2023, 4694−4702

[33]	Inan H, Upasani K, Chi J, Rungta R, Iyer K, Mao Y, Tontchev M, Hu Q, Fuller B, Testuggine D, Khabsa M. Llama guard: LLM-based input-output safeguard for human-AI conversations. 2023, arXiv preprint arXiv: 2312.06674

[34]	Xie Y, Fang M, Pi R, Gong N. GradSafe: detecting jailbreak prompts for LLMs via safety-critical gradient analysis. 2024, arXiv preprint arXiv: 2402.13494

[35]	Ngo H, Raterink C, Araújo J G M, Zhang I, Chen C, Morisot A, Frosst N. Mitigating harm in language models with conditional-likelihood filtration. 2021, arXiv preprint arXiv: 2108.07790

[36]	Anil R, Dai A M, Firat O, Johnson M, Lepikhin D, , . PaLM 2 technical report. 2023, arXiv preprint arXiv: 2305.10403

[37]	Wei J, Bosma M, Zhao V Y, Guu K, Yu A W, Lester B, Du N, Dai A M, Le Q V. Finetuned language models are zero-shot learners. In: Proceedings of the 10th International Conference on Learning Representations. 2022

[38]	Longpre S, Hou L, Vu T, Webson A, Chung H W, Tay Y, Zhou D, Le Q V, Zoph B, Wei J, Roberts A. The flan collection: designing data and methods for effective instruction tuning. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 941

[39]	Lee H, Phatale S, Mansoor H, Mesnard T, Ferret J, Lu K, Bishop C, Hall E, Carbune V, Rastogi A, Prakash S. RLAIF vs. RLHF: scaling reinforcement learning from human feedback with AI feedback. In: Proceedings of the 41st International Conference on Machine Learning. 2024

[40]	Ethayarajh K, Xu W, Muennighoff N, Jurafsky D, Kiela D. KTO: model alignment as prospect theoretic optimization. 2024, arXiv preprint arXiv: 2402.01306

[41]	Duan S, Yi X, Zhang P, Liu Y, Liu Z, Lu T, Xie X, Gu N. Negating negatives: alignment with human negative samples via distributional dispreference optimization. In: Proceedings of Findings of the Association for Computational Linguistics. 2024, 1012−1042

[42]	Sun Z, Shen Y, Zhou Q, Zhang H, Chen Z, Cox D, Yang Y, Gan C. Principle-driven self-alignment of language models from scratch with minimal human supervision. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2024, 115

[43]	Li X, Yu P, Zhou C, Schick T, Levy O, Zettlemoyer L, Weston J, Lewis M. Self-alignment with instruction backtranslation. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[44]	Sun Z, Shen Y, Zhang H, Zhou Q, Chen Z, Cox D, Yang Y, Gan C. SALMON: Self-alignment with principle-following reward models. 2024, arXiv preprint arXiv: 2310.05910v1

[45]	Chen Z, Deng Y, Yuan H, Ji K, Gu Q. Self-play fine-tuning converts weak language models to strong language models. 2024, arXiv preprint arXiv: 2401.01335

[46]	Ganguli D, Lovitt L, Kernion J, Askell A, Bai Y, , . Red teaming language models to reduce harms: methods, scaling behaviors, and lessons learned. 2022, arXiv preprint arXiv: 2209.07858

[47]	Perez F, Ribeiro I. Ignore previous prompt: attack techniques for language models. 2022, arXiv preprint arXiv: 2211.09527

[48]	Qi X, Zeng Y, Xie T, Chen P Y, Jia R, Mittal P, Henderson P. Fine-tuning aligned language models compromises safety, even when users do not intend to! In: Proceedings of the 12th International Conference on Learning Representations. 2024

[49]	Zhong Q, Ding L, Liu J, Du B, Tao D. ROSE doesn’t do that: boosting the safety of instruction-tuned large language models with reverse prompt contrastive decoding. In: Proceedings of Findings of the Association for Computational Linguistics. 2024, 13721−13736

[50]	Xie Y, Yi J, Shao J, Curl J, Lyu L, Chen Q, Xie X, Wu F . Defending ChatGPT against jailbreak attack via self-reminders. Nature Machine Intelligence, 2023, 5( 1): 1486–1496

[51]	Phute M, Helbling A, Hull M, Peng S, Szyller S, Cornelius C, Chau D H. LLM self defense: by self examination, LLMs know they are being tricked. In: Proceedings of the 2nd Tiny Papers Track at ICLR 2024. 2024

[52]	Wei Z, Wang Y, Li A, Mo Y, Wang Y. Jailbreak and guard aligned language models with only few in-context demonstrations. 2024, arXiv preprint arXiv: 2310.06387

[53]	Xu Z, Jiang F, Niu L, Jia J, Lin B Y, Poovendran R. SafeDecoding: defending against jailbreak attacks via safety-aware decoding. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024, 5587−5605

[54]	Ilharco G, Ribeiro M T, Wortsman M, Schmidt L, Hajishirzi H, Farhadi A. Editing models with task arithmetic. In: Proceedings of the 11th International Conference on Learning Representations. 2023

[55]	Zhou W, Wang X, Xiong L, Xia H, Gu Y, , . EasyJailbreak: a unified framework for jailbreaking large language models. 2024, arXiv preprint arXiv: 2403.12171

[56]	Li H, Guo D, Fan W, Xu M, Huang J, Meng F, Song Y. Multi-step jailbreaking privacy attacks on ChatGPT. In: Proceedings of Findings of the Association for Computational Linguistics. 2023, 4138−4153

[57]	Li X, Zhou Z, Zhu J, Yao J, Liu T, Han B. Deepinception: hypnotize large language model to be jailbreaker. 2024, arXiv preprint arXiv: 2311.03191

[58]	Shayegani E, Dong Y, Abu-Ghazaleh N. Jailbreak in pieces: compositional adversarial attacks on multi-modal language models. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[59]	Yuan Y, Jiao W, Wang W, Huang J T, He P, Shi S, Tu Z. GPT-4 is too smart to be safe: stealthy chat with LLMs via cipher. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[60]	Deng Y, Zhang W, Pan S J, Bing L. Multilingual jailbreak challenges in large language models. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[61]	Lv H, Wang X, Zhang Y, Huang C, Dou S, Ye J, Gui T, Zhang Q, Huang X. CodeChameleon: personalized encryption framework for jailbreaking large language models. 2024, arXiv preprint arXiv: 2402.16717

[62]	Zou A, Wang Z, Carlini N, Nasr M, Kolter J Z, Fredrikson M. Universal and transferable adversarial attacks on aligned language models. 2023, arXiv preprint arXiv: 2307.15043

[63]	Liu X, Xu N, Chen M, Xiao C. AutoDAN: generating stealthy jailbreak prompts on aligned large language models. 2024, arXiv preprint arXiv: 2310.04451

[64]	Yu J, Lin X, Yu Z, Xing X. GPTFUZZER: red teaming large language models with auto-generated jailbreak prompts. 2024, arXiv preprint arXiv: 2309.10253

[65]	Chao P, Robey A, Dobriban E, Hassani H, Pappas G J, Wong E. Jailbreaking black box large language models in twenty queries. 2024, arXiv preprint arXiv: 2310.08419

[66]

Ding P, Kuang J, Ma D, Cao X, Xian Y, Chen J, Huang S. A wolf in sheep’s clothing: generalized nested jailbreak prompts can fool large language models easily. In: Proceedings of 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2024, 2136−2153

[67]	Ji J, Liu M, Dai J, Pan X, Zhang C, Bian C, Chen B, Sun R, Wang Y, Yang Y. BEAVERTAILS: towards improved safety alignment of LLM via a human-preference dataset. In: Proceedings of the 37th Conference on Neural Information Processing Systems. 2023, 36

[68]	Wang Y, Li H, Han X, Nakov P, Baldwin T. Do-not-answer: a dataset for evaluating safeguards in LLMs. 2023, arXiv preprint arXiv: 2308.13387

[69]	Bhardwaj R, Poria S. Red-teaming large language models using chain of utterances for safety-alignment. 2023, arXiv preprint arXiv: 2308.09662

[70]	Beltagy I, Peters M E, Cohan A. Longformer: the long-document transformer. 2020, arXiv preprint arXiv: 2004.05150

[71]	Hendrycks D, Burns C, Basart S, Zou A, Mazeika M, Song D, Steinhardt J. Measuring massive multitask language understanding. In: Proceedings of the 9th International Conference on Learning Representations. 2021

[72]	Zellers R, Holtzman A, Bisk Y, Farhadi A, Choi Y. HellaSwag: can a machine really finish your sentence? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 4791−4800

[73]	Clark P, Cowhey I, Etzioni O, Khot T, Sabharwal A, Schoenick C, Tafjord O. Think you have solved question answering? Try ARC, the AI2 reasoning challenge. 2018, arXiv preprint arXiv: 1803.05457

[74]	Sakaguchi K, Le Bras R, Bhagavatula C, Choi Y . WinoGrande: an adversarial winograd schema challenge at scale. Communications of the ACM, 2021, 64( 9): 99–106

[75]	Bisk Y, Zellers R, Le bras R, Gao J, Choi Y. PIQA: reasoning about physical commonsense in natural language. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 7432−7439

[76]	Suzgun M, Scales N, Schärli N, Gehrmann S, Tay Y, Chung H W, Chowdhery A, Le Q V, Chi E H, Zhou D, Wei J. Challenging BIG-bench tasks and whether chain-of-thought can solve them. In: Proceedings of Findings of the Association for Computational Linguistics. 2023, 13003−13051

[77]	Clark C, Lee K, Chang M W, Kwiatkowski T, Collins M, Toutanova K. BoolQ: exploring the surprising difficulty of natural yes/no questions. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019, 2924−2936

[78]	Cobbe K, Kosaraju V, Bavarian M, Chen M, Jun H, Kaiser L, Plappert M, Tworek J, Hilton J, Nakano R, Hesse C, Schulman J. Training verifiers to solve math word problems. 2021, arXiv preprint arXiv: 2110.14168

[79]	Hendrycks D, Burns C, Kadavath S, Arora A, Basart S, Tang E, Song D, Steinhardt J. Measuring mathematical problem solving with the MATH dataset. In: Proceedings of the 35th Conference on Neural Information Processing Systems. 2021

[80]	Gao L, Tow J, Biderman S, Black S, DiPofi A, Foster C, Golding L, Hsu J, McDonell K, Muennighoff N, Phang J, Reynolds L, Tang E, Thite A, Wang B, Wang K, Zou A. A framework for few-shot language model evaluation. See github.com/EleutherAI/lm-evaluation-harness website, 2021

[81]

Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Alban Desmaison, Köpf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang Lu, Bai J, Chintala S. PyTorch: an imperative style, high-performance deep learning library. In: Proceedings of the 33rd Conference on Neural Information Processing Systems. 2019, 32

[82]	Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, , . Transformers: state-of-the-art natural language processing. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2020, 38−45

[83]	Lin S, Hilton J, Evans O. TruthfulQA: measuring how models mimic human falsehoods. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 3214−3252

[84]	Zheng C, Yin F, Zhou H, Meng F, Zhou J, Chang K W, Huang M, Peng N. On prompt-driven safeguarding for large language models. In: Proceedings of the 41st International Conference on Machine Learning. 2024

[85]	Wu T, Luo L, Li Y F, Pan S, Vu T T, Haffari G. Continual learning for large language models: a survey. 2024, arXiv preprint arXiv: 2402.01364

[86]	Lopez-Paz D, Ranzato M. Gradient episodic memory for continual learning. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6470−6479

[87]	Sun F K, Ho C H, Lee H Y. LAMOL: language modeling for lifelong language learning. In: Proceedings of the 8th International Conference on Learning Representations. 2020

[88]	Qin C, Joty S R. LFPT5: a unified framework for lifelong few-shot language learning based on prompt tuning of T5. In: Proceedings of the 10th International Conference on Learning Representations. 2022

[89]

Kirkpatrick J, Pascanu R, Rabinowitz N, Veness J, Desjardins G, Rusu A A, Milan K, Quan J, Ramalho T, Grabska-Barwinska A, Hassabis D, Clopath C, Kumaran D, Hadsell R . Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences of the United States of America, 2017, 114( 13): 3521–3526

[90]	Wang X, Chen T, Ge Q, Xia H, Bao R, Zheng R, Zhang Q, Gui T, Huang X. Orthogonal subspace learning for language model continual learning. In: Proceedings of Findings of the Association for Computational Linguistics. 2023, 10658−10671

[91]	Song C, Han X, Zeng Z, Li K, Chen C, Liu Z, Sun M, Yang T. ConPET: continual parameter-efficient tuning for large language models. 2023, arXiv preprint arXiv: 2309.14763

[92]	Zhao W, Wang S, Hu Y, Zhao Y, Qin B, Zhang X, Yang Q, Xu D, Che W. SAPT: a shared attention framework for parameter-efficient continual learning of large language models. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024, 11641−11661

[93]	Bai A, Yeh C K, Hsieh C J, Taly A. Which pretrain samples to rehearse when finetuning pretrained models? 2024, arXiv preprint arXiv: 2402.08096

[94]	Xia M, Malladi S, Gururangan S, Arora S, Chen D. LESS: selecting influential data for targeted instruction tuning. In: Proceedings of the 41st International Conference on Machine Learning. 2024

[95]	Tao Z, Lin T E, Chen X, Li H, Wu Y, Li Y, Jin Z, Huang F, Tao D, Zhou J. A survey on self-evolution of large language models. 2024, arXiv preprint arXiv: 2404.14387

[96]	Cao B, Lu K, Lu X, Chen J, Ren M, Xiang H, Liu P, Lu Y, He B, Han X, Sun L, Lin H, Yu B. Towards scalable automated alignment of LLMs: a survey. 2024, arXiv preprint arXiv: 2406.01252

[97]	Yuan W, Pang R Y, Cho K, Li X, Sukhbaatar S, Xu J, Weston J. Self-rewarding language models. 2024, arXiv preprint arXiv: 2401.10020

[98]	Wei A, Haghtalab N, Steinhardt J. Jailbroken: how does LLM safety training fail? In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2024, 3508

[99]	Yi J, Ye R, Chen Q, Zhu B B, Chen S, Lian D, Sun G, Xie X, Wu F. Open-source can be dangerous: on the vulnerability of value alignment in open-source LLMs. See openreview.net/pdf?id=NIouO0C0ex website, 2023

[100]

He L, Xia M, Henderson P. What’s in your “safe” data?: Identifying benign data that breaks safety. 2024, arXiv preprint arXiv: 2404.01099v1

[101]

Hu E J, Shen P, Wallis P, Allen-Zhu Z, Li Y, Wang S, Wang L, Chen W. LoRA: low-rank adaptation of large language models. In: Proceedings of the 10th International Conference on Learning Representations. 2022