The gains do not make up for the losses: a comprehensive evaluation for safety alignment of large language models via machine unlearning
Weixiang ZHAO , Yulin HU , Xingyu SUI , Zhuojun LI , Yang DENG , Yanyan ZHAO , Bing QIN , Wanxiang CHE
Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (2) : 2002319
The gains do not make up for the losses: a comprehensive evaluation for safety alignment of large language models via machine unlearning
Machine Unlearning (MU) has emerged as a promising technique for aligning large language models (LLMs) with safety requirements to steer them forgetting specific harmful contents. Despite the significant progress in previous studies, we argue that the current evaluation criteria, which solely focus on safety evaluation, are actually impractical and biased, leading to concerns about the true effectiveness of MU techniques. To address this, we propose to comprehensively evaluate LLMs after MU from three aspects: safety, over-safety, and general utility. Specifically, a novel benchmark MUBENCH with 18 related datasets is first constructed, where the safety is measured with both vanilla harmful inputs and 10 types of jailbreak attacks. Furthermore, we examine whether MU introduces side effects, focusing on over-safety and utility-loss. Extensive experiments are performed on 3 popular LLMs with 7 recent MU methods. The results highlight a challenging trilemma in safety alignment without side effects, indicating that there is still considerable room for further exploration. MUBENCH serves as a comprehensive benchmark, fostering future research on MU for safety alignment of LLMs.
machine unlearning / safety alignment / large language models
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
|
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
|
| [26] |
|
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
|
| [32] |
|
| [33] |
|
| [34] |
|
| [35] |
|
| [36] |
|
| [37] |
|
| [38] |
|
| [39] |
|
| [40] |
|
| [41] |
|
| [42] |
|
| [43] |
|
| [44] |
|
| [45] |
|
| [46] |
|
| [47] |
|
| [48] |
|
| [49] |
|
| [50] |
|
| [51] |
|
| [52] |
|
| [53] |
|
| [54] |
|
| [55] |
|
| [56] |
|
| [57] |
|
| [58] |
|
| [59] |
|
| [60] |
|
| [61] |
|
| [62] |
|
| [63] |
|
| [64] |
|
| [65] |
|
| [66] |
|
| [67] |
|
| [68] |
|
| [69] |
|
| [70] |
|
| [71] |
|
| [72] |
|
| [73] |
|
| [74] |
|
| [75] |
|
| [76] |
|
| [77] |
|
| [78] |
|
| [79] |
|
| [80] |
|
| [81] |
|
| [82] |
|
| [83] |
|
| [84] |
|
| [85] |
|
| [86] |
|
| [87] |
|
| [88] |
|
| [89] |
|
| [90] |
|
| [91] |
|
| [92] |
|
| [93] |
Bai A, Yeh C K, Hsieh C J, Taly A. Which pretrain samples to rehearse when finetuning pretrained models? 2024, arXiv preprint arXiv: 2402.08096 |
| [94] |
|
| [95] |
|
| [96] |
|
| [97] |
|
| [98] |
|
| [99] |
|
| [100] |
|
| [101] |
|
The Author(s) 2025. This article is published with open access at link.springer.com and journal.hep.com.cn
/
| 〈 |
|
〉 |