The gains do not make up for the losses: a comprehensive evaluation for safety alignment of large language models via machine unlearning

Weixiang ZHAO , Yulin HU , Xingyu SUI , Zhuojun LI , Yang DENG , Yanyan ZHAO , Bing QIN , Wanxiang CHE

Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (2) : 2002319

PDF (1261KB)
Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (2) : 2002319 DOI: 10.1007/s11704-024-41099-x
Artificial Intelligence
RESEARCH ARTICLE

The gains do not make up for the losses: a comprehensive evaluation for safety alignment of large language models via machine unlearning

Author information +
History +
PDF (1261KB)

Abstract

Machine Unlearning (MU) has emerged as a promising technique for aligning large language models (LLMs) with safety requirements to steer them forgetting specific harmful contents. Despite the significant progress in previous studies, we argue that the current evaluation criteria, which solely focus on safety evaluation, are actually impractical and biased, leading to concerns about the true effectiveness of MU techniques. To address this, we propose to comprehensively evaluate LLMs after MU from three aspects: safety, over-safety, and general utility. Specifically, a novel benchmark MUBENCH with 18 related datasets is first constructed, where the safety is measured with both vanilla harmful inputs and 10 types of jailbreak attacks. Furthermore, we examine whether MU introduces side effects, focusing on over-safety and utility-loss. Extensive experiments are performed on 3 popular LLMs with 7 recent MU methods. The results highlight a challenging trilemma in safety alignment without side effects, indicating that there is still considerable room for further exploration. MUBENCH serves as a comprehensive benchmark, fostering future research on MU for safety alignment of LLMs.

Graphical abstract

Keywords

machine unlearning / safety alignment / large language models

Cite this article

Download citation ▾
Weixiang ZHAO, Yulin HU, Xingyu SUI, Zhuojun LI, Yang DENG, Yanyan ZHAO, Bing QIN, Wanxiang CHE. The gains do not make up for the losses: a comprehensive evaluation for safety alignment of large language models via machine unlearning. Front. Comput. Sci., 2026, 20(2): 2002319 DOI:10.1007/s11704-024-41099-x

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Brown T B, Mann B, Ryder N, Subbiah M, Kaplan J D, , . Language models are few-shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 159

[2]

Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu P J . Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020, 21( 1): 140

[3]

Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, , . Gpt-4 technical report. 2024, arXiv preprint arXiv: 2303.08774

[4]

Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M A, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F, Rodriguez A, Joulin A, Grave E, Lample G. LlaMA: open and efficient foundation language models. 2023, arXiv preprint arXiv: 2302.13971

[5]

Touvron H, Martin L, Stone K, Albert P, Almahairi A, , . Llama 2: open foundation and fine-tuned chat models. 2023, arXiv preprint arXiv: 2307.09288

[6]

Gemini Team Google. Gemini: a family of highly capable multimodal models. 2024, arXiv preprint arXiv: 2312.11805

[7]

Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C L, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A, Schulman J, Hilton J, Kelton F, Miller L, Simens M, Askell A, Welinder P, Christiano P, Leike J, Lowe R. Training language models to follow instructions with human feedback. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022, 27730−27744

[8]

Bai Y, Kadavath S, Kundu S, Askell A, Kernion J, , . Constitutional AI: harmlessness from AI feedback. 2022, arXiv preprint arXiv: 2212.08073

[9]

Glaese A, McAleese N, Trębacz M, Aslanides J, Firoiu V, , . Improving alignment of dialogue agents via targeted human judgements. 2022, arXiv preprint arXiv: 2209.14375

[10]

Korbak T, Shi K, Chen A, Bhalerao R V, Buckley C, Phang J, Bowman S R, Perez E. Pretraining language models with human preferences. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 17506−17533

[11]

Askell A, Bai Y, Chen A, Drain D, Ganguli D, , . A general language assistant as a laboratory for alignment. 2021, arXiv preprint arXiv: 2112.00861

[12]

Bai Y, Jones A, Ndousse K, Askell A, Chen A, , . Training a helpful and harmless assistant with reinforcement learning from human feedback. 2022, arXiv preprint arXiv: 2204.05862

[13]

Rafailov R, Sharma A, Mitchell E, Ermon S, Manning C D, Finn C. Direct preference optimization: Your language model is secretly a reward model. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 2338

[14]

Yao Y, Xu X, Liu Y. Large language model unlearning. 2024, arXiv preprint arXiv: 2310.10683

[15]

Liu Z, Dou G, Tan Z, Tian Y, Jiang M. Towards safer large language models through machine unlearning. In: Proceedings of Findings of the Association for Computational Linguistics: ACL 2024. 2024, 1817−1829

[16]

Röttger P, Kirk H, Vidgen B, Attanasio G, Bianchi F, Hovy D. XSTest: a test suite for identifying exaggerated safety behaviours in large language models. In: Proceedings of 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2024, 5377−5400

[17]

Shi C, Wang X, Ge Q, Gao S, Yang X, Gui T, Zhang Q, Huang X, Zhao X, Lin D. Navigating the OverKill in large language models. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024, 4602−4614

[18]

Zhang R, Lin L, Bai Y, Mei S. Negative preference optimization: from catastrophic collapse to effective unlearning. 2024, arXiv preprint arXiv: 2404.05868

[19]

Li N, Pan A, Gopal A, Yue S, Berrios D, , . The WMDP benchmark: measuring and reducing malicious use with unlearning. In: Proceedings of the 41st International Conference on Machine Learning. 2024

[20]

Zhang J, Chen S, Liu J, He J. Composing parameter-efficient modules with arithmetic operations. In: Proceedings of the 37th Conference on Neural Information Processing Systems. 2023, 12589−12610

[21]

Gao L, Niu Y, Tang T, Avestimehr S, Annavaram M. Ethos: rectifying language models in orthogonal parameter space. In: Proceedings of Findings of the Association for Computational Linguistics. 2024, 2054−2068

[22]

Jiang A Q, Sablayrolles A, Mensch A, Bamford C, Chaplot D S, , . Mistral 7B. 2023, arXiv preprint arXiv: 2310.06825

[23]

Zheng L, Chiang W L, Sheng Y, Zhuang S, Wu Z, , . Judging LLM-as-a-judge with MT-bench and chatbot arena. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 2020

[24]

Si N, Zhang H, Chang H, Zhang W, Qu D, Zhang W. Knowledge unlearning for LLMs: tasks, methods, and challenges. 2023, arXiv preprint arXiv: 2311.15766

[25]

Zhang D, Finckenberg-Broman P, Hoang T, Pan S, Xing Z, Staples M, Xu X. Right to be forgotten in the era of large language models: implications, challenges, and solutions. 2024, arXiv preprint arXiv: 2307.03941

[26]

Liu S, Yao Y, Jia J, Casper S, Baracaldo N, Hase P, Yao Y, Liu Y, Xu X, Li H, Varshney K R, Bansal M, Koyejo S, Liu Y. Rethinking machine unlearning for large language models. 2024, arXiv preprint arXiv: 2402.08787

[27]

Qu Y, Ding M, Sun N, Thilakarathna K, Zhu T, Niyato D. The frontier of data erasure: machine unlearning for large language models. 2024, arXiv preprint arXiv: 2403.15779

[28]

Jang J, Yoon D, Yang S, Cha S, Lee M, Logeswaran L, Seo M. Knowledge unlearning for mitigating privacy risks in language models. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 2023, 14389−14408

[29]

Wu X, Li J, Xu M, Dong W, Wu S, Bian C, Xiong D. DEPN: detecting and editing privacy neurons in pretrained language models. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 2875−2886

[30]

Wang L, Zeng X, Guo J, Wong K F, Gottlob G. Selective forgetting: advancing machine unlearning techniques and evaluation in language models. 2024, arXiv preprint arXiv: 2402.05813

[31]

Bhardwaj R, Do D A, Poria S. Language models are homer Simpson! Safety re-alignment of fine-tuned language models through task arithmetic. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024, 14138−14149

[32]

Lin Z, Wang Z, Tong Y, Wang Y, Guo Y, Wang Y, Shang J. ToxicChat: unveiling hidden challenges of toxicity detection in real-world user-AI conversation. In: Proceedings of Findings of the Association for Computational Linguistics. 2023, 4694−4702

[33]

Inan H, Upasani K, Chi J, Rungta R, Iyer K, Mao Y, Tontchev M, Hu Q, Fuller B, Testuggine D, Khabsa M. Llama guard: LLM-based input-output safeguard for human-AI conversations. 2023, arXiv preprint arXiv: 2312.06674

[34]

Xie Y, Fang M, Pi R, Gong N. GradSafe: detecting jailbreak prompts for LLMs via safety-critical gradient analysis. 2024, arXiv preprint arXiv: 2402.13494

[35]

Ngo H, Raterink C, Araújo J G M, Zhang I, Chen C, Morisot A, Frosst N. Mitigating harm in language models with conditional-likelihood filtration. 2021, arXiv preprint arXiv: 2108.07790

[36]

Anil R, Dai A M, Firat O, Johnson M, Lepikhin D, , . PaLM 2 technical report. 2023, arXiv preprint arXiv: 2305.10403

[37]

Wei J, Bosma M, Zhao V Y, Guu K, Yu A W, Lester B, Du N, Dai A M, Le Q V. Finetuned language models are zero-shot learners. In: Proceedings of the 10th International Conference on Learning Representations. 2022

[38]

Longpre S, Hou L, Vu T, Webson A, Chung H W, Tay Y, Zhou D, Le Q V, Zoph B, Wei J, Roberts A. The flan collection: designing data and methods for effective instruction tuning. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 941

[39]

Lee H, Phatale S, Mansoor H, Mesnard T, Ferret J, Lu K, Bishop C, Hall E, Carbune V, Rastogi A, Prakash S. RLAIF vs. RLHF: scaling reinforcement learning from human feedback with AI feedback. In: Proceedings of the 41st International Conference on Machine Learning. 2024

[40]

Ethayarajh K, Xu W, Muennighoff N, Jurafsky D, Kiela D. KTO: model alignment as prospect theoretic optimization. 2024, arXiv preprint arXiv: 2402.01306

[41]

Duan S, Yi X, Zhang P, Liu Y, Liu Z, Lu T, Xie X, Gu N. Negating negatives: alignment with human negative samples via distributional dispreference optimization. In: Proceedings of Findings of the Association for Computational Linguistics. 2024, 1012−1042

[42]

Sun Z, Shen Y, Zhou Q, Zhang H, Chen Z, Cox D, Yang Y, Gan C. Principle-driven self-alignment of language models from scratch with minimal human supervision. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2024, 115

[43]

Li X, Yu P, Zhou C, Schick T, Levy O, Zettlemoyer L, Weston J, Lewis M. Self-alignment with instruction backtranslation. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[44]

Sun Z, Shen Y, Zhang H, Zhou Q, Chen Z, Cox D, Yang Y, Gan C. SALMON: Self-alignment with principle-following reward models. 2024, arXiv preprint arXiv: 2310.05910v1

[45]

Chen Z, Deng Y, Yuan H, Ji K, Gu Q. Self-play fine-tuning converts weak language models to strong language models. 2024, arXiv preprint arXiv: 2401.01335

[46]

Ganguli D, Lovitt L, Kernion J, Askell A, Bai Y, , . Red teaming language models to reduce harms: methods, scaling behaviors, and lessons learned. 2022, arXiv preprint arXiv: 2209.07858

[47]

Perez F, Ribeiro I. Ignore previous prompt: attack techniques for language models. 2022, arXiv preprint arXiv: 2211.09527

[48]

Qi X, Zeng Y, Xie T, Chen P Y, Jia R, Mittal P, Henderson P. Fine-tuning aligned language models compromises safety, even when users do not intend to! In: Proceedings of the 12th International Conference on Learning Representations. 2024

[49]

Zhong Q, Ding L, Liu J, Du B, Tao D. ROSE doesn’t do that: boosting the safety of instruction-tuned large language models with reverse prompt contrastive decoding. In: Proceedings of Findings of the Association for Computational Linguistics. 2024, 13721−13736

[50]

Xie Y, Yi J, Shao J, Curl J, Lyu L, Chen Q, Xie X, Wu F . Defending ChatGPT against jailbreak attack via self-reminders. Nature Machine Intelligence, 2023, 5( 1): 1486–1496

[51]

Phute M, Helbling A, Hull M, Peng S, Szyller S, Cornelius C, Chau D H. LLM self defense: by self examination, LLMs know they are being tricked. In: Proceedings of the 2nd Tiny Papers Track at ICLR 2024. 2024

[52]

Wei Z, Wang Y, Li A, Mo Y, Wang Y. Jailbreak and guard aligned language models with only few in-context demonstrations. 2024, arXiv preprint arXiv: 2310.06387

[53]

Xu Z, Jiang F, Niu L, Jia J, Lin B Y, Poovendran R. SafeDecoding: defending against jailbreak attacks via safety-aware decoding. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024, 5587−5605

[54]

Ilharco G, Ribeiro M T, Wortsman M, Schmidt L, Hajishirzi H, Farhadi A. Editing models with task arithmetic. In: Proceedings of the 11th International Conference on Learning Representations. 2023

[55]

Zhou W, Wang X, Xiong L, Xia H, Gu Y, , . EasyJailbreak: a unified framework for jailbreaking large language models. 2024, arXiv preprint arXiv: 2403.12171

[56]

Li H, Guo D, Fan W, Xu M, Huang J, Meng F, Song Y. Multi-step jailbreaking privacy attacks on ChatGPT. In: Proceedings of Findings of the Association for Computational Linguistics. 2023, 4138−4153

[57]

Li X, Zhou Z, Zhu J, Yao J, Liu T, Han B. Deepinception: hypnotize large language model to be jailbreaker. 2024, arXiv preprint arXiv: 2311.03191

[58]

Shayegani E, Dong Y, Abu-Ghazaleh N. Jailbreak in pieces: compositional adversarial attacks on multi-modal language models. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[59]

Yuan Y, Jiao W, Wang W, Huang J T, He P, Shi S, Tu Z. GPT-4 is too smart to be safe: stealthy chat with LLMs via cipher. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[60]

Deng Y, Zhang W, Pan S J, Bing L. Multilingual jailbreak challenges in large language models. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[61]

Lv H, Wang X, Zhang Y, Huang C, Dou S, Ye J, Gui T, Zhang Q, Huang X. CodeChameleon: personalized encryption framework for jailbreaking large language models. 2024, arXiv preprint arXiv: 2402.16717

[62]

Zou A, Wang Z, Carlini N, Nasr M, Kolter J Z, Fredrikson M. Universal and transferable adversarial attacks on aligned language models. 2023, arXiv preprint arXiv: 2307.15043

[63]

Liu X, Xu N, Chen M, Xiao C. AutoDAN: generating stealthy jailbreak prompts on aligned large language models. 2024, arXiv preprint arXiv: 2310.04451

[64]

Yu J, Lin X, Yu Z, Xing X. GPTFUZZER: red teaming large language models with auto-generated jailbreak prompts. 2024, arXiv preprint arXiv: 2309.10253

[65]

Chao P, Robey A, Dobriban E, Hassani H, Pappas G J, Wong E. Jailbreaking black box large language models in twenty queries. 2024, arXiv preprint arXiv: 2310.08419

[66]

Ding P, Kuang J, Ma D, Cao X, Xian Y, Chen J, Huang S. A wolf in sheep’s clothing: generalized nested jailbreak prompts can fool large language models easily. In: Proceedings of 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2024, 2136−2153

[67]

Ji J, Liu M, Dai J, Pan X, Zhang C, Bian C, Chen B, Sun R, Wang Y, Yang Y. BEAVERTAILS: towards improved safety alignment of LLM via a human-preference dataset. In: Proceedings of the 37th Conference on Neural Information Processing Systems. 2023, 36

[68]

Wang Y, Li H, Han X, Nakov P, Baldwin T. Do-not-answer: a dataset for evaluating safeguards in LLMs. 2023, arXiv preprint arXiv: 2308.13387

[69]

Bhardwaj R, Poria S. Red-teaming large language models using chain of utterances for safety-alignment. 2023, arXiv preprint arXiv: 2308.09662

[70]

Beltagy I, Peters M E, Cohan A. Longformer: the long-document transformer. 2020, arXiv preprint arXiv: 2004.05150

[71]

Hendrycks D, Burns C, Basart S, Zou A, Mazeika M, Song D, Steinhardt J. Measuring massive multitask language understanding. In: Proceedings of the 9th International Conference on Learning Representations. 2021

[72]

Zellers R, Holtzman A, Bisk Y, Farhadi A, Choi Y. HellaSwag: can a machine really finish your sentence? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 4791−4800

[73]

Clark P, Cowhey I, Etzioni O, Khot T, Sabharwal A, Schoenick C, Tafjord O. Think you have solved question answering? Try ARC, the AI2 reasoning challenge. 2018, arXiv preprint arXiv: 1803.05457

[74]

Sakaguchi K, Le Bras R, Bhagavatula C, Choi Y . WinoGrande: an adversarial winograd schema challenge at scale. Communications of the ACM, 2021, 64( 9): 99–106

[75]

Bisk Y, Zellers R, Le bras R, Gao J, Choi Y. PIQA: reasoning about physical commonsense in natural language. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 7432−7439

[76]

Suzgun M, Scales N, Schärli N, Gehrmann S, Tay Y, Chung H W, Chowdhery A, Le Q V, Chi E H, Zhou D, Wei J. Challenging BIG-bench tasks and whether chain-of-thought can solve them. In: Proceedings of Findings of the Association for Computational Linguistics. 2023, 13003−13051

[77]

Clark C, Lee K, Chang M W, Kwiatkowski T, Collins M, Toutanova K. BoolQ: exploring the surprising difficulty of natural yes/no questions. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019, 2924−2936

[78]

Cobbe K, Kosaraju V, Bavarian M, Chen M, Jun H, Kaiser L, Plappert M, Tworek J, Hilton J, Nakano R, Hesse C, Schulman J. Training verifiers to solve math word problems. 2021, arXiv preprint arXiv: 2110.14168

[79]

Hendrycks D, Burns C, Kadavath S, Arora A, Basart S, Tang E, Song D, Steinhardt J. Measuring mathematical problem solving with the MATH dataset. In: Proceedings of the 35th Conference on Neural Information Processing Systems. 2021

[80]

Gao L, Tow J, Biderman S, Black S, DiPofi A, Foster C, Golding L, Hsu J, McDonell K, Muennighoff N, Phang J, Reynolds L, Tang E, Thite A, Wang B, Wang K, Zou A. A framework for few-shot language model evaluation. See github.com/EleutherAI/lm-evaluation-harness website, 2021

[81]

Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Alban Desmaison, Köpf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang Lu, Bai J, Chintala S. PyTorch: an imperative style, high-performance deep learning library. In: Proceedings of the 33rd Conference on Neural Information Processing Systems. 2019, 32

[82]

Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, , . Transformers: state-of-the-art natural language processing. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2020, 38−45

[83]

Lin S, Hilton J, Evans O. TruthfulQA: measuring how models mimic human falsehoods. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 3214−3252

[84]

Zheng C, Yin F, Zhou H, Meng F, Zhou J, Chang K W, Huang M, Peng N. On prompt-driven safeguarding for large language models. In: Proceedings of the 41st International Conference on Machine Learning. 2024

[85]

Wu T, Luo L, Li Y F, Pan S, Vu T T, Haffari G. Continual learning for large language models: a survey. 2024, arXiv preprint arXiv: 2402.01364

[86]

Lopez-Paz D, Ranzato M. Gradient episodic memory for continual learning. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6470−6479

[87]

Sun F K, Ho C H, Lee H Y. LAMOL: language modeling for lifelong language learning. In: Proceedings of the 8th International Conference on Learning Representations. 2020

[88]

Qin C, Joty S R. LFPT5: a unified framework for lifelong few-shot language learning based on prompt tuning of T5. In: Proceedings of the 10th International Conference on Learning Representations. 2022

[89]

Kirkpatrick J, Pascanu R, Rabinowitz N, Veness J, Desjardins G, Rusu A A, Milan K, Quan J, Ramalho T, Grabska-Barwinska A, Hassabis D, Clopath C, Kumaran D, Hadsell R . Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences of the United States of America, 2017, 114( 13): 3521–3526

[90]

Wang X, Chen T, Ge Q, Xia H, Bao R, Zheng R, Zhang Q, Gui T, Huang X. Orthogonal subspace learning for language model continual learning. In: Proceedings of Findings of the Association for Computational Linguistics. 2023, 10658−10671

[91]

Song C, Han X, Zeng Z, Li K, Chen C, Liu Z, Sun M, Yang T. ConPET: continual parameter-efficient tuning for large language models. 2023, arXiv preprint arXiv: 2309.14763

[92]

Zhao W, Wang S, Hu Y, Zhao Y, Qin B, Zhang X, Yang Q, Xu D, Che W. SAPT: a shared attention framework for parameter-efficient continual learning of large language models. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024, 11641−11661

[93]

Bai A, Yeh C K, Hsieh C J, Taly A. Which pretrain samples to rehearse when finetuning pretrained models? 2024, arXiv preprint arXiv: 2402.08096

[94]

Xia M, Malladi S, Gururangan S, Arora S, Chen D. LESS: selecting influential data for targeted instruction tuning. In: Proceedings of the 41st International Conference on Machine Learning. 2024

[95]

Tao Z, Lin T E, Chen X, Li H, Wu Y, Li Y, Jin Z, Huang F, Tao D, Zhou J. A survey on self-evolution of large language models. 2024, arXiv preprint arXiv: 2404.14387

[96]

Cao B, Lu K, Lu X, Chen J, Ren M, Xiang H, Liu P, Lu Y, He B, Han X, Sun L, Lin H, Yu B. Towards scalable automated alignment of LLMs: a survey. 2024, arXiv preprint arXiv: 2406.01252

[97]

Yuan W, Pang R Y, Cho K, Li X, Sukhbaatar S, Xu J, Weston J. Self-rewarding language models. 2024, arXiv preprint arXiv: 2401.10020

[98]

Wei A, Haghtalab N, Steinhardt J. Jailbroken: how does LLM safety training fail? In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2024, 3508

[99]

Yi J, Ye R, Chen Q, Zhu B B, Chen S, Lian D, Sun G, Xie X, Wu F. Open-source can be dangerous: on the vulnerability of value alignment in open-source LLMs. See openreview.net/pdf?id=NIouO0C0ex website, 2023

[100]

He L, Xia M, Henderson P. What’s in your “safe” data?: Identifying benign data that breaks safety. 2024, arXiv preprint arXiv: 2404.01099v1

[101]

Hu E J, Shen P, Wallis P, Allen-Zhu Z, Li Y, Wang S, Wang L, Chen W. LoRA: low-rank adaptation of large language models. In: Proceedings of the 10th International Conference on Learning Representations. 2022

RIGHTS & PERMISSIONS

The Author(s) 2025. This article is published with open access at link.springer.com and journal.hep.com.cn

AI Summary AI Mindmap
PDF (1261KB)

Supplementary files

Highlights

474

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/