Recent advances in attack and defense approaches of large language models

Jing CUI; Yishi XU; Zhewei HUANG; Zekeng ZENG; Jianbin JIAO; Junge ZHANG

doi:10.1007/s11704-025-50297-0

Front. Comput. Sci. ›› 2027, Vol. 21 ›› Issue (4) :2104805 DOI: 10.1007/s11704-025-50297-0

Information Security

REVIEW ARTICLE

Recent advances in attack and defense approaches of large language models

Author information +

History +

PDF (1575KB)

Abstract

Large Language Models (LLMs) have revolutionized artificial intelligence and machine learning through their advanced generation and reasoning capabilities. However, their widespread deployment has raised significant safety and reliability concerns. Emerging threats, coupled with established vulnerabilities in deep neural networks, may compromise model security and even create a false sense of security. Given the extensive research in the field of LLM security, especially the studies in late 2023 and 2024, a survey that begins with model behavior and delves into its internal representational roots is crucial for providing the community with key insights and guiding future development. In this survey, we analyze recent studies on various attack vectors and threat models, providing insights into improving attack mechanisms. We also examine the present defense strategies, highlighting their strengths and current limitations. Our goal is to deepen the understanding of LLM safety challenges and contribute to the development of more robust security measures.

Graphical abstract

Keywords

large language models / safety / attack methods / defense mechanisms

Cite this article

Download citation ▾

Jing CUI, Yishi XU, Zhewei HUANG, Zekeng ZENG, Jianbin JIAO, Junge ZHANG. Recent advances in attack and defense approaches of large language models. Front. Comput. Sci., 2027, 21(4): 2104805 DOI:10.1007/s11704-025-50297-0

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, , et al. GPT-4 technical report. 2023, arXiv preprint arXiv: 2303.08774

[2]	Feng X, Han X, Chen S, Yang W . LLMEffiChecker: understanding and testing efficiency degradation of large language models. ACM Transactions on Software Engineering and Methodology, 2024, 33( 7): 186

[3]	Wei B, Huang K, Huang Y, Xie T, Qi X, Xia M, Mittal P, Wang M, Henderson P. Assessing the brittleness of safety alignment via pruning and low-rank modifications. In: Proceedings of the 41st International Conference on Machine Learning. 2024

[4]	Skalse J, Howe N H R, Krasheninnikov D, Krueger D. Defining and characterizing reward hacking. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 687

[5]	Shi J, Liu Y, Zhou P, Sun L. BadGPT: exploring security vulnerabilities of ChatGPT via backdoor attacks to InstructGPT. 2023, arXiv preprint arXiv: 2304.12298

[6]	Baumgärtner T, Gao Y, Alon D, Metzler D. Best-of-venom: attacking RLHF by injecting poisoned preference data. 2024, arXiv preprint arXiv: 2404.05530

[7]	Yang X, Wang X, Zhang Q, Petzold L, Wang W Y, Zhao X, Lin D. Shadow alignment: the ease of subverting safely-aligned language models. 2023, arXiv preprint arXiv: 2310.02949

[8]	Gade P, Lermen S, Rogers-Smith C, Ladish J. BadLlama: cheaply removing safety fine-tuning from llama 2-chat 13B. 2023, arXiv preprint arXiv: 2311.00117

[9]	Wang S, Zhu Y, Liu H, Zheng Z, Chen C, Li J . Knowledge editing for large language models: a survey. ACM Computing Surveys, 2025, 57( 3): 59

[10]	Mitchell E, Lin C, Bosselut A, Manning C D, Finn C. Memory-based model editing at scale. In: Proceedings of the 39th International Conference on Machine Learning. 2022, 15817−15831

[11]	Souri H, Fowl L, Chellappa R, Goldblum M, Goldstein T. Sleeper agent: scalable hidden trigger backdoors for neural networks trained from scratch. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 1393

[12]	Halawi D, Wei A, Wallace E, Wang T T, Haghtalab N, Steinhardt J. Covert malicious finetuning: challenges in safeguarding LLM adaptation. In: Proceedings of the 41st International Conference on Machine Learning. 2024

[13]	Lapid R, Langberg R, Sipper M. Open sesame! Universal black box jailbreaking of large language models. 2023, arXiv preprint arXiv: 2309.01446

[14]	Liu X, Xu N, Chen M, Xiao C. AutoDAN: generating stealthy jailbreak prompts on aligned large language models. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[15]	Zou A, Wang Z, Carlini N, Nasr M, Kolter J Z, Fredrikson M. Universal and transferable adversarial attacks on aligned language models. 2023, arXiv preprint arXiv: 2307.15043

[16]	Mehrotra A, Zampetakis M, Kassianik P, Nelson B, Anderson H, Singer Y, Karbasi A. Tree of attacks: jailbreaking black-box LLMs automatically. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. 2024, 1952

[17]	Chao P, Robey A, Dobriban E, Hassani H, Pappas G J, Wong E. Jailbreaking black box large language models in twenty queries. In: Proceedings of 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). 2025, 23−42

[18]	Paulus A, Zharmagambetov A, Guo C, Amos B, Tian Y. AdvPrompter: fast adaptive adversarial prompting for LLMs. 2024, arXiv preprint arXiv: 2404.16873

[19]	Deng G, Liu Y, Li Y, Wang K, Zhang Y, Li Z, Wang H, Zhang T, Liu Y. MASTERKEY: automated jailbreaking of large language model chatbots. In: Proceedings of the 31st Annual Network and Distributed System Security Symposium. 2024

[20]

Yan J, Yadav V, Li S, Chen L, Tang Z, Wang H, Srinivasan V, Ren X, Jin H. Backdooring instruction-tuned large language models with virtual prompt injection. In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2024, 6065−6086

[21]	Anil C, Durmus E, Panickssery N, Sharma M, Benton J, , et al. Many-shot jailbreaking. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. 2024, 4121

[22]	Andriushchenko M, Croce F, Flammarion N. Jailbreaking leading safety-aligned LLMs with simple adaptive attacks. In: Proceedings of the 13th International Conference on Learning Representations. 2025

[23]	Xiang Z, Jiang F, Xiong Z, Ramasubramanian B, Poovendran R, Li B. BadChain: backdoor chain-of-thought prompting for large language models. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[24]	Su J. Enhancing adversarial attacks through chain of thought. 2024, arXiv preprint arXiv: 2410.21791

[25]	Zhu Z, Zhang H, Zhang M, Wang R, Wu G, Xu K, Wu B. BoT: breaking long thought processes of o1-like large language models through backdoor attack. 2025, arXiv preprint arXiv: 2502.12202

[26]	Xu Z, Gardiner J, Belguith S. The dark deep side of DeepSeek: fine-tuning attacks against the safety alignment of CoT-enabled models. 2025, arXiv preprint arXiv: 2502.01225

[27]	Meng K, Bau D, Andonian A, Belinkov Y. Locating and editing factual associations in GPT. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 1262

[28]	Meng K, Sharma A S, Andonian A J, Belinkov Y, Bau D. Mass-editing memory in a transformer. In: Proceedings of the 11th International Conference on Learning Representations. 2023

[29]	Zhao W, Hu Y, Li Z, Deng Y, Zhao Y, Qin B, Chua T S. Towards comprehensive and efficient post safety alignment of large language models via safety patching. 2024, arXiv preprint arXiv: 2405.13820

[30]	Kadhe S R, Ahmed F, Wei D, Baracaldo N, Padhi I. Split, unlearn, merge: leveraging data attributes for more effective unlearning in LLMs. 2024, arXiv preprint arXiv: 2406.11780

[31]	Google . Perspective API. 2024

[32]	OpenAI . Moderation. 2023

[33]	Alon G, Kamfonas M. Detecting language model attacks with perplexity. 2023, arXiv preprint arXiv: 2308.14132

[34]	Robey A, Wong E, Hassani H, Pappas G J. SmoothLLM: defending large language models against jailbreaking attacks. Transactions on Machine Learning Research, 2025, 2025

[35]	Xu Z, Jiang F, Niu L, Jia J, Lin B Y, Poovendran R. SafeDecoding: defending against jailbreak attacks via safety-aware decoding. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024, 5587−5605

[36]	Xie Y, Yi J, Shao J, Curl J, Lyu L, Chen Q, Xie X, Wu F . Defending ChatGPT against jailbreak attack via self-reminders. Nature Machine Intelligence, 2023, 5( 12): 1486–1496

[37]	Azaria A, Mitchell T. The internal state of an LLM knows when it’s lying. In: Proceedings of Findings of the Association for Computational Linguistics: EMNLP 2023. 2023, 967−976

[38]	Chen C, Liu K, Chen Z, Gu Y, Wu Y, Tao M, Fu Z, Ye J. INSIDE: LLMs’ internal states retain the power of hallucination detection. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[39]	Turner A M, Thiergart L, Leech G, Udell D, Vazquez J J, Mini U, MacDiarmid M. Steering language models with activation engineering. 2023, arXiv preprint arXiv: 2308.10248

[40]	Li K, Patel O, Viégas F, Pfister H, Wattenberg M. Inference-time intervention: eliciting truthful answers from a language model. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 1797

[41]	Feldman V, Zhang C. What neural networks memorize and why: discovering the long tail via influence estimation. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 242

[42]	Zhang C, Bengio S, Hardt M, Recht B, Vinyals O . Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 2021, 64( 3): 107–115

[43]	Xu Z, Jain S, Kankanhalli M. Hallucination is inevitable: an innate limitation of large language models. 2024, arXiv preprint arXiv: 2401.11817

[44]	Xiao G, Lin J, Seznec M, Wu H, Demouth J, Han S. SmoothQuant: accurate and efficient post-training quantization for large language models. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 38087−38099

[45]	Lin Y, Zhang T, Sun P, Li Z, Zhou S. FQ-ViT: post-training quantization for fully quantized vision transformer. In: Proceedings of the 31st International Joint Conference on Artificial Intelligence. 2022, 1173−1179

[46]	Qi X, Zeng Y, Xie T, Chen P Y, Jia R, Mittal P, Henderson P. Fine-tuning aligned language models compromises safety, even when users do not intend to! In: Proceedings of the 12th International Conference on Learning Representations. 2024

[47]	Luo Y, Yang Z, Meng F, Li Y, Zhou J, Zhang Y . An empirical study of catastrophic forgetting in large language models during continual fine-tuning. IEEE Transactions on Audio, Speech and Language Processing, 2025, 33: 3776–3786

[48]	Kumar S S, Cummings M, Stimpson A. Strengthening LLM trust boundaries: a survey of prompt injection attacks Surender Suresh Kumar Dr. M.L. Cummings Dr. Alexander Stimpson. In: Proceedings of the 4th IEEE International Conference on Human-Machine Systems. 2024, 1−6

[49]	Egashira K, Vero M, Staab R, He J, Vechev M. Exploiting LLM quantization. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. 2024, 1319

[50]	Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. 2017, arXiv preprint arXiv: 1707.06347

[51]	Rafailov R, Sharma A, Mitchell E, Manning C D, Ermon S, Finn C. Direct preference optimization: your language model is secretly a reward model. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 2338

[52]	Su J, Kempe J, Ullrich K. Mission impossible: a statistical perspective on jailbreaking LLMs. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. 2024, 1210

[53]	Wolf Y, Wies N, Avnery O, Levine Y, Shashua A. Fundamental limitations of alignment in large language models. In: Proceedings of the 41st International Conference on Machine Learning. 2024

[54]	Lee A, Bai X, Pres I, Wattenberg M, Kummerfeld J K, Mihalcea R. A mechanistic understanding of alignment algorithms: a case study on DPO and toxicity. In: Proceedings of the 41st International Conference on Machine Learning. 2024

[55]	Cui J, Han Y, Ma Y, Jiao J, Zhang J. BadRL: sparse targeted backdoor attack against reinforcement learning. In: Proceedings of the 38th AAAI Conference on Artificial Intelligence. 2024, 11687−11694

[56]	Andriushchenko M, Flammarion N. Does refusal training in LLMs generalize to the past tense? In: Proceedings of the 13th International Conference on Learning Representations. 2025

[57]	Wei A, Haghtalab N, Steinhardt J. Jailbroken: how does LLM safety training fail? In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 3508

[58]	Perez F, Ribeiro I. Ignore previous prompt: attack techniques for language models. 2022, arXiv preprint arXiv: 2211.09527

[59]	Shayegani E, Dong Y, Abu-Ghazaleh N. Plug and pray: exploiting off-the-shelf components of multi-modal models. 2023, arXiv preprint arXiv: 2307.14539

[60]	Ganguli D, Lovitt L, Kernion J, Askell A, Bai Y, , et al. Red teaming language models to reduce harms: methods, scaling behaviors, and lessons learned. 2022, arXiv preprint arXiv: 2209.07858

[61]	Li H, Guo D, Fan W, Xu M, Huang J, Meng F, Song Y. Multi-step jailbreaking privacy attacks on ChatGPT. In: Proceedings of Findings of the Association for Computational Linguistics: EMNLP 2023. 2023, 4138−4153

[62]	Gupta A, Shirgaonkar A, de Luis Balaguer A, Silva B, Holstein D, , et al. RAG vs fine-tuning: pipelines, tradeoffs, and a case study on agriculture. 2024, arXiv preprint arXiv: 2401.08406

[63]	Mahyari A A. Harnessing the power of LLMs in source code vulnerability detection. In: Proceedings of 2024 IEEE Military Communications Conference (MILCOM). 2024, 251−256

[64]	Gonçalves J, Dias T, Maia E, Praça I. SCoPE: evaluating LLMs for software vulnerability detection. In: Proceedings of Distributed Computing and Artificial Intelligence, Special Sessions I, 21st International Conference. 2025, 34−43

[65]	Barnett S, Kurniawan S, Thudumu S, Brannelly Z, Abdelrazek M. Seven failure points when engineering a retrieval augmented generation system. In: Proceedings of the 3rd IEEE/ACM International Conference on AI Engineering-Software Engineering for AI. 2024, 194−199

[66]	Qiang Y, Zhou X, Zade S Z, Roshani M A, Khanduri P, Zytko D, Zhu D. Learning to poison large language models during instruction tuning. 2024, arXiv preprint arXiv: 2402.13459

[67]	Wan A, Wallace E, Shen S, Klein D. Poisoning language models during instruction tuning. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 35413−35425

[68]	Shu M, Wang J, Zhu C, Geiping J, Xiao C, Goldstein T. On the exploitability of instruction tuning. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 2703

[69]	Zhan Q, Fang R, Bindu R, Gupta A, Hashimoto T, Kang D. Removing RLHF protections in GPT-4 via fine-tuning. In: Proceedings of 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2023, 681−687

[70]	Chen C, Shu K. Can LLM-generated misinformation be detected? In: Proceedings of the 12th International Conference on Learning Representations. 2024

[71]	Pathmanathan P, Chakraborty S, Liu X, Liang Y, Huang F. Is poisoning a real threat to LLM alignment? Maybe more so than you think. 2024, arXiv preprint arXiv: 2406.12091

[72]	Rando J, Tramèr F. Universal jailbreak backdoors from poisoned human feedback. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[73]	Szegedy C, Zaremba W, Sutskever I, Bruna J, Erhan D, Goodfellow I, Fergus R. Intriguing properties of neural networks. In: Proceedings of the 2nd International Conference on Learning Representations. 2014

[74]	Biggio B, Corona I, Maiorca D, Nelson B, Šrndić N, Laskov P, Giacinto G, Roli F. Evasion attacks against machine learning at test time. In: Proceedings of European Conference on Machine Learning and Knowledge Discovery in Databases. 2013, 387−402

[75]	Papernot N, McDaniel P, Jha S, Fredrikson M, Celik Z B, Swami A. The limitations of deep learning in adversarial settings. In: Proceedings of 2016 IEEE European Symposium on Security and Privacy (EuroS&P). 2016, 372−387

[76]	Carlini N, Wagner D. Towards evaluating the robustness of neural networks. In: Proceedings of 2017 IEEE Symposium on Security and Privacy (SP). 2017, 39−57

[77]	Shen X, Chen Z, Backes M, Shen Y, Zhang Y. “Do anything now”: characterizing and evaluating in-the-wild jailbreak prompts on large language models. In: Proceedings of 2024 on ACM SIGSAC Conference on Computer and Communications Security. 2024, 1671−1685

[78]	Burgess M. The hacking of ChatGPT is just getting started. See wired.com/story/chatgpt-jailbreak-generative-ai-hacking website, 2023

[79]	Christian J. Amazing “jailbreak” bypasses ChatGPT’s ethics safeguards. See futurism.com/amazing-jailbreak-chatgpt website, 2023

[80]	Fraser C. Master thread of ways I have discovered to get ChatGPT to output text that it’s not supposed to, including bigotry, URLS and personal information, and more. X (formerly Twitter), online post, 2023

[81]	Bai Y, Jones A, Ndousse K, Askell A, Chen A, , et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. 2022, arXiv preprint arXiv: 2204.05862

[82]

Ouyang L, Wu J, Jiang X, Almeida D, Wainwright CL, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A, Schulman J, Hilton J, Kelton F, Miller L, Simens M, Askell A, Welinder P, Christiano P, Leike J, Lowe R. Training language models to follow instructions with human feedback. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 2011

[83]	Yong Z X, Menghini C, Bach S H. Low-resource languages jailbreak GPT-4. 2023, arXiv preprint arXiv: 2310.02446

[84]	Deng Y, Zhang W, Pan S J, Bing L. Multilingual jailbreak challenges in large language models. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[85]	Zeng Y, Lin H, Zhang J, Yang D, Jia R, Shi W. How Johnny can persuade LLMs to jailbreak them: rethinking persuasion to challenge AI safety by humanizing LLMs. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024, 14322−14350

[86]	Wen Y, Bi K, Chen W, Guo J, Cheng X. Evaluating implicit bias in large language models by attacking from a psychometric perspective. In: Proceedings of Findings of the Association for Computational Linguistics: ACL 2025. 2025, 5081−5097

[87]

Wang B, Chen W, Pei H, Xie C, Kang M, Zhang C, Xu C, Xiong Z, Dutta R, Schaeffer R, Truong S T, Arora S, Mazeika M, Hendrycks D, Lin Z, Cheng Y, Koyejo S, Song D, Li B. DecodingTrust: a comprehensive assessment of trustworthiness in GPT models. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 1361

[88]	Huang Y, Gupta S, Xia M, Li K, Chen D. Catastrophic jailbreak of open-source LLMs via exploiting generation. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[89]	Zhang H, Guo Z, Zhu H, Cao B, Lin L, Jia J, Chen J, Wu D. On the safety of open-sourced large language models: does alignment really prevent them from being misused? 2023, arXiv preprint arXiv: 2310.01581

[90]	Zhao X, Yang X, Pang T, Du C, Li L, Wang Y X, Wang W Y. Weak-to-strong jailbreaking on large language models. 2024, arXiv preprint arXiv: 2401.17256

[91]	Mazeika M, Phan L, Yin X, Zou A, Wang Z, Mu N, Sakhaee E, Li N, Basart S, Li B, Forsyth D, Hendrycks D. HarmBench: a standardized evaluation framework for automated red teaming and robust refusal. In: Proceedings of the 41st International Conference on Machine Learning. 2024

[92]

Chao P, Debenedetti E, Robey A, Andriushchenko M, Croce F, Sehwag V, Dobriban E, Flammarion N, Pappas G J, Tramèr F, Hassani H, Wong E. JailbreakBench: an open robustness benchmark for jailbreaking large language models. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. 2024, 1745

[93]	Seclify S. Prompt injection cheat sheet: how to manipulate AI language models. See blog.seclify.com/prompt-injection-cheat-sheet/ website, 2023

[94]	Willison S. Series: prompt injection. See simonwillison.net/series/prompt-injection/ website, 2022

[95]	Greshake K. Indirect prompt injection threats. See greshake.github.io/ website, 2023

[96]	Injection Guide. Adversarial prompting guide. See promptingguide.ai/risks/adversarial website, 2023

[97]	Goodside R. Exploiting GPT-3 prompts with malicious inputs that order the model to ignore its previous directions. See simonwillison.net/2022/Sep/12/prompt-injection/ website, 2023

[98]	Armstrong S, Gorman R. Using GPT-Eliezer against ChatGPT jailbreaking. See alignmentforum.org/posts/pNcFYZnPdXyL2RfgA/using-gpt-eliezer-against-chatgpt-jailbreaking website, 2022

[99]	Wunderwuzzi R. AI injections: direct and indirect prompt injections and their implications. See embracethered.com/blog/posts/2023/ai-injections-direct-and-indirect-prompt-injection-basics/ website, 2023

[100]

Samoilenko R. New prompt injection attack on ChatGPT web version. Markdown images can steal your chat data. See simonwillison.net/2023/Apr/14/new-prompt-injection-attack-on-chatgpt-web-version-markdown-imag/ website, 2023

[101]

Branch H J, Cefalu J R, McHugh J, Hujer L, Bahl A, del Castillo Iglesias D, Heichman R, Darwishi R. Evaluating the susceptibility of pre-trained language models via handcrafted adversarial examples. 2022, arXiv preprint arXiv: 2209.02128

[102]

Shayegani E, Mamun M A A, Fu Y, Zaree P, Dong Y, Abu-Ghazaleh N. Survey of vulnerabilities in large language models revealed by adversarial attacks. 2023, arXiv preprint arXiv: 2310.10844

[103]

Liu Y, Deng G, Li Y, Wang K, Wang Z, Wang X, Zhang T, Liu Y, Wang H, Zheng Y, Liu Y. Prompt injection attack against LLM-integrated applications. 2023, arXiv preprint arXiv: 2306.05499

[104]

Kumar D, Kumar A, Agarwal S, Harshangi P. Increased LLM vulnerabilities from fine-tuning and quantization. 2024, arXiv preprint arXiv: 2404.04392

[105]

Greshake K, Abdelnabi S, Mishra S, Endres C, Holz T, Fritz M. Not what you’ve signed up for: compromising real-world LLM-integrated applications with indirect prompt injection. In: Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security. 2023, 79−90

[106]

Liu X, Yu Z, Zhang Y, Zhang N, Xiao C. Automatic and universal prompt injection attacks against large language models. 2024, arXiv preprint arXiv: 2403.04957

[107]

Schick T, Dwivedi-Yu J, Dessí R, Raileanu R, Lomeli M, Hambro E, Zettlemoyer L, Cancedda N, Scialom T. Toolformer: language models can teach themselves to use tools. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 2997

[108]

Shen Y, Song K, Tan X, Li D, Lu W, Zhuang Y. HuggingGPT: solving AI tasks with ChatGPT and its friends in hugging face. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 1657

[109]

Ye J, Maddi A, Murakonda S K, Bindschaedler V, Shokri R. Enhanced membership inference attacks against machine learning models. In: Proceedings of 2022 ACM SIGSAC Conference on Computer and Communications Security. 2022, 3093−3106

[110]

Mireshghallah F, Goyal K, Uniyal A, Berg-Kirkpatrick T, Shokri R. Quantifying privacy risks of masked language models using membership inference attacks. In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 8332−8347

[111]

Watson L, Guo C, Cormode G, Sablayrolles A. On the importance of difficulty calibration in membership inference attacks. In: Proceedings of the 10th International Conference on Learning Representations. 2022

[112]

Mattern J, Mireshghallah F, Jin Z, Schoelkopf B, Sachan M, Berg-Kirkpatrick T. Membership inference attacks against language models via neighbourhood comparison. In: Proceedings of Findings of the Association for Computational Linguistics: ACL 2023. 2023, 11330−11343

[113]

Galli F, Melis L, Cucinotta T. Noisy neighbors: efficient membership inference attacks against LLMs. In: Proceedings of the 5th Workshop on Privacy in Natural Language Processing. 2024, 1−6

[114]

Fu W, Wang H, Gao C, Liu G, Li Y, Jiang T. Practical membership inference attacks against fine-tuned large language models via self-prompt calibration. 2023, arXiv preprint arXiv: 2311.06062

[115]

Staab R, Vero M, Balunovic M, Vechev M. Beyond memorization: violating privacy via inference with large language models. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[116]

Carlini N, Liu C, Erlingsson Ú, Kos J, Song D. The secret sharer: evaluating and testing unintended memorization in neural networks. In: Proceedings of the 28th USENIX Security Symposium. 2019, 267−284

[117]

Carlini N, Tramèr F, Wallace E, Jagielski M, Herbert-Voss A, Lee K, Roberts A, Brown T, Song D, Erlingsson U, Oprea A, Raffel C. Extracting training data from large language models. In: Proceedings of the 30th USENIX Security Symposium. 2021, 2633−2650

[118]

Nasr M, Carlini N, Hayase J, Jagielski M, Cooper A F, Ippolito D, Choquette-Choo C A, Wallace E, Tramèr F, Lee K. Scalable extraction of training data from (production) language models. 2023, arXiv preprint arXiv: 2311.17035

[119]

Oh M G, Hyun Park L, Kim J, Park J, Kwon T . Membership inference attacks with token-level deduplication on Korean language models. IEEE Access, 2023, 11: 10207–10217

[120]

Biderman S, Prashanth U, Sutawika L, Schoelkopf H, Anthony Q, Purohit S, Raff E. Emergent and predictable memorization in large language models. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 1219

[121]

Bai Y, Pei G, Gu J, Yang Y, Ma X. Special characters attack: toward scalable training data extraction from large language models. 2024, arXiv preprint arXiv: 2405.05990

[122]

Hong S, Kaya Y, Modoranu I V, Dumitras T. A panda? No, it’s a sloth: slowdown attacks on adaptive multi-exit neural network inference. In: Proceedings of the 9th International Conference on Learning Representations. 2021

[123]

Chen S, Liu C, Haque M, Song Z, Yang W. NMTsloth: understanding and testing efficiency degradation of neural machine translation systems. In: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2022, 1148−1160

[124]

Guo D, Yang D, Zhang H, Song J, Zhang R, , et al. DeepSeek-R1: incentivizing reasoning capability in LLMs via reinforcement learning. 2025, arXiv preprint arXiv: 2501.12948

[125]

Rajeev M, Ramamurthy R, Trivedi P, Yadav V, Bamgbose O, Madhusudan S T, Zou J, Rajani N. Cats confuse reasoning LLM: query agnostic adversarial triggers for reasoning models. 2025, arXiv preprint arXiv: 2503.01781

[126]

Cho K, van Merriënboer B, Bahdanau D, Bengio Y. On the properties of neural machine translation: encoder-decoder approaches. In: Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. 2014, 103−111

[127]

Sanh V, Debut L, Chaumond J, Wolf T. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. 2019, arXiv preprint arXiv: 1910.01108

[128]

Jiao X, Yin Y, Shang L, Jiang X, Chen X, Li L, Wang F, Liu Q. TinyBERT: distilling BERT for natural language understanding. In: Proceedings of Findings of the Association for Computational Linguistics: EMNLP 2020. 2020, 4163−4174

[129]

Yang S, Wang B, Shen Y, Panda R, Kim Y. Gated linear attention transformers with hardware-efficient training. In: Proceedings of the 41st International Conference on Machine Learning. 2024

[130]

Liu X, Cheng H, He P, Chen W, Wang Y, Poon H, Gao J. Adversarial training for large neural language models. 2020, arXiv preprint arXiv: 2004.08994

[131]

Xhonneux S, Sordoni A, Günnemann S, Gidel G, Schwinn L. Efficient adversarial training in LLMs with continuous attacks. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. 2024, 48

[132]

Casper S, Schulze L, Patel O, Hadfield-Menell D. Defending against unforeseen failure modes with latent adversarial training. Transactions on Machine Learning Research, 2025, 2025

[133]

Huang T, Hu S, Liu L. Vaccine: Perturbation-aware alignment for large language model. 2024, arXiv preprint arXiv: 2402.01109

[134]

Huang T, Bhattacharya G, Joshi P, Kimball J, Liu L. Antidote: post-fine-tuning safety alignment for large language models against harmful fine-tuning. 2024, arXiv preprint arXiv: 2408.09600

[135]

Yi X, Zheng S, Wang L, Wang X, He L . A safety realignment framework via subspace-oriented model fusion for large language models. Knowledge-Based Systems, 2024, 306: 112701

[136]

Huang T, Hu S, Ilhan F, Tekin S F, Liu L. Lazy safety alignment for large language models against harmful fine-tuning. 2024, arXiv preprint arXiv: 2405.18641

[137]

Peng S, Chen P Y, Hull M, Chau D H. Navigating the safety landscape: measuring risks in finetuning large language models. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. 2024, 3032

[138]

Liu X, Liang J, Ye M, Xi Z. Robustifying safety-aligned large language models through clean data curation. 2024, arXiv preprint arXiv: 2405.19358

[139]

Perez E, Huang S, Song F, Cai T, Ring R, Aslanides J, Glaese A, McAleese N, Irving G. Red teaming language models with language models. In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 3419−3448

[140]

Hartvigsen T, Sankaranarayanan S, Palangi H, Kim Y, Ghassemi M. Aging with grace: lifelong model editing with discrete key-value adaptors. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 2079

[141]

Min Y, Chen L, Karbasi A. The curious case of adversarially robust models: more data can help, double descend, or hurt generalization. In: Proceedings of the 37th Conference on Uncertainty in Artificial Intelligence. 2021, 129−139

[142]

Raghunathan A, Xie S M, Yang F, Duchi J, Liang P. Understanding and mitigating the tradeoff between robustness and accuracy. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 7909−7919

[143]

Prakash N, Shaham T R, Haklay T, Belinkov Y, Bau D. Fine-tuning enhances existing mechanisms: a case study on entity tracking. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[144]

Bianchi F, Suzgun M, Attanasio G, Rottger P, Jurafsky D, Hashimoto T, Zou J. Safety-tuned llamas: lessons from improving the safety of large language models that follow instructions. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[145]

Zhao J, Deng Z, Madras D, Zou J, Ren M. Learning and forgetting unsafe examples in large language models. In: Proceedings of the 41st International Conference on Machine Learning. 2024

[146]

Fu Y, Xiao W, Chen J, Li J, Papalexakis E, Chien A, Dong Y. Cross-task defense: instruction-tuning LLMs for content safety. In: Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024). 2024, 85−93

[147]

Wang J, Li J, Li Y, Qi X, Chen M, Hu J, Li Y, Li B, Xiao C. Mitigating fine-tuning jailbreak attack with backdoor enhanced alignment. 2024, arXiv preprint arXiv: 2402.14968

[148]

Qi X, Panda A, Lyu K, Ma X, Roy S, Beirami A, Mittal P, Henderson P. Safety alignment should be made more than just a few tokens deep. In: Proceedings of the 13th International Conference on Learning Representations. 2025

[149]

Xu J, Ju D, Li M, Boureau Y L, Weston J, Dinan E. Bot-adversarial dialogue for safe conversational agents. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021, 2950−2968

[150]

Ribeiro M T, Wu T, Guestrin C, Singh S. Beyond accuracy: behavioral testing of NLP models with checklist. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 4902−4912

[151]

Rottger P, Vidgen B, Nguyen D, Waseem Z, Margetts H, Pierrehumbert J B. HateCheck: functional tests for hate speech detection models. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021, 41−58

[152]

Cao Z, Xu Y, Huang Z, Zhou S. ML4CO-KIDA: knowledge inheritance in dataset aggregation. 2022, arXiv preprint arXiv: 2201.10328

[153]

Wortsman M, Ilharco G, Gadre S Y, Roelofs R, Gontijo-Lopes R, Morcos A S, Namkoong H, Farhadi A, Carmon Y, Kornblith S, Schmidt L. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In: Proceedings of the 39th International Conference on Machine Learning. 2022, 23965−23998

[154]

Yang E, Shen L, Guo G, Wang X, Cao X, Zhang J, Tao D. Model merging in LLMs, mLLMs, and beyond: methods, theories, applications and opportunities. 2024, arXiv preprint arXiv: 2408.07666

[155]

Gallego V. Merging improves self-critique against jailbreak attacks. 2024, arXiv preprint arXiv: 2406.07188

[156]

Engels J, Michaud E J, Liao I, Gurnee W, Tegmark M. Not all language model features are one-dimensionally linear. In: Proceedings of the 13th International Conference on Learning Representations. 2025

[157]

Jain N, Schwarzschild A, Wen Y, Somepalli G, Kirchenbauer J, Chiang P Y, Goldblum M, Saha A, Geiping J, Goldstein T. Baseline defenses for adversarial attacks against aligned language models. 2023, arXiv preprint arXiv: 2309.00614

[158]

Kumar A, Agarwal C, Srinivas S, Li A J, Feizi S, Lakkaraju H. Certifying LLM safety against adversarial prompting. 2023, arXiv preprint arXiv: 2309.02705

[159]

Cao B, Cao Y, Lin L, Chen J. Defending against alignment-breaking attacks via robustly aligned LLM. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024, 10542−10560

[160]

Chiu K L, Collins A, Alexander R. Detecting hate speech with GPT-3. 2021, arXiv preprint arXiv: 2103.12407

[161]

Pisano M, Ly P, Sanders A, Yao B, Wang D, Strzalkowski T, Si M. Bergeron: combating adversarial attacks through a conscience-based alignment framework. 2023, arXiv preprint arXiv: 2312.00029

[162]

Kim J, Derakhshan A, Harris I G. Robust safety classifier for large language models: adversarial prompt shield. 2023, arXiv preprint arXiv: 2311.00172

[163]

Touvron H, Martin L, Stone K, Albert P, Almahairi A, , et al. Llama 2: open foundation and fine-tuned chat models. 2023, arXiv preprint arXiv: 2307.09288

[164]

Chiang W L, Li Z, Lin Z, Sheng Y, Wu Z, Zhang H, Zheng L, Zhuang S, Zhuang Y, Gonzalez J E, Stoica I, Xing E P. Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality. See lmsys.org/blog/2023–03-30-vicuna/ website, 2023

[165]

Phute M, Helbling A, Hull M D, Peng S, Szyller S, Cornelius C, Chau D H. LLM self defense: by self examination, LLMs know they are being tricked. In: Proceedings of the 2nd Tiny Papers Track at ICLR 2024. 2024

[166]

Zhang Z, Yang J, Ke P, Mi F, Wang H, Huang M. Defending large language models against jailbreaking attacks through goal prioritization. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024, 8865−8887

[167]

Wei Z, Wang Y, Wang Y. Jailbreak and guard aligned language models with only few in-context demonstrations. 2023, arXiv preprint arXiv: 2310.06387

[168]

Zhong Q, Ding L, Liu J, Du B, Tao D. ROSE doesn’t do that: boosting the safety of instruction-tuned large language models with reverse prompt contrastive decoding. In: Proceedings of Findings of the Association for Computational Linguistics: ACL 2024. 2024, 13721−13736

[169]

Li Y, Wei F, Zhao J, Zhang C, Zhang H. RAIN: your language models can align themselves without finetuning. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[170]

Zou A, Phan L, Chen S, Campbell J, Guo P, , et al. Representation engineering: a top-down approach to AI transparency. 2023, arXiv preprint arXiv: 2310.01405

[171]

Rimsky N, Gabrieli N, Schulz J, Tong M, Hubinger E, Turner A M. Steering llama 2 via contrastive activation addition. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024, 15504−15522

[172]

Arditi A, Obeso O, Syed A, Paleka D, Panickssery N, Gurnee W, Nanda N. Refusal in language models is mediated by a single direction. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. 2024, 4322

[173]

Bricken T, Templeton A, Batson J, Chen B, Jermyn A, , et al. Towards monosemanticity: decomposing language models with dictionary learning. Transformer Circuits Thread, 2023, 2

[174]

Yao Y, Duan J, Xu K, Cai Y, Sun Z, Zhang Y . A survey on large language model (LLM) security and privacy: the good, the bad, and the ugly. High-Confidence Computing, 2024, 4( 2): 100211

[175]

Chowdhury A G, Islam M M, Kumar V, Shezan F H, Kumar V, Jain V, Chadha A. Breaking down the defenses: a comparative survey of attacks on large language models. 2024, arXiv preprint arXiv: 2403.04786

[176]

Das B C, Amini M H, Wu Y . Security and privacy challenges of large language models: a survey. ACM Computing Surveys, 2025, 57( 6): 152

[177]

Dong Z, Zhou Z, Yang C, Shao J, Qiao Y. Attacks, defenses and evaluations for LLM conversation safety: a survey. In: Proceedings of 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2024, 6734−6747

[178]

Chua J, Li Y, Yang S, Wang C, Yao L. Ai safety in generative AI large language models: a survey. 2024, arXiv preprint arXiv: 2407.18369

[179]

Xu Z, Liu Y, Deng G, Li Y, Picek S. A comprehensive study of jailbreak attack versus defense for large language models. In: Proceedings of the Association for Computational Linguistics. 2024, 7432−7449

[180]

Yi S, Liu Y, Sun Z, Cong T, He X, Song J, Xu K, Li Q. Jailbreak attacks and defenses against large language models: a survey. 2024, arXiv preprint arXiv: 2407.04295

[181]

Liu Y, Jia Y, Geng R, Jia J, Gong N Z. Formalizing and benchmarking prompt injection attacks and defenses. In: Proceedings of the 33rd USENIX Security Symposium. 2024, 1831−1847

[182]

Huang T, Hu S, Ilhan F, Tekin S F, Liu L. Harmful fine-tuning attacks and defenses for large language models: a survey. 2024, arXiv preprint arXiv: 2409.18169

[183]

Allen-Zhu Z, Li Y. Physics of language models: part 3.2, knowledge manipulation. In: Proceedings of the 13th International Conference on Learning Representations. 2025

[184]

Templeton A, Conerly T, Marcus J, Lindsey J, Bricken T, Chen B, Pearce A, Citro C, Ameisen E, Jones A, Cunningham H, Turner N L, McDougall C, MacDiarmid M, Tamkin A, Durmus E, Hume T, Mosconi F, Freeman C D, Sumers T R, Rees E, Batson J, Jermyn A, Carter S, Olah C, Henighan T. Scaling monosemanticity: extracting interpretable features from Claude 3 sonnet. Transformer Circuits Thread, 2024