Recent advances in attack and defense approaches of large language models
Jing CUI , Yishi XU , Zhewei HUANG , Zekeng ZENG , Jianbin JIAO , Junge ZHANG
Front. Comput. Sci. ›› 2027, Vol. 21 ›› Issue (4) : 2104805
Large Language Models (LLMs) have revolutionized artificial intelligence and machine learning through their advanced generation and reasoning capabilities. However, their widespread deployment has raised significant safety and reliability concerns. Emerging threats, coupled with established vulnerabilities in deep neural networks, may compromise model security and even create a false sense of security. Given the extensive research in the field of LLM security, especially the studies in late 2023 and 2024, a survey that begins with model behavior and delves into its internal representational roots is crucial for providing the community with key insights and guiding future development. In this survey, we analyze recent studies on various attack vectors and threat models, providing insights into improving attack mechanisms. We also examine the present defense strategies, highlighting their strengths and current limitations. Our goal is to deepen the understanding of LLM safety challenges and contribute to the development of more robust security measures.
large language models / safety / attack methods / defense mechanisms
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
|
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
|
| [26] |
|
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
|
| [32] |
|
| [33] |
|
| [34] |
|
| [35] |
|
| [36] |
|
| [37] |
|
| [38] |
|
| [39] |
|
| [40] |
|
| [41] |
|
| [42] |
|
| [43] |
|
| [44] |
|
| [45] |
|
| [46] |
|
| [47] |
|
| [48] |
Kumar S S, Cummings M, Stimpson A. Strengthening LLM trust boundaries: a survey of prompt injection attacks Surender Suresh Kumar Dr. M.L. Cummings Dr. Alexander Stimpson. In: Proceedings of the 4th IEEE International Conference on Human-Machine Systems. 2024, 1−6 |
| [49] |
|
| [50] |
|
| [51] |
|
| [52] |
|
| [53] |
|
| [54] |
|
| [55] |
|
| [56] |
|
| [57] |
|
| [58] |
|
| [59] |
|
| [60] |
|
| [61] |
|
| [62] |
|
| [63] |
|
| [64] |
|
| [65] |
Barnett S, Kurniawan S, Thudumu S, Brannelly Z, Abdelrazek M. Seven failure points when engineering a retrieval augmented generation system. In: Proceedings of the 3rd IEEE/ACM International Conference on AI Engineering-Software Engineering for AI. 2024, 194−199 |
| [66] |
|
| [67] |
|
| [68] |
|
| [69] |
Zhan Q, Fang R, Bindu R, Gupta A, Hashimoto T, Kang D. Removing RLHF protections in GPT-4 via fine-tuning. In: Proceedings of 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2023, 681−687 |
| [70] |
|
| [71] |
|
| [72] |
|
| [73] |
|
| [74] |
|
| [75] |
Papernot N, McDaniel P, Jha S, Fredrikson M, Celik Z B, Swami A. The limitations of deep learning in adversarial settings. In: Proceedings of 2016 IEEE European Symposium on Security and Privacy (EuroS&P). 2016, 372−387 |
| [76] |
Carlini N, Wagner D. Towards evaluating the robustness of neural networks. In: Proceedings of 2017 IEEE Symposium on Security and Privacy (SP). 2017, 39−57 |
| [77] |
Shen X, Chen Z, Backes M, Shen Y, Zhang Y. “Do anything now”: characterizing and evaluating in-the-wild jailbreak prompts on large language models. In: Proceedings of 2024 on ACM SIGSAC Conference on Computer and Communications Security. 2024, 1671−1685 |
| [78] |
|
| [79] |
|
| [80] |
|
| [81] |
|
| [82] |
|
| [83] |
|
| [84] |
|
| [85] |
|
| [86] |
|
| [87] |
|
| [88] |
|
| [89] |
Zhang H, Guo Z, Zhu H, Cao B, Lin L, Jia J, Chen J, Wu D. On the safety of open-sourced large language models: does alignment really prevent them from being misused? 2023, arXiv preprint arXiv: 2310.01581 |
| [90] |
|
| [91] |
|
| [92] |
|
| [93] |
|
| [94] |
|
| [95] |
|
| [96] |
|
| [97] |
|
| [98] |
|
| [99] |
|
| [100] |
|
| [101] |
|
| [102] |
|
| [103] |
|
| [104] |
|
| [105] |
|
| [106] |
|
| [107] |
|
| [108] |
|
| [109] |
Ye J, Maddi A, Murakonda S K, Bindschaedler V, Shokri R. Enhanced membership inference attacks against machine learning models. In: Proceedings of 2022 ACM SIGSAC Conference on Computer and Communications Security. 2022, 3093−3106 |
| [110] |
Mireshghallah F, Goyal K, Uniyal A, Berg-Kirkpatrick T, Shokri R. Quantifying privacy risks of masked language models using membership inference attacks. In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 8332−8347 |
| [111] |
|
| [112] |
|
| [113] |
|
| [114] |
|
| [115] |
|
| [116] |
|
| [117] |
|
| [118] |
|
| [119] |
|
| [120] |
|
| [121] |
|
| [122] |
|
| [123] |
|
| [124] |
|
| [125] |
|
| [126] |
|
| [127] |
|
| [128] |
|
| [129] |
|
| [130] |
|
| [131] |
|
| [132] |
|
| [133] |
|
| [134] |
|
| [135] |
|
| [136] |
|
| [137] |
|
| [138] |
|
| [139] |
Perez E, Huang S, Song F, Cai T, Ring R, Aslanides J, Glaese A, McAleese N, Irving G. Red teaming language models with language models. In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 3419−3448 |
| [140] |
|
| [141] |
|
| [142] |
|
| [143] |
|
| [144] |
|
| [145] |
|
| [146] |
|
| [147] |
|
| [148] |
|
| [149] |
|
| [150] |
|
| [151] |
|
| [152] |
|
| [153] |
|
| [154] |
|
| [155] |
|
| [156] |
|
| [157] |
|
| [158] |
|
| [159] |
|
| [160] |
|
| [161] |
|
| [162] |
|
| [163] |
|
| [164] |
|
| [165] |
|
| [166] |
|
| [167] |
|
| [168] |
|
| [169] |
|
| [170] |
|
| [171] |
|
| [172] |
|
| [173] |
|
| [174] |
|
| [175] |
|
| [176] |
|
| [177] |
Dong Z, Zhou Z, Yang C, Shao J, Qiao Y. Attacks, defenses and evaluations for LLM conversation safety: a survey. In: Proceedings of 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2024, 6734−6747 |
| [178] |
|
| [179] |
|
| [180] |
|
| [181] |
|
| [182] |
|
| [183] |
|
| [184] |
|
Higher Education Press
/
| 〈 |
|
〉 |