A survey of backdoor attacks and defences: From deep neural networks to large language models

Ling-Xin Jin , Wei Jiang , Xiang-Yu Wen , Mei-Yu Lin , Jin-Yu Zhan , Xing-Zhi Zhou , Maregu Assefa Habtie , Naoufel Werghi

Journal of Electronic Science and Technology ›› 2025, Vol. 23 ›› Issue (3) : 100326

PDF (10478KB)
Journal of Electronic Science and Technology ›› 2025, Vol. 23 ›› Issue (3) : 100326 DOI: 10.1016/j.jnlest.2025.100326
research-article

A survey of backdoor attacks and defences: From deep neural networks to large language models

Author information +
History +
PDF (10478KB)

Abstract

Deep neural networks (DNNs) have found extensive applications in safety-critical artificial intelligence systems, such as autonomous driving and facial recognition systems. However, recent research has revealed their susceptibility to backdoors maliciously injected by adversaries. This vulnerability arises due to the intricate architecture and opacity of DNNs, resulting in numerous redundant neurons embedded within the models. Adversaries exploit these vulnerabilities to conceal malicious backdoor information within DNNs, thereby causing erroneous outputs and posing substantial threats to the efficacy of DNN-based applications. This article presents a comprehensive survey of backdoor attacks against DNNs and the countermeasure methods employed to mitigate them. Initially, we trace the evolution of the concept from traditional backdoor attacks to backdoor attacks against DNNs, highlighting the feasibility and practicality of generating backdoor attacks against DNNs. Subsequently, we provide an overview of notable works encompassing various attack and defense strategies, facilitating a comparative analysis of their approaches. Through these discussions, we offer constructive insights aimed at refining these techniques. Finally, we extend our research perspective to the domain of large language models (LLMs) and synthesize the characteristics and developmental trends of backdoor attacks and defense methods targeting LLMs. Through a systematic review of existing studies on backdoor vulnerabilities in LLMs, we identify critical open challenges in this field and propose actionable directions for future research.

Keywords

Backdoor attacks / Backdoor defenses / Deep neural networks / Large language model

Cite this article

Download citation ▾
Ling-Xin Jin, Wei Jiang, Xiang-Yu Wen, Mei-Yu Lin, Jin-Yu Zhan, Xing-Zhi Zhou, Maregu Assefa Habtie, Naoufel Werghi. A survey of backdoor attacks and defences: From deep neural networks to large language models. Journal of Electronic Science and Technology, 2025, 23(3): 100326 DOI:10.1016/j.jnlest.2025.100326

登录浏览全文

4963

注册一个新账户 忘记密码

CRediT authorship contribution statement

Ling-Xin Jin: Writing – original draft, Formal analysis, Investigation, Conceptualization. Wei Jiang: Writing – original draft, Validation, Supervision. Xiang-Yu Wen: Writing – original draft, Formal analysis, Investigation. Mei-Yu Lin: Validation, Supervision. Jin-Yu Zhan: Validation, Supervision. Xing-Zhi Zhou: Validation. Maregu Assefa Habtie: Validation. Naoufel Werghi: Validation.

Declaration of competing interest

The authors declare that they have no conflict of interest. All figures are obtained from publicly available datasetswith proper attribution and reproduced with permission.

Acknowledgment

This work was supported in part by the National Natural Science Foundation of China under Grants No. 62372087 and No. 62072076, the Research Fund of State Key Laboratory of Processors under Grant No. CLQ202310, and the CSC scholarship ​under Grant No.202406070152.

Appendix.

For conciseness, the derivation of neural backdoor attacks, the methodology comparisons, research challenges, and some figures are included in the Supplementary Materials.

Appendix A. Supplementary data

The following is the Supplementary data to this article: Download: Download Word document (674KB)

Multimedia component 1.

References

[1]

D. Bau, J.-Y. Zhu, H. Strobelt, et al., GAN dissection: visualizing and understanding generative adversarial networks, in: Proc. of the 7th Intl. Conf. on Learning Representations, New Orleans, USA, (2019)

[2]

K.-M. He, X.-Y. Zhang, S.-Q. Ren, J. Sun, Deep residual learning for image recognition, in: Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, Las Vegas, USA, (2016), pp. 770-778.

[3]

X.-J. Shi, Z.-R. Chen, H. Wang, D.Y. Yeung, W.K. Wong, W.C. Woo, Convolutional LSTM network: a machine learning approach for precipitation nowcasting, in: Proc. of 29th Intl. Conf. on Neural Information Processing Systems, Montreal, Canada, (2015), pp. 802-810.

[4]

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, in: Proc. of Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, (2019), pp. 4171-4186.

[5]

A. Vaswani, N. Shazeer, N. Parmar, et al., Attentionis all you need, in: Proc. of the 31st Intl. Conf. on Neural Information Processing Systems, Long Beach, USA, (2017), pp. 6000-6010.

[6]

Y. Taigman, M. Yang, M.A. Ranzato, L. Wolf, DeepFace: closing the gap to human-level performance in face verification, in: Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, Columbus, USA, (2014), pp. 1701-1708.

[7]

M. Khodabandeh, A. Vahdat, M. Ranjbar, W. Macready, A robust learning approach to domain adaptive object detection, in: Proc. of IEEE/CVF Intl. Conf. on Computer Vision, Seoul, Republic of Korea, (2019), pp. 480-490.

[8]

C. Badue, R. Guidolini, R.V. Carneiro, et al., Self-driving cars: a survey, Expert Syst. Appl. 165 (2021) 113816.

[9]

Google Cloud ML Engine, Google Inc, Menlo Park, USA, (2019).

[10]

EasyDL PaddlePaddle developers [Online]. Available, https://cloud.baidu.com/doc/EASYDL/index.html, February 2023.

[11]

Azure Batch AI, Microsoft Corp, Redmond, USA, (2025).

[12]

X.-Y. Chen, C. Liu, B. Li, K. Lu, D. Song, Targeted backdoor attacks on deep learning systems using data poisoning [Online]. Available, https://arxiv.org/abs/1712.05526, December 2017.

[13]

Z.-Q. Li, H. Sun, P.-F. Xia, et al., Efficient backdoor attacks for deep neural networks in real-world scenarios, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, (2024)

[14]

W. Yin, J. Lou, P. Zhou, et al., Physical backdoor: towards temperature-based backdoor attacks in the physical world, in: Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Seattle, USA, (2024), pp. 12733-12743.

[15]

Y.-P. Hu, W.-X. Kuang, Z. Qin, et al., Artificial intelligence security: threats and countermeasures, ACM Comput. Surv. 55 (2) (2021) 1-36. 20.

[16]

S. Kaviani, I. Sohn, Defense against neural trojan attacks: a survey, Neurocomputing 423 (2021) 651-667.

[17]

Y.-D. Li, S.-G. Zhang, W.-P. Wang, H. Song, Backdoor attacks to deep learning models and countermeasures: a survey, IEEE Open J. Comput. Soc. 4 (2023) 134-146.

[18]

Y.-M. Li, Poisoning-based backdoor attacks in computer vision, in: Proc. of the 37th AAAI Conf. on Artificial Intelligence, Washington, USA, (2023), pp. 16121-16122.

[19]

K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators, Neural Netw. 2 (5) (1989) 359-366.

[20]

T.-Y. Gu, B. Dolan-Gavitt, S. Garg, BadNets: identifying vulnerabilities in the machine learning model supply chain, IEEE Access 7 (2019) 47230-47244.

[21]

A. Geigel, Neural network Trojan, J. Comput. Secur. 21 (2) (2013) 191-232.

[22]

H.-T. Zhong, C. Liao, A.C. Squicciarini, S.-C. Zhu, D. Miller, Backdoor embedding in convolutional neural network models via invisible perturbation, in: Proc. of the Tenth ACM Conf. on Data and Application Security and Privacy, New Orleans, USA, (2020), pp. 97-108.

[23]

M. Barni, K. Kallas, B. Tondi, A new backdoor attack in CNNS by training set corruption without label poisoning, in: Proc. of the IEEE Intl. Conf. on Image Processing, Taipei, China, (2019), pp. 101-105.

[24]

S.-F. Li, M.-H. Xue, B.Z.H. Zhao, H.-J. Zhu, X.-P. Zhang, Invisible backdoor attacks on deep neural networks via steganography and regularization, IEEE T. Depend. Secure 18 (5) (2021) 2088-2105.

[25]

Y.-R. Yu, X.-T. Gao, C.-Z. Xu, LAFEAT: piercing through adversarial defenses with latent features, in: Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Nashville, USA, (2021), pp. 5731-5741.

[26]

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, A. Vladu, Towards deep learning models resistant to adversarial attacks, in: Proc. of the 6th Intl. Conf. on Learning Representations, Vancouver, Canada, (2018)

[27]

Z.-T. Wang, J. Zhai, S.-Q. Ma, BppAttack: stealthy and efficient Trojan attacks against deep neural networks via image quantization and contrastive adversarial learning, in: Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, New Orleans, USA, (2022), pp. 15054-15063.

[28]

S.-Y. Cheng, Y.-Q. Liu, S.-Q. Ma, X.-Y. Zhang, Deep feature space Trojan attack of neural networks by controlled detoxification, in: Proc. of the 35th AAAI Conf. on Artificial Intelligence, Virtual Event, (2021), pp. 1148-1156.

[29]

A. Salem, R. Wen, M. Backes, S.-Q. Ma, Y. Zhang, Dynamic backdoor attacks against machine learning models, in: Proc. of the 7th European Symposium on Security and Privacy, Genoa, Italy, (2022), pp. 703-718.

[30]

D. Tang, X.-F. Wang, H.-X. Tang, K.-H. Zhang, Demon in the variant: statistical analysis of DNNs for robust backdoor contamination detection, in: Proc. of the 30th USENIX Security Symposium, Virtual Event, (2021), pp. 1541-1558.

[31]

Z.-H. Zhang, J.-W. Ding, Q. Zhang, Q.-Y. Deng, WaTrojan: wavelet domain trigger injection for backdoor attacks, Comput. Secur. 140 (2024) 103767.

[32]

G. Severi, J. Meyer, S.E. Coull, A. Oprea, Explanation-guided backdoor poisoning attacks against malware classifiers, in: Proc. of the 30th USENIX Security Symposium, Virtual Event, (2021), pp. 1487-1504.

[33]

S. Narisada, S. Hidano, K. Fukushima, Fully hidden dynamic trigger backdoor attacks, in: Proc. of the 15th Intl. Conf. on Agents and Artificial Intelligence, Lisbon, Portugal (15th ed.), (2023), pp. 81-91.

[34]

T. Huynh, D. Nguyen, T. Pham, A. Tran, COMBAT: alternated training for effective clean-label backdoor attacks, in: Proc. of the 38th AAAI Conf. on Artificial Intelligence, Vancouver, Canada, (2024), pp. 2436-2444.

[35]

Y.-Q. Liu, S.-Q. Ma, Y. Aafer, et al., Trojaning attack on neural networks, in: Proc. of the 25th Annu. Network and Distributed System Security Symposium, San Diego, USA, (2018)

[36]

Y.-J. Ji, X.-Y. Zhang, S.-L. Ji, X.-P. Luo, T. Wang, Model-reuse attacks on deep learning systems, in: Proc. of the ACM SIGSAC Conf. on Computer and Communications Security, Toronto, Canada, (2018), pp. 349-363.

[37]

Y. Ji, Z.-X. Liu, X. Hu, P.-Q. Wang, Y.-H. Zhang, Programmable neural network Trojan for pre-trained feature extractor, in: Proc. of the 8th Intl. Conf. on Learning Representations, Addis Ababa, Ethiopia, (2020)

[38]

S.-F. Li, B.Z.H. Zhao, J.-H. Yu, M.-H. Xue, D. Kaafar, H.-J. Zhu, Invisible Backdoor Attacks Against Deep Neural Networks [Online]. Available, https://arxiv.org/abs/1909.02742v1, August 2020.

[39]

A. Rozsa, E.M. Rudd, T.E. Boult, Adversarial diversity and hard positive generation, in: Proc. of IEEE Conf. on Computer Vision and Pattern Recognition Workshops, Las Vegas, USA, (2016), pp. 410-417.

[40]

J. Clements, Y.J. Lao, Backdoor attacks on neural network operations, in: Proc. of IEEE Global Conf. on Signal and Information Processing, Anaheim, USA, (2018), pp. 1154-1158.

[41]

T. Liu, W.J. Wen, Y. Jin, SIN2: stealth infection on neural network—a low-cost agile neural Trojan attackmethodology, in: Proc. of IEEE Intl. Symposium on Hardware Oriented Security and Trust, Washington, USA, (2018), pp. 227-230.

[42]

S. Katzenbeisser, F.A. Petitcolas, Information Hiding Techniques for Steganography and Digital Watermarking, Artech House, Inc., St. Norwood, USA, (2000).

[43]

A.S. Rakin, Z.-Z. He, D.-L. Fan, TBT: targeted neural network attack with bit Trojan, in: Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Seattle, USA, (2020), pp. 13195-13204.

[44]

M.-H. Zou, Y. Shi, C.-L. Wang, F.-Y. Li, W.-Z. Song, Y. Wang, PoTrojan: Powerful neural-level Trojan Designs in Deep Learning Models [Online]. Available, https://arxiv.org/abs/1802.03043, December 2019.

[45]

Q.-X. Xiao, K. Li, D.-Y. Zhang, W.-L. Xu, Security risks in deep learning implementations, in: Proc. of IEEE Security and Privacy Workshops, San Francisco, USA, (2018), pp. 123-128.

[46]

E. Bagdasaryan, V. Shmatikov, Blind backdoors in deep learning models, in: Proc. of the 30th USENIX Security Symposium, Virtual Event, (2021), pp. 1505-1521.

[47]

Y.-N. Liu, L.-X. Wei, B. Luo, Q. Xu, Fault injection attack on deep neural network, in: Proc. of the 36th Intl. Conf. on Computer-Aided Design, Irvine, USA, (2017), pp. 131-138.

[48]

J. Clements, Y.-J. Lao, Hardware Trojan design on neural networks, in: Proc. of IEEE Intl. Symposium on Circuits and Systems, Sapporo, Japan, (2019), pp. 1-5.

[49]

W.-S. Li, J.-C. Yu, X.-F. Ning, et al., Hu-Fu: hardware and software collaborative attack framework against neural networks, in: Proc. of IEEE Computer Society Annual Symposium on VLSI, Hong Kong, China, (2018), pp. 482-487.

[50]

Y.-T. Liu, Y. Xie, A. Srivastava, Neural trojans, in: Proc. of IEEE Intl. Conf. on Computer Design, Boston, USA, (2017), pp. 45-48.

[51]

M.A. Hearst, S.T. Dumais, E. Osuna, J. Platt, B. Scholkopf, Support vector machines, IEEE Intell. Syst. Their Appl. 13 (4) (1998) 18-28.

[52]

A. Chan, Y.S. Ong, Poison as a Cure: Detecting & Neutralizing variable-sized Backdoor Attacks in Deep Neural Networks [Online]. Available, https://arxiv.org/abs/1911.08040, March 2019.

[53]

B. Chen, W. Carvalho, N. Baracaldo, et al., Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering [Online]. Available, https://arxiv.org/abs/1811.03728, November 2018.

[54]

A. Hyvärinen, J. Karhunen, E. Oja, Independent Component Analysis, John Wiley & Sons, Inc., Hoboken, USA, (2001).

[55]

Z. Xiang, D.J. Miller, G. Kesidis, A benchmark study of backdoor data poisoning defenses for deep neural network classifiers and a novel defense, in: Proc. of the 29th Intl. Workshop on Machine Learning for Signal Processing, Pittsburgh, USA, (2019), pp. 1-6.

[56]

W.-L. Ma, D.-R. Wang, R.-X. Sun, M.-H. Xue, S. Wen, Y. Xiang, The “Beatrix” resurrections: robust backdoor detection via gram matrices, in: Proc. of the 30th Network and Distributed System Security Symposium, San Diego, USA, (2023)

[57]

W. Jiang, X.-Y. Wen, J.-Y. Zhan, X.-P. Wang, Z.-W. Song, Interpretability-guided defense against backdoor attacks to deep neural networks, IEEE T. Comput. Aid. D. 41 (8) (2022) 2611-2624.

[58]

W. Jiang, X.-Y. Wen, J.-Y. Zhan, X.-P. Wang, Z.-W. Song, C. Bian, Critical path-based backdoor detection for deep neural networks, IEEE T. Neur. Net. Lear. 35 (3) (2024) 4032-4046.

[59]

C.-F. Yang, Q. Wu, H. Li, Y.-R. Chen, Generative Poisoning Attack Method Against Neural Networks [Online]. Available, https://arxiv.org/abs/1703.01340, 2017.

[60]

B. Tran, J. Li, A. Madry, Spectral signatures in backdoor attacks, in: Proc. of the 32nd Intl. Conf. on Neural Information Processing Systems, Montréal, Canada, (2018), pp. 8011-8021.

[61]

B.-L. Wang, Y.-S. Yao, S. Shan, et al., Neural cleanse: identifying and mitigating backdoor attacks in neural networks, in: Proc. of IEEE Symposium on Security and Privacy, San Francisco, USA, (2019), pp. 707-723.

[62]

Y.-Q. Liu, W.C. Lee, G.-H. Tao, S.-Q. Ma, Y. Aafer, X.-Y. Zhang, ABS: scanning neural networks for back-doors by artificial brain stimulation, in: Proc. of ACM SIGSAC Conf. on Computer and Communications Security, London, UK, (2019), pp. 1265-1282.

[63]

G.-Y. Shen, Y.-Q. Liu, G.-H. Tao, et al., Backdoor scanning for deep neural networks through K-Arm optimization, in: Proc. of the 38th Intl. Conf. on Machine Learning, Virtual Event, (2021), pp. 9525-9536.

[64]

Y.-Q. Liu, G.-Y. Shen, G.-H. Tao, Z.-T. Wang, S.-Q. Ma, X.-Y. Zhang, Complex backdoor detection by symmetric feature differencing, in: Proc. of IEEE/CVF Conf. on Computer Vision and Pattern Recognition, New Orleans, USA, (2022), pp. 14983-14993.

[65]

D. Vršnak, M. Subašić, S. Lončarić, Efficient method for robust backdoor detection and removal in feature space using clean data, IEEE Access 13 (2025) 18215-18227.

[66]

D. Lewis, Counterfactuals and comparative possibility, J. Phil. Logic 2 (4) (1973) 418-446.

[67]

X.-J. Xu, Q. Wang, H.-C. Li, N. Borisov, C.A. Gunter, B. Li, Detecting AI Trojans using meta neural analysis, in: Proc. of IEEE Symposium on Security and Privacy San Francisco, USA, (2021), pp. 103-120.

[68]

Z. Xiang, D.J. Miller, G. Kesidis, Revealing backdoors, post-training, in DNN classifiers via novel inference on optimized perturbations inducing group misclassification, in: Proc. of the IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, Barcelona, Spain, (2020), pp. 3827-3831.

[69]

C. Szegedy, W. Zaremba, I. Sutskever, et al., Intriguing properties of neural networks, in: Proc. of the 2nd Intl. Conf. on Learning Representations, Banff, Canada, (2014)

[70]

J. Steinhardt, P.W. Koh, P. Liang, Certified defenses for data poisoning attacks, in: Proc. of the 31st Intl. Conf. on Neural Information Processing Systems, Long Beach, USA, (2017), pp. 3520-3532.

[71]

G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science 313 (5786) (2006) 504-507.

[72]

B.G. Doan, E. Abbasnejad, D.C. Ranasinghe, Februus: input purification defense against Trojan attacks on deep neural network systems, in: Proc. of the 36th Annu. Computer Security Applications Conf, Austin, USA, (2020), pp. 897-912.

[73]

H. Wang, Z. Xiang, D.J. Miller, G. Kesidis, MM-BD: post-training detection of backdoor attacks with arbitrary backdoor pattern types using a maximum margin statistic, in: Proc. of the IEEE Symposium on Security and Privacy, San Francisco, USA, (2024), pp. 1994-2012.

[74]

Y.-S. Yao, H.-Y. Li, H.-T. Zheng, B.Y. Zhao, Latent backdoor attacks on deep neural networks, in: Proc. of ACM SIGSAC Conf. on Computer and Communications Security, London, UK, (2019), 2041-2055.

[75]

N. Baracaldo, B. Chen, H. Ludwig, J.A. Safavi, Mitigating poisoning attacks on machine learning models: a data provenance based approach, in: Proc. of 10th ACM Workshop on Artificial Intelligence and Security, Dallas, USA, (2017), pp. 103-110.

[76]

W.-X. Chen, B.-Y. Wu, H.-Q. Wang, Effective backdoor defense by exploiting sensitivity of poisoned samples, in: Proc. of the 36th Intl. Conf. on Neural Information Processing Systems, New Orleans, USA, (2022), pp. 9727-9737.

[77]

X.-M. Qiao, Y.-K. Yang, H. Li, Defending neural backdoors via generative distribution modeling, in: Proc. of the 33rd Intl. Conf. on Neural Information Processing Systems, Vancouver, Canada, (2019), pp. 14027-14036.

[78]

Y. Zhao, H. Zhu, K. Chen, S.-Z. Zhang, AI-Lancet: locating error-inducing neurons to optimize neural networks, in: Proc. of ACM SIGSAC Conf. on Computer and Communications Security, Virtual Event, (2021), pp. 141-158.

[79]

K. Liu, B. Dolan-Gavitt, S. Garg, Fine-pruning: defending against backdooring attacks on deep neural networks, in: Proc. of the 21st Research in Attacks, Intrusions, and Defenses, Heraklion, Greece, (2018), pp. 273-294.

[80]

Y. Zhao, X. Hu, S.-C. Li, et al., Memory Trojan attack on neural network accelerators, in: Proc. of Design, Automation & Test in Europe Conf. & Exhibition, Florence, Italy, (2019), pp. 1415-1420.

[81]

E. Stefanov, M. van Dijk, E. Shi, et al., Path ORAM: an extremely simple oblivious RAM protocol, J. ACM 65 (4) (2018) 18.

[82]

A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, Improving Language Understanding by Generative Pre-training [Online]. Available, https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf, 2018.

[83]

L.-C. Sun, P.S. Yu, Natural backdoor attack on text data [Online]. Available, https://arxiv.org/abs/2006.16176v3, January 2021.

[84]

W.K. Yang, L. Li, Z.-Y. Zhang, X.-C. Ren, X. Sun, B. He, Be careful about poisoned word embeddings: exploring the vulnerability of the embedding layers in NLP models, in: Proc. of the Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Virtual Event, (2021), pp. 2048-2058.

[85]

J.-Z. Dai, C.-S. Chen, Y.-F. Li, A backdoor attack against LSTM-based text classification systems, IEEE Access 7 (2019) 138872-138878.

[86]

P.-Z. Cheng, W. Du, Z.-R. Wu, et al., SynGhost: invisible and universal task-agnostic backdoor attack via syntactic transfer, in: Proc. of the Association for Computational Linguistics, Albuquerque, New Mexico, (2025), pp. 3530-3546.

[87]

R. Zhang, H.-W. Li, R. Wen, et al., Instruction backdoor attacks against customized LLMs, in: Proc. of the 33rd USENIX Security Symposium, Philadelphia, USA, (2024)

[88]

F.-C. Qi, Y.-Y. Chen, X.-R. Zhang, M.-K. Li, Z.-Y. Liu, M.-S. Sun, Mind the style of text! adversarial and backdoor attacks based on text style transfer, in: Proc. of the Conf. on Empirical Methods in Natural Language Processing, Virtual Event, (2021), pp. 4569-4580.

[89]

Q.-C. Zeng, M.-Y. Jin, Q.-K. Yu, et al., Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models [Online]. Available, https://arxiv.org/abs/2407.11282, July 2024.

[90]

X.-R. Cai, H.-D. Xu, S.-H. Xu, Y. Zhang, X.-J. Yuan, BadPrompt: backdoor attacks on continuous prompts, in: Proc. of 36th Intl. Conf. on Neural Information Processing Systems, New Orleans, USA, (2022), p. 2686

[91]

H.-W. Yao, J. Lou, Z. Qin, PoisonPrompt: backdoor attack on prompt-based large language models, in: Proc. of the IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea, (2024), pp. 7745-7749.

[92]

J.-Z. Li, Y.-J. Yang, Z.-F. Wu, V.G. Vinod, C.-W. Xiao, ChatGPT as an attack tool: stealthy textual backdoor attack via blackbox generative model trigger, in: Proc. of the Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Mexico City, Mexico, (2024), pp. 2985-3004.

[93]

P.-Z. Cheng, Z.-R. Wu, T.-J. Ju, W. Du, Z.-G. Liu, Transferring Backdoors Between Large Language Models by Knowledge Distillation [Online]. Available, https://arxiv.org/abs/2408.09878, August 2024.

[94]

D. Lu, T.-Y. Pang, C. Du, Q. Liu, X.-J. Yang, M. Lin, Test-Time Backdoor Attacks on Multimodal Large Language Models [Online]. Available, https://arxiv.org/abs/2402.08577, February 2024.

[95]

G. Hinton, O. Vinyals, J. Dean, Distilling the Knowledge in a Neural Network [Online]. Available, https://arxiv.org/abs/1503.02531, March 2015.

[96]

C. Buciluǎ, R. Caruana, A. Niculescu-Mizil, Model compression, in: Proc. of the 12th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, Philadelphia, USA, (2006), pp. 535-541.

[97]

Y.-P. Cao, B.-C. Cao, J.-H. Chen, Stealthy and persistent unalignment on large language models via backdoor injections, in: Proc. of the Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Mexico City, Mexico, (2024), pp. 4920-4935.

[98]

S. Zhao, L.L. Gan, Z.L. Guo, et al., Weak-To-Strong Backdoor Attack for Large Language Models [Online]. Available, https://arxiv.org/abs/2409.17946, October 2024.

[99]

H.-T. Liu, C.-Y. Li, Q.-Y. Wu, Y.-J. Lee, Visual instruction tuning, in: Proc. of the 37th Intl. Conf. on Neural Information Processing Systems, New Orleans, USA, (2023), p. 1516

[100]

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language Models are Unsupervised Multitask Learners [Online]. Available, https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf, 2019.

[101]

A.Q. Jiang, A. Sablayrolles, A. Roux, et al., Mixtral of Experts [Online]. Available, https://arxiv.org/abs/2401.04088, January 2024.

[102]

J. Yan, V. Yadav, S. Li, et al., Backdooring instruction-tuned large language models with virtual prompt injection, in: Proc. of the Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Mexico City, Mexico, (2024), pp. 6065-6086.

[103]

L. Ouyang, J. Wu, X. Jiang, et al., Training language models to follow instructions with human feedback, in: Proc. of the 36th Intl. Conf. on Neural Information Processing Systems, New Orleans, USA, (2022), p. 2011

[104]

E.L. Aleixo, J.G. Colonna, M. Cristo, E. Fernandes, Catastrophic Forgetting in Deep Learning: a Comprehensive Taxonomy [Online]. Available, https://doi.org/10.48550/arXiv.2312.10549, December 2023.

[105]

L.-Y. Li, D.-M. Song, X.-N. Li, J.-H. Zeng, R.-T. Ma, X.-P. Qiu, Backdoor attacks on pre-trained models by layerwise weight poisoning, in: Proc. of the Conf. on Empirical Methods in Natural Language Processing, Virtual Event, (2021), pp. 3023-3032.

[106]

B. Lester, R. Al-Rfou, N. Constant, The power of scale for parameter-efficient prompt tuning, in: Proc. of the 2021 Conf. on Empirical Methods in Natural Language Processing, Virtual Event, (2021), pp. 3045-3059.

[107]

W. Du, Y.C. Zhao, B.Q. Li, G.S. Liu, S.L. Wang, PPT: backdoor attacks on pre-trained models via poisoned prompt tuning, in: Proc. of the 31st Intl. Joint Conf. on Artificial Intelligence, Vienna, Austria, (2022), pp. 680-686.

[108]

J. Ebrahimi, A.-Y. Rao, D. Lowd, D.-J. Dou, HotFlip: white-box adversarial examples for text classification, in: Proc. of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, (2018), pp. 31-36.

[109]

T.B. Brown, B. Mann, N. Ryder, et al., Language models are few-shot learners, in: Proc. of the 34th Intl. Conf. on Neural Information Processing Systems, Vancouver, Canada, (2020), p. 159

[110]

J.-W. Shi, Y.-X. Liu, P. Zhou, L.-C. Sun, Badgpt: Exploring Security Vulnerabilities of Chatgpt via Backdoor Attacks to Instructgpt [Online]. Available, https://arxiv.org/abs/2304.12298, February 2023.

[111]

H. Wang, R.-Q. Zhong, J.-X. Wen, J. Steinhardt, Adaptive backdoor: backdoored language model agents that detect human overseers, in: Proc. of the 41st Intl. Conf. on Machine Learning, Vienna, Austria, (2024)

[112]

B.-C. Chen, N. Ivanov, G.-J. Wang, Q.-B. Yan, Multi-turn hidden backdoor in large language model-powered chatbot models, in: Proc. of the 19th ACM Asia Conf. on Computer and Communications Security, Singapore, Singapore, (2024), pp. 1316-1330.

[113]

S. Zhao, M. Jia, A.T. Luu, F.-J. Pan, J.-M. Wen, Universal vulnerabilities in large language models: backdoor attacks for in-context learning, in: Proc. of the Conf. on Empirical Methods in Natural Language Processing, Miami, USA, (2024), pp. 11507-11522.

[114]

P.-F. He, H. Xu, Y. Xing, H. Liu, M. Yamada, J.-L. Tang, Data poisoning for in-context learning, in: Proc. of the Association for Computational Linguistics, Albuquerque, New Mexico, (2025), pp. 1680-1700.

[115]

W. Zou, R.-P. Geng, B.-H. Wang, J.-Y. Jia, PoisonedRAG: knowledge corruption attacks to retrieval-augmented generation of large language models [Online]. Available, https://arxiv.org/abs/2402.07867, August 2024.

[116]

J.-Q. Xue, M.-X. Zheng, Y. Hu, F. Liu, X. Chen, Q. Lou, BadRAG: identifying vulnerabilities in retrieval augmented generation of large language models [Online]. Available, https://arxiv.org/abs/2406.00083, June 2024.

[117]

H. Chaudhari, G. Severi, J. Abascal, et al., Phantom: General Trigger Attacks on Retrieval Augmented Language Generation [Online]. Available, https://arxiv.org/abs/2405.20485, October 2024.

[118]

I.J. Goodfellow, J. Shlens, C. Szegedy, Explaining and harnessing adversarial examples, in: Proc. of the 3rd Intl. Conf. on Learning Representations, San Diego, USA, (2015)

[119]

C.S. Chen, J.Z. Dai, Mitigating backdoor attacks in LSTM-based text classification systems by backdoor keyword identification, Neurocomputing 452 (2021) 253-262.

[120]

Y.-S. Gao, Y. Kim, B.G. Doan, et al., Design and evaluation of a multi-domain Trojan detection method on deep neural networks, IEEE T. Depend. Secure 19 (4) (2022) 2349-2364.

[121]

Y.-S. Gao, C. Xu, D.-R. Wang, S.-P. Chen, D.C. Ranasinghe, S. Nepal, STRIP: a defence against Trojan attacks on deep neural networks, in: Proc. of the 35th Annu. Computer Security Applications Conf., San Juan, USA, (2019), pp. 113-125.

[122]

K. Shao, J.-A. Yang, Y. Ai, H. Liu, Y. Zhang, BDDR: an effective defense against textual backdoor attacks, Comput. Secur. 110 (2021) 102433.

[123]

F.-C. Qi, Y.-Y. Chen, M.-K. Li, Y. Yao, Z.-Y. Liu, M.-S. Sun, ONION: a simple and effective defense against textual backdoor attacks, in: Proc. of the Conf. on Empirical Methods in Natural Language Processing, Virtual Event, (2021), pp. 9558-9566.

[124]

J.-L. Wei, M. Fan, W.-J. Jiao, W.-X. Jin, T. Liu, BDMMT: backdoor sample detection for language models through model mutation testing, IEEE Trans. Inf. Forensics Secur. 19 (2024) 4285-4300.

[125]

S.-F. Zhai, Q.-N. Shen, X.-Y. Chen, et al., NCL: textual backdoor defense using noise-augmented contrastive learning, in: Proc. of the IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, (2023), pp. 1-5.

[126]

B.-R. Zhu, Y.-J. Qin, G.-Q. Cui, et al., Moderate-fitting as a natural backdoor defender for pre-trained language models, in: Proc. of the 36th Intl. Conf. on Neural Information Processing Systems, New Orleans, USA, (2022), p. 80

[127]

E. Hubinger, C. Denison, J. Mu, et al., Sleeper Agents: Training Deceptive Llms that Persist Through Safety Training [Online]. Available, https://arxiv.org/abs/2401.05566, January 2024.

[128]

H.-R. Li, Y.-L. Chen, Z.-H. Zheng, et al., Simulate and eliminate: revoke backdoors for generative large language models, in: Proc. of the 39th AAAI Conf. on Artificial Intelligence, Philadelphia, USA, (2025), pp. 397-405.

[129]

Y.-G. Li, X.-X. Lyu, N. Koren, L.-J. Lyu, B. Li, X.-J. Ma, Neural attention distillation: erasing backdoor triggers from deep neural networks, in: Proc. of the 9th Intl. Conf. on Learning Representations, Virtual Event, (2021)

[130]

J. Xia, T. Wang, J.-P. Ding, X. Wei, M.-S. Chen, Eliminating backdoor triggers for deep neural networks using attention relation graph distillation, in: Proc. of the 31st Intl. Joint Conf. on Artificial Intelligence, Vienna, Austria, (2022), pp. 1481-1487.

[131]

H. Bansal, F. Yin, N. Singhi, A. Grover, Y. Yang, K.-W. Chang, Clean CLIP: mitigating data poisoning attacks in multimodal contrastive learning, in: Proc. of the IEEE/CVF Intl. Conf. on Computer Vision, Paris, France, (2023), pp. 112-123.

[132]

J.-Y. Guan, Z.-Z. Tu, R. He, D.-C. Tao, Few-shot backdoor defense using shapley estimation, in: Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, New Orleans, USA, (2022), pp. 13348-13357.

[133]

X.-Y. Zhao, D.-P. Xu, S.-H. Yuan, Defense against backdoor attack on pre-trained language models via head pruning and attention normalization, in: Proc. of the 41st Intl. Conf. on Machine Learning, Vienna, Austria, (2024)

AI Summary AI Mindmap
PDF (10478KB)

152

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/