A prompt-based approach to adversarial example generation and robustness enhancement
Yuting YANG , Pei HUANG , Juan CAO , Jintao LI , Yun LIN , Feifei MA
Front. Comput. Sci. ›› 2024, Vol. 18 ›› Issue (4) : 184318
A prompt-based approach to adversarial example generation and robustness enhancement
Recent years have seen the wide application of natural language processing (NLP) models in crucial areas such as finance, medical treatment, and news media, raising concerns about the model robustness and vulnerabilities. We find that prompt paradigm can probe special robust defects of pre-trained language models. Malicious prompt texts are first constructed for inputs and a pre-trained language model can generate adversarial examples for victim models via mask-filling. Experimental results show that prompt paradigm can efficiently generate more diverse adversarial examples besides synonym substitution. Then, we propose a novel robust training approach based on prompt paradigm which incorporates prompt texts as the alternatives to adversarial examples and enhances robustness under a lightweight minimax-style optimization framework. Experiments on three real-world tasks and two deep neural models show that our approach can significantly improve the robustness of models to resist adversarial attacks.
robustness / adversarial example / prompt learning / pre-trained language model
| [1] |
|
| [2] |
|
| [3] |
Huang P, Yang Y, Jia F, Liu M, Ma F, Zhang J. Word level robustness enhancement: fight perturbation with perturbation. In: Proceedings of the 36th AAAI Conference on Artificial Intelligence. 2022, 10785−10793 |
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
Cheng M, Wei W, Hsieh C J. Evaluating and enhancing the robustness of dialogue systems: a case study on a negotiation agent. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019, 3325−3335 |
| [9] |
Devlin J, Chang M, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019, 4171−4186 |
| [10] |
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V. Roberta: a robustly optimized BERT pretraining approach. 2019, arXiv preprint arXiv: 1907.11692 |
| [11] |
Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. 2021, arXiv preprint arXiv: 2107.13586 |
| [12] |
Xu J, Ju D, Li M, Boureau Y L, Weston J, Dinan E. Bot-adversarial dialogue for safe conversational agents. In: Proceedings of 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021, 2950−2968 |
| [13] |
Bartolo M, Thrush T, Jia R, Riedel S, Stenetorp P, Kiela D. Improving question answering model robustness with synthetic adversarial data generation. In: Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. 2021, 8830−8848 |
| [14] |
Perez E, Huang S, Song H F, Cai T, Ring R, Aslanides J, Glaese A, McAleese N, Irving G. Red teaming language models with language models. In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022 |
| [15] |
Madry A, Makelov A, Schmidt L, Tsipras D, Vladu A. Towards deep learning models resistant to adversarial attacks. In: Proceedings of the 6th International Conference on Learning Representations. 2018 |
| [16] |
|
| [17] |
Jin D, Jin Z, Zhou J T, Szolovits P. Is BERT really robust? A strong baseline for natural language attack on text classification and entailment. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 8018−8025 |
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
Dou Z, Liu P, Hayashi H, Jiang Z, Neubig G. GSum: a general framework for guided neural abstractive summarization. In: Proceedings of 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021, 4830−4842 |
| [22] |
|
| [23] |
Wang X, Yang Y, Deng Y, He K. Adversarial training with fast gradient projection method against synonym substitution based text attacks. In: Proceedings of the 36th AAAI Conference on Artificial Intelligence. 2021, 13997−14005 |
| [24] |
Alzantot M, Sharma Y, Elgohary A, Ho B H, Srivastava M B, Chang K W. Generating natural language adversarial examples. In: Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing, Brussels. 2018, 2890−2896 |
| [25] |
|
| [26] |
|
| [27] |
|
| [28] |
Li D, Zhang Y, Peng H, Chen L, Brockett C, Sun M T, Dolan B. Contextualized perturbation for textual adversarial attack. In: Proceedings of 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021, 5053−5069 |
| [29] |
Li L, Ma R, Guo Q, Xue X, Qiu X. BERT-ATTACK: adversarial attack against BERT using BERT. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. 2020, 6193−6202 |
| [30] |
Garg S, Ramakrishnan G. BAE: BERT-based adversarial examples for text classification. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. 2020, 6174−6181 |
| [31] |
|
| [32] |
|
| [33] |
Miyato T, Dai A M, Goodfellow I J. Adversarial training methods for semi-supervised text classification. In: Proceedings of the 5th International Conference on Learning Representations. 2017 |
| [34] |
Jia R, Raghunathan A, Göksel K, Liang P. Certified robustness to adversarial word substitutions. In: Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019, 4129−4142 |
| [35] |
Dong X, Luu A T, Ji R, Liu H. Towards robustness against natural language word substitutions. In: Proceedings of the 9th International Conference on Learning Representations. 2021 |
| [36] |
Pang B, Lee L. Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. 2005, 115−124 |
| [37] |
Bowman S R, Angeli G, Potts C, Manning C D. A large annotated corpus for learning natural language inference. In: Proceedings of 2015 Conference on Empirical Methods in Natural Language Processing. 2015, 632−642 |
| [38] |
|
| [39] |
|
| [40] |
|
| [41] |
Kingma D P, Ba J. Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations. 2015 |
| [42] |
Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A. Supervised learning of universal sentence representations from natural language inference data. In: Proceedings of 2017 Conference on Empirical Methods in Natural Language Processing. 2017, 670−680 |
| [43] |
Pennington J, Socher R, Manning C D. Glove: Global vectors for word representation. In: Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. 2014, 1532−1543 |
Higher Education Press
Supplementary files
/
| 〈 |
|
〉 |