Proxy robustness in vision language models is effortlessly transferable

Xiaowei FU; Fuxiang HUANG; Lei ZHANG

doi:10.1007/s11704-026-50951-1

Front. Comput. Sci. ›› 2027, Vol. 21 ›› Issue (7) :2107343 DOI: 10.1007/s11704-026-50951-1

Artificial Intelligence

RESEARCH ARTICLE

Proxy robustness in vision language models is effortlessly transferable

Xiaowei FU ¹^,²
, Fuxiang HUANG ¹^,³
, Lei ZHANG ¹^,²

Author information +

History +

PDF (3361KB)

Abstract

As a pivotal technique for improving the defense of deep models, adversarial robustness transfer via distillation has demonstrated remarkable success in conventional image classification tasks. However, this paradigm encounters critical challenges when applied to vision-language models (VLM) (e.g., CLIP): constructing adversarially robust teacher for large-scale multi-modal models demands prohibitively high computational resources. We bridge this gap by revealing an interesting phenomenon: vanilla CLIP (without adversarial training) exhibits intrinsic defensive capabilities against adversarial examples generated by another CLIP with different architectures. We formally define this as proxy adversarial robustness, and naturally propose a Heterogeneous Proxy Transfer (HPT) framework that establishes cross-architectural robustness distillation channels between CLIP variants, effortlessly enabling the VLM robustness transfer from proxy to target models. Yet, such proxy transfer paradigm easily induces severe overfitting, leading to a sharp degradation in zero-shot natural generalization. To resolve that, we design Generalization-Pivot Decoupling (GPD) by leveraging the difference in learning rate scheduling. This decouples the proxy transfer process into a generalization-anchored warm-up that maintains generalization and a generalization-pulled HPT that promotes adversarial robustness, to achieve an equilibrium between natural generalization and adversarial robustness. Extensive experiments on 15 zero-shot datasets demonstrate the effectiveness of our HPT-GPD method. The code is available at the website of github.com/fxw13/HPT-GPD.

Graphical abstract

Keywords

adversarial defense / vision language model / adversarial distillation / proxy robustness

Cite this article

Download citation ▾

Xiaowei FU, Fuxiang HUANG, Lei ZHANG. Proxy robustness in vision language models is effortlessly transferable. Front. Comput. Sci., 2027, 21 (7) : 2107343 DOI:10.1007/s11704-026-50951-1

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Li D, Li J, Li H, Niebles J C, Hoi S C H. Align and prompt: video-and-language pre-training with entity prompts. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 4943−4953

[2]	Li J, Li D, Xiong C, Hoi S. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Proceedings of the 39th International Conference on Machine Learning. 2022, 12888−12900

[3]	Singh M, Gustafson L, Adcock A, De Freitas Reis V, Gedik B, Kosaraju R P, Mahajan D, Girshick R, Dollár P, Van Der Maaten L. Revisiting weakly supervised pre-training of visual perception models. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 794−804

[4]	Jia C, Yang Y, Xia Y, Chen Y T, Parekh Z, Pham H, Le Q, Sung Y H, Li Z, Duerig T. Scaling up visual and vision-language representation learning with noisy text supervision. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 4904−4916

[5]	Radford A, Kim J W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I. Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 8748−8763

[6]	Chen J, Guo H, Yi K, Li B, Elhoseiny M. VisualGPT: data-efficient adaptation of pretrained language models for image captioning. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 18009−18019

[7]	Liu J, Zhang Y, Chen J N, Xiao J, Lu Y, Landman B A, Yuan Y, Yuille A, Tang Y, Zhou Z. CLIP-driven universal model for organ segmentation and tumor detection. In: Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. 2023, 21095−21107

[8]	Tang Y, Yang D, Li W, Roth H R, Landman B, Xu D, Nath V, Hatamizadeh A. Self-supervised pre-training of swin transformers for 3D medical image analysis. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 20698−20708

[9]	Goodfellow I J, Shlens J, Szegedy C. Explaining and harnessing adversarial examples. In: Proceedings of the 3rd International Conference on Learning Representations. 2015

[10]	Madry A, Makelov A, Schmidt L, Tsipras D, Vladu A. Towards deep learning models resistant to adversarial attacks. In: Proceedings of the 6th International Conference on Learning Representations. 2018

[11]	Croce F, Hein M. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 206

[12]	Szegedy C, Zaremba W, Sutskever I, Bruna J, Erhan D, Goodfellow I J, Fergus R. Intriguing properties of neural networks. In: Proceedings of the 2nd International Conference on Learning Representations. 2014

[13]	Moosavi-Dezfooli S M, Fawzi A, Frossard P. DeepFool: a simple and accurate method to fool deep neural networks. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016, 2574−2582

[14]	Mao C, Geng S, Yang J, Wang X, Vondrick C. Understanding zero-shot adversarial robustness for large-scale models. In: Proceedings of the 11th International Conference on Learning Representations. 2023

[15]	Wang S, Zhang J, Yuan Z, Shan S. Pre-trained model guided fine-tuning for zero-shot adversarial robustness. In: Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024, 24502−24511

[16]	Goldblum M, Fowl L, Feizi S, Goldstein T. Adversarially robust distillation. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 3996−4003

[17]	Chen T, Zhang Z, Liu S, Chang S, Wang Z. Robust overfitting may be mitigated by properly learned smoothening. In: Proceedings of the 9th International Conference on Learning Representations. 2021

[18]	Huang B, Chen M, Wang Y, Lu J, Cheng M, Wang W. Boosting accuracy and robustness of student models via adaptive adversarial distillation. In: Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 24668−24677

[19]	Jung J, Jang H, Song J, Lee J. PeerAiD: improving adversarial distillation from a specialized peer tutor. In:Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024, 24482−24491

[20]	Wang H, Wang Y. Generalist: decoupling natural and robust generalization. In: Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 20554−20563

[21]	Zhang X, Gui S, Jin J, Zhu Z, Zhao Y . ATZSL: defensive zero-shot recognition in the presence of adversaries. IEEE Transactions on Multimedia, 2024, 26: 15–27

[22]	Yucel M K, Cinbis R G, Duygulu P . How robust are discriminatively trained zero-shot learning models?. Image and Vision Computing, 2022, 119: 104392

[23]	Athalye A, Carlini N, Wagner D. Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 274−283

[24]	Carlini N, Wagner D. Towards evaluating the robustness of neural networks. In: Proceedings of 2017 IEEE Symposium on Security and Privacy. 2017, 39−57

[25]	Dong Y, Liao F, Pang T, Su H, Zhu J, Hu X, Li J. Boosting adversarial attacks with momentum. In:Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 9185−9193

[26]	Liu Y, Guo J, Cai D, He X. Attribute attention for semantic disambiguation in zero-shot learning. In:Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 6697−6706

[27]	Xian Y, Lorenz T, Schiele B, Akata Z. Feature generating networks for zero-shot learning. In: Proceedings of 2018 IEEE Conference on Computer Vision and Pattern Recognition. 2018, 5542−5551

[28]	Zhang L, Zhou Y, Yang Y, Gao X . Meta invariance defense towards generalizable robustness to unknown adversarial attacks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46( 10): 6669–6687

[29]	Fu X, Huang F, Wang G, Gao X, Zhang L . M3C: resist agnostic attacks by mitigating consistent class confusion prior. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026, 48( 2): 1390–1406

[30]	Tang L, Zhang L. Robust overfitting does matter: test-time adversarial purification with FGSM. In:Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024, 24347−24356

[31]	Fu X, Ma L, Zhang L . Remove to regenerate: boosting adversarial generalization with attack invariance. IEEE Transactions on Circuits and Systems for Video Technology, 2025, 35( 3): 1999–2012

[32]	Caron M, Touvron H, Misra I, Jégou H, Mairal J, Bojanowski P, Joulin A. Emerging properties in self-supervised vision transformers. In: Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. 2021, 9630−9640

[33]	Li X, Zhang W, Liu Y, Hu Z, Zhang B, Hu X. Language-driven anchors for zero-shot adversarial robustness. In: Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024, 24686−24695

[34]	Hwang J W, Lee Y, Oh S, Bae Y. Adversarial training with stochastic weight average. In: Proceedings of 2021 IEEE International Conference on Image Processing. 2021, 814−818

[35]	Zhu J, Yao J, Han B, Zhang J, Liu T, Niu G, Zhou J, Xu J, Yang H. Reliable adversarial distillation with unreliable teachers. In: Proceedings of the 10th International Conference on Learning Representations. 2022

[36]	Zi B, Zhao S, Ma X, Jiang Y G. Revisiting adversarial robustness distillation: robust soft labels make student better. In: Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. 2021, 16423−16432

[37]	Zhang H, Yu Y, Jiao J, Xing E P, El Ghaoui L, Jordan M I. Theoretically principled trade-off between robustness and accuracy. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 7472−7482

[38]	Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. 2014, 2672−2680

[39]	Deng J, Dong W, Socher R, Li L J, Li K, Li Fei-Fei. ImageNet: a large-scale hierarchical image database. In:Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009, 248−255

[40]	Krizhevsky A. Learning multiple layers of features from tiny images. Toronto: University of Toronto, 2009

[41]	Coates A, Ng A, Lee H. An analysis of single-layer networks in unsupervised feature learning. In:Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. 2011, 215−223

[42]	Li Fei-Fei, Fergus R, Perona P. One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006, 28(4): 594−611

[43]	Griffin G, Holub A, Perona P. Caltech-256 object category dataset. Technical Report 7694. Pasadena: California Institute of Technology, 2007

[44]	Xiao J, Hays J, Ehinger K A, Oliva A, Torralba A. Sun database: large-scale scene recognition from abbey to zoo. In: Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2010, 3485−3492

[45]	Parkhi O M, Vedaldi A, Zisserman A, Jawahar C V. Cats and dogs. In: Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. 2012, 3498−3505

[46]	Nilsback M E, Zisserman A. Automated flower classification over a large number of classes. In: Proceedings of 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing. 2008, 722−729

[47]	Maji S, Rahtu E, Kannala J, Blaschko M, Vedaldi A. Fine-grained visual classification of aircraft. 2013, arXiv preprint arXiv: 1306.5151

[48]	Krause J, Stark M, Deng J, Li Fei-Fei. 3D object representations for fine-grained categorization. In:Proceedings of 2013 IEEE International Conference on Computer Vision Workshops. 2013, 554−561

[49]	Bossard L, Guillaumin M, Van Gool L. Food-101–mining discriminative components with random forests. In:Proceedings of the 13th European Conference on Computer Vision. 2014, 446−461

[50]	Helber P, Bischke B, Dengel A, Borth D . EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019, 12( 7): 2217–2226

[51]	Cimpoi M, Maji S, Kokkinos I, Mohamed S, Vedaldi A. Describing textures in the wild. In: Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. 2014, 3606−3613

[52]	Bejnordi B E, Veta M, Van Diest P J, Van Ginneken B, Karssemeijer N, . et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA, 2017, 318( 22): 2199–2210

[53]	Robbins H, Monro S . A stochastic approximation method. The Annals of Mathematical Statistics, 1951, 22( 3): 400–407

[54]	He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016, 770−778

[55]	Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In: Proceedings of the 3rd International Conference on Learning Representations. 2015

[56]	Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L C. MobileNetV2: inverted residuals and linear bottlenecks. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 4510−4520

[57]	Zhang X, Zhou X, Lin M, Sun J. ShuffleNet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 6848−6856

[58]	Ding X, Zhang X, Ma N, Han J, Ding G, Sun J. RepVGG: making VGG-style ConvNets great again. In:Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 13728−13737