1. Laboratory of Intelligent Recognition and Image Processing, School of Computer Science and Engineering, Beihang University, Beijing 100191, China
2. The State Key Laboratory of Internet of Things for Smart City and the Department of Computer and Science, University of Macau, Macao 999078, China
andyguo@buaa.edu.cn
Show less
History+
Received
Accepted
Published
2024-05-26
2024-12-15
Issue Date
Revised Date
2024-12-18
PDF
(1987KB)
Abstract
This paper focuses on an important type of black-box attacks, i.e., transfer-based adversarial attacks, where the adversary generates adversarial examples using a substitute (source) model and utilizes them to attack an unseen target model, without knowing its information. Existing methods tend to give unsatisfactory adversarial transferability when the source and target models are from different types of DNN architectures (e.g., ResNet-18 and Swin Transformer). In this paper, we observe that the above phenomenon is induced by the output inconsistency problem. To alleviate this problem while effectively utilizing the existing DNN models, we propose a common knowledge learning (CKL) framework to learn better network weights to generate adversarial examples with better transferability, under fixed network architectures. Specifically, to reduce the model-specific features and obtain better output distributions, we construct a multi-teacher framework, where the knowledge is distilled from different teacher architectures into one student network. By considering that the gradient of input is usually utilized to generate adversarial examples, we impose constraints on the gradients between the student and teacher models, to further alleviate the output inconsistency problem and enhance the adversarial transferability. Extensive experiments demonstrate that our proposed work can significantly improve the adversarial transferability.
Deep neural networks (DNNs), such as convolutional neural networks (CNNs) [1–3] and transformers [4,5], have demonstrated significant advancements across various machine learning tasks [6–10]. However, recent studies reveal that DNNs are vulnerable to adversarial attacks [11–13], which add special yet imperceptible perturbations to benign data to deceive the deep models into making wrong predictions. This vulnerability poses a considerable threat to the safety of the DNN-based systems [14–16], especially those applied in security-sensitive domains, such as autonomous driving and face-scan payment.
Since different DNN architectures usually function differently, their corresponding vulnerabilities are also different. Therefore, existing adversarial attack techniques are usually specifically designed for different DNN architectures, to discover the safety threats of different DNN architectures. Due to privacy or copyright protection considerations, black-box attacks tend to possess more applicability in real scenarios. In this paper, we focus on an extensively studied scenario of black-box attacks, i.e., transfer-based adversarial attack, which assumes that the adversary does not have access to the target model. Instead, the attacker can train substitute models to generate adversarial examples to attack the target model.
For common CNN models, to enhance the transferability of adversarial examples, various techniques have been proposed, which can be briefly classified into two categories according to their mechanisms, i.e., gradient modifications [12,14,17] and input transformations [18–21]. The former type of methods usually improves the gradient ascend process for adversarial attacks to prevent the adversarial examples from over-fitting to the source model. The latter type of methods usually manipulates input images with various transformations, which enables the generated adversarial perturbations to adapt different input transformations. Consequently, these adversarial examples possess a higher probability of transferring to successfully attack the target unknown model.
The recent success of vision transformers (ViT) has also prompted several studies on devising successful attacks on the ViT type of architectures. [22–24] construct substitute model based attack methods for ViTs, according to their unique architectures, such as attention module, classification token, to generate transferable adversarial examples.
Currently, existing methods usually employ pre-trained classification models as the source (substitute) model (as well as the target model in the experiments) directly, for achieving transfer-based adversarial attacks. One of the core reasons, which affects the transferability of these adversarial attacks, is the similarity between the source (substitute) model and the target model. By assuming that the source model and the target model are identical, the attack becomes a white-box attack. Then, the transferability is expected to be high, and the attack success rate is equivalent to that in the corresponding white-box setting. Intuitively, two models with similar architectures and similar weights tend to possess high transferability [25]. On the contrary, models with significantly different architectures and weights usually exhibit low transferability. For example, when we generate adversarial examples in ResNet-18 and test them on ViT-S, the attack success rate is 45.99%, which is lower than the transferability from ResNet-18 to Inception-v3 (62.87%).
Since different network architectures and weights will induce different outputs, we believe that the low adversarial transferability is caused by the output inconsistency problem. As can be observed in Fig.1, even when each of the models gives correct classification result, the output probabilities are still inconsistent. Besides, the inconsistency between two different CNN models is usually smaller than that between two models from different architectural categories, e.g., a CNN model and a transformer-based model. Apparently, this inconsistency is harmful to adversarial transferability, because adversarial attacks are usually designed to manipulate the target model’s output probability and this inconsistency will increase the uncertainty of the outputs of the target model. To better describe this output inconsistency, KL divergence is employed to numerically represent it. As shown in Fig.2, higher output inconsistency tends to induce lower transferability and vice versa.
To alleviate the above problem, a straightforward solution is to construct a universal network architecture which possesses relatively similar output distributions as different types of DNN models. Unfortunately, this universal network architecture and its training strategy are both difficult to be designed and implemented. Besides, this solution is highly unlikely to effectively utilize the existing pre-defined DNN architectures, which are much more convenient to be applied in real scenarios.
To alleviate this output inconsistency problem and effectively utilize the existing pre-defined DNN models, in this paper, we propose a common knowledge learning (CKL) method for the substitute (source) model to learn better network weights to generate adversarial examples with better transferability, under fixed network architectures. Specifically, to reduce the model-specific features and obtain better output distributions, we adopt a multi-teacher approach, where the knowledge is distilled from different teacher architectures into one student network. By considering that the gradient of input is usually utilized to generate adversarial examples, we impose constraints on the gradients between the student and teacher models. Since multiple teach models may generate conflicting gradients, which will interfere the optimization process, motivated by PCGrad [26], we introduce the model gradients constraint term into our work to diminish the gradient conflicts of the teacher models.
Our contributions are summarized as follow.
● We analyze the relationship between adversarial transferability and property of the substitute (source) model, and observe that a substitute model with less output inconsistency to the target model tends to possess better adversarial transferability.
● To reduce the model-specific features and obtain better output distributions, we propose a common knowledge learning framework to distill multi-teacher knowledge into one single student network.
● For generating adversarial examples with better transferability, we propose to learn the input gradients of the teacher models and utilize gradient projection to reduce the conflicts in the gradients of multiple teachers.
● Extensive experiments on CIFAR-10, CIFAR-100, and TinyImageNet demonstrate that our method is effective and can be easily integrated into transfer-based adversarial attack methods to significantly improve their attack performances.
2 Related work
2.1 Adversarial attacks
Adversarial attack is firstly proposed by [27]. Subsequently, a large number of adversarial attack methods are proposed, which are usually classified into two categories according to the adversary’s knowledge [28], i.e., white-box and black-box attacks. The black-box attacks can be further classified into query-based and transfer-based attacks. White-box attacks usually assume that the adversary can access all the necessary information, including the architecture, parameters and training strategy, of the target model. Query-based attacks usually assume that the adversary can obtain the outputs by querying the target model [29–33]. Transfer-based attacks generate adversarial examples without the access to the target model, whose assumption is the closest to practice. Under such circumstance, the adversary usually exploits a substitute model to generate adversarial examples and utilize the examples to deceive the target model [18,19,22,34–36]. Since our work focus on the transfer-based scenario, we will introduce this attack scenario in details in next subsection.
2.2 Transfer-based attacks
Since different DNN architectures usually function differently, existing transfer-based attack techniques are usually specifically designed for different DNN architectures.
For CNN architectures, the attack approaches in transfer-based scenarios can mainly be classified into two categories, i.e., gradient modifications and input transformations. For gradient modifications based methods, [14] firstly proposes MI-FGSM to stabilize the update directions with a momentum term to improve the transferability of adversarial examples. [12] adopts the Nesterov accelerated gradients into the iterative attacks. [17] proposes an Adam iterative fast gradient tanh method (AI-FGSM) to generate adversarial examples with high transferability. Besides, [37] adopts the AdaBelief optimizer into the update of the gradients and constructs ABI-FGM to further boost the attack success rates of adversarial examples. Recently, [38] introduces variance tuning to further enhance the adversarial transferability of iterative gradient-based attack methods.
On the contrary, input transformations based methods usually applies various transformations to the input image in each iteration to prevent the attack method from being overfitting to the substitute model. [19] presents a translation-invariant attack method, named TIM, by optimizing a perturbation over an ensemble of translated images. Inspired by data augmentations, [37] optimizes adversarial examples by adding image cropping operation to each iteration of the perturbation generation process. Recently, Admix [20] calculates the gradient on the input image admixed with a small portion of each add-in image while using the original label, to craft adversarial examples with better transferability. [21] improves adversarial transferability via an adversarial transformation network, which studies efficient image transformations to boosting the transferability. [39] proposes AutoMa to seek for a strong model augmentation policy based on reinforcement learning.
Since the above approaches are designed for CNNs, their performances degrade when their generated adversarial examples are directly input to other types of DNN architectures, such as vision transformers [4] and mlpmixer [40]. Since the transformer-based architectures have also been widely applied in image classification task, several literatures have also presented transfer-based adversarial attack methods when transformer-based architectures are employed as the source (substitute) model. [22] proposes a self-attention gradient attack (SAGA) to enhance the adversarial transferability. [23] introduces two novel strategies, i.e., self-ensemble and token refinement, to improve the adversarial transferability of vision transformers. Motivated by the observation that the gradients of attention in each head impair the generation of highly transferable adversarial examples, [24] presents a pay no attention (PNA) attack, which ignores the backpropagated gradient from the attention branch.
2.3 Knowledge distillation
A common technique for transferring knowledge from one model to another is knowledge distillation. The mainstream knowledge distillation algorithms can be classified into three categories, i.e., response-based, feature-based and relation-based methods [41]. The feature-based methods [42,43] exploit the outputs of intermediate layers in the teacher model to supervise the training of the student model. The relation-based methods [44] explore the relationships between different layers or data samples. These two types of methods requires synchronized layers in both the teacher and student models. However, when the architectures of the teacher and student models are inconsistent, the selection of proper synchronized layers becomes difficult to achieve. On the contrary, [45], which is a response-based method, constrains the logits layers of the teacher and student models, which can be easily implemented for different tasks without the above mentioned synchronization problem. Therefore, [45] is adopted in our proposed work.
3 Methodology
3.1 Notations
Here, we define the notations which will be utilized in the rest of this paper. Let denote a clean image and is its corresponding label, where is the image space. Let be the output logit of a DNN model, where is the number of classes. denote the teacher networks and is the total number of teacher models. stands for the student network. is utilized as the loss function, e.g., cross-entropy loss, respect to input image , model and label . For simplicity, we also use , to denote and , respectively, in the rest section. The goal of an generated adversarial example is to deceive the target DNN model, such that the prediction result of the target model is not , i.e., . Meanwhile, the adversarial example is usually desired to be similar to the original input, which is usually achieved by constraining the adversarial perturbation by norm, i.e., , where is a predefined small constant.
3.2 Overview
According to Fig.1 and Fig.2, the output inconsistency problem significantly affects the transferability of adversarial examples, i.e., high output inconsistency usually indicates low transferability, and vice versa. Since the output inconsistency within each type of DNN architectures is usually less than that of the cross-architecture models, the adversarial examples generated based on the substitute model from one type of DNN architectures (e.g., CNNs) usually give relatively poor attack performance on the target models from other types of DNN architectures (e.g., ViT, MLPMixer).
To alleviate the above problems, in this paper, we propose a common knowledge learning (CKL) framework, which distills the common knowledge of multiple teacher models with different architectures into a single student model, to obtain better substitute models. The overall framework is shown in Fig.3. Firstly, we select teacher models from different types of DNN architectures. The student model will learn from their outputs to reduce the model-specific features and obtain common (model-agnostic) features, to alleviate the output inconsistency problem. Since the input gradient is always utilized in typical adversarial attack process, we also design a constraint on the input gradients between the teacher models and the student model, to further promote the transferability of generated adversarial examples. After training the student model, in the testing stage, this model will be utilized as the source (substitute) model to generate adversarial examples.
3.3 Common knowledge distillation
As can be observed from Fig.1, when the same input is fed into different models, the output probabilities are quite different, which actually reveals that there exists feature preference in deep models. Apparently, the output inconsistency problem is induced by these model-specific features, because these features, which are considered to be distinctive to one model, may not be distinctive enough to others. Under such circumstance, when an adversarial example is misclassified by the source model to the other class (), it contains certain manipulated features which are distinctive to the source model. However, if these manipulated features are not distinctive enough to the target model, the adversarial example may not be able to deceive the target model.
According to the analysis above, it is vital to identify and emphasize the common (model-agnostic) features among different DNN models in the substitute model, such that when these model-agnostic features is manipulated to generate the adversarial examples, the target model will possess a higher possibility to be deceived, i.e., the adversarial transferability will be improved. Therefore, we construct a multi-teacher knowledge distillation method to force the student model to learn and emphasize the common features from various DNN models. Since different DNN models usually possess different architectures, we constrain the model outputs between the student and teacher models, by adopting [45]. Specifically, we employ KL divergence to measure the output discrepancy between the student model and each teacher model . We denote the output of student model as and the output of teacher model as . Then, KL divergence is formulated as
where represents the number of classes.
To jointly utilize all the teacher models, the knowledge distillation (KD) loss is defined as
3.4 Gradient distillation
Since the input gradients are commonly utilized in the main-stream adversarial attack methods, such as FGSM [11], MI-FGSM [14], DI-FGSM [18], TI-FGSM [18], and VNI-FGSM [38], if two networks and satisfy , adversarial examples generated by these methods will be identical when either of the two networks are employed as the source model. Additionally, if for all , the losses of and differ by at most one constant, i.e., there exists a constant that . Since the outputs of two models are more likely to be inconsistent when their losses are different, if the input gradients of two models are similar, these two models tend to possess less output inconsistency. Although this assumption cannot be exactly satisfied in real scenarios, it can still be useful in generating transferable adversarial examples. Therefore, we make the student model to learn the input gradients from the teacher networks, to further improve the adversarial transferability.
As our framework will utilize multiple teacher networks to teach one student model, it is vital to design suitable approaches to learn multiple input gradients. Under such circumstance, a simple solution is to design a multi-objective optimization problem which minimizes the distances between the input gradients of the student model and each teacher model. This optimization problem can be simplified by minimizing the distance between the input gradient of the student model and the averaged input gradients of the teacher models (which can be regarded as a representative value among all the input gradients of the teacher models), as
where and . For convenience, we let denote and let denote in the rest of this section, when the input is not necessary to be emphasized.
Unfortunately, there usually exists conflicting gradients among them, which is similar to the gradient conflict problem [26,46] in multi-task learning. This gradient conflict problem means that there exists a , satisfying . If is gradually closer to , this gradient conflict problem will actually move further away from , which is the input gradient of the th teacher model.
To address this issue, inspired by PCGrad [26] method, we adjust our optimization objective function by projecting one of the two conflicting gradients onto the normal plane of the other gradient. Specifically, when there exist two conflicting gradients and , i.e., , will be replaced by
to decrease the conflicts among them. This replacement step is performed for all the gradients. After the replacement step, we calculate the averaged value of replaced gradient , i.e., , for the student model to learn. Then, the gradient loss function becomes
By combining Eqs. (2) and (5), the final loss of our common knowledge learning (CKL) for training the student network can be obtained as
where is the hyperparameter employed to balance the two loss terms, which is further discussed in Section 4.6. The detailed process of our CKL method for one mini-batch is summarized in Algorithm 1.
3.5 Generating adversarial examples with CKL
After the training process, we utilize the trained student model (S) as the source (substitute) model to generate adversarial examples. Our framework can be easily combined with the existing transfer-based adversarial attack methods. For demonstration, here we leverage MI-FGSM [14] as an example to explain the adversarial example generation process. Let denote the commonly utilized Cross Entropy loss. We set . Then, the adversarial example generation process can be formulated as
where is a predefined small constant to constrain the maximum magnitudes of the generated adversarial example. forces the modified value to stay inside the ball (). is the step size and is momentum. This process terminates when reaches the maximum number (N) of iterations, and is the finally generated adversarial example.
4 Experiments
In this section, we firstly introduce the experimental settings in Section 4.1. Next, we present non-targeted attack experiments in Section 4.2, and targeted attack experiments in Section 4.3. Then, we give the experimental results compared to ensemble attack and intermediate-based attack, in Section 4.4 and Section 4.5. At last, we conduct ablation studies to evaluate the effectiveness of our proposed modules in Section 4.6.
4.1 Experimental settings
Datasets We consider three widely used classification datasets, i.e., the CIFAR-10, CIFAR-100 [47], and TinyImageNet, in our experiments. CIFAR-10 and CIFAR-100 datasets possess images with the size of . The numbers of classes in CIFAR-10 and CIFAR-100 are 10 and 100, respectively. For these two CIFAR datasets, we utilize the official training and testing sets, which respectively contains 50000 images and 10000 images, to train the student models and generate adversarial examples. TinyImageNet, in which the number of classes is , possesses 100000 images in its official training set and 10000 images in its official validation set, with the image size of . For TinyImageNet, we employ its training set to train the student models and utilize its validation set to generate adversarial examples. Since the widely used DNNs usually give a low accuracy on TinyImageNet and it is not quite reasonable to attack the samples which are already misclassified by DNNs, we select 1000 images that are correctly classified by all the target models to produce adversarial examples.
Networks Nine networks with different types of DNN architectures are employed as either the source model or the target model, which includes ResNets [3], VGG-16 [2], DenseNet-121 [48], Inception-v3 [49], MobileNet-v2 [50], ViT-S [4], Swin Transformers [5], MLPMixer [40], and ConvMixer [51]. To learn common knowledge of neural networks, we select teacher models from different types of DNN architectures. Thus, ResNet-50, Inception-v3, Swin-T, and MLPMixer are constructed as the teacher models in the subsequent experiments.
Baselines Five baselines are employed in our work, including non-targeted attack, i.e., MI-FGSM [14], DI-FGSM [18], VNI-FGSM [38], ILA-DA [52], and targeted attack, Logit [53]. For non-targeted attack, the number of attack iterations is set to 30 and step size is set to . For targeted attack, following the experimental settings in [53], is set to 300 and step size is set to .
Implementation details We employ the training set to train the student model (S) and testing set for generating the adversarial examples. In the training process, the momentum SGD optimizer is employed, with an initial learning rate (annealed down to zero following a cosine schedule), momentum , and weight decay . The maximum epoch number is . In the attack stage, we constrain the adversarial example and origin input by the ball with = 8/255, i.e., . For DI-FGSM [18], each input benign image is randomly resized to , with ( for TinyImageNet), and then padded to the size ( for TinyImageNet) in a random manner.
Evaluation metric The attack success rate (ASR) is employed to evaluate the attack performance. It is defined as the probability that the target model is fooled by the generated adversarial examples, i.e.,
where denotes the number of elements in the set.
4.2 Non-targeted attack evaluation
4.2.1 CIFAR10 results
The attack success rates (ASR) of the non-targeted attack on CIFAR-10 are reported in Tab.1. We compare our method with MI-FGSM, DI-FGSM, and VNI-FGSM, which are abbreviated as MI, DI, and NI, respectively. Their original methods generate adversarial examples by directly employing the pre-trained models. Meanwhile, our adversarial examples are generated by the student model, which is trained by our CKL framework. As can be observed, expect for the white-box attacks, i.e., the source model and target model are same, our CKL can give significant improvements compared to the corresponding baseline methods, which proves the effectiveness of our proposed work for adversarial transferability. Since the transferability to the unseen models is more consistent with transfer attack scenario, we calculate the averaged ASR value of unseen models (other models) in the last column, which do not overlap with teacher models.
Transferability to the unseen models Note that ResNet-50, Inception-v3, Swin Transformer, and, MLPMixer are employed as teacher models and utilized to train the student models. As can be observed, when selected these four models as the target models, the results with our CKL framework are significantly improved, compared to their corresponding baselines. To better verify the effectiveness of our CKL, we also employ the unseen models, e.g., VGG-16, DenseNet-121, Convmixer and ViT-S, as target models for evaluation, and our CKL can also achieve significant improvements. We also compute the averaged ASR value of the unseen models in the last column. We can observe that our method have consistent improvement on all baselines and substitute models. Our CKL method can improve the averaged value for at least 10 percentage points, which indicates that our CKL framework can learn actually effective common knowledge from the teacher models and leverage them to the unseen models.
Transferability to the cross-architecture models The cross-architecture transferability is usually a challenging problem for the baseline attack methods, as can be observed from the results. For example, when the source model is selected as ResNet-18, the correspondingly generated adversarial examples’ transferability to ViT-S is relatively low, i.e., only 37.16% for MI-FGSM. On the contrary, by integrating our CKL framework, the attack transferability can be largely improved (62.14% for MI-FGSM-CKL), which benefits from the common knowledge learned at the training stage.
4.2.2 CIFAR100 results
For better assessment of our proposed work, we further validate our CKL method on CIFAR-100 and the results are shown in Tab.2. The experimental setups are identical to these in Section 4.1. MI-FGSM, DI-FGSM, and VNI-FGSM are employed as the baseline methods.
As can be observed, our method has a consistent improvement on the CIFAR-100 dataset, whatever the attack method and the source model are. Besides, we compute the averaged ASRs of the unseen models in the last column of Tab.2. From the results, we can observe that our CKL method has a decent improvement, compared to the corresponding baseline methods in different substitute models. In addition, for the cross-architecture transferability, our method usually gives an improvement of about 10 percentage points. For example, when the source model is ResNet-18 and the target model is ConvMixer, ‘MI+CKL’ outperforms ‘MI’ up to 10.86% and ‘DI+CKL’ outperforms ‘DI’ up to 11.93%. These results further verify the effectiveness of our CKL method.
4.2.3 TinyImageNet results
We also conduct non-targeted adversarial attack experiments on TinyImageNet, to validate the effectiveness of our CKL method. Considering that some models perform poorly on TinyImageNet, we choose VGG-16, ResNets, DenseNet-121, ViT-S, Swin-T, Convmixer in this experiment. ResNet-50 and ViT-S are employed as teacher models and others are employed as source or target models. According to the strategy presented in Section 4.1, we randomly select 1,000 testing samples which are all correctly classified by the networks in the non-targeted attack experiments. The results are shown in Tab.3.
As can be observed, the attack success rate of baseline method is relatively low. After applying our CKL method, which enables the substitute model to learn common knowledge from different architectures, the generated adversarial examples usually possess high ASR. We point that even attacking unseen models, our CKL method also gives a significant improvement. We also compute the averaged attack success rates of unseen models in the last column. As can be observed, our method improves the averaged ASR values of unseen models for at least 20 percentage points, compared to the baselines.
4.3 Targeted attack evaluation
Transferable targeted attack is more worthy of study because attackers can directly control the unknown model to output the desired prediction. Specifically, this scenario requires the unknown target model to classify the adversarial examples into a pre-specific class (). Therefore, the targeted attack is indeed a more challenging problem.
To evaluate the performance of our method in targeted attack scenario, we employ Logit attack [53] as our baseline method. By following [53], we set the maximum number of iterations to 300, step size to 2/255 and to 8/255. The target label () is randomly generate for each data pair . Note that for targeted attack, the attack success rate is computed as
where denotes the output class of the target model and represents the adversarial examples set. Then, we give the experimental results on CIFAR-10, CIFAR-100 and TinyImageNet dataset.
4.3.1 CIFAR10 results
We utilize the entire CIFAR-10 testing set to generate adversarial examples. The results are shown in Tab.4. As can be observed, the tASR scores are usually significantly lower than the corresponding ASR scores, as shown in Tab.1, with the same settings. Besides, we can observe that the cross-architecture transfer attack usually leads to lower tASR values, compared to attacking a target model, which is in the same type of DNN architectures. For example, when Logit attack is employed as the attack method and ResNet-18 is utilized as its source model, ViT-S only obtains 9.91% tASR score, because the distinctive features of ResNet-18 and ViT-S tend to be different. On the contrary, when our CKL framework is integrated, the corresponding results obtain large gains, e.g., up to 23.79% improvement for the above example. The last column gives the averaged tASR values of unseen models. Although these test models are not touched in the whole training stage, our method still improves averaged tASRs at least 20 percentage points. This phenomenon also indicates that our CKL framework can enable the student (substitute) model to learn common knowledge from multiple teacher models, which significantly improves the adversarial transferability.
4.3.2 CIFAR100 results
CIFAR-100 results are shown in Tab.5. The entire CIFAR-100 testing set are utilized to generate adversarial examples. The first column introduces the source models and the second row presents the target models. ‘Avg.’ represents the averaged tASR values of unseen test models. Targeted attack on CIFAR-100 is a more hard problem, as it contains more categories. From the results, it is observed that the tASR scores on CIFAR-100 are usually lower than the corresponding tASR scores on CIFAR-10. Nevertheless, applying different source models, our method can still outperform the baseline method a lot. Our method can even outperform the baseline method more than 10%, when source/target model is ResNet-18/VGG-16, which validates the effectiveness of our CKL method.
4.3.3 TinyImageNet results
TinyImageNet results are shown in Tab.6. Here, we employ the entire validation set, i.e., 10,000 images, to generate adversarial examples to attack the target models. Compared to the attack results on the CIFAR datasets, all the methods possess relatively low tASR scores on TinyImageNet, which implies that pushing a sample to a specific class on TinyImageNet is more difficult. Under such difficult circumstance, our method can still significantly improves the attack performances of the baseline methods.
4.4 Comparison to ensemble attack
A commonly used technique to combine multiple models to generate adversarial examples is ensemble. Our CKL method distills the knowledge of multiple teacher models into one single student model. In this section, we compare our CKL method with ensemble strategy. MI-FGSM is deployed as the attack method. ResNet-50, Inception-v3, Swin-T, and MLPMixer are constructed as the teacher models, which are also the models for ensemble. We conduct the experiment on CIFAR-10 testing set. Ensemble is achieved by utilize the average value of the four networks outputs to generate adversarial examples. The results are shown in Tab.7. We observe that if the test model is one of the teacher models, ensemble models have better performance. But when tested on unseen models, our CKL method outperforms ensemble strategy, which validates that the effectiveness of the common knowledge learned by our proposed method.
In addition, ensemble strategy has two obvious defects. Firstly, when there exist non-CNN models in ensemble model, it can not employ any intermediate level based attack, because intermediate level based attack needs a pre-specific intermediate layer to get the feature map. But non-CNN models do not possess corresponding feature map. Secondly, directly utilizing ensemble model to generate adversarial examples will cause high computational complexity, especially when the volume of the teacher model is much larger than student model. That is a common circumstance in knowledge distillation. We compare the cost time of generating adversarial examples in the last column, and this experiment is conducted on a single RTX 3080Ti GPU. It is obvious that our method is faster than the ensemble models attack. When using ResNet-18 as the student model, our method (41.10 s) is more than faster than ensemble strategy (1174.65 s).
4.5 Comparison to intermediate feature based attack
To further the broad suitability of our CKL method, we compare it to intermediate feature based attack method. A previously proposed method called intermediate level attack (ILA) [34] finetunes an adversarial example by designing the loss function on the output feature map of an pre-specific intermediate layer of the source model. Recently, ILA-DA [52] employs three novel augmentation techniques to enhance ILA. In this section, we use ILA-DA as the baseline method for comparison. As ILA-DA requires a pre-specific intermediate layer to get the output feature map, it cannot directly employ Transformer-based models as its source model. Therefore, we utilize VGG-16, ResNet-18 and ResNet-50 as the source model to generate adversarial examples. Test model includes CNN and non-CNN models. ASR results are shown in Tab.8.
The first column gives the source models and the first row presents the test models. ‘Average’ represents the averaged ASR values of all test models. From the results, we can see that our CKL method can consistently improve the ILA-DA performance, whatever the test model is. Observing the results of different source models, the averaged ASR value improves more than 18 percent, which proves the effectiveness of our CKL method when combining with intermediate level based attack.
4.6 Ablation study
Input gradient distillation scheme Here, the effectiveness of our input gradient distillation scheme is validated. For better comparisons, we firstly introduce several variants of the objective function in learning the input gradients.
(i) The student model is training without gradient learning, i.e., it only employs Eq. (2) as the objective function, which is denoted as ‘w/o teacher gradients’.
(ii) The objective function is replaced by the averaged value of multiple teacher models’ gradients, i.e., , where , which is denoted as ‘average of teacher gradients’.
(iii) The objective function is adjusted by gradient projection, which is introduced in Section 3.4.
Here, ResNet-18 is employed as the source model. The teacher models are identical to that in Section 4.1. As can be observed in Tab.9, learning the input gradient from teacher models is effective. Besides, our method is more effective than directly learning the average value of teacher gradients.
Effects on hyper-parameter . Here, we study the effects of the hyperparameter in Eq. (6), which is a factor to balance the two loss terms, i.e., and . To assess its impacts, we set 1, 5, 10, 50, 100, 500, 1000, 2000. ResNet-18 is employed as the source model and The target models include VGG-16, ResNet-18, ResNet-50, DenseNet-121, Inception-v3, ConvMixer, MLPMixer, Swin-T, and ViT-S. The results are shown in Fig.4. As can be observed, when value is small, the objective function in Eq. (6) is dominanted by the first term, i.e., the objective function of knowledge distillation, and the result remains essentially unchanged. With the increasing of , the second term, i.e., the objective function of input gradient distillation, starts to function gradually, and thus the performance gradually increases. However, when is relatively large, e.g., , the attack success rate will decline. Thus, to achieve a good balance between the knowlege distillation and input gradient distillation objectives on all the models, we select in our experiments.
Selection of the teacher models. To learn a common knowledge from different types of DNN architectures, ResNet-50, Inception-v3, Swin-T, and MLPMixer are employed as the teacher models in our experiments. Here, we study the effects of different selections of the teacher models, i.e.,
(a). 4 CNN models, including ResNet-18, ResNet-50, Inception-v3, and VGG-16;
(b). 2 CNN models and 2 non-CNN models, including ResNet-50, Inception-v3, Swin-T, and MLPMixer;
(c). 4 non-CNN models, including Swin-T, MLPMixer, ConvMixer, and ViT-S.
For convenience, ResNet-18 is employed as the student (source) model. The results are shown in Tab.10. As can be observed, when the teacher models are all selected from the non-CNN models, the adversarial transferability to the non-CNN models is relatively high while that to the CNN models is relative low, because the student model learns more bias from the non-CNN models. Besides, we observe an interesting phenomenon that when all the teacher models are selected from CNNs, the attack success rate on the target CNN models are actually not the best. The best results are obtained by selecting 2 CNN and 2 non-CNN models as the teacher models, which implies that learning from both the CNN and non-CNN models is more effective, when attacking the CNN models.
4.7 Visualization results
We provide several visualization cases to further illustrate the advantage of our method, as shown in Fig.5. MI-FGSM is employed as the baseline attack method. ResNet-18 is utilized as the substitute model, and ResNet-50 is the test model. ‘+CKL’ denotes that the substitute model is trained by our CKL method. We employ interpretable visualization with GradCAM [54] to obtain a visual explanation for both clean and adversarial images. As can be observed, while both our method and the baseline method could mislead the test model, our method usually suppresses the attention region on the true object, and moves it to other region. This suggests that our method possesses better interpretability and can successfully divert the attention of the network to the wrong region.
5 Discussion and analyses
5.1 Discussion on common knowledge
This paper draws on the idea of common knowledge, to enhance the transferability of adversarial examples. The concept of common knowledge is also used in other tasks, like object caption [55] and etc. This idea is somewhat similar as Multiple Knowledge Representation (MKR) [56], which integrates multiple knowledge representations from different aspects, making the AI techniques more explainable and generalizable. However, our goal is to get better substitute model, which can generate more transferable adversarial examples. Therefore, we design the common knowledge learning framework to distill knowledge from teacher models to the student model, which is used as the substitute model in adversarial attacks. Besides, Since the input gradient is always utilized in typical adversarial attack process, we also design a constraint on the input gradients between the teacher models and the student model, to further promote the transferability of generated adversarial examples.
5.2 Complexity analyses
We propose to learn common knowledge from multiple teacher networks of different architecture. In practice, our method requires approximately times of the complexity in the training process, where is the number of teacher models, compared to the standard training of one DNN model. Fortunately, after the student model is trained, our method can generate adversarial examples with high efficiency as well as a significant improvement compared to the baseline, as demonstrated in the experiments. Since our student model can generate adversarial examples with a much better transferability, we can release the trained student model to the users, such that the users do not need to re-train the student model. Meanwhile, we can utilize as a hyperparameter to control the tradeoff between generalization ability of the adversarial examples and training complexity, i.e., we can reduce the number of teacher models to reduce the training complexity.
6 Conclusion
In this paper, we analyze the output inconsistency problem in different deep neural networks and propose that the output inconsistency significantly causes the poor transferability of adversarial examples. To alleviate this problem while effectively utilizing the existing DNN models, we propose a common knowledge learning (CKL) framework, which distills the knowledge of multiple teacher models with different architectures into a single student model, to obtain better substitute model for adversarial attack. Specifically, to emphasize the model-agnostic features, the student model is required to learn the outputs from multiple teacher models. To further reduce the output inconsistencies of models and enhance the adversarial transferability, we propose an input gradient distillation to learn the gradients from teacher models. Extensive experiments on CIFAR-10, CIFAR-100, and TinyImageNet have demonstrated the superiority of our proposed work.
Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. 2012, 1097–1105
[2]
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In: Proceedings of the 3rd International Conference on Learning Representations. 2015, 1–14
[3]
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016, 770−778
[4]
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An image is worth 16x16 words: transformers for image recognition at scale. In: Proceedings of the 9th International Conference on Learning Representations. 2021, 1–21
[5]
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. 2021, 9992–10002
[6]
S un Y P, Z hang M L. Compositional metric learning for multi-label classification. Frontiers of Computer Science, 2021, 15( 5): 155320
[7]
M a F, W u Y, Y u X, Y ang Y. Learning with noisy labels via self-reweighting from class centroids. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33( 11): 6275–6285
[8]
Y ang Y, G uo J, L i G, L i L, L i W, Y ang J. Alignment efficient image-sentence retrieval considering transferable cross-modal representation learning. Frontiers of Computer Science, 2024, 18( 1): 181335
[9]
H u T, L ong C, X iao C. CRD-CGAN: category-consistent and relativistic constraints for diverse text-to-image generation. Frontiers of Computer Science, 2024, 18( 1): 181304
[10]
L iang X, Q ian Y, G uo Q, Z heng K. A data representation method using distance correlation. Frontiers of Computer Science, 2025, 19( 1): 191303
[11]
Goodfellow I J, Shlens J, Szegedy C. Explaining and harnessing adversarial examples. In: Proceedings of the 3rd International Conference on Learning Representations. 2015, 1−11
[12]
Lin J, Song C, He K, Wang L, Hopcroft J E. Nesterov accelerated gradient and scale invariance for adversarial attacks. In: Proceedings of the 8th International Conference on Learning Representations. 2020, 1−12
[13]
Miao H, Ma F, Quan R, Zhan K, Yang Y. Autonomous LLM-enhanced adversarial attack for text-to-motion. 2024, arXiv preprint arXiv: 2408.00352
[14]
Dong Y, Liao F, Pang T, Su H, Zhu J, Hu X, Li J. Boosting adversarial attacks with momentum. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 9185−9193
[15]
Y ang Y, H uang P, C ao J, L i J, L in Y, M a F. A prompt-based approach to adversarial example generation and robustness enhancement. Frontiers of Computer Science, 2024, 18( 4): 184318
[16]
L u S, L i R, L iu W. FedDAA: a robust federated learning framework to protect privacy and defend against adversarial attack. Frontiers of Computer Science, 2024, 18( 2): 182307
[17]
Zou J, Duan Y, Li B, Zhang W, Pan Y, Pan Z. Making adversarial examples more transferable and indistinguishable. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2022, 3662−3670
[18]
Xie C, Zhang Z, Zhou Y, Bai S, Wang J, Ren Z, Yuille A L. Improving transferability of adversarial examples with input diversity. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 2725−2734
[19]
Dong Y, Pang T, Su H, Zhu J. Evading defenses to transferable adversarial examples by translation-invariant attacks. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 4307−4316
[20]
Wang X, He X, Wang J, He K. Admix: enhancing the transferability of adversarial attacks. In: Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. 2021, 16138−16147
[21]
Wu W, Su Y, Lyu M R, King I. Improving the transferability of adversarial samples with adversarial transformations. In: Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 9020−9029
[22]
Mahmood K, Mahmood R, van Dijk M. On the robustness of vision transformers to adversarial examples. In: Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. 2021, 7818−7827
[23]
Naseer M, Ranasinghe K, Khan S, Khan F S, Porikli F. On improving adversarial transferability of vision transformers. In: Proceedings of the 10th International Conference on Learning Representations. 2022, 1−24
[24]
Wei Z, Chen J, Goldblum M, Wu Z, Goldstein T, Jiang Y G. Towards transferable adversarial attacks on vision transformers. In: Proceedings of the 36th AAAI Conference on Artificial Intelligence. 2022, 2668−2676
[25]
Waseda F, Nishikawa S, Le T N, Nguyen H H, Echizen I. Closer look at the transferability of adversarial examples: how they fool different models differently. In: Proceedings of 2023 IEEE/CVF Winter Conference on Applications of Computer Vision. 2023, 1360−1368
[26]
Yu T, Kumar S, Gupta A, Levine S, Hausman K, Finn C. Gradient surgery for multi-task learning. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 489
[27]
Bruna J, Szegedy C, Sutskever I, Goodfellow I, Zaremba W, Fergus R, Erhan D. Intriguing properties of neural networks. In: Proceedings of the 2nd International Conference on Learning Representations. 2014, 1−10
[28]
Dong Y, Fu Q A, Yang X, Pang T, Su H, Xiao Z, Zhu J. Benchmarking adversarial robustness on image classification. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 318−328
[29]
Li Y, Li L, Wang L, Zhang T, Gong B. NATTACK: learning the distributions of adversarial examples for an improved black-box attack on deep neural networks. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 3866−3876
[30]
Cheng M, Le T, Chen P Y, Zhang H, Yi J, Hsieh C J. Query-efficient hard-label black-box attack: an optimization-based approach. In: Proceedings of the 7th International Conference on Learning Representations. 2019, 1−14
[31]
Shi Y, Han Y, Tian Q. Polishing decision-based adversarial noise with a customized sampling. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 1027−1035
[32]
Zhou L, Cui P, Zhang X, Jiang Y, Yang S. Adversarial Eigen attack on BlackBox models. In: Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 15233−15241
[33]
Lord N A, Mueller R, Bertinetto L. Attacking deep networks with surrogate-based adversarial black-box methods is easy. In: Proceedings of the 10th International Conference on Learning Representations. 2022, 1−17
[34]
Huang Q, Katsman I, Gu Z, He H, Belongie S J, Lim S N. Enhancing adversarial example transferability with an intermediate level attack. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 4732−4741
[35]
Yang D, Li W, Ni R, Zhao Y. Enhancing adversarial examples transferability via ensemble feature manifolds. In: Proceedings of the 1st International Workshop on Adversarial Learning for Multimedia. 2021, 49−54
[36]
G umus F, A masyali M F. Exploiting natural language services: a polarity based black-box attack. Frontiers of Computer Science, 2022, 16( 5): 165325
[37]
Y ang B, Z hang H, L i Z, Z hang Y, X u K, W ang J. Adversarial example generation with adabelief optimizer and crop invariance. Applied Intelligence, 2023, 53( 2): 2332–2347
[38]
Wang X, He K. Enhancing the transferability of adversarial attacks through variance tuning. In: Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 1924−1933
[39]
Y uan H, C hu Q, Z hu F, Z hao R, L iu B, Y u N. AutoMA: towards automatic model augmentation for transferable adversarial attacks. IEEE Transactions on Multimedia, 2023, 25: 203–213
[40]
Tolstikhin I O, Houlsby N, Kolesnikov A, Beyer L, Zhai X, Unterthiner T, Yung J, Steiner A, Keysers D, Uszkoreit J, Lucic M, Dosovitskiy A. MLP-mixer: an all-MLP architecture for vision. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. 2021, 24261−24272
[41]
G ou J, Y u B, M aybank S J, T ao D. Knowledge distillation: a survey. International Journal of Computer Vision, 2021, 129( 6): 1789–1819
[42]
Passban P, Wu Y, Rezagholizadeh M, Liu Q. ALP-KD: attention-based layer projection for knowledge distillation. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021, 13657−13665
[43]
Chen D, Mei J P, Zhang Y, Wang C, Wang Z, Feng Y, Chen C. Cross-layer distillation with semantic calibration. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021, 7028−7036
[44]
Lee S, Song B C. Graph-based knowledge distillation by multi-head attention network. In: Proceedings of the 30th British Machine Vision Conference. 2019, 141
[45]
Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. 2015, arXiv preprint arXiv: 1503.02531
[46]
Liu B, Liu X, Jin X, Stone P, Liu Q. Conflict-averse gradient descent for multi-task learning. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. 2021, 1443
[47]
Krizhevsky A. Learning multiple layers of features from tiny images. University of Toronto, Dissertation, 2009
[48]
Huang G, Liu Z, Van Der Maaten L, Weinberger K Q. Densely connected convolutional networks. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 2261−2269
[49]
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016, 2818−2826
[50]
Sandler M, Howard A G, Zhu M, Zhmoginov A, Chen L C. MobileNetV2: inverted residuals and linear bottlenecks. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 4510−4520
[51]
Ng D, Chen Y, Tian B, Fu Q, Chng E S. Convmixer: feature interactive convolution with curriculum learning for small footprint and noisy far-field keyword spotting. In: Proceedings of the ICASSP 2022−2022 IEEE International Conference on Acoustics, Speech and Signal Processing. 2022, 3603−3607
[52]
Yan C W, Cheung T H, Yeung D Y. ILA-DA: improving transferability of intermediate level attack with data augmentation. In: Proceedings of the 11th International Conference on Learning Representations. 2023, 1−25
[53]
Zhao Z, Liu Z, Larson M A. On success and simplicity: a second look at transferable targeted attacks. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. 2021, 468
[54]
S elvaraju R R, C ogswell M, D as A, V edantam R, P arikh D, B atra D. Grad-CAM: visual explanations from deep networks via gradient-based localization. International Journal of Computer Vision, 2020, 128( 2): 336–359
[55]
W u Y, J iang L, Y ang Y. Switchable novel object captioner. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45( 1): 1162–1173
[56]
Y ang Y, Z huang Y, P an Y. Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies. Frontiers of Information Technology & Electronic Engineering, 2021, 22( 12): 1551–1558
RIGHTS & PERMISSIONS
Higher Education Press
AI Summary 中Eng×
Note: Please be aware that the following content is generated by artificial intelligence. This website is not responsible for any consequences arising from the use of this content.