1 Introduction
In recent years, foundation Vision-Language Models (VLMs), such as CLIP [
1], which empower zero-shot transfer to a wide variety of domains without fine-tuning, have led to a significant shift in machine learning systems. Despite the impressive capabilities, it is concerning that the VLMs are prone to inheriting biases from the uncurated datasets scraped from the Internet [
2–
5]. We examine these biases from three perspectives. (1) Label bias, certain classes (words) appear more frequently in the pre-training data. (2) Spurious correlation, non-target features, e.g., image background, that are correlated with labels, resulting in poor group robustness. (3) Social bias, which is a special form of spurious correlation, focuses on societal harm. Unaudited image-text pairs might contain human prejudice, e.g., gender, ethnicity, and age, that are correlated with targets. These biases are subsequently propagated to downstream tasks, leading to biased predictions.
In this survey, we provide an overview of the three biases prevalent in visual classification within the area of VLMs, along with strategies to mitigate these biases. By doing the survey, we hope to provide a useful resource for the debiasing and VLMs community.
2 Biases in VLMs
Label bias Real-world large-scale datasets often exhibit long-tailed distributions, wherein most labels are associated with only few samples. Formally, we consider a dataset in which each input
is associated with a class label
(the class name in the natural language can be regarded as a surrogate of class label). Label bias exists when
is highly skewed. Naive learning on such data develops an undesirable bias towards dominant labels. Recently, label bias in vision-language models have gained significant attention [
3,
4,
6]. They found that the predictions of VLMs are skewed toward frequent semantics. As illustrated in Fig.1(a), a zero-shot CLIP model yields disproportionately skewed predictions on the ImageNet-1K dataset. An example of this imbalance is the over-prediction of more than 3500 instances as belonging to class 0, which is three times the actual number of samples in that class.
Spurious correlation Consider a dataset in which each input
is associated with a class label
and a spurious attribute
. Spurious correlation occurs when non-target attributes
such as image background that are correlated with labels
. The spurious correlation might come from many biases: (1) geography bias, data are collected from some specific locations, (2) sub-population bias, training data only cover a part of the sub-categories of the class, and (3) style bias, data have a certain style (e.g., sketches). For example, a model might inaccurately classify cows (
) on a beach, as it’s trained to recognize cows in typical settings like grassy fields, focusing on background cues (
) rather than the actual animal. These spurious correlations result in
poor group robustness. Existing researches [
8,
9] illustrate the presence of spurious correlation in the predictions of VLMs. Fig.1(b.2) demonstrates the worst-group (WG) and average accuracies with zero-shot CLIP on Waterbirds dataset. The groups in Waterbirds are presented in Fig.1(b.1), where groups are based on bird type (waterbird or landbird) and their background (water or land). Zhang et al. [
9] find that the zero-shot VLMs can achieve 55.6 gap between worst-group and average accuracy, indicating the existence of spurious correlations.
Social bias Social bias is a special form of spurious correlation, where the associations learned by a model reflect societal stereotypes or prejudices, e.g., gender, ethnicity and age. The web-scraped datasets are too large to be manually audited to filter out the image-text pairs with stereotypes, limiting the usefulness of VLMs in real-world high-stakes scenarios. Many works found social biases in VLMs [
2,
5,
10]. For example, as illustrated in Fig.1(c), greater similarity between the text “doctor” and male images than female ones in VLMs could undermine trust in models using these biased representations.
3 Debiasing methods
In Tab.1, we summarize the properties of the existing debiasing methods, categorizing them from the perspectives of the type of biases they can address, whether training data is required, and whether retraining the VLMs is necessary.
Debiasing label bias The core of debiasing label bias is to estimate the pre-training label prior. Allingham et al. [
4] propose ZPE which assumes that there is an access of the pre-training data, i.e.,
. The label bias is solved by subtracting the expected logits
based on the pre-training data, i.e.,
. However, the pretraining data are often inaccessible because of the copyright or privacy concerns. Instead, GLA [
3] proposes to only use the downstream data, i.e.,
, to estimate the label bias. In specific, GLA adjusts the margin
of VLM classifier to achieve the lowest error in downstream data, i.e.,
. The label bias is mitigated by subtracting the margin. Recently, REAL [
11] replaces the original class names with the synonyms found in pre-training text to mitigate the imbalanced performance.
Improving group robustness Zhu et al. [
8] found that when adapting VLMs to downstream tasks, pre-training knowledge and downstream knowledge have different biases. They propose ProReg loss to enable unbiased learning from both biased knowledge. Concretely, during fine-tuning, ProReg calculates the Kullback-Leibler divergence and cross-entropy loss for each training sample, then merge these using an adaptive sample-specific weight. Zhang et al. [
9] found that the poor group robustness comes with poor similarity between sample embeddings from different groups but the same class. They propose to train an adapter, which enforces similarities between “hard” samples that zero-shot VLM predicts incorrectly and the “positive” samples that the model predicts correctly within the same class, and makes the “hard” samples far away from the nearest samples in different classes. In [
10], the authors propose to debiase foundation VLMs by projecting out the biased direction in the text embedding without the additional data. Concretely, the authors first prepare a series of irrelevant features through natural language, i.e., “a photo of a [irrelevant attribute]”, then construct a projection matrix to make the space of classifier weights orthogonal to these irrelevant features. This approach, which can be expressed as a closed-form solution, is training-free and effectively addresses social bias by designating protected attributes.
Debiasing social bias Wang et al. [
5] target at mitigating gender bias in image searching. They propose FairSampling to sample each gender pairs equiprobably during contrastive learning. To avoid re-training of the VLMs, [
5] proposes to remove feature embedding dimensions that are strongly associated with gender label. The correlation is computed based on mutual information. Berg et al. [
12] employ adversarial classifier during prompt tuning for bias mitigation. Specifically, the adversarial classifier is train to predict protected attributes, e.g., gender, ethnicity and age, while the proposed method attempts to maximize the adversarial loss to achieve a representation that blind to protected attributes. Similarly, DeAR [
2] learns a residual visual representation that, when added to the original, prevents the prediction of protected attributes, while remaining close to the original. Unlike [
12] that jointly trains an adversarial classifier, DeAR trains the classifier separately.
4 Conclusion
While VLMs showcase remarkable capabilities, it’s crucial to recognize the potential biases these models might carry into downstream applications. In this survey article, we have reviewed several current work of debiasing VLMs, focusing on label bias, spurious correlation, and social bias. Currently, most VLMs debiasing methods focus on discriminative models, such as image classification, while generative tasks like image captioning and image generation receive little attention in terms of debiasing. This could become a significant research direction in the future.