Revisiting multi-dimensional classification from a dimension-wise perspective

Yi SHI; Hanjia YE; Dongliang MAN; Xiaoxu HAN; Dechuan ZHAN; Yuan JIANG

doi:10.1007/s11704-023-3272-9

PDF(13655 KB)

Front. Comput. Sci. ›› 2025, Vol. 19 ›› Issue (1) : 191304. DOI: 10.1007/s11704-023-3272-9

Artificial Intelligence

RESEARCH ARTICLE

Revisiting multi-dimensional classification from a dimension-wise perspective

Yi SHI¹ ,
Hanjia YE¹ ,
Dongliang MAN²^,³ ,
Xiaoxu HAN²^,³ ,
Dechuan ZHAN¹ ,
Yuan JIANG¹

Author information +

History +

Abstract

Real-world objects exhibit intricate semantic properties that can be characterized from a multitude of perspectives, which necessitates the development of a model capable of discerning multiple patterns within data, while concurrently predicting several Labeling Dimensions (LDs) — a task known as Multi-dimensional Classification (MDC). While the class imbalance issue has been extensively investigated within the multi-class paradigm, its study in the MDC context has been limited due to the imbalance shift phenomenon. A sample’s classification as a minor or major class instance becomes ambiguous when it belongs to a minor class in one LD and a major class in another. Previous MDC methodologies predominantly emphasized instance-wise criteria, neglecting prediction capabilities from a dimension aspect, i.e., the average classification performance across LDs. We assert the significance of dimension-wise metrics in real-world MDC applications and introduce two such metrics. Furthermore, we observe imbalanced class distributions within each LD and propose a novel Imbalance-Aware fusion Model (IMAM) for addressing the MDC problem. Specifically, we first decompose the task into multiple multi-class classification problems, creating imbalance-aware deep models for each LD separately. This straightforward method performs well across LDs without sacrificing performance in instance-wise criteria. Subsequently, we employ LD-wise models as multiple teachers and transfer their knowledge across all LDs to a unified student model. Experimental results on several real-world datasets demonstrate that our IMAM approach excels in both instance-wise evaluations and the proposed dimension-wise metrics.

Graphical abstract

Keywords

multi-dimensional classification / dimension perspective / class imbalance learning

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Yi SHI, Hanjia YE, Dongliang MAN, Xiaoxu HAN, Dechuan ZHAN, Yuan JIANG. Revisiting multi-dimensional classification from a dimension-wise perspective. Front. Comput. Sci., 2025, 19(1): 191304 https://doi.org/10.1007/s11704-023-3272-9

1 Introduction

Recent years have witnessed the success of deep learning methods in various domains, such as image recognition [1–4] and retrieval [5–7]. In most classification problems, we typically extract multi-variate features to represent an object and assume a single labeling aspect associated with each instance. However, real-world objects often possess rich semantics, and it may be necessary to label an object from different perspectives or Labeling Dimensions (LDs). For example, in fashion recommendation [8–11] (as shown in Fig.1), a system may recommend shoes based on a user’s preference, such as “color” (like red, yellow, black), “type” (like boot, slipper, sandal), and “style” (like simple, fashionable, retro, trendy). Similarly, in the medical diagnosis of blood disorders [12, 13], patients may have a variety of blood disorders, such as ‘hemolysis’, ‘lipemia’, and ‘jaundice’. Each disorder can be categorized into normal, mild, moderate, or severe, depending on the severity of the condition. It is essential to predict all types of disorders simultaneously to avoid affecting diagnosis. The same problem also exists in music classification [14], web mining [15], text classification [16], genetics [17], brain-computer interface [18], etc.

Fig.1 In a fashion recommendation system, images encapsulate rich semantic information. For example, the left image can be labeled across multiple dimensions such as “type”, “color”, and “style”. Each of these dimensions encompasses several potential classes, of which only one is accurate. For the left image, the appropriate labels across the three dimensions are respectively “boot”, “black”, and “trendy”

Full size|PPT slide

Although learning with multiple LDs is equivalent to jointly learning a set of multi-class classification tasks over a single instance, it is not appropriate to follow the same criteria for the multi-class problem with single LD in evaluation. Evaluating models with multiple LDs requires considering both the instance-wise and dimension-wise perspectives to fully capture the performance of the model. For example, in fashion recommendation, it is necessary to evaluate whether a model can predict all LDs correctly for a particular user and whether a certain type of outfit can be successfully portrayed. Similarly, in blood diagnosis, it is essential to evaluate not only the number of patients diagnosed correctly but also the extent to which a certain type of disorder can be detected accurately among all patients. Hence, evaluation metrics should be constructed to holistically encode the gap between a model’s prediction and the ground-truth, taking into account both the instance-wise and dimension-wise perspectives. In summary, the goal is to develop a model that can predict multiple LDs with good performance on both these aspects of criteria.

In the last decade, learning with multiple LDs is studied as Multi-Dimensional Classification (MDC) [19–21], where each dimension corresponds to a certain labeling aspect. In multi-class classification, we aim to learn a map predicting the label

y_{i}

given an instance

x_{i}

. While in MDC, there is a size-

L

set of labels

y_{i} = (y_{i 1}, \dots, y_{i j}, \dots, y_{i L})

associated with one

x_{i}

. Each

y_{i j}

is a class label for the

j

th LD. Although MDC becomes an intuitive and mature way to solve the task, current MDC methods [19, 20, 22–24] focus on instance-wise metrics, e.g., Instance Accuracy and Hamming Accuracy, while ignoring dimension-wise ones. Therfore, we propose two dimension-wise metrics, meanMacroF1 and meanMacroAUC, to address this limitation.

In conventional multi-class classification paradigms, the challenge of class imbalance exerts a substantial influence on the discriminative efficacy of the algorithm, notably compromising the classification of underrepresented or minority classes. This phenomenon also perturbs the model’s performance in multi-dimensional classification contexts. Moreover, the class imbalance problem in MDC is more complicated than in the multi-class paradigm, since an instance is labeled from multiple LDs, and a minor class in one LD could become a major class when annotated from another aspect. For example, a patient who is ‘severe’ (a minor class) in dimension ‘Hemolysis’ can be ‘normal’ (a major class) in dimension ‘Lipemia’ and ‘Jaundice’. We name this unique phenomenon in MDC — an instance’s affiliation with a major or a minor class changes across LDs — as Imbalance Shift. Therefore, in the MDC paradigm, the class imbalance problem is usually canceled out between different dimensions within a sample, and it is not so essential if we only evaluate instance-wise performance. However, when we evaluate dimension-wise performance of MDC methods, it is necessary to consider imbalance-sensitive criteria along LDs since all classes are equally important in most scenarios.

In this paper, we investigate MDC based on deep neural networks, and train multiple imbalance-aware models for each LD separately to address diverse imbalanced class distributions along different LDs. This simple method gets high dimension-wise performance and does not lose the prediction ability from the instance-wise side. To model the relationship between different LDs, we then distill both the prediction and comparison ability from these multiple models to a single IMbalance-Aware fusion Model (IMAM). Specifically, our IMAM works in a ‘decompose-and-fuse’ manner. IMAM first decomposes the MDC task into a set of multi-class classification problems by training deep models separately for each LD. Since the class distribution per LD is not balanced, we adapt the logit of predictions according to the LD-specific imbalanced ratio. These imbalance-aware models can avoid being biased towards major classes with more examples and gets high dimension-wise metrics even for minor classes. At the same time, we surprisingly find these separate models do not lose instance-wise performance. Given multiple models for LDs, they require multiple forward passes on each dimension during its inference. Thus, we propose to construct a single deep MDC model, which reduces the computational costs and jointly balances prediction ability across multiple LDs. We claim both the comparison ability with embeddings and the prediction ability with the classifier are essential for MDC, and propose a novel knowledge reuse method to fuse multiple imbalance-aware models into a compact one. Experimental results show that our final IMAM model maintains good performance on both types of metrics in various real-world tasks, such as shoes recommendation and medical diagnoses.

We summarize our contributions as follows:

● We emphasize the importance of not only instance-wise metrics but also dimension-wise metrics for evaluating MDC models, and introduce two dimension-wise metrics for MDC.

● We explore the distinct phenomenon of imbalance shift in MDC paradigms and assess its impact on dimension-wise evaluation metrics.

● From the LD aspect, we employ the imbalanced LD-specific class distribution, and propose a simple yet effective method IMAM.

● A comprehensive analysis of various MDC methodologies is provided, taking into account both instance-wise and dimension-wise metrics. IMAM can get state-of-the-art performance on both types of metrics over various real-world datasets.

2 Related work

Multi-dimensional classification (MDC) solves the problem that an object may possess multiple dimensions of labels w.r.t. different semantics. MDC can be considered as a combination of multiple multi-class learning tasks [25], one per dimension once ignoring the relationship between LDs. MDC also degenerates to multi-label learning [26–28] if we restrict each LD as a binary classification task. Therefore, some multi-label classification solutions are intuitive for MDC problems, like Binary Relevance (BR) [23] and Classifier Chains (CC) [22]. BR trains multiple multi-class classifiers independently for each dimension. ECC constructs a chain of multi-class classifiers for each dimension sequentially, where the subsequent classifiers are built by augmenting the feature space with the predictions of preceding classifiers. Recent MDC solutions mainly augment features by incorporating feature and label information, such as KRAM [24], LEFA [20], and SLEM [19]. KRAM enriches the original feature space with label similarity based on KNN techniques, while LEFA uses a cross correlation aware network to learn latent label vectors. SLEM encodes the class space by encoding operations like pairwise grouping and one-hot conversion. In [29], a feature selection strategy is used to enable better feature augmentation. The current feature augmentation based MDC methods synthesize features manually, which cannot learn features in an end-to-end manner. Besides, SEEM [30] use a stacked dependency exploitation strategy to model the dependencies among class spaces. Moreover, existing MDC evaluation emphasizes whether a model predicts classes for all dimensions correctly from the instance aspect, while neglecting the dimension-wise prediction ability. To the best of our knowledge, we propose novel MDC evaluation criteria from the labeling aspect and the first deep MDC method IMAM.

Class imbalance learning The imbalanced class distributions exist in many real-world datasets, where a few ‘major’ classes occupy most training instances in the dataset while the remaining ‘minor’ classes contain relatively fewer instances [31–34]. Since all classes are equally important, vanilla multi-class classification techniques will make the model biased towards major classes and perform poorly on minor classes [35, 36]. To relief the dominance of the major classes, various strategies are proposed, such as sampling methods [37–39], loss functions [40], and logit adjustment [41, 42]. We analyze the class imbalance phenomenon for LDs, and we find directly incorporating class imbalance methods with MDC architecture cannot get good overall performance. Thus, our IMAM works in a ‘decompose-and-fuse’ manner to take advantage of multiple imbalance-aware models.

Knowledge reuse utilize the learning experience from related pre-trained models and transfer the knowledge to the current training progress [43]. There are various strategies for knowledge reuse, e.g., parameters matching [44–49] or prediction matching [50–54]. Usually, we name the fixed well-trained model as a ‘teacher’, and the model for the target task is ‘student’. By aligning two models on the target task’s data, the knowledge from the teacher is transferred to the student [55–59]. In the fusion stage of IMAM, we treat multiple separate LD-specific models as teachers, and then transfer their knowledge to a single student MDC model. This knowledge reuse process keeps the MDC ability, incorporates the knowledge across multiple dimensions, and reduces the model size. We also verify that both the comparison ability of embeddings and the prediction ability of the classifiers are necessary to be transferred towards a better student MDC model.

3 Notations and background

We describe the MDC task formally in this section, including its classical evaluation metrics from the instance aspects. Then we propose our new criteria from the dimension aspect, which take account of the imbalanced class distribution per LD.

3.1 Multi-dimensional classification (MDC)

An instance

x_{i} \in R^{D}

in a data mining task could be described by

D

-dimension features. Different from vanilla multi-class classification, in an MDC problem, we associate

x_{i}

with

L

-dimension labels. In other words, we annotate

x_{i}

from

L

aspects. The set of labels for

x_{i}

y_{i} = {y_{i 1}, \dots, y_{i j}, \dots, y_{i L}}

. For the

j

th Labeling Dimension (LD), we have

K_{j}

classes in total, and the class label for

x_{i}

given this LD is

y_{i j} \in {1, \dots, k, \dots, K_{j}}

. In the fashion recommendation example in Fig.1, there are three dimensions (

L = 3

), namely, ‘type’, ‘color’, and ‘style’, and the second dimension ‘color’ has four classes (

K_{2} = 4

Define the label space for

j

th dimension as

Y_{j}

and the instance (feature) space as

X

. Instead of learning a map from

X

to a single

Y_{j}

, MDC aims to learn a map

F

from

X

to the Cartesian product of

L

semantic spaces, i.e.,

C a r t (Y_{1}, Y_{2}, \dots, Y_{L})

. We learn

F

from a training set

D = {(x_{i}, y_{i})}_{i = 1}^{N}

, and we expect the learned

F

to predict all

L

labels of an unseen instance from

X

correctly. Based on the definition, we can find if

L > 1

and

K_{j} = 2

hold for all

1 \leq j \leq L

, MDC degenerates to a multi-label classification problem where all dimensions contain only binary classification tasks. Besides, if

L = 1

, MDC becomes a multi-class classification task.

Since an MDC model outputs

L

labels for an instance, we decompose the model

F

via

F = (F_{1}, F_{2}, \dots, F_{L})

for each LD. With a bit abuse of notation, we obtain the logit vector, i.e., the confidence vector over those

K_{j}

classes in the

j

th LD, with

{\hat{f}}_{i j} = F_{j} (x_{i}) \in R^{K_{j}}

. The larger a certain value in

{\hat{f}}_{i j}

, the more confident

F_{j}

is with the corresponding class among

{1, \dots, K_{j}}

. Define

{\hat{f}}_{i j}^{k}

as the

k

th element in the vector

{\hat{f}}_{i j}

, the predicted label for

j

th LD given

{\hat{f}}_{i j}

(1)

\begin{array}{l} {\hat{y}}_{i j} = \underset{k = 1, \dots, K_{j}}{\arg max} {\hat{f}}_{i j}^{k} . \end{array}

In summary, we obtain the logit set for

x_{i}

{{\hat{f}}_{i 1}, \dots, {\hat{f}}_{i L}}

, and then we get predicted L-dimension label for

x_{i}

{\hat{y}}_{i} = {{\hat{y}}_{i 1}, \dots, {\hat{y}}_{i L}}

. The goal of MDC is to get a model

F

from

D

which is able to make its prediction

{\hat{y}}_{i}

over any

x_{i} \in X

close to the ground-truth

y_{i}

3.2 Evaluation metrics for MDC

To measure the gap with a perfect MDC model, we introduce several evaluation metrics for MDC.

Instance-wise criteria are widely used in the literature [19, 20, 22–24] such as Hamming Accuracy and Instance Accuracy. These evaluations work as an extension of the common accuracy criterion, which measure whether an MDC model is able to predict labels across all LDs right or not. Let

n (i) = \sum_{j = 1}^{L} [[y_{i j} = {\hat{y}}_{i j}]]

denotes the number of dimensions a model

F

correctly predicted for the instance

x_{i}

. The indicator function

[[α]]

returns 1 if

α

holds and 0 otherwise.

● Instance Accuracy (I-Acc, a.k.a. Exact Match Accuracy) computes the proportion of instances that an MDC model predicts all

L

LDs correctly. An instance contributes to increasing the accuracy only all of its

L

labels are correctly predicted simultaneously.

● Hamming Accuracy is not as strict as the Instance Accuracy. It treats the labels in all LDs independently, and computes the proportion an MDC model predicts any of the

L

labels for instances correctly. In other words, any correctly predicted labels matter in increasing the Hamming Accuracy.

The instance-wise criteria reveal the ability of an MDC model to some extent. However, we claim that dimension-wise metric is also essential in real-world MDC tasks. For example, in fashion recommendation, it is important to know if users are satisfied with various types of outfit recommended to them, but sometimes we also care about whether different classes of items from a particular outfit type can be successfully recommended. In summary, the dimension-wise criteria calculate the performance of an MDC model along each LD separately, and then average their results.

We can also interpret the Hamming Accuracy from the dimension-wise aspect, which is the averaged accuracy among all LDs. However, it is unable to faithfully reflect the ability of an MDC model due to the impact of the class imbalance problem. For example, in the LD related to ‘style’ in fashion recommendation, there should be more boots annotated by ‘simple’ than ‘retro’. Then, if an MDC model predicts all instances as major classes, it achieves higher accuracy, i.e., the Hamming Accuracy, easily.

In a multitude of practical applications, both majority and minority classes within individual LDs possess equivalent significance. Therefore, our objective is to conceptualize dimension-wise evaluation criteria that impartially encapsulate the algorithm’s performance across the entire class spectrum, eschewing a skewed representation dominated by majority-class performance. To this end, we propose the utilization of a “macro” averaging technique for aggregating performance metrics across diverse categories within each LD. This is followed by a computation of the mean of these dimension-specific metrics, thereby guaranteeing that our proposed evaluative criteria sustain a heightened level of sensitivity toward minority classes across all LDs. Details of the two metrics are as follows:

● meanMacroF1 (M²F1). MacroF1 is an extension of the F1 metric for multi-class classification tasks. A model balancing both major and minor with higher precision and recall values leads to higher MacroF1. We then compute the mean of MacroF1 over all LDs. The larger meanMacroF1, the better the MDC model.

● meanMacroAUC (M²AUC). Similarly, meanMacroAUC averages the LD-wise MacroAUC (the multi-class extension of Area Under the ROC Curve), the larger the better.

Both of F1 and AUC punish models that focus mainly on major classes while ignoring minor classes. We summarize both previous instance-wise MDC criteria and our newly proposed dimension-wise MDC criteria in Tab.1. We expect an MDC model is able to balance both instance-wise and dimension-wise criteria. We will demonstrate that almost all real-world MDC tasks face the imbalanced class distribution in most LDs, and the classical MDC methods cannot get as strong performance in the dimension aspect as in the instance aspect. So we need special designs for an MDC model.

Tab.1 Definitions of four MDC performance measures. The later two are our newly proposed dimension-wise criteria

Measure	Formulation	Note
Instance Accuracy (I-Acc)	$\frac{1}{N} \sum_{i = 1}^{N} [[n (i) = L]]$	The probability that labels from all L LDs are correctly predicted
Hamming Accuracy (H-Acc)	$\frac{1}{N} \sum_{i = 1}^{N} \frac{1}{L} \cdot n (i)$	The probability that labels from any L LDs are correctly predicted
meanMacroF1 (M²F1)	$MacroPrecision (j) = \sum_{k = 1}^{K_{j}} \frac{\sum_{i = 1}^{N} [[{\hat{y}}_{i j} = y_{i j} = k]]}{\sum_{i = 1}^{N} [[{\hat{y}}_{i j} = k]]}$ $MacroRecall (j) = \sum_{k = 1}^{K_{j}} \frac{\sum_{i = 1}^{N} [[{\hat{y}}_{i j} = y_{i j} = k]]}{\sum_{i = 1}^{N} [[y_{i j} = k]]}$ $\sum_{j = 1}^{L} \frac{2}{1 / MacroPrecision (j) + 1 / MacroRecall (j)}$	Averaged MacroF1, the harmonic mean of Macro-Precision and Macro-Recall, across LDs
meanMacroAUC (M²AUC)	$\sum_{j = 1}^{L} (\sum_{k = 1}^{K_{j}} \frac{\sum_{i = 1}^{N} [[{\hat{f}}_{i j}^{k} = max ({\hat{f}}_{i j}^{1}, {\hat{f}}_{i j}^{2}, \dots, {\hat{f}}_{i j}^{K_{j}})]]}{\sum_{i = 1}^{N} [[y_{i j} = k]] \cdot \sum_{i = 1}^{N} [[y_{i j} \neq k]]})$	Averaged MacroAUC across LDs

4 Analyses of deep MDC

In this section, we analyze how to construct a deep MDC method and how the modeling factors influence the two aspects of criteria. We first show there exists imbalanced class distribution in real-world MDC tasks, which is challenge for classical MDC methods. Then, we claim that model decomposition is vital for deep MDC, which lays the foundation for our IMAM approach.

4.1 Imbalanced labeling dimensions

Imbalanced class distribution is ubiquitous in multi-class classification tasks, which also exists in MDC problems among different LDs. We take a real-world dataset Zappos [60] as an example, which is constructed for fashion recommendation. The details of the dataset are in Section 6. We represent two LDs of Zappos by different colors, which have 11 and 13 classes, respectively. Fig.2 demonstrates clear long-tailed class distribution for both LDs of Zappos. We provide the results of other datasets in the appendix.

Fig.2 Imbalanced class distribution on different dimensions of the Zappos dataset (with 11 and 13 classes, respectively)

Full size|PPT slide

Remarkably, the class imbalance problem becomes more complicated in MDC. Since an instance is labeled from multiple LDs, a minor classes instance in one LD could change to a major class one when annotated from another aspect. For example, a patient who is ‘severe’ (a minor class) in dimension ‘Hemolysis’ can be ‘normal’ (a major class) in dimension ‘Lipemia’ and ‘Jaundice’. We name this unique phenomenon in MDC — an instance’s affiliation with a major or a minor class changes across LDs — as Imbalance Shift. Therefore, we cannot determine the major/minor class property of an instance without its LD. We count the number that an instance belongs to different classes in both LDs of Zappos in Fig.3, where lots of instances have a major class label in one LD but a minor class label in another. Since all classes are equally important in most scenarios, the imbalanced LDs indicate the necessity to consider imbalance-sensitive criteria along LDs.

Fig.3 We show the imbalance shift from one LD to another. The color map counts the number of instances over two LDs on Zappos. The numerical values annotated on the colored blocks in the figure represent values post logarithmic transformation. Many major class instances become minor ones when the LD changes. In other words, the major/minor class property of an instance is difficult to be kept across LDs

Full size|PPT slide

4.2 Deep MDC from labeling dimension aspect

Based on current observations, we’d like to construct an end-to-end MDC model, which balances two types of criteria. We first show the challenges of classical MDC methods, and then analyze several baselines to demonstrate the key factors in deep MDC.

Challenges for traditional MDC methods Current MDC methods [19, 20, 22–24] rely on pre-extracted features and cannot be applied to structural data such as images in an end-to-end manner. We first answer a question how about directly using the current MDC methods? One main requirement is to get strong pre-extracted features. For a fair comparison, we extract features for classical MDC methods by pre-training with deep neural networks. We decompose the model

F_{j}

for the

j

th LD into two parts, the linear classifier

W_{j} \in R^{d \times K_{j}}

and the feature extractor (a.k.a. embedding function)

ϕ_{j} : R^{D} \to R^{d}

. The embedding maps the raw input to a

d

-dimensional space, and the classifier

W_{j}

further predicts the confidence of

K_{j}

classes with a linear transformation.^† We have

W_{j} = [w_{j}^{1}, \dots, w_{j}^{K_{j}}]

, where each column of

W_{j}

corresponds to the classifier for one of the

K_{j}

classes in the

j

th LD.

To cast the extracted features to classical MDC methods, we assume there is a shared embedding space

ϕ

across different LDs, but there are multiple linear classifiers

{W_{1}, \dots, W_{L}}

for various LDs. In other words, we implement the LD-specific prediction

{\hat{f}}_{i j}^{k} = (w_{j}^{k})^{⊤} ϕ (x_{i})

for

k = {1, \dots, K_{j}}

. Then, we train the embedding extractor and the multi-head classifier by using the cross-entropy:

(2)

min_{ϕ, W_{1}, \dots, W_{L}} - \sum_{i = 1}^{N} \sum_{j = 1}^{L} \log (\frac{\exp {\hat{f}}_{i j}^{y_{i j}}}{\sum_{k = 1}^{K_{j}} \exp {\hat{f}}_{i j}^{k}}) .

Equation (2) minimizes the discrepancy between the posterior class probability on each LD with the ground-truth, which makes the embedding and multiple classifiers discriminative. We name the method in Eq. (2) as ‘Multi-Head’ (abbreviated as ‘M-Head’), which predicts different LDs with the corresponding classifier. By keeping

ϕ

, we get one general feature to depict all dimensions of an instance, so that we can apply traditional MDC methods.

We apply two representative MDC methods KRAM [24] and LEFA [20] on extracted features. We implement

ϕ

with ResNet-18 [61], and their results on Zappos and another real-world MDC dataset Med (with three LDs) are shown in Tab.2. We have the following observations based on the results:

Tab.2 MDC performance comparison on Med and Zappos. H-Acc/I-Acc and M $^{2}$ F1/M $^{2}$ AUC are instance-wise and dimension-wise criteria, respectively

Med	H-Acc	I-Acc	M²F1	M²AUC
KRAM	91.11	74.62	72.38	94.33
LEFA	90.75	75.27	69.88	92.84
M-Head	90.66	77.29	76.78	97.00

Zappos	H-Acc	I-Acc	M²F1	M²AUC
KRAM	67.23	50.93	44.46	78.69
LEFA	67.66	50.12	44.85	80.87
M-Head	66.41	47.89	43.24	87.35

● M-Head is an effective way for MDC feature extraction, based on whose features KRAM and LEFA achieve high instance-wise performance than M-Head itself, especially on Zappos.

● The two kinds of criteria do not show consistent phenomena. For example, there exist large gaps between the performance of dimension-wise criteria and the instance-wise criteria, which indicate that considering the imbalanced class distributions per LD is important to improve the dimension-wise MDC performance.

● M-Head gets higher dimension-wise performance than traditional MDC methods. For the instance-wise criteria, M-Head’s values are lower but still competitive due to its strong extracted features.

Therefore, since the LD-wise imbalanced distribution is a big obstacle to MDC, especially for dimension-wise metrics, constructing an MDC model from the LD aspect like M-Head helps. The dimension-wise model construction facilitates assorting a certain instance to ‘minor’ or ‘major’ class LD-wisely, and may not lose performance on instance-wise criteria due to the discriminative features.

Baselines for Deep MDC. Following M-Head, we study a variety of deep MDC baselines along the LD aspect.

● Multi-Emb (M-Emb): In addition to using multiple classifier heads, we can also use a light-weight strategy to generate multiple embedding spaces. In detail, we learn multiple projections, one for each LD, which maps the original shared embedding to different ones. For the

j

th LD, we have a projection

P_{j} \in R^{d \times d}

, and then

{\hat{f}}_{i j}^{k} = (w_{j}^{k})^{⊤} (P_{j} ϕ (x_{i}))

for

k = {1, \dots, K_{j}}

. Thus, we get different embedding spaces to deal with the label diversity in various LDs, but also model their relatedness with the shared embedding

ϕ

● M-Head $_{I m b}$ and M-Emb $_{I m b}$ : To cope with the class imbalance problem, we apply several deep class imbalance learning methods [40–42] to M-Head and M-Emb on each LD. We empirically find CDT [41] performs the best in general, so we report the results of M-Head and M-Emb with CDT in the main paper. Results with other imbalance learning methods can be found in the appendix. Notably, any existing method that hinges on logit adjustment for managing class imbalances can be used here. Such methods are orthogonal to M-Head and M-Emb. We tune and use the best hyper-parameters separately on classifiers of different LDs.

The results of various deep MDC methods on Zappos are shown in Tab.3. We get the following observations:

Tab.3 Deep MDC performance comparison on Med and Zappos. H-Acc/I-Acc and M $^{2}$ F1/M $^{2}$ AUC are instance-wise and dimension-wise criteria, respectively

Med	H-Acc	I-Acc	M²F1	M²AUC
M-Head	90.66	77.29	76.78	95.20
M-Emb	91.83	78.08	78.66	96.98
M-Head $_{I m b}$	92.05	78.73	78.59	97.00
M-Emb $_{I m b}$	91.77	77.95	78.32	96.73

Zappos	H-Acc	I-Acc	M²F1	M²AUC
M-Head	66.41	47.89	43.24	87.35
M-Emb	67.16	48.87	45.58	88.36
M-Head $_{I m b}$	68.26	49.76	45.75	88.75
M-Emb $_{I m b}$	66.95	48.12	45.45	88.00

● M-Emb outperforms M-Head on all evaluation metrics, which indicates the effectiveness of using multiple embedding spaces.

● M-Head

_{I m b}

improves over the vanilla M-Head more obviously. So imbalanced class distribution is the key issue in deep MDC, and incorporating imbalance learning methods per LD helps.

● Although M-Emb

_{I m b}

combines together the previous two useful components, i.e., multiple embedding spaces and imbalanced training strategy, it cannot get further improvement. On the contrary, we find that the performance of M-Emb

_{I m b}

drops w.r.t. M-Emb and M-Head

_{I m b}

instead. One possible reason is that the deep imbalance learning methods influence the major and minor class features towards different directions and reduce the feature deviation between training and test classes on minor classes [41]. Due to the imbalance shift in MDC, the embedding of one instance may have different update directions due to the change of its major/minor properties across different LDs, which weakens the comprehensive ability of the model.

We empirically show that there are several baseline models for deep MDC. By considering multiple embedding spaces and the imbalanced class distribution in LDs, we can improve deep MDC with corresponding strategies. But directly combining them together cannot get additional helps. Therefore, in IMAM, we fuse the strong discriminative ability on imbalanced classes and the strong representative ability together for a better deep MDC model.

5 IMAM for deep MDC

In this section, we introduce our proposed novel IMbalance-Aware fusing Model (IMAM), which works in a ‘decompose and fuse’ manner. We apply imbalance learning methods on multiple branches of models at first. Then we use M-Emb to fuse the knowledge from those multiple teachers, via the embedding and the classifier distillation. The detailed architecture is shown in Fig.4.

Fig.4 An illustration of our proposed IMAM approach based on an MDC problem with two LDs. In the decomposition step (a), we construct imbalance-aware deep models for each LD. In the fusion step (b), we use the models in the former step as fixed teachers and fuse their knowledge into a compact student. Both embedding (green) and classifier (red) distillations help in matching knowledge between models. Subscript ‘T’ denote the component of teacher. ‘CE’ means the cross-entropy

Full size|PPT slide

5.1 Step one: deep imbalanced decomposition

Based on our observations in Section 4, we first decompose the MDC task into a set of multi-class classification problems along the labeling dimensions. Different from M-Emb, we further increase the representative ability of models and consider multiple branches with fully different feature extractors

{ϕ_{1}, \dots, ϕ_{L}}

for

L

dimensions. Specifically, for

k = {1, \dots, K_{j}}

, we have

(3)

\begin{array}{l} {\hat{f}}_{i j}^{k} = (w_{j}^{k})^{⊤} ϕ_{j} (x_{i}) . \end{array}

Recall that

W_{j} = [w_{j}^{1}, \dots, w_{j}^{K_{j}}]

is the multi-class classifier for the

j

th LD. Furthermore, we take account of the imbalanced class distribution in each LD, and adjust the logit of a classifier condition on the number of instances in a particular class. Major class receives larger confidence (logit values) than minor ones, which cause the classifier biases towards the major classes during the test. To calibrate the logit among all classes during the test phase, we need to improve the logit value over minor classes than the major ones. Following [41], we use a class-dependent temperature in the cross-entropy only during the training progress with the objective:

(4)

min_{{ϕ_{j}, W_{j}}_{j = 1}^{L}} - \sum_{i = 1}^{N} \sum_{j = 1}^{L} \log (\frac{\exp ({\hat{f}}_{i j}^{y_{i j}} / a_{j}^{y_{i j}})}{\sum_{k = 1}^{K_{j}} \exp ({\hat{f}}_{i j}^{k} / a_{j}^{k})}) .

Denote

N_{j}^{k}

as the number of training instance that belong to class

k

in the

j

th LD, and

N_{j}^{max}

as the maximum of

N_{j}^{k}

over all

k

, then we set

a_{j}^{k} = {(\frac{N_{j}^{max}}{N_{j}^{k}})}^{γ}

γ \geq 0

is a hyper-parameter. Hence,

a_{j}^{k} = 1

for the most major class and

a_{j}^{k} > 1

for the other (including minor) classes if

γ > 0

. Minimizing Eq. (4) forces

{\hat{f}}_{i j}^{k}

to be enlarged by a factor of

a_{j}^{k}

when

x_{i}

belongs to class

k

in the

j

th LD. It is notable that during the test stage we do not use special temperature over logits.

In the first step of IMAM (IMAM

_{1 s t}

), we train multiple imbalance-aware models for those LDs separately, and different LDs have disjoint embedding spaces. We empirically find IMAM $_{1 s t}$ performs better than M-Head

_{I m b}

on both aspects of criteria. By using more branches, the representative ability of the feature extractors as well as their diversity among LDs are improved, which resolves the imbalance shift problem. The advantages of multiple feature spaces and the imbalance learning strategy are fully utilized.

Although IMAM

_{1 s t}

is similar to Binary Relevance (BR) [23] in classical MDC, the separate imbalance configurations on LDs matter. In a nutshell, we get a set of models that resolve MDC problems as multiple imbalanced multi-class problems.

5.2 Step two: deep imbalanced fusion

IMAM

_{1 s t}

learns discriminative MDC models, but there are still two remaining issues. First, the multiple separate imbalance-aware models ignore the LD-wise relationship, which is quite common in some cases. For example, in fashion recommendation, high-heeled shoes are usually less comfortable than slippers, which means the LD ‘Heel Height’ is strongly related to LD ‘Comfort’. Previous MDC methods [19, 20, 24] have demonstrated that modeling the relationship between different LDs can effectively improve model performance. Besides, one forward pass on each LD are required during the inference stage of the multiple models, which results in high computational cost. Therefore, in the second step of IMAM, we treat the first stage models as multiple fixed teachers, and fuse their knowledge into a single student deep MDC model.

We propose a novel knowledge reuse method to fuse the information from multiple LDs. By minimizing Eq. (4), we obtain multiple embeddings and classifiers for

L

LDs, i.e.,

{ϕ_{T_{j}}, W_{T_{j}}}_{j = 1}^{L}

. We use the subscript

T

to denote it is obtained from the first stage and are fixed as teachers in the second stage. Since M-Emb models the relatedness and difference between LDs with a shared

ϕ

and multiple projections, it is more suitable to be a compact student model. In particular, the student model learns a shared embedding

ϕ

, multiple LD-specific projections

{P_{j}}_{j = 1}^{L}

and classifiers

{W_{j}}_{j = 1}^{L}

We claim both the comparison ability of embeddings and the prediction ability of classifiers are essential for MDC, so we transfer the knowledge from multiple teachers to a student in two aspects.

Embedding distillation We aim to make the embedding

ϕ, {P_{j}}_{j = 1}^{L}

of the student as strong as the teacher. In addition to the vanilla cross-entropy loss as in Eq. (2), we align embeddings between multiple teachers and the student via the following training objective

(5)

min_{ϕ, {P_{j}, W_{j}}_{j = 1}^{L}} \sum_{i = 1}^{N} \sum_{j = 1}^{L} (\frac{\exp {\hat{f}}_{i j}^{y_{i j}}}{\sum_{k = 1}^{K_{j}} \exp {\hat{f}}_{i j}^{k}}) + λ_{1} R_{1} (ϕ_{j} (x_{i}), ϕ_{T_{j}} (x_{i})),

ϕ_{j} (x_{i}) = P_{j} ϕ (x_{i})

is the specific embedding for the

j

th LD of

x_{i}

R_{1} (\cdot)

measures the difference between two embeddings, and we use the Euclidean distance between L2-normalized vectors.

λ_{1}

is a trade-off hyper-parameter. Benefiting from Eq. (5), we make the compact student model with a shared component

ϕ

perform as well as the multiple teacher’s embeddings. The embeddings capture multiple kinds of similarity measures between instances for MDC. More analyses and visualizations are in Section 6.

Classifier distillation After optimizing Eq. (5), we align logits between teachers and the student via the following training objective

(6)

min_{ϕ, {P_{j}, W_{j}}_{j = 1}^{L}} \sum_{i = 1}^{N} \sum_{j = 1}^{L} (\frac{\exp {\hat{f}}_{i j}^{y_{i j}}}{\sum_{k = 1}^{K_{j}} \exp {\hat{f}}_{i j}^{k}}) + λ_{2} R_{2} (s_{τ} ({\hat{f}}_{i j}), s_{τ} ({\hat{f}}_{T_{i j}})),

S_{τ}

transforms the logit

{\hat{f}}_{j} (x_{i})

into a softened

K_{j}

-way probability:

(7)

s_{τ} ({\hat{f}}_{i j}) = softmax (\frac{{\hat{f}}_{i j}}{τ}),

R_{2} (\cdot)

in Eq. (6) measures the difference between two distributions, and we use the Kullback-Leibler divergence.

τ

is a hyper-parameter to soften the teacher’s posterior probability. We set

τ = 1

when dealing with the student’s logit.

λ_{2}

is also a trade-off parameter.

Summary of IMAM. IMAM works in a ‘decompose-and-fuse’ manner. It first learns multiple separate imbalance-aware models for each LD. Then by treating previous multiple models as fixed teachers, IMAM learn a shared embedding as well as multiple projections and classifiers to distill their knowledge. The two-step approach not only reduce the computational burden with a more compact model, and then relate different LDs together, which especially helps when the relatedness between LDs is obvious. The two steps of IMAM also handle the imbalance shift in deep MDC. The first stage does not restrict the update directions of different LDs, while the second stage fuse the discriminative ability of multiple models together.

6 Experiments

In this section, we evaluate the performance of the proposed IMAM and various classical and deep MDC methods on three real-world datasets and a synthesized dataset.

6.1 Experimental setup

Datasets Unlike previous MDC small-size datasets with pre-extracted features, we collect four image-based datasets to evaluate various deep MDC approaches.

● Med is a real-world dataset about medical diagnosis of blood disorders collected from hospitals. The raw images contain patients’ blood samples, and we cut out the serum portion of blood as a pre-processing. Three LDs are related to ‘hemolysis’, ‘lipemia’, and ‘jaundice’, and they have four, five, and two classes of severity orders such as normal, mild, moderate, and severe, respectively.

● Zappos contains 50025 images of shoes collected from an online shopping website [60]. Each image has rich annotations from various aspects. Since most annotations are missing, we choose the two most comprehensively annotated aspects as two LDs, i.e., ‘closure’ and ‘subcategory’.

● Calligraphy contains 23195 images of Chinese characters from different Ancient Chinese calligraphy works [62]. Each image is annotated from two LDs, i.e., ‘calligraphy’ and ‘font type’.

● Letter is a computer-generated image dataset [63]. There is an alphabet letter in each image and is annotated from various aspects. We use uppercase ‘letters’ with 26 categories. Another two LDs are ‘font colors’ (10 colors) and ‘background colors’ (9 colors). Each distinct combination of the three LDs contains 300 images. We construct an exponential distributed long-tailed dataset based on Letters with 13634 images. The details are in the appendix.

Tab.4 summarizes the statistics of all datasets, including the number of instances, the number of LDs, the number of class labels per LD. For all datasets, we randomly split 80% of the whole dataset as the training set and the remaining part as the test set.

Tab.4 Statistics of MDC datasets

Name	# of Instance	# of LD	# of Class per LD
Med	4567	3	4, 5, 2
Zappos [60]	40020	2	11, 13
Calligraphy [62]	23195	2	14, 5
Letter [63]	13634	3	26, 10, 9

Comparison methods:

● Classical MDC methods on pre-extracted features. As described in Section 4, we apply several MDC methods on the features extracted by M-Head. In detail, we compare with BR [23], ECC [22], KRAM [24], LEFA [20]. Later two enrich feature space across LDs with special strategies.^†

● M-Head, M-Emb, M-Head

_{I m b}

, M-Emb

_{I m b}

: the four deep MDC baselines mentioned in Section 4.

Implementation details We use ResNet-18 [61] as the feature extractor

ϕ

. For all datasets, we resize images to 32

\times

32, and we use Stochastic Gradient Descent as the default optimizer. Hyper-parameters such as

λ_{1}

and

λ_{2}

in IMAM are tuned based on a cross-validation over the training set. More details are in Appendixes.

6.2 Experimental results

We report results of IMAM and all comparison methods, including four classical MDC methods based on pre-extracted features and four deep MDC baselines in Tab.5. Instance-wise criteria like Hamming Accuracy (H-Acc)/Instance Accuracy (I-Acc) and dimension-wise criteria like meanMacroF1 (M²F1)/meanMacroAUC (M²AUC) are calculated. Based on results, we have the following observations:

Tab.5 Performance of four traditional MDC methods, four deep MDC methods, and our IMAM on four real-world datasets. Four evaluation metrics from instance and label aspects are calculated

Method	Med				Zappos				Calligraphy				Letter
Method	H-Acc	I-Acc	M²F1	M²AUC	H-Acc	I-Acc	M²F1	M²AUC	H-Acc	I-Acc	M²F1	M²AUC	H-Acc	I-Acc	M²F1	M²AUC
BR	90.63	74.62	70.09	92.69	66.54	47.89	43.35	85.87	80.65	72.32	78.71	96.62	71.00	44.25	67.37	92.07
ECC	87.91	68.35	51.25	84.69	60.74	37.86	35.44	69.53	74.27	71.94	71.19	87.38	65.45	40.06	59.34	86.53
KRAM	91.11	74.62	72.38	94.33	67.23	48.65	44.46	78.69	81.40	73.60	78.88	94.53	72.03	44.05	68.12	92.07
LEFA	90.75	75.27	69.88	92.84	67.66	48.36	44.85	80.87	80.42	72.85	77.06	91.86	72.33	42.80	66.88	91.75
M-Head	90.66	77.29	76.78	95.20	66.41	47.89	43.24	87.35	81.12	73.01	78.97	97.30	72.66	47.31	68.75	94.30
M-Emb	91.83	78.08	78.66	96.98	67.16	48.87	45.58	88.36	83.21	75.68	81.39	97.54	74.49	49.57	71.17	94.87
M-Head $_{I m b}$	92.05	78.73	78.59	95.60	68.26	49.76	45.75	88.75	82.86	75.24	81.49	97.54	75.33	50.73	71.46	94.60
M-Emb $_{I m b}$	91.77	77.95	78.32	96.73	66.95	48.12	45.45	88.00	82.22	74.56	80.40	97.47	73.52	49.33	70.58	94.37
IMAM	93.10	80.82	81.79	97.71	68.33	50.01	45.90	89.32	83.44	76.06	81.98	97.61	81.48	61.43	79.21	96.70

● Traditional MDC methods can get good instance-wise performance with the feature extracted by M-Head. KRAM and LEFA work better than BR and ECC due to modeling the LD-wise relationship with feature augmentation. However, these methods ignore class imbalance problems among LDs and cannot extract features in an end-to-end manner. Compared to imbalance-aware deep models, all of them get relatively poor performance, especially on dimension-wise metrics. Experimental results indicate that the performance of ECC is notably inferior compared to other methods. We hypothesize that the chain-like dependency between dimensions, as assumed by ECC, might not align with the actual scenario. In reality, some dimensions may have no correlation whatsoever.

● Simple deep models like M-Head and M-Emb are effective in MDC tasks. If incorporated with class imbalance methods, they can get better dimension-wise performance while maintaining good performance on instance-wise metrics.

● Our IMAM achieves the best performance on both instance-wise and dimension-wise metrics. The ‘decompose-and-fuse’ strategy enables IMAM to effectively learn knowledge of class imbalance and model the relatedness among all LDs.

6.3 Ablation study

Do all components in IMAM help? We compare IMAM with its variants. IMAM

_{C L S - D i s t i l l}

and IMAM

_{E M B - D i s t i l l}

mean the methods we only use embedding distillation and classifier distillation to transfer the knowledge from IMAM

_{1 s t}

. In Tab.6, we show the performance of different IMAM variants on the Med dataset. We find IMAM $_{1 s t}$ performs better than M-Head

_{I m b}

on both aspects of criteria. Benefit from the use of more branches, the feature extractors in IMAM $_{1 s t}$ can get better representative ability and diversity among LDs, which resolves the imbalance shift to some extent. Besides, both IMAM-emb and IMAM-logit can get better performance than M-Emb, and our IMAM can get the best performance. Therefore, both the embedding and the classifier distillation steps are indispensable in IMAM.

Tab.6 Performance comparison between IMAM and its variants on Med dataset

	H-Acc	I-Acc	M²F1	M²AUC
M-Head $_{I m b}$	92.05	78.73	78.56	95.60
M-Emb	91.83	78.08	78.66	96.98
IMAM $_{1 s t}$	93.05	80.11	78.88	97.42
IMAM $_{C L S - D i s t i l l}$	92.77	80.53	78.95	97.63
IMAM $_{E M B - D i s t i l l}$	92.14	78.51	78.74	97.52
IMAM	93.10	80.82	81.79	97.71

Does IMAM learn multiple meaningful embedding spaces? Besides, to better illustrate the quality of feature embeddings, we provide TSNE [64] visualizations for each of the LD-specific embeddings in Fig.5 on Med dataset. We choose two LDs with more than two classes. From the results, we find that the same class instances are clustered together in both maps, which indicates that IMAM learn semantically meaningful embeddings.

Fig.5 TSNE of the learned feature embeddings of IMAM on Med dataset. Different colors denote different classes

Full size|PPT slide

7 Conclusion

In this study, we revisit the practical issue of Multi-Dimensional Classification (MDC) and delve into the distinct imbalance shift phenomenon inherent to MDC paradigms. We highlight the necessity of evaluating MDC methodologies from a dimension-wise standpoint and introduce two dimension-wise metrics. Furthermore, we present an efficient approach, denoted as IMAM, for MDC that operates through a ‘decompose-and-fuse’ process. During the decomposition phase, IMAM’s dimension-oriented capabilities are optimized by recasting MDC as a series of multi-class classification problems, with imbalance-aware deep models applied independently to each dimension. In the fusion stage, we distill the comparative and predictive capacities of multiple models, developed in the prior phase, into a more compact model. Experimental evaluations on an array of real-world datasets, comparing various MDC methods—encompassing earlier MDC techniques, deep baseline approaches, and IMAM—demonstrate the state-of-the-art performance of IMAM in both instance-wise and dimension-wise assessments.

Yi Shi received the master degree in computer science from Nanjing University, China in 2022. At the same year, he became a PhD student with the School of Artificial Intelligence, Nanjing University, China. His research interests are mainly in machine learning and data mining, including distance metric learning, multi-modal/multi-task learning, and semantic mining

Hanjia Ye received the PhD degree in computer science from Nanjing University, China in 2019. At the same year, he became a faculty member with the School of Artificial Intelligence, Nanjing University, China. His research interests lie primarily in machine learning, including distance metric learning, multi-modal/multi-task learning, meta-learning, and semantic mining. He serves as PC in leading conferences such as ICLR, CVPR, ICCV, ICML, and NeurIPS

Dongliang Man received the BA degree in medicine clinical laboratory diagnostics from China Medical University, China in 2010. At the same year, he became a faculty member with the Department of Laboratory Medicine at the First Hospital of China Medical University, China. He is currently a lecturer and PhD student with China Medical University, China. He has authored or coauthored more than 10 papers in national and international journals. His research interests include laboratory medicine and laboratory medicine intelligence

Xiaoxu Han received the MD degree in clinical medicine from China Medical University, China in 2006. At the same year, she became a faculty member with the First Hospital of China Medical University, China. She is currently a professor with China Medical University. She has authored or coauthored more than 100 papers in national and international journals. Her research interests include molecular diagnostic techniques and laboratory medicine intelligence. She is an invited reviewer for J Biol Chem, BMC Genomics, Virologica Sinica, and other journals. She is currently the vice director of the National Clinical Research Center for Laboratory Medicine, vice director of the Department of Laboratory Medicine at the First Hospital of China Medical University, and director of the Laboratory of Molecular Biology at the Key Laboratory of AIDS Immunology of the National Health Care Commission

Dechuan Zhan received the PhD degree in computer science from Nanjing University, China in 2010. At the same year, he became a faculty member in the Department of Computer Science and Technology at Nanjing University, China. He is currently a professor with the School of Artificial Intelligence at Nanjing University from since 2019. His research interests are mainly in machine learning and data mining. Up until now, He has published over 90 papers in national and international journals or conferences such as TPAMI, TKDD, TIFS, TSMSB, IJCAI, AAAI, ICML, NeurIPS, serves as the editorial board member of IDA and IJAPR, and as SPC/PCs in leading conferences such as IJCAI, AAAI, ICML, NeurIPS. He is the deputy director of LAMDA group, NJU

Yuan Jiang is now a full professor in School of Artificial Intelligence, Nanjing University, China. She received the PhD degree in computer science from Nanjing University, China in 2004. Her research interests are mainly in machine learning, data mining and artificial intelligence applications. She has published more than 50 papers in leading international/national journals and conferences. She was selected in the Program for New Century Excellent Talents in University, Ministry of Education in 2009

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Zhang C, Yankov D, Wu C T, Shapiro S, Hong J, Wu W. What is that building?: an end-to-end system for building recognition from streetside images. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020, 2425−2433

[2]	Sahoo D, Wang H, Shu K, Wu X, Le H, Achananuparp P, Lim E P, Hoi S C H. FoodAI: food image recognition via deep learning for smart food logging. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019, 2260−2268

[3]	Borisyuk F, Gordo A, Sivakumar V. Rosetta: large scale system for text detection and recognition in images. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018, 71−79

[4]	Yang X, Zeng Z, Teo S G, Wang L, Chandrasekhar V, Hoi S. Deep learning for practical image recognition: case study on kaggle competitions. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018, 923−931

[5]	Wang Z, Long C, Cong G, Ju C. Effective and efficient sports play retrieval with deep representation learning. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019, 499−509

[6]	Huang J T, Sharma A, Sun S, Xia L, Zhang D, Pronin P, Padmanabhan J, Ottaviano G, Yang L. Embedding-based retrieval in facebook search. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020, 2553−2561

[7]	Jia X, Zhao H, Lin Z, Kale A, Kumar V. Personalized image retrieval with sparse graph representation learning. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020, 2735−2743

[8]	Tan R, Vasileva M I, Saenko K, Plummer B A. Learning similarity conditions without explicit supervision. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 10373−10382

[9]	Lin Y L, Tran S D, Davis L S. Fashion outfit complementary item retrieval. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 3311−3319

[10]	Kim D, Saito K, Mishra S, Sclaroff S, Saenko K, Plummer B A. Self-supervised visual attribute learning for fashion compatibility. In: Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. 2021, 1057−1066

[11]	Wang Z, Wang Y, Feng B, Mudigere D, Muthiah B, Ding Y. El-rec: efficient large-scale recommendation model training via tensor-train embedding table. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2022, 1−14

[12]	Kononenko I . Machine learning for medical diagnosis: history, state of the art and perspective. Artificial Intelligence in Medicine, 2001, 23( 1): 89–109

[13]	Amato F, López A, María Peña-Méndez E, Vaňhara P, Hampl A, Havel J . Artificial neural networks in medical diagnosis. Journal of Applied Biomedicine, 2013, 11( 2): 47–58

[14]	Turnbull D, Barrington L, Torres D, Lanckriet G . Semantic annotation and retrieval of music and sound effects. IEEE Transactions on Audio, Speech, and Language Processing, 2008, 16( 2): 467–476

[15]	Serafino F, Pio G, Ceci M, Malerba D. Hierarchical multidimensional classification of web documents with MultiWebClass. In: Proceedings of the 18th International Conference on Digital Society. 2015, 236−250

[16]	Shatkay H, Pan F, Rzhetsky A, Wilbur W J . Multi-dimensional classification of biomedical text: toward automated, practical provision of high-utility text to diverse users. Bioinformatics, 2008, 24( 18): 2086–2093

[17]	Barutcuoglu Z, Schapire R E, Troyanskaya O G . Hierarchical multi-label prediction of gene function. Bioinformatics, 2006, 22( 7): 830–836

[18]	Feng B, Wang Y, Ding Y. Saga: sparse adversarial attack on EEG-based brain computer interface. In: Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. 2021, 975−979

[19]	Jia B B, Zhang M L. Multi-dimensional classification via sparse label encoding. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 4917−4926

[20]	Wang H, Chen C, Liu W, Chen K, Hu T, Chen G. Incorporating label embedding and feature augmentation for multi-dimensional classification. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 6178−6185

[21]	Ma Z, Chen S . Multi-dimensional classification via a metric approach. Neurocomputing, 2018, 275: 1121–1131

[22]	Read J, Pfahringer B, Holmes G, Frank E . Classifier chains for multi-label classification. Machine Learning, 2011, 85( 3): 333–359

[23]	Zhang M L, Zhou Z H . A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 2014, 26( 8): 1819–1837

[24]	Jia B B, Zhang M L . Multi-dimensional classification via kNN feature augmentation. Pattern Recognition, 2020, 106: 107423

[25]	Wu T F, Lin C J, Weng R C . Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research, 2004, 5: 975–1005

[26]	Zhang M L, Zhou Z H. A k-nearest neighbor based algorithm for multi-label classification. In: Proceedings of 2005 IEEE International Conference on Granular Computing. 2005, 718−721

[27]	Tang L, Rajan S, Narayanan V K. Large scale multi-label classification via metalabeler. In: Proceedings of the 18th International Conference on World Wide Web. 2009, 211−220

[28]	Wu J H, Wu X, Chen Q G, Hu Y, Zhang M L. Feature-induced manifold disambiguation for multi-view partial multi-label learning. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020, 557−565

[29]	Jia B B, Zhang M L . Multi-dimensional classification via selective feature augmentation. Machine Intelligence Research, 2022, 19( 1): 38–51

[30]	Jia B B, Zhang M L . Multi-dimensional classification via stacked dependency exploitation. Science China Information Sciences, 2020, 63( 12): 222102

[31]	Zhang Y, Zhao P, Cao J, Ma W, Huang J, Wu Q, Tan M. Online adaptive asymmetric active learning for budgeted imbalanced data. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018, 2768−2777

[32]	Wang J, Zhang M L. Towards mitigating the class-imbalance problem for partial label learning. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018, 2427−2436

[33]	Marchant N G, Rubinstein B I P. Needle in a haystack: label-efficient evaluation under extreme class imbalance. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2021, 1180−1190

[34]	Feng B, Wang Y, Li G, Xie Y, Ding Y. Palleon: a runtime system for efficient video processing toward dynamic class skew. In: Proceedings of 2021 USENIX Annual Technical Conference. 2021, 427−441

[35]	Buda M, Maki A, Mazurowski M A . A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks, 2018, 106: 249–259

[36]	Liu Z, Miao Z, Zhan X, Wang J, Gong B, Yu S X. Large-scale long-tailed recognition in an open world. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 2537−2546

[37]	Shen L, Lin Z, Huang Q. Relay backpropagation for effective learning of deep convolutional neural networks. In: Proceedings of the 14th European Conference on Computer Vision. 2016, 467−482

[38]	Wang S, Liu W, Wu J, Cao L, Meng Q, Kennedy P J. Training deep neural networks on imbalanced data sets. In: Proceedings of 2016 International Joint Conference on Neural Networks. 2016, 4368−4374

[39]	Cui Y, Jia M, Lin T Y, Song Y, Belongie S. Class-balanced loss based on effective number of samples. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 9268−9277

[40]	Cao K, Wei C, Gaidon A, Aréchiga N, Ma T. Learning imbalanced datasets with label-distribution-aware margin loss. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 140

[41]	Ye H J, Chen H Y, Zhan D C, Chao W L. Identifying and compensating for feature deviation in imbalanced deep learning. 2020, arXiv preprint arXiv: 2001.01385

[42]	Ren J, Yu C, Sheng S, Ma X, Zhao H, Yi S, Li H. Balanced meta-softmax for long-tailed visual recognition. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 351

[43]	Zhou Z H . Learnware: on the future of machine learning. Frontiers of Computer Science, 2016, 10( 4): 589–590

[44]	Kuzborskij I, Orabona F . Fast rates by transferring from auxiliary hypotheses. Machine Learning, 2017, 106( 2): 171–195

[45]	Du S S, Koushik J, Singh A, Póczos B. Hypothesis transfer learning via transformation functions. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 574−584

[46]	Li X, Grandvalet Y, Davoine F. Explicit inductive bias for transfer learning with convolutional networks. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 2830−2839

[47]	Srinivas S, Fleuret F. Knowledge transfer with Jacobian matching. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 4730−4738

[48]	Ye H J, Zhan D C, Jiang Y, Zhou Z H. Rectify heterogeneous models with semantic mapping. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 1904−1913

[49]	Ye H J, Zhan D C, Jiang Y, Zhou Z H . Heterogeneous few-shot model rectification with semantic mapping. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43( 11): 3878–3891

[50]	Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. 2015, arXiv preprint arXiv: 1503.02531

[51]	Phuong M, Lampert C. Towards understanding knowledge distillation. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 5142−5151

[52]	Gotmare A, Keskar N S, Xiong C, Socher R. A closer look at deep learning heuristics: learning rate restarts, warmup and distillation. In: Proceedings of the 37th International Conference on Learning Representations. 2019

[53]	Heo B, Kim J, Yun S, Park H, Kwak N, Choi J Y. A comprehensive overhaul of feature distillation. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 1921−1930

[54]	Cho J H, Hariharan B. On the efficacy of knowledge distillation. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 4794−4802

[55]	Sau B B, Balasubramanian V N. Deep model compression: distilling knowledge from noisy teachers. 2016, arXiv preprint arXiv: 1610.09650

[56]	Wang Q, Zhan L, Thompson P, Zhou J. Multimodal learning with incomplete modalities by knowledge distillation. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020, 1828−1838

[57]	Liang S, Gong M, Pei J, Shou L, Zuo W, Zuo X, Jiang D. Reinforced iterative knowledge distillation for cross-lingual named entity recognition. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2021, 3231−3239

[58]	Zhang W, Jiang Y, Li Y, Sheng Z, Shen Y, Miao X, Wang L, Yang Z, Cui B. ROD: reception-aware online distillation for sparse graphs. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2021, 2232−2242

[59]	Xu C, Li Q, Ge J, Gao J, Yang X, Pei C, Sun F, Wu J, Sun H, Ou W. Privileged features distillation at Taobao recommendations. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020, 2590−2598

[60]	Yu A, Grauman K. Fine-grained visual comparisons with local learning. In: Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. 2014, 192−199

[61]	He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016, 770−778

[62]	Liu C, Zhao P, Huang S J, Jiang Y, Zhou Z H. Dual set multi-label learning. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018, 3635−3642

[63]	Ge Y, Abu-El-Haija S, Xin G, Itti L. Zero-shot synthesis with group-supervised learning. In: Proceedings of the 9th International Conference on Learning Representations. 2021

[64]	van der Maaten L, Hinton G . Visualizing data using t-SNE. Journal of Machine Learning Research, 2008, 9( 86): 2579–2605

Acknowledgments

This research was supported by the National Key R&D Program of China (2020AAA0109401, 2020AAA0109405), (62376118, 62006112, 62250069, 62206245), the Young Elite Scientists Sponsorship Program of Jiangsu Association for Science and the Technology 2021-020, Collaborative Innovation Center of Novel Software Technology and Industrialization.

Competing interests

The authors declare that they have no competing interests or financial conflits to disclose.

RIGHTS & PERMISSIONS

2025 Higher Education Press

AI Summary AI Mindmap

PDF(13655 KB)

Supplementary files

FCS-23272-OF-YS_suppl_1 (432 KB)

939

Accesses

Citations

Detail

Sections

Recommended

Abstract
Graphical abstract
Keywords
Cite this article
1 Introduction
Fig.1 In a fashion recommendation system, images encapsulate rich semantic information. For example, the left image can be labeled across multiple dimensions such as “type”, “color”, and “style”. Each of these dimensions encompasses several potential classes, of which only one is accurate. For the left image, the appropriate labels across the three dimensions are respectively “boot”, “black”, and “trendy”
2 Related work
3 Notations and background
3.1 Multi-dimensional classification (MDC)
3.2 Evaluation metrics for MDC
Tab.1 Definitions of four MDC performance measures. The later two are our newly proposed dimension-wise criteria
4 Analyses of deep MDC
4.1 Imbalanced labeling dimensions
Fig.2 Imbalanced class distribution on different dimensions of the Zappos dataset (with 11 and 13 classes, respectively)
Fig.3 We show the imbalance shift from one LD to another. The color map counts the number of instances over two LDs on Zappos. The numerical values annotated on the colored blocks in the figure represent values post logarithmic transformation. Many major class instances become minor ones when the LD changes. In other words, the major/minor class property of an instance is difficult to be kept across LDs
4.2 Deep MDC from labeling dimension aspect
Tab.2 MDC performance comparison on Med and Zappos. H-Acc/I-Acc and M2F1/M2AUC are instance-wise and dimension-wise criteria, respectively
Tab.3 Deep MDC performance comparison on Med and Zappos. H-Acc/I-Acc and M2F1/M2AUC are instance-wise and dimension-wise criteria, respectively
5 IMAM for deep MDC
Fig.4 An illustration of our proposed IMAM approach based on an MDC problem with two LDs. In the decomposition step (a), we construct imbalance-aware deep models for each LD. In the fusion step (b), we use the models in the former step as fixed teachers and fuse their knowledge into a compact student. Both embedding (green) and classifier (red) distillations help in matching knowledge between models. Subscript ‘T’ denote the component of teacher. ‘CE’ means the cross-entropy
5.1 Step one: deep imbalanced decomposition
5.2 Step two: deep imbalanced fusion
6 Experiments
6.1 Experimental setup
Tab.4 Statistics of MDC datasets
6.2 Experimental results
Tab.5 Performance of four traditional MDC methods, four deep MDC methods, and our IMAM on four real-world datasets. Four evaluation metrics from instance and label aspects are calculated
6.3 Ablation study
Tab.6 Performance comparison between IMAM and its variants on Med dataset
Fig.5 TSNE of the learned feature embeddings of IMAM on Med dataset. Different colors denote different classes
7 Conclusion
References
Acknowledgments
Competing interests
RIGHTS & PERMISSIONS

Received	Accepted	Published
03 Apr 2023	06 Nov 2023	15 Jan 2025
Just Accepted Date	Issue Date
14 Nov 2023	14 Mar 2024

About the journal

Browse

Authors & reviewers

Abstract

Graphical abstract

Keywords

Cite this article

1 Introduction

2 Related work

3 Notations and background

3.1 Multi-dimensional classification (MDC)

3.2 Evaluation metrics for MDC

Tab.1 Definitions of four MDC performance measures. The later two are our newly proposed dimension-wise criteria

4 Analyses of deep MDC

4.1 Imbalanced labeling dimensions

Fig.2 Imbalanced class distribution on different dimensions of the Zappos dataset (with 11 and 13 classes, respectively)

4.2 Deep MDC from labeling dimension aspect

Tab.2 MDC performance comparison on Med and Zappos. H-Acc/I-Acc and M2F1/M2AUC are instance-wise and dimension-wise criteria, respectively

Tab.3 Deep MDC performance comparison on Med and Zappos. H-Acc/I-Acc and M2F1/M2AUC are instance-wise and dimension-wise criteria, respectively

5 IMAM for deep MDC

5.1 Step one: deep imbalanced decomposition

5.2 Step two: deep imbalanced fusion

6 Experiments

6.1 Experimental setup

Tab.4 Statistics of MDC datasets

6.2 Experimental results

Tab.5 Performance of four traditional MDC methods, four deep MDC methods, and our IMAM on four real-world datasets. Four evaluation metrics from instance and label aspects are calculated

6.3 Ablation study

Tab.6 Performance comparison between IMAM and its variants on Med dataset

Fig.5 TSNE of the learned feature embeddings of IMAM on Med dataset. Different colors denote different classes

7 Conclusion

{{custom_sec.title}}

{{custom_sec.title}}

References

Acknowledgments

Competing interests

RIGHTS & PERMISSIONS

Tab.2 MDC performance comparison on Med and Zappos. H-Acc/I-Acc and M $^{2}$ F1/M $^{2}$ AUC are instance-wise and dimension-wise criteria, respectively

Tab.3 Deep MDC performance comparison on Med and Zappos. H-Acc/I-Acc and M $^{2}$ F1/M $^{2}$ AUC are instance-wise and dimension-wise criteria, respectively