Similarity-based multi-dimensional multi-label classification

Zi-Zhan GU; Bin-Bin JIA; Min-Ling ZHANG

doi:10.1007/s11704-025-41432-y

Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (2) : 2002322 DOI: 10.1007/s11704-025-41432-y

Artificial Intelligence

RESEARCH ARTICLE

Similarity-based multi-dimensional multi-label classification

Zi-Zhan GU ¹^,³
, Bin-Bin JIA ²
, Min-Ling ZHANG ¹^,³^,^†

Author information +

History +

PDF (881KB)

Abstract

In multi-dimensional multi-label classification (MDML), a number of heterogeneous label spaces are assumed to characterize the rich semantics of one object from different dimensions and a set of proper labels can be assigned to the object from each heterogeneous label space. In recent years, similarity-based framework has achieved a promising performance in classification tasks (e.g., multi-class/multi-label classification), while its effectiveness has not been investigated in solving the MDML problems. Moreover, existing similarity-based approaches only utilize either instance-based or label-based information which limits their generalization ability. In this paper, we propose a novel similarity-based MDML approach, naming SIDLE which attempts to utilize both instance-based and label-based information. To extract similarity information, SIDLE first identifies $k$ nearest neighbors in instance space and enhanced label space, respectively. Then, with these identified samples, SIDLE calculates the simple counting statistics based on their labels as well as a bias based on distance between the sample and these identified samples. Finally, the instance space is enriched with extracted similarity information to update instance space and enhanced label space. These three steps are iteratively conducted until convergence. Experiments validate the effectiveness of the proposed SIDLE approach.

Graphical abstract

Keywords

machine learning / multi-dimensional multi-label classification / similarity-based learning / k nearest neighbor / feature augmentation

Cite this article

Download citation ▾

Zi-Zhan GU, Bin-Bin JIA, Min-Ling ZHANG. Similarity-based multi-dimensional multi-label classification. Front. Comput. Sci., 2026, 20(2): 2002322 DOI:10.1007/s11704-025-41432-y

登录浏览全文

4963

注册一个新账户忘记密码

1 Introduction

In supervised learning, the most commonly-used assumption is that each object is annotated with a unique class label from a single label space, e.g., multi-class classification. However, the rich semantics of objects in some applications need to be characterized by encompassing heterogeneous label spaces and multi-label annotations. For example, songs can be categorized based on the emotions they contain, the genres they belong to, and the scenarios in which they are suitable to play. Moreover, each song might contain different kinds of emotions (e.g., happy, sad, relax), belong to different kinds of genres (e.g., classical, rock, popular), and be suitable to play in different kinds of scenarios (e.g., bedtime, wedding, walking). These problems can be naturally formalized under the multi-dimensional multi-label classification (MDML) situation [1]. Here, the multi-dimensional semantics of one MDML object is characterized with a number of heterogeneous label spaces and multiple proper labels from each heterogeneous label space can be assigned to the object to characterize its ambiguous semantics along this dimension. Specifically, MDML problems can be found in different kinds of applications [2–5].

To learn from MDML samples, the most intuitive way is to decompose the original MDML problem into

q

multi-label classification problems, one per label space, or concatenate

q

label spaces to form the single multi-label classification problem. However, the former completely ignores potential label correlations across different label spaces, and the latter ignores the nature of multi-dimensional semantics for MDML objects. Then the pioneering work CLIM aims to consider label correlations within individual dimension and across multiple dimensions in different ways [1]. To consider label correlations within individual dimension, CLIM chooses to maximize the likelihood of relevant labels by regarding the fact that predictive confidences of labels from the same label space are comparable. To consider label correlations across multiple dimensions, CLIM chooses to manipulate the feature space via augmenting the feature space with predictions for each label space.

Similarity-based learning technique has shown its promising performance in different kinds of classification tasks [6–13], while its effectiveness has not been investigated in solving MDML problems. Moreover, existing similarity-based approaches only utilize either instance-based or label-based information which limits their generalization ability. In this paper, we make a first attempt to adapt similarity-based learning technique for solving MDML problem, and propose a novel approach named SImilarity-based multi-Dimensional multi-LabEl classification (SIDLE). SIDLE proposes to extract both instance-based and label-based similarity information by identifying

k

nearest neighbors. For instance-based information, the identification is initiated in the original instance space, while for label-based information, the identification is conducted in an enhanced label space, which is initiated via optimizing cross entropy-like loss. With these identified samples, similarity information is represented by the simple counting statistics based on their labels as well as a bias based on distance between the sample and these identified samples. SIDLE combines instance-based and label-based similarity information with an element-wise addition and enriches the instance space with extracted similarity information. As the instance space is enriched, the enhanced label space will also be updated. These steps will be iteratively conducted until convergence. Comparative studies clearly demonstrate the superiority of the proposed SIDLE approach against state-of-the-art baselines.

The rest of this paper is organized as follows. In Section 2, related works on MDML are briefly discussed. In Section 3, technical details of the proposed SIDLE approach are introduced. In Section 4, experimental results of comparative studies are reported. Finally, we conclude this paper in Section 5.

2 Related work

The most related learning frameworks to MDML include multi-class classification (MCC), multi-label classification (MLC) [14–18], and multi-dimensional classification (MDC) [19–22]. As shown in Tab.1, MDML generalizes multi-label classification from a single homogeneous label space to multiple heterogeneous label spaces and generalizes multi-dimensional classification from unique relevant label assumption in each label space to multiple relevant labels.

The study of MDML is started by [1], where the MDML problem is formalized and a specially designed MDML approach named CLIM is proposed. On one hand, based on the assumption that predictive confidences of labels from the same label space are comparable, CLIM considers the label correlations within each label space by maximizing the likelihood of relevant labels. On the other hand, motivated by [8,23], CLIM augments the instance space with binary predictions of all label spaces to update the predictive model.

Similarity-based classification refers to a category of classifiers that determine their judgment based on the computed similarity (e.g., Euclidean distance) between the target instance and the set of training instances [10]. Existing works can be roughly categorized into instance-based and label-based approach based on the type of used similarity information, whose representative algorithms are

k

nearest neighbors (kNN) classifier [24] and nearest class mean (NCM) classifier [25] respectively. kNN determines the prediction via majority voting according to the labels of

k

nearest neighbors identified in instance space. NCM calculates the mean instance vector for each label and assigns unseen instance to the class label with the closest mean. Similarity-based learning techniques have been utilized to solve many learning problems and applied in real-world applications, including MCC [10,25,26], MLC [7,27], MDC [8,9,28], multiple-instance classification [6], drug-target interactions prediction [29], etc.

However, similarity-based techniques have not been utilized to deal with MDML problem. Moreover, existing works utilize either instance-based or label-based similarity information, which limits their generalization ability. In the next section, we will present the SIDLE approach which solves the MDML problem with both instance-based and label-based similarity information.

3 The SIDLE approach

3.1 Problem formulation

Formally, let

X = R d

denote the

d

-dimensional input space and

Y = ⋃ j = 1 q Y j

be the output space which is the union of

q

heterogeneous label spaces. Here, each label space

Y j =

{y 1 j, y 2 j, …, y K j j}

(

1 ≤ j ≤ q

) comprises

K j

labels and then the total number of labels across all

q

label spaces is given by

K = ∑ j = 1 q K j

. Let

D = {(x i, l i) ∣ 1 ≤ i ≤ m}

represent the MDML training set consisting of

m

training samples, the task of MDML is to learn a mapping function

f : X ↦ Y

from

D

, enabling it to assign an appropriate set of labels to any unseen instance

x ∗

, denoted as

l^∗ = f (x ∗)

. For the

i

th sample

(x i, l i) ∈ D

, where

x i = [x i 1, x i 2, …, x i d] ⊤ ∈ X

is the length-

d

feature vector, its associated label vector is usually represented as a length-

K

binary label vector

l i = [l i 1; l i 2; …; l i q] ∈ {0, 1} K

for notation convenience. Here,

l i j = [l i 1 j, l i 2 j, …, l i K j j] ⊤ ∈ {0, 1} K j

signifies the relevance of labels within the

j

th label space. Specifically, the

a

th element

l i a j = 1

indicates that the

a

th label (i.e.,

y a j

) in the

j

th label space (i.e.,

Y j

) is relevant to

x i

, and

l i a j = 0

suggests otherwise. To facilitate the understanding, we summarize these notations in Tab.2.

The technical details of SIDLE can be divided into three parts, including

k

nearest neighbors identification, similarity information extraction, and predictive model updating. Following the notations in Tab.2, we will present these technical details in the rest of this section.

3.2 K Nearest neighbors identification

Given one instance

x ∈ R d

, to identify its

k

nearest neighbors in the MDML training set

D = {(x i, l i) ∣ 1 ≤ i ≤ m}

, in this paper, we simply calculate the Euclidean distance between

x

and each training sample

x i

D

(1)

d (x, x i) = ‖ x − x i ‖ 2, (1 ≤ i ≤ m) .

Then we sort the obtained

m

distances and identify the

k

smallest ones among them. For the identified

k

nearest training samples of

x

, their indices are stored as follows:

(2)

N k I (x) = {i r ∣ 1 ≤ r ≤ k} .

For convenience, the distance between

x

and

x i r

is arranged in ascending order. In other words, for the identified

k

nearest training samples,

x i 1

is the closest one to

x

and

x i k

is the farthest one to

x

. Note that Eq. (1) regards the similarity among samples in their instance space, thus we call the identified samples as instance-based

k

nearest neighbors. In Eq. (2), we use the superscript

I

(i.e., the first letter of Instance) to indicate this meaning.

However, for an unseen instance, as its labeling information is unknown, we cannot directly identify its

k

nearest neighbors in label space. Motivated by the idea of label enhancement [30], we choose to learn an enhanced label space via inducing a predictive model. Without loss of generality, let

Θ j = [θ 1 j, …, θ K j j] ∈ R d × K j

denote the model parameters for the

j

th label space (

1 ≤ j ≤ q

), we determine these parameters via optimizing the following cross-entropy-like loss on the MDML training set

D

(3)

min Θ j − ∑ i = 1 m ∑ a = 1 K j l i a j ⋅ ln ⁡ e ⟨ θ a j, x i ⟩ ∑ s = 1 K j e ⟨ θ s j, x i ⟩ + λ 2 ‖ Θ j ‖ F 2,

where

⟨ ⋅, ⋅ ⟩

denotes the inner product of two vectors,

λ

is the trade-off parameter to balance the importance of two terms. It is easy to see that we cannot derive a closed-form solution to Eq. (3), then we can optimize it via gradient descent.

With the determined model parameters

Θ j

(

1 ≤ j ≤ q

), for each training sample

(x i, l i) ∈ D

, we can transform its length-

K

binary label vector

l i = [l i 1; …; l i j; …; l i q] ∈ {0, 1} K

to a length-

K

real-valued label vector

p i = [p i 1; …; p i j; …; p i q] ∈ R K

, where

p i j

is defined as follows:

(4)

p i j = (Θ j) ⊤ x i, (1 ≤ j ≤ q) .

Given one instance

x

without label, we can also determine its length-

K

real-valued label vector

p

in a similar way. Then we can calculate the Euclidean distance between

p

and each transformed label vector

p i

(5)

d (p, p i) = ‖ p − p i ‖ 2, (1 ≤ i ≤ m) .

By identifying the

k

smallest distances, we can identify the label-based

k

nearest training samples of

x

and store their indices as follows:

(6)

N k L (x) = {i r ∣ 1 ≤ r ≤ k} .

Here, we use the superscript

L

(i.e., the first letter of Label) to indicate that

N k L (x)

stores the set of indices for

x

’s

k

nearest neighbors identified in label space.

3.3 Similarity information extraction

Motivated by the rationale of KNN classifier, we extract similarity information for one instance with the simple counting statistics based on the labels of its

k

nearest neighbors. For convenience, we temporarily use

N k (x) = {i r ∣ 1 ≤ r ≤ k}

to denote the set of indices for

x

’s

k

nearest neighbors and no longer distinguish whether it is instance-based or label-based

k

nearest neighbors in this subsection.

According to the notations used in previous sections, for the

r

th nearest neighbor of

x

l i r a j = 1

indicates that the

a

th label (i.e.,

y a j

) in the

j

th label space (i.e.,

Y j

) is relevant to

x i r

, and

l i r a j = 0

suggests otherwise. Here,

i r ∈ N k (x)

and

1 ≤ r ≤ k

. Then we can define the following length-

k

indicating vector w.r.t.

y a j

x

’s

k

nearest neighbors:

(7)

l j a x = [l i 1 a j, l i 2 a j, …, l i k a j] ⊤ ∈ {0, 1} k .

Let

1 k

be a length-

k

column vector of all ones, we can calculate the following statistics:

(8)

δ j a x = ⟨ 1 k, l j a x ⟩, (1 ≤ j ≤ q, 1 ≤ a ≤ K j) .

It is easy to know that

δ j a x

records the number of samples which are associated with

y a j

x

’s

k

nearest neighbors and

0 ≤ δ j a x ≤ k

always holds.

Note that in the process of calculating the above

δ j a x

, the

k

nearest neighbors contribute equally. Generally, closer neighbors are of greater importance than farther neighbors in the determination of the final judgment. Thus, it might be better to assign a nonuniform weight to

k

nearest neighbors based on their distance from the target sample. In this paper, we simply set

w = [1, 1 / 2, …, 1 / k] ⊤

as the weight vector. Our motivation is that the weight for the closest neighbor is set to 1 and the weights will gradually decrease. Then we define a corresponding bias

ϵ j a x

for

δ j a x

as follows:

(9)

ϵ j a x = ⟨ w, l j a x ⟩ − m i n (l j a x) m a x (l j a x) − m i n (l j a x) (ϵ m a x − ϵ m i n) + ϵ m i n .

Here,

m a x (l j a x)

and

m i n (l j a x)

represent the possible maximum and minimum of

⟨ w, l j a x ⟩

respectively. It is not difficult to know that

m a x (l j a x)

corresponds to the sum of the first

δ j a x

elements in

w

and

m i n (l j a x)

corresponds to the sum of the last

δ j a x

elements in

w

. In other words, their definitions can be respectively given as follows:

m a x (l j a x) = ∑ r = 1 δ j a x w (r), m i n (l j a x) = ∑ r = k − δ j a x + 1 k w (r),

where

w (r)

denotes the

r

-th element of weight vector

w

ϵ m a x

and

ϵ m i n

are two hyper-parameters to control the range of

ϵ j a x

. In this paper, we set

ϵ m a x

to 0.5 and

ϵ m i n

to 0.

In fact, the function of Eq. (9) is like a normalization process, aiming to normalize the distance value

⟨ w, l j a x ⟩

to the range

[ϵ m i n, ϵ m a x]

, which is similar to min-max normalization. Therefore, it can fine-tune Eq. (8) which serves as the primary component for augmenting features.

Thereafter, we combine

δ j a x

and

ϵ j a x

together:

(10)

ζ j a x = δ j a x + ϵ j a x .

After traversing all labels, we can obtain the following length-

K

vector containing similarity information:

(11)

z x = [ζ 11 x, ζ 12 x, …, ζ 1 K 1 x ⏟ T h e 1 s t d i m ., ζ 21 x, ζ 22 x, …, ζ 2 K 2 x ⏟ T h e 2 n d d i m ., …, ζ q 1 x, ζ q 2 x, …, ζ q K q x ⏟ T h e q t h d i m .] ⊤ .

3.4 Predictive model updating

For the predictive model via optimizing problem Eq. (3), to help improve its generalization performance with the extracted similarity information, we choose to manipulate the feature space. Specifically, for each instance

x i

, an augmented feature vector can generated as follows:

(12)

ξ i = z x i I + z x i L + p i .

Here,

z x i I

and

z x i L

are the instance-based and label-based versions of similarity vector for

x i

in Eq. (11).

p i

is the real-valued label vector defined in Eq.(4). It is worth noting that, since

p i

is a probability vector while

z x i I

and

z x i L

are counting statistics, to let them share the same importance in augmented feature, both

z x i I

and

z x i L

need to be normalized in the range

[0, 1]

before adding up. Then the MDML dataset can be transformed into the new version:

(13)

D ~ = {(x ~ i, l i) ∣ 1 ≤ i ≤ m}, w h e r e x ~ i = [x i; ξ i],

where

[x i; ξ i]

means concatenation. With the transformed dataset, the optimization problem Eq. (3) will be updated into (

1 ≤ j ≤ q

(14)

min Θ j − ∑ i = 1 m ∑ a = 1 K j l i a j ⋅ ln ⁡ e ⟨ θ a j, x ~ i ⟩ ∑ s = 1 K j e ⟨ θ s j, x ~ i ⟩ + λ 2 ‖ Θ j ‖ F 2 .

Here, we slightly abuse the notation because model parameter

θ a j

is updated from length-

d

vector to length-(

d + K

) one due to the changing of feature space, which can match the dimension update in Eq. (13).

With the transformed dataset

D ~

and updated predictive model, the instance-based and label-based

k

nearest neighbors for any instance

x

will also be updated. In this paper, we iteratively conduct the three steps (including

k

nearest neighbors identification, similarity information extraction and predictive model updating) until convergence.

The Algorithm 1 shows the pseudo code of the working procedure of the SIDLE approach. Specifically, steps 1−10 initialize the predictive model over original dataset

D

. Then steps 11−23 repeatedly update the predictive model over the transformed dataset

D ~

. Finally, the predicted label vector

l^∗

for unseen instance

x ∗

is determined based on its augmented version

x ~ ∗

and the converged predictive model.

4 Experiments

4.1 Experimental setup

In this paper, we collect five benchmark MDML datasets to conduct comparative study. Tab.3 summarizes their detailed characteristics, including number of dimensions (#Dimension), number of labels per dimension (#Label/Dim.), number of features (#Feature), number of samples (#Sample), and their application domain.

Song data sets are the data from Chinese songs, each dimension corresponds to one kind of semantics, i.e., emotion, genre, and scenario [1]. The label organization strategy for Song-v1 is that the label be regarded to be relevant to one instance is the one has the largest confidence value within dimension. For Song-v2, assume that the labels in the same dimension are sorted in descending order according to confidence values, and therefore the relevant labels are those that appear before the largest difference between two adjacent labels.

Yeast data sets are collected about budding yeast Saccharomyces cerevisiae [31]. Dimensions correspond to alpha factor arrest & release, cdc15 arrest & release, elutriation, diauxic shift, heat shock, and sporulation. For Yeast-v1, if the current gene expression level is larger than the average level in the biological experiment, it will be considered as relevant. For Yeast-v2, like mentioned in Song-v2, the label will be treated as relevant if it is in the front of the position of the largest different between two adjacent time points (ordered).

The Flickr dataset is a set of pictures in mirflickr25k [32] categorized by different dimensions. Each example corresponds to one picture and total of three dimensions correspond to circumstance, item, and light, respectively. The circumstance includes sky condition, indoor or outdoor, and whether the scene is far or close. The item includes whether there are people in the scene and their genders, and other decorative items like plants and others. The light dimension includes the light condition of the picture, e.g., daytime, night, or cannot decide. The label will be regarded as a relevant one if the scenario actually appears in the picture.

To evaluate the performance of different MDML approaches, five popular multi-label evaluation metrics are utilized in experiments, including Hamming loss, ranking loss, coverage, one error, and average precision. Their definitions can be easily found in many literatures [14,33], so we omit them in this paper. We compute the value of these metrics dimension by dimension, and then report the average value of all dimensions.

The performance is compared with five competing approaches, including CLIM [1], Multi-Label

k

-Nearest Neighbors (MLKNN) [34], Binary Relevance (BR) [35], Classifier Chains (CC) [36], and wrapping multi-label classification with label-specific features generation (WRAP) [37]. Specifically, CLIM is an MDML approach which considers label correlations within individual dimension and across multiple dimensions in different ways. All the remaining four competing approaches are proposed for multi-label classification. They can be used to solve MDML problem by transforming a MDML problem to a multi-label one via simply concatenating all label spaces as an entirety. Here, we include them because there are not enough MDML approaches currently. MLKNN is based on

k

-nearest neighbors techniques. BR learns a binary classifier for each label independently while CC learns a chain of binary classifier in a cascaded manner. WRAP learns multi-label predictive model based on label-specific features generation technique.

The parameters for each approach are set as suggested in their original literature. Specifically,

λ = 2 − 3

for CLIM,

k = 10

for MLKNN,

α = 0.9

λ 1 = 5

λ 2 = 5

λ 3 = 0.1

for WRAP. For BR and CC that necessitate a base binary classifier, we use cross entropy-based logistic regression for fair comparison, which is implemented by the popular LIBLINEAR software [38]. For our proposed SIDLE approach,

λ = 2 − 3

and

k = 10

which are consistent to CLIM and MLKNN respectively. Ten-fold cross-validation is conducted over each dataset and we report the mean metric value as well as standard deviation.

All experiments were conducted on a Windows 11 operating system using MATLAB R2023b. The hardware used in the system includes an Intel i7-12700 CPU and 32GB of memory.

4.2 Experimental results

The detailed results can be found in Tab.4. To facilitate comparison, the performance ranks are also shown in subscript style and the best performance is shown in boldface.

According to these experiment results, the following observations can be made:

● Across all 25 configurations (5 metrics

×

5 datasets), the proposed SIDLE approach achieves the best performance in 24 cases and the only rest one ranks the second.

● Compared with the up-to-date approach CLIM, the proposed SIDLE approach achieves superior performance in terms of all the five evaluation metrics. While CLIM enhances feature representations by augmenting the feature space with predictions from each label space to capture label correlations, it relies solely on label-based information. In contrast, SIDLE integrates both instance-based and label-based similarity information. It identifies

k

-nearest neighbors in the original instance space and in an enhanced label space, extracting similarity features via counting statistics with distance-based biasing. This dual approach enriches the feature space, capturing nuanced inter-instance and inter-label similarities for superior classification performance. Empirical results across metrics and datasets consistently show that SIDLE outperforms CLIM, underscoring the benefit of complementary similarity cues.

● Compared with the popular MLKNN approach, the proposed SIDLE approach achieves overwhelming superior performance in terms of each metric. MLKNN works based on the similarity information in instance space, SIDLE’s superiority shows the necessity of leveraging label-based similarity information.

● Both BR and CC works based on learning a binary classifier for each label and do not utilize any similarity information. It can be seen that, by introducing similarity information into the process of predictive model induction, the proposed SIDLE approach achieves superior performance against them in terms of all the five evaluation metrics.

● Compared with the WRAP approach, even it is the state-of-the-art baseline on multi-label classification problems, it still achieves inferior performance to the proposed SIDLE approach. The possible reason is that, the WRAP approach ignores the heterogeneous nature in MDML problems which degenerates its performance.

● Both Song and Yeast datasets are collected form real-world information, and two version of them have different organized label space. The result shows that both MDML approaches perform better than traditional multi-label approaches, which means considering dimension correlations will help improve the performance for MDML missions. Flickr dataset is also real-world dataset but with relatively bigger size of features and instance numbers. The result shows that for large data, MDML approaches can still have better performance than traditional multi-label approaches, showing the high robustness of proposed approach. From the analysis of these datasets, it can be found that take the new nature of MDML framework into account will significantly help improving the final performance for classification approach.

● CLIM has already demonstrated that label space can enhance prediction performance, and SIDLE extends this by showing that both label space and instance space play similar roles in the MDML learning process. The performance improvements in SIDLE validate that combining label-based and instance-based similarity information can enhance the feature representations from different instances and then improve classification performance.

4.3 Further analysis

4.3.1 Ablation study

To validate the effectiveness of the technical design for SIDLE, ablation study is conducted in this section. Specifically, SIDLE is compared with its four variants regarding the way of constructing the augmented features in Eq. (12). They are denoted by SIDLE

I

, SIDLE

L

, SIDLE

C

, SIDLE

H

, respectively. SIDLE

I

and SIDLE

L

construct the augmented features with only instance-based similarity information

z x i I

and label-based similarity information. SIDLE

C

constructs the augmented features in a concatenated manner, i.e.,

ξ i = [z x i I; z x i L; p i]

, while SIDLE

H

constructs the augmented features in a hybrid manner (addition+concatenation), i.e.,

ξ i = [z x i I + z x i L; p i]

The detailed results can be found in Tab.5. To facilitate comparison, the best performance is shown in boldface. It can be observed that, across all 25 configurations (5 metrics

×

5 datasets), the proposed SIDLE approach achieves the best performance in 13 cases. Although SIDLE may not achieve the top score for every metric, it consistently performs well, with only minor gaps compared to the best-performing versions on specific metrics. When averaging these metrics and ranking them, SIDLE emerges as the best overall choice.

It is worth noting that SIDLE

H

outperforms SIDLE

C

. The reason this phenomenon happens is complicated, and some possible factors are listed as follows. 1) Since both instance-based and label-based similarity information is derived from neighbor information, they may have redundant or overlapping features. Therefore, simply concatenating them together may lead to feature redundancy. In contrast, adding them together inherently merges shared neighborhood information while suppressing redundant components, where dimensionality reduction mitigates the risk of overfitting and improves computational efficiency. 2) The noises in either instance-based or label-based similarity information will be amplified if concatenation applied. The addition of instance-based and label-based similarity information also acts as a linear smoothing mechanism which can cancel out erratic variations while reinforcing stable similarity information.

4.3.2 Parameter sensitivity analysis

In our proposed SIDLE approach, the similarity information is extracted in the

k

nearest neighbors. To investigate how does this parameter affect the performance of SIDLE, we repeat the experiments of SIDLE with

k

increasing from 6 to 14 over all datasets.

Fig.1 illustrates the performance fluctuation of SIDLE with varying value of

k

. It can be observed that the proposed SIDLE approach achieves relatively stable performance when changing the parameter value of

k

. The reason why

k

value does not affect the final performance a lot is about the application of weighted

k

NN strategy, which can minimize the effect of value selection of

k

because a limited number of close neighbors is enough to make the final judgment. In previous experiments, the value of

k

is simply fixed as a moderate value of 10 which can be used as a default setting.

Another parameter sensitivity experiment is about the threshold

ϵ m a x

and

ϵ m i n

in Eq. (9). From Fig.2, it can be found that when the threshold (

ϵ m a x − ϵ m i n

) is bigger, the metrics will be slightly increased (performs better). Since the threshold is greater than 0.5, the performance remains stable. Therefore in previous experiments the value of

ϵ m a x

and

ϵ m i n

are fixed as 0.5 and 0. In addition, the selection of the position of the range of

[ϵ m i n, ϵ m a x]

[0, 1]

will not heavily affect the performance as well (i.e.,

[0, 0.5]

and

[0.5, 1]

will return similar results in metrics), which leads to fixing

ϵ m i n

at 0 because it simplifies Eq. (9) to a single term. In fact, the function of Eq. (9) is similar to min-max normalization, and

ϵ m a x − ϵ m i n

act as thresholds for the normalization.

4.3.3 Complexity analysis

The majority of the complexity of SIDLE is from optimizing the Eq. (3), which applies gradient descent. Since the number of iterations that gradient descent used to converge is hard to theoretically analyze because it depends on the objective function. Here we measure the time costs of SIDLE and the comparable approach CLIM under the same hardware and software conditions. All approaches can be done on a computer with at least 16 GB RAM size, and the experiment is performed in a 32 GB RAM hardware environment. The result can be seen in Tab.6.

It can be shown that SIDLE is more efficient than the compared approach CLIM, which enjoys a comparable time with the traditional approach WRAP and longer than other multi-label approaches.

5 Conclusion

The main contribution of this paper can be summarized from the following two aspects. 1) From a similarity-based learning perspective, we propose that it can help induce better similarity-based predictive models by simultaneously leveraging instance-based and label-based similarity information. Moreover, different from nearest class mean, we introduce another effective strategy to leverage label-based similarity information. 2) From the MDML perspective, we propose a novel approach named SIDLE that utilizes both instance-based and label-based similarity information. The experimental results clearly demonstrate its superiority over the competing baselines.

Looking ahead, one promising direction for SIDLE is to refine similarity information extraction by exploring adaptive or learned distance metrics (metric learning) and sophisticated weight schemes for

k

NN statistics. These enhancements could better capture the nuances of label and instance relationships and boost classification performance. Moreover, similarity information strategies can extend to complex classification tasks and other machine learning schemes, such as semi-supervised, unsupervised, or deep feature augmentation learning. Broadening this approach to future works could improve SIDLE’s performance and underscore the MDML paradigm’s versatility in multi-dimensional and multi-label challenges.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Jia B B, Zhang M L . Multi-dimensional multi-label classification: Towards encompassing heterogeneous label spaces and multi-label annotations. Pattern Recognition, 2023, 138: 109357

[2]	Song L, Liu J, Qian B, Sun M, Yang K, Sun M, Abbas S . A deep multi-modal CNN for multi-instance multi-label image classification. IEEE Transactions on Image Processing, 2018, 27( 12): 6025–6038

[3]	Yin T, Chen H, Wang Z, Li T. Missing labels feature selection based on multilabel multi-scale fusion fuzzy rough sets. In: Proceedings of the 3rd International Conference on Digital Society and Intelligent Systems. 2023, 348−352

[4]	He X, Ma P, Chen Y, Liu Y. MOD-YOLO: improved YOLOv5 based on multi-softmax and omni-dimensional dynamic convolution for multi-label bridge defect detection. In: Proceedings of the 20th International Conference on Advanced Intelligent Computing Technology and Applications. 2024, 44−55

[5]	Yin T, Chen H, Wang Z, Liu K, Yuan Z, Horng S J, Li T . Feature selection for multilabel classification with missing labels via multi-scale fusion fuzzy uncertainty measures. Pattern Recognition, 2024, 154: 110580

[6]	Xiao Y, Liu B, Hao Z, Cao L . A similarity-based classification framework for multiple-instance learning. IEEE Transactions on Cybernetics, 2014, 44( 4): 500–515

[7]	Rossi R A, Ahmed N K, Eldardiry H, Zhou R. Similarity-based multi-label learning. In: Proceedings of 2018 International Joint Conference on Neural Networks. 2018, 1−8

[8]	Jia B B, Zhang M L . Multi-dimensional classification via kNN feature augmentation. Pattern Recognition, 2020, 106: 107423

[9]	Jia B B, Zhang M L. MD-KNN: An instance-based approach for multi-dimensional classification. In: Proceedings of the 25th International Conference on Pattern Recognition. 2021, 126−133

[10]	Ma Z, Chen S . A similarity-based framework for classification task. IEEE Transactions on Knowledge and Data Engineering, 2023, 35( 5): 5438–5443

[11]	Qin Y, Yang J, Zhou J, Pu H, Mao Y . A new supervised multi-head self-attention autoencoder for health indicator construction and similarity-based machinery RUL prediction. Advanced Engineering Informatics, 2023, 56: 101973

[12]	Foumani N M, Tan C W, Webb G I, Rezatofighi H, Salehi M . Series2vec: similarity-based self-supervised representation learning for time series classification. Data Mining and Knowledge Discovery, 2024, 38( 4): 2520–2544

[13]	Gou Q, Dong Y, Wu Y, Ke Q . Semantic similarity-based program retrieval: a multi-relational graph perspective. Frontiers of Computer Science, 2024, 18( 3): 183209

[14]	Zhang M L, Zhou Z H . A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 2014, 26( 8): 1819–1837

[15]	Liu W, Wang H, Shen X, Tsang I W . The emerging trends of multi-label learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44( 11): 7955–7974

[16]	Si C, Jia Y, Wang R, Zhang M L, Feng Y, Qu C . Multi-label classification with high-rank and high-order label correlations. IEEE Transactions on Knowledge and Data Engineering, 2024, 36( 8): 4076–4088

[17]	Wang Y B, Hang J Y, Zhang M L. Multi-label open set recognition. In: Proceedings of the 38th Conference on Neural Information Processing Systems. 2024

[18]	Tang W, Zhang W, Zhang M L . Multi-instance partial-label learning: towards exploiting dual inexact supervision. Science China Information Sciences, 2024, 67( 3): 132103

[19]	Gil-Begue S, Bielza C, Larrañaga P . Multi-dimensional Bayesian network classifiers: a survey. Artificial Intelligence Review, 2021, 54( 1): 519–559

[20]	Jia B B, Zhang M L . Multi-dimensional classification: paradigm, algorithms and beyond. Vicinagearth, 2024, 1( 1): 3

[21]	Jia B B, Zhang M L . Multi-dimensional classification via decomposed label encoding. IEEE Transactions on Knowledge and Data Engineering, 2023, 35( 2): 1844–1856

[22]	Huang T, Jia B B, Zhang M L. Deep multi-dimensional classification with pairwise dimension-specific features. In: Proceedings of the 33rd International Joint Conference on Artificial Intelligence. 2024, 4183−4191

[23]	Wang H, Chen C, Liu W, Chen K, Hu T, Chen G. Incorporating label embedding and feature augmentation for multi-dimensional classification. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 6178−6185

[24]	Cover T M, Hart P E . Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 1967, 13( 1): 21–27

[25]	Datta P, Kibler D F. Symbolic nearest mean classifiers. In: Proceedings of the 14th National Conference on Artificial Intelligence and 9th Innovative Applications of Artificial Intelligence Conference. 1997, 82−87

[26]	Guerriero S, Caputo B, Mensink T. DeepNCM: Deep nearest class mean classifiers. In: Proceedings of the 6th International Conference on Learning Representations. 2018

[27]	Zhang Z, Zou Q, Lin Y, Chen L, Wang S . Improved deep hashing with soft pairwise similarity for multi-label image retrieval. IEEE Transactions on Multimedia, 2020, 22( 2): 540–553

[28]	Shi Y, Ye H, Man D, Han X, Zhan D, Jiang Y . Revisiting multi-dimensional classification from a dimension-wise perspective. Frontiers of Computer Science, 2025, 19( 1): 191304

[29]	Ding H, Takigawa I, Mamitsuka H, Zhu S . Similarity-based machine learning methods for predicting drug-target interactions: a brief review. Briefings in Bioinformatics, 2014, 15( 5): 734–747

[30]	Xu N, Liu Y P, Geng X . Label enhancement for label distribution learning. IEEE Transactions on Knowledge and Data Engineering, 2021, 33( 4): 1632–1643

[31]	Eisen M B, Spellman P T, Brown P O, Botstein D . Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences, 1998, 95( 25): 14863–14868

[32]	Huiskes M J, Lew M S. The MIR flickr retrieval evaluation. In: Proceedings of the 1st ACM SIGMM International Conference on Multimedia Information Retrieval. 2008, 39−43

[33]	Wu X Z, Zhou Z H. A unified view of multi-label performance measures. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 3780−3788

[34]	Zhang M L, Zhou Z H . ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition, 2007, 40( 7): 2038–2048

[35]	Zhang M L, Li Y K, Liu X Y, Geng X . Binary relevance for multi-label learning: an overview. Frontiers of Computer Science, 2018, 12( 2): 191–202

[36]	Read J, Pfahringer B, Holmes G, Frank E . Classifier chains for multi-label classification. Machine Learning, 2011, 85( 3): 333–359

[37]	Yu Z B, Zhang M L . Multi-label classification with label-specific feature generation: a wrapped approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44( 9): 5199–5210

[38]	Fan R E, Chang K W, Hsieh C J, Wang X R, Lin C J . LIBLINEAR: a library for large linear classification. The Journal of Machine Learning Research, 2008, 9: 1871–1874