COURIER: contrastive user intention reconstruction for large-scale visual recommendation

Jia-Qi YANG; Chenglei DAI; Dan OU; Dongshuai LI; Ju HUANG; De-Chuan ZHAN; Xiaoyi ZENG; Yang YANG

doi:10.1007/s11704-024-3939-x

PDF(13518 KB)

Front. Comput. Sci. ›› 2025, Vol. 19 ›› Issue (7) : 197602. DOI: 10.1007/s11704-024-3939-x

Information Systems

RESEARCH ARTICLE

COURIER: contrastive user intention reconstruction for large-scale visual recommendation

Author information +

History +

Abstract

With the advancement of multimedia internet, the impact of visual characteristics on the decision of users to click or not within the online retail industry is increasingly significant. Thus, incorporating visual features is a promising direction for further performance improvements in click-through rate (CTR). However, experiments on our production system revealed that simply injecting the image embeddings trained with established pre-training methods only has marginal improvements. We believe that the main advantage of existing image feature pre-training methods lies in their effectiveness for cross-modal predictions. However, this differs significantly from the task of CTR prediction in recommendation systems. In recommendation systems, other modalities of information (such as text) can be directly used as features in downstream models. Even if the performance of cross-modal prediction tasks is excellent, it is challenging to provide significant information gain for the downstream models. We argue that a visual feature pre-training method tailored for recommendation is necessary for further improvements beyond existing modality features. To this end, we propose an effective user intention reconstruction module to mine visual features related to user interests from behavior histories, which constructs a many-to-one correspondence. We conduct extensive experimental evaluations on public datasets and our production system to verify that our method can learn users’ visual interests. Our method achieves 0.46% improvement in offline AUC and 0.88% improvement in Taobao GMV (Cross Merchandise Volume) with p-value < 0.01.

Graphical abstract

Keywords

user Intention reconstruction / contrastive learning / personalized searching / image features

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Jia-Qi YANG, Chenglei DAI, Dan OU, Dongshuai LI, Ju HUANG, De-Chuan ZHAN, Xiaoyi ZENG, Yang YANG. COURIER: contrastive user intention reconstruction for large-scale visual recommendation. Front. Comput. Sci., 2025, 19(7): 197602 https://doi.org/10.1007/s11704-024-3939-x

1 Introduction

Predicting the click-through rate (CTR) is an essential task in recommendation systems based on deep learning [1–3]. Typically, CTR models measure the probability that a user will click on an item based on the user’s profile and behavior history [2,4], such as clicking, purchasing, adding a cart, etc. The behavior histories are represented by sequences of item IDs, titles, and some statistical features, such as monthly sales and favorable rate [1].

From an intuitive perspective, the visual representations of items hold significant importance in online item recommendation, particularly in categories like women’s clothing. Recent studies have shown that incorporating modality information through end-to-end training can enhance the effectiveness of recommendations [5]. However, when dealing with large-scale product recommendation systems, where users generate billions of clicks daily, implementing end-to-end training becomes impractical [6]. A feasible approach involves enhancing image features through the utilization of representation learning methods specifically designed for image features. Nevertheless, our experiments reveal that employing embeddings derived from existing image feature learning methods such as SimCLR [7], SimSiam [8], and CLIP [9], only yield marginal effects on the relevant downstream CTR prediction task. We ascribe this lack of success to two factors. Firstly, in the context of recommendation scenarios, user preferences for visual appearance tend to be vague and imprecise, which may not be captured by existing image feature learning methods that focus on label prediction. Besides, augmentation methods that excel in tasks like image classification, due to their significant alteration of the image’s appearance, are not suitable for recommendation systems. Secondly, the label information embedded in the pre-trained embeddings, such as classes or language descriptions, can already be directly utilized in item recommendations, rendering the information gain provided by pre-trained embeddings redundant. For instance, we already possess categories, item titles, and style tags provided by the merchants in Taobao. Consequently, a pre-trained model that performs well in predicting categories or titles contributes little novel information and does not enhance CTR prediction task.

To boost the performance of downstream CTR prediction tasks, we argue that the representation learning method should be aware of the downstream CTR task and should also be decoupled from the CTR task to reduce computation. To achieve this goal, we propose a COntrastive UseRIntEntion Reconstruction (COURIER) method. Our method is based on an intuitive assumption: An item clicked by a user is likely to have visual characteristics similar to some of the clicking history items. One straightforward approach is to optimize the distance between user embeddings (comprised of item images previously clicked by the user) and clicked item image embeddings. Unlike the typical one-to-one correspondence in common contrastive learning, this establishes a many-to-one correspondence. Consequently, we must aggregate multiple item image embeddings from the user’s historical clicks. Common aggregation methods include self-attention or pooling. However, it should be noted that user click histories often contain a significant number of images that have low relevance to the clicked items. Directly minimizing the distance between these aggregated embeddings may result in all images having very similar embeddings. To mine the visual features related to user interests, we propose reconstructing the next clicking item with a cross-attention mechanism on the clicking history items. The reconstruction can be interpreted by a weighted sum of history item embeddings, which effectively selects related images from history in an end-to-end manner, as depicted in Fig.1. We propose to optimize a contrastive loss that not only encourages lower reconstruction error, but also push embeddings of un-clicked items further apart. We conducted various experiments to verify our motivation and design of the method. Our pre-trained image embedding achieves 12% improvements in NDCG and Recall on several public datasets compared with strong baselines. We conducted experiments in a large-scale Taobao dataset, our method achieves 0.46% absolute offline improvement on AUC. In online A/B tests, we achieve 0.88% improvement on GMV in the women’s clothing category, which is significant considering the volume of Taobao online shopping platform.

Fig.1 (a) Existing image feature learning methods are tailored for cross-modal prediction tasks. (b) We propose a user intention reconstruction method to mine potential visual features that cannot be reflected by cross-modal labels. In this example, the user searched for “Coat” and received two recommendations (Page-viewed items). The user clicked on the one on the right. Through our user intention reconstruction, we identified similar items from the user’s click history with larger attention, the reconstructed PV item embeddings are denoted as $R_{p v}^{j}$ . Then, we optimize the PV embeddings $E_{p v}^{j}$ and reconstructions $R_{p v}^{j}$ to be closer if the corresponding item is clicked and more far apart otherwise

Full size|PPT slide

Our contribution can be summarized as follows:

● To establish a one-to-one correspondence between images clicked by the user in the past and the image currently clicked by the user, we propose a user intention reconstruction method, which can mine latent user intention from history click sequences without any explicit semantic labels.

● The user intention reconstruction objective alone may lead to a collapsed solution. To solve this problem, we propose a contrastive training method that utilizes the un-clicked item efficiently.

● We conduct comprehensive experiments on both private and public datasets to validate the effectiveness of our method. Additionally, we provide insights and share our practical experience regarding the deployment of image feature models in real-world large-scale recommendation systems.

2 Related work

From collaborative filtering [10,11] to deep learning based recommender systems [12–14], IDs and categories (user IDs, item IDs, advertisement IDs, tags, etc. are the essential features in item recommendation systems, which can straightforwardly represent the identities. However, with the development of the multi-media internet, ID features alone can hardly cover all the important information. Thus, recommendations based on content such as images, videos, and texts have become an active research field in recommender systems. In this section, we briefly review the related work and discuss their differences compared with our method.

2.1 Content-based recommendation

In general, content is more important when the content itself is the concerned information. Thus, the content-based recommendation has already been applied in news [15], music [16], image [17], and video [18] recommendations.

Research on the content-based recommendation in the item recommendation task is much fewer because of the dominating ID features. Early applications of image features typically use image embeddings extracted from a pre-trained image classification model, such as [19–25]. The image features adopted by these methods are trained on general classification datasets such as ImageNet, which may not fit recommendation tasks. Thus, [26] proposes to train on multi-modal data from recommendation task. However, [26] does not utilize any recommendation labels such as clicks and payments, which will have marginal improvement if we are already using information from other modalities.

With the increase in computing power, some recent papers propose to train item recommendation models end-to-end with image feature networks [5,27]. However, the datasets used in these papers are much smaller than our application scenario. For example, [27] uses a dataset consisting of 25 million interactions. While in our online system, average daily interactions in a single category (women’s clothing) are about 850 million. We found it infeasible to train image networks in our scenario end-to-end, which motivates our decoupled two-stage framework with user-interest-aware embedding learning.

2.2 Image representation learning in recommendation

Self-supervised pre-training has been a hot topic in recent years. We classify the self-supervised learning methods into two categories: Augmentation-based and prediction-based.

Augmentation-based. The augmentation-based methods generate multiple different views of an image by random transformations, then the model is trained to pull the embeddings of different views closer and push other embeddings (augmented from different images) further. SimCLR [7], SimSiam [8], BYOL [28] are some famous self-supervised methods in this category. These augmentation-based methods do not perform well in the item recommendation task as shown in our experiments in Section 4.2.4. Since the augmentations are designed for classification (or segmentation, etc.) tasks, they change the visual appearance of the images without changing their semantic class, which contradicts the fact that visual appearance is also important in recommendations (e.g., color, shape).

Prediction-based. If the data can be split into more than one part, we can train a model taking some of the parts as inputs and predict the rest parts, which is the basic idea of prediction-based pre-training. Representative prediction-based methods include BERT [29], CLIP [9], etc. The prediction-based methods can be used to train multi-modal recommendation data as proposed by [26]. However, if we are already utilizing multi-modal information, the improvements are limited, as shown in our experiments. To learn user interests information that can not be provided by other modalities, we argue that user behaviors should be utilized, as in our proposed method.

2.3 Contrastive learning in recommender systems

Contrastive learning methods are also adopted in recommendation systems in recent years. The most explored augmentation-based method is augmenting data by dropping, reordering, and masking some features [30], items [31,32], and graph edges [33]. Prediction-based methods are also adopted for recommendation tasks, e.g., BERT4Rec [34] randomly masks some items and makes predictions. However, all these recommender contrastive learning methods concentrate on augmentation and pre-training with ID features, while our method tackles representation learning of image features. Several recent works also considered mining users’ intents with contrastive learning [35,36]. Different from our concentration on visual features, they focus on learning with graph structures while image features are not considered.

3 Contrastive user intention reconstruction

We briefly introduce some essential concepts and notations, then introduce our method in detail.

3.1 Preliminary

Notations. A data sample for CTR prediction in item search can be represented with a tuple of (user, item, query, label). Typically, in recommendation tasks, there is no explicit query, and the remaining aspects align with search. Our approach is universally applicable in these scenarios. For simplicity, the subsequent sections will consistently use the term “recommendation”. A user searches for a query text, and several items are shown to the user. Then the items that the user clicked are labeled as positive and negative otherwise. When a user views a page, we have a list of items that are presented to the user, we call them page-view (PV) items. The length of this PV list is denoted by

l_{p v}

. Each PV item has a cover image, denoted by

I_{p v}^{j}

, where

0 \leq j < l_{p v}

. The corresponding click labels are denoted by

y_{p v}^{j}

, where

y_{p v}^{j} \in {0, 1}

. Each user has a list of clicked item history, the image of each item is denoted by

I_{c l i c k}^{k}

0 \leq k < l_{c l i c k}

, where

l_{c l i c k}

is the length of click history. The

l_{p v}

and

l_{c l i c k}

may vary by different users and pages, we trim or pad to the same length practically.

Attention. An attention layer is defined as follows:

(1)

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{K}}}) V,

where

Q

is the query matrix,

K

is the key matrix,

V

is the value matrix. The mechanism of the attention layer can be interpreted intuitively: for each query, a similarity score is computed with every key, the values corresponding to keys are weighted by their similarity scores and summed up to obtain the outputs. We refer interested readers to [37] for more details of the attention mechanism.

3.2 User intention reconstruction

In the following discussion, we only consider a single line of data (a PV list and corresponding list of user click history), the batch training method will be discussed later. We use

I_{p v}

and

I_{c l i c k}

to denote the matrix of all the

l_{p v}

and

l_{c l i c k}

images. All the images are fed to the image backbone (IB) to get their embeddings. We denote the embeddings as

E_{p v}^{j} = I B (I_{p v}^{j})

and

E_{c l i c k}^{k} = I B (I_{c l i c k}^{k})

correspondingly. In our user interest reconstruction method, we treat the embeddings of PV images

E_{p v}

as queries

Q

, and we input the embeddings of click images

E_{c l i c k}

as values

V

and keys

K

. Then the user interest reconstruction layer can be calculated by

(2)

\begin{array}{l} R_{p v}^{j} = A t t e n t i o n (E_{p v}^{j}, E_{c l i c k}, E_{c l i c k}) \end{array}

(3)

= \sum α_{k} E_{c l i c k}^{k},

where

α \sim s o f t m a x (E_{p v}^{j} E_{c l i c k}^{T})

α_{k}

is the attention on

k

th history click item. The reason for its name (user intention reconstruction) is that the attention layer forces the outputs to be a weighted sum of the embeddings of historical click sequences. Thus, the output space is limited to the combination of

E_{c l i c k}

within a simplex.

3.3 Contrastive training method

The user intention reconstruction module cannot prevent the trivial solution that all the embeddings collapse to the same value. Thus, we propose a contrastive method to train the user-interest reconstruction module.

With the PV embeddings

E_{p v}

, and the corresponding reconstructions

R_{p v}

. We calculate pairwise similarity score of

E_{p v}^{j_{0}}

and

R_{p v}^{j_{1}}

, where

E_{p v}^{j_{0}}

represents embedding of the

j_{0}

-th PV item and

R_{p v}^{j_{1}}

represents the reconstruction of the

j_{1}

-th PV item:

(4)

S i m (j_{0}, j_{1}) = \frac{{E_{p v}^{j_{0}}}^{T} R_{p v}^{j_{1}}}{| | E_{p v}^{j_{0}} | | | | R_{p v}^{j_{1}} | |} .

Then we calculate the contrastive loss by

(5)

\begin{array}{l} L_{p v} = L_{c o n t r a s t} (E_{p v}, E_{c l i c k}, y) \end{array}

(6)

\begin{aligned} = \sum_{j_{0}, j_{1}} - \log (\frac{e^{S i m (j_{0}, j_{1}) / τ}}{\sum_{j_{0}} e^{S i m (j_{0}, j_{1}) / τ}}) I [j_{0} = j_{1} a n d y_{j_{0}} = 1] . \end{aligned}

Here

I [j_{0} = j_{1} a n d y_{j_{0}} = 1]

is an indicator function that equals

1

when

j_{0} = j_{1}

and

y_{j_{0}} = 1

, and equals

0

otherwise.

The contrastive loss with user interest reconstruction is depicted in Fig.2. The softmax function is calculated column-wisely, and only the positive PV images are optimized to be reconstructed (the two columns in dashed boxes). The behavior of the contrastive loss aligns with our assumption: Positive PV images are pulled closer to the corresponding reconstructions (

j_{0} = j_{1}

and

y_{j_{0}} = 1

), while the negative PV images are not encouraged to be reconstructed (

j_{0} = j_{1}

and

y_{j_{0}} = 0

). All the PV embeddings

E_{p v}^{j_{0}}

with

j_{0} \neq j_{1}

and

y_{j_{0}} = 1

, are pushed further away from

R_{p v}^{j_{1}}

, which can prevent the trivial solution that all the embeddings are the same. Some elements in the similarity matrix are not used in our loss, namely the elements with

y_{j_{1}} = 0

, since the negative PV images are not supposed to be reconstructible. We left these columns for ease of implementation.

Fig.2 The contrastive user intention reconstruction method. The images are fed into the image backbone model to obtain the corresponding embeddings. The embeddings of PV (Page-View) sequences are blue-colored, and the embeddings of click sequences are yellow-colored. The reconstructions are in green. Red boxes denote positive PV items

Full size|PPT slide

Extending to batched contrastive loss. The above contrastive loss is calculated using PV items within a single page (we have at most 10 items on a page), which can only provide a limited number of negative samples. However, a well-known property of contrastive loss is that the performance increases as the number of negative samples increases, which is verified both practically [7] and theoretically [38]. To increase the number of negative samples, we propose to treat all other in-batch PV items as negative samples. Specifically, we have (batch_size*

l_{p v}

) PV items in this batch. For a positive PV item, all the other (batch_size*

l_{p v} - 1

) PV items are used as negatives, which significantly increases the number of negatives.

3.4 Contrastive learning on click sequences

In the contrastive loss introduced in Section 3.3, the negative and positive samples are from the same query. Since all the PV items that are pushed to the users are already ranked by our online recommender systems, they may be visually and conceptually similar and are relatively hard to distinguish by contrastive loss. Although the batch training method introduces many easier negatives, the hard negatives still dominate the contrastive loss at the beginning of training, which makes the model hard to train.

Thus, we propose another contrastive loss on the user click history, which does not have hard PV negatives. Specifically, suppose the user’s click history is denoted by corresponding embeddings

E_{c l i c k}^{0}, . . ., E_{c l i c k}^{l_{c l i c k} - 1}

We treat the last item as the next item to be clicked (since the items are sorted by their click timestamp), and the rest items are treated as click history. The user click sequence loss is calculated as follows:

L_{u c s} = L_{c o n t r a s t} (E_{c l i c k}^{l_{c l i c k} - 1}, E_{c l i c k}^{0 : l_{c l i c k} - 1}, 1) .

The

L_{c o n t r a s t}

function is the same as in Section 3.3. Here the label

y = 1

because all the samples in history click sequences are positive. The user sequence loss provides an easier objective at the start of the training, which helps the model to learn with a curriculum style. It also introduces more signals to train the model, which improves data efficiency. The overall loss of COURIER is:

(7)

L_{C O U R I E R} = L_{p v} + L_{u c s} .

3.5 Differences compared to established contrastive methods

Although COURIER is also a contrastive method, it is significantly different from classic contrastive methods. Firstly, in contrastive methods such as SimCLR [7] and CLIP [9], every sample has a corresponding positive counterpart. In our method, a negative PV item does not have a corresponding positive reconstruction, but it still serves as a negative sample in the calculation. Secondly, there is a straightforward one-to-one matching in SimCLR and CLIP, e.g., text and corresponding image generated by augmentation. In recommendation systems, image augmentations are not applicable due to the distortion of appearance. Instead, a positive PV item corresponds to a list of history click items, which is transformed into a one-to-one matching with our user interest recommendation module introduced in Section 3.2. Thirdly, another approach to convert this multi-to-one matching to one-to-one is to apply self-attention on the multi-part, as suggested in [39], which turned out to perform worse in our scenario. We experiment and analyze this method in Section 4.3.2 (w/o Reconstruction).

4 Experiments

To validate the efficacy of our proposed image representation learning method in enhancing recommendation performance, we conducted experiments on several publicly available product recommendation datasets. Subsequently, we proceeded with systematic experiments on actual data from Taobao and integrated our approach into the online system.

4.1 Experiments on public datasets

Since nearly all recommendation methods incorporate ID features, the addition of image or text features typically characterizes multimodal recommendation methods. Our CTR prediction model falls under this category as well. These methods primarily rely on fixed image and text backbone models to extract image features. However, our proposed image feature pre-training method aims to enhance the representational capabilities of image features in recommendation system methods. In this section, we will experimentally investigate whether our pre-trained image features can further enhance the performance of these methods.

Datasets. In order to validate the effectiveness of our proposed visual feature pre-training method in learning user-intention related features, we select three categories from the Amazon Review dataset [40], namely Baby, Sports, and Clothing. Each item has corresponding image and text descriptions. We apply our method to the image features while keeping the text features unchanged. The statistics are shown in Tab.1. To evaluate the performance of different visual feature extraction methods, we divided the dataset into pre-training, training, validation, and testing sets, with proportions of 50%, 30%, 10%, and 10% respectively. We group the datasets by user IDs to construct a list of positive items, and then we uniformly sample 10 negative items for each user to construct the pre-training dataset.

Tab.1 Statistics of public datasets

	# User	# Items	# Interactions
Baby	19445	7050	160792
Sports	35598	18357	296337
Clothing	39387	23033	278677

Baselines. We compare with several representative ID-based and multi-modal recommendation methods. We select two ID-based recommendation methods, namely BPR [41] and SLMRec [23]. We select five multi-modal recommendation methods, namely DualGNN [21], LATTICE [22], MGCN [24], MMGCN [20], and BM3 [25]. To validate the information gained from our visual feature learning method, we concatenate the pre-trained embeddings with the original embedding provided by [42]. Then, we apply the augmented embeddings to BM3 and MMGCN, which corresponds to BM3+ours and MMGCN+ours, respectively.

Implementation. All the baseline methods are tuned following [42]. Specifically, we tune the hyper-parameters (learning rate, weight decay, and method-specific hyper-parameters such as contrastive loss weight in MGCN) of each method with a validation dataset. Then, we run each method with 5 different random seeds and report their average performance.

Results. As observed in Tab.2, our method, when combined with existing multimodal recommendation algorithms, can further enhance performance, achieving an average improvement of around 12% in terms of both Recall and NDCG.

Tab.2 Averge Recall and NDCG performance comparison on public datasets

Dataset	Sports			Baby			Clothing
Method	R@20	R@50	N@50	R@20	R@50	N@50	R@20	R@50	N@50
BPR	0.004	0.008	0.003	0.007	0.014	0.005	0.004	0.007	0.002
SlmRec	0.006	0.016	0.006	0.014	0.028	0.011	0.004	0.010	0.003
DualGNN	0.012	0.023	0.009	0.018	0.037	0.014	0.007	0.014	0.005
LATTICE	0.012	0.021	0.008	0.010	0.022	0.008	−	−	−
MGCN	0.015	0.027	0.011	0.017	0.032	0.013	0.011	0.019	0.008
MMGCN	0.015	0.028	0.012	0.024	0.061	0.024	0.008	0.017	0.006
BM3	0.018	0.033	0.014	0.031	0.065	0.026	0.009	0.016	0.006
MMGCN+ours	0.017	0.031	0.013	0.030	0.061	0.024	0.010	0.019	0.007
BM3+ours	0.019	0.034	0.015	0.035	0.068	0.027	0.012	0.019	0.007

4.2 Experiments on Taobao offline dataset

To evaluate the information increment of pre-trained image embeddings on the CTR prediction task, we use an architecture that aligns with our online recommender system. In essence, each item is associated with a unique item ID and features. The item ID is transformed into an embedding and concatenated with other features, including the image embedding we have learned, to form the item features. Users consist of user features and their behavioral history, where the behavioral history comprises numerous items. Therefore, the user behavioral history is aggregated using self-attention, which is then concatenated with other user features. Subsequently, the user and item features are concatenated and serve as inputs to an MLP. The MLP’s final output is the user’s click probability, representing the predicted CTR. We provide a detailed description of this CTR model in the Appendixes.

Downstream usage of image representations. Practically, we find that how we feed the image features into the downstream CTR model is critical for the final performance. We experimented with three different methods: 1. Directly using embedding vectors. 2. Using similarity scores to the target item. 3. Using the cluster IDs of the embeddings. Cluster-ID is the best-performing method among the three methods, bringing about 0.1%−0.2% improvements on AUC compared to using embedding vectors directly. We attribute the success of Cluster-ID to its better alignment with our pre-training method. We provide a more detailed analysis in the Appendix.

4.2.1 Taobao pre-training and CTR dataset

Pre-training dataset. The pre-training dataset is collected during 2022.11.18 to 2022.11.25 on our online search service. To reduce the computational burden, we down-sample to 20% negative samples. So the click-through rate (CTR) is increased to around 13%. To reduce the computational burden, we sort the PV items with their labels to list positive samples in the front, then we select the first 5 PV items to constitute the training targets (

l_{p v} = 5

). We retain the latest 5 user-click-history items (

l_{c l i c k} = 5

). Thus, there are at most 10 items in one sample. There are three reasons for such data reduction: First, our dataset is still large enough after reduction. Second, the number of positive items in PV sequences is less than 5 most of the time, so trimming PV sequences to 5 will not lose many positive samples, which is generally much more important than negatives. Third, we experimented with

l_{c l i c k} = 10

and did not observe significant improvements, while the training time is significantly longer. Thus, we keep

l_{c l i c k} = 5

and

l_{p v} = 5

fixed in all the experiments. We remove the samples without clicking history or positive items within the page. Intuitively, women’s clothing is one of the most difficult recommendation tasks (the testing AUC is significantly lower than average) which also largely depends on the visual appearance of the items. Thus, we select the women’s clothing category to form the dataset of the training dataset of the pre-training task. The statistics of the pre-training dataset are summarized in Tab.3.

Tab.3 Pre-training dataset collected from the women’s clothing category

# User	# Item	# Samples	CTR	# Hist	# PV Items
71.7 million	35.5 million	311.6 million	0.13	5	5

In the pre-training dataset, we only retain the item images and the click labels. All other information, such as item title, item price, user properties, and even the query by the users are dropped. We report the experimental results of training with text information in Section 4.3.4, which indicates that additional information is unnecessary.

CTR dataset. The average daily statistics of the downstream CTR datasets in all categories and women’s clothing are summarized in Tab.4. To avoid any information leakage, we use data collected from 2022.11.27 to 2022.12.04 on our online shopping platform to train the downstream CTR model, and we use data collected on 2022.12.05 to evaluate the performance. In the evaluation stage, the construction of the dataset aligns with our online system. We use all the available information to train the downstream CTR prediction model. The negative samples are also down-sampled to 20%. Different from the pre-training dataset, we do not group the page-view data in the evaluation dataset, so each sample corresponds to an item.

Tab.4 Daily average statistics of the downstream dataset

	# User	# Item	# Samples	CTR	# Hist
All	0.118 billion	0.117 billion	4.64 billion	0.139	98
Women’s	26.39 million	12.29 million	874.39 million	0.145	111.3

4.2.2 Evaluation metrics

● Area Under the ROC Curve (AUC): AUC is the most commonly used evaluation metric in evaluating ranking methods, which denotes the probability that a random positive sample is ranked before a random negative sample.

● Grouped AUC (GAUC): The AUC is a global metric that ranks all the predicted probabilities. However, in the online item-searching task, only the relevant items are considered by the ranking stage, so the ranking performance among the recalled items (relevant to the user’s query) is more meaningful than global AUC. Thus, we propose a Grouped AUC metric, which is the average AUC within searching sessions.

4.2.3 Compared methods

● Baseline: Our baseline is a CTR model that serves our online system. It’s noteworthy that we adopt a warmup strategy that uses our online model trained with more than one year’s data to initialize all the weights (user ID embeddings, item ID embeddings, etc., which is a fairly strong baseline.

● Supervised: To pre-train image embeddings with user behavior information, a straightforward method is to train a CTR model with click labels and image backbone network end-to-end. We use the trained image network to extract embeddings as other compared methods.

● SimCLR [7]: SimCLR is a self-supervised image pre-training method based on augmentations and contrastive learning.

●SimSiam [8]: SimSiam is also an augmentation-based method. Different from SimCLR, SimSiam suggests that contrastive loss is unnecessary and proposes directly minimizing the distance between matched embeddings.

● CLIP [9]: CLIP is a multi-modal pre-training method that optimizes a contrastive loss between image embeddings and item embeddings. We treat the item cover image and its title as a matched sample. We use a pre-trained BERT [43] as the feature network of item titles, which is also trained end-to-end.

● MaskCLIP [44]: MaskCLIP is an improved version of CLIP with masked self-distillation in images and text.

4.2.4 Performance in downstream CTR task

The performances of compared methods are summarized in Tab.5. We have the following conclusions: Firstly, since all the methods are pre-trained in the women’s clothing category, they all have performance improvement on the AUC of the downstream women’s clothing category. SimCLR, SimSiam, CLIP, MaskCLIP and COURIER outperform the Baseline and Supervised pre-training. Among them, our COURIER performs best, outperforms baseline by 0.46% AUC and outperforms second best method MaskCLIP by 0.15% AUC, which verified our analysis that traditional CV pre-training methods provide little information gain to the CTR model. Secondly, we also check the performance in all categories. Our COURIER also performs best with 0.16% improvement in AUC. However, the other methods’ performances are significantly different from the women’s clothing category. The Vanilla and SimSiam method turned out to have a negative impact in all categories. And the improvements of CLIP, SimCLR and MaskCLIP become marginal. The reason is that the pre-training methods failed to extract general user interest information and overfit the women’s clothing category. We analyze the performance in categories other than women’s clothing in Section 4.3.5. Thirdly, the performance on GAUC indicates that the performance gain of CLIP, SimCLR and MaskCLIP vanishes when we consider in-page ranking, which is indeed more important than global AUC, as discussed in Section 4.2.2. The GAUC performance further validates that COURIER can learn fine-grained user interest features that can distinguish between in-page items.

Tab.5 The improvements of AUC ( $Δ A U C$ ) in the women’s clothing category. And performances of $Δ A U C, Δ G A U C$ in all categories. We report the relative improvements compared to the Baseline method, and the raw values of the metrics are in parentheses

Methods	$Δ A U C$ (Women’s Clothing)	$Δ A U C$	$Δ G A U C$
Baseline	0.00% (0.7785)	0.00% (0.8033)	0.00% (0.7355)
Supervised	+0.06% (0.7790)	−0.14% (0.8018)	−0.06% (0.7349)
CLIP [9]	+0.26% (0.7810)	+0.04% (0.8036)	−0.09% (0.7346)
SimCLR [7]	+0.28% (0.7812)	+0.05% (0.8037)	−0.08% (0.7347)
SimSiam [8]	+0.10% (0.7794)	−0.10% (0.8022)	−0.29% (0.7327)
MaskCLIP [44]	+0.31% (0.7815)	+0.03% (0.8035)	−0.03% (0.7352)
COURIER (ours)	+0.46% (0.7830)	+0.16% (0.8048)	+0.19% (0.7374)

4.3 Further experimental analysis

We have validated the efficacy of the COURIER approach in enhancing the overall performance of recommendation algorithms on both public datasets and the Taobao dataset. However, we also aim to address the following inquiries:

1. Existing research has highlighted the significant impact of temperature coefficients in pre-training. Does the temperature coefficient similarly affect our pre-training task?

2. What is the respective contribution of each module in the COURIER approach towards performance improvement?

3. Currently, we only utilize image and user click information in the training of embeddings. Would the inclusion of other modalities, such as text, further enhance the performance?

4. As a pre-training method, can the embeddings acquired in the women’s clothing category also yield improvements in other categories?

5. Can our method solely utilize user click information to learn features relevant to user intent?

We will address the aforementioned inquiries through experimental investigation.

4.3.1 Influence of temperature

We experiment with different temperatures

τ

. The results are in Fig.3. Note that the experiments are run with the Simscore method, which is worse than Cluster-ID but is much faster. We find that

τ = 0.05

performs best for COURIER and keep it fixed.

Fig.3 The impact of different values of $τ$ on the performance of downstream CTR tasks. The horizontal axis represents the values of $τ$ , while the vertical axis denotes the change (%) in the metrics

Full size|PPT slide

4.3.2 Ablation study

We conduct the following ablation experiments to verify the effect of each component of COURIER.

● w/o UCS: Remove the user click sequence loss.

● w/o Contrast: Remove the contrastive loss, only minimize the reconstruction loss, similar to SimSiam [8].

● w/o Reconstruction: Use self-attention instead of cross-attention in the user interest recommendation module.

● w/o Neg PV: Remove negative PV samples, only use positive samples.

The results in Tab.6 indicate that all the proposed components are necessary for the best performance.

Tab.6 Ablation studies of COURIER

	$Δ A U C$ (women’s clothing)	$Δ A U C$	$Δ G A U C$
w/o UCS	0.06%	−0.13%	0.11%
w/o Contrast	0.23%	0.03%	−0.11%
w/o Reconstruction	0.25%	0.02%	−0.11%
w/o Neg PV	0.30%	0.07%	−0.06%
COURIER	0.46%	0.16%	0.19%

4.3.3 Influence of batch size

Due to both theoretical [38] and practical evidence [7] indicating that increasing batch size in contrastive learning can enhance model generalization, we conducted experiments with different batch sizes. The results in Tab.7 also demonstrate that in our method, the larger the batch size, the better the performance.

Tab.7 Influence of different batch on performance

Batch size	Femal AUC	AUC	GAUC
64	0.15%	−0.06%	−0.08%
256	0.23%	0.04%	0.02%
512	0.36%	0.09%	0.10%
2048	0.43%	0.15%	0.17%
3072	0.46%	0.16%	0.19%
4096	0.47%	0.16%	0.21%

On the flip side, augmenting the batch size necessitates a substantial increase in computational power. In our framework, configuring a batch size of 3072 requires 48 NIVIDIA V100 GPUs, while a batch size of 4096 demands 64 GPUs. Considering the diminishing returns associated with further escalating the batch size, coupled with the consideration of training costs, we ultimately opted for a batch size of 3072 for deployment.

4.3.4 Train with text information

Text information is important for searching and recommendation since it’s directly related to the query by the users and the properties of items. Thus, raw text information is already widely used in real-world systems. The co-train of texts and images also shows significant performance gains in computer vision tasks such as classification and segmentation. So we are interested in verifying the influence to COURIER by co-training with text information. Specifically, we add a CLIP [9] besides COURIER, with the loss function becomes

L = L_{C O U R I E R} + L_{C L I P}

. The CLIP loss is calculated with item cover images and item titles. However, such multi-task training leads to worse downstream CTR performance as shown in Tab.8, which indicates that co-training with the text information may not help generalization when the text information is available in the downstream task.

Tab.8 Train with text information

	$Δ A U C$ (women’s clothing)	$Δ A U C$	$Δ G A U C$
w CLIP	0.26%	0.04%	−0.09%
COURIER	0.46%	0.16%	0.19%

4.3.5 Generalization in unseen categories

In Fig.4, we plot the AUC improvements of COURIER in different categories. We have the following conclusions: Firstly, the performance improvement in the women’s clothing category is the most, which is intuitive since the embeddings are trained with women’s clothing data. Secondly, there are also significant improvements in women’s shoe, children’s shoe, children’s clothing, underwear, etc. These categories are not used in the pre-training task, which indicates the COURIER method can learn general visual characteristics that reflect user interests. Thirdly, the performances in the bedding, cosmetics, knapsack, and handicrafts are also improved by more than 0.1%. These categories are significantly different from women’s clothing in visual appearance, and COURIER also learned some features that are transferable to these categories. Fourthly, COURIER does not have a significant impact on some categories, and has a negative impact on the car category. These categories are less influenced by visual looking and can be ignored when using our method to avoid performance drop. Fifthly, the performance is also related to the amount of data. Generally, categories that have more data tend to perform better.

Fig.4 The AUC improvements of COURIER compared to the Baseline on different categories. The x-axis is sorted by the improvements

Full size|PPT slide

4.3.6 Visualization of trained embeddings

Did COURIER really learn features related to user interests? We verified the quantitative improvements on CTR in Section 4.2.4. Here, we also provide some qualitative analysis. During the training of COURIER, we did not use any additional information other than images and user clicks. Thus, if the embeddings contain some semantic information, such information must be extracted from user behaviors. So we plot some randomly selected embeddings with specific categories and style tags in Fig.5 andFig.6. Firstly, embeddings from different categories are clearly separated, which indicates that COURIER can learn categorical semantics from user behaviors. Secondly, some of the style tags can be separated, such as Cool vs. Sexy. The well-separated tags are also intuitively easy to distinguish. Thirdly, some of the tags can not be separated clearly, such as Mature vs. Cuties, and Grace vs. Antique, which is also intuitive since these tags have relatively vague meanings and may overlap. Despite this, COURIER still learned some gradients between the two concepts. To conclude, the proposed COURIER method can learn meaningful user-interest-related features by only using images and click labels.

Fig.5 T-SNE visualization of embeddings in different categories. (a) Dress and Jeans; (b) Shirt and Cheongsam; (c) Skirt and Fur

Full size|PPT slide

Fig.6 T-SNE visualization of embeddings with different style tags. We also plot some item images with different tags below the corresponding figures. (a) Cool and Sexy; (b) Mature and Cuties; (c) Grace and Antique

Full size|PPT slide

Tab.9 The A/B testing improvements of COURIER

	$Δ$ # Order	$Δ$ CTR	$Δ$ GMV
All categories	+0.1%	+0.18%	+0.66%
Women’s clothing	+0.31%	+0.34%	+0.88%

5 Online experiments and deployment

During the deployment of our image feature learning method, we devoted substantial efforts to optimizing the model’s performance. Below, we summarize some of the impactful optimization measures that were undertaken:

Image backbone and performance optimization. We use the Swin-tiny transformer [45] trained on Imagenet as the image backbone model, which is faster than the Swin model and has comparable performance. We train the backbone network end-to-end in the pre-training phase. However, since we have about

3 \times 10^{9}

images in our pre-training dataset (230 times bigger than the famous Imagenet dataset, which has about

1.3 \times 10^{6}

images), even the Swin-tiny model is slow. After applying gradient checkpointing [46], mixed-precision computation [47], and optimization on distributed IO of images, we managed to reduce the training time for one epoch on the pre-training dataset from 150 hours to 50 hours (consuming daily data in < 10 hours) with 48 Nvidia V100 GPUs (32 GB).

Efficient clustering. In our scenario, there are about

N = 6 * 10^{7}

image embeddings to be clustered, and we set the cluster number to

C = 10^{5}

. The computational complexity of vanilla k-means implementation is

O (N * C * d)

per iteration, which is unaffordable. Practically, we implement a high-speed learning-based clustering problem proposed in [48]. The computing time is reduced significantly from more than 7 days to about 6 hours. We will assign a new image to its closest cluster center ID. The amount of new items is relatively small and has little impact on the performance. We will re-train and replace all the cluster-IDs regularly (e.g., half a year).

Online serving performance. To evaluate the performance improvement brought by COURIER on our online system, we conduct online A/B testing in our online shopping platform for 30-days, the results are reported in Tab.9.

We report improvements on number of orders (

Δ

#Order), click-through rate (

Δ

CTR), and cross merchandise volume (

Δ

GMV). Compared with the strongest deployed online baseline, COURIER significantly (p-value < 0.01) improves the CTR and GMV by +0.34% and +0.88% in women’s clothing, respectively (the noise level is less than 0.1% according to the online A/A test). Such improvements are considered significant with the large volume of our online shopping platform. The model has also been successfully deployed into production, serving the main traffic.

6 Conclusion

Visually, the image information of a product has a significant impact on whether a user clicks. However, we have observed that the features extracted by existing image pre-training methods have limited utility for improving downstream CTR models. We attribute this phenomenon to the fact that the labels typically used in existing methods, such as those for classification or image-text retrieval, have already been applied as features in CTR models. Existing image pre-training methods are insufficient for effectively extracting features relevant to user interests. To address this issue, we propose a method for user interest reconstruction to extract image features relevant to user click behavior. Specifically, to mitigate the problem of images in a user’s click history that may not be relevant to the current image, we employ an attention-based approach to identify images in the click sequence that are similar to the current image. We then attempt to reconstruct the user’s next-click image through a weighted sum of the embeddings of these related images. Furthermore, to prevent trivial solutions, we optimize using a contrastive learning loss, reducing the reconstruction error for clicked images while increasing it for non-clicked images. Consequently, our method learns visual embeddings that influence whether a user clicks without relying on any downstream features directly. Instead, it enhances the information available for downstream utilization from the perspective of interest reconstruction. Experiments conducted on various datasets and large-scale online evaluations confirm that the embeddings learned by our method significantly enhance the performance of downstream recommendation models.

Jia-Qi Yang earned his ME degree in 2021 and is currently pursuing a PhD at the State Key Laboratory for Novel Software Technology, Nanjing University, China. His research are primarily focused on machine learning and data mining, with specific expertise in uncertainty calibration, recommendation systems, and AI for science. He serves as a Program Committee member and Reviewer for prestigious conferences like AAAI, NeurIPS, ICLR, and others

Chenglei Dai graduated from Zhejiang University, China in 2016 with a master’s degree in statistics. Currently working in Alibaba’s search ranking team, his research interests include large language models, reinforcement learning, etc. He has published many papers in domestic and foreign journals and conferences

Dan Ou graduated with the bachelor’s and master’s degrees from Wuhan University, China. She currently working at the TaoTian Group as a Senior Algorithm Expert in the Algorithm Technology Team, responsible for text search algorithms

Dongshuai Li received his master degree from Tongji University, China in 2020. He is currently a senior algorithm engineer at Alibaba Inc., China. He has placed in the top three in several competitions. His research interests include multimodal representation learning and computer vision

Ju Huang is currently employed at Alibaba Group, working on the development of Deep learning training systems. Her work includes optimization and acceleration of large language model training on Pytorch and the Nebula Algorithm Platform, etc

De-Chuan Zhan joined in the LAMDA Group (LAMDA Publications) on 2004 and received his PhD degree in computer science from Nanjing University, China in 2010 (supervisor Prof. Zhi-Hua Zhou), and then serviced in the Department of Computer Science and Technology of Nanjing University as an Assistant Professor from 2010, and as an Associate Professor from 2013. Then he joined the School of Artificial Intelligence of Nanjing University as a Professor from 2019. His research interests mainly include machine learning and data mining, especially working on mobile intelligence, distance metric learning, multi-modal learning, etc. Up until now, He has published over 90 papers in national and international journals or conferences such as TPAMI, TKDD, TIFS, TSMSB, IJCAI, ICML, NIPS, AAAI, etc. He served as the deputy director of LAMDA group, NJU, and the director of AI Innovation Institute of AI Valley, Nanjing, China

Xiaoyi Zeng is the intelligent technology leader of Alibaba International Digital Commerce, responsible for the search, recommendation, advertising, and user growth algorithms of international e-commerce shopping platforms

Yang Yang received the PhD degree in computer science, Nanjing University, China in 2019. At the same year, he became a faculty member at Nanjing University of Science and Technology, China. He is currently a Professor with the school of Computer Science and Engineering. His research interests lie primarily in machine learning and data mining, including heterogeneous learning, model reuse, and incremental mining. He has published prolifically in refereed journals and conference proceedings, including IEEE Transactions on Knowledge and Data Engineering (TKDE), ACM Transactions on Information Systems (ACM TOIS), ACM Transactions on Knowledge Discovery from Data (TKDD), ACM SIGKDD, ACM SIGIR, WWW, IJCAI, and AAAI. He was the recipient of the the Best Paper Award of ACML-2017. He serves as PC/SPC in leading conferences such as IJCAI, AAAI, ICML, and NeurIPS

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Zhang W, Qin J, Guo W, Tang R, He X. Deep learning for click-through rate estimation. In: Proceedings of the 30th International Joint Conference on Artificial Intelligence. 2021, 4695−4703

[2]	Zhou G, Mou N, Fan Y, Pi Q, Bian W, Zhou C, Zhu X, Gai K. Deep interest evolution network for click-through rate prediction. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 5941−5948

[3]	Yang J Q, Zhan D C, Gan L. Beyond probability partitions: Calibrating neural networks with semantic aware grouping. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 2547

[4]	Yang J Q, Zhan D C. Generalized delayed feedback model with post-click information in recommender systems. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 1899

[5]	Yuan Z, Yuan F, Song Y, Li Y, Fu J, Yang F, Pan Y, Ni Y. Where to go next for recommender systems? ID- vs. modality-based recommender models revisited. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2023, 2639−2649

[6]	Zhou Z H. Learnability with time-sharing computational resource concerns. 2023, arXiv preprint arXiv: 2305.02217

[7]	Chen T, Kornblith S, Norouzi M, Hinton G. A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 149

[8]	Chen X, He K. Exploring simple Siamese representation learning. In: Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 15745−15753

[9]	Radford A, Kim J W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I. Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 8748−8763

[10]	Schafer J B, Frankowski D, Herlocker J, Sen S. Collaborative filtering recommender systems. In: Brusilovsky P, Kobsa A, Nejdl W, eds. The Adaptive Web, Methods and Strategies of Web Personalization. Berlin: Springer, 2007, 291−324

[11]	Linden G, Smith B, York J . Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet Computing, 2003, 7( 1): 76–80

[12]	Zhang S, Yao L, Sun A, Tay Y . Deep learning based recommender system: a survey and new perspectives. ACM Computing Surveys, 2019, 52( 1): 5

[13]	Huang P S, He X, Gao J, Deng L, Acero A, Heck L. Learning deep structured semantic models for web search using clickthrough data. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. 2013, 2333−2338

[14]	Yang J Q, Li X, Han S, Zhuang T, Zhan D C, Zeng X, Tong B. Capturing delayed feedback in conversion rate prediction via elapsed-time sampling. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021, 4582−4589

[15]	Wu C, Wu F, Qi T, Huang Y. Empowering news recommendation with pre-trained language models. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2021, 1652−1656

[16]	van den Oord A, Dieleman S, Schrauwen B. Deep content-based music recommendation. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013, 2643−2651

[17]	Wu L, Chen L, Hong R, Fu Y, Xie X, Wang M . A hierarchical attention model for social contextual image recommendation. IEEE Transactions on Knowledge and Data Engineering, 2020, 32( 10): 1854–1867

[18]	Covington P, Adams J, Sargin E. Deep neural networks for YouTube recommendations. In: Proceedings of the 10th ACM Conference on Recommender Systems. 2016, 191−198

[19]	He R, McAuley J. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In: Proceedings of the 25th International Conference on World Wide Web. 2016, 507−517

[20]	Wei Y, Wang X, Nie L, He X, Hong R, Chua T S. MMGCN: multi-modal graph convolution network for personalized recommendation of micro-video. In: Proceedings of the 27th ACM International Conference on Multimedia. 2019, 1437−1445

[21]	Wang Q, Wei Y, Yin J, Wu J, Song X, Nie L . DualGNN: Dual graph neural network for multimedia recommendation. IEEE Transactions on Multimedia, 2023, 25: 1074–1084

[22]	Zhang J, Zhu Y, Liu Q, Wu S, Wang S, Wang L. Mining latent structures for multimedia recommendation. In: Proceedings of the 29th ACM International Conference on Multimedia. 2021, 3872−3880

[23]	Tao Z, Liu X, Xia Y, Wang X, Yang L, Huang X, Chua T S . Self-supervised learning for multimedia recommendation. IEEE Transactions on Multimedia, 2023, 25: 5107–5116

[24]	Yu P, Tan Z, Lu G, Bao B K. Multi-view graph convolutional network for multimedia recommendation. In: Proceedings of the 31st ACM International Conference on Multimedia. 2023, 6576−6585

[25]	Zhou X, Zhou H, Liu Y, Zeng Z, Miao C, Wang P, You Y, Jiang F. Bootstrap latent representations for multi-modal recommendation. In: Proceedings of the ACM Web Conference. 2023, 845−854

[26]	Dong X, Zhan X, Wu Y, Wei Y, Kampffmeyer M C, Wei X, Lu M, Wang Y, Liang X. M5Product: Self-harmonized contrastive learning for e-commercial multi-modal pretraining. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 21220−21230

[27]	Wang J, Yuan F, Cheng M, Jose J M, Yu C, Kong B, He X, Wang Z, Hu B, Li Z. TransRec: learning transferable recommendation from mixture-of-modality feedback. 2022, arXiv preprint arXiv: 2206.06190

[28]

Grill J B, Strub F, Altché F, Tallec C, Richemond P H, Buchatskaya E, Doersch C, Pires B A, Guo Z D, Azar M G, Piot B, Kavukcuoglu K, Munos R, Valko M. Bootstrap your own latent A new approach to self-supervised learning. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1786

[29]	Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019, 4171−4186

[30]	Yao T, Yi X, Cheng D Z, Yu F, Chen T, Menon A, Hong L, Chi E H, Tjoa S, Kang J, Ettinger E. Self-supervised learning for large-scale item recommendations. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2021, 4321−4330

[31]	Qiu R, Huang Z, Yin H, Wang Z. Contrastive learning for representation degeneration problem in sequential recommendation. In: Proceedings of the 15th ACM International Conference on Web Search and Data Mining. 2022, 813−823

[32]	Hou Y, Mu S, Zhao W X, Li Y, Ding B, Wen J R. Towards universal sequence representation learning for recommender systems. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2022, 585−593

[33]	Wu J, Wang X, Feng F, He X, Chen L, Lian J, Xie X. Self-supervised graph learning for recommendation. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2021, 726−735

[34]	Sun F, Liu J, Wu J, Pei C, Lin X, Ou W, Jiang P. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2019, 1441−1450

[35]	Wang Y, Wang X, Huang X, Yu Y, Li H, Zhang M, Guo Z, Wu W. Intent-aware recommendation via disentangled graph contrastive learning. In: Proceedings of the 32nd International Joint Conference on Artificial Intelligence. 2023, 260

[36]	Ren X, Xia L, Zhao J, Yin D, Huang C. Disentangled contrastive collaborative filtering. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2023, 1137−1146

[37]	Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000−6010

[38]	Poole B, Ozair S, van den Oord A, Alemi A A, Tucker G. On variational bounds of mutual information. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 5171−5180

[39]	Mustafa B, Riquelme C, Puigcerver J, Jenatton R, Houlsby N. Multimodal contrastive learning with LIMoe: the language-image mixture of experts. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 695

[40]	McAuley J, Targett C, Shi Q, van den Hengel A. Image-based recommendations on styles and substitutes. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2015, 43−52

[41]	He R, McAuley J. VBPR: visual Bayesian personalized ranking from implicit feedback. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence. 2016, 144−150

[42]	Zhou X. MMRec: Simplifying multimodal recommendation. In: Proceedings of the 5th ACM International Conference on Multimedia in Asia Workshops. 2023, 6

[43]

Sun Z, Li X, Sun X, Meng Y, Ao X, He Q, Wu F, Li J. ChineseBERT: Chinese pretraining enhanced by glyph and pinyin information. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021, 2065−2075

[44]	Dong X, Bao J, Zheng Y, Zhang T, Chen D, Yang H, Zeng M, Zhang W, Yuan L, Chen D, Wen F, Yu N. MaskCLIP: Masked self-distillation advances contrastive language-image pretraining. In: Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 10995−11005

[45]	Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. 2021, 9992−10002

[46]	Chen T, Xu B, Zhang C, Guestrin C. Training deep nets with sublinear memory cost. 2016, arXiv preprint arXiv: 1604.06174

[47]	Micikevicius P, Narang S, Alben J, Diamos G F, Elsen E, García D, Ginsburg B, Houston M, Kuchaiev O, Venkatesh G, Wu H. Mixed precision training. In: Proceedings of the 6th International Conference on Learning Representations. 2018

[48]	Yi J, Zhang L, Wang J, Jin R, Jain A K. A single-pass algorithm for efficiently recovering sparse cluster centers of high-dimensional data. In: Proceedings of the 31st International Conference on International Conference on Machine Learning. 2014, II-658−II-666

[49]	Zhang Z Y, Sheng X R, Zhang Y, Jiang B, Han S, Deng H, Zheng B. Towards understanding the overfitting phenomenon of deep click-through rate models. In: Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 2022, 2671−2680

[50]	Chen T, Kornblith S, Swersky K, Norouzi M, Hinton G. Big self-supervised models are strong semi-supervised learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1865

[51]	Sohn K. Improved deep metric learning with multi-class n-pair loss objective. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 1857−1865

[52]	Xuan H, Stylianou A, Liu X, Pless R. Hard negative examples are hard, but useful. In: Proceedings of the 16th European Conference on Computer Vision. 2020, 126−142

Acknowledgements

This work was supported by the National Science and Technology Major Project (No. 2022ZD0114805), the National Key R&D Program of China (2022YFF0712100), the National Natural Science Foundation of China (Grant No. 62276131), the Fundamental Research Funds for the Central Universities (No.30922010317), the Collaborative Innovation Center of Novel Software Technology and Industrialization, Postgraduate Research & Practice Innovation Program of Jiangsu Province (No. KYCX24_0229), and the Scholarship of China Scholarship Council.

Competing interests

The authors declare that they have no competing interests or financial conflicts to disclose.

RIGHTS & PERMISSIONS

2025 Higher Education Press

AI Summary AI Mindmap

PDF(13518 KB)

Supplementary files

FCS-23939-OF-JY_suppl_1 (369 KB)

358

Accesses

Citations

Detail

Sections

Recommended

Abstract
Graphical abstract
Keywords
Cite this article
1 Introduction
Fig.1 (a) Existing image feature learning methods are tailored for cross-modal prediction tasks. (b) We propose a user intention reconstruction method to mine potential visual features that cannot be reflected by cross-modal labels. In this example, the user searched for “Coat” and received two recommendations (Page-viewed items). The user clicked on the one on the right. Through our user intention reconstruction, we identified similar items from the user’s click history with larger attention, the reconstructed PV item embeddings are denoted as Rpvj. Then, we optimize the PV embeddings Epvj and reconstructions Rpvj to be closer if the corresponding item is clicked and more far apart otherwise
2 Related work
2.1 Content-based recommendation
2.2 Image representation learning in recommendation
2.3 Contrastive learning in recommender systems
3 Contrastive user intention reconstruction
3.1 Preliminary
3.2 User intention reconstruction
3.3 Contrastive training method
Fig.2 The contrastive user intention reconstruction method. The images are fed into the image backbone model to obtain the corresponding embeddings. The embeddings of PV (Page-View) sequences are blue-colored, and the embeddings of click sequences are yellow-colored. The reconstructions are in green. Red boxes denote positive PV items
3.4 Contrastive learning on click sequences
3.5 Differences compared to established contrastive methods
4 Experiments
4.1 Experiments on public datasets
Tab.1 Statistics of public datasets
Tab.2 Averge Recall and NDCG performance comparison on public datasets
4.2 Experiments on Taobao offline dataset
4.2.1 Taobao pre-training and CTR dataset
Tab.3 Pre-training dataset collected from the women’s clothing category
Tab.4 Daily average statistics of the downstream dataset
4.2.2 Evaluation metrics
4.2.3 Compared methods
4.2.4 Performance in downstream CTR task
Tab.5 The improvements of AUC (ΔAUC) in the women’s clothing category. And performances of ΔAUC,ΔGAUC in all categories. We report the relative improvements compared to the Baseline method, and the raw values of the metrics are in parentheses
4.3 Further experimental analysis
4.3.1 Influence of temperature
Fig.3 The impact of different values of τ on the performance of downstream CTR tasks. The horizontal axis represents the values of τ, while the vertical axis denotes the change (%) in the metrics
4.3.2 Ablation study
Tab.6 Ablation studies of COURIER
4.3.3 Influence of batch size
Tab.7 Influence of different batch on performance
4.3.4 Train with text information
Tab.8 Train with text information
4.3.5 Generalization in unseen categories
Fig.4 The AUC improvements of COURIER compared to the Baseline on different categories. The x-axis is sorted by the improvements
4.3.6 Visualization of trained embeddings
Fig.5 T-SNE visualization of embeddings in different categories. (a) Dress and Jeans; (b) Shirt and Cheongsam; (c) Skirt and Fur
Fig.6 T-SNE visualization of embeddings with different style tags. We also plot some item images with different tags below the corresponding figures. (a) Cool and Sexy; (b) Mature and Cuties; (c) Grace and Antique
Tab.9 The A/B testing improvements of COURIER
5 Online experiments and deployment
6 Conclusion
References
Acknowledgements
Competing interests
RIGHTS & PERMISSIONS

Received	Accepted
22 Nov 2023	11 Jun 2024
Just Accepted Date	Issue Date
13 Jun 2024	11 Oct 2024

About the journal

Browse

Authors & reviewers

Abstract

Graphical abstract

Keywords

Cite this article

1 Introduction

2 Related work

2.1 Content-based recommendation

2.2 Image representation learning in recommendation

2.3 Contrastive learning in recommender systems

3 Contrastive user intention reconstruction

3.1 Preliminary

3.2 User intention reconstruction

3.3 Contrastive training method

3.4 Contrastive learning on click sequences

3.5 Differences compared to established contrastive methods

4 Experiments

4.1 Experiments on public datasets

Tab.1 Statistics of public datasets

Tab.2 Averge Recall and NDCG performance comparison on public datasets

4.2 Experiments on Taobao offline dataset

4.2.1 Taobao pre-training and CTR dataset

Tab.3 Pre-training dataset collected from the women’s clothing category

Tab.4 Daily average statistics of the downstream dataset

4.2.2 Evaluation metrics

4.2.3 Compared methods

4.2.4 Performance in downstream CTR task

Tab.5 The improvements of AUC (ΔAUC) in the women’s clothing category. And performances of ΔAUC,ΔGAUC in all categories. We report the relative improvements compared to the Baseline method, and the raw values of the metrics are in parentheses

4.3 Further experimental analysis

4.3.1 Influence of temperature

Fig.3 The impact of different values of τ on the performance of downstream CTR tasks. The horizontal axis represents the values of τ, while the vertical axis denotes the change (%) in the metrics

4.3.2 Ablation study

Tab.6 Ablation studies of COURIER

4.3.3 Influence of batch size

Tab.7 Influence of different batch on performance

4.3.4 Train with text information

Tab.8 Train with text information

4.3.5 Generalization in unseen categories

Fig.4 The AUC improvements of COURIER compared to the Baseline on different categories. The x-axis is sorted by the improvements

4.3.6 Visualization of trained embeddings

Fig.5 T-SNE visualization of embeddings in different categories. (a) Dress and Jeans; (b) Shirt and Cheongsam; (c) Skirt and Fur

Fig.6 T-SNE visualization of embeddings with different style tags. We also plot some item images with different tags below the corresponding figures. (a) Cool and Sexy; (b) Mature and Cuties; (c) Grace and Antique

Tab.9 The A/B testing improvements of COURIER

5 Online experiments and deployment

6 Conclusion

{{custom_sec.title}}

{{custom_sec.title}}

References

Acknowledgements

Competing interests

RIGHTS & PERMISSIONS

Tab.5 The improvements of AUC ( $Δ A U C$ ) in the women’s clothing category. And performances of $Δ A U C, Δ G A U C$ in all categories. We report the relative improvements compared to the Baseline method, and the raw values of the metrics are in parentheses

Fig.3 The impact of different values of $τ$ on the performance of downstream CTR tasks. The horizontal axis represents the values of $τ$ , while the vertical axis denotes the change (%) in the metrics