1 Introduction
Predicting the click-through rate (CTR) is an essential task in recommendation systems based on deep learning [
1–
3]. Typically, CTR models measure the probability that a user will click on an item based on the user’s profile and behavior history [
2,
4], such as clicking, purchasing, adding a cart, etc. The behavior histories are represented by sequences of item IDs, titles, and some statistical features, such as monthly sales and favorable rate [
1].
From an intuitive perspective, the visual representations of items hold significant importance in online item recommendation, particularly in categories like women’s clothing. Recent studies have shown that incorporating modality information through end-to-end training can enhance the effectiveness of recommendations [
5]. However, when dealing with large-scale product recommendation systems, where users generate billions of clicks daily, implementing end-to-end training becomes impractical [
6]. A feasible approach involves enhancing image features through the utilization of representation learning methods specifically designed for image features. Nevertheless, our experiments reveal that employing embeddings derived from existing image feature learning methods such as SimCLR [
7], SimSiam [
8], and CLIP [
9], only yield marginal effects on the relevant downstream CTR prediction task. We ascribe this lack of success to two factors. Firstly, in the context of recommendation scenarios, user preferences for visual appearance tend to be vague and imprecise, which may not be captured by existing image feature learning methods that focus on label prediction. Besides, augmentation methods that excel in tasks like image classification, due to their significant alteration of the image’s appearance, are not suitable for recommendation systems. Secondly, the label information embedded in the pre-trained embeddings, such as classes or language descriptions, can already be directly utilized in item recommendations, rendering the information gain provided by pre-trained embeddings redundant. For instance, we already possess categories, item titles, and style tags provided by the merchants in Taobao. Consequently, a pre-trained model that performs well in predicting categories or titles contributes little novel information and does not enhance CTR prediction task.
To boost the performance of downstream CTR prediction tasks, we argue that the representation learning method should be aware of the downstream CTR task and should also be decoupled from the CTR task to reduce computation. To achieve this goal, we propose a COntrastive UseRIntEntion Reconstruction (COURIER) method. Our method is based on an intuitive assumption: An item clicked by a user is likely to have visual characteristics similar to some of the clicking history items. One straightforward approach is to optimize the distance between user embeddings (comprised of item images previously clicked by the user) and clicked item image embeddings. Unlike the typical one-to-one correspondence in common contrastive learning, this establishes a many-to-one correspondence. Consequently, we must aggregate multiple item image embeddings from the user’s historical clicks. Common aggregation methods include self-attention or pooling. However, it should be noted that user click histories often contain a significant number of images that have low relevance to the clicked items. Directly minimizing the distance between these aggregated embeddings may result in all images having very similar embeddings. To mine the visual features related to user interests, we propose reconstructing the next clicking item with a cross-attention mechanism on the clicking history items. The reconstruction can be interpreted by a weighted sum of history item embeddings, which effectively selects related images from history in an end-to-end manner, as depicted in Fig.1. We propose to optimize a contrastive loss that not only encourages lower reconstruction error, but also push embeddings of un-clicked items further apart. We conducted various experiments to verify our motivation and design of the method. Our pre-trained image embedding achieves 12% improvements in NDCG and Recall on several public datasets compared with strong baselines. We conducted experiments in a large-scale Taobao dataset, our method achieves 0.46% absolute offline improvement on AUC. In online A/B tests, we achieve 0.88% improvement on GMV in the women’s clothing category, which is significant considering the volume of Taobao online shopping platform.
Fig.1 (a) Existing image feature learning methods are tailored for cross-modal prediction tasks. (b) We propose a user intention reconstruction method to mine potential visual features that cannot be reflected by cross-modal labels. In this example, the user searched for “Coat” and received two recommendations (Page-viewed items). The user clicked on the one on the right. Through our user intention reconstruction, we identified similar items from the user’s click history with larger attention, the reconstructed PV item embeddings are denoted as . Then, we optimize the PV embeddings and reconstructions to be closer if the corresponding item is clicked and more far apart otherwise |
Full size|PPT slide
Our contribution can be summarized as follows:
● To establish a one-to-one correspondence between images clicked by the user in the past and the image currently clicked by the user, we propose a user intention reconstruction method, which can mine latent user intention from history click sequences without any explicit semantic labels.
● The user intention reconstruction objective alone may lead to a collapsed solution. To solve this problem, we propose a contrastive training method that utilizes the un-clicked item efficiently.
● We conduct comprehensive experiments on both private and public datasets to validate the effectiveness of our method. Additionally, we provide insights and share our practical experience regarding the deployment of image feature models in real-world large-scale recommendation systems.
2 Related work
From collaborative filtering [
10,
11] to deep learning based recommender systems [
12–
14], IDs and categories (user IDs, item IDs, advertisement IDs, tags, etc. are the essential features in item recommendation systems, which can straightforwardly represent the identities. However, with the development of the multi-media internet, ID features alone can hardly cover all the important information. Thus, recommendations based on content such as images, videos, and texts have become an active research field in recommender systems. In this section, we briefly review the related work and discuss their differences compared with our method.
2.1 Content-based recommendation
In general, content is more important when the content itself is the concerned information. Thus, the content-based recommendation has already been applied in news [
15], music [
16], image [
17], and video [
18] recommendations.
Research on the content-based recommendation in the item recommendation task is much fewer because of the dominating ID features. Early applications of image features typically use image embeddings extracted from a pre-trained image classification model, such as [
19–
25]. The image features adopted by these methods are trained on general classification datasets such as ImageNet, which may not fit recommendation tasks. Thus, [
26] proposes to train on multi-modal data from recommendation task. However, [
26] does not utilize any recommendation labels such as clicks and payments, which will have marginal improvement if we are already using information from other modalities.
With the increase in computing power, some recent papers propose to train item recommendation models end-to-end with image feature networks [
5,
27]. However, the datasets used in these papers are much smaller than our application scenario. For example, [
27] uses a dataset consisting of 25 million interactions. While in our online system, average daily interactions in a single category (women’s clothing) are about 850 million. We found it infeasible to train image networks in our scenario end-to-end, which motivates our decoupled two-stage framework with user-interest-aware embedding learning.
2.2 Image representation learning in recommendation
Self-supervised pre-training has been a hot topic in recent years. We classify the self-supervised learning methods into two categories: Augmentation-based and prediction-based.
Augmentation-based. The augmentation-based methods generate multiple different views of an image by random transformations, then the model is trained to pull the embeddings of different views closer and push other embeddings (augmented from different images) further. SimCLR [
7], SimSiam [
8], BYOL [
28] are some famous self-supervised methods in this category. These augmentation-based methods do not perform well in the item recommendation task as shown in our experiments in Section 4.2.4. Since the augmentations are designed for classification (or segmentation, etc.) tasks, they change the visual appearance of the images without changing their semantic class, which contradicts the fact that visual appearance is also important in recommendations (e.g., color, shape).
Prediction-based. If the data can be split into more than one part, we can train a model taking some of the parts as inputs and predict the rest parts, which is the basic idea of prediction-based pre-training. Representative prediction-based methods include BERT [
29], CLIP [
9], etc. The prediction-based methods can be used to train multi-modal recommendation data as proposed by [
26]. However, if we are already utilizing multi-modal information, the improvements are limited, as shown in our experiments. To learn user interests information that can not be provided by other modalities, we argue that user behaviors should be utilized, as in our proposed method.
2.3 Contrastive learning in recommender systems
Contrastive learning methods are also adopted in recommendation systems in recent years. The most explored augmentation-based method is augmenting data by dropping, reordering, and masking some features [
30], items [
31,
32], and graph edges [
33]. Prediction-based methods are also adopted for recommendation tasks, e.g., BERT4Rec [
34] randomly masks some items and makes predictions. However, all these recommender contrastive learning methods concentrate on augmentation and pre-training with ID features, while our method tackles representation learning of image features. Several recent works also considered mining users’ intents with contrastive learning [
35,
36]. Different from our concentration on visual features, they focus on learning with graph structures while image features are not considered.
3 Contrastive user intention reconstruction
We briefly introduce some essential concepts and notations, then introduce our method in detail.
3.1 Preliminary
Notations. A data sample for CTR prediction in item search can be represented with a tuple of (user, item, query, label). Typically, in recommendation tasks, there is no explicit query, and the remaining aspects align with search. Our approach is universally applicable in these scenarios. For simplicity, the subsequent sections will consistently use the term “recommendation”. A user searches for a query text, and several items are shown to the user. Then the items that the user clicked are labeled as positive and negative otherwise. When a user views a page, we have a list of items that are presented to the user, we call them page-view (PV) items. The length of this PV list is denoted by . Each PV item has a cover image, denoted by , where . The corresponding click labels are denoted by , where . Each user has a list of clicked item history, the image of each item is denoted by . , where is the length of click history. The and may vary by different users and pages, we trim or pad to the same length practically.
Attention. An attention layer is defined as follows:
where
is the query matrix,
is the key matrix,
is the value matrix. The mechanism of the attention layer can be interpreted intuitively: for each query, a similarity score is computed with every key, the values corresponding to keys are weighted by their similarity scores and summed up to obtain the outputs. We refer interested readers to [
37] for more details of the attention mechanism.
3.2 User intention reconstruction
In the following discussion, we only consider a single line of data (a PV list and corresponding list of user click history), the batch training method will be discussed later. We use and to denote the matrix of all the and images. All the images are fed to the image backbone (IB) to get their embeddings. We denote the embeddings as and correspondingly. In our user interest reconstruction method, we treat the embeddings of PV images as queries , and we input the embeddings of click images as values and keys . Then the user interest reconstruction layer can be calculated by
where . is the attention on th history click item. The reason for its name (user intention reconstruction) is that the attention layer forces the outputs to be a weighted sum of the embeddings of historical click sequences. Thus, the output space is limited to the combination of within a simplex.
3.3 Contrastive training method
The user intention reconstruction module cannot prevent the trivial solution that all the embeddings collapse to the same value. Thus, we propose a contrastive method to train the user-interest reconstruction module.
With the PV embeddings , and the corresponding reconstructions . We calculate pairwise similarity score of and , where represents embedding of the -th PV item and represents the reconstruction of the -th PV item:
Then we calculate the contrastive loss by
Here is an indicator function that equals when and , and equals otherwise.
The contrastive loss with user interest reconstruction is depicted in Fig.2. The softmax function is calculated column-wisely, and only the positive PV images are optimized to be reconstructed (the two columns in dashed boxes). The behavior of the contrastive loss aligns with our assumption: Positive PV images are pulled closer to the corresponding reconstructions ( and ), while the negative PV images are not encouraged to be reconstructed ( and ). All the PV embeddings with and , are pushed further away from , which can prevent the trivial solution that all the embeddings are the same. Some elements in the similarity matrix are not used in our loss, namely the elements with , since the negative PV images are not supposed to be reconstructible. We left these columns for ease of implementation.
Fig.2 The contrastive user intention reconstruction method. The images are fed into the image backbone model to obtain the corresponding embeddings. The embeddings of PV (Page-View) sequences are blue-colored, and the embeddings of click sequences are yellow-colored. The reconstructions are in green. Red boxes denote positive PV items |
Full size|PPT slide
Extending to batched contrastive loss. The above contrastive loss is calculated using PV items within a single page (we have at most 10 items on a page), which can only provide a limited number of negative samples. However, a well-known property of contrastive loss is that the performance increases as the number of negative samples increases, which is verified both practically [
7] and theoretically [
38]. To increase the number of negative samples, we propose to treat all other in-batch PV items as negative samples. Specifically, we have (batch_size*
) PV items in this batch. For a positive PV item, all the other (batch_size*
) PV items are used as negatives, which significantly increases the number of negatives.
3.4 Contrastive learning on click sequences
In the contrastive loss introduced in Section 3.3, the negative and positive samples are from the same query. Since all the PV items that are pushed to the users are already ranked by our online recommender systems, they may be visually and conceptually similar and are relatively hard to distinguish by contrastive loss. Although the batch training method introduces many easier negatives, the hard negatives still dominate the contrastive loss at the beginning of training, which makes the model hard to train.
Thus, we propose another contrastive loss on the user click history, which does not have hard PV negatives. Specifically, suppose the user’s click history is denoted by corresponding embeddings We treat the last item as the next item to be clicked (since the items are sorted by their click timestamp), and the rest items are treated as click history. The user click sequence loss is calculated as follows:
The function is the same as in Section 3.3. Here the label because all the samples in history click sequences are positive. The user sequence loss provides an easier objective at the start of the training, which helps the model to learn with a curriculum style. It also introduces more signals to train the model, which improves data efficiency. The overall loss of COURIER is:
3.5 Differences compared to established contrastive methods
Although COURIER is also a contrastive method, it is significantly different from classic contrastive methods. Firstly, in contrastive methods such as SimCLR [
7] and CLIP [
9], every sample has a corresponding positive counterpart. In our method, a negative PV item does not have a corresponding positive reconstruction, but it still serves as a negative sample in the calculation. Secondly, there is a straightforward one-to-one matching in SimCLR and CLIP, e.g., text and corresponding image generated by augmentation. In recommendation systems, image augmentations are not applicable due to the distortion of appearance. Instead, a positive PV item corresponds to a list of history click items, which is transformed into a one-to-one matching with our user interest recommendation module introduced in Section 3.2. Thirdly, another approach to convert this multi-to-one matching to one-to-one is to apply self-attention on the multi-part, as suggested in [
39], which turned out to perform worse in our scenario. We experiment and analyze this method in Section 4.3.2 (w/o Reconstruction).
4 Experiments
To validate the efficacy of our proposed image representation learning method in enhancing recommendation performance, we conducted experiments on several publicly available product recommendation datasets. Subsequently, we proceeded with systematic experiments on actual data from Taobao and integrated our approach into the online system.
4.1 Experiments on public datasets
Since nearly all recommendation methods incorporate ID features, the addition of image or text features typically characterizes multimodal recommendation methods. Our CTR prediction model falls under this category as well. These methods primarily rely on fixed image and text backbone models to extract image features. However, our proposed image feature pre-training method aims to enhance the representational capabilities of image features in recommendation system methods. In this section, we will experimentally investigate whether our pre-trained image features can further enhance the performance of these methods.
Datasets. In order to validate the effectiveness of our proposed visual feature pre-training method in learning user-intention related features, we select three categories from the Amazon Review dataset [
40], namely Baby, Sports, and Clothing. Each item has corresponding image and text descriptions. We apply our method to the image features while keeping the text features unchanged. The statistics are shown in Tab.1. To evaluate the performance of different visual feature extraction methods, we divided the dataset into pre-training, training, validation, and testing sets, with proportions of 50%, 30%, 10%, and 10% respectively. We group the datasets by user IDs to construct a list of positive items, and then we uniformly sample 10 negative items for each user to construct the pre-training dataset.
Tab.1 Statistics of public datasets |
| # User | # Items | # Interactions |
---|
Baby | 19445 | 7050 | 160792 |
Sports | 35598 | 18357 | 296337 |
Clothing | 39387 | 23033 | 278677 |
Baselines. We compare with several representative ID-based and multi-modal recommendation methods. We select two ID-based recommendation methods, namely BPR [
41] and SLMRec [
23]. We select five multi-modal recommendation methods, namely DualGNN [
21], LATTICE [
22], MGCN [
24], MMGCN [
20], and BM3 [
25]. To validate the information gained from our visual feature learning method, we concatenate the pre-trained embeddings with the original embedding provided by [
42]. Then, we apply the augmented embeddings to BM3 and MMGCN, which corresponds to BM3+ours and MMGCN+ours, respectively.
Implementation. All the baseline methods are tuned following [
42]. Specifically, we tune the hyper-parameters (learning rate, weight decay, and method-specific hyper-parameters such as contrastive loss weight in MGCN) of each method with a validation dataset. Then, we run each method with 5 different random seeds and report their average performance.
Results. As observed in Tab.2, our method, when combined with existing multimodal recommendation algorithms, can further enhance performance, achieving an average improvement of around 12% in terms of both Recall and NDCG.
Tab.2 Averge Recall and NDCG performance comparison on public datasets |
Dataset | Sports | | Baby | | Clothing |
Method | R@20 | R@50 | N@50 | | R@20 | R@50 | N@50 | | R@20 | R@50 | N@50 |
BPR | 0.004 | 0.008 | 0.003 | | 0.007 | 0.014 | 0.005 | | 0.004 | 0.007 | 0.002 |
SlmRec | 0.006 | 0.016 | 0.006 | | 0.014 | 0.028 | 0.011 | | 0.004 | 0.010 | 0.003 |
DualGNN | 0.012 | 0.023 | 0.009 | | 0.018 | 0.037 | 0.014 | | 0.007 | 0.014 | 0.005 |
LATTICE | 0.012 | 0.021 | 0.008 | | 0.010 | 0.022 | 0.008 | | − | − | − |
MGCN | 0.015 | 0.027 | 0.011 | | 0.017 | 0.032 | 0.013 | | 0.011 | 0.019 | 0.008 |
MMGCN | 0.015 | 0.028 | 0.012 | | 0.024 | 0.061 | 0.024 | | 0.008 | 0.017 | 0.006 |
BM3 | 0.018 | 0.033 | 0.014 | | 0.031 | 0.065 | 0.026 | | 0.009 | 0.016 | 0.006 |
MMGCN+ours | 0.017 | 0.031 | 0.013 | | 0.030 | 0.061 | 0.024 | | 0.010 | 0.019 | 0.007 |
BM3+ours | 0.019 | 0.034 | 0.015 | | 0.035 | 0.068 | 0.027 | | 0.012 | 0.019 | 0.007 |
4.2 Experiments on Taobao offline dataset
To evaluate the information increment of pre-trained image embeddings on the CTR prediction task, we use an architecture that aligns with our online recommender system. In essence, each item is associated with a unique item ID and features. The item ID is transformed into an embedding and concatenated with other features, including the image embedding we have learned, to form the item features. Users consist of user features and their behavioral history, where the behavioral history comprises numerous items. Therefore, the user behavioral history is aggregated using self-attention, which is then concatenated with other user features. Subsequently, the user and item features are concatenated and serve as inputs to an MLP. The MLP’s final output is the user’s click probability, representing the predicted CTR. We provide a detailed description of this CTR model in the Appendixes.
Downstream usage of image representations. Practically, we find that how we feed the image features into the downstream CTR model is critical for the final performance. We experimented with three different methods: 1. Directly using embedding vectors. 2. Using similarity scores to the target item. 3. Using the cluster IDs of the embeddings. Cluster-ID is the best-performing method among the three methods, bringing about 0.1%−0.2% improvements on AUC compared to using embedding vectors directly. We attribute the success of Cluster-ID to its better alignment with our pre-training method. We provide a more detailed analysis in the Appendix.
4.2.1 Taobao pre-training and CTR dataset
Pre-training dataset. The pre-training dataset is collected during 2022.11.18 to 2022.11.25 on our online search service. To reduce the computational burden, we down-sample to 20% negative samples. So the click-through rate (CTR) is increased to around 13%. To reduce the computational burden, we sort the PV items with their labels to list positive samples in the front, then we select the first 5 PV items to constitute the training targets (). We retain the latest 5 user-click-history items (). Thus, there are at most 10 items in one sample. There are three reasons for such data reduction: First, our dataset is still large enough after reduction. Second, the number of positive items in PV sequences is less than 5 most of the time, so trimming PV sequences to 5 will not lose many positive samples, which is generally much more important than negatives. Third, we experimented with and did not observe significant improvements, while the training time is significantly longer. Thus, we keep and fixed in all the experiments. We remove the samples without clicking history or positive items within the page. Intuitively, women’s clothing is one of the most difficult recommendation tasks (the testing AUC is significantly lower than average) which also largely depends on the visual appearance of the items. Thus, we select the women’s clothing category to form the dataset of the training dataset of the pre-training task. The statistics of the pre-training dataset are summarized in Tab.3.
Tab.3 Pre-training dataset collected from the women’s clothing category |
# User | # Item | # Samples | CTR | # Hist | # PV Items |
---|
71.7 million | 35.5 million | 311.6 million | 0.13 | 5 | 5 |
In the pre-training dataset, we only retain the item images and the click labels. All other information, such as item title, item price, user properties, and even the query by the users are dropped. We report the experimental results of training with text information in Section 4.3.4, which indicates that additional information is unnecessary.
CTR dataset. The average daily statistics of the downstream CTR datasets in all categories and women’s clothing are summarized in Tab.4. To avoid any information leakage, we use data collected from 2022.11.27 to 2022.12.04 on our online shopping platform to train the downstream CTR model, and we use data collected on 2022.12.05 to evaluate the performance. In the evaluation stage, the construction of the dataset aligns with our online system. We use all the available information to train the downstream CTR prediction model. The negative samples are also down-sampled to 20%. Different from the pre-training dataset, we do not group the page-view data in the evaluation dataset, so each sample corresponds to an item.
Tab.4 Daily average statistics of the downstream dataset |
| # User | # Item | # Samples | CTR | # Hist |
---|
All | 0.118 billion | 0.117 billion | 4.64 billion | 0.139 | 98 |
Women’s | 26.39 million | 12.29 million | 874.39 million | 0.145 | 111.3 |
4.2.2 Evaluation metrics
● Area Under the ROC Curve (AUC): AUC is the most commonly used evaluation metric in evaluating ranking methods, which denotes the probability that a random positive sample is ranked before a random negative sample.
● Grouped AUC (GAUC): The AUC is a global metric that ranks all the predicted probabilities. However, in the online item-searching task, only the relevant items are considered by the ranking stage, so the ranking performance among the recalled items (relevant to the user’s query) is more meaningful than global AUC. Thus, we propose a Grouped AUC metric, which is the average AUC within searching sessions.
4.2.3 Compared methods
● Baseline: Our baseline is a CTR model that serves our online system. It’s noteworthy that we adopt a warmup strategy that uses our online model trained with more than one year’s data to initialize all the weights (user ID embeddings, item ID embeddings, etc., which is a fairly strong baseline.
● Supervised: To pre-train image embeddings with user behavior information, a straightforward method is to train a CTR model with click labels and image backbone network end-to-end. We use the trained image network to extract embeddings as other compared methods.
●
SimCLR [
7]: SimCLR is a self-supervised image pre-training method based on augmentations and contrastive learning.
●
SimSiam [
8]: SimSiam is also an augmentation-based method. Different from SimCLR, SimSiam suggests that contrastive loss is unnecessary and proposes directly minimizing the distance between matched embeddings.
●
CLIP [
9]: CLIP is a multi-modal pre-training method that optimizes a contrastive loss between image embeddings and item embeddings. We treat the item cover image and its title as a matched sample. We use a pre-trained BERT [
43] as the feature network of item titles, which is also trained end-to-end.
●
MaskCLIP [
44]: MaskCLIP is an improved version of CLIP with masked self-distillation in images and text.
4.2.4 Performance in downstream CTR task
The performances of compared methods are summarized in Tab.5. We have the following conclusions: Firstly, since all the methods are pre-trained in the women’s clothing category, they all have performance improvement on the AUC of the downstream women’s clothing category. SimCLR, SimSiam, CLIP, MaskCLIP and COURIER outperform the Baseline and Supervised pre-training. Among them, our COURIER performs best, outperforms baseline by 0.46% AUC and outperforms second best method MaskCLIP by 0.15% AUC, which verified our analysis that traditional CV pre-training methods provide little information gain to the CTR model. Secondly, we also check the performance in all categories. Our COURIER also performs best with 0.16% improvement in AUC. However, the other methods’ performances are significantly different from the women’s clothing category. The Vanilla and SimSiam method turned out to have a negative impact in all categories. And the improvements of CLIP, SimCLR and MaskCLIP become marginal. The reason is that the pre-training methods failed to extract general user interest information and overfit the women’s clothing category. We analyze the performance in categories other than women’s clothing in Section 4.3.5. Thirdly, the performance on GAUC indicates that the performance gain of CLIP, SimCLR and MaskCLIP vanishes when we consider in-page ranking, which is indeed more important than global AUC, as discussed in Section 4.2.2. The GAUC performance further validates that COURIER can learn fine-grained user interest features that can distinguish between in-page items.
Tab.5 The improvements of AUC () in the women’s clothing category. And performances of in all categories. We report the relative improvements compared to the Baseline method, and the raw values of the metrics are in parentheses |
Methods | (Women’s Clothing) | | |
Baseline | 0.00% (0.7785) | 0.00% (0.8033) | 0.00% (0.7355) |
Supervised | +0.06% (0.7790) | −0.14% (0.8018) | −0.06% (0.7349) |
CLIP [9] | +0.26% (0.7810) | +0.04% (0.8036) | −0.09% (0.7346) |
SimCLR [7] | +0.28% (0.7812) | +0.05% (0.8037) | −0.08% (0.7347) |
SimSiam [8] | +0.10% (0.7794) | −0.10% (0.8022) | −0.29% (0.7327) |
MaskCLIP [44] | +0.31% (0.7815) | +0.03% (0.8035) | −0.03% (0.7352) |
COURIER (ours) | +0.46% (0.7830) | +0.16% (0.8048) | +0.19% (0.7374) |
4.3 Further experimental analysis
We have validated the efficacy of the COURIER approach in enhancing the overall performance of recommendation algorithms on both public datasets and the Taobao dataset. However, we also aim to address the following inquiries:
1. Existing research has highlighted the significant impact of temperature coefficients in pre-training. Does the temperature coefficient similarly affect our pre-training task?
2. What is the respective contribution of each module in the COURIER approach towards performance improvement?
3. Currently, we only utilize image and user click information in the training of embeddings. Would the inclusion of other modalities, such as text, further enhance the performance?
4. As a pre-training method, can the embeddings acquired in the women’s clothing category also yield improvements in other categories?
5. Can our method solely utilize user click information to learn features relevant to user intent?
We will address the aforementioned inquiries through experimental investigation.
4.3.1 Influence of temperature
We experiment with different temperatures . The results are in Fig.3. Note that the experiments are run with the Simscore method, which is worse than Cluster-ID but is much faster. We find that performs best for COURIER and keep it fixed.
Fig.3 The impact of different values of on the performance of downstream CTR tasks. The horizontal axis represents the values of , while the vertical axis denotes the change (%) in the metrics |
Full size|PPT slide
4.3.2 Ablation study
We conduct the following ablation experiments to verify the effect of each component of COURIER.
● w/o UCS: Remove the user click sequence loss.
●
w/o Contrast: Remove the contrastive loss, only minimize the reconstruction loss, similar to SimSiam [
8].
● w/o Reconstruction: Use self-attention instead of cross-attention in the user interest recommendation module.
● w/o Neg PV: Remove negative PV samples, only use positive samples.
The results in Tab.6 indicate that all the proposed components are necessary for the best performance.
Tab.6 Ablation studies of COURIER |
| (women’s clothing) | | |
w/o UCS | 0.06% | −0.13% | 0.11% |
w/o Contrast | 0.23% | 0.03% | −0.11% |
w/o Reconstruction | 0.25% | 0.02% | −0.11% |
w/o Neg PV | 0.30% | 0.07% | −0.06% |
COURIER | 0.46% | 0.16% | 0.19% |
4.3.3 Influence of batch size
Due to both theoretical [
38] and practical evidence [
7] indicating that increasing batch size in contrastive learning can enhance model generalization, we conducted experiments with different batch sizes. The results in Tab.7 also demonstrate that in our method, the larger the batch size, the better the performance.
Tab.7 Influence of different batch on performance |
Batch size | Femal AUC | AUC | GAUC |
64 | 0.15% | −0.06% | −0.08% |
256 | 0.23% | 0.04% | 0.02% |
512 | 0.36% | 0.09% | 0.10% |
2048 | 0.43% | 0.15% | 0.17% |
3072 | 0.46% | 0.16% | 0.19% |
4096 | 0.47% | 0.16% | 0.21% |
On the flip side, augmenting the batch size necessitates a substantial increase in computational power. In our framework, configuring a batch size of 3072 requires 48 NIVIDIA V100 GPUs, while a batch size of 4096 demands 64 GPUs. Considering the diminishing returns associated with further escalating the batch size, coupled with the consideration of training costs, we ultimately opted for a batch size of 3072 for deployment.
4.3.4 Train with text information
Text information is important for searching and recommendation since it’s directly related to the query by the users and the properties of items. Thus, raw text information is already widely used in real-world systems. The co-train of texts and images also shows significant performance gains in computer vision tasks such as classification and segmentation. So we are interested in verifying the influence to COURIER by co-training with text information. Specifically, we add a CLIP [
9] besides COURIER, with the loss function becomes
. The CLIP loss is calculated with item cover images and item titles. However, such multi-task training leads to worse downstream CTR performance as shown in Tab.8, which indicates that co-training with the text information may not help generalization when the text information is available in the downstream task.
Tab.8 Train with text information |
| (women’s clothing) | | |
w CLIP | 0.26% | 0.04% | −0.09% |
COURIER | 0.46% | 0.16% | 0.19% |
4.3.5 Generalization in unseen categories
In Fig.4, we plot the AUC improvements of COURIER in different categories. We have the following conclusions: Firstly, the performance improvement in the women’s clothing category is the most, which is intuitive since the embeddings are trained with women’s clothing data. Secondly, there are also significant improvements in women’s shoe, children’s shoe, children’s clothing, underwear, etc. These categories are not used in the pre-training task, which indicates the COURIER method can learn general visual characteristics that reflect user interests. Thirdly, the performances in the bedding, cosmetics, knapsack, and handicrafts are also improved by more than 0.1%. These categories are significantly different from women’s clothing in visual appearance, and COURIER also learned some features that are transferable to these categories. Fourthly, COURIER does not have a significant impact on some categories, and has a negative impact on the car category. These categories are less influenced by visual looking and can be ignored when using our method to avoid performance drop. Fifthly, the performance is also related to the amount of data. Generally, categories that have more data tend to perform better.
Fig.4 The AUC improvements of COURIER compared to the Baseline on different categories. The x-axis is sorted by the improvements |
Full size|PPT slide
4.3.6 Visualization of trained embeddings
Did COURIER really learn features related to user interests? We verified the quantitative improvements on CTR in Section 4.2.4. Here, we also provide some qualitative analysis. During the training of COURIER, we did not use any additional information other than images and user clicks. Thus, if the embeddings contain some semantic information, such information must be extracted from user behaviors. So we plot some randomly selected embeddings with specific categories and style tags in Fig.5 andFig.6. Firstly, embeddings from different categories are clearly separated, which indicates that COURIER can learn categorical semantics from user behaviors. Secondly, some of the style tags can be separated, such as Cool vs. Sexy. The well-separated tags are also intuitively easy to distinguish. Thirdly, some of the tags can not be separated clearly, such as Mature vs. Cuties, and Grace vs. Antique, which is also intuitive since these tags have relatively vague meanings and may overlap. Despite this, COURIER still learned some gradients between the two concepts. To conclude, the proposed COURIER method can learn meaningful user-interest-related features by only using images and click labels.
Fig.5 T-SNE visualization of embeddings in different categories. (a) Dress and Jeans; (b) Shirt and Cheongsam; (c) Skirt and Fur |
Full size|PPT slide
Fig.6 T-SNE visualization of embeddings with different style tags. We also plot some item images with different tags below the corresponding figures. (a) Cool and Sexy; (b) Mature and Cuties; (c) Grace and Antique |
Full size|PPT slide
Tab.9 The A/B testing improvements of COURIER |
| # Order | CTR | GMV |
---|
All categories | +0.1% | +0.18% | +0.66% |
Women’s clothing | +0.31% | +0.34% | +0.88% |
5 Online experiments and deployment
During the deployment of our image feature learning method, we devoted substantial efforts to optimizing the model’s performance. Below, we summarize some of the impactful optimization measures that were undertaken:
Image backbone and performance optimization. We use the Swin-tiny transformer [
45] trained on Imagenet as the image backbone model, which is faster than the Swin model and has comparable performance. We train the backbone network end-to-end in the pre-training phase. However, since we have about
images in our pre-training dataset (230 times bigger than the famous Imagenet dataset, which has about
images), even the Swin-tiny model is slow. After applying gradient checkpointing [
46], mixed-precision computation [
47], and optimization on distributed IO of images, we managed to reduce the training time for one epoch on the pre-training dataset from 150 hours to 50 hours (consuming daily data in < 10 hours) with 48 Nvidia V100 GPUs (32 GB).
Efficient clustering. In our scenario, there are about
image embeddings to be clustered, and we set the cluster number to
. The computational complexity of vanilla k-means implementation is
per iteration, which is unaffordable. Practically, we implement a high-speed learning-based clustering problem proposed in [
48]. The computing time is reduced significantly from more than 7 days to about 6 hours. We will assign a new image to its closest cluster center ID. The amount of new items is relatively small and has little impact on the performance. We will re-train and replace all the cluster-IDs regularly (e.g., half a year).
Online serving performance. To evaluate the performance improvement brought by COURIER on our online system, we conduct online A/B testing in our online shopping platform for 30-days, the results are reported in Tab.9.
We report improvements on number of orders (#Order), click-through rate ( CTR), and cross merchandise volume ( GMV). Compared with the strongest deployed online baseline, COURIER significantly (p-value < 0.01) improves the CTR and GMV by +0.34% and +0.88% in women’s clothing, respectively (the noise level is less than 0.1% according to the online A/A test). Such improvements are considered significant with the large volume of our online shopping platform. The model has also been successfully deployed into production, serving the main traffic.
6 Conclusion
Visually, the image information of a product has a significant impact on whether a user clicks. However, we have observed that the features extracted by existing image pre-training methods have limited utility for improving downstream CTR models. We attribute this phenomenon to the fact that the labels typically used in existing methods, such as those for classification or image-text retrieval, have already been applied as features in CTR models. Existing image pre-training methods are insufficient for effectively extracting features relevant to user interests. To address this issue, we propose a method for user interest reconstruction to extract image features relevant to user click behavior. Specifically, to mitigate the problem of images in a user’s click history that may not be relevant to the current image, we employ an attention-based approach to identify images in the click sequence that are similar to the current image. We then attempt to reconstruct the user’s next-click image through a weighted sum of the embeddings of these related images. Furthermore, to prevent trivial solutions, we optimize using a contrastive learning loss, reducing the reconstruction error for clicked images while increasing it for non-clicked images. Consequently, our method learns visual embeddings that influence whether a user clicks without relying on any downstream features directly. Instead, it enhances the information available for downstream utilization from the perspective of interest reconstruction. Experiments conducted on various datasets and large-scale online evaluations confirm that the embeddings learned by our method significantly enhance the performance of downstream recommendation models.
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}