COURIER: contrastive user intention reconstruction for large-scale visual recommendation

Jia-Qi YANG, Chenglei DAI, Dan OU, Dongshuai LI, Ju HUANG, De-Chuan ZHAN, Xiaoyi ZENG, Yang YANG

PDF(13518 KB)
PDF(13518 KB)
Front. Comput. Sci. ›› 2025, Vol. 19 ›› Issue (7) : 197602. DOI: 10.1007/s11704-024-3939-x
Information Systems
RESEARCH ARTICLE

COURIER: contrastive user intention reconstruction for large-scale visual recommendation

Author information +
History +

Abstract

With the advancement of multimedia internet, the impact of visual characteristics on the decision of users to click or not within the online retail industry is increasingly significant. Thus, incorporating visual features is a promising direction for further performance improvements in click-through rate (CTR). However, experiments on our production system revealed that simply injecting the image embeddings trained with established pre-training methods only has marginal improvements. We believe that the main advantage of existing image feature pre-training methods lies in their effectiveness for cross-modal predictions. However, this differs significantly from the task of CTR prediction in recommendation systems. In recommendation systems, other modalities of information (such as text) can be directly used as features in downstream models. Even if the performance of cross-modal prediction tasks is excellent, it is challenging to provide significant information gain for the downstream models. We argue that a visual feature pre-training method tailored for recommendation is necessary for further improvements beyond existing modality features. To this end, we propose an effective user intention reconstruction module to mine visual features related to user interests from behavior histories, which constructs a many-to-one correspondence. We conduct extensive experimental evaluations on public datasets and our production system to verify that our method can learn users’ visual interests. Our method achieves 0.46% improvement in offline AUC and 0.88% improvement in Taobao GMV (Cross Merchandise Volume) with p-value < 0.01.

Graphical abstract

Keywords

user Intention reconstruction / contrastive learning / personalized searching / image features

Cite this article

Download citation ▾
Jia-Qi YANG, Chenglei DAI, Dan OU, Dongshuai LI, Ju HUANG, De-Chuan ZHAN, Xiaoyi ZENG, Yang YANG. COURIER: contrastive user intention reconstruction for large-scale visual recommendation. Front. Comput. Sci., 2025, 19(7): 197602 https://doi.org/10.1007/s11704-024-3939-x

Jia-Qi Yang earned his ME degree in 2021 and is currently pursuing a PhD at the State Key Laboratory for Novel Software Technology, Nanjing University, China. His research are primarily focused on machine learning and data mining, with specific expertise in uncertainty calibration, recommendation systems, and AI for science. He serves as a Program Committee member and Reviewer for prestigious conferences like AAAI, NeurIPS, ICLR, and others

Chenglei Dai graduated from Zhejiang University, China in 2016 with a master’s degree in statistics. Currently working in Alibaba’s search ranking team, his research interests include large language models, reinforcement learning, etc. He has published many papers in domestic and foreign journals and conferences

Dan Ou graduated with the bachelor’s and master’s degrees from Wuhan University, China. She currently working at the TaoTian Group as a Senior Algorithm Expert in the Algorithm Technology Team, responsible for text search algorithms

Dongshuai Li received his master degree from Tongji University, China in 2020. He is currently a senior algorithm engineer at Alibaba Inc., China. He has placed in the top three in several competitions. His research interests include multimodal representation learning and computer vision

Ju Huang is currently employed at Alibaba Group, working on the development of Deep learning training systems. Her work includes optimization and acceleration of large language model training on Pytorch and the Nebula Algorithm Platform, etc

De-Chuan Zhan joined in the LAMDA Group (LAMDA Publications) on 2004 and received his PhD degree in computer science from Nanjing University, China in 2010 (supervisor Prof. Zhi-Hua Zhou), and then serviced in the Department of Computer Science and Technology of Nanjing University as an Assistant Professor from 2010, and as an Associate Professor from 2013. Then he joined the School of Artificial Intelligence of Nanjing University as a Professor from 2019. His research interests mainly include machine learning and data mining, especially working on mobile intelligence, distance metric learning, multi-modal learning, etc. Up until now, He has published over 90 papers in national and international journals or conferences such as TPAMI, TKDD, TIFS, TSMSB, IJCAI, ICML, NIPS, AAAI, etc. He served as the deputy director of LAMDA group, NJU, and the director of AI Innovation Institute of AI Valley, Nanjing, China

Xiaoyi Zeng is the intelligent technology leader of Alibaba International Digital Commerce, responsible for the search, recommendation, advertising, and user growth algorithms of international e-commerce shopping platforms

Yang Yang received the PhD degree in computer science, Nanjing University, China in 2019. At the same year, he became a faculty member at Nanjing University of Science and Technology, China. He is currently a Professor with the school of Computer Science and Engineering. His research interests lie primarily in machine learning and data mining, including heterogeneous learning, model reuse, and incremental mining. He has published prolifically in refereed journals and conference proceedings, including IEEE Transactions on Knowledge and Data Engineering (TKDE), ACM Transactions on Information Systems (ACM TOIS), ACM Transactions on Knowledge Discovery from Data (TKDD), ACM SIGKDD, ACM SIGIR, WWW, IJCAI, and AAAI. He was the recipient of the the Best Paper Award of ACML-2017. He serves as PC/SPC in leading conferences such as IJCAI, AAAI, ICML, and NeurIPS

References

[1]
Zhang W, Qin J, Guo W, Tang R, He X. Deep learning for click-through rate estimation. In: Proceedings of the 30th International Joint Conference on Artificial Intelligence. 2021, 4695−4703
[2]
Zhou G, Mou N, Fan Y, Pi Q, Bian W, Zhou C, Zhu X, Gai K. Deep interest evolution network for click-through rate prediction. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 5941−5948
[3]
Yang J Q, Zhan D C, Gan L. Beyond probability partitions: Calibrating neural networks with semantic aware grouping. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 2547
[4]
Yang J Q, Zhan D C. Generalized delayed feedback model with post-click information in recommender systems. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 1899
[5]
Yuan Z, Yuan F, Song Y, Li Y, Fu J, Yang F, Pan Y, Ni Y. Where to go next for recommender systems? ID- vs. modality-based recommender models revisited. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2023, 2639−2649
[6]
Zhou Z H. Learnability with time-sharing computational resource concerns. 2023, arXiv preprint arXiv: 2305.02217
[7]
Chen T, Kornblith S, Norouzi M, Hinton G. A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 149
[8]
Chen X, He K. Exploring simple Siamese representation learning. In: Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 15745−15753
[9]
Radford A, Kim J W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I. Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 8748−8763
[10]
Schafer J B, Frankowski D, Herlocker J, Sen S. Collaborative filtering recommender systems. In: Brusilovsky P, Kobsa A, Nejdl W, eds. The Adaptive Web, Methods and Strategies of Web Personalization. Berlin: Springer, 2007, 291−324
[11]
Linden G, Smith B, York J . Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet Computing, 2003, 7( 1): 76–80
[12]
Zhang S, Yao L, Sun A, Tay Y . Deep learning based recommender system: a survey and new perspectives. ACM Computing Surveys, 2019, 52( 1): 5
[13]
Huang P S, He X, Gao J, Deng L, Acero A, Heck L. Learning deep structured semantic models for web search using clickthrough data. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. 2013, 2333−2338
[14]
Yang J Q, Li X, Han S, Zhuang T, Zhan D C, Zeng X, Tong B. Capturing delayed feedback in conversion rate prediction via elapsed-time sampling. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021, 4582−4589
[15]
Wu C, Wu F, Qi T, Huang Y. Empowering news recommendation with pre-trained language models. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2021, 1652−1656
[16]
van den Oord A, Dieleman S, Schrauwen B. Deep content-based music recommendation. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013, 2643−2651
[17]
Wu L, Chen L, Hong R, Fu Y, Xie X, Wang M . A hierarchical attention model for social contextual image recommendation. IEEE Transactions on Knowledge and Data Engineering, 2020, 32( 10): 1854–1867
[18]
Covington P, Adams J, Sargin E. Deep neural networks for YouTube recommendations. In: Proceedings of the 10th ACM Conference on Recommender Systems. 2016, 191−198
[19]
He R, McAuley J. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In: Proceedings of the 25th International Conference on World Wide Web. 2016, 507−517
[20]
Wei Y, Wang X, Nie L, He X, Hong R, Chua T S. MMGCN: multi-modal graph convolution network for personalized recommendation of micro-video. In: Proceedings of the 27th ACM International Conference on Multimedia. 2019, 1437−1445
[21]
Wang Q, Wei Y, Yin J, Wu J, Song X, Nie L . DualGNN: Dual graph neural network for multimedia recommendation. IEEE Transactions on Multimedia, 2023, 25: 1074–1084
[22]
Zhang J, Zhu Y, Liu Q, Wu S, Wang S, Wang L. Mining latent structures for multimedia recommendation. In: Proceedings of the 29th ACM International Conference on Multimedia. 2021, 3872−3880
[23]
Tao Z, Liu X, Xia Y, Wang X, Yang L, Huang X, Chua T S . Self-supervised learning for multimedia recommendation. IEEE Transactions on Multimedia, 2023, 25: 5107–5116
[24]
Yu P, Tan Z, Lu G, Bao B K. Multi-view graph convolutional network for multimedia recommendation. In: Proceedings of the 31st ACM International Conference on Multimedia. 2023, 6576−6585
[25]
Zhou X, Zhou H, Liu Y, Zeng Z, Miao C, Wang P, You Y, Jiang F. Bootstrap latent representations for multi-modal recommendation. In: Proceedings of the ACM Web Conference. 2023, 845−854
[26]
Dong X, Zhan X, Wu Y, Wei Y, Kampffmeyer M C, Wei X, Lu M, Wang Y, Liang X. M5Product: Self-harmonized contrastive learning for e-commercial multi-modal pretraining. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 21220−21230
[27]
Wang J, Yuan F, Cheng M, Jose J M, Yu C, Kong B, He X, Wang Z, Hu B, Li Z. TransRec: learning transferable recommendation from mixture-of-modality feedback. 2022, arXiv preprint arXiv: 2206.06190
[28]
Grill J B, Strub F, Altché F, Tallec C, Richemond P H, Buchatskaya E, Doersch C, Pires B A, Guo Z D, Azar M G, Piot B, Kavukcuoglu K, Munos R, Valko M. Bootstrap your own latent A new approach to self-supervised learning. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1786
[29]
Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019, 4171−4186
[30]
Yao T, Yi X, Cheng D Z, Yu F, Chen T, Menon A, Hong L, Chi E H, Tjoa S, Kang J, Ettinger E. Self-supervised learning for large-scale item recommendations. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2021, 4321−4330
[31]
Qiu R, Huang Z, Yin H, Wang Z. Contrastive learning for representation degeneration problem in sequential recommendation. In: Proceedings of the 15th ACM International Conference on Web Search and Data Mining. 2022, 813−823
[32]
Hou Y, Mu S, Zhao W X, Li Y, Ding B, Wen J R. Towards universal sequence representation learning for recommender systems. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2022, 585−593
[33]
Wu J, Wang X, Feng F, He X, Chen L, Lian J, Xie X. Self-supervised graph learning for recommendation. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2021, 726−735
[34]
Sun F, Liu J, Wu J, Pei C, Lin X, Ou W, Jiang P. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2019, 1441−1450
[35]
Wang Y, Wang X, Huang X, Yu Y, Li H, Zhang M, Guo Z, Wu W. Intent-aware recommendation via disentangled graph contrastive learning. In: Proceedings of the 32nd International Joint Conference on Artificial Intelligence. 2023, 260
[36]
Ren X, Xia L, Zhao J, Yin D, Huang C. Disentangled contrastive collaborative filtering. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2023, 1137−1146
[37]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000−6010
[38]
Poole B, Ozair S, van den Oord A, Alemi A A, Tucker G. On variational bounds of mutual information. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 5171−5180
[39]
Mustafa B, Riquelme C, Puigcerver J, Jenatton R, Houlsby N. Multimodal contrastive learning with LIMoe: the language-image mixture of experts. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 695
[40]
McAuley J, Targett C, Shi Q, van den Hengel A. Image-based recommendations on styles and substitutes. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2015, 43−52
[41]
He R, McAuley J. VBPR: visual Bayesian personalized ranking from implicit feedback. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence. 2016, 144−150
[42]
Zhou X. MMRec: Simplifying multimodal recommendation. In: Proceedings of the 5th ACM International Conference on Multimedia in Asia Workshops. 2023, 6
[43]
Sun Z, Li X, Sun X, Meng Y, Ao X, He Q, Wu F, Li J. ChineseBERT: Chinese pretraining enhanced by glyph and pinyin information. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021, 2065−2075
[44]
Dong X, Bao J, Zheng Y, Zhang T, Chen D, Yang H, Zeng M, Zhang W, Yuan L, Chen D, Wen F, Yu N. MaskCLIP: Masked self-distillation advances contrastive language-image pretraining. In: Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 10995−11005
[45]
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. 2021, 9992−10002
[46]
Chen T, Xu B, Zhang C, Guestrin C. Training deep nets with sublinear memory cost. 2016, arXiv preprint arXiv: 1604.06174
[47]
Micikevicius P, Narang S, Alben J, Diamos G F, Elsen E, García D, Ginsburg B, Houston M, Kuchaiev O, Venkatesh G, Wu H. Mixed precision training. In: Proceedings of the 6th International Conference on Learning Representations. 2018
[48]
Yi J, Zhang L, Wang J, Jin R, Jain A K. A single-pass algorithm for efficiently recovering sparse cluster centers of high-dimensional data. In: Proceedings of the 31st International Conference on International Conference on Machine Learning. 2014, II-658−II-666
[49]
Zhang Z Y, Sheng X R, Zhang Y, Jiang B, Han S, Deng H, Zheng B. Towards understanding the overfitting phenomenon of deep click-through rate models. In: Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 2022, 2671−2680
[50]
Chen T, Kornblith S, Swersky K, Norouzi M, Hinton G. Big self-supervised models are strong semi-supervised learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1865
[51]
Sohn K. Improved deep metric learning with multi-class n-pair loss objective. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 1857−1865
[52]
Xuan H, Stylianou A, Liu X, Pless R. Hard negative examples are hard, but useful. In: Proceedings of the 16th European Conference on Computer Vision. 2020, 126−142

Acknowledgements

This work was supported by the National Science and Technology Major Project (No. 2022ZD0114805), the National Key R&D Program of China (2022YFF0712100), the National Natural Science Foundation of China (Grant No. 62276131), the Fundamental Research Funds for the Central Universities (No.30922010317), the Collaborative Innovation Center of Novel Software Technology and Industrialization, Postgraduate Research & Practice Innovation Program of Jiangsu Province (No. KYCX24_0229), and the Scholarship of China Scholarship Council.

Competing interests

The authors declare that they have no competing interests or financial conflicts to disclose.

RIGHTS & PERMISSIONS

2025 Higher Education Press
AI Summary AI Mindmap
PDF(13518 KB)

Accesses

Citations

Detail

Sections
Recommended

/