PDF
(13987KB)
Abstract
Clothing attribute recognition has become an essential technology, which enables users to automatically identify the characteristics of clothes and search for clothing images with similar attributes. However, existing methods cannot recognize newly added attributes and may fail to capture region-level visual features. To address the aforementioned issues, a region-aware fashion contrastive language-image pre-training(RaF-CLIP)model was proposed. This model aligned cropped and segmented images with category and multiple fine-grained attribute texts, achieving the matching of fashion region and corresponding texts through contrastive learning. Clothing retrieval found suitable clothing based on the user-specified clothing categories and attributes, and to further improve the accuracy of retrieval, an attribute-guided composed network(AGCN)as an additional component on RaF-CLIP was introduced, specifically designed for composed image retrieval. This task aimed to modify the reference image based on textual expressions to retrieve the expected target.By adopting a transformer-based bidirectional attention and gating mechanism, it realized the fusion and selection of image features and attribute text features. Experimental results show that the proposed model achieves a mean precision of 0. 663 3 for attribute recognition tasks and a recall@10(recall@k is defined as the percentage of correct samples appearing in the top k retrieval results)of 39. 18 for composed image retrieval task, satisfying user needs for freely searching for clothing through images and texts.
Keywords
attribute recognition
/
image retrieval
/
contrastive language-image pre-training(CLIP)
/
image text matching
/
transformer
Cite this article
Download citation ▾
Kangping WANG, Mingbo ZHAO.
Region-Aware Fashion Contrastive Learning for Unified Attribute Recognition and Composed Retrieval.
Journal of Donghua University(English Edition), 2024, 41(4): 405-415 DOI:10.19884/j.1672-5220.202405006
| [1] |
LIU J Y, LU H. Deep fashion analysis with feature map upsampling and landmark-driven attention[C]//Proceedings of the European Conference on Computer Vision(ECCV)Workshops. Berlin: Springer, 2018: 11131.
|
| [2] |
LIU X, LI J, WANG J, et al. MMFashion: an open-source toolbox for visual fashion analysis[C]//Proceedings of the 29th ACM International Conference on Multimedia. New York: ACM, 2021: 3755-3758.
|
| [3] |
LIU Z W, LUO P, QIU S, et al. DeepFashion: powering robust clothes recognition and retrieval with rich annotations[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2016: 1096-1104.
|
| [4] |
ZENG F, ZHAO M B, ZHANG Z, et al. Joint clothes detection and attribution prediction via anchor-free framework with decoupled representation transformer[C]//Proceedings of the 31st ACM International Conference on Information & Knowledge Management. New York: ACM, 2022: 2444-2454.
|
| [5] |
ZHANG S Y, SONG Z J, CAO X C, et al. Task-aware attention model for clothing attribute prediction[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 30(4): 1051-1064.
|
| [6] |
LUO X, XIA D M, TAO R, et al. Fabric image retrieval based on fine-grained features[J]. Journal of Donghua University(English Edition), 2024, 41(2): 115-129.
|
| [7] |
RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//International Conference on Machine Learning. New York: ICML, 2021: 8748-8763.
|
| [8] |
CHIA P J, ATTANASIO G, BIANCHI F, et al. Contrastive language and vision learning of general fashion concepts[J]. Scientific Reports, 2022, 12: 18958.
|
| [9] |
VO N, JIANG L, SUN C, et al. Composing text and image for image retrieval:an empirical odyssey[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Los Alamitos: IEEE, 2019: 6432-6441.
|
| [10] |
CHEN Y B, GONG S G, BAZZANI L.Image search with text feedback by visiolinguistic attention learning[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). New York: IEEE, 2020: 2998-3008.
|
| [11] |
LEE S, KIM D, HAN B. CoSMo:content-style modulation for image retrieval with text feedback[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Los Alamitos: IEEE, 2021: 802-812.
|
| [12] |
KIM J, YU Y, KIM H, et al. Dual compositional learning in interactive image retrieval[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(2): 1771-1779.
|
| [13] |
JANDIAL S, BADJATIYA P, CHAWLA P, et al. SAC:semantic attention composition for text-conditioned image retrieval[C]//2022 IEEE/CVF Winter Conference on Applications of Computer Vision(WACV). Los Alamitos: IEEE, 2022: 597-606.
|
| [14] |
HAN X, YU L, ZHU X, et al. FashionViL: fashion-focused vision-and-language representation learning[C]//European Conference on Computer Vision. Berlin: Springer, 2022: 634-651.
|
| [15] |
WEN H, SONG X, YANG X, et al. Comprehensive linguistic-visual composition network for image retrieval[C]//Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2021: 1369-1378.
|
| [16] |
YANG Q, YE M, CAI Z, et al. Composed image retrieval via cross relation network with hierarchical aggregation transformer[J]. IEEE Transactions on Image Processing, 2023, 32: 4543-4554.
|
| [17] |
ZHA J H, YAN C R, ZHANG Y T, et al. Image retrieval with text manipulation by local feature modification[J]. Journal of Donghua University(English Edition), 2023, 40(4): 404-409.
|
| [18] |
LI F, PAN H S, SHENG S X, et al. Image retrieval based on vision transformer and masked learning[J]. Journal of Donghua University(English Edition), 2023, 40(5): 539-547.
|
| [19] |
LIANG J H, LIU Y, VLASSOV V. The impact of background removal on performance of neural networks for fashion image classification and segmentation[EB/OL].(2023-8-18)[2024-5-7]. https://doi.org/10.1109/CSCE60160.2023.00323.
|
| [20] |
HE K M, FAN H Q, WU Y X, et al. Momentum contrast for unsupervised visual representation learning[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). New York: IEEE, 2020: 9726-9735.
|
| [21] |
KIRILLOV A, MINTUN E, RAVI N, et al. Segment anything[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 4015-4026.
|
| [22] |
BALDRATI A, BERTINI M, URICCHIO T, et al. Composed image retrieval using contrastive learning and task-oriented clip-based features[J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2024, 20(3): 1-24.
|
| [23] |
ASHISH V, NOAM S, NIKI P, et al. Attention is all you need[C]//31st Annual Conference on Neural Information Processing Systems. San Diego: NIPS, 2017, 30.
|
| [24] |
WU H, GAO Y P, GUO X X, et al. FashionIQ:a new dataset towards retrieving images by natural language feedback[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). New York: IEEE, 2021: 11302-11312.
|
| [25] |
CHERTI M, BEAUMONT R, WIGHTMAN R, et al. Reproducible scaling laws for contrastive language-image learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2023: 2818-2829.
|
| [26] |
LI J, LI D. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation[C]//International conference on machine learning. New York: ICML, 2022: 12888-12900.
|
Funding
National Natural Science Foundation of China(61971121)