Alignment efficient image-sentence retrieval considering transferable cross-modal representation learning
Yang YANG , Jinyi GUO , Guangyu LI , Lanyu LI , Wenjie LI , Jian YANG
Front. Comput. Sci. ›› 2024, Vol. 18 ›› Issue (1) : 181335
Alignment efficient image-sentence retrieval considering transferable cross-modal representation learning
Traditional image-sentence cross-modal retrieval methods usually aim to learn consistent representations of heterogeneous modalities, thereby to search similar instances in one modality according to the query from another modality in result. The basic assumption behind these methods is that parallel multi-modal data (i.e., different modalities of the same example are aligned) can be obtained in prior. In other words, the image-sentence cross-modal retrieval task is a supervised task with the alignments as ground-truths. However, in many real-world applications, it is difficult to realign a large amount of parallel data for new scenarios due to the substantial labor costs, leading the non-parallel multi-modal data and existing methods cannot be used directly. On the other hand, there actually exists auxiliary parallel multi-modal data with similar semantics, which can assist the non-parallel data to learn the consistent representations. Therefore, in this paper, we aim at “Alignment Efficient Image-Sentence Retrieval” (AEIR), which recurs to the auxiliary parallel image-sentence data as the source domain data, and takes the non-parallel data as the target domain data. Unlike single-modal transfer learning, AEIR learns consistent image-sentence cross-modal representations of target domain by transferring the alignments of existing parallel data. Specifically, AEIR learns the image-sentence consistent representations in source domain with parallel data, while transferring the alignment knowledge across domains by jointly optimizing a novel designed cross-domain cross-modal metric learning based constraint with intra-modal domain adversarial loss. Consequently, we can effectively learn the consistent representations for target domain considering both the structure and semantic transfer. Furthermore, extensive experiments on different transfer scenarios validate that AEIR can achieve better retrieval results comparing with the baselines.
image-sentence retrieval / transfer learning / semantic transfer / structure transfer
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
Zhen L, Hu P, Wang X, Peng D. Deep supervised cross-modal retrieval. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 10386−10395 |
| [7] |
|
| [8] |
|
| [9] |
Wang Z, Liu X, Li H, Sheng L, Yan J, Wang X, Shao J. CAMP: cross-modal adaptive message passing for text-image retrieval. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 5763−5772 |
| [10] |
|
| [11] |
|
| [12] |
Yu F, Tang J, Yin W, Sun Y, Tian H, Wu H, Wang H. ERNIE-ViL: Knowledge enhanced vision-language representations through scene graphs. In: Proceedings of AAAI Conference on Artificial Intelligence. 2021, 3208−3216 |
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
Song G, Tan X. Sequential learning for cross-modal retrieval. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision Workshop. 2019, 4531−4539 |
| [17] |
Feng Y, Ma L, Liu W, Luo J. Unsupervised image captioning. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 4120−4129 |
| [18] |
Gu J, Joty S R, Cai J, Zhao H, Yang X, Wang G. Unpaired image captioning via scene graph alignments. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 10322−10331 |
| [19] |
|
| [20] |
|
| [21] |
Chen Q, Liu Y, Albanie S. Mind-the-gap! Unsupervised domain adaptation for text-video retrieval. In: Proceedings of AAAI Conference on Artificial Intelligence. 2021, 1072−1080 |
| [22] |
|
| [23] |
|
| [24] |
Yang Y, Zhang C, Xu Y C, Yu D, Zhan D C, Yang J. Rethinking label-wise cross-modal retrieval from A semantic sharing perspective. In: Proceedings of the 30th International Joint Conference on Artificial Intelligence. 2021, 3300−3306 |
| [25] |
|
| [26] |
|
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
|
| [32] |
Deng J, Dong W, Socher R, Li L J, Li K, Fei-Fei L. ImageNet: a large-scale hierarchical image database. In: Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009, 248−255 |
| [33] |
|
| [34] |
|
| [35] |
|
| [36] |
|
| [37] |
|
| [38] |
Tzeng E, Hoffman J, Saenko K, Darrell T. Adversarial discriminative domain adaptation. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 2962−2971 |
| [39] |
|
| [40] |
|
| [41] |
|
| [42] |
|
| [43] |
|
| [44] |
|
| [45] |
|
| [46] |
|
| [47] |
Zhang J, Peng Y, Yuan M. Unsupervised generative adversarial cross-modal hashing. In: Proceedings of AAAI Conference on Artificial Intelligence. 2018, 539−546 |
| [48] |
Chen H, Ding G, Liu X, Lin Z, Liu J, Han J. IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 12652−12660 |
| [49] |
|
| [50] |
|
| [51] |
Saito K, Kim D, Sclaroff S, Darrell T, Saenko K. Semi-supervised domain adaptation via minimax entropy. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 8049−8057 |
| [52] |
|
| [53] |
|
Higher Education Press
Supplementary files
/
| 〈 |
|
〉 |