A novel deep high-level concept-mining jointing hashing model for unsupervised cross-modal retrieval

Chun-Ru Dong , Jun-Yan Zhang , Feng Zhang , Qiang Hua , Dachuan Xu

High-Confidence Computing ›› 2025, Vol. 5 ›› Issue (2) : 100274

PDF (2553KB)
High-Confidence Computing ›› 2025, Vol. 5 ›› Issue (2) : 100274 DOI: 10.1016/j.hcc.2024.100274
Research article

A novel deep high-level concept-mining jointing hashing model for unsupervised cross-modal retrieval

Author information +
History +
PDF (2553KB)

Abstract

Unsupervised cross-modal hashing has achieved great success in various information retrieval applications owing to its efficient storage usage and fast retrieval speed. Recent studies have primarily focused on training the hash-encoded networks by calculating a sample-based similarity matrix to improve the retrieval performance. However, there are two issues remain to solve: (1) The current sample-based similarity matrix only considers the similarity between image-text pairs, ignoring the different information densities of each modality, which may introduce additional noise and fail to mine key information for retrieval; (2) Most existing unsupervised cross-modal hashing methods only consider alignment between different modalities, while ignoring consistency between each modality, resulting in semantic conflicts. To tackle these challenges, a novel Deep High-level Concept-mining Jointing Hashing (DHCJH) model for unsupervised cross-modal retrieval is proposed in this study. DHCJH is able to capture the essential high-level semantic information from image modalities and integrate into the text modalities to improve the accuracy of guidance information. Additionally, a new hashing loss with a regularization term is introduced to avoid the cross-modal semantic collision and false positive pairs problems. To validate the proposed method, extensive comparison experiments on benchmark datasets are conducted. Experimental findings reveal that DHCJH achieves superior performance in both accuracy and efficiency. The code of DHCJH is available at Github.

Keywords

Concept-mining / Unsupervised learning / Cross-modal / Information retrieval

Cite this article

Download citation ▾
Chun-Ru Dong, Jun-Yan Zhang, Feng Zhang, Qiang Hua, Dachuan Xu. A novel deep high-level concept-mining jointing hashing model for unsupervised cross-modal retrieval. High-Confidence Computing, 2025, 5(2): 100274 DOI:10.1016/j.hcc.2024.100274

登录浏览全文

4963

注册一个新账户 忘记密码

CRediT authorship contribution statement

Chun-Ru Dong: Data curation, Formal analysis, Investigation. Jun-Yan Zhang: Conceptualization, Data curation, Investigation, Methodology, Writing - original draft. Feng Zhang: Conceptualization. Methodology, Supervision, Writing - review & editing. Qiang Hua: Conceptualization, Resources. Dachuan Xu: Resources, Validation.

Code availability

The code are available at https://github.com/JunyanZhang/DHCJH.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work is supported by National Key R and D Program of China (2022YFE0196100), Beijing Natural Science Foundation Project (Z200002) and the Natural Science Foundation of Hebei Province (F2018201115).

Data availability

All data used in this study come from open accessed benchmark datasets.

References

[1]

S. Kumar, R. Udupa,Learning hash functions for cross-view similarity search, in:Proceedings of the 22th International Joint Conference on Artificial Intelligence, 2011, http://dx.doi.org/10.5555/2283516.2283623.

[2]

G. Wu, Z. Lin, J. Han, et al., Unsupervised deep hashing via binary latent factor models for large-scale cross-modal retrieval,in: Proceedings of the 29th International Joint Conference on Artificial Intelligence, vol. 1(3), 2018, p. 5, http://dx.doi.org/10.5555/3304889.3305057.

[3]

J. Zhang, Y. Peng, M. Yuan, Unsupervised generative adversarial cross-modal hashing, Proc. AAAI Conf. Artif. Intell. 32 (1) (2018) http://dx.doi.org/10.48550/arXiv.1712.00358.

[4]

R.C. Tu, X.L. Mao, R. Tu, et al., Deep cross-modal hashing via margin-dynamic-softmax loss, 2020, arXiv preprint arXiv:2011.03451.

[5]

Z. Lin, G. Ding, M. Hu, et al., Semantics-preserving hashing for cross-view retrieval, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3864-3872, http://dx.doi.org/10.1109/CVPR.2015.7299011.

[6]

Z. Ye, Y. Peng, Multi-scale correlation for sequential cross-modal hash-ing learning, in: Proceedings of the 26th ACM International Conference on Multimedia, 2018, pp. 852-860, http://dx.doi.org/10.1145/3240508.3240560.

[7]

Q.Y. Jiang, W.J. Li, Deep cross-modal hashing, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3232-3240, http://dx.doi.org/10.1109/CVPR.2017.348.

[8]

Y. Cao, M. Long, J. Wang, et al., Correlation hashing network for efficient cross-modal retrieval, 2016, arXiv preprint arXiv:1602.06697.

[9]

D. Xie, C. Deng, C. Li, et al., Multi-task consistency-preserving adversarial hashing for cross-modal retrieval, IEEE Trans. Image Process. 29 (2020) 3626-3637, http://dx.doi.org/10.1109/TIP.2020.2963957.

[10]

R.C. Tu, X.L. Mao, B. Ma, et al., Deep cross-modal hashing with hashing functions and unified hash codes jointly learning, IEEE Trans. Knowl. Data Eng. (2020) http://dx.doi.org/10.1109/TKDE.2020.2987312.

[11]

Y. Cao, M. Long, J. Wang, et al., Deep visual-semantic hashing for cross-modal retrieval, in: Proceedings of the 22nd ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining, 2016, pp. 1445-1454, http://dx.doi.org/10.1145/2939672.2939812.

[12]

J. Tang, K. Wang, L. Shao, Supervised matrix factorization hashing for cross-modal retrieval, IEEE Trans. Image Process. 25 (7) (2016) 3157-3166, http://dx.doi.org/10.1109/TIP.2016.2564638.

[13]

S. Su, Z. Zhong, C. Zhang,Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval, in:Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 3027-3035, http://dx.doi.org/10.1109/ICCV.2019.00312.

[14]

S. Liu, S. Qian, Y. Guan, et al., Joint-modal distribution-based similarity hashing for large-scale unsupervised deep cross-modal retrieval, in: Pro-ceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2020, pp. 1379-1388, http://dx.doi.org/10.1145/3397271.3401086.

[15]

D. Yang, D. Wu, W. Zhang, et al., Deep semantic-alignment hashing for unsupervised cross-modal retrieval,in: Proceedings of the 2020 Interna-tional Conference on Multimedia Retrieval, 2020, pp. 44-52, http://dx.doi.org/10.1145/3372278.3390673.

[16]

J. Yu, H. Zhou, Y. Zhan, et al., Deep graph-neighbor coherence preserving network for unsupervised cross-modal hashing, Proc. AAAI Conf. Artif. Intell. 35 (5) (2021) 4626-4634, http://dx.doi.org/10.1609/aaai.v35i5.16592.

[17]

X. Li, D. Hu, F. Nie, Deep binary reconstruction for cross-modal hashing, in: Proceedings of the 25th ACM International Conference on Multimedia, 2017, pp. 1398-1406, http://dx.doi.org/10.1109/TMM.2018.2866771.

[18]

P. Hu, H. Zhu, J. Lin, et al., Unsupervised contrastive cross-modal hashing, IEEE Trans. Pattern Anal. Mach. Intell. 45 (3) (2022) 3877-3889, http://dx.doi.org/10.1109/TPAMI.2022.3177356.

[19]

G. Ding, Y. Guo, J. Zhou, Collective matrix factorization hashing for multimodal data, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2075-2082, http://dx.doi.org/10.1109/CVPR.2014.267.

[20]

J.B. Alayrac, A. Recasens, R. Schneider, et al., Self-supervised multimodal versatile networks, Adv. Neural Inf. Process. Syst. 33 (2020) 25-37, http://dx.doi.org/10.5555/3495724.3495727.

[21]

K. He, X. Chen, S. Xie, et al., Masked autoencoders are scalable vision learners,in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16000-16009, http://dx.doi.org/10.1109/CVPR52688.2022.01553.

[22]

Z. Fu, F. Liu, Q. Xu, et al., NHFNET: A non-homogeneous fusion network for multimodal sentiment analysis, in: 2022 IEEE International Conference on Multimedia and Expo, ICME, IEEE, 2022, pp. 1-6, http://dx.doi.org/10.1109/ICME52920.2022.9859836.

[23]

J. Devlin, M.W. Chang, K. Lee, et al., Bert: Pre-training of deep bidirectional transformers for language understanding, 2018, arXiv preprint arXiv:1810. 04805.

[24]

A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al., An image is worth 16x 16 words: Transformers for image recognition at scale, 2020, arXiv preprint arXiv:2010. 11929.

[25]

W. Su, X. Zhu, Y. Cao, et al., Vl-bert: Pre-training of generic visual-linguistic representations, 2019, arXiv preprint arXiv:1908.08530.

[26]

J. Lu, D. Batra, D. Parikh, et al., ViLBERT: Pretraining task-agnostic visi-olinguistic representations for vision-and-language tasks, in: Advances in Neural Information Processing Systems, vol. 32, 2019, http://dx.doi.org/10.48550/arXiv.1908.02265.

[27]

Adam Paszke, Sam Gross, Francisco Massa, et al., Pytorch: An imperative style, high-performance deep learning library, in: Advances in Neural Information Processing Systems, vol. 32, Curran Associates, Inc., 2019, pp. 8024-8035.

[28]

J. Li, D. Li, C. Xiong, et al., Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022, http://dx.doi.org/10.48550/arXiv.2201.12086, arXiv preprint arXiv:2201.12086.

[29]

L.H. Li, M. Yatskar, D. Yin, et al., Visualbert: A simple and performant baseline for vision and language, 2019, arXiv preprint arXiv:1908.03557.

[30]

A. Radford, J.W. Kim, C. Hallacy, et al., Learning transferable visual models from natural language supervision, in: International Conference on Ma-chine Learning, PMLR, 2021, pp. 8748-8763, http://dx.doi.org/10.48550/arXiv.2103.00020.

[31]

H. Tan, M. Bansal, Lxmert: Learning cross-modality encoder representations from transformers, 2019, arXiv preprint arXiv:1908.07490.

[32]

J. Zhou, G. Ding, Y. Guo, Latent semantic sparse hashing for cross-modal similarity search, in: Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, 2014, pp. 415-424, http://dx.doi.org/10.1145/2600428.2609610.

[33]

Y. Wang, X. Luo, L. Nie, et al., BATCH: A scalable asymmetric discrete cross-modal hashing, IEEE Trans. Knowl. Data Eng. 33 (11) (2020) 3507-3519, http://dx.doi.org/10.1109/TKDE.2020.2974825.

[34]

J. Song, Y. Yang, Y. Yang, et al., Inter-media hashing for large-scale retrieval from heterogeneous data sources, in: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, 2013, pp. 785-796, http://dx.doi.org/10.1145/2463676.2465274.

[35]

C. Li, C. Deng, L. Wang, et al., Coupled CycleGAN: Unsupervised hashing network for cross-modal retrieval, Proc. AAAI Conf. Artif. Intell. 33 (01)(2019) 176-183, http://dx.doi.org/10.48550/arXiv.1903.02149.

[36]

A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks, Commun. ACM 60 (6) (2017) 84-90, http://dx.doi.org/10.1145/3065386.

[37]

D.E. Rumelhart, G.E. Hinton, J.L. McClelland, A general framework for parallel distributed processing, in: Parallel Distributed Processing: Explo-rations in the Microstructure of Cognition, vol. 1(45-76), 1986, p. 26, http://dx.doi.org/10.5555/104279.104286.

[38]

Y. Zhang, R. Jin, Z.H. Zhou, Understanding bag-of-words model: a statistical framework, Int. J. Mach. Learn. Cybern. 1 (2010) 43-52, http://dx.doi.org/10.1007/s13042-010-0001-0.

[39]

J.T. Hoe, K.W. Ng, T. Zhang, et al., One loss for all: Deep hashing with a single cosine similarity based learning objective, Adv. Neural Inf. Process. Syst. 34 (2021) 24286-24298, http://dx.doi.org/10.48550/arXiv.2109.14449.

[40]

J. Masci, M.M. Bronstein, A.M. Bronstein, et al., Multimodal similarity-preserving hashing, IEEE Trans. Pattern Anal. Mach. Intell. 36 (4) (2013) 824-830, http://dx.doi.org/10.48550/arXiv.1207.1522.

[41]

E. Yang, C. Deng, W. Liu, et al., Pairwise relationship guided deep hashing for cross-modal retrieval, Proc. AAAI Conf. Artif. Intell. 31 (1) (2017) http://dx.doi.org/10.1609/aaai.v31i1.10719.

[42]

Y.K. Jang, G. Gu, B. Ko, et al., Deep hash distillation for image retrieval,in: Proceedings of the European Conference on Computer Vision, 2022, pp. 354-371, http://dx.doi.org/10.1007/978-3-031-19781-921.

[43]

T. Chen, S. Kornblith, M. Norouzi, et al., A simple framework for contrastive learning of visual representations, in: International Conference on Machine Learning, PMLR, 2020, pp. 1597-1607, http://dx.doi.org/10.48550/arXiv.2002.05709.

[44]

X. Luo, H. Wang, D. Wu, et al., A survey on deep hashing methods, ACM Trans. Knowl. Discov. Data 17 (1) (2023) 15:1-15:50.

[45]

Z. Qiu, Q. Su, Z. Ou, et al., Unsupervised hashing with contrastive informa-tion bottleneck,in: Proceedings of the 30th International Joint Conference on Artificial Intelligence, 2021, pp. 959-965, http://dx.doi.org/10.24963/ijcai.2021/133.

[46]

G. Mikriukov, M. Ravanbakhsh, B. Demir, Unsupervised contrastive hashing for cross-modal retrieval in remote sensing, in: IEEE International Confer-ence on Acoustics, Speech and Signal Processing, ICASSP, IEEE, 2022, pp. 4463-4467, http://dx.doi.org/10.48550/arXiv.2204.08707.

[47]

Z. Cao, M. Long, J. Wang, et al., HashNet: Deep learning to hash by contin-uation, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5608-5617, http://dx.doi.org/10.1109/ICCV.2017.598.

[48]

T.S. Chua, J. Tang, R. Hong, et al., Nus-wide: A real-world web image database from national university of Singapore, in: Proceedings of the ACM International Conference on Image and Video Retrieval, 2009, pp. 1-9, http://dx.doi.org/10.1145/1646396.1646452.

[49]

M.J. Huiskes, M.S. Lew, The MIR Flickr retrieval evaluation, in: Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval, 2008, pp. 39-43, http://dx.doi.org/10.1145/1460096.1460104.

[50]

J. Xu, T. Li, C. Xi, et al., Self-auxiliary hashing for unsupervised cross modal retrieval, in: CCF Conference on Computer Supported Cooperative Work and Social Computing, Springer, Singapore, 2022, pp. 431-443, http://dx.doi.org/10.1007/978-981-19-4549-6_33.

[51]

T.Y. Lin, M. Maire, S. Belongie, et al., Microsoft coco: Common objects in context, in: Computer Vision-ECCV, 2014, pp. 740-755, http://dx.doi.org/10.48550/arXiv.1405.0312.

[52]

O. Russakovsky, J. Deng, H. Su, et al., ImageNet large scale visual recogni-tion challenge, Int. J. Comput. Vis. 115 (3) (2015) 211-252, http://dx.doi.org/10.1007/s11263-015-0816-y.

AI Summary AI Mindmap
PDF (2553KB)

497

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/