AVCLNet: Multimodal Multispeaker Tracking Network Using Audio-Visual Contrastive Learning

Yihan Li; Yidi Li; Zhenhuan Xu; Hao Guo; Mengyuan Liu; Weiwei Wan

doi:10.1049/cit2.70092

CAAI Transactions on Intelligence Technology ›› 2026, Vol. 11 ›› Issue (1) :238 -255. DOI: 10.1049/cit2.70092

ORIGINAL RESEARCH

research-article

AVCLNet: Multimodal Multispeaker Tracking Network Using Audio-Visual Contrastive Learning

Author information +

History +

PDF (4108KB)

Abstract

Audio-visual speaker tracking aims to determine the locations of multiple speakers in the scene by leveraging signals captured from multisensor platforms. Multimodal fusion methods can improve both the accuracy and robustness of speaker tracking. However, in complex multispeaker tracking scenarios, critical challenges such as cross-modal feature discrepancy, weak sound source localisation ambiguity and frequent identity switch errors remain unresolved, which severely hinder the modelling of speaker identity consistency and consequently lead to degraded tracking accuracy and unstable tracking trajectories. To this end, this paper proposes a multimodal multispeaker tracking network using audio-visual contrastive learning (AVCLNet). By integrating heterogeneous modal representations into a unified space through audio-visual contrastive learning, which facili-tates cross-modal feature alignment, mitigates cross-modal feature bias and enhances identity-consistent representations. In the audio-visual measurement stage, we design a vision-guided weak sound source weighted enhancement method, which lever-ages visual cues to establish cross-modal mappings and employs a spatiotemporal dynamic weighted mechanism to improve the detectability of weak sound sources. Furthermore, in the data association phase, a dual geometric constraint strategy is introduced by combining the 2D and 3D spatial geometric information, reducing frequent identity switch errors. Experiments on the AV16.3 and CAV3D datasets show that AVCLNet outperforms state-of-the-art methods, demonstrating superior robustness in multispeaker scenarios.

Keywords

computer vision / machine perception / multimodal approaches / pattern recognition / video signal processing

Cite this article

Download citation ▾

Yihan Li, Yidi Li, Zhenhuan Xu, Hao Guo, Mengyuan Liu, Weiwei Wan. AVCLNet: Multimodal Multispeaker Tracking Network Using Audio-Visual Contrastive Learning. CAAI Transactions on Intelligence Technology, 2026, 11(1): 238-255 DOI:10.1049/cit2.70092

登录浏览全文

4963

注册一个新账户忘记密码

Funding

This work was supported by the National Natural Science Foundation of China (62403345), the Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology (2024B1212010006), and the Shanxi Provincial Department of Science and Technology Basic Research Project (202403021212174, 202403021221074).

Conflicts of Interest

The authors declare no confiicts of interest.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland, and O. Vinyals, “Speaker Diarization: A Review of Recent Research,” IEEE Transactions on Audio Speech and Language Processing 20, no. 2 (2012): 356-370, https://doi.org/10.1109/tasl.2011.2125954.

[2]	A. Alahi, K. Goel, V. Ramanathan, A. Robicquet,L. Fei-Fei, and S. Savarese, “Social Lstm: Human Trajectory Prediction in Crowded Spaces,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2016), 961-971.

[3]	X. Chen, B. Yan, J. Zhu, D. Wang, X. Yang and H. Lu, “Transformer Tracking ” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2021), 8126-8135.

[4]	J. S. Yoon, T. Shiratori, S.-I. Yu and H. S. Park, “Self-Supervised Adaptation of High-Fidelity Face Models for Monocular Performance Tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019).

[5]	Y. Feng, S. Yu, H. Peng, Y.-R. Li and J. Zhang, “Detect Faces Effi-ciently: A Survey and Evaluations,” IEEE Transactions on Biometrics, Behavior, and Identity Science 4 (2022): 1-18, https://doi.org/10.1109/tbiom.2021.3120412.

[6]	Y. Chen, B. Liu, Z. Zhang, and H.-S. Kim, “An End-To-End Deep Learning Framework for Multiple Audio Source Separation and Local-ization,” in IEEE International Conference on Acoustics, Speech and Signal Processing, (2022), 736-740.

[7]	D. Diaz-Guerra,A. Miguel, and J. R. Beltran, “Robust Sound Source Tracking Using srp-phat and 3d Convolutional Neural Networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2020): 300-311, https://doi.org/10.1109/taslp.2020.3040031.

[8]	X. Qian, M. M, Z. Pan, J. Wang, and H. Li, “Multi-Target Doa Esti-mation With an Audio-Visual Fusion Mechanism,” in IEEE Interna-tional Conference on Acoustics, Speech, and Signal Processing, (2021), 4280-4284.

[9]	E. D’Arca,N. M. Robertson, and J. Hopgood, “Person Tracking via Audio and Video Fusion,” in IET Data Fusion and Target Tracking Conference: Algorithms and Applications, (2012).

[10]	X. Qian, A. Brutti, M. Omologo, and A. Cavallaro, “3d Audio-Visual Speaker Tracking With an Adaptive Particle Filter,” in IEEE Interna-tional Conference on Acoustics, Speech, and Signal Processing, (2017), 2896-2900.

[11]	Y. Li, H. Liu, and H. Tang, “Multi-Modal Perception Attention Network With Self-Supervised Learning for Audio-Visual Speaker Tracking,” in Proceedings of the AAAI Conference on Artificial Intelli-gence, (2022), 1456-1463.

[12]	Y. Li, H. Liu, and B. Yang, “Stnet: Deep Audio-Visual Fusion Network for Robust Speaker Tracking,” IEEE Transactions on Multi-media (2024): 1-13.

[13]	X. Qian, Z. Wang, J. Wang,G. Guan, and H. Li, “Audio-Visual Cross-Attention Network for Robotic Speaker Tracking,” IEEE/ACM Transactions on Audio, Speech, and Language Processing 31 (2023): 550-562, https://doi.org/10.1109/taslp.2022.3226330.

[14]	B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mohamed, “Learning Audio- Visual Speech Representation by Masked Multimodal Cluster Predic-tion,” in International Conference on Learning Representations, (2022).

[15]	A. Berg, M. O’Connor, Kalle and M. Oskarsson, “Extending gcc- phat Using Shift Equivariant Neural Networks,” in Proceedings of Interspeech, (2022), 1791-1795.

[16]	D. H. Shmuel, J. P. Merkofer, G. Revach, R. J. G. van Sloun, and N. Shlezinger, “Deep Root Music Algorithm for Data-Driven Doa Estima-tion,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, (2023).

[17]	Y. Li, Z. Zhou, C. Chen, P. Wu, and Z. Zhou, “An Efficient Con-volutional Neural Network With Supervised Contrastive Learning for Multi-Target Doa Estimation in Low Snr,” Axioms 12, no. 9 (2023): 862, https://doi.org/10.3390/axioms12090862.

[18]	A. Liu, J. Guo, Y. Arnatovich, and Z. Liu, “Lightweight Deep Neural Network With Data Redundancy Removal and Regression for Doa Estimation in Sensor Array,” Remote Sensing 16, no. 8 (2024): 1423, https://doi.org/10.3390/rs16081423.

[19]	T. Fischer, T. E. Huang, J. Pang, et al., “Qdtrack: Quasi-Dense Similarity Learning for appearance-Only Multiple Object Tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence 45, no. 12 (2023): 15.380-15.393, https://doi.org/10.1109/tpami.2023.3301975.

[20]	S. Kim, I. Petrunin, and H.-S. Shin, “Afjpda: A Multiclass Multi- Object Tracking With Appearance Feature-Aided Joint Probabilistic Data Association,” Journal of Aerospace Information Systems 21, no. 4 (2024): 294-304, https://doi.org/10.2514/1.i011301.

[21]

T. Kropfreiter, F. Meyer, D. F. Crouse, S. Coraluppi, F. Hlawatsch, and P. Willett, “Track Coalescence and Repulsion in Multitarget Tracking: An Analysis of Mht, Jpda, and Belief Propagation Methods,” IEEE Open Journal of Signal Processing 5 (2024): 1089-1106, https://doi.org/10.1109/ojsp.2024.3451167.

[22]	V. Kılıç, M. Barnard, W. Wang, A. Hilton, and J. Kittler, “Mean-Shift and Sparse Sampling-Based smc-phd Filtering for Audio Informed Vi-sual Speaker Tracking,” IEEE Transactions on Multimedia 18, no. 12 (2016): 2417-2431, https://doi.org/10.1109/tmm.2016.2599150.

[23]	A. Brutti and O. Lanz, “A Joint Particle Filter to Track the Position and Head Orientation of People Using Audio Visual Cues,” in European Signal Processing Conference, (2010), 974-978.

[24]	H. Zhou, M. Taj, and A. Cavallaro, “Target Detection and Tracking With Heterogeneous Sensors,” IEEE Journal of Selected Topics in Signal Processing 2, no. 4 (2008): 503-513, https://doi.org/10.1109/jstsp.2008.2001429.

[25]	Y. Ban, X. Alameda-Pineda,L. Girin, and R. Horaud, “Exploiting the Complementarity of Audio and Visual Data in Multi-Speaker Tracking,” in Proceedings of IEEE International Conference on Com-puter Vision Workshops, (2017), 446-454.

[26]	F. Talantzis, A. Pnevmatikakis, and A. G. Constantinides, “Audio-Visual Active Speaker Tracking in Cluttered Indoors Environ-ments,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 38, no. 3 (2008): 799-807, https://doi.org/10.1109/tsmcb.2008.922063.

[27]	Y. Ban, X. Alameda-Pineda, L. Girin, and R. Horaud, “Variational Bayesian Inference for audio-visual Tracking of Multiple Speakers,” IEEE Transactions on Pattern Analysis and Machine Intelligence 43, no. 5 (2019): 1761-1776, https://doi.org/10.1109/tpami.2019.2953020.

[28]	R. Brunelli, A. Brutti, P. Chippendale, et al., “A Generative Approach to Audio-Visual Person Tracking,” in International Evalua-tion Workshop on Classification of Events, Activities and Relationships, (2006), 55-68.

[29]	J. Wilson and M. C. Lin, “Avot: Audio-Visual Object Tracking of Multiple Objects for Robotics,” in IEEE International Conference on Robotics and Automation, (2020), 10045-10051.

[30]	I. D. Gebru, S. Ba, X. Li, and R. Horaud, “Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion,” IEEE Trans-actions on Pattern Analysis and Machine Intelligence 40, no. 5 (2017): 1086-1099, https://doi.org/10.1109/tpami.2017.2648793.

[31]	D. Zotkin, R. Duraiswami, and L. Davis, “Joint Audio-Visual Tracking Using Particle Filters,” EURASIP Journal on Advances in Signal Processing 2002 (2002): 1-11, https://doi.org/10.1155/s1110865702206058.

[32]	J. Zhao, P. Wu1 X. Liu, et al., Audio-Visual multi-speaker Tracking with Improved Gcf and Pmbm Filter (Interspeech, 2022).

[33]	U. Kirchmaier, S. Hawe, and K. Diepold, “Dynamical Information Fusion of Heterogeneous Sensors for 3d Tracking Using Particle Swarm Optimization,” Information Fusion 12, no. 4 (2011): 275-283, https://doi.org/10.1016/j.inffus.2010.06.005.

[34]	I. D. Gebru, S. Ba, and G. E. andR. P. Horaud, “Audio-Visual Speech-Turn Detection and Tracking,” in International Conference on Latent Variable Analysis and Signal Separation, (2015), 143-151.

[35]	M. Barnard, P. Koniusz, W. Wang, J. Kittler, S. M. Naqvi, and J. Chambers, “Robust Multi-Speaker Tracking via Dictionary Learning and Identity Modeling,” IEEE Transactions on Multimedia 16, no. 3 (2014): 864-880, https://doi.org/10.1109/tmm.2014.2301977.

[36]	D. Gatica-Perez, G. Lathoud, I. McCowan, J.-M. Odobez and D. Moore, “Audio-Visual Speaker Tracking With Importance Particle Fil-ters,” in Proceedings of International Conference on Image Processing, (2003), 1259-1262.

[37]	N. Checka, K. W. Wilson, M. R. Siracusa, and T. Darrell, “Multiple Person and Speaker Activity Tracking With a Particle Filter,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, (2004), 881-884.

[38]	H. Liu, Y. Li, and B. Yang, “3d audio-visual Speaker Tracking With a Two-Layer Particle Filter,” in IEEE International Conference on Image Processing, (2019), 1955-1959.

[39]	V. Kılıç, M. Barnard, W. Wang, and J. Kittler, “Audio Assisted Robust Visual Tracking With Adaptive Particle Filtering,” IEEE Transactions on Multimedia 17, no. 2 (2015): 186-200, https://doi.org/10.1109/tmm.2014.2377515.

[40]	F. Sanabria-Macias,M. Marron-Romera, and J. Macias-Guarasa, “3d Audiovisual Speaker Tracking With Distributed Sensors Configuration,” in European Signal Processing Conference, (2021), 256-260.

[41]	H. Liu, Y. Sun,Y. Li, and B. Yang, “3d Audio-Visual Speaker Tracking With a Novel Particle Filter,” in Proceedings of IEEE Interna-tional Conference on Pattern Recognition, (2021), 7343-7348.

[42]	Y. Liu, V. Kılıç, J. Guan, and W. Wang, “Audio-Visual Particle Flow smc-phd Filtering for Multi-Speaker Tracking,” IEEE Transactions on Multimedia 22, no. 4 (2020): 934-948, https://doi.org/10.1109/tmm.2019.2937185.

[43]	J. Zhao, P. Wu, X. Liu, Y. Xu, L. Mihaylova, and S. Godsill, “Audio- Visual Tracking of Multiple Speakers via a Pmbm Filter,” in IEEE In-ternational Conference on Acoustics, Speech and Signal Processing, (2022), 5068-5072.

[44]	Y. Qian,Z. Chen, and S. Wang, “Audio-Visual Deep Neural Network for Robust Person Verification,” IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021): 1079-1092, https://doi.org/10.1109/taslp.2021.3057230.

[45]	J. Yu,Y. Cheng, and R. Feng, “Mpn: Multimodal Parallel Network for Audio-Visual Event Localization,” in IEEE International Conference on Multimedia and Expo, (2021), 1-6.

[46]	Y. Tian, J. Shi, B. Li, Z. Duan, and C. Xu, “Audio-Visual Event Localization in Unconstrained Videos,” in Proceedings of the European Conference on Computer Vision, (2018), 247-263.

[47]	D. Salvati, C. Drioli, and G. L. Foresti, “Acoustic Source Localization Using a Geometrically Sampled Grid SRP-PHAT Algorithm With Max- Pooling Operation,” IEEE Signal Processing Letters 29 (2022): 1828-1832, https://doi.org/10.1109/LSP.2022.3199662.

[48]	X. Xu, T. Qian, Z. Xiao, N. Zhang, J. Wu, and F. Zhou, “Pgsl: A Probabilistic Graph Diffusion Model for Source Localization,” Expert Systems with Applications 238 (2024): 122028, https://doi.org/10.1016/j.eswa.2023.122028.

[49]	P.-A. Grumiaux S. Kitić L. Girin and A. Guérin, “Improved Feature Extraction for Crnn-Based Multiple Sound Source Localiza-tion,” in European Signal Processing Conference, (2021), 1-5.

[50]	A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention Is all You Need,” Advances in Neural Information Processing Systems (2017): 5998-6008, https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.

[51]	C. Schymura, B. Bönninghoff, T. Ochiai, et al., “Pilot: Introducing Transformers for Probabilistic Sound Event Localization,” Interspeech (2021): 2117-2121, https://doi.org/10.48550/arXiv.2106.03903.

[52]	L. Perotin, R. Serizel, E. Vincent, and A. Guerin, “Crnn-Based Multiple Doa Estimation Using Acoustic Intensity Features for Ambi-sonics Recordings,” IEEE Journal of Selected Topics in Signal Processing 13, no. 1 (2019): 22-33, https://doi.org/10.1109/jstsp.2019.2900164.

[53]	J. Yang, X. Huang, X. Zhang, Q. Zhang, and P. Mei, “Sound Event Localization and Detection Model Based on Multi-View Attention,” Journal of Signal Processing 40 (2024): 385-395, https://link.springer.com/article/10.1007/s00034-025-03325-0.

[54]	A. Radford, J. W. Kim, C. Hallacy, et al., “Learning Transferable Visual Models From Natural Language Supervision,” in Proceedings of the International Conference on Machine Learning, (2021), 8748-8763.

[55]	C. Jia, Y. Yang, Y. Xia, et al., “Scaling up Visual and Vision- Language Representation Learning With Noisy Text Supervision,” in Proceedings of the International Conference on Machine Learning, (2021), 4904-4916.

[56]	H. Bao, W. Wang, L. Dong, et al., “Vlmo: Unified Vision-Language Pre-Training With Mixture-Of-Modality-Experts,” in Advances in Neural Information Processing Systems, (2022).

[57]	L. Sun, Z. Lian, B. Liu, and J. Tao, “Hicmae: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition,” Information Fusion 101 (2024), https://doi.org/10.1016/j.inffus.2024.102382.

[58]	I. Tsiamas, S. Pascual, C. Yeh and J. Serrà, “Sequential Contrastive Audio-Visual Learning ” in Proceedings of the IEEE International Con-ference on Acoustics, Speech, and Signal Processing, (2025).

[59]	S. Nakada, T. Nishimura, H. Munakata, M. Kondo, and T. Komatsu, “Deteclap: Enhancing Audio-Visual Representation Learning With Object Information,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, (2025).

[60]	G. Lathoud, J.-M. Odobez, and D. Gatica-Perez, “Av16.3:An Audio- Visual Corpus for Speaker Localization and Tracking,” in Proceedings of the International Workshop on Machine Learning for Multimodal Inter-action, (2004), 182-195.

[61]	X. Qian, A. Brutti, O. Lanz, M. Omologo, and A. Cavallaro, “Multi- Speaker Tracking From an Audio-Visual Sensing Device,” IEEE Transactions on Multimedia 21, no. 10 (2019): 2576-2588, https://doi.org/10.1109/tmm.2019.2902489.

[62]	W. Ao, C. Hui, L. Lihao, et al., “Yolov10:Real-Time End-To-End Object Detection,” in Proceedings of the Conference on Neural Informa-tion Processing Systems, (2024).

[63]	J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale Hierarchical Image Database,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2009), 248-255.

[64]	X. Qian, A. Brutti, O. Lanz, M. Omologo, and A. Cevallaro, “Audio- Visual Tracking of Concurrent Speakers,” IEEE Transactions on Multi-media 24 (2022): 942-954, https://doi.org/10.1109/tmm.2021.3061800.