A variational stochastic dirichlet process-based autoencoder model for fine-grained music source separation

Yin ZHU; Jingqi LI; Cong JIN; Qiuqiang KONG; Hongming SHAN; Junping ZHANG

doi:10.1007/s11704-025-51043-2

Front. Comput. Sci. ›› 2027, Vol. 21 ›› Issue (1) :2101308 DOI: 10.1007/s11704-025-51043-2

Artificial Intelligence

RESEARCH ARTICLE

A variational stochastic dirichlet process-based autoencoder model for fine-grained music source separation

Author information +

History +

PDF (1820KB)

Abstract

Traditional source separation methods rely on coarse-grained categorical labels through labeling all vocals collectively without distinguishing individual voices in an audio mixture, which inherently limits the ability to isolate single tracks. While fine-grained annotations could partially address this issue, they demand substantial resources and face challenges in extracting tracks from raw signals. To overcome these limitations, we propose to extract each track through decomposing the patterns of data generation. Specifically, we refine Variational Stochastic Dirichlet Process-VAE, a variational autoencoder framework through replacing the standard variational distribution by a variational stochastic Dirichlet process (VSDP). Among our proposed framework, the encoder, leveraging stick-breaking constructions, adaptively partitions the latent space into clusters, while the decoder designed to recover each component achieves implicit signal separation. Its advantage is that the reconstruction target can be shifted from the raw input to its individual components. Experiments demonstrate our method’s efficacy in two scenarios: (1) Under coarse-grained source definitions, it reaches near-state-of-the-art performance (SDR=10.3); (2) For fine-grained track separation, what’s more, it identifies 83% of individual vocal tracks with an average SDR of 7.8, which cannot be obtained by other SOTA methods without the help of annotations.

Graphical abstract

Keywords

music source separation / variational stochastic process / generative model / fine-grained track separation

Cite this article

Download citation ▾

Yin ZHU, Jingqi LI, Cong JIN, Qiuqiang KONG, Hongming SHAN, Junping ZHANG. A variational stochastic dirichlet process-based autoencoder model for fine-grained music source separation. Front. Comput. Sci., 2027, 21(1): 2101308 DOI:10.1007/s11704-025-51043-2

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Tzanetakis G, Cook P . Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing, 2002, 10( 5): 293–302

[2]	Zhu Y, Kong Q, Shi J, Liu S, Ye X, Wang J C, Shan H, Zhang J . End-to-end paired ambisonic-binaural audio rendering. IEEE/CAA Journal of Automatica Sinica, 2024, 11( 2): 502–513

[3]	Benetos E, Dixon S, Duan Z, Ewert S . Automatic music transcription: an overview. IEEE Signal Processing Magazine, 2019, 36( 1): 20–30

[4]	Rafii Z, Liutkus A, Stöter F R, Mimilakis S I, FitzGerald D, Pardo B . An overview of lead and accompaniment separation in music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018, 26( 8): 1307–1335

[5]	Mitsufuji Y, Fabbro G, Uhlich S, Stöter F R, Défossez A, Kim M, Choi W, Yu C Y, Cheuk K W . Music Demixing Challenge 2021. Frontiers in Signal Processing, 2022, 1: 808395

[6]	Ewert S, Müller M. Using score-informed constraints for NMF-based source separation. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2012, 129–132

[7]	Hyvärinen A, Hurri J, Hoyer P O. Independent component analysis. In: Hyvärinen A, Hurri J, Hoyer P O, eds. Natural Image Statistics: A Probabilistic Approach to Early Computational Vision. London: Springer, 2009, 151–175

[8]	Luo Y, Yu J . Music source separation with band-split RNN. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31: 1893–1901

[9]	Lu W T, Wang J C, Kong Q, Hung Y N. Music source separation with band-split rope transformer. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2024, 481–485

[10]	Ma X, Zhao S Y, Yin Z H, Li W J . Clustered reinforcement learning. Frontiers of Computer Science, 2025, 19( 4): 194313

[11]	Zhang D, Wu S, Lu Z, Zhang Z, Hu H, Yu J . Improving sound event detection through enhanced feature extraction and attention mechanisms. Frontiers of Computer Science, 2025, 19( 10): 1910707

[12]	Li J, Lei R, Wei Z . A survey of estimating number of distinct values. Frontiers of Computer Science, 2025, 19( 9): 199611

[13]	Rafii Z, Liutkus A, Stöter F R, Mimilakis S I, Bittner R. The MUSDB18 corpus for music separation. See Dans.world/repository/rafiimusdb18corpus2017/ website, 2017

[14]	McDermott J H . The cocktail party problem. Current Biology, 2009, 19( 22): R1024–R1027

[15]	Luo Y, Mesgarani N . Conv-TasNet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27( 8): 1256–1266

[16]	Kong Q, Cao Y, Liu H, Choi K, Wang Y. Decoupling magnitude and phase estimation with deep ResUNet for music source separation. In: Proceedings of the International Society for Music Information Retrieval Conference. 2021, 342–349

[17]	Zang Y, Dai Z, Plumbley M D, Kong Q. Music source restoration. 2025, arXiv preprint arXiv: 2505.21827

[18]	Ferguson T S . A Bayesian analysis of some nonparametric problems. Annals of Statistics, 1973, 1( 2): 209–230

[19]	Blackwell D, MacQueen J B . Ferguson distributions via Pólya urn schemes. The Annals of Statistics, 1973, 1( 2): 353–355

[20]	Blei D M, Jordan M I. Variational methods for the Dirichlet process. In: Proceedings of the 21st International Conference on Machine Learning. 2004, 12

[21]	Neal R M . Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 2000, 9( 2): 249–265

[22]	Teh Y W. Dirichlet process. In: Sammut C, Webb G I, eds. Encyclopedia of Machine Learning. New York: Springer, 2010, 280–287

[23]	Joo W, Lee W, Park S, Moon I C . Dirichlet variational autoencoder. Pattern Recognition, 2020, 107: 107514

[24]	Li J, Yu J, Li J, Zhang H, Zhao K, Rong Y, Cheng H, Huang J. Dirichlet graph variational autoencoder. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 443

[25]	Yao D, Wang J, Chen W, Guo F, Han P, Bi J. Deep dirichlet process mixture model for non-parametric trajectory clustering. In: Proceedings of the 40th IEEE International Conference on Data Engineering (ICDE). 2024, 4449–4462

[26]	Chowdhury A, Pal A, Raut A, Kumar M . KIHCDP: an incremental hierarchical clustering approach for IoT data using dirichlet process. IEEE Access, 2024, 12: 56019–56032

[27]	Seetharaman P, Wichern G, Venkataramani S, Le Roux J. Class-conditional embeddings for music source separation. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2019, 301–305

[28]	Slizovskaia O, Haro G, Gómez E . Conditioned source separation for musical instrument performances. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 2083–2095

[29]	Tzinis E, Wichern G, Smaragdis P, Le Roux J. Optimal condition training for target source separation. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2023, 1–5

[30]	Yu D, Kolbæk M, Tan Z H, Jensen J. Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2017, 241–245

[31]	Kolbæk M, Yu D, Tan Z H, Jensen J . Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017, 25( 10): 1901–1913

[32]

Shimada K, Koyama Y, Takahashi S, Takahashi N, Tsunoo E, Mitsufuji Y. Multi-ACCDOA: localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2022, 316–320

[33]	Narayanaswamy V S, Thiagarajan J J, Anirudh R, Spanias A. Unsupervised audio source separation using generative priors. In: Proceedings of the 21st Annual Conference of the International Speech Communication Association. 2020, 2657–2661

[34]	Mancusi M, Postolache E, Mariani G, Fumero M, Santilli A, Cosmo L, Rodolà E. Unsupervised source separation via Bayesian inference in the latent domain. 2021, arXiv preprint arXiv: 2110.05313

[35]	Zhu G, Darefsky J, Jiang F, Selitskiy A, Duan Z . Music source separation with generative flow. IEEE Signal Processing Letters, 2022, 29: 2288–2292

[36]	Welling M, Teh Y W. Bayesian learning via stochastic gradient Langevin dynamics. In: Proceedings of the 28th International Conference on International Conference on Machine Learning. 2011, 681–688

[37]	Postolache E, Mariani G, Cosmo L, Benetos E, Rodolà E. Generalized multi-source inference for text conditioned music diffusion models. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2024, 6980–6984

[38]	Ye Y, Yang W, Tian Y. LAVSS: location-guided audio-visual spatial audio separation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2024, 5496–5507