A data representation method using distance correlation
Xinyan LIANG, Yuhua QIAN, Qian GUO, Keyin ZHENG
A data representation method using distance correlation
Association in-between features has been demonstrated to improve the representation ability of data. However, the original association data reconstruction method may face two issues: the dimension of reconstructed data is undoubtedly higher than that of original data, and adopted association measure method does not well balance effectiveness and efficiency. To address above two issues, this paper proposes a novel association-based representation improvement method, named as AssoRep. AssoRep first obtains the association between features via distance correlation method that has some advantages than Pearson’s correlation coefficient. Then an improved matrix is formed via stacking the association value of any two features. Next, an improved feature representation is obtained by aggregating the original feature with the enhancement matrix. Finally, the improved feature representation is mapped to a low-dimensional space via principal component analysis. The effectiveness of AssoRep is validated on 120 datasets and the fruits further prefect our previous work on the association data reconstruction.
association / representation / distance correlation / classification
Xinyan Liang received the PhD degree in computer science and technology from Shanxi University, China in 2022. He is currently a Lecturer at the Institute of Big Data Science and Industry, Shanxi University, China. He was a visiting scholar at The University of Hong Kong, China in 2018. His main research interests include multi-modal machine learning, evolutionary intelligence, and their applications. He has published several journal papers in his research fields, including IEEE TPAMI, IEEE TEVC, etc
Yuhua Qian received the MS and PhD degrees in computers with applications from Shanxi University, China in 2005 and 2011, respectively. He is currently a Professor with the Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University, China. He is best known for multigranulation rough sets in learning from categorical data and granular computing. He is involved in research on machine learning, pattern recognition, feature selection, granular computing, and artificial intelligence. He has authored over 150 articles on these topics in international journals. He served on the Editorial Board of the International Journal of Knowledge-Based Organizations and Artificial Intelligence Research
Qian Guo received the PhD degree in computer science and technology from Shanxi University, China in 2022. She is currently a Lecturer at the School of Computer Science and Technology, Taiyuan University of Science and Technology, China. She was a visiting scholar at The University of Hong Kong, China in 2018. Her current research interests include logic learning, abstract reasoning, deep learning and their applications
Keyin Zheng received a BS degree in information and computing science and Master’s degree in pattern recognition and intelligent system at school of Mathematical Sciences from Shanxi University, China in 2012 and 2015, respectively. She is a PhD candidate at Institute of Big Data Science and Industry, Shanxi University, China. Her research interest includes concept learning and machine learning
[1] |
Zhu Y, Geng Y, Li Y, Qiang J, Wu X . Representation learning: serial-autoencoder for personalized recommendation. Frontiers of Computer Science, 2024, 18( 4): 184316
|
[2] |
Bengio Y, Courville A, Vincent P . Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35( 8): 1798–1828
|
[3] |
Jia B B, Liu J Y, Hang J Y, Zhang M L . Learning label-specific features for decomposition-based multi-class classification. Frontiers of Computer Science, 2023, 17( 6): 176348
|
[4] |
Zhang M L, Fang J P, Wang Y B . BiLabel-specific features for multi-label classification. ACM Transactions on Knowledge Discovery from Data, 2021, 16( 1): 18
|
[5] |
Yang M, Liu Q, Sun X, Shi N, Xue H . Towards kernelizing the classifier for hyperbolic data. Frontiers of Computer Science, 2024, 18( 1): 181301
|
[6] |
Dong X, Luo T, Fan R, Zhuge W, Hou C . Active label distribution learning via kernel maximum mean discrepancy. Frontiers of Computer Science, 2023, 17( 4): 174327
|
[7] |
Zhang Y, Jiang L, Li C . Attribute augmentation-based label integration for crowdsourcing. Frontiers of Computer Science, 2023, 17( 5): 175331
|
[8] |
Troncoso-García A R, Martínez-Ballesteros M, Martínez-Álvarez F, Troncoso A . A new approach based on association rules to add explainability to time series forecasting models. Information Fusion, 2023, 94: 169–180
|
[9] |
Liang X, Qian Y, Guo Q, Cheng H, Liang J . AF: an association-based fusion method for multi-modal classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44( 12): 9236–9254
|
[10] |
Jia B B, Zhang M L . Multi-dimensional classification via kNN feature augmentation. Pattern Recognition, 2020, 106: 107423
|
[11] |
Deng M, Yang W, Chen C, Liu C . Exploring associations between streetscape factors and crime behaviors using Google Street View images. Frontiers of Computer Science, 2022, 16( 4): 164316
|
[12] |
Guo Q, Qian Y, Liang X . GLRM: logical pattern mining in the case of inconsistent data distribution based on multigranulation strategy. International Journal of Approximate Reasoning, 2022, 143: 78–101
|
[13] |
Guo Q, Qian Y, Liang X, She Y, Li D, Liang J . Logic could be learned from images. International Journal of Machine Learning and Cybernetics, 2021, 12( 12): 3397–3414
|
[14] |
Kuzma J. Basic Statistics for the Health Sciences. Palo Alto: Mayfield Publishing Company, 1984, 158–169
|
[15] |
Spearman C . The proof and measurement of association between two things. The American Journal of Psychology, 1904, 15( 1): 72–101
|
[16] |
Kendall M G . A new measure of rank correlation. Biometrika, 1938, 30( 1-2): 81–93
|
[17] |
Székely G J, Rizzo M L, Bakirov N K . Measuring and testing dependence by correlation of distances. The Annals of Statistics, 2007, 35( 6): 2769–2794
|
[18] |
Reshef D N, Reshef Y A, Finucane H K, Grossman S R, Mcvean G, Turnbaugh P J, Lander E S, Mitzenmacher M, Sabeti P C . Detecting novel associations in large data sets. Science, 2011, 334( 6062): 1518–1524
|
[19] |
Cheng H, Qian Y, Hu Z, Liang J . Association mining method based on neighborhood perspective. SCIENTIA SINICA Informationis, 2020, 50( 6): 824–844
|
[20] |
Zhu Y, Kwok J T, Zhou Z H . Multi-label learning with global and local label correlation. IEEE Transactions on Knowledge and Data Engineering, 2018, 30( 6): 1081–1094
|
[21] |
Xu N, Shu J, Zheng R, Geng X, Meng D, Zhang M L . Variational label enhancement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45( 5): 6537–6551
|
[22] |
Zhang M L, Zhou Z H . A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 2014, 26( 8): 1819–1837
|
[23] |
Zhang M L, Li Y K, Liu X Y, Geng X . Binary relevance for multi-label learning: an overview. Frontiers of Computer Science, 2018, 12( 2): 191–202
|
[24] |
Kou Y, Lin G, Qian Y, Liao S . A novel multi-label feature selection method with association rules and rough set. Information Sciences, 2023, 624: 299–323
|
[25] |
Zhang Y, Zhu H, Song Z, Koniusz P, King I. Spectral feature augmentation for graph contrastive learning and beyond. In: Proceedings of the 37th AAAI Conference on Artificial Intelligence. 2023, 11289−11297
|
[26] |
Gao Z, Wu Y, Jia Y, Harandi M. Hyperbolic feature augmentation via distribution estimation and infinite sampling on manifolds. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022, 34421–34435
|
[27] |
Zhang M L, Wu L . LIFT: multi-label learning with label-specific features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37( 1): 107–120
|
[28] |
Zheng S, Yuan W, Guan D . Heterogeneous information network embedding with incomplete multi-view fusion. Frontiers of Computer Science, 2022, 16( 5): 165611
|
[29] |
Wang B, Li H, Wei B, Kang Z, Li C . Nighttime image dehazing using color cast removal and dual path multi-scale fusion strategy. Frontiers of Computer Science, 2022, 16( 4): 164706
|
[30] |
Wang Z, Li L, Xue Y, Jiang C, Wang J, Sun K, Ma H . FeNet: feature enhancement network for lightweight remote-sensing image super-resolution. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5622112
|
[31] |
Wang W, Zhang M L. Partial label learning with discrimination augmentation. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2022, 1920−1928
|
[32] |
Gong C, Wang D, Li M, Chandra V, Liu Q. KeepAugment: a simple information-preserving data augmentation approach. In: Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 1055−1064
|
[33] |
Wang M, Han H, Huang Z, Xie J . Unsupervised spectral feature selection algorithms for high dimensional data. Frontiers of Computer Science, 2023, 17( 5): 175330
|
[34] |
Liu J, Chai C, Luo Y, Lou Y, Feng J, Tang N. Feature augmentation with reinforcement learning. In: Proceedings of the 38th IEEE International Conference on Data Engineering. 2022, 3360−3372
|
[35] |
Li H, Xu C, Ma L, Bo H, Zhang D . MODENN: a shallow broad neural network model based on multi-order descartes expansion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44( 12): 9417–9433
|
[36] |
Taylor R . Interpretation of the correlation coefficient: a basic review. Journal of Diagnostic Medical Sonography, 1990, 6( 1): 35–39
|
[37] |
Spearman C . The proof and measurement of association between two things. The American Journal of Psychology, 1987, 100( 3-4): 441–471
|
[38] |
Spearman C . The proof and measurement of association between two things. International Journal of Epidemiology, 2010, 39( 5): 1137–1150
|
[39] |
Puth M T, Neuhäuser M, Ruxton G D . Effective use of Spearman’s and Kendall’s correlation coefficients for association between two measured traits. Animal Behaviour, 2015, 102: 77–84
|
[40] |
Shannon C E . A mathematical theory of communication. The Bell system Technical Journal, 1948, 27( 3): 379–423
|
[41] |
Cheng H, Qian Y, Guo Y, Zheng K, Zhang Q . Neighborhood information-based method for multivariate association mining. IEEE Transactions on Knowledge and Data Engineering, 2023, 35( 6): 6126–6135
|
[42] |
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000−6010
|
[43] |
Shen W X, Zeng X, Zhu F, Wang Y L, Qin C, Tan Y, Jiang Y Y, Chen Y Z . Out-of-the-box deep learning prediction of pharmaceutical properties by broadly learned knowledge-based molecular representations. Nature Machine Intelligence, 2021, 3( 4): 334–343
|
[44] |
Liang X, Guo Q, Qian Y, Ding W, Zhang Q . Evolutionary deep fusion method and its application in chemical structure recognition. IEEE Transactions on Evolutionary Computation, 2021, 25( 5): 883–893
|
[45] |
Gretton A, Bousquet O, Smola A, Schölkopf B. Measuring statistical dependence with hilbert-schmidt norms. In: Proceedings of the 16th International Conference on Algorithmic Learning Theory. 2005, 63−77
|
[46] |
Fernández-Delgado M, Cernadas E, Barro S, Amorim D. Do we need hundreds of classifiers to solve real world classification problems? The Journal of Machine Learning Research, 2014, 15(1): 3133–3181
|
[47] |
Lampert C H, Nickisch H, Harmeling S . Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36( 3): 453–465
|
[48] |
Arevalo J, Solorio T, Montes-y-Gómez M, Gonzalez F A . Gated multimodal networks. Neural Computing and Applications, 2020, 32( 14): 10209–10228
|
[49] |
Zhang Y, Cao C, Cheng J, Lu H . EgoGesture: a new dataset and benchmark for egocentric hand gesture recognition. IEEE Transactions on Multimedia, 2018, 20( 5): 1038–1050
|
[50] |
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay É . Scikit-learn: machine learning in python. The Journal of Machine Learning Research, 2011, 12: 2825–2830
|
[51] |
Cortes C, Vapnik V . Support-vector networks. Machine Learning, 1995, 20( 3): 273–297
|
[52] |
Cover M, Hart E . Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 1967, 13( 1): 21–27
|
[53] |
Breiman L . Random forests. Machine Learning, 2001, 45( 1): 5–32
|
[54] |
Freund Y, Schapire R E . Large margin classification using the perceptron algorithm. Machine Learning, 1999, 37( 3): 277–296
|
[55] |
Demšar J . Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 2006, 7: 1–30
|
[56] |
Reshef Y A, Reshef D N, Finucane H K, Sabeti P C, Mitzenmacher M . Measuring dependence powerfully and equitably. The Journal of Machine Learning Research, 2016, 17( 1): 7406–7468
|
/
〈 | 〉 |