A novel classifier for multivariate instance using graph class signatures
Parnika PARANJAPE, Meera DHABU, Parag DESHPANDE
A novel classifier for multivariate instance using graph class signatures
Applications like identifying different customers from their unique buying behaviours, determining ratingsof a product given by users based on different sets of features, etc. require classification using class-specific subsets of features. Most of the existing state-of-the-art classifiers for multivariate data use complete feature set for classification regardless of the different class labels. Decision tree classifier can produce class-wise subsets of features. However, none of these classifiers model the relationship between features which may enhance classification accuracy. We call the class-specific subsets of features and the features’ interrelationships as class signatures. In this work, we propose to map the original input space of multivariate data to the feature space characterized by connected graphs as graphs can easily model entities, their attributes, and relationships among attributes. Mostly, entities are modeled using graphs, where graphs occur naturally, for example, chemical compounds. However, graphs do not occur naturally in multivariate data. Thus, extracting class signatures from multivariate data is a challenging task. We propose some feature selection heuristics to obtain class-specific prominent subgraph signatures. We also propose two variants of class signatures based classifier namely: 1) maximum matching signature (gMM), and 2) score and size of matched signatures (gSM). The effectiveness of the proposed approach on real-world and synthetic datasets has been studied and compared with other established classifiers. Experimental results confirm the ascendancy of the proposed class signatures based classifier on most of the datasets.
multivariate instance / data transformation / subgraph feature selection / class signatures / classification
[1] |
Theodoridis S, Koutroumbas K. Pattern Recognition. 4th ed. USA: Academic Press, Inc., 2008
|
[2] |
Esmeir S, Markovitch S. Lookahead-based algorithms for anytime induction of decision trees. In: Proceedings of the 21st ACM International Conference on Machine Learning. 2004, 33
CrossRef
Google scholar
|
[3] |
Tan P J, Dowe D L. MML inference of decision graphs with multiway joins and dynamic attributes. In: Proceedings of Australasian Joint Conference on Artificial Intelligence. 2003, 269–281
CrossRef
Google scholar
|
[4] |
Zhu Y, Yu J X, Cheng H, Qin L. Graph classification: a diversified discriminative feature selection approach. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management. 2012, 205–214
CrossRef
Google scholar
|
[5] |
Ranu S, Hoang M, Singh A K. Mining discriminative subgraphs from global-state networks. In: Proceedings of the 19th ACM International Conference on Knowledge Discovery and DataMining. 2013, 509–517
CrossRef
Google scholar
|
[6] |
Dang X H, Singh A K,Bogdanov P, You H, Hsu B. Discriminative subnetworks with regularized spectral learning for global-state network data. In: Proceedings of Joint European Conference onMachine Learning and Knowledge Discovery in Databases. 2014, 290–306
CrossRef
Google scholar
|
[7] |
Dang X H, You H, Bogdanov P, Singh A K. Learning predictive substructures with regularization for network data. In: Proceedings of IEEE International Conference on Data Mining. 2015, 81–90
CrossRef
Google scholar
|
[8] |
Vert J P. Kernel methods in computational biology. Bussei Kenkyu, 2004, 81(97): 35–90
|
[9] |
Mahhé P, Ueda N, Akutsu T, Perret J L. Extensions of marginalized graph kernels. In: Proceedings of the 21st ACM International Conference on Machine Learning. 2004, 70
CrossRef
Google scholar
|
[10] |
Seeland M, Karwath A, Kramer S. A structural cluster kernel for learning on graphs. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and DataMining. 2012, 516–524
CrossRef
Google scholar
|
[11] |
Riesen K, Bunke H. Graph classification by means of Lipschitz embedding. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 2009, 39(6): 1472–1483
CrossRef
Google scholar
|
[12] |
Bandyopadhyay D, Huan J, Liu J, Prins J, Snoeyink J, Wang W, Tropsha A. Structure-based function inference using protein family-specific fingerprints. Protein Science, 2006, 15(6): 1537–1543
CrossRef
Google scholar
|
[13] |
Cheng H, Yan X, Han J, Hsu C W. Discriminative frequent pattern analysis for effective classification. In: Proceedings of the 23rd IEEE International Conference on Data Engineering. 2007, 716–725
CrossRef
Google scholar
|
[14] |
Deshpande M, Kuramochi M, Wale N, Karypis G.Frequent substructure-based approaches for classifying chemical compounds. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(8): 1036–1050
CrossRef
Google scholar
|
[15] |
Yan X, Han
|
[16] |
Inokuchi A, Washio T, Motoda H. An apriori-based algorithm for mining frequent substructures from graph data. In: Proceedings of European Conference on Principles of Data Mining and Knowledge Discovery. 2000, 13–23
CrossRef
Google scholar
|
[17] |
Kuramochi M, Karypis G. Frequent subgraph discovery. In: Proceedings of the IEEE International Conference on Data Mining. 2001, 313–320
|
[18] |
Huan J, Wang W, Prins J. Efficient mining of frequent subgraphs in the presence of isomorphism. In: Proceedings of the 3rd IEEE International Conference on Data Mining. 2003, 549–552
|
[19] |
Nijssen S, Kok J N. A quickstart in frequent structure mining can make a difference. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and DataMining. 2004, 647–652
CrossRef
Google scholar
|
[20] |
Zhou K. gBolt: very fast implementation for gSpan algorithm in data mining. GitHub, 2017
|
[21] |
Yan X, Cheng H, Han J, Yu P S. Mining significant graph patterns by leap search. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. 2008, 433–444
CrossRef
Google scholar
|
[22] |
Saigo H, Krämer N, Tsuda K. Partial least squares regression for graph mining. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2008, 578–586
CrossRef
Google scholar
|
[23] |
Thoma M, Cheng H, Gretton A, Han J, Kriegel H P, Smola A, Song L, Yu P S, Yan X, Borgwardt K. Near-optimal supervised feature selection among frequent subgraphs. In: Proceedings of the SIAM International Conference on Data Mining. 2009, 1076–1087
CrossRef
Google scholar
|
[24] |
Ranu S, Singh A K. Graphsig: a scalable approach to mining significant subgraphs in large graph databases. In: Proceedings of the 25th IEEE International Conference on Data Engineering. 2009, 844–855
CrossRef
Google scholar
|
[25] |
Jin N, Young C, Wang W. Graph classification based on pattern cooccurrence. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management. 2009, 573–582
CrossRef
Google scholar
|
[26] |
Jin N, Young C, Wang W. GAIA: graph classification using evolutionary computation. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. 2010, 879–890
CrossRef
Google scholar
|
[27] |
Kong X, Yu P S. Semi-supervised feature selection for graph classification. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2010, 793–802
CrossRef
Google scholar
|
[28] |
Kong X, Philip S Y.gMLC: a multi-label feature selection framework for graph classification. Knowledge and Information Systems, 2012, 31(2): 281–305
CrossRef
Google scholar
|
[29] |
Pan S, Zhu X, Zhang C, Philip S Y. Graph stream classification using labeled and unlabeled graphs. In: Proceedings of the 29th IEEE International Conference on Data Engineering. 2013, 398–409
|
[30] |
Kong X, Yu P S, Wang X, Ragin A B. Discriminative feature selection for uncertain graph classification. In: Proceedings of the SIAM International Conference on Data Mining. 2013, 82–93
CrossRef
Google scholar
|
[31] |
Wu J, Zhu X, Zhang C, Philip S Y. Bag constrained structure pattern mining for multi-graph classification. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(10): 2382–2396
CrossRef
Google scholar
|
[32] |
Pan S, Wu J, Zhu X, Zhang C. Joint structure feature exploration and regularization for multi-task graph classification. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(3): 715–728
CrossRef
Google scholar
|
[33] |
Pan S, Wu J, Zhu X, Long G, Zhang C. Task sensitive feature exploration and learning for multitask graph classification. IEEE Transactions on Cybernetics, 2017, 47(3): 744–758
CrossRef
Google scholar
|
[34] |
Pan S, Wu J, Zhu X, Long G, Zhang C. Finding the best not the most: regularized loss minimization subgraph selection for graph classification. Pattern Recognition, 2015, 48(11): 3783–3796
CrossRef
Google scholar
|
[35] |
Saigo H, Nowozin S, Kadowaki T, Kudo T, Tsuda K. gBoost: a mathematical programming approach to graph classification and regression. Machine Learning, 2009, 75(1): 69–89
CrossRef
Google scholar
|
[36] |
Pan S, Zhu X. Graph classification with imbalanced class distributions and noise. In: Proceedings of the 23rd International Joint Conference on Artificial Intelligence. 2013, 1586–1592
|
[37] |
Kudo T, Maeda E, Matsumoto Y. An application of boosting to graph classification. In: Proceedings of the 17th International Conference on Neural Information Processing Systems. 2004, 729–736
|
[38] |
Pan S, Wu J, Zhu X, Zhang C. Graph ensemble boosting for imbalanced noisy graph stream classification. IEEE Transactions on Cybernetics, 2015, 45(5): 954–968
CrossRef
Google scholar
|
[39] |
Pan S, Wu J, Zhu X. CogBoost: boosting for fast cost-sensitive graph classification. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(11): 2933–2946
CrossRef
Google scholar
|
[40] |
Cook D J, Holder L B. Mining Graph Data. John Wiley & Sons, 2006
CrossRef
Google scholar
|
[41] |
Witten I H, Frank E, Hall M A, Pal C J. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, 2016
|
[42] |
Thrun S B, Bala J W, Bloedorn E, Bratko I, Cestnik B, Cheng J, De Jong K A, Dzeroski S, Fisher D H, Fahlman S E, Hamann R. The monk’s problems: a performance comparison of different learning algorithms. Technical Report of Carnegie Mellon University, 1991
|
/
〈 | 〉 |