Automatic malware classification and new malware detection using machine learning
Liu LIU, Bao-sheng WANG, Bo YU, Qiu-xi ZHONG
Automatic malware classification and new malware detection using machine learning
The explosive growth of malware variants poses a major threat to information security. Traditional anti-virus systems based on signatures fail to classify unknown malware into their corresponding families and to detect new kinds of malware programs. Therefore, we propose a machine learning based malware analysis system, which is composed of three modules: data processing, decision making, and new malware detection. The data processing module deals with gray-scale images, Opcode n-gram, and import functions, which are employed to extract the features of the malware. The decision-making module uses the features to classify the malware and to identify suspicious malware. Finally, the detection module uses the shared nearest neighbor (SNN) clustering algorithm to discover new malware families. Our approach is evaluated on more than 20 000 malware instances, which were collected by Kingsoft, ESET NOD32, and Anubis. The results show that our system can effectively classify the unknown malware with a best accuracy of 98.9%, and successfully detects 86.7% of the new malware.
Malware classification / Machine learning / n-gram / Gray-scale image / Feature extraction / Malware detection
[1] |
Annachhatre , C., Austin , T.H., Stamp , M., 2015. Hidden Markov models for malware classification. J. Comput. Virol. Hack. Tech., 11(2):59–73.
|
[2] |
Cheng , J.Y.C., Tsai , T.S., Yang , C.S., 2013. An information retrieval approach for malware classification based on Windows API calls. Int. Conf. on Machine Learning and Cybernetics,p.1678–1683.
|
[3] |
Damodaran , A., di Troia , F.,Visaggio , C.A.,
|
[4] |
Ding , Y.X., Dai , W., Yan , S.L.,
|
[5] |
Egele , M., Scholte , T.,Kirda , E.,
|
[6] |
Ertoz , L., Steinbach , M., Kumar , V., 2002. A new shared nearest neighbor clustering algorithm and its applications. Workshop on Clustering High Dimensional Data and Its Applications at the 2nd SIAM Int. Conf. on Data Mining, p.105–115.
|
[7] |
Gandotra , E., Bansal , D., Sofat , S., 2014. Malware analysis and classification: a survey. J. Inform. Secur., 5(2):44440.
|
[8] |
Han , K.S.,Lim , J.H., Im , E.G., 2013. Malware analysis method using visualization of binary files. Proc. on Research in Adaptive and Convergent Systems, p.317–321.
|
[9] |
Hu , Q.H., Yu , D.R., Xie , Z.X.,
|
[10] |
Islam , R., Tian , R.H., Batten , L.M.,
|
[11] |
Iwamoto , K., Wasaki , K., 2012. Malware classification based on extracted API sequences using static analysis. Proc. Asian Internet Engineering Conf., p.31–38.
|
[12] |
Jain , S., Meena , Y.K., 2011. Byte level n-gram analysis for malware detection. In: Venugopal, K.R., Patnaik, L.M. (Eds.), Computer Networks and Intelligent Computing. Springer, Berlin, p.51–59.
|
[13] |
Jarvis , R.A., Patrick , E.A., 1973. Clustering using a similarity measure based on shared near neighbors. IEEE Trans. Comput., C-22(11):1025–1034.
|
[14] |
Jolliffe , I.T., 2002. Principal Component Analysis (2nd Ed.). Springer, New York.
|
[15] |
Kancherla , K., Mukkamala , S., 2013. Image visualization based malware detection. IEEE Symp. on Computational Intelligence in Cyber Security, p.40–44.
|
[16] |
Kapoor , A., Dhavale , S., 2016. Control flow graph based multiclass malware detection using bi-normal separation. Defen. Sci. J., 66(2):138–145.
|
[17] |
Kaspersky Labs, 2015. Security Bulletin 2015.
|
[18] |
Kinable , J., Kostakis , O., 2011. Malware classification based on call graph clustering. J. Comput. Virol., 7(4):233–245.
|
[19] |
Kong , D.G., Yan , G.H., 2013. Discriminant malware distance learning on structural information for automated malware classification. Proc. 19th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.1357–1365.
|
[20] |
Lee , J., Jeong , K., Lee , H., 2010. Detecting metamorphic malwares using code graphs. Proc. ACM Symp. on Applied Computing, p.1970–1977.
|
[21] |
Lin , C.T., Wang , N.J., Xiao , H.,
|
[22] |
Lin , D., Stamp , M., 2011. Hunting for undetectable metamorphic viruses. J. Comput. Virol., 7(3):201–214.
|
[23] |
Liu , X.W., Wang , L., Huang , G.B.,
|
[24] |
Musale , M., Austin , T.H., Stamp , M., 2015. Hunting for metamorphic JavaScript malware. J. Comput. Virol. Hack. Tech., 11(2):89–102.
|
[25] |
Nataraj , L., Karthikeyan , S., Jacob , G.,
|
[26] |
Oliva , A., Torralba , A., 2001. Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis., 42(3):145–175.
|
[27] |
Pascanu , R., Stokes , J.W., Sanossian , H.,
|
[28] |
Roundy , K.A., Miller , B.P., 2010. Hybrid analysis and control of malware. In: Jha, S., Sommer, R., Kreibich, C. (Eds.), Recent Advances in Intrusion Detection. Springer Berlin Heidelberg, p.317–338.
|
[29] |
Russo , A., Sabelfeld , A., 2010. Dynamic vs. static flowsensitive security analysis. 23rd IEEE Computer Security Foundations Symp., p.186–199.
|
[30] |
Salton , G., McGill , M.J., 1986. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, USA.
|
[31] |
Shabtai , A., Moskovitch , R., Elovici , Y.,
|
[32] |
Tao , H., Ma , X., Qiao , M., 2013. Subspace selective ensemble algorithm based on feature clustering. J. Comput., 8(2): 509–516.
|
[33] |
Tian , R.H.,Batten , L., Islam , R.,
|
[34] |
Tian , R.H., Islam , R., Batten , L.,
|
[35] |
Tsyganok , K.,Tumoyan , E., Babenko , L.,
|
[36] |
Wong , W., Stamp , M., 2006. Hunting for metamorphic engines. J. Comput. Virol., 2(3):211–229.
|
[37] |
Yan , G.H., Brown , N., Kong , D.G., 2013. Exploring discriminatory features for automated malware classification. In: Rieck, K., Stewin, P., Seifert, J.P. (Eds.), Detection of Intrusions and Malware, and Vulnerability Assessment. Springer Berlin Heidelberg, p.41–61.
|
[38] |
Yao , W.,Chen , X.Q., Zhao , Y.,
|
[39] |
Ye , Y.F., Li , T., Chen , Y.,
|
[40] |
Yu , Y., Wang , H.M., Yin , G.,
|
[41] |
Zhou , Z.H., Wu , J.X., Tang , W., 2002. Ensembling neural networks: many could be better than all. Artif. Intell., 137(1-2):239–263.
|
/
〈 | 〉 |