Impact of preprocessing on medical data classification

Sarab ALMUHAIDEB; Mohamed El Bachir MENAI

doi:10.1007/s11704-016-5203-5

PDF(680 KB)

Front. Comput. Sci. ›› 2016, Vol. 10 ›› Issue (6) : 1082-1102. DOI: 10.1007/s11704-016-5203-5

RESEARCH ARTICLE

Impact of preprocessing on medical data classification

Author information +

History +

Abstract

The significance of the preprocessing stage in any data mining task is well known. Before attempting medical data classification, characteristics ofmedical datasets, including noise, incompleteness, and the existence of multiple and possibly irrelevant features, need to be addressed. In this paper, we show that selecting the right combination of preprocessing methods has a considerable impact on the classification potential of a dataset. The preprocessing operations considered include the discretization of numeric attributes, the selection of attribute subset(s), and the handling of missing values. The classification is performed by an ant colony optimization algorithm as a case study. Experimental results on 25 real-world medical datasets show that a significant relative improvement in predictive accuracy, exceeding 60% in some cases, is obtained.

Keywords

classification / ant colony optimization / medical data classification / preprocessing / feature subset selection / discretization

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Sarab ALMUHAIDEB, Mohamed El Bachir MENAI. Impact of preprocessing on medical data classification. Front. Comput. Sci., 2016, 10(6): 1082‒1102 https://doi.org/10.1007/s11704-016-5203-5

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Pham H N A, Triantaphyllou E. An application of a new metaheuristic for optimizing the classification accuracy when analyzing some medical datasets. Expert Systems with Applications, 2009, 36: 9240–9249 CrossRef Google scholar

[2]	Almuhaideb S, El-Bachir Menai M. Hybrid metaheuristics for medical data classification. In: El-Ghazali T, ed. Hybrid Metaheuristics. Springer, 2013, 187–217 CrossRef Google scholar

[3]	Penã-Reyes C A, Sipper M. Evolutionary computation in medicine: an overview. Artificial Intelligence in Medicine, 2000, 19(1): 1–23 CrossRef Google scholar

[4]

Tanwani A K, Afridi J, Shafiq M Z, Farooq M. Guidelines to select machine learning scheme for classification of biomedical datasets. In: Pizzuti C, Ritchie M D, Giacobini M, eds. Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics. Springer, 2009, 28–139

CrossRef Google scholar

[5]	Almuhaideb S, El-Bachir Menai M. A new hybrid metaheuristic for medical data classification. International Journal of Metaheuristics, 2014, 3(1): 59–80 CrossRef Google scholar

[6]	Milne D, Witten I H. An open-source toolkit for mining Wikipedia. Artificial Intelligence, 2013, 194: 222–239 CrossRef Google scholar

[7]	Alcalá-fdez J, L. Sánchez L, García S, del Jesus M J, Ventura S, Garrell J M, Otero J, Bacardit J, Rivas V M, Fernández J C, Herrera F. KEEL: a software tool to assess evolutionary algorithms to data mining problems. Soft Computing, 2009, 13(3): 307–318 CrossRef Google scholar

[8]	Martens D, de Backer M, Haesen R, Vanthienen J, Snoeck M, Baesens B. Classification with ant colony optimization. IEEE Transactions on Evolutionary Computation, 2007, 11(5): 651–665 CrossRef Google scholar

[9]	Tanwani A K, Farooq M. Performance evaluation of evolutionary algorithms in classification of biomedical datasets. In: Proceedings of the 11th Annual Conference Companion on Genetic and Evolutionary Computation: Late Breaking Papers. 2009, 2617–2624 CrossRef Google scholar

[10]	Tanwani A K, Farooq M. The role of biomedical dataset inclassification. In: Proceedings of Conference on Artificial Intelligence in Medicine in Europe. 2009 CrossRef Google scholar

[11]	Tanwani A K, Farooq M. Classification potential vs. classification accuracy: a comprehensive study of evolutionary algorithms with biomedical datasets. Learning Classifier System, 2010: 127–144

[12]	Kotsiantis S B. Feature selection for machine learning classification problems: a recent overview. Artificial Intelligence Review, 2011: 249–268 CrossRef Google scholar

[13]	Whitney A W. A direct method of nonparametric measurement selection. IEEE Transactions on Computers, 1971, 20(9): 1100–1103 CrossRef Google scholar

[14]	Marill T, Green D. On the effectiveness of receptors in recognition systems. IEEE Transactions on Information Theory, 1963, 9(1): 11–17 CrossRef Google scholar

[15]	Pudil P, Novoviˇcová J, Kittler J. Floating search methods in features election. Pattern Recognition Letters, 1994, 15(10): 1119–1125 CrossRef Google scholar

[16]	Yusta S C. Different metaheuristic strategies to solve the feature selection problem. Pattern Recognition Letters, 2009, 30(5): 525–534 CrossRef Google scholar

[17]	Jourdan L, Dhaenens C, Talbi E G. A genetic algorithm for features election in datamining for genetics. In: Proceedings of the 4th Metaheuristics International Conference Porto. 2010: 29–34

[18]	Huang J J, Cai Y Z, Xu X M. A hybrid genetic algorithm for features election wrapper based on mutual information. Pattern Recognition Letters, 2007, 28(13): 1825–1844 CrossRef Google scholar

[19]	AI-Ani A. Feature subset selection using ant colony optimization. International Journal of Computational Intelligence, 2005, 2(1): 53–58

[20]	Unler A, Murat A. A discrete particle swarm optimization method for feature selection in binary classification problems. European Journal of Operational Research, 2010, 206(3): 528–539 CrossRef Google scholar

[21]	Bekkerman R, El-Yaniv R, Tishby N, Winter Y. Distributional word clusters vs. words for text categorization. Journal of Machine Learning Research, 2003, 3: 1183–1208

[22]	Liu H, Yu L. Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge Discovery and Data Engineering, 2005, 17(4): 491–502 CrossRef Google scholar

[23]	Shin K, Fernandes D, Miyazaki S. Consistency measures for features election: a formal definition, relative sensitivity comparison, and a fast algorithm. In: Proceedings of International Conference on Artificial Intelligence (IJCAI). 2011, 1491–1497

[24]	Kerber R. ChiMerge: discretization of numeric attributes. In: Proceedings of the 10th National Conference on Artificial Intelligence. 1992, 123–128

[25]	Liu H, Setiono R. Feature selection via discretization. IEEE Transactions on Knowledge and Data Engineering, 1997, 9(4): 642–645 CrossRef Google scholar

[26]	Fayyad U M, Irani K B. Multi-interval discretization of continuousvalued attributes for classification learning. In: Proceedings of International Conference on Artificial Intelligence. 1993, 1022–1029

[27]	Jin R M, Breitbart Y, Muoh C. Data discretization unification. Knowledge and Information Systems, 2009, 19(1): 1–29 CrossRef Google scholar

[28]	Quinlan R. C4.5: Programs for Machine Learning. San Mateo,CA: Morgan Kaufmann Publishers, 1993

[29]	Guyon I, Elisseeff A. An introduction to variable and feature selection. The Journal of Machine Learning Research, 2003, 3: 1157–1182

[30]	Kohavi R, John G H. Wrappers for feature subsets election. Artificial Intelligence, 1997, 97(1–2): 273–324 CrossRef Google scholar

[31]	Caruana R, Freitag D. Greedy attribute selection. In: Proceedings of International Conference on Machine Learning. 1994, 28–36 CrossRef Google scholar

[32]	Koza J R. Genetic Programming: On the Programming of Computers by Means of Natural Selection. Cambridge, MA: MIT Press, 1992

[33]	Breiman L, Friedman J H, Olshen R A, Stone C J. Classification and Regression Trees. New York, NY: Chapman & Hall, 1984

[34]	Das S. Filters, wrappers and a boosting-based hybrid for feature selection. In: Proceedings of International Conference on Machine Learning. 2001, 74–81

[35]	Han J W, Kamber M. Data Mining: Concepts and Techniques. 2nd edition. London, UK: Morgan Kaufmann Publishers, 2006

[36]	Chlebus B S, Nguyen S H. On finding optimal discretizations for two attributes. In: Polkowski L, Skowron A, eds. Rough Sets and Current Trends in Computing. Springer, 1998, 537–544 CrossRef Google scholar

[37]	García S, Luengo J, Sáez J A, López V, Herrera F. A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(4): 734–750 CrossRef Google scholar

[38]	Wong A K C, Chiu D K Y. Synthesizing statistical knowledge from incomplete mixed-mode data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1987, 9(6): 796–805 CrossRef Google scholar

[39]	Garcá-Laencina P J, Sancho-Gómez J L, Figueiras-Vidal A R. Pattern classification with missing data: a review. Neural Computing and Ap plications, 2010, 19(2): 263–282 CrossRef Google scholar

[40]	Grzymala-Busse J W, Goodwin L K, Grzymala-Busse W J, Zheng X Q. Handling missing attribute values in preterm birth data sets. In: Slezak D, Yao J T, Peters J F, Ziarko W, Hu X H, eds. Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing. Springer, 2005, 342–351 CrossRef Google scholar

[41]	Batista G E A P A, Monard M C. An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence, 2003, 17(5–6): 519–533 CrossRef Google scholar

[42]	Feng H H, Chen G S, Yin C, Yang B R, Chen Y M. A SVM regression based approach to filling in missing values. In: Khosla R, Howlett R J, Jain L C, eds. Knowledge-Based Intelligent Information and Engineering Systems. Springer, 2005, 581–587

[43]	Gupta A, Lam M S. Estimating missing values using neural networks. Journal of the Operational Research Society, 1996, 47(2): 229–238 CrossRef Google scholar

[44]	Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B (Methodological), 1977, 39(1): 1–38

[45]	Schneider T. Analysis of incomplete climate data: estimation of mean values and covariance matrices and imputation of missing values. Journal of Climate, 2001, 14: 853–871 CrossRef Google scholar

[46]	Gourraud P A, Génin E, Cambon-Thomsen A. Handling missing values in population data: consequences for maximum likelihood estimation of haplotype frequencies. European Journal of Human Genetics, 2004, 12: 805–812 CrossRef Google scholar

[47]	Mcculloch W, Pitts W. A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 1943, 5: 115–133 CrossRef Google scholar

[48]	Holland J H. Adaptation in Natural and Artificial Systems. Ann Arbor: The University of Michigan Press, 1975

[49]	Dorigo M. Optimization, learning and natural algorithms. Dissertation for the Doctoral Degree. Politecnico di Milano, Italy, 1992

[50]	Kennedy J, Eberhart R. Particle swarm optimization. In: Proceedings of IEEE International Conference on Neural Networks. 1995, 1942–1948 CrossRef Google scholar

[51]	Sato T, Hagiwara M. Bee system: finding solution by a concentrated search. In: Proceedings of IEEE International Conference on Systems, Man, and Cybernetics. 1997 CrossRef Google scholar

[52]	Karaboga D. An idea based on honey bee swarm for numerical optimization. Technical Report TR06, Erciyes University, 2005

[53]	Dorigo M, Gambardella L M. Ant colony system: a cooperative learning approach to the traveling salesman problem. IEEE Transactions on Evolutionary Computation, 1997, 1(1): 53–66 CrossRef Google scholar

[54]	Parpinelli R S, Lopes H S, Freitas A A. Data mining with an ant colony optimization algorithm. IEEE Transactions Evolutionary Computation, 2002, 6(4): 321–332 CrossRef Google scholar

[55]	Stützle T, Hoos H H. MAX-MIN ant system. Future Generation Computer Systems, 2000, 16(8): 889–914 CrossRef Google scholar

[56]	Pellegrini P, Ellero A. The small world of pheromone trails. In: Dorigo M, Birattari M, Blum C, Clerc M, Stützle T, Winfield A F T, eds. Ant Colony Optimzation and Swarm Intelligence. Springer, 2008, 387–394 CrossRef Google scholar

[57]	Cohen W W. Fast effective rule induction. In: Prieditis A, Russell S J, eds. International Conference on Machine Learning. Morgan Kaufmann, 1995, 115–123 CrossRef Google scholar

[58]	Minnaert B, Martens D, de Baker M, Baesens B. To tune or not to tune: rule evaluation for metaheuristic-based sequential covering algorithms. Data Mining and Knowledge Discovery, 2015, 29(1): 237–272 CrossRef Google scholar

[59]	Almuhaideb S, ElBachir Menai M. A new hybrid metaheuristic for medical data classification. International Journal of Metaheuristics, 2014: 1–17

[60]	Rissanen J. Modeling by shortest data description. Automatica, 1978, 14(5): 465–471 CrossRef Google scholar

[61]	Kononenko I. On biases in estimating multi-valued attributes. In: Proceedings of International Conference on Artificial Intelligence. 1995, 1034–1040

[62]	Kira K, Rendell L A. A practical approach to feature selection. In: Proceedings of the 9th International Workshop on Machine Learning. 1992 CrossRef Google scholar

[63]	Kononenko I. Estimating attributes: analysis and extensions of RELIEF. In: Proceedings of European Conference on Machine Learning. 1994, 171–182 CrossRef Google scholar

[64]	Hall M A. Correlation-based feature selection for machine learning. Dissertation for the Dotoral Degree. Hamilton, New Zealand: University of Waikato, 1999

[65]	Liu H, Setiono R. A probabilistic approach to feature selection—a filter solution. In: Proceedings of International Conference on Machine Learning. 1996, 319–327

[66]	Frank E, Witten I H. Generating accurate rule sets without global optimization. In: Proceedings of the 15th International Conference on Machine Learning. 1998, 144–151

[67]	Holte R C. Very simple classification rules perform well on most commonly used datasets. Machine Learning, 1993, 11(1): 63–91 CrossRef Google scholar

[68]	Klösgan W. Problems for knowledge discovery in databases and their treatment in the statistics interpreter explora. International Journal of Intelligent Systems, 1992, 7(7): 649–673 CrossRef Google scholar

[69]	Janssen F, Fürnkranz J. On the quest for optimal rule learning heuristics. Machine Learning, 2010, 78(3): 343–379 CrossRef Google scholar

[70]	Martens D, Baesens B, Fawcett T. Editorial survey: swarm intelligence for data mining. Machine Learning, 2010, 82(1): 1–42 CrossRef Google scholar

[71]	Hanczara B, Dougherty E R. The reliability of estimated confidence intervals for classification error rates when only a single sample is available. Pattern Recognition, 2013, 64(3): 1067–1077 CrossRef Google scholar

[72]	Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of International Conference on Artificial Intelligence. 1995, 1137–1145

[73]	García S, Fernández A, Luengo J, Herrera F. A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability. Soft Computing, 2009, 13(10): 959–977 CrossRef Google scholar

[74]	Wilcoxon F. Individual comparisons by ranking methods. Biometrics Bulletin, 1945, 1(6): 80–83 CrossRef Google scholar

[75]	Friedman M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. American Statistical Association, 1937, 32(200): 675–701 CrossRef Google scholar

[76]	Frank A, Asuncion A. UCI machine learning repository. Irvine, CA: University of California, 2010

[77]	Napierala K, Stefanowski J. BRACID: a comprehensive approach to learning rules from imbalanced data. Journal of Intelligent Information Systems, 2012, 39(2): 335–373 CrossRef Google scholar

[78]	Orriols-Puig A, Bernadó-Mansilla E. The class imbalance problem in UCS classifier system: a preliminary study. In: Proceedings of the 2003–2005 International Conference on Learning Classifier Systems. 2007, 161–180 CrossRef Google scholar

[79]	Pazzani M J, Mani S, Shankle W R. Acceptance of rules generated by machine learning among medical experts. Methods of Information in Medicine, 2001, 40(5): 380–385

[80]	Vapnik V N. Estimation of Dependences Based on Empirical Data. Springer-Verlag, 1982

[81]	Vapnik V N. The Nature of Statistical Learning Theory. New York: Springer, 1995 CrossRef Google scholar

[82]	Lim T S, Loh W Y, Shih Y S. A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, 2000, 40(3): 203–228 CrossRef Google scholar

[83]	Gonzalez A, Perez R. Slave: a genetic learning system based on an iterative approach. IEEE Transactions on Fuzzy Systems, 1999, 7(2): 176–191 CrossRef Google scholar

[84]	Bernadó-Mansilla E, Garrell-Guiu J M. Accuracy based learning classifier systems: models, analysis and applications to classification tasks. Evolutionary Computation, 2003, 11(3): 209–238 CrossRef Google scholar

[85]	Wilson S W. Classifier fitness based on accuracy. Evolutionary Computation, 1995, 3(2): 149–175 CrossRef Google scholar

[86]	Orriols-Puig A, Casillas J, Bernadó-Mansilla E. A comparative study of several geneticbased supervised learning systems. In: Bull L, Bernadó-Mansilla E, Holmes J H, eds. Learning Classifier Systems in Data Mining. Springer, 2008, 205–230 CrossRef Google scholar

[87]	Troyanskaya O G, Cantor M, Sherlock G, Brown P O, Hastie T, Tibshirani R, Botstein D, Altman R B. Missing value estimation methods for DNA microarrays. Bioinformatics, 2001, 17(6): 520–525 CrossRef Google scholar

[88]	Amaldi E, Kann V. On the approximability of minimizing nonzero variables or unsatisfied relations in linear systems. Theoretical Computer Science, 1998, 209(1–2): 237–260 CrossRef Google scholar

[89]	Bacardit J, Butz M. Data mining in learning classifier systems: comparing XCS with gassist. In: Proceedings of International Conference on Learning Classifier Systems (IWLCS 2003–2005). 2004, 282–290

RIGHTS & PERMISSIONS

2016 Higher Education Press and Springer-Verlag Berlin Heidelberg

AI Summary AI Mindmap

PDF(680 KB)

Accesses

Citations

Detail

Sections

Recommended

Received	Accepted	Published
24 May 2015	08 Jan 2016	11 Oct 2016
Just Accepted Date	Online First Date	Issue Date
11 Jan 2016	14 Sep 2016	11 Oct 2016

About the journal

Aims & scope

Description

Editorial board

Abstracting / Indexing

Contact us

Browse

Just accepted

Online first

Latest issue

All volumes and issues

Collections

Featured articles

Most accessed

Most cited

Collections

Multimedia collections

Authors & reviewers

Online submisson

Call for papers

Guidelines for authors

Download templates