Performance measures in evaluating machine learning based bioinformatics predictors for classifications

Yasen Jiao; Pufeng Du

doi:10.1007/s40484-016-0081-2

PDF(289 KB)

Quant. Biol. ›› 2016, Vol. 4 ›› Issue (4) : 320-330. DOI: 10.1007/s40484-016-0081-2

REVIEW

Performance measures in evaluating machine learning based bioinformatics predictors for classifications

Yasen Jiao ,
Pufeng Du

Author information +

History +

Abstract

Background: Many existing bioinformatics predictors are based on machine learning technology. When applying these predictors in practical studies, their predictive performances should be well understood. Different performance measures are applied in various studies as well as different evaluation methods. Even for the same performance measure, different terms, nomenclatures or notations may appear in different context.

Results: We carried out a review on the most commonly used performance measures and the evaluation methods for bioinformatics predictors.

Conclusions: It is important in bioinformatics to correctly understand and interpret the performance, as it is the key to rigorously compare performances of different predictors and to choose the right predictor.

Graphical abstract

Keywords

machine learning / performance measures / evaluation methods

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Yasen Jiao, Pufeng Du. Performance measures in evaluating machine learning based bioinformatics predictors for classifications. Quant. Biol., 2016, 4(4): 320‒330 https://doi.org/10.1007/s40484-016-0081-2

This is a preview of subscription content, contact us for subscripton.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Eberwine, J., Sul, J.-Y., Bartfai, T. and Kim, J. (2014) The promise of single-cell sequencing. Nat. Methods, 11, 25–27 CrossRef Pubmed Google scholar

[2]	Ashley, E. A. (2015) The precision medicine initiative: a new national effort. JAMA, 313, 2119–2120 CrossRef Pubmed Google scholar

[3]	Chou, K.-C. (2009) Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Curr. Proteomics, 6, 262–274 CrossRef Google scholar

[4]	Chou, K.-C. (2015) Impacts of bioinformatics to medicinal chemistry. Med. Chem., 11, 218–234 CrossRef Pubmed Google scholar

[5]	Jiao, Y.-S. and Du, P.-F. (2016) Predicting Golgi-resident protein types using pseudo amino acid compositions: approaches with positional specific physicochemical properties. J. Theor. Biol., 391, 35–42 CrossRef Pubmed Google scholar

[6]	Wang, Y. and Zeng, J. (2013) Predicting drug-target interactions using restricted Boltzmann machines. Bioinformatics, 29, i126–i134 CrossRef Pubmed Google scholar

[7]	Lee, K., Byun, K., Hong, W., Chuang, H. Y., Pack, C. G., Bayarsaikhan, E., Paek, S. H., Kim, H., Shin, H. Y., Ideker, T., (2013) Proteome-wide discovery of mislocated proteins in cancer. Genome Res., 23, 1283–1294 CrossRef Pubmed Google scholar

[8]

Shao, J., Xu, D., Hu, L., Kwan, Y. W., Wang, Y., Kong, X. and Ngai, S. M. (2012) Systematic analysis of human lysine acetylation proteins and accurate prediction of human lysine acetylation through bi-relative adapted binomial score Bayes feature representation. Mol. Biosyst., 8, 2964–2973

CrossRef Pubmed Google scholar

[9]	Libbrecht, M. W. and Noble, W. S. (2015) Machine learning applications in genetics and genomics. Nat. Rev. Genet., 16, 321–332 CrossRef Pubmed Google scholar

[10]	Kohavi, R. and Provost, F. (1998) Glossary of terms. Mach. Learn., 30, 271–274 CrossRef Google scholar

[11]	Simon P. (2013) Too Big to Ignore: The Business Case for Big Data. New Jersey: Wiley

[12]	Fan, Y.-X., Zhang, Y. and Shen, H.-B. (2013) LabCaS: labeling calpain substrate cleavage sites from amino acid sequence using conditional random fields. Proteins, 81, 622–634 CrossRef Pubmed Google scholar

[13]	Song, J., Tan, H., Shen, H., Mahmood, K., Boyd, S. E., Webb, G. I., Akutsu, T. and Whisstock, J. C. (2010) Cascleave: towards more accurate prediction of caspase substrate cleavage sites. Bioinformatics, 26, 752–760 CrossRef Pubmed Google scholar

[14]	Chou, K.-C. and Shen, H.-B. (2008) Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms. Nat. Protoc., 3, 153–162 CrossRef Pubmed Google scholar

[15]	Li X, Liu T, Tao P, Wang, C., Chen, L. (2015) A highly accurate protein structural class prediction approach using auto cross covariance transformation and recursive feature elimination. Comput. Biol. Chem., 59, 95–100

[16]	Kong, L., Zhang, L. and Lv, J. (2014) Accurate prediction of protein structural classes by incorporating predicted secondary structure information into the general form of Chou’s pseudo amino acid composition. J. Theor. Biol., 344, 12–18 CrossRef Pubmed Google scholar

[17]	Guo, S.-H., Deng, E.-Z., Xu, L.-Q., Ding, H., Lin, H., Chen, W. and Chou, K. C. (2014) iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition. Bioinformatics, 30, 1522–1529 CrossRef Pubmed Google scholar

[18]	Xu, Y., Wen, X., Wen, L.-S., Wu, L. Y., Deng, N. Y. and Chou, K. C. (2014) iNitro-Tyr: prediction of nitrotyrosine sites in proteins with general pseudo amino acid composition. PLoS One, 9, e105018 CrossRef Pubmed Google scholar

[19]	Xu, Y. and Chou, K.-C. (2016) Recent progress in predicting posttranslational modification sites in proteins. Curr. Top. Med. Chem., 16, 591–603 CrossRef Pubmed Google scholar

[20]	Jiang, R., Tang, W., Wu, X. and Fu, W. (2009) A random forest approach to the detection of epistatic interactions in case-control studies. BMC Bioinformatics, 10, S65 CrossRef Pubmed Google scholar

[21]	Tang, W., Wu, X., Jiang, R. and Li, Y. (2009) Epistatic module detection for case-control studies: a Bayesian model with a Gibbs sampling strategy. PLoS Genet., 5, e1000464 CrossRef Pubmed Google scholar

[22]	Wu, X., Jiang, R., Zhang, M. Q. and Li, S. (2008) Network-based global inference of human disease genes. Mol. Syst. Biol., 4, 189 CrossRef Pubmed Google scholar

[23]	Li, T., Du, P. and Xu, N. (2010) Identifying human kinase-specific protein phosphorylation sites by integrating heterogeneous information from various sources. PLoS One, 5, e15411 CrossRef Pubmed Google scholar

[24]	Xue, Y., Liu, Z., Cao, J., Ma, Q., Gao, X., Wang, Q., Jin, C., Zhou, Y., Wen, L. and Ren, J. (2011) GPS 2.1: enhanced prediction of kinase-specific phosphorylation sites with an algorithm of motif length selection. Protein Eng. Des. Sel., 24, 255–260 CrossRef Pubmed Google scholar

[25]	Zhao, Q., Xie, Y., Zheng, Y., Jiang, S., Liu, W., Mu, W., Liu, Z., Zhao, Y., Xue, Y. and Ren, J. (2014) GPS-SUMO: a tool for the prediction of sumoylation sites and SUMO-interaction motifs. Nucleic Acids Res., 42, W325–W330 CrossRef Pubmed Google scholar

[26]	Nanni, L., Brahnam, S. and Lumini, A. (2012) Combining multiple approaches for gene microarray classification. Bioinformatics, 28, 1151–1157 CrossRef Pubmed Google scholar

[27]	Dong, X. and Weng, Z. (2013) The correlation between histone modifications and gene expression. Epigenomics, 5, 113–116 CrossRef Pubmed Google scholar

[28]	Dong, X., Greven, M. C., Kundaje, A., Djebali, S., Brown, J. B., Cheng, C., Gingeras, T. R., Gerstein, M., Guig�, R., Birney, E., (2012) Modeling gene expression using chromatin features in various cellular contexts. Genome Biol., 13, R53 CrossRef Pubmed Google scholar

[29]	Cheng, C., Shou, C., Yip, K. Y. and Gerstein, M. B. (2011) Genome-wide analysis of chromatin features identifies histone modification sensitive and insensitive yeast transcription factors. Genome Biol., 12, R111 CrossRef Pubmed Google scholar

[30]	Huang, J., Marco, E., Pinello, L. and Yuan, G. C. (2015) Predicting chromatin organization using histone marks. Genome Biol., 16, 162 CrossRef Pubmed Google scholar

[31]	Bishop CM. (2006) Pattern Recognition and Machine Learning. New York: Springer

[32]	Zhang, M.-L. and Zhou, Z.-H. (2007) ML-KNN: a lazy learning approach to multi-label learning. Pattern Recognit., 40, 2038–2048 CrossRef Google scholar

[33]	Chou, K.-C. (2013) Some remarks on predicting multi-label attributes in molecular biosystems. Mol. Biosyst., 9, 1092–1100 CrossRef Pubmed Google scholar

[34]	Chou, K.-C. and Shen, H.-B. (2006) Hum-PLoc: a novel ensemble classifier for predicting human protein subcellular localization. Biochem. Biophys. Res. Commun., 347, 150–157 CrossRef Pubmed Google scholar

[35]	Chou, K.-C., Wu, Z.-C. and Xiao, X. (2012) iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. Mol. Biosyst., 8, 629–641 CrossRef Pubmed Google scholar

[36]	Du, P. and Li, Y. (2006) Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence. BMC Bioinformatics, 7, 518 CrossRef Pubmed Google scholar

[37]	Du, P., Tian, Y. and Yan, Y. (2012) Subcellular localization prediction for human internal and organelle membrane proteins with projected gene ontology scores. J. Theor. Biol., 313, 61–67 CrossRef Pubmed Google scholar

[38]	Lin, H., Deng, E.-Z., Ding, H., Chen, W. and Chou, K. C. (2014) iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res., 42, 12961–12972 CrossRef Pubmed Google scholar

[39]	Chou, K.-C. (2011) Some remarks on protein attribute prediction and pseudo amino acid composition. J. Theor. Biol., 273, 236–247 CrossRef Pubmed Google scholar

[40]	Chou, K. C. and Zhang, C. T. (1995) Prediction of protein structural classes. Crit. Rev. Biochem. Mol. Biol., 30, 275–349 CrossRef Pubmed Google scholar

[41]	Du, P., Li, T. and Wang, X. (2011) Recent progress in predicting protein sub-subcellular locations. Expert Rev. Proteomics, 8, 391–404 CrossRef Pubmed Google scholar

[42]	Hastie, T., Tibshirani, R. and Friedman, J. (2009) Model Assessment and Selection. In The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 219–260, New York: Springer-Verlag

[43]	Chou, K. C. (2001) Using subsite coupling to predict signal peptides. Protein Eng., 14, 75–79 CrossRef Pubmed Google scholar

[44]	Chen, W., Feng, P., Ding, H., Lin, H. and Chou, K. C. (2015) iRNA-Methyl: identifying N(6)-methyladenosine sites using pseudo nucleotide composition. Anal. Biochem., 490, 26–33 CrossRef Pubmed Google scholar

[45]	Powers, D. M. W. (2011) Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. Inter. J. Mach. Learn. Tech., 2, 37–63

[46]	Li, J., Witten, D. M., Johnstone, I. M. and Tibshirani, R. (2012) Normalization, testing, and false discovery rate estimation for RNA-sequencing data. Biostatistics, 13, 523–538 CrossRef Pubmed Google scholar

[47]

Andreassen, O. A., Thompson, W. K., Schork, A. J., Ripke, S., Mattingsdal, M., Kelsoe, J. R., Kendler, K. S., O’Donovan, M. C., Rujescu, D., Werge, T., (2013) Improved detection of common variants associated with schizophrenia and bipolar disorder using pleiotropy-informed conditional false discovery rate. PLoS Genet., 9, e1003455

CrossRef Pubmed Google scholar

[48]	Chen, J. J., Roberson, P. K. and Schell, M. J. (2010) The false discovery rate: a key concept in large-scale genetic studies. Cancer Control, 17, 58–62 Pubmed

[49]	Brodersen, K. H., Ong, C. S., Stephan, K. E., Buhmann, J. M. (2010) The Balanced Accuracy and Its Posterior Distribution. In 2010 20th International Conference on Pattern Recognition (ICPR). 3121–3124

[50]	Mower, J. P. (2005) PREP-Mt: predictive RNA editor for plant mitochondrial genes. BMC Bioinformatics, 6, 96 CrossRef Pubmed Google scholar

[51]

Dayarian, A., Romero, R., Wang, Z., Biehl, M., Bilal, E., Hormoz, S., Meyer, P., Norel, R., Rhrissorrakrai, K., Bhanot, G., (2015) Predicting protein phosphorylation from gene expression: top methods from the IMPROVER Species Translation Challenge. Bioinformatics, 31, 462–470

CrossRef Pubmed Google scholar

[52]	Matthews, B. W. (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. BBA – Protein Structure, 405, 442–451 CrossRef Google scholar

[53]	Saito, T. and Rehmsmeier, M. (2015) The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One, 10, e0118432 CrossRef Pubmed Google scholar

[54]	Davis, J. and Goadrich, M. (2006) The relationship between precision-recall and ROC curves. In Proceedings of the 23rd international conference on Machine learning. 233–240, New York: the Association for Computing Machinery

[55]	Du, P. and Xu, C. (2013) Predicting multisite protein subcellular locations: progress and challenges. Expert Rev. Proteomics, 10, 227–237 CrossRef Pubmed Google scholar

[56]	Tsoumakas, G., Katakis, I. and Vlahavas, I. (2010) Mining Multi-label Data. In Data Mining and Knowledge Discovery Handbook. 667–685, New York: Springer US

[57]	Tsoumakas, G. and Katakis, I. (2007) Multi-label classification: an overview. Int. J. Data Warehous. Min., 3, 1–13 CrossRef Google scholar

[58]	Sprenger, J., Fink, J. L. and Teasdale, R. D. (2006) Evaluation and comparison of mammalian subcellular localization prediction methods. BMC Bioinformatics, 7, S3 CrossRef Pubmed Google scholar

[59]	Bermingham, M. L., Pong-Wong, R., Spiliopoulou, A., Hayward, C., Rudan, I., Campbell, H., Wright, A. F., Wilson, J. F., Agakov, F., Navarro, P., (2015) Application of high-dimensional feature selection: evaluation for genomic prediction in man. Sci. Rep., 5, 10312 CrossRef Pubmed Google scholar

[60]	Varma, S. and Simon, R. (2006) Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics, 7, 91 CrossRef Pubmed Google scholar

ACKNOWLEDGEMENTS

This work was supported by the National Natural Science Foundation of China (NSFC 61005041), the Specialized Research Fund for the Doctoral Program of Higher Education (SRFDP 20100032120039),Tianjin Natural Science Foundation (No. 12JCQNJC02300), and China Postdoctoral Science Foundation (Nos. 2012T50240 and 2013M530114).

COMPLIANCE WITH ETHICS GUIDELINES

The authors Yasen Jiao and Pufeng Du declare that they have no conflict of interests.

This article does not contain any studies with human or animal subjects performed by any of the authors.

Funding

RIGHTS & PERMISSIONS

2016 Higher Education Press and Springer-Verlag Berlin Heidelberg

AI Summary AI Mindmap

PDF(289 KB)

2781

Accesses

105

Citations

Altmetric

Detail

Sections

Recommended

Abstract
Graphical abstract
Keywords
Cite this article
References
ACKNOWLEDGEMENTS
COMPLIANCE WITH ETHICS GUIDELINES
RIGHTS & PERMISSIONS

Received	Accepted	Published
08 Jun 2016	21 Oct 2016	01 Dec 2016
Online First Date	Issue Date
23 Nov 2016	01 Dec 2016

About the journal

Aims & scopes

Description

Editorial board

Abstracting / Indexing

Cover gallery

Contact us

Browse

Just accepted

Online first

Latest issue

All volumes and issues

Collections

Featured articles

Most accessed

Most cited

Collections

Authors & reviewers

Online submisson

Call for papers

Editorial policy

Guidelines for authors

Download templates

Classifications via endnote

Guidelines for reviewers

Author FAQs

Abstract

Graphical abstract

Keywords

Cite this article

{{custom_sec.title}}

{{custom_sec.title}}

References

ACKNOWLEDGEMENTS

COMPLIANCE WITH ETHICS GUIDELINES

RIGHTS & PERMISSIONS