piEnPred: a bi-layered discriminative model for enhancers and their subtypes via novel cascade multi-level subset feature selection algorithm

Zaheer Ullah KHAN; Dechang PI; Shuanglong YAO; Asif NAWAZ; Farman ALI; Shaukat ALI

doi:10.1007/s11704-020-9504-3

Front. Comput. Sci. ›› 2021, Vol. 15 ›› Issue (6) :156904 DOI: 10.1007/s11704-020-9504-3

RESEARCH ARTICLE

piEnPred: a bi-layered discriminative model for enhancers and their subtypes via novel cascade multi-level subset feature selection algorithm

Author information +

History +

PDF (722KB)

Abstract

Enhancers are short DNA cis-elements that can be bound by proteins (activators) to increase the possibility that transcription of a particular gene will occur. The Enhancers perform a significant role in the formation of proteins and regulating the gene transcription process. Human diseases such as cancer, inflammatory bowel disease, Parkinson’s, addiction, and schizophrenia are due to genetic variation in enhancers. In the current study, we havemade an effort by building, amore robust and novel computational a bi-layered model. The representative feature vector was constructed over a linear combination of six features. The optimum Hybrid feature vector was obtained via the Novel Cascade Multi-Level Subset Feature selection (CMSFS) algorithm. The first layer predicts the enhancer, and the secondary layer carries the prediction of their subtypes. The baseline model obtained 87.88% of accuracy, 95.29% of sensitivity, 80.47% of specificity, 0.766 of MCC, and 0.9603 of a roc value on Layer-1. Similarly, the model obtained 68.24%, 65.54%, 70.95%, 0.3654, and 0.7568 as an Accuracy, sensitivity, specificity, MCC, and ROC values on layer-2 respectively. Over an independent dataset on layer-1, the piEnPred secured 80.4% accuracy, 82.5% of sensitivity, 78.4% of specificity, and 0.6099 as MCC, respectively. Subsequently, the proposed predictor obtained 72.5% of accuracy, 70.0% of sensitivity, 75% of specificity, and 0.4506 of MCC on layer-2, respectively. The proposed method remarkably performed in contrast to other state-of-the-art predictors. For the convenience of most experimental scientists, a user-friendly and publicly freely accessible web server@/bienhancer dot pythonanywhere dot com has been developed.

Keywords

enhancer / enhancer types / novel CM-SFS algorithm / feature selection / SVM

Cite this article

Download citation ▾

Zaheer Ullah KHAN, Dechang PI, Shuanglong YAO, Asif NAWAZ, Farman ALI, Shaukat ALI. piEnPred: a bi-layered discriminative model for enhancers and their subtypes via novel cascade multi-level subset feature selection algorithm. Front. Comput. Sci., 2021, 15(6): 156904 DOI:10.1007/s11704-020-9504-3

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Blackwood E M, Kadonaga J T. Going the distance: a current view of enhancer action. Science, 1998, 281(5373): 60–63

[2]	Roeder R G. The role of general initiation factors in transcription by RNA polymerase II. Trends in Biochemical Sciences, 1996, 21(9): 327–335

[3]	Nikolov D B, Burley S K. RNA polymerase II transcription initiation: a structural view. Proceedings of the National Academy of Sciences, 1997, 94(1): 15–22

[4]	Lee T I, Young R A. Transcription of eukaryotic protein-coding genes. Annual Review of Genetics, 2000, 34(1): 77–137

[5]	Pennacchio L A, Bickmore W, Dean A, Nobrega M A, Bejerano G. Enhancers: five essential questions. Nature Reviews Genetics, 2013, 14(4): 288–295

[6]	Kulaeva O I, Nizovtseva E V, Polikanov Y S, Ulianov S V, Studitsky V M. Distant activation of transcription: mechanisms of enhancer action. Molecular and Cellular Biology, 2012, 32(24): 4892–4897

[7]	Civas A, Génin P, Morin P, Lin R, Hiscott J. Promoter organization of the interferon-A genes differentially affects virus-induced expression and responsiveness to TBK1 and IKKc. Journal of Biological Chemistry, 2006, 281(8): 4856–486

[8]	Sharan R, Karni S, Felder Y. Analysis of biological networks: transcriptional networks-promoter sequence analysis. Tel Aviv University, 2007, 1–5

[9]	Li M, Marin-Muller C, Bharadwaj U, Chow K H, Yao Q, Chen C. MicroRNAs: control and loss of control in human physiology and disease. World Journal of Surgery, 2009, 33(4): 667–684

[10]	Ong C T, Corces V G. Enhancer function: new insights into the regulation of tissue-specific gene expression. Nature Reviews Genetics, 2011, 12(4): 283–293

[11]	Wittkopp P J, Kalay G. Cis-regulatory elements: molecular mechanisms and evolutionary processes underlying divergence. Nature Reviews Genetics, 2012, 13(1): 59–69

[12]	Gagniuc P, Ionescu-Tirgoviste C. Gene promoters show chromosomespecificity and reveal chromosome territories in humans. BMC Genomics, 2013, 14(1): 1–13

[13]	Corradin O, Scacheri P C. Enhancer variants: evaluating functions in common disease. Genome Medicine, 2014, 6(10): 1–4

[14]	Boyd M, Thodberg M, Vitezic M, Bornholdt J, Vitting-Seerup K, Chen Y, Coskun M, Li Y, Lo B Z S, Klausen P. Characterization of the enhancer and promoter landscape of inflammatory bowel disease from human colon biopsies. Nature Communications, 2018, 9(1): 1–9

[15]	Herz H. Enhancer deregulation in cancer and other diseases. BioEssays, 2016, 38(10): 1003–1015

[16]	Zhang G, Shi J, Zhu S, Lan Y, Xu L, Yuan H, Liao G, Liu X, Zhang Y, Xiao Y. DiseaseEnhancer: a resource of human disease-associated enhancer catalog. Nucleic Acids Research, 2017, 46(D1): D78–D84

[17]	Whyte W A, Orlando D A, Hnisz D, Abraham B J, Lin C Y, Kagey M H, Rahl P B, Lee T I, Young R A. Master transcription factors and mediator establish super-enhancers at key cell identity genes. Cell, 2013, 153(2): 307–319

[18]

Parker S C, Stitzel M L, Taylor D L, Orozco J M, Erdos M R, Akiyama J A, van Bueren K L, Chines P S, Narisu N, Black B L, Visel A. Chromatin stretch enhancer states drive cell-specific gene regulation and harbor human disease risk variants. Proceedings of the National Academy of Sciences, 2013, 110(44): 17921–17926

[19]	Chatterjee B, Banoth B, Mukherjee T, Taye N, Vijayaragavan B, Chattopadhyay S, Gomes J, Basak S. Late-phase synthesis of I_KBαinsulates the TLR4-activated canonical NF-_KB pathway from noncanonical NF-_KB signaling in macrophages. Science Signaling, 2016, 9(457): ra120–ra120

[20]	Niederriter A R, Varshney A, Parker S C, Martin D M. Super enhancers in cancers, complex disease, and developmental disorders. Genes, 2015, 6(4): 1183–1200

[21]	Schmidt S F, Larsen B D, Loft A, Nielsen R, Madsen J G S, Mandrup S. Acute TNF-induced repression of cell identity genes is mediated by NF_KB-directed redistribution of cofactors from super-enhancers. Genome Research, 2015, 25(9): 1281–1294

[22]	Vahedi G, Kanno Y, Furumoto Y, Jiang K, Parker S C J, Erdos MR, Davis S R, Roychoudhuri R, Restifo N P, Gadina M. Super-enhancers delineate disease-associated regulatory nodes in T cells. Nature, 2015, 520(7548): 558–562

[23]	Brown J D, Lin C Y, Duan Q, Griffin G, Federation A J, Paranal R M, Bair S, Newton G, Lichtman A H, Kung A L. NF-_KB directs dynamic super enhancer formation in inflammation and atherogenesis. Molecular Cell, 2014, 56(2): 219–231

[24]	Vlahopoulos S A, Cen O, Hengen N, Agan J, Moschovi M, Critselis E, Adamaki M, Bacopoulou F, Copland J A, Boldogh I. Dynamic aberrant NF-_KB spurs tumorigenesis: a new model encompassing the microenvironment. Cytokine & Growth Factor Reviews, 2015, 26(4): 389–403

[25]	Zou Z, Huang B, Wu X, Zhang H, Qi J, Bradner J, Nair S, Chen L F. Brd4 maintains constitutively active NF-_KB in cancer cells by binding to acetylated RelA. Oncogene, 2014, 33(18): 2395–2404

[26]	Shlyueva D, Stampfel G, Stark A. Transcriptional enhancers: from properties to genome-wide predictions. Nature Reviews Genetics, 2014, 15(4): 272–286

[27]	Tahir M, Hayat M, Khan S A. A two-layer computational model for discrimination of enhancer and their types using hybrid features pace of pseudo k-tuple nucleotide composition. Arabian Journal for Science and Engineering, 2018, 43(12): 6719–6727

[28]	Visel A, Blow M J, Li Z, Zhang T, Akiyama J A, Holt A, Plajzer-Frick I, Shoukry M, Wright C, Chen F. ChIP-seq accurately predicts tissuespecific activity of enhancers. Nature, 2009, 457(7231): 854–858

[29]	Visel A, Prabhakar S, Akiyama J A, Shoukry M, Lewis K D, Holt A, Plajzer-Frick I, Afzal V, Rubin E M, Pennacchio L A. Ultraconservation identifies a small subset of extremely constrained developmental enhancers. Nature Genetics, 2008, 40(2): 158–160

[30]	Kulakovskiy I V,Medvedeva Y A, Schaefer U, Kasianov A S, Vorontsov I E, Bajic V B, Makeev V J. HOCOMOCO: a comprehensive collection of human transcription factor binding sites models. Nucleic Acids Research, 2012, 41(D1): 195–202

[31]	Bryne J C, Valen E, Tang M H E, Marstrand T, Winther O, da Piedade I, Krogh A, Lenhard B, Sandelin A. JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update. Nucleic Acids Research, 2007, 36(suppl_1): 102–106

[32]	Ernst J, Kellis M. ChromHMM: automating chromatin-state discovery and characterization. Nature Methods, 2012, 9(3): 215–216

[33]	Hoffman M M,Buske O J, Wang J, Weng Z, Bilmes J A, Noble W S. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nature Methods, 2012, 9(5): 473–480

[34]	Firpi H A, Ucar D, Tan K. Discover regulatory DNA elements using chromatin signatures and artificial neural network. Bioinformatics, 2010, 26(13): 1579–1586

[35]	Rajagopal N, Xie W, Li Y, Wagner U, Wang W, Stamatoyannopoulos J, Ernst J, Kellis M, Ren B. RFECS: a random-forest based algorithm for enhancer identification from chromatin state. PLoS Computational Biology, 2013, 9(3): e1002968

[36]	Erwin G D, Oksenberg N, Truty R M, Kostka D, Murphy K K, Ahituv N, Pollard K S, Capra J A. Integrating diverse datasets improves developmental enhancer prediction. PLoS Computational Biology, 2014, 10(6): e1003677

[37]	Lu Y, Qu W, Shan G, Zhang C. DELTA: a distal enhancer locating tool based on AdaBoost algorithm and shape features of chromatin modifications. PLoS ONE, 2015, 10(6): e0130622

[38]	Bu H, Gan Y, Wang Y, Zhou S, Guan J. A new method for enhancer prediction based on deep belief network. BMC Bioinformatics, 2017, 18(12): 418–430

[39]	Yang B, Liu F, Ren C, Ouyang Z, Xie Z, Bo X, Shu W. BiRen: predicting enhancers with a deep-learning-based model using the DNA sequence alone. Bioinformatics, 2017, 33(13): 1930–1936

[40]	Kleftogiannis D, Kalnis P, Bajic V B. DEEP: a general computational framework for predicting enhancers. Nucleic Acids Research, 2014, 43(1): e6–e6

[41]	Shao J, Xu D, Tsai S N, Wang Y, Ngai S M. Computational identification of protein methylation sites through bi-profile Bayes feature extraction. PLoS ONE, 2009, 4(3): e4920

[42]	Chen W, Lei T Y, Jin D C, Lin H, Chou K C. PseKNC: a flexible web server for generating pseudo k-tuple nucleotide composition. Analytical Biochemistry, 2014, 456(1): 53–60

[43]	Jia C, He W. EnhancerPred: a predictor for discovering enhancers based on the combination and selection of multiple features. Scientific Reports, 2016, 6: 38741

[44]	Liu B, Fang L, Long R, Lan X, Chou K C. iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition. Bioinformatics, 2015, 32(3): 362–369

[45]	Liu B, Li K, Huang D S, Chou K C. iEnhancer-EL: identifying enhancers and their strength with ensemble learning approach. Bioinformatics, 2018, 34(22): 3835– 3842

[46]	Le N Q K, Yapp E K Y, Ho Q T, Nagasundaram N, Ou Y Y, Yeh H Y. iEnhancer-5Step: identifying enhancers using hidden information of DNA sequences via Chou’s 5-step rule and word embedding. Analytical Biochemistry, 2019, 571: 53–61

[47]	Zeng X, Yuan S, Huang X, Zou Q. Identification of cytokine via an improved genetic algorithm. Frontiers of Computer Science, 2015, 9(4): 643–651

[48]	Zhao W, Wang L, Zhang T X, Zhao Z N, Du P F. A brief review on software tools in generating Chou’s pseudo-factor representations for all types of biological sequences. Protein and Peptide Letters, 2018, 25(9): 822–829

[49]	Akbar S, Hayat M, Iqbal M, Tahir M. iRNA-PseTNC: identification of RNA 5-methylcytosine sites using hybrid vector space of pseudo nucleotide composition. Frontiers of Computer Science, 2020, 14(2): 451–460

[50]	Ali F, Hayat M. Classification of membrane protein types using voting feature interval in combination with Chou’s pseudo amino acid composition. Journal of Theoretical Biology, 2015, 384: 78–83

[51]	LiW A, Godzik. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 2006, 22(13): 1658–1659

[52]	Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 2012, 28(23): 3150–3152

[53]	Liu B, Liu Y, Huang D. Recombination hotspot/coldspot identification combining three different pseudocomponents via an ensemble learning approach. BioMed Research International, 2016, 10(1): 100–120

[54]	Khan Z U, Ali F, Ahmad I, Hayat M, Pi D. iPred CNC: computational prediction model for cancerlectins and non-cancerlectins using novel cascade features subset selection. Chemometrics and Intelligent Laboratory Systems, 2019, 195: 103876

[55]

Chen Z, Zhao P, Li F, Marquez-Lago T T, Leier A, Revote J, Zhu Y, Powell D R, Akutsu T, Webb G I, Chou K C. iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data. Briefings in Bioinformatics, 2020, 21(3): 1047–1057

[56]	Chen Z, Zhao P, Li F, Leier A, Marquez-Lago T T, Wang Y, Webb G I, Smith A I, Daly R J, Chou K C. iFeature: a python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics, 2018, 34(14): 2499–2502

[57]	Zhang S, Zhuang W, Xu Z. Prediction of DNase I hypersensitive sites in plant genome using multiple modes of pseudo components. Analytical Biochemistry, 2018, 549: 149–156

[58]	Chen W, Ding H, Zhou X, Lin H, Chou K C. iRNA(m6A)-PseDNC: identifying N6-methyladenosine sites using pseudo dinucleotide composition. Analytical Biochemistry, 2018, 561: 59–65

[59]	Chen W, Feng P M, Lin H, Chou K C. iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Research, 2013, 41(6): e68–e74

[60]

Khan Z U, Ali F, Khan I A, Hussain Y, Pi D. iRSpot-SPI: deep learningbased recombination spots prediction by incorporating secondary sequence information coupled with physio-chemical properties via Chou’s 5-step rule and pseudo components. Chemometrics and Intelligent Laboratory Systems, 2019, 189: 169–180

[61]	Lin H, Deng E Z, Ding H, Chen W, Chou K C. iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Research, 2014, 42(21): 12961–12972

[62]	Feng P, Yang H, Ding H, Lin H, Chen W, Chou K C. iDNA6mA-PseKNC: identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. Genomics, 2019, 111(1): 96–102

[63]	Yang H, Qiu W R, Liu G, Guo F B, Chen W, Chou K C, Lin H. iRSpot- Pse6NC: identifying recombination spots in saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC. International Journal of Biological Sciences, 2018, 14(8): 883

[64]	Khan Z U, Hayat M, Khan M A. Discrimination of acidic and alkaline enzyme using Chou’s pseudo amino acid composition in conjunction with probabilistic neural network model. Journal of Theoretical Biology, 2015, 365: 197–203

[65]	Ali F, Kabir M, Arif M, Khan Swati Z N, Khan Z U, Ullah M, Yu D J. DBPPred-PDSD: machine learning approach for prediction of DNAbinding proteins using Discrete Wavelet Transform and optimized integrated features space. Chemometrics and Intelligent Laboratory Systems, 2018, 182: 21–30

[66]	Hayat M, Khan A. Predicting membrane protein types by fusing composite protein sequence features into pseudo amino acid composition. Journal of Theoretical Biology, 2011, 271(1): 10–17

[67]	Chou K C, Shen H B. Recent progress in protein subcellular location prediction. Analytical Biochemistry, 2007, 370(1): 1–16

[68]	Gheyas I A, Smith L S. Feature subset selection in large dimensionality domains. Pattern Recognition, 2010, 43(1): 5–13

[69]	Kohavi R, John G H. Wrappers for feature subset selection. Artificial Intelligence, 1997, 97(1–2): 273–324

[70]	Chokka A, Sandhua Rani K. AdaBoost with feature selection using IoT to bring the paths for somatic mutations evaluation in cancer. In: Internet of Things and Personalized Healthcare Systems. Springer, Singapore, 2019, 51–63

[71]	Maldonado S, Weber R. A wrapper method for feature selection using Support Vector Machines. Information Sciences, 2009, 179(13): 2208–2217

[72]	Das S. Filters, wrappers and a boosting-based hybrid for feature selection. In: Proceedings of the 18th International Conference on Machine Learning. 2001, 74–81

[73]	Hsu H H, Hsieh CW, Lu M D. Hybrid feature selection by combining filters and wrappers. Expert Systems with Applications, 2011, 38(7): 8144–8150

[74]	Chandrashekar G, Sahin F. A survey on feature selection methods. Computers & Electrical Engineering, 2014, 40(1): 16–28

[75]	Peng H, Long F, Ding C. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(8):1226–1238

[76]	Yang R, Zhang C, Zhang L, Gao R. A two-step feature selection method to predict cancerlectins by multiview features and synthetic minority oversampling technique. BioMed Research International, 2018, 2018(1): 1–10

[77]	AL-barakati H J, McConnell E W, Hicks L M, Poole L B, Newman R H. SVM-SulfoSite: a support vector machine based predictor for sulfenylation sites. Scientific Reports, 2018, 8(1): 11288

[78]	Ding Y, Wilkins D. Improving the performance of SVM-RFE to select genes in microarray data. BMC Bioinformatics, 2006, 7(2): S12

[79]	Javed F, Hayat M. Predicting subcellular localization of multi-label proteins by incorporating the sequence features into Chou’s PseAAC. Genomics, 2019, 111(6): 1325–1332

[80]	Liu B, Liu Y, Jin X, Wang X, Liu B. iRSpot-DACC: a computational predictor for recombination hot/cold spots identification based on dinucleotide-based auto-cross covariance. Scientific Reports, 2016, 6(1): 1–9

[81]	Jia C, Zuo Y. S-SulfPred: a sensitive predictor to capture S-sulfenylation sites based on a resampling one-sided selection undersampling-synthetic minority oversampling technique. Journal of Theoretical Biology, 2017, 422: 84–89

[82]	Chou K C. Some remarks on predicting multi-label attributes in molecular biosystems. Molecular Biosystems, 2013, 9: 1092–1100

[83]	Chou K C. Some remarks on protein attribute prediction and pseudo amino acid composition. Journal of Theoretical Biology, 2011, 273(1): 236–247

[84]	Liu B, Wang S, Long R, Chou K C. iRSpot-EL: identify recombination spots with an ensemble learning approach. Bioinformatics, 2017, 33(1):35–41

[85]	Tahir M, Tayara H, Chong K T. iRNA-PseKNC (2methyl): identify RNA 2’-o-methylation sites by convolution neural network and chou’s pseudo components. Journal of Theoretical Biology, 2019, 465: 1–6

[86]	Tayara H, Tahir M, Chong K T. Identification of prokaryotic promoters and their strength by integrating heterogeneous features. Genomics, 2020, 112(2): 13S96–1403