NPEST: a nonparametric method and a database for transcription start site prediction

Tatiana Tatarinova, Alona Kryshchenko, Martin Triska, Mehedi Hassan, Denis Murphy, Michael Neely, Alan Schumitzky

PDF(572 KB)
PDF(572 KB)
Quant. Biol. ›› 2013, Vol. 1 ›› Issue (4) : 261-271. DOI: 10.1007/s40484-013-0022-2
RESEARCH ARTICLE
RESEARCH ARTICLE

NPEST: a nonparametric method and a database for transcription start site prediction

Author information +
History +

Abstract

In this paper we present NPEST, a novel tool for the analysis of expressed sequence tags (EST) distributions and transcription start site (TSS) prediction. This method estimates an unknown probability distribution of ESTs using a maximum likelihood (ML) approach, which is then used to predict positions of TSS. Accurate identification of TSS is an important genomics task, since the position of regulatory elements with respect to the TSS can have large effects on gene regulation, and performance of promoter motif-finding methods depends on correct identification of TSSs. Our probabilistic approach expands recognition capabilities to multiple TSS per locus that may be a useful tool to enhance the understanding of alternative splicing mechanisms. This paper presents analysis of simulated data as well as statistical analysis of promoter regions of a model dicot plant Arabidopsis thaliana. Using our statistical tool we analyzed 16520 loci and developed a database of TSS, which is now publicly available at www.glacombio.net/NPEST.

Keywords

transcription start site (TSS) / nonparametric maximum likelihood

Cite this article

Download citation ▾
Tatiana Tatarinova, Alona Kryshchenko, Martin Triska, Mehedi Hassan, Denis Murphy, Michael Neely, Alan Schumitzky. NPEST: a nonparametric method and a database for transcription start site prediction. Quant Biol, 2013, 1(4): 261‒271 https://doi.org/10.1007/s40484-013-0022-2

References

[1]
Berendzen, K. W., Stüber, K., Harter, K. and Wanke, D. (2006) Cis-motifs upstream of the transcription and translation initiation sites are effectively revealed by their positional disequilibrium in eukaryote genomes using frequency distribution curves. BMC Bioinformatics, 7, 522
Pubmed
[2]
Pritsker, M., Liu, Y.-C., Beer, M. A. and Tavazoie, S. (2004) Whole-genome discovery of transcription factor binding sites by network-level conservation. Genome Res., 14, 99–108
Pubmed
[3]
Ohler, U., Liao, G. C., Niemann, H. and Rubin, G. M. (2002) Computational analysis of core promoters in the Drosophila genome. Genome Biol., 3, H0087
Pubmed
[4]
Ohler, U. (2006) Identification of core promoter modules in Drosophila and their application in accurate transcription start site prediction. Nucleic Acids Res., 34, 5943–5950
Pubmed
[5]
Suzuki, Y. and Sugano, S. (1997) Generation of the 5′ EST using 5′-end enriched cDNA library. Tanpakushitsu Kakusan Koso, 42, 2836–2843
Pubmed
[6]
Fickett, J. W. and Hatzigeorgiou, A. G. (1997) Eukaryotic promoter recognition. Genome Res., 7, 861–878
Pubmed
[7]
Down, T. A. and Hubbard, T. J. (2002) Computational detection and location of transcription start sites in mammalian genomic DNA. Genome Res., 12, 458–461
Pubmed
[8]
King, O. D. and Roth, F. P. (2003) A non-parametric model for transcription factor binding sites. Nucleic Acids Res., 31, e116
Pubmed
[9]
Abeel, T., Peer, Y. and Saeys, Y. (2009) Toward a gold standard for promoter prediction evaluation. Bioinformatics,25.
[10]
Gordon, L., Chervonenkis, A. Y., Gammerman, A. J., Shahmuradov, I. A. and Solovyev, V. V. (2003) Sequence alignment kernel for recognition of promoter regions. Bioinformatics, 19, 1964–1971
Pubmed
[11]
Shahmuradov, I. A., Solovyev, V. V. and Gammerman, A. J. (2005) Plant promoter prediction with confidence estimation. Nucleic Acids Res., 33, 1069–1076
Pubmed
[12]
Anwar,F., Baker, S., Jabid, T., Hasan,M., Shoyaib, M., Khan, H. and Walshe, R. (2008) Pol II promoter prediction using characteristic 4-mer motifs: a machine learning approach. BMC Bioinformatics, 9, 414
[13]
Troukhan, M., Tatarinova, T., Bouck, J., Flavell, R., and Alexandrov, N. (2009) Genome-wide discovery of cis-elements in promoter sequences using gene expression data. OMICS: A Journal of Integrative Biolog, 13
[14]
Joun, H., Lanske, B., Karperien, M., Qian, F., Defize, L. and Abou-Samra, A. (1997) Tissue-specific transcription start sites and alternative splicing of the parathyroid hormone (PTH)/PTH-related peptide (PTHrP) receptor gene: a new PTH/PTHrP receptor splice variant that lacks the signal peptide. Endocrinology, 138, 1742–1749
Pubmed
[15]
Tran, P., Leclerc, D., Chan, M., Pai, A., Hiou-Tim, F., Wu, Q., Goyette, P., Artigas, C., Milos, R. and Rozen, R. (2002) Multiple transcription start sites and alternative splicing in the methylenetetrahydrofolate reductase gene result in two enzyme isoforms. Mamm. Genome, 13, 483–492
Pubmed
[16]
Rach, E. A., Yuan, H.-Y., Majoros, W. H., Tomancak, P. and Ohler, U. (2009) Motif composition, conservation and condition-specificity of single and alternative transcription start sites in the Drosophila genome, Genome Biology, 10.
[17]
Lamesch, P., Berardini, T. Z., Li, D., Swarbreck, D., Wilks, C., Sasidharan, R., Muller, R., Dreher, K., Alexander, D. L., Garcia-Hernandez, M., (2012) The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res., 40, D1202–D1210 .
Pubmed
[18]
Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K. and Madden, T. L. (2009) BLAST+: architecture and applications. BMC Bioinformatics, 10, 421
Pubmed
[19]
Tatarinova, T., Neely, M., Bartroff, J., van Guilder, M., Yamada, W., Bayard, D., Jelliffe, R., Leary, R., Chubatiuk, A. and Schumitzky, A. (2013) Two general methods for population pharmacokinetic modeling: non-parametric adaptive grid and non-parametric Bayesian. J. Pharmacokinet Pharmacodyn, 40, 189–199
Pubmed
[20]
Mallet, A. (1986) A maximum likelihood estimation method for random coefficient regression models. Biometrika, 73, 645–656.
[21]
Schumitzky, A. (1991) Nonparametric EM algorithms for estimating prior distributions. Appl. Math. Comput., 45, 141–157.
[22]
Lindsay, B. (1983) The geometry of mixture likelihoods: a general theory. Ann. Stat., 11, 86–94.
[23]
MATLAB version 7.10.0,2010.
[24]
Tora, L. (2002) A unified nomenclature for TATA box binding protein (TBP)-associated factors (TAFs) involved in RNA polymerase II transcription. Genes Dev., 16, 673–675
Pubmed
[25]
Smale, S. T. (2001) Core promoters: active contributors to combinatorial gene regulation. Genes Dev., 15, 2503–2508
Pubmed
[26]
Lenhard, B., Sandelin, A. and Carninci, P. (2012) Metazoan promoters: emerging characteristics and insights into transcriptional regulation. Nat. Rev. Genet., 13, 233–245
Pubmed
[27]
Shahmuradov, I. A., Gammerman, A. J., Hancock, J. M., Bramley, P. M. and Solovyev, V. V. (2003) PlantProm: a database of plant promoter sequences. Nucleic Acids Res., 31, 114–117
Pubmed
[28]
Yamamoto, Y. Y., Yoshitsugu, T., Sakurai, T., Seki, M., Shinozaki, K. and Obokata, J. (2009) Heterogeneity of Arabidopsis core promoters revealed by high-density TSS analysis. Plant J., 60, 350–362
Pubmed
[29]
Chodavarapu, R. K., Feng, S., Bernatavichute, Y. V., Chen, P. Y., Stroud, H., Yu, Y., Hetzel, J. A., Kuo, F., Kim, J., Cokus, S. J., (2010) Relationship between nucleosome positioning and DNA methylation. Nature, 466, 388–392
Pubmed
[30]
Triska, M., Grocutt, D., Southern, J., Murphy, D. J. and Tatarinova, T. (2013) cisExpress: motif detection in DNA sequences. Bioinformatics, 29, 2203–2205
Pubmed
[31]
Tatarinova, T., Elhaik, E. and Pellegrini, M. (2013) Cross-species analysis of genic GC3 content and DNA methylation patterns. Genome Biol Evol, 5, 1443–1456
Pubmed
[32]
Alexandrov, N. N., Troukhan, M. E., Brover, V. V., Tatarinova, T., Flavell, R. B. and Feldmann, K. A. (2006) Features of Arabidopsis genes and genome discovered using full-length cDNAs. Plant Mol. Biol., 60, 69–85
Pubmed

ACKNOWLEDEMENTS

The authors thank Prof. Xiaoyu Zhang from Department of Plant Biology, University of Georgia for providing Pol II occupancy data and Prof. Yoshi Yamamoto from the Laboratory for Plant Molecular Physiology, Gifu University for access to the PlantPromoter database.
Support from NIH-NIGMS GM068968, NIH-NICHD HD070996, USC Center for High-Performance Computing and Communications, Fujitsu Lab Europe, and HPC Wales is gratefully acknowledged. We are grateful to the two anonymous reviewers for providing valuable feedback and suggestions for improvement.

RIGHTS & PERMISSIONS

2014 Higher Education Press and Springer-Verlag Berlin Heidelberg
AI Summary AI Mindmap
PDF(572 KB)

Accesses

Citations

Detail

Sections
Recommended

/