PDF
(760KB)
Abstract
Background: With the recent advance of sequencing technology, the collection of RNA expression (RNA-seq) data has been growing rapidly. RNA-seq data are statistically count-type measurements. Poisson distribution is a basic probability distribution for modeling count-type data. With Poisson regression models, various experimental factors, GC content as well as alternative splicing isoforms can be flexibly considered in RNA-seq data analysis. Due to the biochemical and technical limitations of sequencing technology, the biases among RNA-seq data have been recognized.
Methods: In this study, an artificial censoring approach has been proposed to an isoform-specific Poisson regression model for analyzing RNA-seq data. Low expression values can be grouped (censored) into one probability category, and high expression values can also be grouped (censored) into another probability category. We have implemented the related Newton-Raphson numeric computing procedure to achieve the maximum likelihood estimation for our censored-Poisson regression model. The related mathematical simplifications have been derived for the consideration of stable and convenient numerical computing.
Results: The advantages of our artificial censoring approach have been demonstrated in both simulation studies and application analysis of experimental data.
Conclusions: Our proposed artificial censoring approach allows us to focus on the majority of data. As the extreme values (tails) of data are artificially censored, more efficient analysis results can be obtained, even from relatively simple Poisson regression models. Our proposed artificial censoring approach can certainly be considered for other well-developed models or methods for RNA-seq data analysis.
Graphical abstract
Keywords
RNA-seq
/
Poisson models
/
censored distribution
Cite this article
Download citation ▾
Xing Chen, Yinglei Lai.
A censored-Poisson model based approach to the analysis of RNA-seq data.
Quant. Biol., 2020, 8(2): 155-171 DOI:10.1007/s40484-020-0208-3
| [1] |
Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim , D., Kelley, D. R., Pimentel, H., Salzberg, S. L., Rinn, J. L., and Pachter, L. (2012) Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc., 7, 562–578
|
| [2] |
Alkhateeb, A., and Rueda , L. (2017) Zseq: An approach for preprocessing next-generation sequencing data. J. Comput. Biol., 24, 746–755
|
| [3] |
Pérez-Rubio, P., Lottaz , C., and Engelmann, J. C. (2019) FastqPuri: high-performance preprocessing of RNA-seq data. BMC Bioinformatics, 20, 226
|
| [4] |
Mortazavi, A., Williams , B. A., McCue, K., Schaeffer, L. and Wold, B. (2008) Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat. Methods, 5, 621–628
|
| [5] |
Li, J., Jiang , H., and Wong, W., H. (2010) Modeling non-uniformity in short-read rates in RNA-seq data. Genome Biol., 11, R50
|
| [6] |
Li, B. and Dewey , C. N. (2011) RSEM: accurate transcript quantification from RNA-seq data with or without a reference genome. BMC Bioinformatics, 12, 323
|
| [7] |
Jiang, H. and Wong , W. H. (2009) Statistical inferences for isoform expression in RNA-seq. Bioinformatics, 25, 1026–1032
|
| [8] |
Salzman, J., Jiang , H. and Wong, W. H. (2011) Statistical modeling of RNA-seq data. Stat. Sci., 26, 62–83
|
| [9] |
Shi, Y. and Jiang , H. (2013) rSeqDiff: detecting differential isoform expression from RNA-seq data using hierarchical likelihood ratio test. PLoS One, 8, e79448
|
| [10] |
Dohm, J. C., Lottaz , C., Borodina, T. and Himmelbauer, H. (2008) Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res., 36, e105
|
| [11] |
Aird, D., Ross , M. G., Chen, W. S., Danielsson, M., Fennell, T., Russ , C., Jaffe, D. B., Nusbaum, C. and Gnirke, A. (2011) Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol., 12, R18
|
| [12] |
Benjamini, Y. and Speed , T. P. (2012) Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res., 40, e72
|
| [13] |
Hansen, K. D., Irizarry , R. A. and Wu, Z. (2012) Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics, 13, 204–216
|
| [14] |
Robinson, M. D. and Smyth, G. K. (2007) Moderated statistical tests for assessing differences in tag abundance. Bioinformatics, 23, 2881–2887
|
| [15] |
Robinson, M. D. and Smyth, G. K. (2008) Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics, 9, 321–332
|
| [16] |
Anders, S. and Huber , W. (2010) Differential expression analysis for sequence count data. Genome Biol., 11, R106
|
| [17] |
Anders, S., McCarthy , D. J., Chen, Y., Okoniewski, M., SmythG. K., Huber, W. and Robinson, M. D. (2013) Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nat. Protoc., 8, 1765–1786
|
| [18] |
Rau, A., Maugis-Rabusseau , C., Martin-Magniette, M.-L. and CeleuxG. (2015) Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models. Bioinformatics, 31, 1420–1427
|
| [19] |
Pertea, M., Kim , D., Pertea, G. M., Leek, J. T. and Salzberg, S. L. (2016) Transcript-level expression analysis of rna-seq experiments with hisat, stringtie and ballgown. Nat. Protoc., 11, 1650–1667
|
| [20] |
Kazakiewicz, D., Claesen , J., Górczak, K., Plewczynski, D. and Burzykowski, T. (2019) A multivariate negative-binomial model with random effects for differential gene-expression analysis of correlated mrna sequencing data. J. Comput. Biol., 26, 1339–1348
|
| [21] |
Li, B., Ruotti , V., Stewart, R. M., Thomson, J. A. and Dewey, C. N. (2010) RNA-seq gene expression estimation with read mapping uncertainty. Bioinformatics, 26, 493–500
|
| [22] |
Khoury, M. P. and Bourdon, J.-C. (2011) p53 isoforms: An intracellular microprocessor? Genes Cancer, 2, 453–465
|
| [23] |
Cancer Genome Atlas Network. (2012) Comprehensive molecular portraits of human breast tumours. Nature, 490, 61–70
|
| [24] |
Rosenbloom, K. R.Armstrong, J., Barber, G.P., Casper, J., Clawson, H., Diekhans, M., Dreszer, T.R., Fujita, P.A., Guruvadoo, L., Haeussler, M., et al. (2015) The UCSC Genome Browser database: 2015 update. Nucleic Acids Res., 43, D670–D681
|
RIGHTS & PERMISSIONS
Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature