RNA binding proteins (RBPs) are known as key post-transcriptional regulators. The recent technology, cross-linking and immunoprecipitation followed by sequencing (CLIP-seq), has made it possible to investigate the interaction between RBPs and RNAs. However, the association between the function and the binding of RBPs has not been systematically studied. In this issue, Lin and Ouyang present a large-scale analysis on the functional targets of human RBPs based on the enhanced C[Detail] ...
Background: The past decade has witnessed a rapid progress in our understanding of the genetics of cancer and its progression. Probabilistic and statistical modeling played a pivotal role in the discovery of general patterns from cancer genomics datasets and continue to be of central importance for personalized medicine.
Results: In this review we introduce cancer genomics from a probabilistic and statistical perspective. We start from (1) functional classification of genes into oncogenes and tumor suppressor genes, then (2) demonstrate the importance of comprehensive analysis of different mutation types for individual cancer genomes, followed by (3) tumor purity analysis, which in turn leads to (4) the concept of ploidy and clonality, that is next connected to (5) tumor evolution under treatment pressure, which yields insights into cancer drug resistance. We also discuss future challenges including the non-coding genomic regions, integrative analysis of genomics and epigenomics, as well as early cancer detection.
Conclusion: We believe probabilistic and statistical modeling will continue to play important roles for novel discoveries in the field of cancer genomics and personalized medicine.
Background: RNA structure is the crucial basis for RNA function in various cellular processes. Over the last decade, high throughput structure profiling (SP) experiments have brought enormous insight into RNA secondary structure.
Results: In this review, we first provide an overview of approaches for RNA secondary structure prediction, including free energy-based algorithms and comparative sequence analysis. Then we introduce SP technologies, databases to document SP data, and pipelines/algorithms to normalize and interpret SP data. Computational frameworks that incorporate SP data in RNA secondary structure prediction are also presented.
Conclusions: We finally discuss potential directions for improvement in the prediction and differential analysis of RNA secondary structure.
Background: RNA binding proteins (RBPs) play essential roles in the regulation of RNA metabolism. Recent studies have disclosed that RBPs achieve their functions via binding to their targets in a position-dependent pattern on RNAs. However, few studies have systematically addressed the associations between the RBP’s functions and their positional binding preferences.
Methods: Here, we present large-scale analyses on the functional targets of human RBPs by integrating the enhanced cross-linking and immunoprecipitation followed by sequencing (eCLIP-seq) datasets and the shRNA knockdown followed by RNA-seq datasets that are deposited in the integrated ENCyclopedia of DNA Elements in the human genome (ENCODE) data portal.
Results: We found that (1) binding to the translation termination site and the 3′ untranslated region is important to most human RBPs in the RNA decay regulation; (2) RBPs’ binding and regulation follow a cell-type specific pattern.
Conclusions: These analysis results show the strong relationship between the binding position and the functions of RBPs, which provides novel insights into the RBPs’ regulation mechanisms.
Background: Multi-view -omics datasets offer rich opportunities for integrative analysis across genomic, transcriptomic, and epigenetic data platforms. Statistical methods are needed to rigorously implement current research on functional biology, matching the complex dynamics of systems genomic datasets.
Methods: We apply imputation for missing data and a structural, graph-theoretic pathway model to a dataset of 22 cancers across 173 signaling pathways. Our pathway model integrates multiple data platforms, and we test for differential activation between cancerous tumor and healthy tissue populations.
Results: Our pathway analysis reveals significant disturbance in signaling pathways that are known to relate to oncogenesis. We identify several pathways that suggest new research directions, including the Trk signaling and focal adhesion kinase activation pathways in sarcoma.
Conclusions: Our integrative analysis confirms contemporary research findings, which supports the validity of our findings. We implement an interactive data visualization for exploration of the pathway analyses, which is available online for public access.
Background: Markov chains (MC) have been widely used to model molecular sequences. The estimations of MC transition matrix and confidence intervals of the transition probabilities from long sequence data have been intensively studied in the past decades. In next generation sequencing (NGS), a large amount of short reads are generated. These short reads can overlap and some regions of the genome may not be sequenced resulting in a new type of data. Based on NGS data, the transition probabilities of MC can be estimated by moment estimators. However, the classical asymptotic distribution theory for MC transition probability estimators based on long sequences is no longer valid.
Methods: In this study, we present the asymptotic distributions of several statistics related to MC based on NGS data. We show that, after scaling by the effective coverage d defined in a previous study by the authors, these statistics based on NGS data approximate to the same distributions as the corresponding statistics for long sequences.
Results: We apply the asymptotic properties of these statistics for finding the theoretical confidence regions for MC transition probabilities based on NGS short reads data. We validate our theoretical confidence intervals using both simulated data and real data sets, and compare the results with those by the parametric bootstrap method.
Conclusions: We find that the asymptotic distributions of these statistics and the theoretical confidence intervals of transition probabilities based on NGS data given in this study are highly accurate, providing a powerful tool for NGS data analysis.
Background: With the recent advance of sequencing technology, the collection of RNA expression (RNA-seq) data has been growing rapidly. RNA-seq data are statistically count-type measurements. Poisson distribution is a basic probability distribution for modeling count-type data. With Poisson regression models, various experimental factors, GC content as well as alternative splicing isoforms can be flexibly considered in RNA-seq data analysis. Due to the biochemical and technical limitations of sequencing technology, the biases among RNA-seq data have been recognized.
Methods: In this study, an artificial censoring approach has been proposed to an isoform-specific Poisson regression model for analyzing RNA-seq data. Low expression values can be grouped (censored) into one probability category, and high expression values can also be grouped (censored) into another probability category. We have implemented the related Newton-Raphson numeric computing procedure to achieve the maximum likelihood estimation for our censored-Poisson regression model. The related mathematical simplifications have been derived for the consideration of stable and convenient numerical computing.
Results: The advantages of our artificial censoring approach have been demonstrated in both simulation studies and application analysis of experimental data.
Conclusions: Our proposed artificial censoring approach allows us to focus on the majority of data. As the extreme values (tails) of data are artificially censored, more efficient analysis results can be obtained, even from relatively simple Poisson regression models. Our proposed artificial censoring approach can certainly be considered for other well-developed models or methods for RNA-seq data analysis.
This tutorial presents a mathematical theory that relates the probability of sample frequencies, of M phenotypes in an isogenic population of N cells, to the probability distribution of the sample mean of a quantitative biomarker, when the N is very large. An analogue to the statistical mechanics of canonical ensemble is discussed.