1 INTRODUCTION
Although primarily known for transmitting information from DNA to proteins, RNA has been discovered to play multiple important roles in various cellular processes in recent years [
1–
4]. To deepen understanding of the underlying mechanisms of multiple roles of RNA in cellular processes, it is crucial to outline the relationship between RNA structure and function, which is generally provided based on prediction and differential analysis of RNA secondary structure [
5–
7].
RNA secondary structure is formed by complementary base pairing, which exhibits dynamics in many aspects [
3,
8]. First, RNA secondary structure is flexible and dynamic even under the same environment. Second, many RNAs have been reported to adopt and convert between several secondary structures, thus different conformations of secondary structure may coexist in cellular context. Third, the structural landscape of RNA can change in response to experimental stimuli. The change may take place in the composition of different conformations, or in the adoption of new structures. These dynamics collectively contribute to the complexity of the depiction of the structural landscapes.
The dynamics of RNA secondary structure is extensive. Consequently, the combinatory landscape cannot be fully resolved by traditional experimental techniques, such as X-ray and cryogenic electron microscopy. Despite their high accuracy, traditional techniques are generally labor intensive. More importantly, the ability to conduct
in vivo measurements is lacking, further limiting the application of traditional techniques in resolving RNA secondary structure [
7,
9]. To overcome these difficulties, computational methods have been developed to predict RNA folding based on free energy or comparative sequence analysis. More recently,
in vivo structure profiling (SP) experiments and computational tools to analyze such data have brought new insight into the problem.
While current computational approaches mainly focus on the prediction of RNA secondary structure, there is a growing demand for differential analysis of RNA secondary structure,
i.e., identifying regions where the structural landscape of an RNA differs between different conditions [
10–
13]. Applications of differential analysis of RNA secondary structure have shed new light on many related problems, such as RNA-protein interactions and RNA structural patterns [
11,
14,
15].
The rest of the article is organized as follows. In Section of “Computational methods for RNA secondary structure prediction”, we overview state of art computational methods for RNA secondary structure prediction. In Section of “Data-driven RNA secondary structure prediction”, we overview the new generation of SP experiments as well as the corresponding databases and briefly introduce the analysis of SP data. Approaches combining computational methods and SP data for RNA secondary structure prediction will also be reviewed in this section. Lastly, we will review the differential analysis of RNA secondary structure in Section of “Discussion” . Potential directions for improvement in the prediction and differential analysis of RNA secondary structure will also be discussed in this section.
2 COMPUTATIONAL METHODS FOR RNA SECONDARY STRUCTURE PREDICTION
Due to the limitations of traditional techniques for measuring RNA secondary structure, for many years, computational methods for RNA secondary structure prediction are the primary origin of our knowledge on RNA structure and the corresponding potential functions for most RNAs. These computational methods are commonly based on thermodynamic models that calculate free energy of RNA secondary structure. An alternative approach, comparative sequence analysis, predicts RNA secondary structure by borrowing information from homologous RNA sequences. The computational approaches have been comprehensively reviewed elsewhere [
16,
17]. Here we briefly introduce several representative algorithms and benchmarking of computational methods. Combination of computational methods and recent SP data will be discussed in later sections.
2.1 Free energy based algorithms
Under thermodynamical assumptions, the problem of RNA folding can be transformed into optimization problems based on free energy. To elaborate, the probability of observing an RNA secondary structure
at the Boltzmann equilibrium,
, with the free energy of
,
is specified as follows:
where
are known physical constants [
18]. Through this law, we can access the probability of observing a specific structure by estimating its free energy, which is the sum of free energies of its substructures. The calculation depends on experimentally measured parameters [
19,
20]. An in-depth description of free energy calculation can be found in Zuker and Stiegler [
20].
2.2 Free energy minimization (MFE)
MFE is a pioneering computational algorithm that predicts RNA secondary structure through free energy minimization [
20]. According to the above thermodynamic statistical law, by minimizing free energy, MFE identifies the most likely structure. Specifically, the number of possible secondary structures grows exponentially with RNA sequence length (denoted by
), posing a severe computational challenge. Zuker
et al. devised a dynamic programming algorithm for the optimization problem, reducing the time and space complexity to
and
respectively. This algorithm provides a broadly applicable computational tool for RNA secondary structure prediction for the first time, through which RNA secondary structures are explored in batches.
However, cases of unsatisfactory prediction accuracy of MFE have been reported [
21], probably due to simplifications in the calculation of free energy, inaccuracies of the experimentally measured parameters, and neglected effects of ions and other molecules. To account for such complexity, several algorithms adapted MFE to predict a list of suboptimal structures [
19,
22,
23]. The length of such list of suboptimal structures, however, often grows exponentially with the length of RNA sequence, which is computationally extensive and brings greater difficulty to experimental validation.
2.3 Sfold
Instead of searching for structures with low free energy, the Sfold algorithm employs a sampling scheme in secondary structure prediction [
24–
26] that samples most representative structures from the Boltzmann ensemble. Specifically, the algorithm takes two steps. The forward step calculates the Boltzmann equilibrium probability for a secondary structure, while the key technique is the recursive calculation of equilibrium partition functions. The backward step samples structures from the Boltzmann distribution in a recursive fashion based on the sampling probabilities derived in the forward step. The algorithm has time and space complexity of
and
respectively. Next Sfold groups the sampled structures into clusters and identifies the centroid of each cluster. These centroids are thus representative of the ensemble and are supposed to approximate the true structures.
Prediction with the centroids has been reported to achieve higher accuracy on some RNAs compared to MFE [
24]. Moreover, the number of clusters of sampled structures is usually small, which is more practical for experimental validation.
2.4 Expected accuracy maximization (MEA)
As another popular computational method, MEA predicts RNA secondary structure by maximizing expected accuracy of base pairing prediction [
27]. Here, the expected accuracy of a structure
measures the consistency between
and the marginal pairing probabilities of the Boltzmann ensemble. Specifically,
where
and
is a scaling factor that balances sensitivity and specificity in evaluating prediction accuracy. A parsing algorithm is developed to solve the optimization problem with time and space complexity of
and
respectively, which are the same as those of MFE and Sfold.
To address the computational difficulty to calculate the expected accuracy, pseudo-expected accuracy is developed as an approximation, which can be optimized via stochastic sampling. The pseudo-expected accuracy serves as a good alternative under various measures of prediction accuracy [
28].
2.5 Comparative sequence analysis
Comparative sequence analysis often achieves higher prediction accuracy compared to free-energy based algorithms [
29]. Comparative sequence analysis leverages the fact that RNA sequence and structure are closely related to RNA function and thus are conserved during evolution [
30]. By integrating evolutionary information, a consensus structure from alignment of homologous sequences across different species is predicted. Some variants further incorporate other information including free energy of RNA structure to improve prediction accuracy [
31–
33]. Extensive reviews for tools for comparative sequence analysis can be found in [
16,
17,
30]. Yet, several complications greatly limit the scope of application of these tools. In general, enough homologous sequences are prerequisites to guarantee prediction accuracy, which seldom happens. In addition, laborious collection of homologous sequences and computational challenges in multiple sequence alignment also put limits on the application of this approach. Moreover, similar to free-energy based algorithms, it remains unclear how to extend the current framework of comparative sequence analysis to construct structural landscapes.
2.6 Benchmarking of computational methods for RNA secondary structure prediction
Systematic benchmarking is necessary to evaluate the accuracy of RNA secondary structure prediction algorithms. Benchmark datasets haven been constructed by collecting experimentally solved RNA secondary structures from databases Protein Data Bank [
34] and RNAstrand [
35]. Hajiaghayi
et al. [
36] have evaluated the accuracy of energy-based prediction algorithms, and concluded that improvement in thermodynamic parameters can greatly increase prediction accuracy, and relative performance of methods can be affected by the parameter set used for evaluation. In general, average F measure of energy-based methods lies between 0.6 and 0.7, depending on the RNA classes being considered. Puton
et al. have implemented a web server, CompaRNA, for continuous benchmarking of computational methods [
29]. Rankings of computational methods for RNA secondary structure prediction under various scenarios are systematically reported, including free energy based algorithms and approaches to comparative sequence analysis.
It is noteworthy that accuracies and rankings of computational methods can be strongly influenced by several factors, including the length of RNA sequence, RNA classes, and the existence of pseudoknots [
29,
37]. Moreover, there are several issues for the use of the benchmarks. First, it is hard to guarantee the separation of training data for existing computational methods and reference structures for benchmarking [
29,
38]. Second, in general the benchmarks are designed to evaluate single structure prediction [
29,
36,
37]. For computational methods predicting multiple structures, one of the structures is properly selected for evaluation [
29]. Third, the length of reference RNA sequences in benchmark datasets is limited by the computational burden for comparing computational methods [
29]. Thus it is suggested the rankings of methods should only be considered in specific context.
3 DATA-DRIVEN RNA SECONDARY STRUCTURE PREDICTION
Despite the successful prediction on a portion of RNAs, computational methods are hampered by their limitations, as described in Section of “Computational methods for RNA secondary structure prediction”, from a wider application range and higher prediction accuracy. Moreover, as a common defect of computational methods, the prediction relies on the sequence information only and hence cannot be adapted to make predictions in varied conditions.
Recent advancements in SP experiments can help to overcome these difficulties. SP experiments utilize chemicals or enzymes to probe RNA structure. The enzymes act differently to RNA nucleotides depending on their pairing status, through which local structural characteristics are decoded into profiling data by tracking the footprints of the additives. The history of SP experiments to be used for RNA secondary structure profiling can be dated back to the 1970s [
39,
40]. In recent years, with the development of experimental technology, SP experiments have entered a new generation, from low resolution, low-throughput,
in vitro experiments, to high resolution (nucleotide level), high-throughput experiments that are applicable under different conditions, such as various
in vivo environments. Meanwhile, it is worth noting that profiling data from SP experiments (SP data) reflects local structural characteristics, and thus behaves stable when the length of RNA increases. This is valuable considering the instability of computational methods for long RNAs. SP experiments have become powerful and cost-effective tools for transcriptome-wide structure profiling under different conditions, thus are also promising for the study of relationship between RNA structure and function [
5,
7,
14,
41].
3.1 High-throughput SP experiments and databases
In this section, we overview SP experiments and the databases of SP data.
3.1.1 SP experiments
Once RNAs fold
in vivo or
in vitro, specific RNase can selectively cleave either unpaired regions or paired regions, generating corresponding cleavage ends. Combining such RNases with high-throughput sequencing technology
, parallel analysis of RNA structures (PARS) and several alternative approaches have the capacity to interrogate RNA secondary structure at the transcriptome level [
42–
45].
However, approaches utilizing RNases are limited to
in vitro structure profiling, since in general nucleases cannot penetrate cell membrane. An alternative approach is chemical modification, which enables RNA modification
in vivo. Given the smaller molecular size of chemical probes [
14,
46–
67], chemical modification approaches also have higher resolution than RNases-based approaches. DMS-seq, for example, utilizes dimethyl sulphate (DMS), which specifically reacts with unpaired nucleotides, for genome-wide structure profiling
in vivo [
60]. However, such reaction is limited to adenine and cytosine residues, providing only a partial view of RNA secondary structure. Alternatively, selective 2’-hydroxyl acylation analyzed by primer extension (SHAPE) can be used as the reagent.
In vivo click selective 2’-hydroxyl acylation and profiling experiment (icSHAPE) [
14,
52,
53] can probe unpaired nucleotides
in vivo for all bases.
It's noteworthy that different chemical probes have distinct chemical reactivities, and probe combinations are suggested under certain conditions [
41]. Thorough discussions on experimental details of SP experiments can be found in [
7,
41,
68–
71].
3.1.2 Databases
With the rapid development of SP experiments, numerous SP data of different types have been generated. There are comprehensive databases that systematically archive and manage SP data [
72–
77]. The databases differ by the type of SP experiments, the types of RNAs, and the source species. Moreover, some databases provide visualization tools and facilitate application of computational methods. Details of the databases are summarized in Table 1.
3.2 SP reactivity estimation
While a large amount of high resolution SP data have been generated at transcriptome level, estimation of nucleotide reactivities from SP data has proven challenging [
15,
78]. The reactivity of a nucleotide measures the extent of the nucleotide reacting to the probing enzymes and thus infers the pairing status of the nucleotide. The inference procedure is complicated by several factors. First, multiple secondary structures of an RNA can co-exist, thus the observed reactivity of a nucleotide reflects the combination of pairing status in a mixture of secondary structures. Deconvolution of the reactivity measures is computationally challenging. Second, due to technical limitations of SP experiments, SP data are usually very noisy [
15,
79]. These collectively make the inference of RNA secondary structure very difficult. To account for the confounding factors, various approaches have been developed in the last decade. In particular, computational effort roughly falls into two classes, heuristic approaches and statistical model-based approaches.
3.2.1 Heuristic approaches
Previous to interpreting data from SP experiments, substantial preprocessing, or normalization, is needed, and the approach is usually tailored to specific experiment protocols. For example, PARS first normalizes SP counts to account for different sequencing depth and then derives reactivities as the log2 ratio between the normalized counts of RNase V1 and of RNase S1 [
45]. Here, RNase V1 and RNase S1 are enzymes that preferentially cleave paired and unpaired regions respectively. Another SP experiment, Structure-seq [
65], first normalizes natural logs of SP counts by transcript length and abundances in case experiment and control experiment separately, and then subtracts the normalized counts in control experiment from those in case experiment as raw reactivity measurements. Finally, 2%/8% normalization technique [
80] and a reactivity-capping procedure are sequentially applied to the raw reactivities to obtain final reactivity estimations [
65].
The normalization steps are customized to specific experimental procedures in SP experiments, hence they seldom directly apply to data generated by other SP experiments. Nevertheless, SP experiments share a set of core experimental procedures [
7,
41,
71], therefore general pipelines of reactivity estimation for various SP data have been proposed, which provided integrated computational framework for reads mapping, background correction, and reactivity derivation [
78,
81].
3.2.2 Statistical model-based approaches
The output of normalization approaches is the estimation of reactivity, and it is straightforward to determine RNA secondary structures by thresholding. However, model-based approaches often yield improved estimations and can be adapted to more complicated situations.
Aviran
et al. have devised a rigorous probabilistic model to describe the fragment distribution in SHAPE-Seq (
selective 2’-hydroxyl acylation analyzed by primer extension sequencing) experiment, which accounts for the effect of chemical modification and natural polymerase drop-off [
82]. In contrast to heuristic approaches that rely on expert knowledge, the structure inference is automatic through maximum likelihood estimation. Extending the strategy of generative models, another approach, PROBer, has been developed to simultaneously model reactivities of multiple RNAs in a Bayesian framework [
83]. Specifically, reactivities, as well as noise parameters, of multiple RNA isoforms are set to follow a common prior distribution, and maximum posterior estimates are further employed to incorporate the generative model of SP data. The Bayesian framework of PROBer not only greatly reduces the number of parameters, but also enables borrowing information across RNA transcripts, which is important in the estimation of reactivities considering the high noise level and limited number of replicates in SP data.
Unlike generative modeling approaches, BUM-HMM (Beta-Uniform Mixture Hidden Markov Model) first calculated empirical
P values for treatment-control LDR (log-ratio of drop-off rates) and then made structure inference by modeling the distribution of the
P values [
79]. In detail, an empirical null distribution of the LDR was constructed to account for the variability of replicates in SP data. By comparing the observed LDRs, log-ratios of drop-off rates of case experiments with those of control experiments, to the empirical null distribution, quantitative differences between case and control experiments are transferred into
P values. Next, BUM-HMM utilizes a hidden Markov model to leverage the dependency of RNA secondary structure between neighboring nucleotides. Hidden states in the model correspond to the status of modification. Several confounding factors, including coverage biases and sequence biases, are explicitly accounted for in the model. It is worth noticing that these factors, along with the way BUM-HMM deals with them, have general implications for the analysis of SP data beyond the estimation of reactivities. The idea to model structural dependencies in adjacent nucleotides has also been exploited in JPGM, a statistical model-based approach to interpret RNase footprint sequencing data [
84].
3.3 RNA secondary structure prediction integrating SP data
Free-energy based algorithms and comparative sequence analysis are based on nucleotide sequences, while SP experiment directly interrogates the RNA secondary structure by chemical probing regardless of the sequence context. Thus, these approaches provide complementary information, and it is reasonable to hypothesize that combining SP data with computational methods could potentially improve RNA secondary structure prediction.
To leverage on SP data, several approaches use reactivities as constraints to guide RNA folding. For example, Deigan
et al. have developed an approach to combine the MFE algorithm with reactivities measurements from SHAPE experiments [
85]. Instead of using hard thresholding,
i.e., forcing nucleotides with high (low) reactivities to be unpaired (paired) [
86], reactivities are incorporated as pseudo free energy terms for paired nucleotides. Specifically, the pseudo free energy term of nucleotide
i is defined as
where
and
are parameters that are pre-trained on datasets with reference structures. By adding pseudo free energy terms to the original energy function of MFE, structures that assign nucleotides with high reactivities as paired are penalized. In consequence, RNA folding is guided to match the structural information reflected by reactivities. This approach has been shown to improve prediction accuracy compared to MFE [
85]. However, the choice of parameters (
) for the whole transcriptome reflects the contribution of SP data, and it is not straightforward to specify their values due to their nonphysical nature [
87]. Moreover, it remains an open problem how to decide the format of pseudo free energy terms [
88–
90].
Similarly, SeqFold extended the Sfold algorithm to combine diverse types of SP data as a structure preference profile [
91]. After grouping structures sampled by Sfold into clusters, SeqFold assigns the structure preference profile into the most likely cluster and makes prediction with centroid of the cluster. This approach is robust and is flexible to accommodate a variety of SP data. In addition to integrate SP data, Rsample predicts multiple RNA structures, and provides estimates of the mixing proportions [
89]. Particularly, it allows more flexible options for RNA secondary structure prediction, either to predict a single structure with the MEA algorithm or to predict multiple representative structures with the Sfold algorithm.
Methods that combine SP experiment with free energy based prediction of RNA secondary structure have achieved comparable and even better performance compared to existing methods [
88,
90]. While integrating SP data as reactivities is usually convenient to implement, compressing SP data into reactivities can lead to loss of information. Alternatively, SLEQ (Structure Landscape Explorer and Quantifier) incorporates SP data at the reads level [
92]. Moreover, structural landscapes, that is, different conformations of secondary structure along with their relative abundances, are considered in this approach. In SLEQ, candidate structures are linked with patterns of sequencing reads through a generative model, where candidate structures can be flexibly selected based on prior knowledge, such as results from various computational methods. Then the approach maximizes the agreement between candidate structures and patterns of sequencing reads to reconstruct the structural landscape. Remarkably, SLEQ alleviates the dependence on thermodynamic models, allowing for versatile RNA secondary structure modeling. Nevertheless, the method enforces stringent constraints on completeness of candidate structures,
i.e., all structures in the landscape are required to be included in the list of candidate structures.
Some of the above approaches that combine computational methods with SP data have been integrated into databases listed in Table 1.
4 DISCUSSION
RNA secondary structure prediction has been a classical problem in computational biology. However, with the rapid development of SP experiments, the problem has attracted new attention in the last decade. Emerging approaches that combine SP data with conventional algorithms have been shown to improve prediction accuracy. However, existing approaches mainly focus on identification of single optimized structure, leaving deciphering the comprehensive structural landscapes, i.e., estimating the content and abundance of multiple co-existing structures, open problems.
In addition to RNA secondary structure prediction, SP data can also facilitate comparison of RNA secondary structure in different conditions, which brings insight regarding the conversion of RNA structure and their cellular roles [
10–
15]. Recently, dStruct proposed a statistical framework to measure dissimilarity of SP data across conditions and to screen for regions with significant difference [
13]. Undoubtfully, the identification of regions with structure change under different conditions brings important functional implications. More effort is needed in this line of work.
Last, we highlight a few emerging themes in analysis of SP data, which should be properly addressed in future development of models and algorithms. First, statistical models that accommodate the systematic biases and high noise nature of SP data are often required to make accurate and robust inference [
15,
79,
84,
93]. However, it is a challenging task as the number of replicates is limited and variability between replicates is usually substantial. Second, considering the complications in preprocessing SP data, it would be beneficial to incorporate SP data at raw reads level when combined with free energy based predictions. The reason is that summarizing SP data into reactivity profiles might lose information, and the procedure is highly dependent on the normalization method of choice. More importantly, for the problem of construction of structural landscapes, utilizing SP data at a less compressed level rather than reactivities might retain quantitative information. Similarly, it is important to balance between complexity and information retention of SP data in other related problems. Third, recent advances in high throughput mapping of RNA-RNA interactions, such as PARIS [
94], SPLASH [
95] and LIGR-seq [
96], shed new light on the prediction of RNA secondary structure. Rather than detecting pairing states of single nucleotides, these experiments directly capture the pairing of two nucleotides. Such two-dimensional information is valuable as it provides direct evidences for alternative conformations (if exist) and lays crucial foundations for constructing long-range structures.
Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature