INTRODUCTION
The diversity in morphology and function of the numerous cell types in the human body is driven primarily through differential gene expression. Perturbations in normal gene expression, at the level of both transcription and post-transcriptional regulatory mechanisms, lead to a number of human diseases, including cancer and neurodegenerative disorders [
1–
3]. Elucidating regulatory mechanisms that control gene expression networks is therefore critical to understand the complexity of development, cell differentiation, and disease.
RNA binding proteins (RBPs) orchestrate many essential processes in the expression of protein-coding genes. This includes co-transcriptional pre-mRNA processing events such as exon splicing and 3′ end formation, as well as mRNA export to the cytoplasm, localization, translation, and decay [
4]. It is estimated that up to 1,500 proteins are able to bind RNA in humans [
5,
6], illustrating the complexity of post-transcriptional gene regulatory networks. Furthermore, protein-RNA interactions are dynamically assembled and remodeled at different stages of the mRNA lifecycle [
7], with the fate of a particular mRNA determined by the composition of attendant proteins. Accordingly, the diverse spectrum of RBPs expressed in different cell types and developmental stages generates distinct programs of combinatorial RNA regulation to fine tune gene output in different biological contexts.
Here, we review high-throughput, in vivo, protein-centric techniques used to explore protein-RNA interactions and highlight the power of integrative approaches to advance our understanding of how complex protein-RNA interaction networks regulate the transcriptome to specialize cell properties.
REDUCTIONIST APPROACHES TO INTERROGATE PROTEIN-RNA INTERACTIONS
Historically, the study of protein-RNA interactions was limited to reconstituted
in vivo and
in vitro approaches, including quantitative methods such as electrophoretic mobility shift assays (EMSA) [
8], surface plasmon resonance [
9], and yeast three-hybrid systems [
10]. Additionally,
in vitro selection of RNA [
11] or systematic evolution of ligands by exponential enrichment (SELEX) [
12] can identify subpopulations of RNA molecules that bind to a particular ligand with high affinity. While these approaches provide detailed kinetic information regarding the specificity and affinity of protein-RNA interactions, they do not recapitulate the native cellular environments that occur
in vivo. In contrast, RNA immunoprecipitation (RIP) uses a protein-specific antibody to co-precipitate associated RNAs followed by sequencing (RIP-seq) or microarray analysis (RIP-chip) [
13] to identify global RNA substrates. While RIP has been used successfully in many studies to identify relevant RNA targets [
13,
14], the technique does not identify direct sites of protein interaction with RNA. Additionally, in some conditions, protein-RNA complexes can remodel after cell lysis, leading to the capture of non-physiological interactions [
15].
The use of UV crosslinking to lock protein-RNA interactions
in vivo overcomes the limitations inherent to traditional RIP procedures. When exposed to UV light, photoreactive molecules of RNA form a covalent association with directly bound proteins, a principle first described by Dreyfuss and colleagues in the 1980s [
16]. The combination of UV crosslinking with immunoprecipitation, or crosslinking and immunoprecipitation (CLIP), is a powerful approach to identify direct protein-RNA interactions [
17]. Because of the covalent crosslinks formed, stringent purification techniques can be applied during the immunoprecipitation, improving the signal to noise ratio. Additionally, by partially digesting the RNA prior to immunoprecipitation, the interactions captured by CLIP provide positional binding information, as well as a list of direct RNA substrates.
The first method for isolating and purifying small RNA fragments crosslinked to a particular RBP was developed in the Darnell lab by Ule
et al.in 2003 [
17]. Briefly, triturated tissue or cells is UV irradiated to crosslink protein to RNA
in vivo, followed by cell lysis and RNA fragmentation with RNase. The protein of interest and associated RNA fragments are purified by immunoprecipitation with stringent washes. This is followed by ligation of a RNA linker to the 3′ end of the RNA fragments, and radiolabeling of the 5′ end. The radiolabeled protein-RNA complexes are then subjected to SDS-PAGE and transferred to nitrocellulose. These steps allow the separation of protein-RNA complexes based on RNA fragment length, removal of free RNA, and visualization of protein-RNA complexes by autoradiography. After extraction of protein-RNA complexes from a nitrocellulose membrane, the target protein is degraded by proteinase K and a second RNA linker is ligated to the 5′ end of RNA. Reverse-transcription PCR is then performed to generate cDNA that either can be cloned into plasmids for Sanger sequencing (as in the original study) or sequenced using massively parallel sequencing (Figure 1). In the first study [
17], CLIP was used to identify RNAs in mouse brain that are directly bound by NOVA, a neuron-specific splicing factor. Importantly, the functional importance of specific NOVA-RNA interactions identified could be interrogated using a gene knockout mouse lacking NOVA expression. These efforts showed that many RNA targets dependent on NOVA for proper alternative splicing had known functions in synaptic biology. The data therefore provided new insights into the aberrant neurological phenotype of NOVA KO mice while also validating CLIP as a powerful tool for RNA molecular biology research.
HIGH-RESOLUTION TRANSCRIPTOME-WIDE MAPPING OF PROTEIN-RNA INTERACTIONS
Applications of CLIP methods have vast potential to facilitate new mechanistic insights into RBP functions and post-transcriptional control of gene expression. Over the last decade, a plethora of CLIP based techniques have been developed, including hiCLIP [
18], CLASH [
19], irCLIP [
20], RipIT-seq [
21], TRIBE [
22], sCLIP [
23], dCLIP [
24], and Fr-iCLIP [
25]. Each technique improves or refines a particular aspect of the methodology. Below, we review the most widely used adaptations.
HITS-CLIP/CLIP-seq
The emergence of next generation sequencing platforms in the mid 2000s, such as 454 Life Sciences and Solexa ushered in a new era of RNA molecular biology research. For the first time, the transcriptome could be examined in a global, unbiased, and high-resolution manner, thus revolutionizing our understanding of transcriptome complexity and its regulation. Integrating high-throughput sequencing with CLIP (HITS-CLIP or CLIP-seq) led to a robust expansion of the number of protein-RNA interactions that could be identified from a single experiment. To emphasize the magnitude of this advance, the first CLIP experiment generated 340 NOVA CLIP reads from mouse brain using Sanger sequencing [
17]. In comparison, a 2008 study coupling CLIP with high-throughput sequencing [
26] generated ~168,000 CLIP reads and revealed thousands of biologically reproducible NOVA-RNA interactions across the brain transcriptome.
PAR-CLIP
UV crosslinking efficiency
in vivo is estimated to be quite low — on the order of 1%‒5% [
27]. Photoactivatable ribonucleoside-enhanced crosslinking and immunoprecipitation (PAR-CLIP) improves the crosslinking efficiency, reportedly improving RNA recovery 100 to 1,000 fold as compared to traditional UV 254 nm crosslinking [
28]. Such improvement is achieved by the incorporation of UV-reactive ribonucleosides into nascent RNA molecules by adding a modified nucleoside, such as 4-thiouridine (4sU), to the cell media. At low concentrations 4sU is not reported to affect mRNA processing or protein synthesis; however, detection of protein-RNA interactions is limited to newly synthesized RNAs with recognition sites that contain the supplied nucleoside. Additionally, this technique is restricted to cell culture applications and cannot be used to investigate RBPs in whole tissue. Direct comparisons between HITS-CLIP and PAR-CLIP show highly reproducible binding sites and similar transcriptomic landscapes [
29].
iCLIP
Individual nucleotide resolution CLIP (iCLIP) [
30] is similar to conventional HITS-CLIP with notable improvements. This includes circularization ligation of single strand cDNA instead of a 5′RNA linker ligation step. Not only is this reaction more efficient than linker ligation, it also allows for the capture of truncated reverse transcription products. After protein degradation, several amino acids remain covalently attached to the RNA molecule at the crosslink site [
31] which cause frequent premature termination of reverse transcription at this position. The vast majority of these sequences are lost in traditional HITS-CLIP as PCR amplification requires reverse transcription read-through into the 5′ adaptor sequence [
32]. In iCLIP, the circularization step is performed after reverse transcription, and captures both full-length and truncated products. The site of crosslinking is predicted to correspond to the nucleotide immediately upstream of the 5′ end of the sequenced CLIP tag, providing a means to identify the direct site of protein-RNA interaction with single nucleotide resolution. While iCLIP improves upon the traditional HITS-CLIP protocol with greater library complexity and superior binding site resolution, the technique remains time consuming and challenging.
eCLIP
The complexity of traditional CLIP protocols has proven to be a significant deterrent to large scale profiling efforts [
33]. Enhanced CLIP (eCLIP) improves iCLIP with a reported ~1,000 fold increase in preamplification products [
34] and reduced library preparation time, while maintaining single nucleotide resolution. These results are achieved by replacing the circularization step after reverse transcription with a more efficient ssDNA linker ligation. The ssDNA adapter contains a unique bar code to identify PCR duplicates generated during PCR amplification. Notably, eCLIP reduces the percentage of CLIP reads discarded due to PCR duplicates by ~60%, producing libraries with greater complexity even in cases with limiting samples or low abundance RBPs. Additional optimization and modifications shorten the protocol from almost two weeks to approximately 4 days. These advances have led to a significant expansion in the number of RBPs that have been interrogated by CLIP, with the datasets publicly available at www.encodeproject.org. Additional resources available to researchers include databases of experimentally validated antibodies [
35]. As the RBP of interest is immunoprecipitated using mono- or poly-clonal antibodies, the quality and specificity of the antibody is critical to the success of the experiment. Alternatively, a fusion protein with an exogenous tag can be used [
36,
37], however validation must be performed to verify the fusion protein remains functional and expression is comparable.
Considerations for interpreting protein-RNA interactions from global analyses
The staggering amount of descriptive information generated in the modern CLIP era has led to significant advances in our understanding of RNA regulation. However, with comprehensive, global interaction maps comes the need to distinguish biologically functional from non-functional opportunistic interactions. Thus, integration of CLIP maps with ancillary data sets and experimental validation must be performed. Knockdown or knockout of the RBP of interest, followed by RNA analyses can determine which RNA and regulatory events are sensitive to the deletion of the RBP in question. A caveat of these approaches is the potential for global remodeling of protein-RNA interactions, and functional compensation by other RBPs. Multiple strategies can be used to further interrogate mechanism and confirm functionality of specific RBP binding sites defined by CLIP, such as the mutation of protein binding sites. Alternatively, artificial tethering assays using MS2 hairpins or RNA guided Cas9 [
38] can bring an RBP to an atypical target. However, artificial manipulation of any system may have unanticipated outcomes. An additional consideration is the inefficiency of 254 nm UV crosslinking and intrinsic crosslinking biases [
39], whereby pyrimidines crosslink more efficiently than purines, thus potentially influencing capture of specific transcripts based on their nucleotide composition.
COMPUTATIONAL TOOLS
The emergence of CLIP has prompted a new generation of bioinformatic tools and platforms developed to investigate protein-RNA regulation. At a fundamental level, analysis of deep sequencing data involves steps to decipher between signal and noise. General strategies include analysis of uniquely mapped tags, removal of potential PCR duplicates, and identification of biologically reproducible interactions. With the widespread adoption and use of CLIP methods, a large number of computational approaches (discussed below) have been developed to facilitate data analysis to gain insights into the molecular function(s) of the RBP of interest.
Peak calling tools
Since the advent of CLIP methods, an array of software tools has been developed to distinguish RBP binding sites from the raw data. Such tools include CLIPSeqTools [
40], ASPeak [
41], CLIPZ [
42], PIPE-CLIP [
43], Piranha [
44], MiClip [
45], Pyicoclip [
46], CLIPper [
47], PYCRAC [
48] and PARalyzer [
49]. Some of these algorithms can be applied to data from different CLIP platforms and include commands that pre-process input files to remove adapters, mask repeats, align reads to the genome, remove PCR duplicates, identify chemical modifications, and evoke binomial models to identify enriched peaks. The majority of these packages provide outputs that are compatible with any genome browser for easy visualization. Fortunately, given the many bioinformatic tools that have spawned from CLIP technologies, users should be able to find a tool to interrogate data generated from HITS-, PAR-, i-, or e-CLIP, and is also suitable for different machine platforms (linux, mac, PC).
Binding site predictions
High-resolution RBP-RNA interactome maps generated by CLIP methodologies present an opportunity to identify RBP binding signatures. Multiple motif discovery tools, such as MEME [
50], compseq (EMBOSS utilities), HOMER [
51], mCarts [
52], and ChIPMunk [
53] can be applied to determine RBP binding motifs. Additionally, software specifically designed to interrogate CLIP data has been developed. In 2011, a computational framework, known as CIMS, was developed to identify protein-RNA interactions at single nucleotide resolution [
54]. The foundation of this platform takes advantage of the residual amino-acid-RNA adduct remaining after UV crosslinking and protein degradation, which can obstruct the reverse transcriptase or result in a mutation in the cDNA at the crosslink site. Another motif discovery algorithm, called Zagros, was modeled to exploit sequence, secondary structure, and technology-specific crosslinking events from CLIP data [
55]. These techniques can be readily utilized to characterize sequences within CLIP peaks.
Overcoming CLIP biases
As described above, there are several resources used to annotate RBP binding sites based on CLIP data, however many of these methods cannot distinguish binding of a single RBP from the binding of a protein complex; nor can they account for preferential detection of uridine-rich sequences [
32]. To circumvent these biases, CLIP data can be integrated with other protein-RNA binding methods. One such method is RNACompete, which systematically analyzes RNA binding specificities of RBPs in a rapid and low-cost fashion [
56]. In brief, RNACompete uses
in vitro RNA-protein binding, followed by high-throughput analyses to identify RNA binding motifs. Notably, this approach provides binding preferences for a large number of sequences with only one binding reaction, and is absent of any possible crosslinking modifications. On the other hand,
in vitro approaches do not account for the competitive and cooperative binding of other proteins in the cell. Furthermore, the influence of secondary structure on binding is complicated by the number of probes required for the assay.
Following RNACompete, RNA Bind-n-Seq (RBNS) [
57] was developed to quantitatively measure RBP binding affinities to a range of bound RNAs. The high-throughput nature of RBNS is more robust than previous quantitative assays, such as electrophoretic mobility shift assays and surface plasmon resonance, which only provide low-throughput
Kd values. Evolved from previous protein-DNA binding protocols, RBNS optimized RBP concentration relative to binding affinities, and advanced an analytical framework to estimate the effects of secondary structure on protein binding [
57]. Importantly, Lambert
et al. demonstrated that motifs enriched by CLIP, but not by RBNS, are not associated with regulatory activity, suggesting that RBNS is a valuable supplement in creating RBP profiles when using CLIP [
57].
DELINEATING FUNCTIONS OF PROTEIN-RNA INTERACTIONS
RBP binding profiles and splicing maps
Our understanding of splicing regulatory networks advanced in the late 1990s, when SELEX-based experiments were adopted to investigate the position of SR proteins relative to spliced exons [
58]. Liu
et al. showed that individual SR proteins have a high affinity to bind exonic splicing enhancers, thereby promoting the inclusion of alternative exons [
58]. Since then, the development of splicing microarrays [
59–
62], RIP-chip [
14], and CLIP [
26,
63–
72] technologies have allowed metagene analyses and the construction of “splicing maps” to reveal position dependent effects of RBPs on splicing function. Generally, RBPs bound upstream of an alternative exon repress splicing of the exon, while RBPs bound downstream promote exon inclusion. More specifically, NOVA, hnRNP C, hnRNP L, hnRNP H, PTBP1, PTBP2, and MBNL1 have been shown to silence exon inclusion by binding at positions close to the branch site or the 3′splice site, thus interfering with spliceosome machinery. In contrast, NOVA, RBFOX, hnRNP L, and TIA proteins bind downstream to promote alternative exon inclusion, potentially by RNA looping. Intron versus exon binding also dictates splicing behavior. In general, hnRNPs inhibit splicing when bound to an exon, yet can have either positive or negative regulation when bound to intronic sites [
73,
74].
While many splicing maps have been generated for individual RBPs, global splicing networks of combinatorial RBPs remain poorly understood. In 2010, the binding patterns of NOVA and its subsequent contribution on splicing were characterized using RBP-RNA transcriptome-wide maps [
75]. Not only did Zhang
et al. show conserved NOVA binding sites, they also discovered an enrichment of the RBFOX (UGCAUG) binding element, suggesting combinatorial regulation of splicing by NOVA and RBFOX. Shortly after, additional evidence supporting widespread combinatorial regulation of alternative splicing was demonstrated between ESRPs and RBFOX2/RBM47 during the epithelial-mesenchymal transition [
76,
77]. More recently, Damianov
et al. demonstrated that nuclear RBFOX proteins are bound within a large assembly of splicing regulators (LASR), which include hnRNP M, hnRNP H, hnNP C, Matrin3, NF110/NFAR-2, NF45, and DDX5 [
70]. This study implicates that protein complexes may function in controlling splicing networks, and identification of such complexes may be required to fully decipher regulatory codes.
Position-dependent regulation of alternative polyadenylation
In addition to elucidating general regulatory rules associated with known functions of RBPs, high-throughput sequencing after CLIP can uncover new functions for RBPs. The unanticipated identification of NOVA CLIP tags overlapping or near polyadenylation (pA) sites led to a hypothesis that this protein could have a second, splicing-independent function in mRNA processing in the brain [
26]. Indeed, subsequent experiments indicated that NOVA impacts alternative polyadenylation choices to promote expression of mRNAs with brain-specific 3′ UTRs. In 2014, Batra
et al. also utilized HITS-CLIP data alongside minigene reporter assays to show that direct binding of MBNL to target RNAs regulates pA site selection [
78]. Importantly, polyadenylation regulatory maps revealed that pA sites are repressed when MBNL binds in close proximity to the core 3′ end processing region, while more distal binding (upstream) activates pA selection. Similarly, CLIP-seq technology was used to investigate the functional impact of FUS binding clusters near alternative pA sites. Extensive transcriptome-wide mapping demonstrated a positional dependence of FUS binding to RNA for activation of alternative polyadenylation and suggested the involvement of RNA polymerase accumulation in this regulation [
79]. Two new computational platforms, expressRNA and RNAmotifs2, were developed to advance the position-dependent principles of pre-mRNA processing [
80]. These platforms demonstrated that TDP-43 most often binds the proximal same-exon pA site to repress site usage, but also has enriched binding downstream of activated pA sites (consistent with TDP-43 splicing regulation) [
80]. While RNA binding maps are fundamental to understanding the role of RBPs in alternative polyadenylation, a need remains for an integrative platform, which accounts for structure and combinatorial regulation with other
trans-acting factors, including the cleavage and polyadenylation machinery.
Insights into cytoplasmic mRNA regulation from protein-RNA interaction maps
CLIP methods have also been applied to the study of cytoplasmic mRNA metabolism. In conjunction with RNA-seq analysis and ribosome profiling, CLIP analysis of mammalian UPF1 has been used to identify direct targets of the nonsense-mediated decay (NMD) [
81]. Opposite of what was expected, Hurt
et al. identified reproducible UPF1 binding to 3′ UTRs with a density 10 times greater than that seen in the coding region [
81]. While CLIP analysis did not suggest a clear binding motif, bound sequences were enriched for guanosine residues and secondary structure. Further characterization found UPF1 targets to have longer 3′ UTRs and increased translational efficiency. Another study identified that MBNL1 and CUGBP1 preferentially bind 3′UTRs to facilitate mRNA decay in C2C12 cells [
82]. Although NMD has been studied for nearly forty years, mechanisms of NMD substrate recognition remain unclear.
CLIP methods have also been used to study gene regulation by microRNAs (miRNAs), which post-transcriptionally control gene expression by binding to 3′ UTRs. Functional miRNAs are loaded on AGO complexes to bind target RNAs, leading to silencing by translation repression or nucleolytic turnover. While bioinformatic predictions can identify miRNA binding sites, even the most stringent analyses yield high rates of false positives. Application of HITS-CLIP to AGO
in vivo allows functional miRNA target sites to be mapped [
83]. Since miRNA-mRNA base pairing occurs within short seed regions, size selection by SDS-PAGE after immunoprecipitation is especially crucial. Moore
et al. have published a detailed protocol for AGO HITS-CLIP library construction, as well as downstream computational analyses [
83] that can be applied to cells or tissues of interest. Nearly five years after AGO HITS-CLIP method was published, CLASH (crosslinking, ligation, and sequencing of hybrids) [
84] was developed as a technique to capture and map miRNA-RNA duplexes associated with human AGO1. Using this approach, Helwak
et al. annotated several miRNA “seed” binding sites, a substantial number of interactions that do not involve contacts within the seed region, and enriched motifs within these sites. CLASH technology and the nearly forty AGO HITS-CLIP datasets published inspired the development of new algorithms to resolve highly expressed miRNAs associated with AGO CLIP peaks, including microMUMMIE [
85], DIANA-microT-CDS [
86], STarMir [
87], MIRZA [
88], CLIPZ [
42], and chimiRic [
89]. These computational models use crosslinking mutations; energy based duplex predictions; protocol specific sequence signals; and/or common AGO sequence preferences to assign the most likely canonical and noncanonical seed. Together, these experimental and bioinformatic technologies improve miRNA target mapping transcriptome-wide, and demonstrate that miRNA-mRNA targeting is much more widespread than anticipated.
FUTURE OF TRANSCRIPTOME-WIDE PROTEIN-RNA MAPPING TO INVESTIGATE REGULATION OF GENE EXPRESSION
The ability to map protein-RNA contacts in a transcriptome-wide manner has revolutionized our understanding of RNA regulatory mechanisms and functions. Future studies of RBP regulation will significantly benefit from the advancement of tools to map RNA secondary structure. Interestingly, it has been postulated that not all regulatory factors recognize consensus sequences [
90], but instead may be recruited to the RNA via secondary structure or other co-factors. Moving forward, integration of RNA structure predictions with protein-protein and protein-RNA interaction tools will give rise to a more complete characterization of RNA regulation by RBPs.
The experimental methods described in this review are fundamentally based on the purification of RNA bound to a specific protein of interest. These protein-focused techniques provide detailed positional information and may illuminate networks of RNAs under coordinate regulation. However, they do not yield data regarding the composition of the unique complex of proteins bound to each mRNA molecule. While not discussed in depth in this review, a variety of methods make use of the RNA molecule as bait to capture and identify bound proteins using mass spectrometry analysis. Such methodologies include CHART [
91], ChIRP [
92], and RAP-MS [
93]. Additionally, global interactome capture of the entire mRNA-associated proteome has identified many previously unknown RBPs [
94,
95]. Coupling CLIP methods with increasingly powerful biochemical and genetic tools [
96] has the potential to reveal RNA regulatory networks that underlie cell type-specific functions and developmental programs.
In summary, unraveling the complexity of RNA regulatory events and gene regulation is key to understanding gene expression and disease pathogenesis. In addition to elucidating general regulatory rules associated with known functions of RBPs, high-throughput sequencing after CLIP can uncover new functions for RBPs and shed light on global RNA networks. In the future, CLIP technologies can be adapted to address questions relating to countless aspects of RNA metabolism.
Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature