1 INTRODUCTION
With the exception of replication-dependent histone mRNA processing, all mRNA 3′-end processing events involve endonucleolytic cleavage at poly(A) sites (PAS) followed by the addition of a poly(A) tail. Because mRNAs, in general, have more than one PAS, alternative polyadenylation (APA) can generate mRNAs of varying lengths through the selection of a specific cleavage site [
1]. Besides these protein-coding mRNAs, long non-coding RNAs (lncRNAs) have also been regulated by APA. About 66% of lncRNAs undergo APA and mostly located within the upstream poly(A) exons [
2]. Thus, APA is a widespread and evolutionarily conserved mechanism in the regulation of mammalian genes and it is estimated that approximately 70% of human genes undergo APA [
3].
APA can be generally classified into four different types. The two most common APA classes are tandem 3′-UTR APA, in which alternative PASs are located within the same terminal exon (Fig.1A), and alternative terminal exon APAs (sometimes called ‘splicing-APA’) (Fig. 1B), in which PASs are located in distinct terminal exons. The less-frequent APA classes include intronic APA and internal exon APA (Fig.1C, D). APA is tightly regulated via a combination of cis-regulatory sequences and APA regulators. Because the 3'-UTR region hosts many essential regulatory elements, such as AU-rich elements, microRNAs or RNA-binding protein sites, alteration of 3′-UTR length could alter function, stability, and translation efficiency of target mRNAs [
4]. The consequence of APA on mRNA stability/degradation is primarily mediated through regulating the presence and accessibility of these regulatory elements in different cellular contexts [
5]. APA can also impact other mRNA functions such as mRNA translation, mRNA nuclear export, and localization [
4]. Not only on the mRNA level, but APA mediated 3′-UTR alteration can also regulate the protein localization that is independent of RNA localization [
1]. For example, long 3′-UTR of CD47 localizes to the plasma membrane by recruiting the effect protein to the site of translation, while short 3′-UTR isoform of CD47 localizes to the ER and functions in the apoptosis regulation [
6]. Recent studies also show that 3′-UTR can be cleaved off to form small non-coding RNAs which act independently as repressors or activators in prokaryotes [
7‒
9] and polyadenylation near the end of 3′-UTR preserved that of the prokaryotic ancestor, which is also essential to the stability and degradation of mRNAs in human mitochondria [
10]. Such events can have phenotypic impacts on both normal development and the progression of diseases, such as cancer [
1,
5,
11].
Genome-wide association studies (GWAS) have significantly expanded our understanding of common inherited genetic variation effects on complex human diseases and traits, for which thousands of non-coding variants have been identified and associated with gene regulatory activities [
12]. Despite massive experimental efforts, many GWAS loci cannot be rationalized as impacting normal mRNA expression levels. For example, Chun
et al. found that only a small fraction of autoimmune disease risks are likely explained by impacting basal gene expression [
12]. Recent studies also suggest that genetic variations in RNA processing, such as alternative splicing and APA, likely perform an independent but equally as important role as genetic variations impacting transcription [
13] in associating with GWAS loci. Here we focused on the emerging evidence regarding the genetic role of APA in translating common genetic variations to phenotypes.
2 EXPERIMENTAL AND COMPUTATIONAL TOOLS TO DETECT GENETIC INFLUENCES OF APA EVENTS
2.1 RT-PCR/exon arrays
Quantitative RT-PCR analyses revealed that genetic variations can have critical regulatory consequences impacting APA events (Fig.2, Table 1). For example, Yang
et al. reported that a single-base change in the poly(A) signal could alter the polyadenylation pattern of
DHFR [
14]. APA of the
HLA-DQA1 gene was also associated with genetic variations [
17]. A single-nucleotide polymorphism (SNP) (rs10954213) within the
IRF5 3′-UTR region can alter both the length of 3′-UTR itself and mRNA stability [
19]. In addition to these individual cases, a limited number of studies investigate the genetic influence on APA on a more global scale using exon arrays. Fraser
et al. analyzed 176 human lymphoblastoid cell lines and reported that 37.9% exhibited changes in 3′-UTR constituency. They further investigated the influence of genetic polymorphisms on alternative transcript isoforms and found that 16 out of 20 significant genetic associations were APA-related [
27]. Kwan
et al. analyzed the 57 lymphoblastoid cell lines (LCL) samples sequenced by exon tiling arrays [
21] and found that 55% of the genes were strongly associated with isoform changes. The selection of an alternative splice site could also result in differential stop codon usage and create further variability in 3'-UTR length, such as with the genes
ATPIF1 and
TAP2 [
21]. However, potential weaknesses of all of these methods are that they depend on annotated poly(A) sites (Table 2) and are limited by inherent experimental caveats associated with microarrays such as poor or cross-hybridization [
11].
2.2 3′-End enriched RNA sequencing (3′-Seq) has advanced the global discovery and functional characterization of APA sites
There are now nearly twenty 3′-end enriched sequencing approaches [
14], including the QuantSeq 3′mRNA-Seq Library Prep Kit, Quantitative tag-based sequencing (3′-Seq) [
17], Poly(A)-ClickSeq [
19] and 3′READS+ [
27]. These 3′-end enriched approaches, which primarily detect 3′-ends using oligo(dT) primer based reverse transcription, have been employed to efficiently investigate the global effects of variations on APA (Fig. 2). Yoon
et al. performed 3′-end RNA sequencing in human B-lymphoblastoid cell lines from six individuals and identified the essential role of genetic variants by altering polyadenylation signals that could further lead to gene expression changes [
23]. Mittleman
et al. applied 3′-Seq to both the nuclear and total mRNA fractions of 52 lymphoblastoid cell lines. They identified 602 genetic variants that were associated with APA [
28]. These 3′-end approaches have also been applied to studies of non-human model organisms. Using 80 inbred
Drosophila wild isolates, Cannavo
et al. identified 311 genes with genetic variations affecting their 3'-UTR length [
29]. Although these APA protocols have increased the sensitivity of detecting precise locations of poly(A) sites, they are restricted by several technical issues, such as internal priming and general noise [
30]. In addition, they have not been widely adopted due to the lack of a unified APA profiling method and the small sample sizes of previous studies (Table 2).
In addition to these sequencing technologies, analytical methods have been developed for these 3′-end enriched techniques. DPAC [
31] is one of the first tools that streamlined the data analysis and includes preprocessing, poly(A)-site identification, poly(A)-clustering and differential Poly(A) clusters usage. PolyA-miner [
30] is another recently developed tool that enables
de novo detection of differential APA using iterative consensus clustering and vector projections algorithms.
2.3 The RNA-seq approach can be used to detect genetic control of APA
RNA-seq is routinely used to measure gene expression in human genetic studies [
32]. Capitilizing on the rich RNA-seq databases, the relative isoform abundance of individual APA events can also be captured by RNA-seq, which enables scientists to investigate genetic variants that act on APA on a population-scale. Early studies identified the genetic control of APA primarily through analyses of generic mRNA transcript structure with limited variants or coupling with splicing [
33]. For example, DeepSAGE sequencing can target mRNA 3′-ends [
25]. Using this technique to analyze 94 individual samples, Zhernakova
et al. identified and validated SNPs that could affect APA usage, thereby potentially influencing mRNA stability. Lappalainen
et al. analyzed the RNA-seq data from 462 LCLs cell lines and identified 639 genes in which genetic variants were associated with their altered mRNA transcript isoforms with 43% of these variants associated with an alternative 3′-end [
32]. In another study, Mariella
et al. analyzed Geuvadis RNA sequencing data for 373 European individuals and identified 2530 APA events associated with genetic variants [
34]. More recently, Li
et al. has constructed the first comprehensive atlas of human 3′-UTR alternative polyadenylation quantitative trait loci (3′aQTLs) using data from the Genotype-Tissue-Expression (GTEx) Project [
35]: ~0.4 million genetic variants associated with APA of target genes across 46 tissues from 467 individuals [
36]. Although the RNA-seq-based approaches may not be as accurate as 3′-end enriched approaches, they have tremendous upsides to allow for population-scale studies with larger sample sizes and broader conditions.
Along with these sequencing technology advances, several computational approaches have also been developed for genome-wide investigations of how genetic variations impact APA. These can be broadly classified as either annotation-based or
de novo methods. The annotation-based methods such as QAPA [
37] and Roar [
38] rely on existing poly(A) sites that are annotated in the GENCODE, polyADB database [
39], PolyAsite [
40], or APASdb databases [
41]. For example, QAPA employs sailfish to calculate isoform expression and then estimate the poly(A) usage by using the ratio of isoform expression to the sum of the expression of all detected 3′-UTR isoforms. These tools often provide greater accuracy and sensitivity. However, a potential limitation is that annotated transcript isoforms can only account for a small percentage of poly(A) sites [
40]. These annotated poly(A) sites are compiled for various tissues and conditions, thus may lack the comprehensive information for a particular tissue or cell type. There are also several
de novo analytical algorithms, such as GETUTR [
42], TAPAS [
43], APAtrap [
44] and DaPars [
11,
45]. DaPars is the first of its kind for the
de novo identification of dynamic APA events based on localized changes in 3’-UTR RNA-seq read density and was used to identify differential APA events in tumors and normal tissues. For a given transcript, DaPars identifies the distal poly(A) site independent of the gene model, corrects for potential RNA-seq non-uniformity bias, and uses a linear regression model to infer the
de novo proximal poly(A) site as an optimal fitting point that can best explain localized read density changes. APA usage differences can be quantified as the change in percentage of distal poly(A) site usage index. DaPars version 2 extended an earlier DaPars analysis of pairwise tumor/normal tissue comparisons [
11,
45] to include multiple RNA-seq joint analyses. The dynamics of APA genes can be determined based on a two-normal mixture model. Another important advantage of DaPars version 2 is that it supports multi-threading and requires significantly less processing time than other tools such as APAtrap, which required an over 100-fold longer runtime than DaPars v2 to analyze the same 1,000 transcripts. Therefore, DaPars version 2 is quite suitable for running large-scale population RNA-seq analyses.
3 EMERGING MOLECULAR MECHANISMS OF APA REGULATION
As a crucial post-transcriptional regulation mechanism, APA is precisely controlled by cis regulatory elements and trans-acting factors [
1,
4,
46]. Dysregulation of APA often results in hematological diseases or immunological diseases and cancer [
11,
47–
50]. Several mechanisms underlying these dysregulated APA events have been investigated, such as loss or gain of individual poly(A) sites due to poly(A) signal (PAS) mutations and alterations in canonical cleavage and polyadenylation factors [
51]. Recent studies aimed at discovering genetic variants affecting APA on the population level indicated a key role for genetically controlled APA in human diseases [
52]. Here we primarily discuss the emerging molecular mechanisms of APA regulations in human diseases.
3.1 Cis regulatory elements
Among these cis regulatory elements involved in RNA 3′-end processing, the AAUAAA hexamer is the core and canonical poly(A) signal recognized by the cleavage and polyadenylation specificity factor. Other non-canonical variants can also function as a PAS similar to AAUAAA but with a relative lower recognition efficiency [
53]. In addition to hexamers, the upstream regulatory element UGUA motif and the downstream G/U-rich region can modulate the relative strength of the PAS in several specific conditions [
4]. Perturbations of these cis-acting elements could lead to certain human diseases, especially variations in the PAS, which directly change PAS binding affinity of the cleavage and polyadenylation machinery [
51]. In addition to altering the PAS, mutations can also affect the binding of miRNAs resulting in altered gene expression [
54]. For example, mutants in the 3′-UTR of
ACTB mRNA facilitate its interaction with miR-1 and miR-29a via AGO2, promoting hepatocellular carcinoma (HCC) cell migration and invasion [
55]. Interestingly, mRNA 3′-UTR shortening can also play a key role in repressing tumor suppressor by disrupting competing endogenous RNAs (ceRNA) interactions in trans. Hyun
et al. performed a model-based analysis of the trans effect of 3′-UTR shortening and predicted many trans-targets of 3′-UTR shortening, including PTEN, a tumor-suppressor gene involved in ceRNA crosstalk with other 3′-UTR shortening genes [
56].
3.2 Trans factors
3.2.1 CFIm25
CFIm25 is encoded by the
Nudt21 gene and is an essential cleavage and polyadenylation factor that plays a crucial role in APA regulation.
CFIm25 knockdown can result in extensive 3′-UTR shortening of transcripts by increasing the proximal poly(A) site usage [
49,
57,
58]. While 3′-UTR global shortening occurs in different cancers, CFIm25 is likely an essential factor in this process. In glioblastoma,
CFIm25 depletion causes the up-regulation of several known oncogenes, such as cyclin D1 and
Pak1 [
49,
57]. Also in HCC, knockdown of
CFIm25 promotes HCC cell proliferation and metastasis, in part by increasing the expression levels of
PSMB2 and
CXXC5 [
58]. Recent findings showed that CFIm25 plays an essential role in bladder cancer progression through
ANXA2 and
LIMK2 by APA [
59]. In addition to cancer [
46], CFIm25 is also involved in other human diseases, such as systemic sclerosis and idiopathic pulmonary fibrosis [
60]. In both of these diseases, CFIm25 is down-regulated in key cell types present in the skin or lung promoting the 3′-UTR shortening of key TGFβ-regulated fibrotic genes [
61]. Finally, CFIm25 is critical to normal brain development as patients with copy number variation of
CFIm25 or mutations present significant intellectual disability [
62,
63].
3.2.2 PCF11
PCF11, a subunit of the CFIIm complex, is involved in tumor progression, cell cycle regulation, cell proliferation, apoptosis, and neurodifferentiation [
64]. Low expression of PCF11 in neuroblastoma is correlated with a favorable outcome and spontaneous tumor regression [
64]. A recent study revealed that PCF11 can impact the expression of longer genes through regulating intronic polyadenylation [
65].
3.2.3 MAGE-A11
The melanoma-associated antigen (MAGE) gene family is a large and conserved group of genes defined by a common MAGE homology domain [
66]. One member of this family, MAGE-A11, was found to be aberrantly expressed in cancer [
67,
68]. By designing a computational approach from existing cancer “big-data”, Seung
et al. recently reported that MAGE-A11-HUWE1 promotes APA and 3′-UTR shortening in cancer through ubiquitination of CFIm25 [
69]. This finding provides new insights into the functions of MAGE genes regarding APA processing in cancer development.
3.2.4 HnRNPC
Heterogeneous nuclear ribonucleoprotein C (hnRNPC) is an RNA-binding protein that aberrantly up-regulated in multiple cancer types [
70–
73]. hnRNPC plays a critical role in regulating APA in metastatic colon cancer cells [
74]. Mechanistically, hnRNPC regulates poly(A) site selection in a subset of genes implicated in cancer progression [
74].
3.3 Long-distance APA regulation
In addition to cis regulatory elements and trans-acting factors, recent studies also demonstrated how functional elements outside of transcribed regions of genes can profoundly impact RNA processing within that gene. Nanavaty et al. elegantly demonstrates the DNA methylation patterns can have significant impact on APA via the Cohesin and CTCF factors [
75]. In this instance, mutations/SNPs or changes in DNA methylation state outside the transcribed region of genes indeed causes APA changes via disruption in gene looping. Xiong
et al. demonstrate an unexpected finding that enhancers regulate alternative polyadenylation
in trans [
59]. Specifically, they reveal how enhancers can, independent of transcription output, specifically alter polyA site selection. In another study, Oktaba
et al. demonstrate that in
Drosophila, longer 3′-UTRs emanate from specific promoters active in neural tissue [
76]. Their data reveal a connection between RNA Pol II pausing and loading of the ELAV RNA binding protein as a key factor to repress proximal polyA site usage thereby promoting long 3′-UTRs. Importantly, they show that swapping out these promoters causes significant changes in APA events.
4 GENETIC ARCHITECTURE OF APA VARIATIONS
Although recent population-scale RNA-seq data revealed associations between genetic variants and APA, characterization of genetic architecture remains a significant challenge. Mittleman
et al. found many apaQTLs are intronic, and the genetic variants associated with increasing intronic poly(A) site usage tend to have lower gene expression levels [
28]. In addition, Li
et al. found that these genetic variants could alter poly(A) motifs and RNA-binding protein binding sites [
35]. Another challenge is to identify the causal APA variants. Several innovative statistical fine-mapping algorithms have been proposed to identify these causal variants. CAVIAR is a method that models the association between the local linkage disequilibrium (LD) structure and effect sizes to quantify the posterior probability of causality for each variant [
77]. The Sum of Single Effect (SuSiE) method operates on individual-level data to efficiently analyze loci with many independent effect variables [
78]. SuSiE produces clusters of association signals, formally defined as 95% bayesian credible sets and each signal clusters are highly correlated due to LD.
In addition to the above, experimental approaches such as massive parallel reporter assays (MPRAs) [
79] can test the effect of genetic variants on a tens-of-thousands scale for several target genes. These massive datasets can couple with machine learning approach to model APA patterns and predict the effects of genetic variants on APA. A recent study [
80] generated mini gene libraries of over 3 million unique UTRs constructs. Each library varied in mRNA structure and 3′-UTR region. These constructs were then transiently transfected into HEK293 cells for high-throughput sequencing. The massive sequencing data were then trained using the deep learning method APARENT (APA REgression NeT) to predict the impact and putative casual mechanism of genetic variants on APA.
5 GENETIC CONTROL OF APA IS ASSOCIATED WITH COMPLEX HUMAN DISEASES AND TRAITS
Genetic variations impacting APA play essential roles in the dysregulation of RNA 3′-end processing and have been frequently associated with human diseases and phenotypic changes. In one common scenario, genetic variants can alter poly(A) site usage and thus influence the gene expression of key genes, leading to a series of different human complex diseases. For example, You-Jun
et al. found an A-to-G mutation in the polyadenylation site of the α2-hemoglobin gene
HBA2 (AAUAAA to AAUAAG) that causes α-thalassemia (hemoglobin H disease) by reducing the expression of the αl-globin gene, resulting in a down-regulation of both the α1-globin and α2-globin genes [
81,
82]. In another thalassemia study, a T-to-C substitution within the 3′-end conserved sequence (AAUAAA to AACAAA) was shown to disrupt polyadenylation signals for APA, which results in an at least 900-bp extension of the human β-globin transcription and the subsequent β-thalassemia [
83,
84]. In addition, an A-to-C mutation in the canonical PAS within
TP53 (AAUAAA to AAUACT) was associated with the impaired 3′-end processing of
TP53 transcripts and increased susceptibility to multiple cancers, including cutaneous basal cell carcinoma, prostate cancer, glioma, and colorectal adenoma [
85]. Bennett
et al. reported a rare A-to-G PAS mutation of the
FOXP3 gene (AAUAAA to AAUGAA), which results in decreased expression of
FOXP3 due to degradation of the transcripts of
FOXP3, leading to the immunodysregulation polyendocrinopathy enteropathy X-linked syndrome [
86]. Besides these variations within poly(A) sites, a few studies have demonstrated that variants located within other cis regulatory elements are also associated with disease risk. For example, a single-nucleotide change within the GU-rich downstream element of
FGG gene can lead to increased distal poly(A) site usage. These differential APA usages are strongly associated with disease risk of deep vein thrombosis [
87].
6 SUMMARY AND OUTLOOK
Traditional RNA-seq reads cannot provide connectivity for most mRNA transcripts, but with the advance of long-read Oxford nanopore sequencing and PacBio isoform sequencing, we can rebuild the coordinate regulation of transcription initiation, alternative splicing, and APA. A recent study using MinION nanopore sequencing found the tight coordination between the
Dscam1 long 3′-UTR and skipping of exon 19 in neurons. This co-regulation generates a specific isoform essential for neural development [
88]. In another study, Anvar et al. also revealed an interdependent relationship in cultured human MCF-7 breast cancer cells [
89]. We expect more coordination of the landscape of APA with other transcript elements with the expansion of long-read sequencing in coming years. In addition, current studies are primarily restricted to the LCL. LCLs is well-studied cell type for investigating the genetic influence of APA and have been widely employed in large consortia studies, including the HapMap and 1000 Genome projects. Expansion to multiple other tissues, as in the case of the GTEx Project, or other cell types, as in the case of DICE [
90], will provide further insights into interpretation of GWAS non-coding variants.
Moreover, single-cell sequencing coupled with genetic information has opened a new era of population genetics [
91]. A recent study used single-cell RNA sequencing to sequence ~25,000 peripheral blood mononuclear cells from 45 donors and detect many cell type-specific expression quantitative trait loci (eQTLs) and gene-gene interaction network [
92]. More recently, several innovative computational tools such as scDaPars [
93] can quantify and recover APA usage at single-cell and single-gene resolution. Thus, we expect more population-scale studies with these computational approaches to precisely define the cellular contexts in which GWAS variants affect polyadenylation usage from these emerging single-cell data. This will help better understand the molecular mechanisms by which GWAS variant is conferred and therapeutic design strategies.
APA is also a conservaed phenomenon across species, and our recent work indicated that genetic influences on APA are mostly in a tissue-specific manner [
36]. It would be interesting to investigate the tissue-dependent genetic influence of APA across species. It has been known that species type, rather than tissue type, is the primary determinant of the splicing patterns [
94] and RNA-editing [
95]. However, it is currently unclear the evolutionary forces of tissue-dependent APA patterns. These studies could provide important insight into understanding the relationship between evolutionary patterns of APA and variation of phenotype across species.
The Author(s) 2022. Published by Higher Education Press