THE CRISPR-CAS9 SYSTEM
The CRISPR-Cas system is widely found in bacterial and archaeal genomes as a defense mechanism against invading viruses and plasmids [
1–
6]. The type II CRISPR-Cas system from
Streptococcus pyogenes relies on only one protein, the nuclease Cas9, and two noncoding RNAs, crRNA and tracrRNA, to target DNA [
7]. These two noncoding RNAs can further be fused into one single guide RNA (sgRNA). The Cas9/sgRNA complex binds double-stranded DNA sequences that contain a sequence match to the first 17-20 nucleotides of the sgRNA if the target sequence is followed by a protospacer adjacent motif (PAM) (Figure1). Once bound, two independent nuclease domains in Cas9 will each cleave one of the DNA strands 3 bases upstream of the PAM, leaving a blunt end DNA double stranded break (DSB). DSBs can be repaired mainly through either the nonhomologous end joining (NHEJ) pathway or homology-directed repair (HDR). NHEJ typically leads to short insertion/deletion (indels) near the cutting site, whereas HDR can be used to introduce specific sequences into the cutting site if exogenous template DNA is provided. This discovery paved the way for use of Cas9 as a genome-engineering tool in other species. In this review, we focus on target specificity of the CRISPR-Cas9 system. We refer readers to other excellent reviews for further discussion of the CRISPR-Cas9 technology [
8–
11].
APPLICATIONS OF CRISPR-CAS9
Genome editing
The use of the CRISPR-Cas9 system as a tool to manipulate the genome was first demonstrated in 2013 in mammalian cells [
12,
13]. Both studies showed that expressing a codon-optimized Cas9 protein and a guide RNA leads to efficient cleavage and short indels of target loci, which could inactivate protein-coding genes by inducing frameshifts. Up to five genes have been mutated simultaneously in mouse and fish cells by delivering five guide RNAs [
14,
15]. Targeting two sites on the same chromosome can be used to create deletions and inversions of regions range from 100 bps to 1000000 bps [
16,
17]. Defined interchromosomal translocation such as those found in specific cancers can be created by targeting Cas9 to different chromosomes [
18]. With exogenous template oligos, specific sequences such as HA-tag or GFP could be inserted into genes to label proteins [
19,
20], or to correct mutations in disease genes in human and mouse [
21–
23]. The system has also been adapted to many other species, including monkey, pig, rat, zebrafish, worm, yeast, and several plants [
9].
Transcriptome modulation
Mutating the two nuclease domains of Cas9 generates the catalytically inactive Cas9 (dCas9), or nuclease-null Cas9, which can bind DNA without introducing cleavage or mutation [
7]. When targeted to promoters, dCas9 binding alone can interfere with transcription initiation, likely by blocking binding of transcription factors or RNA polymerases. When targeted to the non-template strand within the gene body, dCas9 complex blocks RNA polymerase II transcription elongation [
24–
26]. Fusing dCas9 with transcription repressor domains such as the Krueppel-associated box (KRAB) leads to stronger silencing of mammalian genes, a technology termed CRISPRi [
24]. Activation of transcription is also possible by fusing dCas9 with activator domains such as VP64. However, several studies showed that multiple sgRNAs targeting the same promoter need to be used simultaneously to change target gene expression substantially [
27–
29]. The position of target sites with respective to transcription start site (TSS) affects the efficiency of silencing or activation, a subject that needs to be further investigated for optimal target design [
30].
Genomic loci imaging and other applications
To enable site-specific labeling and imaging of endogenous loci in living cells, GFP has also been fused to dCas9 [
31]. In this case, tens of sgRNAs are required to target the same locus such that individual loci show up as punctate dots, unless the target locus contains targetable tandem repeats. The fusion of dCas9 with other heterologous effector domains could enable many other applications. For example, one could fuse dCas9 with chromatin modifiers to change the epigenetic state of a locus. Other potential applications of the system have been previously reviewed extensively [
8,
9].
ASSESSING CAS9 TARGET SPECIFICITY
The original characterization of the Cas9/sgRNA system showed that not every position in the guide RNA needs to match the target DNA, suggesting the existence of off-target sites [
7]. Concerns about off-target effects depend on the purpose of the targeting. As discussed above and below, Cas9/sgRNA binding at a site does not necessarily lead to DNA cutting or mutation, and binding or cutting may not have any functional consequence either, especially when the off-target sites are outside of genes or regulatory elements. The off-target effects of Cas9 cutting/mutation have been studied extensively but sensitive and unbiased genome-wide characterization is still missing. Below we review existing approaches that have been or can be used to study Cas9 target specificity.
Assay of predicted off-targets
Typically a list of potential off-target sites are predicted based on sequence homology to the on-target, or using more sophisticated tools that incorporate various rules previously described in literature (see section “Tools for target design and off-target prediction”). Two types of assays are commonly used to detect and quantify indels formed at those selected sites: mismatch-detection nuclease assay and next generation sequencing (NGS). In the mismatch-detection nuclease assay, genomic DNA from cells treated with Cas9 and sgRNA is PCR amplified, denatured and rehybridized to form heteroduplex DNA, containing one wildtype strand and one strand with indels. Mismatches can be recognized and cleaved by mismatch detection nucleases, such as Surveyor nuclease [
32] or T7 endonuclease I [
33], enabling quantitation of the products by electrophoresis. It is challenging to use this assay to detect loci with less than 1% indels and this assay is difficult to scale-up. Alternatively, the PCR product can also be sequenced directly using NGS platform. The fraction of reads with indels is quantified after mapping to the genome or directly to the amplicon. When combined with proper controls and statistical models, NGS based approaches are more accurate and sensitive than nuclease based assays.
Systematic mutagenesis
To characterize Cas9/sgRNA specificity, several groups performed systematic mutagenic analysis of the sgRNA or target DNA to evaluate the importance of the position, identity, and number of mismatches in the RNA/DNA duplex [
12,
34,
35]. These studies revealed a very complicated picture of Cas9 specificity [
36]. However, it is unclear whether the observed variation truly reflects specificity requirement, or is confounded by unintended changes caused by the mutations introduced in the sgRNA or target DNA. For example, mutations in the sgRNA could change the sgRNA abundance dramatically, which would alter the targeting efficiency [
37]. Also mutations in DNA might create or disrupt binding sites for endogenous proteins that interfere with Cas9 binding. The number of variants evaluated is also limited in these studies. Finally, each study typically examines less than four target sites, leaving questions whether the observations can be generalized.
In vitro cleavage site selection
A more comprehensive way to study Cas9 cutting specificity is
in vitro selection. In this assay a large pool of partially randomized targets are synthesized and cleaved by Cas9 or other nucleases
in vitro [
38–
40]. The cleavage leaves a 5' phosphate group in the DNA, which can then be ligated to an adaptor and selectively amplified using PCR. The advantages of this approach are that the sequence space explored by the target library can be very large (10
12 molecules, even larger than all possible sites in any genome), and that target specificity can be evaluated independently of genome or species used and is not affected by chromatin structure that is usually cell-type specific. However these advantages also impose potential limitations of this assay. Although the sequence space of the library can be huge, most substrates contain on average only 4-5 mismatches to the on-target [
38]. Given that efficient cleavage with 7 mismatches has been observed [
7], such an assay could still miss a significant fraction of genomic off-targets. For example, when the
in vitro cleavage site selection approach was applied to another type of nuclease, the Zinc Finger Nuclease (ZFN), only one of the four off-targets identified by an
in vivo assay was detected [
40–
42]. Alternatively, instead of using partially randomized synthetic DNA library, one could perform the same assay with genomic DNA to detect possible genomic off-targets.
It has also been reported that compared to
in vivo conditions, Cas9 cutting is more promiscuous
in vitro [
43], i.e. off-targets are cleaved at much higher frequency
in vitro than
in vivo. This can be potentially explained by chromatin blockage of accessibility of the off-target sites
in vivo [
37,
44,
45]. Therefore a potential solution is to perform
in vitro selection assay using native or fixed chromatin prepared from cells. However, the higher rate of off-target cutting could also be due to higher effective concentrations of Cas9/sgRNA used
in vitro. A titration series of Cas9/sgRNA concentration is needed to assess the
in vivo relevance of off-target sites identified by
in vitro approaches.
DSB capture and sequencing
Cas9 and other DNA endonucleases typically induce DSBs, and several assays have been developed to capture DSBs induced in cells [
41,
46,
47], although none of them have been applied to the Cas9 system. Gabriel et al. transformed human cells with integrase-defective lentiviral vectors (IDLVs), which are incorporated into DSBs via NHEJ pathway, thus tagging those transient cutting events [
41]. This approach uncovered four
in vivo off-target cleavage sites for a ZFN targeting the CCR5 locus. In another
in situ assay called BLESS [
47], cells are fixed first and then chromatin is purified and ligated with biotinylated DNA linkers. Both approaches could in principle be applied to Cas9 treated cells to uncover genome-wide cutting sites. Compared to
in vitro cleavage site selection approaches, DSB capture approaches are physiologically more relevant, but can be less efficient since most DSBs exist very transiently, and the capture can be biased since both
in vivo IDLV labeling and
in situ linker ligation can be affected by local chromatin and sequence composition near the cutting site. Thus certain DSBs induced by the nuclease will not be tagged. For instance, of the 36 ZFN off-target sites identified by
in vitro selection approach, only one is identified by the IDLV-based DSB capture [
42]. In addition, DSB capture approaches may identify large number of false positive sites, since DSBs can be generated by endogenous cellular process independent of Cas9 cutting, or during the library preparation process. Proper controls, such as cells treated with no Cas9 or no sgRNA can be used to filter false positives.
Whole genome sequencing
Compared to assays described above, whole genome sequencing (WGS) would be a less biased assessment of off-target mutations caused by Cas9, although it will miss off-target sites that are bound without cutting, or are cut but then always perfectly repaired. In addition to small indels, WGS can also detect Cas9 induced structural changes, such as inversions [
16]. So far relatively high coverage (30-60X) of WGS has been performed in single clones of Cas9 treated cells in a variety of species, including worm [
48],
Arabidopsis [
49], rice [
50], and human pluripotent stem cells [
51,
52]. Interestingly, although a number of mutations were identified in Cas9 treated clones, none were found to be near sites with sequences similar to the target, indicating Cas9 induced off-target mutations are rare and it is possible to obtain clones without off-target mutations. However, due to the high cost, only a few clones have been sequenced for each target, which would miss most low-frequency off-target events. For example, if there was a single possible off-target site per genome mutated at a 40% frequency relative to the on-target site, this could have escaped detection in these experiments. However, if there were 10 possible off-target sites per genome mutated at a 40% frequency, then at least one of these sites should have been detected. Therefore, WGS is ideal for screening individual clones for off-targets, but at the moment, it is not practical for systematic study of a large number of guide RNAs to determine the rules governing Cas9 specificity.
Whole genome binding
Chromatin immunoprecipitation (ChIP) is a widely used technique for assaying genome-wide binding of proteins on DNA
in vivo [
53]. Briefly, live cells are crosslinked, lysed and chromatin fragmented and then immunoprecipitated to pull down DNA bound by a specific protein. The DNA is then purified and assayed by microarray or NGS. Compared to other readouts, such as indels that are downstream of the repair pathway, or gene expression changes, which are also affected by relative position of binding to the transcription start site, ChIP provides direct evidence for Cas9 binding on the genome. We and other groups recently generated the first maps of dCas9 binding on mammalian genomes [
37,
44,
45]; all three studies revealed a large number of binding sites, for example up to six thousand in mouse embryonic stem cells, as well as substantial variation (200 fold) in the number of off-target sites between sgRNAs [
37]. Specificity was not altered by fusion to an effector domain, as dCas9-KRAB had a similar binding profile to dCas9 alone [
45]. Surprisingly, two of these studies observed little cutting/mutation at most off-targets tested, while one study observed significant cleavage at 30 out of 57 selected off-target sites, albeit at a substantially lower rate than on-target cleavage [
44]. We further observed little to none of the off-target gene expression change which would presumably result from strong dCas9 binding at many off-target sites (Wu et al., unpublished data). It is possible that most of the off-targets detected by ChIP are weak and transient interactions stabilized by crosslinking. Native ChIP without crosslinking may help to clarify this question. The other limitation of ChIP approach is that it is inherently biased towards open chromatin and highly transcribed genes [
54]. There could be other biases that remain to be discovered. For example, we failed to detect binding at previously validated off-target sites using an NAG PAM [
37]. It is also unclear whether the two mutations introduced in dCas9 alter the target binding specificity as compared to wild type active Cas9.
Transcriptome profiling
For application in transcription modulation, transcriptome profiling by either microarray or RNA-seq is the ultimate read out for assessing off-target effects. In all published cases [
25,
27,
55], no significant off-target gene expression changes were observed, which again is unexpected given the large number of off-target binding sites reported in ChIP-based studies, and that off-target binding is enriched in accessible active regulatory elements [
37]. It also remains to be seen whether marginally affected genes are enriched for off-target binding sites.
DETERMINANTS OF CAS9/sgRNA SPECIFICITY
Despite potential bias, the assays and studies described above revealed many factors that could affect Cas9/sgRNA targeting specificity (Figure 2), and these can be broadly classified into two categories. First, the intrinsic specificity encoded in the Cas9 protein, which likely determines the relative importance of each position in the sgRNA for target recognition, which may vary for different sgRNA sequences. Secondly, the specificity also depends on the relative abundance of effective Cas9/sgRNA complex with respect to effective target concentration. Below we discuss factors that could affect target specificity.
PAM
The protospacer-adjacent motif (PAM) is strictly required to be immediately next to the 3' end of the target sequence. The PAM is recognized by an individual domain in the Cas9 protein [
56], and the PAM sequence varies across bacteria species [
57,
58]. Presumably species with longer PAM, having less targetable sites in the genome, will have correspondingly fewer off-targets, although this has not been directly tested. For the widely used Cas9 from
Streptococcus pyogenes, the PAM is typically NGG, where the first position shows no nucleotide bias. Recent data suggested that PAM binding is required for both opening the DNA and target cleavage [
56,
59]. Both
in vitro [
38] and
in vivo [
29,
34,
60] cleavage data suggested that NAG is also tolerated to some extent, especially when Cas9/sgRNA is in excess to target DNA. In addition, other variants that contain at least one of the two G’s at position 2 and 3, i.e. NNG or NGN, could lead to some cleavage activity
in vitro under Cas9 excess conditions [
38]. Interestingly recent genome-wide ChIP-seq data revealed no significant Cas9 binding at NAG targets [
37,
44,
45], including previously validated off-target NAG cleavage sites, suggesting ChIP may not be able to detect off-target sites with certain PAMs.
Seed
In the original characterization of CRISPR-Cas9 [
7], mismatches in the first 7 positions (PAM-distal) of the guide RNA are well tolerated in terms of cleavage of a plasmid
in vitro. Further studies in bacteria and mammalian cells showed that mismatches in the 10-12 base pairs in the PAM-proximal region usually lead to decrease or even complete abolishment of target cleavage activity. Another study reported that Cas9 can even cleave DNA sequences that contain insertions or deletions relative to the guide RNA; however many of these sites could be alternatively aligned to contain only mismatches to the guide [
61]. Thus, the PAM-proximal 10-12 bases have been defined as the seed region for Cas9 cutting activity [
12,
62]. However, a relatively comprehensive
in vitro cleavage and selection approach revealed no clearly defined seed region for four guide RNAs, although the results confirmed that mismatches near the PAM region are less tolerated [
38]. In contrast, in two genome-wide binding datasets, one out of two [
45] and three of the four [
37] sgRNAs tested showed a clearly defined seed region, only the first 5 nucleotides next to PAM. A third genome-wide binding dataset detected no obvious seed for twelve sgRNAs tested, although PAM proximal bases tended to be more preserved than PAM distal bases in binding sites [
44]. However, the same data, when analyzed with our pipeline, revealed the 5-nucleotide seed region for three out of twelve sgRNAs (Wu et al., unpublished data); this is likely due to differences in selecting the best match to the guide region near binding sites, e.g. accepting matches with alternative PAMs. Hundreds of binding sites detected by ChIP
in vivo contain only seed match with mismatches at all the other 15 positions in the guide RNA [
37]. We also showed that seed-only sites could be bound by Cas9/sgRNA complex
in vitro using a gel shift assay. The variation in the length of the seed detected by different assays likely stems from different concentrations of factors and lengths of dwell times required for Cas9 binding and cleavage.
Cas9/sgRNA abundance
Cas9 cutting becomes less specific at higher effective concentrations of Cas9/sgRNA complexes. For example,
in vitro, when excessive amounts of Cas9/sgRNA complex are present, mismatches in the guide matching region are more tolerated, and Cas9 can even cut at sites with mismatches in the PAM region [
38]. Hsu et al. also showed that
in vivo the specificity (ratio of indel frequency at target vs off-target) increases when decreasing amounts of Cas9 and sgRNA plasmids are transfected into cells [
34]. Genome-wide, we have found that increasing Cas9 protein levels by 2.6 fold leads to a 2.6 fold increase in the number of off-target binding peaks in the genome. On the other hand, at a constant level of Cas9 protein, titrating the amount of sgRNA expression plasmid transfected, and thus the abundance of sgRNA, largely determines the number of off-target binding sites in the mouse genome [
37].
Target or guide sequence
In addition to targeting Cas9 to a certain region in the genome, the sequence of the sgRNA alone appears to affect specificity [
13,
34,
35,
38]. For example, the tolerance of mismatches at each position varies dramatically between different sgRNAs, an observation that remains to be understood.
Possible mechanisms whereby a change in sgRNA sequence could affect Cas9 specificity include: 1) Changes that alter the effective concentration of sgRNA (by modulating transcription of the sgRNA, the stability of the sgRNA, or sgRNA loading into Cas9). For example, we found that two mutations in the seed region can increase U6 promoter transcribed sgRNA’s abundance by at least 7 fold [
37]. 2) Changes that alter the number of seed-matching sites in the genome, which can vary by 100-fold. 3) Changes that depend on the local chromatin environment of the target DNA sequences (i.e. chromatin accessibility). 4) Changes that might cause off-target effects by blocking the binding of trans-acting factors that may potentially affect Cas9 binding or reporter gene transcription. 5) Changes that alter the thermodynamic stability of the guide RNA-DNA duplex. It is likely that the observed effects of sgRNA sequence on specificity are the result of multiple mechanisms described above. Below we will discuss some of these effects in detail.
Accessibility of seed match genomic sites
In cells DNA is packed in chromatin and may have limited accessibility for Cas9 PAM recognition and target binding. DNase I hypersensitivity (DHS) is typically considered to be an indicator of chromatin accessibility. We have shown that DHS is a strong predictor of whether a 5-nucleotide seed followed by NGG (seed+NGG) site will be bound
in vivo [
37], and others have also observed a strong correlation between Cas9-bound sites and open chromatin [
44,
45]. In fact, the number of seed+NGG sites in DHS peaks (accessible seed+NGG sites) accurately predicts the number of ChIP peaks detected
in vivo (
R2=0.92) [
37]. Interestingly, designed target sites not in DHS peaks show significant ChIP enrichment over background, in our case comparable to that of target sites in open chromatin, suggesting that chromatin accessibility is not a requirement for binding to the on-target site [
37,
44]. This is consistent with previous studies showing that dCas9-VP64 fusion protein could be targeted to non-open chromatin regions to activate target gene transcription [
55]. In sum, chromatin accessibility seems to be preferentially facilitating off-target binding.
The preferential enrichment of off-targets in accessible chromatin has implications for dCas9-based transcriptome modulation. In fact, we found that regulatory elements of active genes, such as promoters and enhancers, are significantly enriched for off-target binding since those elements are accessible when active. To what extent these off-target binding events lead to gene expression change remains to be addressed.
Abundance of seed match genomic sites
Given that the binding seed length is relatively short (5 to 12), each guide RNA potentially has thousands to hundreds of thousands of seed match sites in the mammalian genome that are followed by NGG [
37]. However, due to mutational bias and other sequence bias in the genome, the occurrence of specific seed sequences could vary dramatically. For example, there are about 1 million AAGGA+NGG sites in the mouse genome, compared to less than 10,000 CGTCG+NGG sites. Therefore it is important to consider abundance of seed sites when designing sgRNA targets for dCas9 based applications. We have shown that the number of accessible seed+NGG sites in the genome can very accurately predict the number of peaks detected by ChIP (
R2= 0.92), although we only tested four guide RNAs [
37].
Epigenetics
In addition to chromatin accessibility, we have also shown that for target sites with CpG dinucleotides, methylation status strongly correlates with ChIP signal [
37]. Specifically, more methylation is associated with less binding, a correlation even stronger than DHS for the same set of sites. Consistent with the observation that CpG methylation is typically associated with chromatin silencing, we observed a strong negative correlation between DHS and CpG methylation. However, the correlation between CpG methylation and Cas9 binding remained strong even after subtracting the effect of DHS. Previously Hsu et al. showed that
in vitro CpG methylation has no effects on Cas9 cutting of substrates with no mismatches to the guide RNA, and
in vivo, Cas9 could mutate a promoter that is highly methylated, albeit with low indel frequency [
34]. Taking this information together, we speculate that CpG methylation may represent chromatin accessibility not detected by DHS and like DHS, CpG methylation only affects binding at off-target sites. Similarly histone modifications may affect target site accessibility, although so far this has not been investigated.
Target sequence length
One might expect that if the guide region is longer than 20 nucleotides, a longer RNA-DNA duplex may be formed and thus the Cas9/sgRNA complex might have higher specificity. Ran et al. increased the length of the guide region to 30 nucleotides by extending the 5' end of the sgRNA. Interestingly Northern blots detected that the extended 5' end was trimmed
in vivo [
60], suggesting that Cas9 only protects about 20 nucleotides of the guide RNA and free sgRNA is largely unstable. On the other hand, it has been recently reported that when sgRNA is truncated to 17 or 18 nucleotides, the specificity increases dramatically [
63]. The mechanism underlying this increased specificity is unclear. It was assumed the increased specificity is because the first 2-3 nucleotides are not necessary for on-target binding but instead stabilize off-target binding [
63]. The other possibility is that shortened sgRNA may simply be less abundant or less efficiently loaded into Cas9.
sgRNA scaffold
In addition to the 5' end, various modifications have been introduced to the scaffold region of the guide RNA, although their impact on target specificity is not well studied. Extension or truncation at the 3' end can drastically change sgRNA expression levels [
34], likely due to change in transcription or RNA stability, which in principle could affect specificity by tuning the effective concentration of Cas9/sgRNA complexes. Modifications have also been introduced to stabilize the sgRNA by flipping an A-U base pair at the beginning of the scaffold [
31]. Increasing the length of a hairpin that is supposed to be bound by Cas9 also helps to increase the efficiencies for both imaging and transcription regulation, likely due to more efficient loading of sgRNA into Cas9. The effect of these modifications on the specificity of binding or cutting remains unclear, although it is reported that these modifications lead to higher signal to background ratio for imaging [
31].
STRATEGIES TO INCREASE SPECIFICITY
Controlling Cas9/sgRNA abundance and duration
Typically Cas9 and sgRNA are expressed in cells by transient transfection of expressing plasmids. Titrating down the amount of plasmid DNA used in transfection increases specificity, although there is a trade-off for decreased efficiency at the on-target site. This is particularly an issue when the promoter is very strong, i.e. successfully transfected cells express a large amount of Cas9 and sgRNA leading to off-target effects. More recently, sgRNA has been expressed by RNA Pol II transcription and processed from introns, microRNAs, ribozymes, and RNA-triplex-helix structures, providing more flexible control of the sgRNA abundance [
64,
65].
Alternative delivery methods have also been developed to increase specificity. Compared to plasmid transfection based delivery, direct delivery of recombinant Cas9 protein and
in vitro transcribed sgRNA, either individually or as purified complex, reduces off-targets in cells [
66,
67]. This is likely due to the rapid degradation of the protein and RNA in cells, which would lower the effective concentration of the Cas9-sgRNA effector complex and its duration in cells.
Paired nickase
The Cas9 “nickase” generated by mutating only one nuclease domain can only cleave one strand of the target DNA, which creates a nick thought to be repaired efficiently in cells. When the nickase is targeted to two neighboring regions on opposite strands, the offset double nicking leads to a double stranded break with tails that are degraded and subsequently indels in the target region. The requirement of dual Cas9 targeting to a nearby region dramatically increases the specificity, since it is generally unlikely that two guide RNAs will also have nearby off-targets. The limitation of this strategy is that nicks induced by Cas9 could still lead to mutations in off target sites via unknown mechanisms [
13,
29,
60,
63].
dCas9-FokI dimerization
FokI nuclease only cuts DNA when dimerized [
68]. Fusion of dCas9 to FokI monomers creates an RNA-guided nuclease that only cuts the DNA when two guide RNAs bind nearby regions with defined spacing and orientation, thus substantially reducing off-target cleavage [
69,
70]. It has been reported that RNA-guided FokI nuclease is at least four fold more specific than paired Cas9 nickase [
69], likely due to FokI nuclease only functioning when dimerized whereas Cas9 nickase can cleave as a monomer [
70]. Similar to paired nickases, the requirement of two nearby PAM sites with defined spacing and orientation reduces the frequency of target sites in the genome.
TOOLS FOR TARGET DESIGN AND OFF-TARGET PREDICTION
Several tools have been developed for designing sgRNA targets, with the primary consideration to avoid off-targets in the genome [
34,
71–
80]. These tools typically consider an input sequence, a genomic region, or a gene and output potential target/guide sequences with predicted minimized off-target effects. Many of the tools also provide predicted off-target sites for a given sgRNA. These tools vary in their scheme for scoring potential guides and off-targets. Some tools incorporate data from previous systematic mutagenic studies [
34] or user-input penalties [
72,
77] to individually score off-targets based on location and number of mismatches to the guide. Other tools have binary criteria for off-targets, such as sites with less than a certain number of mismatches to the entire guide region [
74,
79,
80], or to some defined PAM proximal or distal region [
71,
73,
75,
76,
78]. Potential guides are generally ranked by a weighted sum of off-target scores, or by number of off-targets.
Several tools consider factors beyond position and number of mismatches. Some tools [
77] include the option to score off-targets with alternate PAMs based on the finding that Cas9 cleaves these sites with lower efficiency [
29,
34,
38,
60]. In terms of the on-target site, various tools consider presence of SNPs and secondary structure [
71] in the potential guide, which could impact targeting and loading of the sgRNA [
81], genomic context of the guide (e.g. exons, transcripts, CpG islands), which could impact the intended purpose of the sgRNA [
72,
75], and GC content, which could impact effectiveness of the sgRNA [
72,
75,
78,
82].
Information from these tools is usually downloadable and sometimes viewable in an interactive format [
34,
75]. In addition, some tools provide support beyond finding potential guides, such as sequences of oligonucleotides for sgRNA construction [
78–
80] or primers for validation of cleavage at the target site [
75,
78]. Some tools also provide specialized modes for design of sgRNA with paired Cas9 nickases [
34,
72,
73,
78–
80] or RNA-guided FokI nucleases [
78–
80].
Each of these tools has its advantages and disadvantages. Researchers seeking to design CRISPR-Cas9 targets in less well-studied organisms or alternative species of Cas9 will need to use tools that accept user-input genomes [
71,
73,
74,
77,
78], are tailored for their organism [
76], accept alternate PAM [
73–
75] or user-input PAM [
77]. The desired purpose of the CRISPR-Cas9 guide is also an important factor to consider. For example, some tools focus on designing sgRNAs to target genes with high efficacy [
75]. If off-target effects are more of a concern, it may be helpful to use a tool that scores predicted off-targets quantitatively [
34,
72,
77]. The type of off-targets detected by each tool also varies; most tools only search for off-targets with few (typically three or less) PAM-proximal or total mismatches to the guide [
71,
75,
76,
79,
80]. Considering what we have discussed in this review, especially for applications of dCas9, these may fail to detect many potential off-targets compared to tools that consider off-targets with more mismatches to the guide [
34,
72–
74,
77,
78]. Since almost every tool has unique features, it may be useful to incorporate multiple tools during the design process. We refer readers to Supplementary Table for a more detailed comparison.
Overall, these tools could aid in designing sgRNA targets that have minimal sequence homology to other sites in the genome. However, many features that are important to sgRNA specificity, as we have discussed, remain to be implemented, such as impact of seed sequence on sgRNA abundance, seed abundance in the genome, and epigenetic features. These factors, as we have discussed, are currently thought to primarily affect binding, or dCas9 based applications.
PERSPECTIVE
Despite intense study, the rules governing the specificity of Cas9/sgRNA targeting, especially target cutting and mutation remain elusive. At this stage, it is still challenging to predict genome-wide off-targets of Cas9 with any significant confidence. Although our genome-wide binding data set shows that the number of off-target peaks can be accurately predicted from the number of accessible seed+NGG sites, predicting binding at individual sites remains challenging [
37]. This suggests that there could be other factors, such as higher-level chromatin structure, that further limit binding of Cas9.
In addition, the relationship between Cas9 binding and functional consequences such as cleavage, mutation and transcription perturbation remains elusive. Several lines of evidences suggest that most Cas9 off-target binding events may be transient and have little functional impact. First, in two separate studies, only one of the 295 off-target binding sites [
37] or one out of 473 off-target binding sites [
45] tested showed evidence of mutations in cells expressing active Cas9 and corresponding sgRNAs. Secondly, transcriptome profiling revealed negligible off-target gene expression change [
25,
27,
55]. Furthermore, theoretical calculation implies an exponential decay in activity from Cas9 binding to downstream effects such as gene expression change [
29]. However, a direct comparison between genome-wide binding, cutting, and transcriptome change will be needed to support this claim.
The current rules of Cas9/sgRNA specificity are likely incomplete and biased. Most assays described here are biased, and may only detect a fraction of the off-target sites in cells and predict many false positives. Integration of multiple assays will likely lead to more comprehensive and more accurate identification of off-targets. For example, intersecting ChIP-detected Cas9 binding sites with whole-genome sequencing data will likely lead to authentic Cas9 target sites while removing Cas9-independent false positives, such as sequencing error or ChIP bias.
In addition to biased assays, the rules learned from each study are also likely biased by the small number of sgRNAs studied, given that the target specificity highly depends on the target sequence. Most of the assays described here are difficult to scale-up, such as ChIP, in vitro selection, and whole-genome sequencing. Further development of multiplexable unbiased assays, such as DSB capture with barcoded linkers, could facilitate the study of large number of sgRNAs at the same time.
The issue of off-targets is most critical in use of the Cas9 system to mutate specific genes. Here off-targets could generate spurious phenotypes and mistaken interpretations. This is particularly a concern when a large library of Cas9 vectors is screened with selective conditions for specific phenotypes. In this case a rare off-target mutation could be selected and the phenotype accredited to the on-target gene. The only really valid assay under these conditions is the deep sequencing of the total genome of the cloned mutated cell. However, this is much too expensive for most experiments and will only be done in particular cases. The principles summarized here about specificity of the Cas9 system hopefully will lead to experimental designs that optimize the probability of obtaining desired on-target mutants in the absence of unknown off-target changes.
Lastly, alternative Cas9 protein and guide RNA architecture may improve specificity. Several alternative Cas9 proteins from various bacteria have been studied and display very different PAM sequences [
83]. Comprehensive characterization of the specificity, such as genome-wide binding and cutting, may identify novel Cas9 proteins with dramatically improved specificity. With an available crystal structure [
56], it is also possible to design a new Cas9 protein with increased specificity via protein engineering and
in vitro evolution.
Higher Education Press and Springer-Verlag Berlin Heidelberg