Introduction
CRISPR-Cas (Clustered, Regularly Interspaced, Short Palindromic Repeats – CRISPR-associated (Cas)) RNA guided endonuclease is the most recent development in genome editing technology (
Esvelt et al., 2013,
Ran et al., 2013b). CRISPR-Cas editing technology borrows a strategy from the adaptive mechanisms for bacteria and archaea to fight invading viruses and plasmids (
Koonin and Makarova, 2009,
2013;
Horvath and Barrangou, 2010;
Jinek et al., 2012;
Sampson et al., 2013;
Doudna and Charpentier, 2014). In brief, CRISPR stores DNA sequences from invaded viruses or plasmid in a transcriptional array and when the same type of virus invades again, the system will recognize it using the transcribed RNA sequences and direct the Cas nuclease to make a double stranded break (DSB). One type of nuclease known as Cas9 from the bacterium
Streptococcus pyogenes (
S. pyogenes) cuts DNA at the exact location dictated by a single guide RNA (sgRNA) that can be programmed to target a genomic DNA sequence for editing (
Hsu et al., 2013). Once Cas9 makes a DSB, random insertion or deletion can be generated via an error-prone non-homologous end-joining (NHEJ) pathway or desired modification can be introduced by homology-directed repair (HDR) pathway templated from exogenous DNA (
Wyman and Kanaar, 2006). Recently, efficient genome editing by the CRISPR-Cas9 system has been demonstrated in multiple organisms, including human, mouse, rat, zebrafish,
Drosophila and
C. elegans (
Cong et al., 2013;
Friedland et al., 2013;
Gratz et al., 2013;
Hou et al., 2013;
Hwang et al., 2013;
Jinek et al., 2013;
Li et al., 2013;
Mali et al., 2013b;
Yang et al., 2013). In contrast to previous genome-editing techniques, such as zinc-finger nucleases (ZFNs) (
Meng et al., 2008;
Gupta et al., 2011;
Chu et al., 2012;
Enuameh et al., 2013) and transcription activator-like effector nucleases (TALENs) (
Joung and Sander, 2013), the target specificity of CRISPR-Cas9 is primarily dictated by a Watson-Crick pairing of a 20-base sequence at the 5′-end of the sgRNA with the target site instead of protein-DNA recognition, providing a much easier system to target multiple genes simultaneously. It has been shown that compared with ZFNs and TALENs, CRISPR-Cas–mediated gene targeting has similar or greater efficiency in human cells, zebrafish and metazoan
Nematostella vectensis (
Ding et al., 2013;
Ikmi et al., 2014;
Smith et al., 2014). Recently, several laboratories have established CRISPR-Cas9 as a screening tool for systematic genetic analysis in mammalian cells, analogous to shRNA screens (
Shalem et al., 2014) (
Chen et al., 2015;
Koike-Yusa et al., 2014;
Wang et al., 2014). At least three companies have been founded to take advantage of this technology for therapeutic uses to correct genetic disorders and battle invading pathogens, named CRISPR Therapeutics, Intellia Therapeutics and Editas Medicine.
Figure 1 depicts the two components of the CRISPR-Cas9 system from
S. pyogenes and their recognition sites. The first componentis the Cas9 nuclease from the bacteria, depicted as purple oval and the second component is a single guide RNA or sgRNA, which is derived from a fusion of the tracrRNA and crRNA found in bacteria(
Jinek et al., 2012). In the engineered form, the sgRNA has two parts. One is the constant region, colored in peach, whichforms several stem-loop structures serving as scaffolding for Cas9 binding. The second is a 20 base variable region (referred to as gRNA hereafter), colored in green, which can be altered to target different sequences. The target site, recognized by this complex, is composed of two parts. One part of the target site, colored in blue, is complementary to the gRNA. The other part of the target site, colored in red, is called protospacer adjacent motif (PAM) and bound by Cas9. The PAM is a very short region (NGG in Sp) adjacent to the 20 bases that are recognized by the gRNA. In summary, for the most commonly engineered CRISPR-Cas9 system derived from Sp, Cas9 nuclease binds to the NGG PAM sequence, then if the20 base gRNA base pairs with the target DNA sequence it will make a DSB. Once Cas9 makes a DSB, DNA undergoes repairs using NHEJ or HDR if a donor template is provided, leading to random indels or desired modification of the targeted gene.
Overview of gRNA design tools
Finding target sites is generally quite easy by just scanning for the PAM sequence e.g., NGG for the CRISPR-Cas9 system from
S. pyogenes. The challenge is to be able to design a predictive algorithm to identify target sites that can be cleaved efficiently(aka efficacy) and for which the cognate gRNAs have little or no cleavage at other genomic locations (aka specificity). Therefore, ideal gRNAs would have high efficacy with great specificity. To help researchers to select the best gRNAs for input sequences, it is essential to identify gRNAs and their potential off-targets, and accurately predict their relative cleavage rates. To facilitate gRNA design, many computational tools have been developed (
Hsu et al., 2013;
Ma et al., 2013;
Doench et al., 2014;
Heigwer et al., 2014;
Xiao et al., 2014;
Zhu et al., 2014;
Prykhozhij et al., 2015), and a few representative ones are summarized in Table 1.
The Root laboratory assessed the rules governing the gRNA efficacy by creating a pool of 1841 sgRNAs, tiling across all possible target sites fora panel of six endogenous mouse and three endogenous human genes and quantitatively assessing their ability to produce null alleles by antibody staining and flow cytometry (
Doench et al., 2014). The data from 1841 sgRNAs were used to construct a model to predict the efficacy by fitting a logistic regression using sequence features of the expanded gRNA. The expanded gRNA includes 4 bases upstream of the gRNA and 3 bases downstream of PAM sequence. The predictive model includes 72 features, found statistically significant to contribute to the gRNA efficacy including GC content, some single nucleotide and dinucleotide variants. For example, in position 20, C is highly disfavored and G is strongly favored. The Root laboratory provides an online tool (sgRNA Designer, the 6th tool in Table 1) for predicting gRNA efficacy to facilitate design of highly active sgRNAs for any gene of interest based on this model. Recently, the Liu laboratory further refined the model by incorporating additional features glean from genome-wide sgRNA screens such as a preference for cytosine at the cleavage site (
Xu et al., 2015). However, both tools only output efficacy score, the cleavage likelihood for a given gRNA on an intended target, without consideration of potential off-target cleavage.
To effectively apply CRISPR-Cas9 genome editing system, we not only need to select gRNAs with high efficacy, but also need to find gRNAs with low off-target cleavage, i.e., high specificity. Most tools have adopted a simple counting approach to predict off-target effects by listing all genomic sequencescontaining0-3or user-defined maximum number of mismatches to the gRNA (
Cradick et al., 2014;
Heigwer et al., 2014), and some provide a relative cleavage score for each potential off-target by classifying the target region as seed and non-seed region, and equally penalizing seed region mismatches (
Bae et al., 2014;
Xiao et al., 2014).
The Zhang laboratory studied the effect of number of mismatches and the mismatch positions of gRNAs on the predicted cleavage rate, by tested>700 gRNA variants for 15 target sequences in a human cell line (
Hsu et al., 2013). Briefly, the cells were transfected with gRNA variants containing all possible single nucleotide mismatches and a subset of multiple mismatches and the lesion rates were compared to the cognate gRNAs by deep sequencing PCR products spanning the region of each target site. It turns out that not only the number of mismatch but also their position impacts the activity of the gRNA. Table 2 contains the penalty weights (0-1) to capture the position-dependent mismatch effect on cleavage, where 0 means no mismatch effect and 1 indicates the biggest effect on cleavage. For example, mismatch at position 1 (most distal to PAM) to 5 has almost no effect on the cleavage activity while mismatches at positions 13 to 20 has a large influence on activity. The Zhang laboratory developed a position specific penalty matrix from this experimental data and used it to develop a web application to evaluate gRNAs based on an aggregated off-target score calculated from the top 100 off-target cleavage scores within the genome (the second tool in Table 1). However, this web application does not provide gRNA efficacy prediction. To date, CRISPRseekis the only tool that performs both efficacy and specificity prediction (the first tool in Table 1) (
Zhu et al., 2014).
Other considerations of gRNA design
CRISPR-Cas9 technology evolves rapidly with the characterization of new CRISPR-Cas from different species, which will likely have different preference for PAM sequence and different gRNA length. For example, Cas9 from
Neisseria minigenitis (
N. minigenitis) has a different PAM preference of NNNNGATT instead of NGG for
S. pyogeneous (
Hou et al., 2013). As new off-target analysis data becomes available, more informative and accurate penalty matrix and scoring system will be generated (
Tsai et al., 2015).Strategies have been developed to reduce off-target cleavage, such as using paired Cas9 nickases (
Mali et al., 2013a;
Ran et al., 2013a;
Cho et al., 2014). RNA-guided Cas9 nickases function as a pair to generate a DSB by binding to genomic neighboring genomic sequences with a flexible spacing but defined orientation to generates two single-stranded breaks. The requirement for two nickases to create a DSB increase specificity since the likelihood of a pair of nickases binding neighboring sites is low. Another paired configuration uses dimeric RNA-guided dCas9FokI nucleases (RFNs) (
Tsai et al., 2014), which function similarly to RNA-guided nickases but have more restricted spacing requirements. There will be likely more novel configurations emerging to increase the cleavage specificity as paired nickases and FokI dimerization. Furthermore, there are different methods for synthesis and delivery of nucleases to cells. Each method might impose different constraints on the gRNAs. For example, synthesis of gRNAs
in vivo from host U6 promoters is more efficient if the first base is guanine and gRNA synthesis in vitro using T7 promoters is most efficient when the first two bases are GG.
Once mutations are introduced, methods are needed to screen the resulting cells or animals for sequence alterations at the target sites. One of the simplest and least expensive methods is by restriction enzyme digestion as shown in Fig. 2. In this example, the target site contains the recognition site of a restriction enzyme Pst1, colored in red, which overlaps with Cas9 cleavage site shown as green arrow. After PCR amplification of the target locus, if there is no mutation in the target sequence, then the Pst1 site will stay intact. Addition of Pst1 enzyme will produce two bands while the untreated DNA will produce one band. However, if there is a modification such as one A insertion, then the Pst1 site will be disrupted and Pst1 will not be able to recognize this site and make a cut. Now the sample treated with Pst1 will produce only one band just like that of the untreated sample. It is useful to be able to identify restriction enzymes whose recognition sites overlap with the Cas9 cleavage site so that users can choose to filter out gRNAs without overlapping restriction enzyme sites (RES).
In addition, there is a need to be able to design gRNAs to analyze closely related sequences, such as targeting one allele but not the other or target both well (
Zhu et al., 2014). Recently, a specialized web tool CRISPR MultiTargeter was developed to find common and unique CRISPR single guide RNA targets in a set of similar sequences (
Prykhozhij et al., 2015), which is also implemented as
compare2Sequences function in CRISPRseek package (
Zhu et al., 2014).
As described in Fig. 1, the variable region of the sgRNA (gRNA) base-pairs with the target sequence, and the constant region of sgRNA forms several stem-loop structure serving as scaffolding. Therefore, it is important to avoid gRNAs that disrupt the secondary structure of the constant region of the sgRNA, e.g., GUUUUAGAGCUAGAAAUAGCAAGUUAAAAUAAGGCUAGUCCGUUAUCAACUUGAAAAAGUGGCACCGAGUCGGUGCUUUUUU. Thus it is important to predict the secondary structure of sgRNA. To date, there are two tools that output the secondary structure of the concatenated sequence of sgRNA (
Ma et al., 2013;
Zhu et al., 2014).
Overview of CRISPRseek functionalities
CRISPRseek was developed with the above considerations in mind to be versatile, flexible and adaptive to rapidly changing needs. Besides efficacy and off-target prediction,
CRISPRseek provides flexibility to incorporate alternative paired configuration, other Cas9 types and to plug in alternative penalty matrix and scoring system for efficiency and off-target score prediction from newly published/unpublished source, and to require or exclude specific features within the target site. Additional features include RES annotation, secondary structure prediction and comparison of two sets of sequences. To make it easy to use, all the above functions have been wrapped into two main workflow functions in
CRISPRseek. One is
offTargetAnalysis workflow for gRNA (paired or not paired) searching and off-target analysis for one or a set of input sequences (Fig. 3). Several report files are generated including gRNAs in different format, i.e., fasta format, GenBank format, bed format to be visualized in UCSC genome browser (Fig. 4), a tab delimited file containing gRNAs overlap with restriction sites, a tab delimited file containing gRNAs in paired configuration, a tab delimited file containing detailed off-target information such as genomic locations, inside exon or not, mismatch positions, sequence and cleavage score (OfftargetAnalysis.xls), and a tab delimited file containing a summary of the gRNAs such as efficacy, RES annotation and top 5 (or a user-specified number) off-target cleavage score (Summary.xls). If RNA secondary structure prediction software
ViennaRNA (
Lorenz et al., 2011) and
GeneRfold are installed, then the minimum free energy and bracket notation of secondary structure of sgRNA will be generated and included in the summary file.
ViennaRNA and
GeneRfold are available at http://www.tbi.univie.ac.at/RNA/index.html#download and http://www.bioconductor.org/packages/2.9/bioc/html/GeneRfold.html.
There are 44 parameters in
offTargetAnalysis for creating customized search. To make it easy to use, all parameters are set for the widely used CRISPR-Cas9 system from
S. pyogenes, composed of a 20 base gRNA sequence and a 3 base preferred PAM sequence (NGG). In default setting, you only need to enter the input sequence file path and the genome you are interested in search for off-targets. The gRNA efficacy and off-target cleavage score calculations are based on the models from the Root laboratory (
Doench et al., 2014) and the Zhang laboratory respectively (
Hsu et al., 2013). Alternative efficacy scoring matrix and off-target mismatch weight matrix can be plugged in as more data and accurate prediction algorithms become available. To identify guide sequence for CRISPR-Cas9 systems from other species that utilize different PAM/gRNA lengths (
Hou et al., 2013) or from truncated gRNAs(
Fu et al., 2014), which may provide greater specificity, simply adjust the parameters
gRNA.size, PAM, PAM.size, weights, PAM.pattern and
allowed.mismatch.PAM accordingly. There is evidence that even though the preferred site is NGG for SpCas9, there is some reduced activity at sites with NAG (
Hsu et al., 2013). Therefore, it is recommended to scan for NGG to identify target sites and include both NGG and NAG for off-target search. Parameter
PAM specifies PAM preference for gRNA search while
PAM.pattern specifies degenerative PAM for off-target search.
The other workflow function is compare2Sequences for identifying gRNAs that specifically target one of the two sets of input sequences or both (Fig. 5).The parameters are almost the same as offTargetAnalysis workflow function. In the default setting, all it needs is two sequence/sequence sets file paths. The compare2sequences first identifies gRNAs that target one of the input sequences with the same parameters available for the offTargetSequence function. Next, for each gRNA, off-target search and scoring were performed against theother sequence(s).Please note that once you identified gRNAs that fit your need, you will still need to run the other workflow function offTargetAnalysis to perform genome wide off-target analysis on the chosen gRNAs to ensure that the one you selected not only target one/all input sequences but also cut rarely elsewhere in the genome. For detailed information on parameter setting and example use cases, please refer to the reference manual and user guide at http://www.bioconductor.org/packages/release/bioc/manuals/CRISPRseek/man/CRISPRseek.pdf and http://www.bioconductor.org/packages/release/bioc/vignettes/CRISPRseek/inst/doc/CRISPRseek.pdf. The ability to easily alter all parameters in both workflow functions is the key in adapting to a rapidly advancing field.
Future directions
There are hurdles to overcome before CRISPR-Cas9 genome editing technology can be successfully applied for therapeutic uses. Computationally, there is a need to develop a more precise gRNA efficacy and off-target cleavage rate prediction models. Although additional features have been discovered to improve the gRNA efficacy prediction (
Xu et al., 2015), ~40% of inefficient sgRNAs are not predictable with the improved sequence model, probably due to thesmall size of the training and testing data set, or/and other sequence determinants not included in the model such as chromatin structure and sgRNA secondary structure. Recently, Cradick and colleagues developed a web application for searching off-targets allowing indels for the human, mouse,
Caenorhabditis elegans, and rhesus macaque genomes but without off-target cleavage score prediction (
Cradick et al., 2014). It is unclear how bulge formed in gRNA or protospacer due to indels affects off-target cleavage. In addition, experiments suggest that not only mismatch positions, but also mismatch types, e.g., A->T, A->G and A->C affect off-target cleavage (
Hsu et al., 2013).With the development of GUIDE-seq and the expanding of CRISPR experimental data sets (
Tsai et al., 2015), more comprehensive and accurate predictive models expect to be developed, which can be easily plugged into CRISPRseek to improve gRNA design.
Compliance with ethics guidelines
Lihua Julie Zhu declares that I have no conflict of interest. This article does not contain any studies with human or animal subjects performed by me.
Higher Education Press and Springer-Verlag Berlin Heidelberg