INTRODUCTION
Each normal human genome is a diploid which composed of two sets of 23 chromosomes. One set inherit from the mother, and the other inherit from the father. Alleles at multiple loci along a single chromosome are referred to haplotype. Haplotype information is essential to explain the relationships between genotypes and phenotypes [
1–
3], map disease genes roundly [
4] and describe genetic ancestry completely [
5].
Although the diploid nature has been observed for over 50 years [
6–
8], phasing adiploid still a laborious task. Up till now, karyotyping is the gold standard in clinical laboratories. The development of DNA microarray and chromosomal fluorescence
in situhybridization (FISH) exhibits additional but still limited haplotype information [
9,
10]. DNA sequencing, which obtains nucleotide sequence information one by one, is a direct and efficient haplotyping technology. In fact, the first two assembled human genomes generated by the Human Genome Project contained extensive haplotype information [
11,
12] by constructing 50–200 kb bacterial artificial chromosomes (BACs). Though mate-pair libraries may be helpful, it still remains a gigantic project to sequence an individual diploid genome by Sanger dideoxy technology [
13].
With the advent of next generation sequencing (NGS) in 2005, the cost of DNA sequencing has reduced over 100,000-fold, with its speed greatly increasing [
14–
18]. However, the short read length is a challenge to haplotype analysis, as the reads shorter than 150 bp span no more than one variant in most cases. Assisted by the paired-end libraries, the linkage obtained was extended to 250–500 bp. The complex mate-paired libraries, which require an
in vitro circularization step, obtained the linkage in blocks for maximum length of 3.5 kb [
19].
In the past decade, numerous experimental technologies have been developed for whole genome haplotyping based on NGS. Distance, complexity and accuracy of the linkages generated are among the main factors to evaluate the efficiency of whole genome haplotyping methods. Here, we review these experimental technologies and evaluate their efficiency in linkages obtaining and experimental system complexity. Some statistical methods are able to resolve haplotypes independently by population analysis [
20–
22]. However, most computational methods are designed to optimize the haplotype-resolving efficiency of certain experimental strategy. In this paper, therefore, we emphasize on reviewing such experimental technologies. The technologies are organized into four categories based on the following strategies: (i) chromosomes separation, (ii) dilution pools, (iii) crosslinking and proximity ligation, (ix) long-read technologies. Within each category, several subsections are listed to classify the each technology. Long-read technologies are not independent strategies but technical renovations in sequencing technology. We categorize them as a separate section for their potency in effects improvement of all experimental technologies.
CHROMOSOMES SEPARATION
Human gametes contain natural sets of homologous chromosomes. Physically separating homologous chromosomes before the sequencing library construction is a kind of direct method to obtain long distance, complex and pure linkages. The artificial separation is limited by intact mitotic cells, complex experimental pipelines and specific devices.
Human gametes
Human gamete is an ideal sample for haplotype study, within which a natural set of homologous chromosomes is packaged. Recently, several groups have operated whole genome sequencing and haplotyping of individual sperm [
23–
25]. Due to the small DNA amount within a sperm, nucleic acids of a single sperm were amplified by multiple displacement amplification (MDA) before sequencing library construction. This isothermal amplification with random hexamer primers and phi29 DNA polymerase amplifies DNA in a cascading, strand displacement reaction [
26]. After 4–6 h amplification at 30°C, the yield of amplified DNA is over a 5-log range of starting material (100 fg–10 ng), exceeding 10 kb in length. Hou
et al. [
27] performed genome-wide haplotyping of a human oocyte by analysis of polar bodies. Multiple annealing and looping based amplification cycles (MALBAC) was used to perform high uniform amplification across the genome. To some extent, studies in human gametes can provide alternative solutions to substitute artificial separation of chromosomes. However, the most visible shortcoming is that almost all the other samples or cells are not haploid as human gametes.
Laser capture microdissection
Ma
et al. [
28] determined haplotypes through chromosome microdissection. A part of chromosomes from one cell were collected by computer-directed laser microdissection (Figure 1). The collection may contain only one copy of some chromosomes, and may also contain no copy or both copies of other chromosomes. The haplotypes of monosomic chromosomes were revealed by conventional genotyping after MDA. The MDA products of microdissection harvests are suitable for NGS, though Ma
et al. [
28] used microarrays for genotyping. However, the microdissection depends on the positions where the chromosomes are located. The collected chromosomes with only one copy are random. Inestimable microdissections and collections are required to analyze each chromosome of the two sets.
Fluorescence-activated sorting
Fluorescence-activated cell sorting (FACS), an efficient cell sorting technology, was used by Yang
et al. [
29] to place individual chromosomes into wells of a 96-well plate. In their study, Chromomycin A3 (binds Guanine-Cytosine-rich regions) and Hoechst 33258 (binds Adenine-Thymine-rich regions) were for staining. Each chromosome was identified by its distinct bivariate distribution of fluorescent signals from staining. MDA and NGS were operated for haplotype analysis. Additional molecular typing was required to deal with similar bivariate distribution patterns of chromosomes [
29].
Microfluidic devices
Microfluidics was developed based on the need to analyze cells and biomolecules more efficiently. Fan
et al. [
30] developed a microfluidic device to separate and amplify homologous copies of each chromosome from a single human metaphase cell. The amplification products of each chromosome or small chromosomes pool were genotyped by microarrays. The microfluidic device is compact and precise, with only two chromosomes not collected. Manual identification of metaphase cells and manufacture of microfluidic devices are the labor-intensive procedures.
DILUTION POOLS
Dilution pools strategy, a classic method for linkage mapping, was first conceived over 20 years ago [
31]. Long intact genomic DNA fragments are compartmented into pools after limiting dilution. In each pool, DNA is sub-haploid amount. Part of the genomic regions are represented once within each pool, while the left are not represented (Figure 2). The entire genome is covered enough times by the pools collectively. Due to the lack of comprehensive whole genome amplification method to microscale DNA, dilution pools strategy was first applied in systematic haplotyping by means of fosmid clone [
13,
32], genotyping by microarray or Sanger dideoxy sequencing. MDA [
26] provides uniform representation across the genome to microscale DNA and makes clone no longer indispensible to dilution pools strategy (Table 1). Dilution pools strategies carry out haplotype analysis without physically separating of homologous chromosomes, and the system complexity is lower than chromosomes separation strategies. However, the distance and purity of linkage are not as remarkable as chromosomes separation strategies.
Fosmid clone pools
By Fosmid clone method, microscale DNA is amplified for NGS. Pure and intact genomic DNA fragments of about 35 Kb are separated in pools for haplotype analysis. For the relatively simple clone pipeline in comparison with BAC clone, fosmid clone was used to operate dilution pools strategy by several groups [
33–
38]. Kitzman
et al. [
33] constructed a single, complex fosmid library in 2011. Within each pool, ~5,000 fosmids with ~37 kb inserts were captured (~3% of the 6 Gb diploid genome). Different barcodes were applied to each of the 115 pools for barcoding libraries construction. After haplotype analysis, half of resolved sequences were within blocks of at least 350 kb (N50 of 350 kb).
BAC clone pools
Compared with fosmid clone, BAC clone has longer inserts of about 140 kb. Longer fragments are crucial for haplotype phasing and can reduce the quantities of required pools. In Lo
et al’s work [
39], only 24 pools (5,000 clones per pool) were captured to construct indexed libraries. The N50 values of the assembled haplotype blocks were greater than 2.6 Mb.
Fragment pools with MDA
The advent of MDA [
26] in 2002 provides uniform representation across the genome and convenient amplification of microscale DNA. The length of DNA fragments in pools is determined by the DNA extraction process. Longer fragments are preferred to generate longer haplotype blocks. Peters
et al. [
40] and Kaper
et al. [
41] captured fragment pools, amplified by MDA, and sequenced on Complete Genomics platform and Illimina platform, separately. Totally 384 fragment pools were captured by Peters
et al. [
40] with 10%–20% of a haploid genome in each pool, resulting 92% of the phasable heterozygous SNPs placed into long contigs with N50s of ~1 Mb and 500 kb for two samples, respectively. 8% of unphased variants were mainly caused by amplification bias.
Fragment pools with CPT-seq
To make sure genomic regions are overwhelmingly represented at most once, smaller pools are preferred in fragment pools with MDA strategy. However, smaller sizes means that more pools are required to represent the genome. Amini
et al. [
42] designed contiguity-preserving transposition sequencing (CPT-seq) to ameliorate this situation (Figure 3). Tn5 transposition was used to modify DNA with adaptor and index sequences. It introduced another dimension of barcodes to the libraries by tightly bounding to target DNA until compartmentalizing. Two dimensions of barcodes built 96×96 virtual compartments, but only 96 sequencing libraries were constructed actually.
Fragment pools with long-range PCR
In the previous study, MDA-based methods reported 8% of variants unphased at a high coverage [
41]. Kuleshov
et al. [
43] took PCR instead of MDA as the amplification approach in order to reduce the amplification bias. After ligation with amplification adaptors, minute DNA is suited to be amplified by PCR. They diluted and placed DNA fragments into 384 wells, at about 3,000 fragments per well. Although the fragments were about 10 kb in length, 99% of single-nucleotide variants in three human genomes were phased into haplotype blocks 0.2–1Mb in length after the optimization of statistical pipeline. The unphased variants observably decreased.
CROSSLINKING AND PROXIMITY LIGATION
Both chromosomes separation and dilution pools strive to separate homologous chromosomes or fragments into different pools. Chromosomes or DNA fragments in each pool are approximately considered to be a haploid. An alternative strategy is to ligate two distant parts of a chromosome into a single sequencing reads. A series of these reads with random distance between the two parts provide different distance and accurate linkages. It is a tough work to gain a series of reads with random distance between two parts before the appearance of capturing chromosome conformation (3C) [
44]. The 3C and coupling chromosome conformation capture-on-chip (4C) [
45] were first developed to identify chromosomal interactions. The capability of grabbing two discontinuous sequences of one chromosome into one read or reads part was considered by the successors [
46–
48] and was used for whole genome haplotyping. Although leaving many variants unphased [
46], crosslinking and proximity ligation approach is a highly innovative strategy for haplotype analysis.
Crosslinking and proximity ligation
The chromatins are cross-linked in cell nucleus. Two cross-linked sequences are wide apart in sequence but nearby in space, and more importantly, linked. In the experiments [
46,
47], cross-linked chromatins were formaldehyde fixed, digested by restriction enzyme and ligated to form artificial fragments (Figure 4). After sequencing, the distance between the two cross-linked sequences ranged from several hundred base pairs to tens of millions of base pairs [
46]. Selvaraj
et al. [
46] phased ~81% of alleles at 17× sequencing. After adjusting the progress, Vree
et al. [
47] applied similar method to selectively sequencing and phasing entire genes.
Cell free crosslinking and proximity ligation
The crosslinking and proximity ligation strategy relies on intact cells or nuclei. The signal seems to be confounding based on the complex and large-scale organization of chromosomes in nuclei. Some structures of different chromosomes, such as telomeres, are often associated in cells. To overcome the limitation, Putnam
et al. [
48] reconstituted chromatin
in vitro to produce DNA linkages up to several hundred kilobases. They increased the haplotype blocks N50 from 508 kb to 10 Mb with the help of 210 million reads which were generated by cell free crosslinking and proximity ligation approach [
48].
LONG-READ TECHNOLOGIES
The mismatch between short read length and long distance linkage requirement is the barrier of convenient whole genome haplotype analysis. Innovatory sequencing technologies have the potential to extend read length, which is the direct way to sweep this barrier. The reported innovatory sequencers include single-molecule real-time (SMRT) sequencing [
49] and nanopore sequencing [
50]. To be pointed, long-read technologies are not independent approaches but are able to join with the chromosomes separation, dilution pools, and crosslinking and proximity ligation strategies.
SMRT sequencing
SMRT sequencing platform produced raw reads median 2 kb in length, obviously longer than other sequencing platforms [
51]. The low read accuracy (~85%) seems to prevent the method from being widely used in genome analysis, including whole genome haplotyping. However, with longer reads (average mapped reads for 5.8 kb in length) and optimized analysis process, Chaisson
et al. [
49] analyzed a haploid human genome, closing or extending 55% of the remaining gaps. In spite of this, low accuracy will bother SMRT sequencing in haplotyping analysis before significant improvement.
Nanopore sequencing
Nanopore platforms sequence nucleotides by the change of electric current while a DNA molecule passes through the nanoscale pore, which are promised to generate long reads. However, single nucleotide can not be correct identified all the time. Recent study [
50] reported the identification of four-nucleotide combinations and drew the quadromer map of sequences up to 4,500 bases in length. It is worthwhile expecting accurate and high-throughput nanopore sequencing methods.
DISCUSSION AND CONCLUSION
For all the experimental haplotyping strategies, obtaining long distance, highly complex and accurate linkages are the core goals (Table 2). The system complexity and requirements are also determinate factors about whether the haplotyping method is well applied or not (Table 3). An efficient, low cost, scalable and labor-saving work flow is desired. Chromosomes separation strategies obtain long, complex and accurate linkages directly. But the work flows are complex and not scalable, while precise devices and metaphase cells are required. More importantly, the key steps of the experiment are not largely controllable. As a special case, the haplotyping of human gametes is not applicable to other samples. Therefore, chromosomes separation strategies are only suitable for special applications and professional labs. Dilution pools do not rely on specific devices and samples and is soft for most labs in almost every haplotyping study. However, it is labor-intensive to construct numerous sequencing libraries. Smaller scale of each pool can generate more accurate linkages, but will lead to constructions of more libraries. At the same time, smaller pools mean high sequencing depth in all. CPT-seq requires less sequencing libraries at the same pool scale, but is unable to clinch this contradiction thoroughly. Crosslinking and proximity ligation approaches open a new window in solid linkages obtaining. Although the crosslinking and ligation work flow seems to be complex, it only requires one sequencing library, which appears a great advantage. More importantly, crosslinking and proximity ligation approaches give an alternative direction apart from homologous chromosomes or fragments separation. Lower sequencing depth is another advantage. Slightly poor result in allele phasing limits crosslinking and proximity ligation approaches to be widely applied nowadays. More studies are required for these methods to generate more comprehensive haplotypes. To generate reads longer enough is expected to be the final solution of haplotyping. Before this, technical renovations in read length are helpful in promoting the performance of all the strategies above.
In the next several years, with the help of long read-length and exact bioinformatic pipelines, experimental strategies will phase the whole genome efficiently and comprehensively. Innovative experimental strategies are expected to have high-quality performance, low cost and be labor-saving, which will be largely desired in the future.
Higher Education Press and Springer-Verlag Berlin Heidelberg