A bacterial artificial chromosome-based physical map of Manihot esculenta ssp. flabellifolia

Cassava (Manihot esculenta) is known as the third most important food crop in the tropics and also used for industrial feedstock for biofuels. Two new bacterial artificial chromosome (BAC) libraries were constructed for W14 (M. Esculenta ssp. flabellifolia), a wild ancestor of domesticated cassava. The libraries were constructed with EcoRI and HindIII insertion vectors, respectively. The EcoRI library has 29952 clones with an average insert size of 115 kb, while the HindIII library consists of 29952 clones with an average insert of 129 kb. The combined libraries contain a total of 59904 clones with an average insert size of 125 kb, representing approximately 10 haploid genome equivalents. A total of 29952 clones were fingerprinted and resulted in a cassava physical map composed of 2485 contigs with an average physical length of 336 kb and 2909 singletons, representing approximately 762 Mb of the cassava genome. 5000 clones located at the ends of BAC contigs were selected and sequenced. A total of 6077 SNPs and 231 indels were identified, that covered 459 gene sequences, of which 6 genes were associated with starch and sucrose metabolism. This BAC-based physical map provides valuable tools to understand the genetics and evolution of cassava.


Introduction
Cassava (Manihot esculenta), an allopolyploid species (2n = 36) with an estimated genome size of approximately 770 Mb [1] , is an important tropical crop belonging to the Euphorbiaceae family. It is grown throughout Africa, Asia and South America, and its starchy roots and edible leaves are consumed by more than 800 million people [2] . Besides being a staple food, it is also an industrial feedstock for biofuels, animal feed and financial source in some countries [3][4][5] . Cassava is considered the third most important source of calories in the tropics [6] .
Despite the importance of cassava, progress has been slow in the areas of both conventional breeding and molecular breeding compared to other major crops, in part this is due to its heterozygous genome, long breeding cycle and low seed production [7] . Great efforts worldwide have been made to improve cassava properties and many important QTLs have been identified underlying cassava mosaic disease [8] , bacterial blight [9,10] , plant architecture and productivity [11,12] , yield-related traits [13] , cyanogenic potential [14] and dry matter content [15] . Current genomic resources for cassava research include several genetic linkage maps constructed with multiple molecular markers [16][17][18][19][20][21][22] ; a large EST database [23][24][25][26][27] ; DNA microarrays [28,29] ; a non-redundant set of SSRs and validated SNPs [26,30] ; bacterial artificial chromosome (BAC) libraries [31,32] and annotated draft cassava genome sequence in Phytozome (http://www.phytozome.net/cassava). To date, four cassava BAC libraries have been developed. The first two libraries were constructed to facilitate the cloning of resistance genes by using a cassava mosaic disease (CMD) resistant and cassava bacterial blight (CBB) resistant clone TMS 30001 and a cassava white fly resistant clone MECU72 [31] . The TMS 30001 library has a 5 Â haploid genome equivalent with an average insert size of 80 kb and was screened by southern hybridization using a cassava analog (CBB1) of the Xa21 from rice, while the MECU72 library has a genome coverage of 11.3 Â with an average insert size of 93 kb and was analyzed by end sequencing of 2301 clones in order to scan for genes and repeat DNA contents. The third BAC library with 11 Â genome coverage was constructed from TME3, a CMD-resistant genotype, for contig mapping. Two RAPD markers which flanked a resistance gene (CMD2) were used to screen the TME3 library by PCR amplification of BAC pools. A total of 16 positive clones were identified. A recent cassava BAC library was developed from a partially inbred genotype AM560-2 for physical mapping [32] .
A bacterial artificial chromosome physical map is an important genomic resource for basic and applied research in plants. Physical maps of either the whole genome, a specific chromosome or region of interest have proven to be useful tools for comparative genomics, gene cloning and understanding the size and complexity of a genome [33][34][35][36][37] . Physical maps which are integrated with genetic map are more valuable in map-based cloning, genomic analysis of closely related species and enhance the understanding of unsequenced genomes [38][39][40] . For any genome sequencing project, a physical map is a fundamental component of the clone-by-clone whole-genome sequencing method that was the predominant approach used to access crop genomes until 2010 or an important intermediate layer for providing a good template for completing gaps and correcting misassemblies in a hybrid sequencing strategy, especially for large and complex genomes [35][36][37][38][39][40][41][42][43][44] . For the species with a sequenced genome, physical maps are being constructed for wild relatives for comparative purposes [35,45] . Studies in other plant species have demonstrated an evolution of BAC physical map construction and the high-information-content fingerprinting method (HICF) and five-enzyme digestion plus SNaPshot labeling have proven to be the most effective [35,[46][47][48] . One BAC-based physical map in cassava was constructed from AM560-2 by Pablo Rabinowicz and Mingcheng Luo (unpublished data), which has been used in improving the genome sequence assembly (Cassava Genome Database, http://cassava.igs.umaryland.edu/cgi-bin/index.cgi).
Phylogeography shows cultivated cassava originated from wild progenitors from the southern rim of the Amazon basin. Chromosome variation studies show M. esculenta ssp. flabellifolia (W14) is probably the only ancestor of cultivated cassava. Scientists carried out cassava genetic breeding using W14 and cultivars such as SC 205, SC 8 and Arg 7 as materials to explore the mechanism of the high photosynthesis efficiency of the cultivated cassava through physiology and biochemistry, transcriptomics and proteomics. In this study, two additional cassava BAC libraries were constructed from W14 using different restriction enzymes. In addition, a physical map was built using the HICF method and single nucleotide polymorphisms (SNPs) were developed from the BAC-end sequence.

Plant materials
A wild ancestral cassava species, M. esculenta spp. flabellifolia accession W14, was used to construct the two BAC libraries. W14 has low storage root yield with an average of 0.8 kg per plant, low starch content and high disease resistance. Seeds were planted in a greenhouse and young fully-developed leaves were collected three weeks after planting and stored at -80°C before use.

BAC libraries construction
Megabase-sized nuclear DNA was prepared according to Zhang's protocol [49] . Approximately 60.0 g of frozen young leaves was used for the extraction of nuclear DNA; nuclei were washed and centrifuged several times in order to minimize plastid DNA contamination. Nuclei were resuspended in 2 mL of 1 Â homogenization buffer (nuclei isolation buffer), mixed with an equal volume of 1% lowmelting-point agarose and dispensed into ice-cold 100-µL plug molds. The solidified nuclei plugs were incubated in lysis buffer (0.5 mol$L -1 EDTA, pH 9.0-9.3, 1% sodium lauryl sarcosine, 0.3 mg$mL -1 proteinase K) with gentle shaking at 50°C overnight. The plugs were washed three times in ice-cold 1 Â TE plus 0.1 mmol$L -1 phenylmethylsulfonyl fluoride (PMSF) on ice, 1 h each wash, followed by three washes on ice in 1 Â TE without PMSF, 1 h each wash. The plugs were stored in 1 Â TE at 4°C before use.
Size selection after partial digestion and BAC cloning were performed according to the protocol developed by Tao et al. [50] with minor modifications. Two plugs were first used to determine the optimal partial digestion condition. They were cut into equal small pieces (nine slices for each plug), equilibrated on ice in the enzyme reaction buffer plus 2 mmol$L -1 spermidine and 1 mmol$L -1 DTT twice for 30 min each time, and then another 60 min in the enzyme reaction buffer plus 2 mmol$L -1 spermidine, 1 mmol$L -1 DTT and 0.5 mg$mL -1 BSA with a range of EcoRI or HindIII. The reactions were incubated at 37°C for 7 min and stopped by transferring onto ice and adding 1/10 volume of 0.5 mol$L -1 EDTA. Partially digested DNA was fractionated on a 1% agarose gel in 0.5 Â TBE using the CHEF DRIII System (Bio-Rad Laboratories, Hercules, CA, USA) and the enzyme conditions producing the largest number of fragments ranging from 100 to 350 kb were selected. Six DNA plugs were used to construct the library and partially digested using EcoRI or HindIII as described as above. The fragments were size-selected twice by pulsed-field electrophoresis (PFGE) and ligated into either the CopyControl pCC1BAC (HindIII-cloning ready) vector or the Copy-Control pCC1BAC (EcoRI-cloning ready) vector (Epicenter Technology, Madison, WI, USA). The ligation mixture was transformed into Escherichia coli strain DH10B electrocompetent cells by electroporation. The cells were cultured in 0.5 mL SOC medium (Invitrogen, Carlsbad, CA, USA) and plated on LB agar medium containing IPTG, X-GAL and chloramphenicol. White colonies were arrayed in 384-well microplates containing freezing medium with appropriate chloramphenicol, incubated at 37°C overnight and stored at -80°C.

Analysis of BAC clones
Random clones from the two BAC libraries were selected to estimate the insert sizes. BAC plasmids were isolated and digested with NotI to release the inserts from the vector. The insert size was analyzed by PFGE and the genome coverage of the libraries was estimated.

BAC fingerprinting
In total, 29952 clones were randomly selected from the HindIII and EcoRI BAC libraries and then fingerprinted as described by Luo et al. [47] with minor modifications. From each 384-well plate from the two libraries, samples were inoculated into 96-deep well plates containing 1 mL TB (Terrific Broth) medium with 12.5 g$mL -1 chloramphenicol using a 96-pin replicator (Boekel Scientific, Feasterville, PA, USA), such that the clones from one 384-well plate were inoculated into four 96-deep well plates. Each of the 96-deepwell plates were covered with air-permeable seals (Excel Scientific, Wrightwood, CA, USA) and incubated in an orbital shaker agitated at 350 r$min -1 at 37°C for 22 h. Cells were harvested by centrifugation at 2500 r$min -1 for 10 min and BAC DNA was isolated using a modified alkaline lysis method [53] . Approximately 0.5-1.2 µg BAC DNA was simultaneously digested using BamHI, EcoRI, NdeI, XbaI and HaeIII enzymes (New England Biolabs, Ipswich, MA, USA) for 3 h at 37°C and then the restricted products were labeled using the SNaPshot Multiplex Kit (Applied Biosystems, Foster City, CA, United States). Fragmented and labeled DNA was precipitated by adding chilled 95% ethanol, washed, air-dried and dissolved in a mixture of 9.9 mL of Hi-Di formamide and 0.3 mL of internal size marker LIZ 1200 (Applied Biosystems). The DNA was denatured at 95°C for 5 min and loaded on the ABI3730xl DNA Analyzer (Applied Biosystems) using the default module (50 cm array and POP7).

Fingerprints editing and contig assembly
The fragment size from the fingerprint profile of each BAC clone was collected by the GeneMapper v3.7 (Applied Biosystems) and output data of size-calling files was edited with the FPMiner program (http://www.bioinforsoft.com). This software package was further used to distinguish peaks corresponding to true fragments from those generated by background noise and remove vector bands and substandard restriction fragments. The edited profiles were exported as sizes files. Fragments in the size range 75 to 1000 bp were used to perform contig assembly using the program FPC V 8.9 (http://www.agcol.arizona.edu/software/fpc) with adjusted parameters for the HICF method [47,48,54] .

BAC-end sequencing
The BAC DNAs were isolated using a modified alkaline lysis method. BAC-end sequencing was carried out using an ABI Big Dye Terminator v3.1 and analyzed on the ABI 3730xl Analyzer. The sequences were processed and trimmed by the programs Phred and Lucy [55,56] .

SNP mining
All the BAC-end sequences were mapped to a reference genome using software Bowtie 2 [57] . The SAM tools [58] pileup command was used to determine the variable positions of single-nucleotide polymorphisms (SNPs). The annotation was performed by BLAST, with blastp parameter as 1Â10 -5 to search the best match of peptide sequence of genes from KEGG (http://www.genome.jp/ kegg) and NCBI nrviridiplantae. The functional annotation was based on BLASTX results with the NCBI nrviridiplantae database. The program BLAST2GO [59] was used to assign biological functions, cellular components and cellular processes to the transcripts.

BAC library construction and characterization
Two BAC libraries were successfully constructed from the nuclear DNA of a wild ancestral cassava species, W14, in the CopyControl pCC1BAC vectors. The HMW DNA was partially digested either by HindIII or by EcoRI and restricted fragments generated from these two digestions were subjected to two-size selections in order to select big restricted fragments for the library construction. Each of the two libraries, EcoRI library and HindIII library, contained 29952 clones arrayed in 78384-well microplates, respectively. To estimate the insert size of the EcoRI library, a random sample of 96 clones was analyzed by NotI digestion and pulsed-field gel electrophoresis (Fig. 1). The results showed EcoRI library has an average insert size of 115 kb in a range of 9 to 183 kb. 75% of the clones contained inserts of greater than 100 kb with a majority having a range from 101 to 150 kb. 5 out of 96 clones (approximately 5%) were found to be insert-empty clones. From HindIII library, a sample of 96 clones was also selected randomly and analyzed by PFGE in order to evaluate the insert size. Approximately 97% of the clones contained DNA inserts with an average size of 129 kb, ranging from 8 to 286 kb. More than 71% of the clones had inserts larger than 125 kb with a majority having a range from 126 to 150 kb (Fig. 2). Therefore, these two libraries consist of a total of 59904 clones with an average insert size of 125 kb. Based on the estimated size of the haploid cassava genome of 770 Mb, the combined two libraries have a genome coverage equivalent to approximately 10 Â haploid genomes.

BAC fingerprinting and contig assembly
A total of 29952 clones from the two libraries representing approximately 5.0 Â genome equivalents were finger-printed using the HICF method of Luo et al. [47] . Of these clones, 5168 clones were removed during fingerprint editing because they were small insert clones, contaminated clones, empty clones or clones with poor quality fingerprints. Therefore, a total of 24784 clones were successfully fingerprinted and subsequently subjected to contig assembly using the FPC software.
To determine the optimal cutoff and minimize the contig number, a tolerance of 0.5 bp and an initial cutoff of 1E-45 were set, followed by DQer and several rounds of end-toend and single-to-end merging at progressively lower stringencies. In the end, a total of 2485 contigs were assembled from the validated 24784 BAC fingerprints whereas 2909 clones remained as singletons (Table 1). Each contig contained an average of nine BACs and spanned an average length of 336 kb. The contiguous bands across all contigs were equivalent to 762 Mb (98.96% of the cassava genome).

BAC-end sequencing and integrated into draft assembly sequences
A total of 5000 BAC clones located at the ends of BAC contigs were randomly selected for end sequencing and 8361 BAC-end sequences (BESs) were generated. The average read length was 699 bp with minimum length of 19 bp, representing 5.8 Mb of genomic DNA in total ( Table 2). The GC content was estimated to be 38.54%.
The BESs were matched to the whole-genome sequence of W14 (sequenced using a next generation sequencing platform Illumina HiSeq2000 and Roche/454 GS FLX) in order to improve the sequence assembly and create superscaffolds. This analysis allowed the original 33166 contigs from the primary sequence assembly to be merged into 7393 scaffolds and a total of 357 mega scaffolds in the physical map were generated with the average scaffold length of 183 kb ( Table 3). The N50 was 202 kb in the physical map assembly and 33 kb in the scaffolds of W14. This alignment made a total of 65 Mb of the physical map successfully integrated with the sequence assembly of W14.

Mining SNP involved in functional genes
A total of 6077 SNPs and 231 indels were discovered through alignment of BESs and whole-genome sequence assembly of AM560. Among the 6077 SNPs, the most abundant SNPs were A/G and C/T, accounting for 28.85% and 27.96% of the total SNPs, respectively (Fig. 3). In addition, 5172 single-nucleotide variations (SNVs) fell into 459 gene sequences, of which 2317 SNVs positioned in the coding sequence of 399 genes and 167 SNVs were in the untranslated region of 50 genes. Gene ontology analysis of the 459 genes showed these SNPs and indels to be related to pathways in cellular component, molecular function and biological process (Fig. 4). Further KEGG analysis showed a total of six genes were associated with starch and sucrose metabolism including sucrose synthase, trehalose 6-phosphate synthase, pectinesterase, trehalose 6-phosphate phosphatase, α-glucosidase and α-1,4galacturonosyltransferase.

Discussion
Two new large-insert libraries were constructed for cassava with different restriction endonucleases. The combined libraries contain a total of 59904 clones with an average insert size of 125 kb and approximate 10 Â genome coverage. In comparison to the previously described cassava BAC libraries, these two libraries were constructed from a wild ancestral cassava subspecies and have relatively large insert size with two different cloning sites. The strategy of constructing libraries with multiple enzymes with a different recognition sequence is to limit the risk of under-representation of genome regions of interest due to restriction enzyme bias, and this strategy has been applied in several physical mapping projects [60][61][62] . Therefore, the two cassava libraries reported here are valuable complements for cassava genomic resources, especially in physical map construction and studying genes that were lost during domestication. A genome-wide physical map of cassava, consisting of 2485 contigs and 2909 singletons with an estimated physical length of 762 Mb, was built based on two BAC libraries using the HICF method. The quality of a BACbased map was determined by several key factors such as source BAC library coverage, the density of bands per fingerprint and the critical parameters (tolerance and cutoff value). It has been reported that some singletons can be produced not only because of low quality fingerprints but also because of low coverage in the source BAC library [61] , and that the contig length increases as the genomic coverage of source clones increases [46] . In the present study, although the source of clones for the cassava physical map was equivalent to 5 Â haploid genomes, the fact that they were from two BAC libraries of different origin ensured higher genomic representation in the physical map and compensated for the low coverage of the source clones. In addition, a high stringency setting was used to build the contig map in order to minimize the false positive and obtain high accuracy of contig assemblies. However, the high stringency parameters employed in physical mapping are likely lead to overestimation of genome coverage due to failures in detecting overlaps between adjacent contigs, as reported for other plant physical maps such as poplar [63] , soybean [60] and Table 2 The index statistics of bacterial artificial chromosome-end sequence and assembly   apple [64] . Therefore, the estimate of the genome coverage of the cassava physical map reported here may be greater than the actual percentage. Furthermore, the reliability of this cassava map was assessed by aligning the BAC-end sequences of the BAC clones with the whole-genome sequence assembly of W14. This BAC-based physical map will be of great help in improving the quality of the W14 genome sequence, cloning potentially valuable genes and comparative evolutionary studies. Sequencing paired ends of BAC clones to discover molecular markers, mainly SSRs and SNPs, has been successfully applied in several plant species, and BESbased markers have proven to be useful tools for integration of physical and genetic maps [65][66][67][68] . Alignment BES to a reference genome, not only allow researchers to develop SNPs in a cost-effective way but also to make use of the genome annotation to predict SNPs, both in coding and non-coding regions. A total of 5172 SNVs discovered in this research are within 459 gene sequences of which six genes are associated with starch and sucrose metabolism. Given that starch content is one of the most important traits of cassava, the SNPs discovered in these six genes are valuable marker resources for better understanding of starch and sucrose metabolism in cassava and potentially for improving the yield and quality. However, further study of the association between these SNPs and heritable phenotypes is needed before they can be applied for marker-assisted selection.

Conclusions
The present study reported the construction of a genomewide BAC contig map of W14, an accession of wild M. esculenta spp. flabellifolia. Two BAC libraries were constructed with a total of 59904 clones with an average insert size of 125 kb. These libraries are valuable complementary resources for cassava genomic studies providing a useful platform for multiple genetic researches. A physical map was constructed using 24784 BAC clones with 8361 BAC-end sequences. This map is suitable for comparative genomics study among species of M. esculenta and positional cloning of disease resistance genes from wild species. We hope this study could accelerate the genetic and genomic researches on this important tropical crop. The BAC library and physical map are shared through our database at http://www.cassavagenome.cn.