Manihot esculenta Assembly and Gene Annotation

About Manihot esculenta

Cassava (Manihot esculenta Crantz) is grown throughout tropical Africa, Asia, and the Americas for its starchy storage roots, and feeds an estimated 1 billion people each day. Farmers choose it for its high productivity and its ability to withstand a variety of environmental conditions (including significant water stress) in which other crops fail. However, it has low protein content and is susceptible to a range of biotic stresses. Despite these problems, the crop production potential for cassava is enormous, and its capacity to grow in a variety of environmental conditions makes it a plant of the future for emerging tropical nations. Cassava is also an excellent energy source-its roots contain 20-40% starch that costs 15-30% less to produce per hectare than starch from corn, making it an attractive and strategic source of renewable energy.

Assembly

Cassava has an estimated genome size of ~772 Mbp. The assembly spans 637.2 Mbp and assembled at the University of California, Berkeley.

Sequence data were generated from a partially inbred (third-generation self, or S3, of MCOL1505) line called AM560-2, which was developed at the International Center for Tropical Agriculture (CIAT) in Cali, Colombia. Etiolated AM560-2 leaves were collected and high molecular weight (HMW) genomic DNA was extracted into low-melting agarose plugs. HMW DNA was extracted from fresh leaf tissue using the CTAB method. The sequencing was done with PacBio RSII machines using P6-C4 chemistry. Additionally, HMW DNA was extracted from AM560-2 leaves at CIAT and sequenced using PacBio Sequel (v2 chemistry). These combined data totalled 88.8 Gbp of sequence, representing 115× depth of coverage. Lastly, over 389 million Hi-C pairs (254 million non-redundant contacts) were generated from fresh AM560-2 leaf tissue.

Contigs were assembled with Canu v1.6 from raw PacBio SMRT CLR data. Redundant contig sequences were identified by their median depth of coverage and the extent of their sequence overlap assessed by alignment. The shorter redundant sequences were then discarded. Misassembled contigs were broken in JuiceBox using the JuiceBox Assembly Tools (JBAT). Contigs were ordered and oriented into chromosome-scale scaffolds with SSPACE v3 and 3D-DNA using, respectively, 40 kb fosmid-end sequencing reads and the newly-generated Hi-C sequences. The resulting scaffolds were manually curated with JBAT to remedy contig inversion errors and incorrect placements, then to incorporate additional small contigs. Scaffold gap sizes were estimated using custom scripts. Gaps were patched via the local de novo assembly of gap-flanking long reads. Base-level errors were corrected by two rounds of signal-based polishing with Arrow (from SMRT Link Suite v6.0.0.47841), followed by four rounds of Illumina short-read polishing with FreeBayes v1.3.1-17-gaa2ace8 and custom scripts. Evaluated by Merqury v1.0.0-g9917ad8, the polishing achieved an average base-level quality value (QV) of 43.5, or less than one error in 22kb.

95.7% of the 122.7k cassava ESTs available in NCBI map to the v8 assembly.

Annotation

The Manihot esculenta v8 reference was annotated at DOE-JGI. To produce the current Cassava V8.1 gene set, the homology-based gene prediction programs FgenesH and GenomeScan were leveraged, along with the PASA program to integrate expression information from cassava ESTs and RNA-Seq.

Transcript assemblies (TAs) were constructed with PERTRAN from roughly three billion strand-specific and five billion standard paired-end Illumina RNA-seq read pairs and five million LS454 ESTs. Subsequently, 282,674 transcript assemblies were constructed from TAs above and ESTs with PASA. Protein-coding loci were inferred from TA alignments and protein homology to Hevea brasiliensis, Jatropha curcas, arabidopsis, soybean, poplar, peach, rice, sorghum, foxtail millet, tomato, grape, castor bean, and Swiss-Prot proteins. Gene model prediction leveraged the following homology-based methods: FGENESH+, FGENESH_EST, and assembly sequence-based open-reading frames inferred by EXONERATE and refined with PASA. The best-scoring predictions for each locus were selected using a heuristic, rule-based method based on multiple positive factors, including EST and peptide coverages, homology scores, and mRNA expression levels. Negative factors included overlap with annotated genomic repeats, presence of known transposable element domains, and minimum single-exon coding sequence length. Filtered gene models were then improved with PASA to refine splice sites and add UTRs and alternative transcripts. This v8.1 annotation represents 32,447 protein-coding gene loci and, using BUSCO v3 benchmarking, is estimated to be 99.2% complete, with only 0.2% of BUSCO genes fragmented and 0.6% missing (eudicotyledons_odb10, N=2,121).

References

- The Cassava Genome: Current Progress, Future Directions.
  Prochnik S, Marri PR, Desany B, Rabinowicz PD, Kodira C, Mohiuddin M, Rodriguez F, Fauquet C, Tohme J, Harkins T, Rokhsar DS, Rounsley S.. Trop Plant Biol 5 (1)
- Phytozome: a comparative platform for green plant genomics.
  Goodstein DM, Shu S, Howson R, Neupane R, Hayes RD, Fazo J, Mitros T, Dirks W, Hellsten U, Putnam N, Rokhsar DS.. Nucleic Acids Res 40 (database issue)

Picture credit: By Ton Rulkens from Mozambique, CC BY-SA 2.0, via Wikimedia Commons(Image source)

Statistics

Summary

Assembly	M.esculenta_v8, INSDC Assembly GCA_001659605.2,
Database version	116.2
Golden Path Length	639,586,700
Genebuild by	JGI
Genebuild method	Import
Data source	JGI

Gene counts

Coding genes	32,805
Gene transcripts	59,151

Upcoming Ensembl Platform Transition

Manihot esculenta Assembly and Gene Annotation

About Manihot esculenta

Assembly

Annotation

References

Statistics

Summary

Gene counts

About Us

Get help

Our sister sites

Follow us

Upcoming Ensembl Platform Transition

Favourite species

All species

Manihot esculenta Assembly and Gene Annotation

About Manihot esculenta

Assembly

Annotation

References

Statistics

Summary

Gene counts

About Us

Get help

Our sister sites

Follow us