Theobroma cacao Matina 1-6 Assembly and Gene Annotation
About Theobroma cacao
Theobroma cacao (cacao or chocolate tree) is a neotropical plant native to Amazonian rainforests. It is now cultivated in over 50 countries. A member of Malvaceae family, its beans are harvested from pods for use as the food chocolate, in confections and cosmetics. Cacao is a diploid species (2n=2x=20) with a relatively small genome (from 411 Mb to 494 Mb). This is the genome assembly and annotation of the Matina 1-6 cultivar, which belongs to the most cultivated cacao type worldwide.
The 445 Mb chromosome-level assembly of this cacao genome has scaffolds anchored and oriented to ten chromosomes with BAC end sequencing and FISH karyotyping-based chromosome assignment, for 94.6% of total assembly, with 4.4% gap sequence, and the remainder in unlinked scaffolds. This assembly is primarily composed of Roche 454 Titanium STD reads, augmented by six recombination paired Titanium libraries, three fosmid packaged Sanger sequenced libraries, and three Sanger sequenced BAC Ends libraries. This genome assembly was produced by the Cacao Genome Consortium.
Illumina and 454 RNA-Seq reads totaling 62 Gb were collected from leaf, pistil and bean tissues, assembled with de-novo RNA assemblers Velvet/Oases, SOAPdenovo-Trans, Roche Newbler, and genome-guided assembler Cufflinks. From this over-assembly of 1.2 million transcripts, the 48,404 most accurate cacao gene assemblies are selected with homology and genome map evidence scoring.
Proteins from 297,061 genes of eight plants, arabidopsis, poplar, castorbean, grape, strawberry, potato, soybean, and sorhgum are used for homology evidence. Transposons were annotated as class I retrotransposon and II DNA transposon with subcategories, totaling 8,542 intact transposons with 122,552 copies, covering 137 Mb or 42% of genome assembly. Gene evidence from RNA mapped introns, plant species protein homology, expressed RNA alignments, transposons, is mapped to genome with GMAP/GSNAP, tBLASTn, and exonerate. Several genome gene prediction sets were produced with AUGUSTUS, following HMM training of this predictor on cacao transcripts, by varying proportions of this gene evidence and parameters.
EvidentialGene software is used to annotate, score and classify gene assemblies and models with weighted evidence scores. This includes heuristics to identify and reduce gene-joins and fragment models. Highest evidence-scored models or assemblies are selected per locus for a primary transcript. Alternate transcripts are selected from remaining transcript assemblies (but not predictions) that vary from primary transcript in exon-intron structure. The resulting gene set is annotated with reciprocal best BLASTP to several plant gene sets, clustered to gene families with OrthoMCL, and annotated with family consensus function names, homology scores and references, and other gene quality scores.
This gene set includes 29,188 protein coding genes and was produced by Don Gilbert (Indiana University, 2012-03-08).
- The genome of Theobroma cacao.
X Argout, J Salse, JM Aury et al. 2011. Nature Genetics. 43:101108.
Picture credit: Cocoa beans in a cacao pod. Photo by Keith Weller. This image was released by the Agricultural Research Service, the research agency of the United States Department of Agriculture, with the ID K4636-14.
General information about this species can be found in Wikipedia.
|Assembly||Theobroma_cacao_20110822, INSDC Assembly GCA_000403535.1, May 2013|
|Golden Path Length||345,993,675|
|Genebuild method||External annotation import|
|Data source||Cacao Genome Consortium|
|Non coding genes||465|
|Small non coding genes||465|