Theobroma cacao Assembly and Gene Annotation

The Cocoa Genome Hub is a collaboration among CIRAD and the South Green bioinformatics platform.

About Theobroma cacao

Theobroma cacao (cacao or chocolate tree) is a neotropical plant native to Amazonian rainforests. It is now cultivated in over 50 countries. A member of Malvaceae family, its beans are harvested from pods for use as the food chocolate, in confections and cosmetics. Cacao is a diploid species (2n=2x=20) with a relatively small genome (from 411 Mb to 494 Mb). This is the V2 genome assembly and annotation of the Belizian Criollo B97-61/B2 cultivar.

Assembly

Four Illumina large insert size mate paired libraries wew combined with 52x of Pacific Biosciences long reads to correct misassembled regions and reduced the number of scaffolds. In addition, genotyping by sequencing SNPs from a UF676 x ICS95 mapping population of 434 individuals were used to increase the proportion of the assembly anchored to chromosomes. The scaffold number decreased from 4,792 in assembly V1 to 554 in V2 while the scaffold N50 size increased from 0.47 Mb in V1 to 6.5 Mb. A total of 96.7% of the assembly was anchored to the 10 chromosomes. Unknown sites (Ns) were reduced from 10.8% to 5.7%. This assembly was produced by the Cocoa Genome Hub.

Annotation

V1 gene annotations produced by EUGene, following specific training for T. cacao, were combined with a new, de novo RefSeq structural annotation performed by the NCBI Eukaryotic Genome Annotation Pipeline based on RNAseq evidence. 98.6% of the V1 gene models were relocated to the V2 assembly. In total 345 genes from V1 were relocated to a different chromosome in V2. A consensus annotation to select the best structural predictions between both datasets was performed, yielding 29,071 consensus protein coding gene models.

Links

References

  1. The cacao Criollo genome v2.0: an improved version of the genome for genetic and functional genomic studies.
    X Argout, G Martin, G Droc, O Fouet, K Labadie, E Rivals, JM Aury, C Lanaud. 2017. BMC Genomics. 18:730.
  2. The genome of Theobroma cacao.
    X Argout, J Salse, JM Aury et al. 2011. Nature Genetics. 43:101108.

Picture credit: Cocoa beans in a cacao pod. Photo by Keith Weller. This image was released by the Agricultural Research Service, the research agency of the United States Department of Agriculture, with the ID K4636-14.

More information

General information about this species can be found in Wikipedia.

Statistics

Summary

AssemblyCriollo_cocoa_genome_V2, INSDC Assembly GCA_000208745.2,
Database version97.2
Base Pairs324,719,311
Golden Path Length324,719,311
Genebuild byCGH
Genebuild methodGenerated from CGH annotation
Data sourceCIRAD

Gene counts

Coding genes21,257
Non coding genes2,164
Small non coding genes2,164
Pseudogenes1,140
Gene transcripts35,849

About this species