About Theobroma cacao

Theobroma cacao (cacao or chocolate tree) is a neotropical plant native to Amazonian rainforests. It is now cultivated in over 50 countries. A member of Malvaceae family, its beans are harvested from pods for use as the food chocolate, in confections and cosmetics. Cacao is a diploid species (2n=2x=20) with a relatively small genome (from 411 Mb to 494 Mb). This is the V2 genome assembly and annotation of the Belizian Criollo B97-61/B2 cultivar.


Four Illumina large insert size mate paired libraries wew combined with 52x of Pacific Biosciences long reads to correct misassembled regions and reduced the number of scaffolds. In addition, genotyping by sequencing SNPs from a UF676 x ICS95 mapping population of 434 individuals were used to increase the proportion of the assembly anchored to chromosomes. The scaffold number decreased from 4,792 in assembly V1 to 554 in V2 while the scaffold N50 size increased from 0.47 Mb in V1 to 6.5 Mb. A total of 96.7% of the assembly was anchored to the 10 chromosomes. Unknown sites (Ns) were reduced from 10.8% to 5.7%. This assembly was produced by the Cocoa Genome Hub.


V1 gene annotations produced by EUGene, following specific training for T. cacao, were combined with a new, de novo RefSeq structural annotation performed by the NCBI Eukaryotic Genome Annotation Pipeline based on RNAseq evidence. 98.6% of the V1 gene models were relocated to the V2 assembly. In total 345 genes from V1 were relocated to a different chromosome in V2. A consensus annotation to select the best structural predictions between both datasets was performed, yielding 29,071 consensus protein coding gene models.



  1. The cacao Criollo genome v2.0: an improved version of the genome for genetic and functional genomic studies.
    X Argout, G Martin, G Droc, O Fouet, K Labadie, E Rivals, JM Aury, C Lanaud. 2017. BMC Genomics. 18:730.

Picture credit: Cocoa beans in a cacao pod. Photo by Keith Weller. This image was released by the Agricultural Research Service, the research agency of the United States Department of Agriculture, with the ID K4636-14.

AssemblyCriollo_cocoa_genome_V2, INSDC Assembly GCA_000208745.2,
Database version100.1
Base Pairs324,719,311
Golden Path Length324,719,311
Genebuild byCGH
Genebuild methodExternal annotation import
Data sourceCIRAD

Gene counts

Coding genes21,257
Non coding genes2,164
Small non coding genes2,164
Gene transcripts35,849

