Cannabis sativa female (cs10)

Cannabis sativa female Assembly and Gene Annotation

About Cannabis sativa

Cannabis sativa (hemp) has been cultivated for millennia with distinct cultivars providing either fiber, oil and confectionary seed or tetrahydrocannabinol. It is diploid (2n=20) and its native range is SE European Russia to NW China and Pakistan. This is a dioecious species with sexual dimorphism occurring in a late stage of plant development. Sex is determined by heteromorphic chromosomes: male is the heterogametic sex (XY) and female is the homogametic one (XX). Sex is considered an important trait for hemp genetic improvement.


A female of cultivar CBDRx was sequenced with ultra-long Nanopore reads (34x) and its chromosomes resolved using markers from a ultra-high-density genetic map of 96 recombinant individuals resulting from a cross between near-isogenic lines Skunk#1 and Carmen. The genome was assembled using a correction-less pipeline that consisted of an overlap (minimap2), layout (miniasm2) consensus (racon), followed by a polishing step (pilon) using a 64x Illumina 2x100 bp paired end reads. The resulting assembly was 746 Mbp in 1,986 contigs with an N50 length of 742 kb and the longest contig 4.5 Mbp.


Full-length cDNAs, Stringtie assembly of 142 RNAseq libraries, and Trinity transcripts were assembled into gene models with PASA. Genes were also predicted ab initio using Augustus. Non redundant RefSeq Viridiplantae proteins were clustered at 90% identity with CD-HIT and representative sequences aligned to the reference. Pairwise hits were locally realigned with AAT and Exonerate protein2genome and PASA updated.

Repeats were annotated with the Ensembl Genomes repeat feature pipeline. There are: 1,532,461 Low complexity (Dust) features, covering 87 Mb (9.9% of the genome); 175,269 RepeatMasker features (with the REdat library), covering 97 Mb (11.1% of the genome); 9,445 RepeatMasker features (with the RepBase library), covering 1 Mb (0.2% of the genome); 461,225 Tandem repeats (TRF) features, covering 33 Mb (3.8% of the genome); Repeat Detector repeats length 439Mb (50.2% of the genome).


Assemblycs10, INSDC Assembly GCA_900626175.1,
Database version107.1
Golden Path Length875,732,045
Genebuild byCannabis Genome
Genebuild methodExternal annotation import
Data sourceHARVARD OEB

Gene counts

Coding genes27,249
Gene transcripts36,254