Camelina sativa Assembly and Gene Annotation
About Camelina sativa
Camelina sativa (false flax, gold of pleasure, German sesame) is a relict oilseed crop of the Crucifer family (Brassicaceae) with centres of origin in SE Europe and SW Asia. C. sativa was cultivated in Europe as an important oilseed crop for many centuries before being displaced by higher-yielding crops such as canola and wheat. It has several agronomic advantages for production, including early maturity, low requirement for water and nutrients, adaptability to adverse environmental conditions and resistance to common cruciferous pests and pathogens. It is currently being re-embraced as an industrial oil platform crop. C. sativa is diploid (2n=40) with an estimated genome size of 785 Mb, retaining a well preserved hexaploid genome as a result of a whole-genome triplication event.
Assembly
The genome of a homozygous doubled haploid line (DH55) was sequenced using a hybrid Illumina and Roche 454 next-generation sequencing (NGS) approach. Filtered sequence data (96.53 Gb) provided 123x coverage of the estimated genome size, which was assembled using a hierarchical assembly strategy into 37,871 scaffolds. A high-density genetic map based on 3,575 polymorphic markers allowed 608.54 Mb of the assembled genome, represented by 588 scaffolds to be anchored to the 20 chromosomes of C. sativa, thereby producing a highly contiguous final assembly with an N50 size of >30 Mb. The final genome assembly contains 641.45 Mb of sequence, covering 82% of the estimated genome size, 95% of which is in 20 chromosomes.
Annotation
RNA-seq data (78.5 Gb) was generated from tissue samples collected at 12 different growth stages to assist with annotation of protein-coding genes. Based on a comprehensive strategy of ab initio gene prediction and homology evidence from proteome data sets, ESTs and RNA-seq transcripts, 89,418 non-redundant genes were predicted, of which 4,753 (5.3%) genes encoded two or more alternatively spliced isoforms. More than 95% (85,274) of these annotated genes were located on the pseudochromosomes with the remainder on unanchored scaffolds. Based on sequence identity 97% of the predicted C. sativa genes have homologues in UniProt. RNA-seq evidence suggested that >90% of the genes were expressed (FPKM>0) in one or more developmental stages.
Repeats were annotated with the Ensembl Genomes repeat feature pipeline. There are: 1,298,859 Low complexity (Dust) features, covering 84 Mb (13.1% of the genome); 333,331 RepeatMasker features (with the REdat library), covering 128 Mb (20.0% of the genome); 5,216 RepeatMasker features (with the RepBase library), covering 1 Mb (0.1% of the genome); 333,890 Tandem repeats (TRF) features, covering 30 Mb (4.7% of the genome); Repeat Detector repeats length 230Mb (35.9% of the genome).
References
- The emerging biofuel crop Camelina sativa retains a highly undifferentiated hexaploid genome structure.
Kagale S, Koh C, Nixon J, Bollina V, Clarke WE, Tuteja R, Spillane C, Robinson SJ, Links MG, Clarke C, Higgins EE, Huebert T, Sharpe AG, Parkin IA..
Image credit: Fornax CC BY-SA 3.0
Links
More information
General information about this species can be found in Wikipedia.
Statistics
Summary
Assembly | Cs, INSDC Assembly GCA_000633955.1, |
Database version | 113.1 |
Golden Path Length | 641,356,059 |
Genebuild by | Camelina sativa Genome Project |
Genebuild method | External annotation import |
Data source | Agriculture & AgriFood Canada |
Gene counts
Coding genes | 89,402 |
Gene transcripts | 94,479 |