Rosa chinensis Assembly and Gene Annotation
About Rosa chinensis
Rose is the world’s most important ornamental plant, with economic, cultural and symbolic value. Roses are cultivated worldwide and sold as garden roses, cut flowers and potted plants, holding great symbolic and cultural value. Roses appeared as decoration on 5,000-year-old Asian pottery, and Romans cultivated roses for their flowers and essential oil. Today, no ornamental plants have greater economic importance than roses. Roses are also used for scent production and for culinary purposes. This genome corresponds to a doubled haploid (2n=14) of Rosa chinensis derived from Chinese variety ‘Old Blush’. ‘Old Blush’ (syn. Parsons’ Pink China) was brought to Europe and North America in the eighteenth century from China and is one of the most influential genotypes in the history of rose breeding.
Assembly
The homozygous doubled haploid was sequenced on the PacBio RS II platform. An 80x sequencing coverage was obtained with 40 single-molecule real-time cells. Preliminary assembly of the rose data with a single assembler generated several hundred of contigs, illustrating the challenge of assembling plant genomes despite long-reads data. A key step in improving the contiguity of the assembly was the detection and the filtering of spurious edges in the graph of overlaps. The assembler CANU implements filter parametrization at the read level, leading to more accurate and contiguous assemblies. For this purpose a software called til-r was developed which implements similar and alternate heuristics to clean the graph of overlaps of the FALCON assembler. Then, CANU was used to perform a meta-assembly of six complementary raw assemblies generated by CANU and FALCON/TIL-R. The final assembly was composed of 82 contigs for an N50 of 24Mb, increasing the contiguity metrics of a simple assembly threefold and demonstrating the power of meta-assembly approaches.
The seven pseudo-chromosomes were built by integrating 86.4% of the 25,695 markers of the K5 high-density genetic map. A large fraction of the assembly (97.7%, 503Mb) was oriented with Pearson's correlation coefficients ranging from 0.986 to 0.996, illustrating the high congruence between sequence and genetic data. The genome structure and quality was confirmed by the mapping of Hi-C chromosomal contact map information data. With very few remaining gaps and high consistency between genetics and sequence data, the rose genome assembly is one of the most contiguous obtained so far for a plant genome.
Annotation
The genome encodes 36,377 inferred protein-coding genes and 3,971 long non-coding RNAs. Annotation assessment with the Plantae BUSCO v2 dataset identified 96.5% complete gene models. Based on transcriptomic data from pooled tissues, 207 miRNA precursors were predicted.
Repeats were annotated with the Ensembl Genomes repeat feature pipeline. There are: 744,291 Low complexity (Dust) features, covering 22 Mb (4.3% of the genome); 187,134 RepeatMasker features (with the REdat library), covering 64 Mb (12.4% of the genome); 8,522 RepeatMasker features (with the RepBase library), covering 1 Mb (0.2% of the genome); 312,755 Tandem repeats (TRF) features, covering 28 Mb (5.5% of the genome); Repeat Detector repeats length 247.9Mb (48% of the genome).
References
The Rosa genome provides new insights into the domestication of modern roses.
Olivier Raymond, Jérôme Gouzy, Jérémy Just et al. 2018 . Nature Genetics .
50(6):772-777.
Picture credit: Wikimedia commons
More information
General information about this species can be found in Wikipedia.
Statistics
Summary
Assembly | RchiOBHm-V2, INSDC Assembly GCA_002994745.2, |
Database version | 113.1 |
Golden Path Length | 515,588,973 |
Genebuild by | INRA/CNRS |
Genebuild method | Import |
Data source | INRA/CNRS |
Gene counts
Coding genes | 45,464 |
Non coding genes | 4,969 |
Small non coding genes | 4,969 |
Pseudogenes | 84 |
Gene transcripts | 50,517 |