Juglans regia (Walnut_2.0)

Juglans regia Assembly and Gene Annotation

About Juglans regia

The Persian walnut (Juglans regia L.), a diploid species (2n=32) native to the mountainous regions of Central Asia, is the major walnut species cultivated for nut production and is one of the most widespread tree nut species in the world. It belongs to the Juglandaceae family and has a genome size of 620-667 Mbp. The high nutritional value of J. regia nuts is associated with a rich array of polyphenolic compounds. This genome sequence was obtained from the cultivar Chandler.

Assembly

A total of 3.7M Illumina super reads were produced with the MaSuRCA assembler, and then combined with 7M (35x) Oxford Nanopore Technology long reads. Finally, the resulting mega-reads were combined to obtain a hybrid assembly, which comprised 1,498 scaffolds, 258 contigs, and 25,007 old scaffolds from Chandler v1.0 [2]. To improve the assembly further and build chromosome-scale scaffolds, Hi-C sequencing was applied. The HiRise scaffolding pipeline processed 356M paired-end Illumina reads to generate the final assembly.

To assess the quality of the HiRise assembly, two genetic maps were used [3,4]. Almost perfect collinearity was observed between the HiRise assembly and both Chandler maps. Therefore, the HiRise scaffolds were oriented, ordered, and named accordingly, generating the final 16 chromosomal pseudomolecules.

Annotation

Full-length transcripts from single-molecule real-time sequencing were used to predict 37,554 gene models, with a mean gene length higher than the previous v1.0 gene annotations. Most of the new protein-coding genes (90%) presents both start and stop codons, which represents a significant improvement compared to Chandler v1.0. Of the 40,884 transcripts identified, 84% were multi-exonic, with 5.9 exons each, on average. Also, 2,801 gene models had from 2 to 4 transcript isoforms each, with a mean length of 9,389 bp.

Repeated sequences were annotated with the Repeat Detector and the Ensembl Genomes repeat feature pipeline.There are: 1215333 Low complexity (Dust) features, covering 52 Mb (9.0% of the genome); 213484 RepeatMasker features (with the nrTEplants library), covering 79 Mb (13.8% of the genome); 473692 Tandem repeats (TRF) features, covering 37 Mb (6.5% of the genome).

Picture credit: Thecupermat [CC BY-SAT 3.0]

Statistics

Summary

AssemblyWalnut_2.0, INSDC Assembly GCA_001411555.2, Jul 2020
Database version111.1
Golden Path Length572,786,133
Genebuild byUC Davis
Genebuild methodExternal annotation import
Data sourceUCDavis

Gene counts

Coding genes40,487
Gene transcripts41,077