Glycine soja (Wild soybean) (ASM419377v2)

Glycine soja (Wild soybean) Assembly and Gene Annotation

About Glycine soja

Glycine soja, known as wild soybean, is an annual plant in the legume family. It may be treated as a separate species, the closest living relative of the cultivated soybean, Glycine max, an important crop, or as a subspecies of the cultivated soybean, Glycine max subsp. soja. The plant is native to eastern China, Japan, Korea and far-eastern Russia.

Assembly

State-of-the-art whole-genome sequencing technologies were used to assemble a high-quality reference genome for wild soybean accession W05 with long contigs and high sequence fidelity. PacBio subreads (85.5 Gb) were error-corrected and de novo assembled into primary contigs. Sequences were then polished with PacBio subreads and Illumina paired-end reads (101.3 Gb). The polished contigs are 989.7 Mb in length and are composed of 2281 sequences with a contig N50 (50% of the genome covered by contigs above this length) of 2.0 Mb. Details for assembly procedures can be found in Methods section.

To anchor polished contigs onto chromosomes with high accuracy, two complementary technologies: OM and Hi-C sequencing were employed. Based on the optical contigs generated with the nickases Nt.BspQI and Nb.BssSI, two-enzyme hybrid scaffolding was performed to generate OM-sequence hybrid scaffolds (hybrid scaffolds). The hybrid scaffolds comprised 1438 sequences with a total length of 1019.8 Mb and a scaffold N50 of 13.9 Mb. In addition, Hi-C contact frequency derived from Hi-C sequencing was used to order and orient the polished contigs into Hi-C scaffolds. The resulting Hi-C scaffolds comprised 1161 sequences, with a total length of 989.8 Mb and a scaffold N50 of 48.5 Mb. Superscaffolds were generated by merging hybrid scaffolds and Hi-C scaffolds.

After gap filling and polishing, the final assembly for W05 is 1013.2 Mb in length, with 988.6 Mb unambiguous bases. In total, 95.7% of sequences are anchored to 20 superscaffolds, corresponding to 20 chromosomes, whereas 43.6 Mb in 1098 contigs remain unplaced. The contig N50 of the final assembly is 3.3 Mb. The longest contig of the W05 final assembly is 23.2 Mb in length, spanning 47.7% of chromosome 6. The contiguity of the W05 assembly is approximately a 17-fold improvement over the current reference genome Wm82_v2 and of similar quality as the recently published Chinese cultivated soybean reference genome of ZH135. Contig N50, scaffold N50, and total assembled genome size of other soybean genome assemblies were compared, but these genomes were not included in subsequent analysis, because they are highly fragmented.

Annotation

Protein-coding genes and alternative spliced isoforms were annotated by combining evidence generated from RNA-seq/PacBio IsoSeq transcript mapping, homology-based protein mapping, and ab initio prediction. In total, 234.7 Gb of Illumina RNA-seq reads were collected from 31 samples at various development and physiological stages. PacBio IsoSeq libraries were constructed in order to generate 414,750 full-length and non-chimeric transcripts. In total, 89,477 protein-coding transcripts were annotated for 55,539 gene loci, with 69,455 transcripts (77.6%) having 5′-untranslated region (UTR) and 71,271 transcripts (79.7%) having 3′-UTR.

  • A reference-grade wild soybean genome.
    Xie M, Chung CY, Li MW, Wong FL, Wang X, Liu A, Wang Z, Leung AK, Wong TH, Tong SW, Xiao Z, Fan K, Ng MS, Qi X, Yang L, Deng T, He L, Chen L, Fu A, Ding Q, He J, Chung G, Isobe S, Tanabata T, Valliyodan B, Nguyen HT, Cannon SB, Foyer CH, Chan TF, Lam HM.. Nat Commun 10 (1)

Picture credit: Wikipedia

Statistics

Summary

AssemblyASM419377v2, INSDC Assembly GCA_004193775.2,
Database version113.1
Golden Path Length1,013,766,566
Genebuild byNCBI
Genebuild methodImport
Data sourceChinese University of Hong Kong

Gene counts

Coding genes47,201
Non coding genes6,773
Small non coding genes2,837
Long non coding genes3,936
Pseudogenes5,391
Gene transcripts87,380