Brassica juncea (ASM1870372v1)

Brassica juncea Assembly and Gene Annotation

About Brassica juncea

Brassica juncea (L.) Czern & Coss is a diverse and important agricultural species. An allotetraploid (AABB, 2n = 36), B. juncea derived from interspecific hybridization between the diploid progenitors Brassica rapa (AA, 2n = 20) and Brassica nigra (BB, 2n = 16)2. Four subspecies have been proposed based on crop use and morphology: juncea (seed mustard), integrifolia (leaf mustard), napiformis (root mustard) and tumida (stem mustard). B. juncea has a wide geographic range as native plants, adapted crops and introduced weeds, spanning the continents of Asia, Europe, Africa, America and Australia. B. juncea is an important oilseed crop in India, Bangladesh, China and Ukraine, and is recently also gaining importance in Canada and Australia. Meanwhile, it is grown as a condiment in Europe, North America, Argentina and China. Root mustard is distributed in Mongolia and northeastern China, whereas leaf mustards are most common in China and Southeast Asia. Brassica juncea is regarded as one of the earliest domesticated plants, with mustard mentioned as a condiment in Sanskrit and Sumerian texts from as early as 3,000 BC.

Assembly

For de novo assembly of the SY genome, four sequencing and assembly technologies were integrated: PacBio long-read sequencing, Illumina short-read sequencing, BioNano optical mapping and Hi-C data. The SY genome size was estimated to be 1056.53 Mb by k-mer analysis, close to the 1,068 Mb estimated by flow cytometry. PacBio reads (~93×) were first assembled using FALCON, followed by contig correction using Illumina reads (~130×) to generate a V.1 assembly. Using 202-fold coverage of BioNano data, an optical consensus map was generated, which was implemented to assemble 1,897 super-scaffolds with an N50 of 5.87 Mb (assembly V.2). These contigs were categorized and ordered into 18 chromosome-scale scaffolds using a 15,543-marker high-density linkage map. Finally, Hi-C data was used to confirm the pseudo-chromosomes and manually adjusted 165 mis-joined contigs by Juicebox. The final SY assembly captured 933.5 Mb of genome sequence, with 867.3 Mb (~92.9%) anchored into chromosomes, which is superior to previous assemblies of stem and Indian mustard in terms of genome size, contiguity and anchorage.

Annotation

Among 92,878 predicted gene models, 95.5% were functionally annotated in public databases. Alignment to known proteins and expression in at least one tissue type showed 82,723 gene models were high-confidence (HC) genes, with an average coding sequence length of ~1.13 kb and an average of five exons per gene, similarly to predictions in other Brassica genomes (Supplementary Table 13). A total of 5,756 genes (6.96% of the HC genes) encoded putative transcription factors belonging to 58 different families.

Picture credit: Wikipedia

Statistics

Summary

AssemblyASM1870372v1, INSDC Assembly GCA_018703725.1, Jun 2021
Database version111.1
Golden Path Length933,495,403
Genebuild byARRAY(0x2b39f80)
Genebuild methodExternal annotation import
Data sourceHUNAU

Gene counts

Coding genes92,887
Gene transcripts92,887