Triticum aestivum Assembly and Gene Annotation

For information about the assembly and annotation please view the IWGSC announcement.

The previous wheat assembly (TGACv1) and every other plant from release 31 is available in the new Ensembl Plants archive site.

About Triticum aestivum

Triticum aestivum (bread wheat) is a major global cereal grain essential to human nutrition. Wheat was one of the first cereals to be domesticated, originating in the fertile crescent around 7000 years ago. Bread wheat is hexaploid, with a genome size estimated at ~17 Gbp, composed of three closely-related and independently maintained genomes that are the result of a series of naturally occurring hybridization events. The ancestral progenitor genomes are considered to be Triticum urartu (the A-genome donor) and an unknown grass thought to be related to Aegilops speltoides (the B-genome donor). This first hybridization event produced tetraploid emmer wheat (AABB, T. dicoccoides) which hybridized again with Aegilops tauschii (the D-genome donor) to produce modern bread wheat.

Assembly

Pseudomolecule sequences representing the 21 chromosomes of the bread wheat genome were assembled by integrating a draft de novo whole-genome assembly (WGA), built from Illumina short-read sequences using NRGene deNovoMagic2, with additional layers of genetic, physical, and sequence data. In the resulting 14.5-Gb genome assembly, contigs and scaffolds with N50s of 52 kb and 7 Mb, respectively, were linked into superscaffolds (N50 = 22.8 Mb), with 97% (14.1 Gb) of the sequences assigned and ordered along the 21 chromosomes and almost all of the assigned sequence scaffolds oriented relative to each other (13.8 Gb, 98%). Unanchored scaffolds comprising 481 Mb (2.8% of the assembly length) formed the “unassigned chromosome” (ChrUn) bin. The quality and contiguity of the IWGSC RefSeq v1.0 genome assembly were assessed through alignments with radiation hybrid maps for the A, B, and D subgenomes [average Spearman’s correlation coefficient (r) of 0.98], the genetic positions of 7832 and 4745 genotyping-by-sequencing derived genetic markers in 88 double haploid and 993 recombinant inbred lines (Spearman’s r of 0.986 and 0.987, respectively), and 1.24 million pairs of neighbor insertion site–based polymorphism (ISBP) markers, of which 97% were collinear and mapped in a similar size range (difference of less than 2 kb) between the de novo WGA and the available bacterial artificial chromosome (BAC)–based sequence assemblies. Finally, IWGSC RefSeq v1.0 was assessed with independent data derived from coding and noncoding sequences, revealing that 99 and 98% of the previously known coding exons and transposable element (TE)–derived (ISBP) markers (table S9), respectively, were present in the assembly. The approximate 1-Gb size difference between IWGSC RefSeq v1.0 and the new genome size estimates of 15.4 to 15.8 Gb can be accounted for by collapsed or unassembled sequences of highly repeated clusters, such as ribosomal RNA coding regions and telomeric sequences.[1]

Annotation

Gene models were predicted with two independent pipelines previously utilized for wheat genome annotation and then consolidated to produce the RefSeq Annotation v1.0. Subsequently, a set of manually curated gene models was integrated to build RefSeq Annotation v1.1. In total, 107,891 high-confidence (HC) protein-coding loci were identified, with relatively equal distribution across the A, B, and D subgenomes. In addition, 161,537 other protein-coding loci were classified as low-confidence (LC) genes, representing partially supported gene models, gene fragments, and orphans. A predicted function was assigned to 82.1% (90,919) of HC genes in RefSeq Annotation v1.0 (tables S19 and S20), and evidence for transcription was found for 85% (94,114) of the HC genes versus 49% of the LC genes [1].
98,270 high confidence genes from the TGACv1 annotation [3] were aligned to the assembly using Exonerate. For each gene up to 3 alignments are displayed, compromising 196,243 alignments of which 90,686 are protein coding.

Variation

Data from CerealsDB [5]

768664 markers from the 820K Axiom SNP array from CerealsDB were aligned to the assembly.
This was done by CerealsDB[5] using Blast with a cutoff of 1e-05. The top three hits were parsed and compared to CerealsDB genetic map data. In cases where two or more of the top three hits had an identical score, the hit agreed with the genetic map was selected. In cases of no genetic map information for a particular SNP then the top hit was selected.

EMS Mutation data [2]

EMS-type variants from sequenced tetraploid (cv ‘Kronos’) and hexaploid (cv ‘Cadenza’) TILLING populations. Mutations were called on the IWGSC RefSeq V1.0 assembly using the Dragen system[4]

  • 4.4 million Kronos mutations
  • 9.0 million Cadenza mutations

Researchers and breeders can search this database online, identify mutations in the different copies of their target gene, and request seeds to study gene function or improve wheat varieties. Seeds can be requested from the UK SeedStor (https://www.seedstor.ac.uk/shopping-cart-tilling.php) or from the US based Dubcovsky lab (http://dubcovskylab.ucdavis.edu/wheat-tilling).

This resource was generated as part of a joint project between the University of California Davis, Rothamsted Research, The Earlham Institute, and the John Innes Centre.

References

  1. Shifting the limits in wheat research and breeding using a fully annotated reference genome.
    Rudi Appels, Kellye Eversole, Catherine Feuillet, Beat Keller, Jane Rogers, Nils Stein.... Hana imkov, Ian Small, Manuel Spannagl, David Swarbreck, Cristobal Uauy. 2018. Science. 361
  2. Uncovering hidden variation in polyploid wheat.
    Ksenia V. Krasileva, Hans A. Vasquez-Gross, Tyson Howell, Paul Bailey, Francine Paraiso, Leah Clissold, James Simmonds, Ricardo H. Ramirez-Gonzalez et al. . 2016. PNAS. 114:E913E921.
  3. An improved assembly and annotation of the allohexaploid wheat genome identifies complete families of agronomic genes and provides genomic evidence for chromosomal translocations.
    Bernardo J. Clavijo, Luca Venturini,Christian Schudoma, Gonzalo Garcia Accinelli, Gemy Kaithakottil, Jonathan Wright, Philippa Borrill, George Kettleborough, Darren Heavens, Helen Chapman et al. 2017. Genome Research.
  4. Ultra-Fast Next Generation Human Genome Sequencing Data Processing Using DRAGENTM Bio-IT Processor for Precision Medicine.
    Goyal, A., Kwon, H.J., Lee, K., Garg, R., Yun, S.Y. et al. 2017. Open Journal of Genetics. 7:9-19.
  5. CerealsDB 3.0: expansion of resources and data integration.
    Wilkinson PA, Winfield MO, Barker GL,Tyrrell S, Bian X, Allen AM, Burridge A, Coghill JA, Waterfall C, Caccamo M et al. 2016. BMC Bioinformatics. 17:256.

More information

General information about this species can be found in Wikipedia.

Statistics

Summary

AssemblyIWGSC, INSDC Assembly GCA_900519105.1, Jul 2018
Database version94.4
Base Pairs14,547,261,565
Golden Path Length14,547,261,565
Genebuild byIWGSC
Genebuild methodImported from IWGSC
Data sourceInternational Wheat Genome Sequencing Consortium

Gene counts

Coding genes107,891
Non coding genes12,853
Small non coding genes12,491
Long non coding genes362
Gene transcripts146,597

Other

Short Variants14,142,687

About this species