Triticum aestivum Assembly and Gene Annotation

For information about the assembly and annotation please view the IWGSC announcement.

The previous wheat assembly (TGACv1) and every other plant from release 31 is available in the new Ensembl Plants archive site.

About Triticum aestivum

Triticum aestivum (bread wheat) is a major global cereal grain essential to human nutrition. Wheat was one of the first cereals to be domesticated, originating in the fertile crescent around 7000 years ago. Bread wheat is hexaploid, with a genome size estimated at ~17 Gb, composed of three closely-related and independently maintained genomes that are the result of a series of naturally occurring hybridisation events. The ancestral progenitor genomes are considered to be Triticum urartu (the A-genome donor) and an unknown grass thought to be related to Aegilops speltoides (the B-genome donor). This first hybridisation event produced tetraploid emmer wheat (AABB, T. dicoccoides) which hybridized again with Aegilops tauschii (the D-genome donor) to produce modern bread wheat.

Assembly

The wheat genome was assembled by the International Wheat Genome Sequencing Consortium. Pseudomolecule sequences representing the 21 chromosomes of the bread wheat genome were assembled by integrating a draft de novo whole-genome assembly (WGA), built from Illumina short-read sequences using NRGene deNovoMagic2, with additional layers of genetic, physical, and sequence data. In the resulting 14.5-Gb genome assembly, contigs and scaffolds with N50s of 52 kb and 7 Mb, respectively, were linked into superscaffolds (N50 = 22.8 Mb), with 97% (14.1 Gb) of the sequences assigned and ordered along the 21 chromosomes and almost all of the assigned sequence scaffolds oriented relative to each other (13.8 Gb, 98%). Unanchored scaffolds comprising 481 Mb (2.8% of the assembly length) formed the “unassigned chromosome” (ChrUn) bin. The quality and contiguity of the IWGSC RefSeq v1.0 genome assembly were assessed through alignments with radiation hybrid maps for the A, B, and D subgenomes [average Spearman’s correlation coefficient (r) of 0.98], the genetic positions of 7832 and 4745 genotyping-by-sequencing derived genetic markers in 88 double haploid and 993 recombinant inbred lines (Spearman’s r of 0.986 and 0.987, respectively), and 1.24 million pairs of neighbour insertion site–based polymorphism (ISBP) markers, of which 97% were collinear and mapped in a similar size range (difference of less than 2 kb) between the de novo WGA and the available bacterial artificial chromosome (BAC)–based sequence assemblies. Finally, IWGSC RefSeq v1.0 was assessed with independent data derived from coding and noncoding sequences, revealing that 99 and 98% of the previously known coding exons and transposable element (TE)–derived (ISBP) markers (table S9), respectively, were present in the assembly. The approximate 1 Gb size difference between IWGSC RefSeq v1.0 and the new genome size estimates of 15.4 to 15.8 Gb can be accounted for by collapsed or unassembled sequences of highly repeated clusters, such as ribosomal RNA coding regions and telomeric sequences.[1] This assembly is for the reference Chinese Spring wheat cultivar.

Annotation

Gene models were predicted with two independent pipelines previously utilised for wheat genome annotation and then consolidated to produce the RefSeq Annotation v1.0. Subsequently, a set of manually curated gene models was integrated to build RefSeq Annotation v1.1. In total, 107,891 high-confidence (HC) protein-coding loci were identified, with relatively equal distribution across the A, B, and D subgenomes. In addition, 161,537 other protein-coding loci were classified as low-confidence (LC) genes, representing partially supported gene models, gene fragments, and orphans. A predicted function was assigned to 82.1% (90,919) of HC genes in RefSeq Annotation v1.0 (tables S19 and S20), and evidence for transcription was found for 85% (94,114) of the HC genes versus 49% of the LC genes.[1]


98,270 high confidence genes from the TGACv1 annotation were aligned to the assembly using Exonerate. For each gene up to three alignments are displayed, compromising 196,243 alignments of which 90,686 are protein coding.

Each high confidence coding gene is externally linked to KnetMiner [7] and WheatExpression [8,9].

Variation

Data from CerealsDB

31,779 (out of 35,143) markers from the 35K Axiom SNP array and 768,664 (out of 819,570) markers from the 820K Axiom SNP array were aligned to the assembly. This was done by CerealsDB using 101bp sequences with the SNP located centrally at position 51. Blastn E-value cutoff was set to 1e-05. The top three hits were parsed and compared to genetic map data. In cases where two or more of the top three hits had an identical score, the hit agreed with the genetic map was selected. In cases of no genetic map information for a particular SNP then the top hit was selected. Some of the markers failed to align to the assembly or only aligned to the chrUn contigs (unassigned contigs), or couldn't be unambiguously assigned to a particular chromsome so they were removed.

The 35K set includes genotype data from 1,963 samples while the 820K has 475 samples. In cases where a marker belongs to both sets, genotype data from both 820K and 35K samples will be displayed. [5]

EMS Mutation data

EMS-type variants from sequenced tetraploid (cv ‘Kronos’) and hexaploid (cv ‘Cadenza’) TILLING populations. Mutations were called on the IWGSC RefSeq V1.0 assembly using the Dragen system.

  • 3.9 million Kronos mutations
  • 8.2 million Cadenza mutations

Researchers and breeders can search this database online, identify mutations in the different copies of their target gene, and request seeds to study gene function or improve wheat varieties. Seeds can be requested from the UK SeedStor or from the US based Dubcovsky lab.This resource was generated as part of a joint project between the University of California Davis, Rothamsted Research, The Earlham Institute, and the John Innes Centre. [2]

Inter-Homoeologous Variants

3.6 million Inter-Homoeologous Variants (IHVs) called by alignments of the A,B and D component genomes where added as SNP markers.

SIFT scores

SIFT predicts whether an amino acid substitution affects protein function based on sequence homology and the physical properties of amino acids. SIFT can be applied to naturally occurring nonsynonymous polymorphisms and laboratory-induced missense mutations.

SIFT scores and predictions (whether it is 'tolerated' or 'deleterious') have been calculated for all missense variants across all bread wheat variation datasets.

KASP markers

KASP markers designed to be genome-specific with PolyMarker[6] are displayed for the EMS-type variants. For details on how to interpret the annotation and details please visit http://www.polymarker.info/about.

References

  1. Shifting the limits in wheat research and breeding using a fully annotated reference genome.
    Rudi Appels, Kellye Eversole, Catherine Feuillet, Beat Keller, Jane Rogers, Nils Stein.... Hana imkov, Ian Small, Manuel Spannagl, David Swarbreck, Cristobal Uauy. 2018. Science. 361
  2. Uncovering hidden variation in polyploid wheat.
    Ksenia V. Krasileva, Hans A. Vasquez-Gross, Tyson Howell, Paul Bailey, Francine Paraiso, Leah Clissold, James Simmonds, Ricardo H. Ramirez-Gonzalez et al. . 2016. PNAS. 114:E913E921.
  3. An improved assembly and annotation of the allohexaploid wheat genome identifies complete families of agronomic genes and provides genomic evidence for chromosomal translocations.
    Bernardo J. Clavijo, Luca Venturini,Christian Schudoma, Gonzalo Garcia Accinelli, Gemy Kaithakottil, Jonathan Wright, Philippa Borrill, George Kettleborough, Darren Heavens, Helen Chapman et al. 2017. Genome Research.
  4. Ultra-Fast Next Generation Human Genome Sequencing Data Processing Using DRAGENTM Bio-IT Processor for Precision Medicine.
    Goyal, A., Kwon, H.J., Lee, K., Garg, R., Yun, S.Y. et al. 2017. Open Journal of Genetics. 7:9-19.
  5. CerealsDB 3.0: expansion of resources and data integration.
    Wilkinson PA, Winfield MO, Barker GL,Tyrrell S, Bian X, Allen AM, Burridge A, Coghill JA, Waterfall C, Caccamo M et al. 2016. BMC Bioinformatics. 17:256.
  6. PolyMarker: A fast polyploid primer design pipeline.
    Ricardo H. Ramirez-Gonzalez, Cristobal Uauy, Mario Caccamo. 2015. Bioinformatics. 31:2038-2039.
  7. Developing integrated crop knowledge networks to advance candidate gene discovery.
    Keywan Hassani-Pak, Martin Castellote, Maria Esch, Matthew Hindle, Artem Lysenkoa Jan Taubert and Christopher Rawlings. 2016. Applied & Translational Genomics. 11:18-26.
  8. The transcriptional landscape of hexaploid wheat across tissues and cultivars.
    Ramrez-Gonzlez RH, Borrill P, Lang D, Harrington SA, Brinton J, Venturini L, Davey M, Jacobs J, van Ex F, Pasha A et al. 2018. Science. 361
  9. expVIP: a customisable RNA-seq data analysis and visualisation platform.
    Philippa Borrill, Ricardo Ramirez-Gonzalez, and Cristobal Uauy. 2016. Plant Physiology .

More information

General information about this species can be found in Wikipedia.

Statistics

Summary

AssemblyIWGSC, INSDC Assembly GCA_900519105.1, Jul 2018
Database version98.4
Base Pairs14,547,261,565
Golden Path Length14,547,261,565
Genebuild byIWGSC
Genebuild methodImport
Data sourceInternational Wheat Genome Sequencing Consortium

Gene counts

Coding genes107,891
Non coding genes12,853
Small non coding genes12,491
Long non coding genes362
Gene transcripts146,597

Other

Short Variants16,448,754

About this species