Triticum aestivum Assembly and Gene Annotation

For information about the assembly and annotation please view the IWGSC announcement.

The previous wheat assembly (TGACv1) and every other plant from release 37 is available in the Ensembl Plants archive site.

About Triticum aestivum

Triticum aestivum (bread wheat) is a major global cereal grain essential to human nutrition. Wheat was one of the first cereals to be domesticated, originating in the fertile crescent around 7000 years ago. Bread wheat is hexaploid, with a genome size estimated at ~17 Gb, composed of three closely-related and independently maintained genomes that are the result of a series of naturally occurring hybridisation events. The ancestral progenitor genomes are considered to be Triticum urartu (the A-genome donor) and an unknown grass thought to be related to Aegilops speltoides (the B-genome donor). This first hybridisation event produced tetraploid emmer wheat (AABB, T. dicoccoides) which hybridized again with Aegilops tauschii (the D-genome donor) to produce modern bread wheat.

Guidelines for gene nomenclature in wheat can be found in the 2013 edition of the Wheat Gene Catalogue available in GrainGenes. The Wheat Gene Catalogue is the internationally agreed rules of nomenclature for wheat genes.

Assembly

The wheat genome was assembled by the International Wheat Genome Sequencing Consortium. Pseudomolecule sequences representing the 21 chromosomes of the bread wheat genome were assembled by integrating a draft de novo whole-genome assembly, built from Illumina short-read sequences using NRGene deNovoMagic2, with additional layers of genetic, physical, and sequence data. In the resulting 14.5-Gb genome assembly, contigs and scaffolds with N50s of 52 kb and 7 Mb, respectively, were linked into superscaffolds (N50 = 22.8 Mb), with 97% (14.1 Gb) of the sequences assigned and ordered along the 21 chromosomes and almost all of the assigned sequence scaffolds oriented relative to each other (13.8 Gb, 98%). Unanchored scaffolds comprising 481 Mb (2.8% of the assembly length) formed the "unassigned chromosome" (ChrUn) bin. The quality and contiguity of the IWGSC RefSeq v1.0 genome assembly were assessed through alignments with radiation hybrid maps for the A, B, and D subgenomes [average Spearman's correlation coefficient (r) of 0.98], the genetic positions of 7832 and 4745 genotyping-by-sequencing derived genetic markers in 88 double haploid and 993 recombinant inbred lines (Spearman's r of 0.986 and 0.987, respectively), and 1.24 million pairs of neighbour insertion site-based polymorphism (ISBP) markers, of which 97% were collinear and mapped in a similar size range (difference of less than 2 kb) between the de novo assembly and the available bacterial artificial chromosome (BAC)-based sequence assemblies. Finally, IWGSC RefSeq v1.0 was assessed with independent data derived from coding and noncoding sequences, revealing that 99 and 98% of the previously known coding exons and transposable element (TE)-derived (ISBP) markers (table S9), respectively, were present in the assembly. The approximate 1 Gb size difference between IWGSC RefSeq v1.0 and the new genome size estimates of 15.4 to 15.8 Gb can be accounted for by collapsed or unassembled sequences of highly repeated clusters, such as ribosomal RNA coding regions and telomeric sequences [1]. Centromere data taken can be viewed here.

This assembly is for the reference Chinese Spring wheat cultivar.

Annotation

Gene models were predicted with two independent pipelines previously utilised for wheat genome annotation and then consolidated to produce the RefSeq Annotation v1.0. Subsequently, a set of manually curated gene models was integrated to build RefSeq Annotation v1.1. In total, 107,891 high-confidence (HC) protein-coding loci were identified, with relatively equal distribution across the A, B, and D subgenomes. In addition, 161,537 other protein-coding loci were classified as low-confidence (LC) genes, representing partially supported gene models, gene fragments, and orphans. A predicted function was assigned to 82.1% (90,919) of HC genes in RefSeq Annotation v1.0 (tables S19 and S20), and evidence for transcription was found for 85% (94,114) of the HC genes versus 49% of the LC genes [1].

98,270 high confidence genes from the TGACv1 annotation were aligned to the assembly using Exonerate. For each gene up to three alignments are displayed, compromising 196,243 alignments of which 90,686 are protein coding.

Each high confidence coding gene is externally linked to KnetMiner [7] and WheatExpression [8,9].

Variation

Watkins Collection ~90 million variants were loaded from the the Watkins Landrace Wheat Collection which is a diverse assortment of heritage wheat varieties from the 1930s, preserved for its unique genetic potential in modern breeding [15]

Exome Capture Watkins 4,493,110 markers were loaded from 103 accessions of the Watkins landrace collection [13]

Data from CerealsDB

31,779 (out of 35,143) markers from the 35K Axiom SNP array and 768,664 (out of 819,570) markers from the 820K Axiom SNP array were aligned to the assembly. This was done by CerealsDB using 101bp sequences with the SNP located centrally at position 51. Blastn E-value cutoff was set to 1e-05. The top three hits were parsed and compared to genetic map data. In cases where two or more of the top three hits had an identical score, the hit agreed with the genetic map was selected. In cases of no genetic map information for a particular SNP then the top hit was selected. Some of the markers failed to align to the assembly or only aligned to the chrUn contigs (unassigned contigs), or couldn't be unambiguously assigned to a particular chromsome so they were removed.

The 35K set includes genotype data from 1,963 samples while the 820K has 475 samples. In cases where a marker belongs to both sets, genotype data from both 820K and 35K samples will be displayed [5].

EMS Mutation data

EMS-type variants from sequenced tetraploid (cv 'Kronos') and hexaploid (cv 'Cadenza') TILLING populations. Sequencing was performed using exome capture [2] for both varieties and promoter capture [11] only for Kronos. Mutations were called on the IWGSC RefSeq V1.0 assembly using the Dragen system for Cadenza and the MAPS pipeline for Kronos [12].

4.8 million Kronos mutations in coding regions and 4.3 million in promoter regions (estimated average error rate lower than 0.7%).
8.2 million Cadenza mutations in coding regions

Researchers and breeders can search this database online, identify mutations in the different copies of their target gene, and request seeds to study gene function or improve wheat varieties. Seeds can be requested from the UK SeedStor or from the US based Dubcovsky lab. This resource was generated as part of a joint project between the University of California Davis, Rothamsted Research, The Earlham Institute, and the John Innes Centre.

Inter-Homoeologous Variants

3.6 million Inter-Homoeologous Variants (IHVs) called by alignments of the A,B and D component genomes where added as SNP markers.

SIFT scores

SIFT predicts whether an amino acid substitution affects protein function based on sequence homology and the physical properties of amino acids. SIFT can be applied to naturally occurring nonsynonymous polymorphisms and laboratory-induced missense mutations.

SIFT scores and predictions (whether 'tolerated' or 'deleterious') have been calculated for all missense variants across all bread wheat variation datasets.

KASP markers

KASP markers designed to be genome-specific with PolyMarker [6] are displayed for the EMS-type variants. For details on how to interpret the annotation and details please visit http://www.polymarker.info/about.

In addition to the EMS-type variants, there are additional KASP markers from the Nottingham BBSRC Wheat Research Centre (WRC). These 710 KASP markers were developed from SNPs between wheat and ten wild relative species (Ambylopyrum muticum, Aegilops caudata, Aegilops speltoides, Secale cereale, Thinopyrum bessarabicum, Thinopyrum intermedium, Thinopyrum elongatum, Thinopyrum ponticum, Triticum timopheevii and Triticum urartu). These SNPs were aligned to the assembly by the WRC using the marker sequences and Blastn E-value cutoff value of 1e-05. Of these, 620 markers are genome-specific in design and thus, those SNPs have been aligned to their chromosome of specificity. Where the KASP markers are genome-nonspecific, the top Blast hit was selected for SNP alignment [10].

Exome Capture Diversity 2019 3,039,822 markers were loaded from 890 diverse wheat landraces and cultivars [14]

Linkage Disequilibrium (LD) data

LD data calculate here has been added to 820K and 35K Axiom SNP arrays.

QTL links 43 variants from the 820K and 35K Axiom SNP arrays have been linked to QTL in Cereals DB. This will increase as the QTL DB will grow. An example can be seen here.

References

Shifting the limits in wheat research and breeding using a fully annotated reference genome.
Rudi Appels, Kellye Eversole, Catherine Feuillet, Beat Keller, Jane Rogers, Nils Stein.... Hana imkov, Ian Small, Manuel Spannagl, David Swarbreck, Cristobal Uauy. 2018. Science. 361
Uncovering hidden variation in polyploid wheat.
Ksenia V. Krasileva, Hans A. Vasquez-Gross, Tyson Howell, Paul Bailey, Francine Paraiso, Leah Clissold, James Simmonds, Ricardo H. Ramirez-Gonzalez et al. . 2016. PNAS. 114:E913E921.
An improved assembly and annotation of the allohexaploid wheat genome identifies complete families of agronomic genes and provides genomic evidence for chromosomal translocations.
Bernardo J. Clavijo, Luca Venturini,Christian Schudoma, Gonzalo Garcia Accinelli, Gemy Kaithakottil, Jonathan Wright, Philippa Borrill, George Kettleborough, Darren Heavens, Helen Chapman et al. 2017. Genome Research.
Ultra-Fast Next Generation Human Genome Sequencing Data Processing Using DRAGENTM Bio-IT Processor for Precision Medicine.
Goyal, A., Kwon, H.J., Lee, K., Garg, R., Yun, S.Y. et al. 2017. Open Journal of Genetics. 7:9-19.
CerealsDB 3.0: expansion of resources and data integration.
Wilkinson PA, Winfield MO, Barker GL,Tyrrell S, Bian X, Allen AM, Burridge A, Coghill JA, Waterfall C, Caccamo M et al. 2016. BMC Bioinformatics. 17:256.
PolyMarker: A fast polyploid primer design pipeline.
Ricardo H. Ramirez-Gonzalez, Cristobal Uauy, Mario Caccamo. 2015. Bioinformatics. 31:2038-2039.
Developing integrated crop knowledge networks to advance candidate gene discovery.
Keywan Hassani-Pak, Martin Castellote, Maria Esch, Matthew Hindle, Artem Lysenkoa Jan Taubert and Christopher Rawlings. 2016. Applied & Translational Genomics. 11:18-26.
The transcriptional landscape of hexaploid wheat across tissues and cultivars.
Ramrez-Gonzlez RH, Borrill P, Lang D, Harrington SA, Brinton J, Venturini L, Davey M, Jacobs J, van Ex F, Pasha A et al. 2018. Science. 361
expVIP: a customisable RNA-seq data analysis and visualisation platform.
Philippa Borrill, Ricardo Ramirez-Gonzalez, and Cristobal Uauy. 2016. Plant Physiology .
Rapid identification of homozygosity and site of wild relative introgressions in wheat through chromosome‐specific KASP genotyping assays.
Surbhi Grewal, Stella Hubbart‐Edwards, Caiyun Yang, Urmila Devi, Lauren Baker, Jack Heath, Stephen Ashling, Duncan Scholefield, Caroline Howells, Jermaine Yarde, Peter Isaac, Ian P. King and Julie King.
1. Plant Biotechnology Journal. 18:743–755.
Integrating genomic resources to present full gene and putative promoter capture probe sets for bread wheat.
Gardiner LJ, Brabbs T, Akhunov A, Jordan K, Budak H et al. 2019. Gigascience 8(4): giz018
Efficient genome-wide detection and cataloging of EMS-induced mutations using next-generation sequencing and exome capture.
Henry IM, Nagalakshmi U, Lieberman MC, Ngo KJ, Krasileva KV et al. 2014. Plant Cell 26:1382–1397
High-density genotyping of the A.E. Watkins Collection of hexaploid landraces identifies a large molecular diversity compared to elite bread wheat.
Mark O Winfield, Alexandra M Allen, Paul A Wilkinson, Amanda J Burridge, Gary L A Barker, Jane Coghill, Christy Waterfall, Luzie U Wingen, Simon Griffiths, Keith J Edwards. 2018. Plant Biotechnol J. 16(1):165-175
Exome sequencing highlights the role of wild-relative introgression in shaping the adaptive landscape of the wheat genome.
Fei He, Raj Pasam, Fan Shi, Surya Kant et al. 2019. Nature Genetics 51, 896-904
Harnessing landrace diversity empowers wheat breeding.
Cheng, S., Feng, C., Wingen, L.U. et al. 2024. Nature 632, 823–831.

More information

General information about this species can be found in Wikipedia.

Statistics

Summary

Assembly	IWGSC, INSDC Assembly GCA_900519105.1, Jul 2018
Database version	114.4
Golden Path Length	14,547,261,565
Genebuild by	International Wheat Genome Sequencing Consortium
Genebuild method	Import
Data source	International Wheat Genome Sequencing Consortium

Gene counts

Gene/transcipt that contains an open reading frame (ORF).Coding genes	107,891
Non coding genes	12,853
Small non coding genes	12,491
Long non coding genes	362
A transcript is the operational unit of a gene. In a genomic context, transcripts consist of one or more exons, with adjoining exons being separated by introns. The exons/introns are transcribed and then the introns spliced out. Transcripts may or may not encode a proteinGene transcripts	146,597

Other

Short Variants

120,232,985

Triticum aestivum Assembly and Gene Annotation

About Triticum aestivum

Assembly

Annotation

Variation

References

More information

Statistics

Summary

Gene counts

Other

About Us

Get help

Our sister sites

Follow us

Favourite species

All species

Triticum aestivum Assembly and Gene Annotation

About Triticum aestivum

Assembly

Annotation

Variation

References

More information

Statistics

Summary

Gene counts

Other

About Us

Get help

Our sister sites

Follow us