Triticum aestivum Assembly and Gene Annotation
About Triticum aestivum
Triticum aestivum (bread wheat) is a major global cereal grain essential to human nutrition. Wheat was one of the first cereals to be domesticated, originating in the fertile crescent around 7000 years ago. Bread wheat is hexaploid, with a genome size estimated at ~17 Gb, composed of three closely-related and independently maintained genomes that are the result of a series of naturally occurring hybridisation events. The ancestral progenitor genomes are considered to be Triticum urartu (the A-genome donor) and an unknown grass thought to be related to Aegilops speltoides (the B-genome donor). This first hybridisation event produced tetraploid emmer wheat (AABB, T. dicoccoides) which hybridized again with Aegilops tauschii (the D-genome donor) to produce modern bread wheat.
Guidelines for gene nomenclature in wheat can be found in the 2013 edition of the Wheat Gene Catalogue available in GrainGenes. The Wheat Gene Catalogue is the internationally agreed rules of nomenclature for wheat genes.
Assembly
The wheat genome was assembled by the International Wheat Genome Sequencing Consortium. Pseudomolecule sequences representing the 21 chromosomes of the bread wheat genome were assembled by integrating a draft de novo whole-genome assembly, built from Illumina short-read sequences using NRGene deNovoMagic2, with additional layers of genetic, physical, and sequence data. In the resulting 14.5-Gb genome assembly, contigs and scaffolds with N50s of 52 kb and 7 Mb, respectively, were linked into superscaffolds (N50 = 22.8 Mb), with 97% (14.1 Gb) of the sequences assigned and ordered along the 21 chromosomes and almost all of the assigned sequence scaffolds oriented relative to each other (13.8 Gb, 98%). Unanchored scaffolds comprising 481 Mb (2.8% of the assembly length) formed the "unassigned chromosome" (ChrUn) bin. The quality and contiguity of the IWGSC RefSeq v1.0 genome assembly were assessed through alignments with radiation hybrid maps for the A, B, and D subgenomes [average Spearman's correlation coefficient (r) of 0.98], the genetic positions of 7832 and 4745 genotyping-by-sequencing derived genetic markers in 88 double haploid and 993 recombinant inbred lines (Spearman's r of 0.986 and 0.987, respectively), and 1.24 million pairs of neighbour insertion site-based polymorphism (ISBP) markers, of which 97% were collinear and mapped in a similar size range (difference of less than 2 kb) between the de novo assembly and the available bacterial artificial chromosome (BAC)-based sequence assemblies. Finally, IWGSC RefSeq v1.0 was assessed with independent data derived from coding and noncoding sequences, revealing that 99 and 98% of the previously known coding exons and transposable element (TE)-derived (ISBP) markers (table S9), respectively, were present in the assembly. The approximate 1 Gb size difference between IWGSC RefSeq v1.0 and the new genome size estimates of 15.4 to 15.8 Gb can be accounted for by collapsed or unassembled sequences of highly repeated clusters, such as ribosomal RNA coding regions and telomeric sequences [1]. Centromere data taken can be viewed here.
This assembly is for the reference Chinese Spring wheat cultivar.
Annotation
Gene models were predicted with two independent pipelines previously utilised for wheat genome annotation and then consolidated to produce the RefSeq Annotation v1.0. Subsequently, a set of manually curated gene models was integrated to build RefSeq Annotation v1.1. In total, 107,891 high-confidence (HC) protein-coding loci were identified, with relatively equal distribution across the A, B, and D subgenomes. In addition, 161,537 other protein-coding loci were classified as low-confidence (LC) genes, representing partially supported gene models, gene fragments, and orphans. A predicted function was assigned to 82.1% (90,919) of HC genes in RefSeq Annotation v1.0 (tables S19 and S20), and evidence for transcription was found for 85% (94,114) of the HC genes versus 49% of the LC genes [1].
98,270 high confidence genes from the TGACv1 annotation were aligned to the assembly using Exonerate. For each gene up to three alignments are displayed, compromising 196,243 alignments of which 90,686 are protein coding.
Each high confidence coding gene is externally linked to KnetMiner [7] and WheatExpression [8,9].
Variation
Watkins Collection ~90 million variants were loaded from the the Watkins Landrace Wheat Collection which is a diverse assortment of heritage wheat varieties from the 1930s, preserved for its unique genetic potential in modern breeding [15]
Exome Capture Watkins 4,493,110 markers were loaded from 103 accessions of the Watkins landrace collection [13]
31,779 (out of 35,143) markers from the 35K Axiom SNP array and 768,664 (out of 819,570) markers from the 820K Axiom SNP array were aligned to the assembly. This was done by CerealsDB using 101bp sequences with the SNP located centrally at position 51. Blastn E-value cutoff was set to 1e-05. The top three hits were parsed and compared to genetic map data. In cases where two or more of the top three hits had an identical score, the hit agreed with the genetic map was selected. In cases of no genetic map information for a particular SNP then the top hit was selected. Some of the markers failed to align to the assembly or only aligned to the chrUn contigs (unassigned contigs), or couldn't be unambiguously assigned to a particular chromsome so they were removed.
The 35K set includes genotype data from 1,963 samples while the 820K has 475 samples. In cases where a marker belongs to both sets, genotype data from both 820K and 35K samples will be displayed [5].
EMS-type variants from sequenced tetraploid (cv 'Kronos') and hexaploid (cv 'Cadenza') TILLING populations. Sequencing was performed using exome capture [2] for both varieties and promoter capture [11] only for Kronos. Mutations were called on the IWGSC RefSeq V1.0 assembly using the Dragen system for Cadenza and the MAPS pipeline for Kronos [12].
- 4.8 million Kronos mutations in coding regions and 4.3 million in promoter regions (estimated average error rate lower than 0.7%).
- 8.2 million Cadenza mutations in coding regions
Researchers and breeders can search this database online, identify mutations in the different copies of their target gene, and request seeds to study gene function or improve wheat varieties. Seeds can be requested from the UK SeedStor or from the US based Dubcovsky lab. This resource was generated as part of a joint project between the University of California Davis, Rothamsted Research, The Earlham Institute, and the John Innes Centre.
Inter-Homoeologous Variants
3.6 million Inter-Homoeologous Variants (IHVs) called by alignments of the A,B and D component genomes where added as SNP markers.
SIFT scores
SIFT predicts whether an amino acid substitution affects protein function based on sequence homology and the physical properties of amino acids. SIFT can be applied to naturally occurring nonsynonymous polymorphisms and laboratory-induced missense mutations.
SIFT scores and predictions (whether 'tolerated' or 'deleterious') have been calculated for all missense variants across all bread wheat variation datasets.
KASP markers
KASP markers designed to be genome-specific with PolyMarker [6] are displayed for the EMS-type variants. For details on how to interpret the annotation and details please visit http://www.polymarker.info/about.
In addition to the EMS-type variants, there are additional KASP markers from the Nottingham BBSRC Wheat Research Centre (WRC). These 710 KASP markers were developed from SNPs between wheat and ten wild relative species (Ambylopyrum muticum, Aegilops caudata, Aegilops speltoides, Secale cereale, Thinopyrum bessarabicum, Thinopyrum intermedium, Thinopyrum elongatum, Thinopyrum ponticum, Triticum timopheevii and Triticum urartu). These SNPs were aligned to the assembly by the WRC using the marker sequences and Blastn E-value cutoff value of 1e-05. Of these, 620 markers are genome-specific in design and thus, those SNPs have been aligned to their chromosome of specificity. Where the KASP markers are genome-nonspecific, the top Blast hit was selected for SNP alignment [10].
Exome Capture Diversity 2019 3,039,822 markers were loaded from 890 diverse wheat landraces and cultivars [14]
Linkage Disequilibrium (LD) data
LD data calculate here has been added to 820K and 35K Axiom SNP arrays.
QTL links 43 variants from the 820K and 35K Axiom SNP arrays have been linked to QTL in Cereals DB. This will increase as the QTL DB will grow. An example can be seen here.
References
- Shifting the limits in wheat research and breeding using a fully
annotated reference
genome.
Rudi Appels, Kellye Eversole, Catherine Feuillet, Beat Keller, Jane Rogers, Nils Stein.... Hana imkov, Ian Small, Manuel Spannagl, David Swarbreck, Cristobal Uauy. 2018. Science. 361 - Uncovering hidden variation in polyploid
wheat.
Ksenia V. Krasileva, Hans A. Vasquez-Gross, Tyson Howell, Paul Bailey, Francine Paraiso, Leah Clissold, James Simmonds, Ricardo H. Ramirez-Gonzalez et al. . 2016. PNAS. 114:E913E921. - An improved assembly and annotation of the allohexaploid wheat
genome identifies complete families of agronomic genes and provides
genomic evidence for chromosomal
translocations.
Bernardo J. Clavijo, Luca Venturini,Christian Schudoma, Gonzalo Garcia Accinelli, Gemy Kaithakottil, Jonathan Wright, Philippa Borrill, George Kettleborough, Darren Heavens, Helen Chapman et al. 2017. Genome Research. - Ultra-Fast Next Generation Human Genome Sequencing Data Processing
Using DRAGENTM Bio-IT Processor for Precision
Medicine.
Goyal, A., Kwon, H.J., Lee, K., Garg, R., Yun, S.Y. et al. 2017. Open Journal of Genetics. 7:9-19. - CerealsDB 3.0: expansion of resources and data
integration.
Wilkinson PA, Winfield MO, Barker GL,Tyrrell S, Bian X, Allen AM, Burridge A, Coghill JA, Waterfall C, Caccamo M et al. 2016. BMC Bioinformatics. 17:256. - PolyMarker: A fast polyploid primer design
pipeline.
Ricardo H. Ramirez-Gonzalez, Cristobal Uauy, Mario Caccamo. 2015. Bioinformatics. 31:2038-2039. - Developing integrated crop knowledge networks to advance candidate
gene
discovery.
Keywan Hassani-Pak, Martin Castellote, Maria Esch, Matthew Hindle, Artem Lysenkoa Jan Taubert and Christopher Rawlings. 2016. Applied & Translational Genomics. 11:18-26. - The transcriptional landscape of hexaploid wheat across tissues and
cultivars.
Ramrez-Gonzlez RH, Borrill P, Lang D, Harrington SA, Brinton J, Venturini L, Davey M, Jacobs J, van Ex F, Pasha A et al. 2018. Science. 361 expVIP: a customisable RNA-seq data analysis and visualisation platform.
Philippa Borrill, Ricardo Ramirez-Gonzalez, and Cristobal Uauy. 2016. Plant Physiology .Rapid identification of homozygosity and site of wild relative introgressions in wheat through chromosome‐specific KASP genotyping assays.
Surbhi Grewal, Stella Hubbart‐Edwards, Caiyun Yang, Urmila Devi, Lauren Baker, Jack Heath, Stephen Ashling, Duncan Scholefield, Caroline Howells, Jermaine Yarde, Peter Isaac, Ian P. King and Julie King.- Plant Biotechnology Journal. 18:743–755.
Integrating genomic resources to present full gene and putative promoter capture probe sets for bread wheat.
Gardiner LJ, Brabbs T, Akhunov A, Jordan K, Budak H et al. 2019. Gigascience 8(4): giz018Efficient genome-wide detection and cataloging of EMS-induced mutations using next-generation sequencing and exome capture.
Henry IM, Nagalakshmi U, Lieberman MC, Ngo KJ, Krasileva KV et al. 2014. Plant Cell 26:1382–1397High-density genotyping of the A.E. Watkins Collection of hexaploid landraces identifies a large molecular diversity compared to elite bread wheat.
Mark O Winfield, Alexandra M Allen, Paul A Wilkinson, Amanda J Burridge, Gary L A Barker, Jane Coghill, Christy Waterfall, Luzie U Wingen, Simon Griffiths, Keith J Edwards. 2018. Plant Biotechnol J. 16(1):165-175Exome sequencing highlights the role of wild-relative introgression in shaping the adaptive landscape of the wheat genome.
Fei He, Raj Pasam, Fan Shi, Surya Kant et al. 2019. Nature Genetics 51, 896-904Harnessing landrace diversity empowers wheat breeding.
Cheng, S., Feng, C., Wingen, L.U. et al. 2024. Nature 632, 823–831.
More information
General information about this species can be found in Wikipedia.
Statistics
Summary
Assembly | IWGSC, INSDC Assembly GCA_900519105.1, Jul 2018 |
Database version | 113.4 |
Golden Path Length | 14,547,261,565 |
Genebuild by | International Wheat Genome Sequencing Consortium |
Genebuild method | Import |
Data source | International Wheat Genome Sequencing Consortium |
Gene counts
Coding genes | 107,891 |
Non coding genes | 12,853 |
Small non coding genes | 12,491 |
Long non coding genes | 362 |
Gene transcripts | 146,597 |
Other
Short Variants | 120,232,985 |