Triticum aestivum Assembly and Gene Annotation
About Triticum aestivum
Triticum aestivum (bread wheat) is a major global cereal grain essential to human nutrition. Wheat was one of the first cereals to be domesticated, originating in the fertile crescent around 7000 years ago. Bread wheat is hexaploid, with a genome size estimated at ~17 Gb, composed of three closely-related and independently maintained genomes that are the result of a series of naturally occurring hybridisation events. The ancestral progenitor genomes are considered to be Triticum urartu (the A-genome donor) and an unknown grass thought to be related to Aegilops speltoides (the B-genome donor). This first hybridisation event produced tetraploid emmer wheat (AABB, T. dicoccoides) which hybridized again with Aegilops tauschii (the D-genome donor) to produce modern bread wheat.
The wheat genome was assembled by the International Wheat Genome Sequencing Consortium. Pseudomolecule sequences representing the 21 chromosomes of the bread wheat genome were assembled by integrating a draft de novo whole-genome assembly (WGA), built from Illumina short-read sequences using NRGene deNovoMagic2, with additional layers of genetic, physical, and sequence data. In the resulting 14.5-Gb genome assembly, contigs and scaffolds with N50s of 52 kb and 7 Mb, respectively, were linked into superscaffolds (N50 = 22.8 Mb), with 97% (14.1 Gb) of the sequences assigned and ordered along the 21 chromosomes and almost all of the assigned sequence scaffolds oriented relative to each other (13.8 Gb, 98%). Unanchored scaffolds comprising 481 Mb (2.8% of the assembly length) formed the “unassigned chromosome” (ChrUn) bin. The quality and contiguity of the IWGSC RefSeq v1.0 genome assembly were assessed through alignments with radiation hybrid maps for the A, B, and D subgenomes [average Spearman’s correlation coefficient (r) of 0.98], the genetic positions of 7832 and 4745 genotyping-by-sequencing derived genetic markers in 88 double haploid and 993 recombinant inbred lines (Spearman’s r of 0.986 and 0.987, respectively), and 1.24 million pairs of neighbour insertion site–based polymorphism (ISBP) markers, of which 97% were collinear and mapped in a similar size range (difference of less than 2 kb) between the de novo WGA and the available bacterial artificial chromosome (BAC)–based sequence assemblies. Finally, IWGSC RefSeq v1.0 was assessed with independent data derived from coding and noncoding sequences, revealing that 99 and 98% of the previously known coding exons and transposable element (TE)–derived (ISBP) markers (table S9), respectively, were present in the assembly. The approximate 1 Gb size difference between IWGSC RefSeq v1.0 and the new genome size estimates of 15.4 to 15.8 Gb can be accounted for by collapsed or unassembled sequences of highly repeated clusters, such as ribosomal RNA coding regions and telomeric sequences.
Gene models were predicted with two independent pipelines previously utilised for wheat genome annotation and then consolidated to produce the RefSeq Annotation v1.0. Subsequently, a set of manually curated gene models was integrated to build RefSeq Annotation v1.1. In total, 107,891 high-confidence (HC) protein-coding loci were identified, with relatively equal distribution across the A, B, and D subgenomes. In addition, 161,537 other protein-coding loci were classified as low-confidence (LC) genes, representing partially supported gene models, gene fragments, and orphans. A predicted function was assigned to 82.1% (90,919) of HC genes in RefSeq Annotation v1.0 (tables S19 and S20), and evidence for transcription was found for 85% (94,114) of the HC genes versus 49% of the LC genes.
98,270 high confidence genes from the TGACv1 annotation were aligned to the assembly using Exonerate. For each gene up to three alignments are displayed, compromising 196,243 alignments of which 90,686 are protein coding.
31779 markers from the 35K Axiom SNP array and 768664 markers from the 820K Axiom SNP array from CerealsDB were aligned to the assembly by CerealsDB using Blast with a cutoff of 1e-05. The top three hits were parsed and compared to CerealsDB genetic map data. In cases where two or more of the top three hits had an identical score, the hit agreed with the genetic map was selected. In cases of no genetic map information for a particular SNP then the top hit was selected.
The 35K array includes genotype data from 1,963 samples while the 820K has 475 samples. In cases where a marker belongs to both sets, genotype data from both 820K and 35K samples will be displayed. 
EMS-type variants from sequenced tetraploid (cv ‘Kronos’) and hexaploid (cv ‘Cadenza’) TILLING populations. Mutations were called on the IWGSC RefSeq V1.0 assembly using the Dragen system.
- 3.9 million Kronos mutations
- 8.2 million Cadenza mutations
Researchers and breeders can search this database online, identify mutations in the different copies of their target gene, and request seeds to study gene function or improve wheat varieties. Seeds can be requested from the UK SeedStor or from the US based Dubcovsky lab.This resource was generated as part of a joint project between the University of California Davis, Rothamsted Research, The Earlham Institute, and the John Innes Centre. 
3.6 million Inter-Homoeologous Variants (IHVs) called by alignments of the A,B and D component genomes where added as SNP markers.
SIFT predicts whether an amino acid substitution affects protein function based on sequence homology and the physical properties of amino acids. SIFT can be applied to naturally occurring nonsynonymous polymorphisms and laboratory-induced missense mutations.
SIFT scores and predictions (whether it is 'tolerated' or 'deleterious') have been calculated for all missense variants across all bread wheat variation datasets.
- Shifting the limits in wheat research and breeding using a fully annotated reference genome.
Rudi Appels, Kellye Eversole, Catherine Feuillet, Beat Keller, Jane Rogers, Nils Stein.... Hana imkov, Ian Small, Manuel Spannagl, David Swarbreck, Cristobal Uauy. 2018. Science. 361
- Uncovering hidden variation in polyploid wheat.
Ksenia V. Krasileva, Hans A. Vasquez-Gross, Tyson Howell, Paul Bailey, Francine Paraiso, Leah Clissold, James Simmonds, Ricardo H. Ramirez-Gonzalez et al. . 2016. PNAS. 114:E913E921.
- An improved assembly and annotation of the allohexaploid wheat genome identifies complete families of agronomic genes and provides genomic evidence for chromosomal translocations.
Bernardo J. Clavijo, Luca Venturini,Christian Schudoma, Gonzalo Garcia Accinelli, Gemy Kaithakottil, Jonathan Wright, Philippa Borrill, George Kettleborough, Darren Heavens, Helen Chapman et al. 2017. Genome Research.
- Ultra-Fast Next Generation Human Genome Sequencing Data Processing Using DRAGENTM Bio-IT Processor for Precision Medicine.
Goyal, A., Kwon, H.J., Lee, K., Garg, R., Yun, S.Y. et al. 2017. Open Journal of Genetics. 7:9-19.
- CerealsDB 3.0: expansion of resources and data integration.
Wilkinson PA, Winfield MO, Barker GL,Tyrrell S, Bian X, Allen AM, Burridge A, Coghill JA, Waterfall C, Caccamo M et al. 2016. BMC Bioinformatics. 17:256.
General information about this species can be found in Wikipedia.
|Assembly||IWGSC, INSDC Assembly GCA_900519105.1, Jul 2018|
|Golden Path Length||14,547,261,565|
|Genebuild method||Imported from IWGSC|
|Data source||International Wheat Genome Sequencing Consortium|
|Non coding genes||12,853|
|Small non coding genes||12,491|
|Long non coding genes||362|