Triticum aestivum Assembly and Gene Annotation

This assembly and annotation is available for use under the terms of the Toronto Agreement on prepublication data sharing.

The previous wheat assembly (IWGSC CSS) and every other plant from release 31 is available in the new Ensembl Plants archive site.

About Triticum aestivum

Triticum aestivum (bread wheat) is a major global cereal grain essential to human nutrition. Wheat was one of the first cereals to be domesticated, originating in the fertile crescent around 7000 years ago. Bread wheat is hexaploid, with a genome size estimated at ~17 Gbp, composed of three closely-related and independently maintained genomes that are the result of a series of naturally occurring hybridization events. The ancestral progenitor genomes are considered to be Triticum urartu (the A-genome donor) and an unknown grass thought to be related to Aegilops speltoides (the B-genome donor). This first hybridization event produced tetraploid emmer wheat (AABB, T. dicoccoides) which hybridized again with Aegilops tauschii (the D-genome donor) to produce modern bread wheat.

Assembly

This assembly (TGACv1) and annotations have been generated by the Earlham Institute, formerly The Centre for Genome Analysis (TGAC).

250 bp paired-end reads were generated from two CS42 libraries constructed using a PCR-free protocol. In total, 1.1 billion PE reads were generated providing 32.78x coverage of the CS42 genome. The w2rap-contigger (based on DISCOVAR de novo [8]) was used to assemble contigs. The contigger is available in Github and is fully described elsewhere [9]. It utilises PCR-free libraries to reduce coverage bias, uses long 250bp reads generated by the latest Illumina sequencing technology and retains the majority of the variation present in the reads when generating contigs. Multiple Nextera long mate-pair libraries were generated for scaffolding with insert sizes ranging from 2-12 Kb. LMP reads were processed using Nextclip [10] and contigs were scaffolded using the SOAPdenovo2 [11] prepare-> map-> scaffold pipeline.

Scaffolds were classified into chromosome-arm bins using arm-specific Chromosome Survey Sequence (CSS) reads
[12]. Scaffolds from 3B were not separated into short/long arm bins as individual arm datasets were not generated for this chromosome in the CSS project. The ‘sect’ method of KAT [13] was used to compute kmer coverage over each scaffold using each CSS read set. Each non-repetitive kmer in a scaffold was scored proportionally to coverage on each CSS arm and scaffolds were classified using the following set of rules:

  1. Scaffolds with less than 10% of the kmers producing a vote were left as unclassified (marked as Chromosome arm “U”). These are mostly small and/or repetitive sequences. 

  2. Scaffolds with a top score towards a CSS set at least double the second top score were classified to the highest scoring chromosome arm. 

  3. Scaffolds with a top score towards a CSS set less than double the second top score were left as unclassified (marked as Chromosome arm “U”, but with the two top scores and CSS sets included in the sequence name). This category contains scaffolds that are classified as combinations of the two arms from the same chromosome, probably due to imprecise identification during flow-sorting. It also contains scaffolds from regions of the genome with specific flow-sorting biases, and assembly chimeras, which will all be investigated further. 


Rather than using a simple length cutoff to include scaffolds in the final assembly, a content filter was applied to the scaffolds classified into each chromosome-arm bin in order to ensure short scaffolds containing unique content were not excluded from the
assembly. Scaffolds were sorted by length, longest first. Scaffolds longer than 5Kbp were automatically added to the assembly. Scaffolds between 5Kbp and 500bp were added from longest to smallest if 20% of the kmers in the scaffold were not already present in the assembly. Scaffolds shorter than 500bp were excluded.

For assigned scaffolds, the arm assignment is included in the FASTA identifier. For unassigned scaffolds with more than 10% voting kmers, the highest and second highest vote is included in the FASTA identifier to indicate possible arms.

Annotation

A high quality gene set for wheat was generated using a custom pipeline integrating wheat-specific transcriptomic data, protein similarity, and evidence-guided gene predictions generated with AUGUSTUS [1].

RNA-Seq data from three different datasets was utilised: ERP004714, ERP004505 and a dataset of 250 bp paired-end strand-specific reads from six different tissues. In total, the three datasets comprised over 3.2 billion paired-end reads. RNA-Seq reads were assembled using four alternative assembly methods (Cufflinks [2], StringTie [3], CLASS [4] and Trinity [5]) and integrated together with PacBio transcripts into a coherent and non-redundant set of models using Mikado [6]. In the second phase, PacBio reads were classified based on protein similarity and a subset of high quality (e.g. full length, canonical splicing, non-redundant) transcripts employed to train an AUGUSTUS wheat-specific gene prediction model. AUGUSTUS was used to generate a first draft of the genome annotation, using as input Mikado-filtered transcript models, reliable junctions identified with Portcullis [7], and peptide alignments of proteins from five species closely related to wheat (Brachypodium distachyon, maize, rice, sorghum bicolor and Setaria italica). In the fourth stage, this draft annotation was refined and polished by identifying and correcting probable gene fusions, missing loci and alternative splice variants. Finally, the polished annotation was functionally annotated and all loci were assigned a confidence rank based on their similarity to known proteins and their agreement with wheat transcriptomic data.

A total of 217,907 loci and 273,739 transcripts were identified. 104,091 were assigned as coding genes (154,798 transcripts) and 10,156 long ncRNAs as high confidence. The high confidence protein-coding set contained 51,851 genes confirmed by a PacBio transcript and an additional 29,996 genes fully supported by assembled RNA-Seq data. Taken together, transcriptome evidence fully supported the annotated gene structures of 81,847 (78.63%) high confidence genes. Another 103,660 loci were defined as low confidence based on either similarity searches indicating a potential gene fragment or pseudogene, classification as non-homology supported or repeat associated, or with partial or no transcriptome data to support the annotation.

Alongside the new gene annoation, a total of 99,000 genes (99% of the total) annotated on the previous IWGSC CSS [15] assembly (MIPS) have been mapped to the new assembly.

Variation

~820 SNP markers provided by CerealsDB [14], from the University of Bristol, were mapped to the TGACv1 assembly, running on ungapped model,

Markers from two SNP Arrays has been provided by CerealsDB, and have been mapped to the reference assembly. Markers that did not align uniquiely with 100% identity have been discarded.

  • The Axiom 820K SNP Array (contains ~820,000 SNP markers of which 504,092 have been mapped).
  • The Axiom 35K SNP Array (contains 35,000 SNP markers - a subset of the 820k set - of which 21,423 have been mapped)

Links

References

  1. Gene prediction with a hidden Markov model and a new intron submodel.
    1) Stanke M, Waack S. . 2003. Bioinformatics . 19
  2. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation.
    Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L. 2010. Nat. Biotechnol.. 28:511-515.
  3. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads.
    Pertea, M., Pertea, G. M., Antonescu, C. M., Chang, T.-C., Mendell, J. T. et al.
  4. CLASS2: accurate and efficient splice variant annotation from RNA-seq reads.
    Song, L., Sabunciyan, S., & Florea, L. . 2016. Nucleic Acids Research. 44
  5. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis.
    Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, Couger MB, Eccles D, Li B, Lieber M et al. 2013. Nat Protoc. 8:1494-1512.
  6. Mikado: leveraging multiple transcriptome assembly methods for improved gene structure annotation (manuscript in preparation).
  7. Portcullis - Fast, reliable and accurate splice junction prediction from RNAseq data (manuscript in preparation).
  8. Comprehensive variation discovery in single human genomes.
    Weisenfeld NI, Yin S, Sharpe T, Lau B, Hegarty R, Holmes L, Sogoloff B, Tabbaa D, Williams L, Russ C et al. 2014. Nat. Genet.. 46:1350-1355.
  9. Generating robust assemblies using w2rap (manuscript in preparation).
  10. NextClip: an analysis and read preparation tool for Nextera Long Mate Pair libraries.
    Leggett RM, Clavijo BJ, Clissold L, Clark MD, Caccamo M. 2014. Bioinformatics. 30:566-568.
  11. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler.
    Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y et al. 2012. Gigascience. 1:18.
  12. A chromosome-based draft sequence of the hexaploid bread wheat (Triticum aestivum) genome.
    The International Wheat Genome Sequencing Consortium (IWGSC). 2014. Science. 345:1251788.
  13. KAT: A K-mer Analysis Toolkit to quality control NGS datasets and genome assemblies.
    D. Mapleson et al.
  14. CerealsDB 2.0: an integrated resource for plant breeders and scientists.
    Wilkinson PA, Winfield MO, Barker GL, Allen AM, Burridge A, Coghill JA, Edwards KJ. 2012. BMC Bioinformatics. 13:219.
  15. A chromosome-based draft sequence of the hexaploid bread wheat (Triticum aestivum) genome.
    2014. Science. 345:1251788.

More information

General information about this species can be found in Wikipedia.

Statistics

Summary

AssemblyTGACv1, INSDC Assembly GCA_900067645, Dec 2015
Database version85.3
Base Pairs13,427,354,022
Golden Path Length13,427,354,022
Genebuild byTGACv1
Genebuild methodImported from TGAC
Data sourceTGAC

Gene counts

Coding genes103,539
Non coding genes22,043
Small non coding genes22,043
Gene transcripts273,739

About this species