Triticum aestivum Assembly and Gene Annotation
About Triticum aestivum
Triticum aestivum (bread wheat) is a major global cereal grain essential to human nutrition. Wheat was one of the first cereals to be domesticated, originating in the fertile crescent around 7000 years ago. Bread wheat is hexaploid, with a genome size estimated at ~17 Gbp, composed of three closely-related and independently maintained genomes that are the result of a series of naturally occurring hybridization events. The ancestral progenitor genomes are considered to be Triticum urartu (the A-genome donor) and an unknown grass thought to be related to Aegilops speltoides (the B-genome donor). This first hybridization event produced tetraploid emmer wheat (AABB, T. dicoccoides) which hybridized again with Aegilops tauschii (the D-genome donor) to produce modern bread wheat.
This assembly (TGACv1) and annotations have been generated by the Earlham Institute, formerly The Centre for Genome Analysis (TGAC) .
250 bp paired-end reads were generated from two CS42 libraries constructed using a PCR-free protocol. In total, 1.1 billion PE reads were generated providing 32.78x coverage of the CS42 genome. The w2rap-contigger (based on DISCOVAR de novo ) was used to assemble contigs. The contigger is available in Github and is fully described elsewhere . It utilises PCR-free libraries to reduce coverage bias, uses long 250bp reads generated by the latest Illumina sequencing technology and retains the majority of the variation present in the reads when generating contigs. Multiple Nextera long mate-pair libraries were generated for scaffolding with insert sizes ranging from 2-12 Kb. LMP reads were processed using Nextclip  and contigs were scaffolded using the SOAPdenovo2  prepare-> map-> scaffold pipeline.
Scaffolds were classified into chromosome-arm bins using arm-specific Chromosome Survey Sequence (CSS) reads . Scaffolds from 3B were not separated into short/long arm bins as individual arm datasets were not generated for this chromosome in the CSS project. The ‘sect’ method of KAT  was used to compute kmer coverage over each scaffold using each CSS read set. Each non-repetitive kmer in a scaffold was scored proportionally to coverage on each CSS arm and scaffolds were classified using the following set of rules:
- Scaffolds with less than 10% of the kmers producing a vote were left as unclassified (marked as Chromosome arm “U”). These are mostly small and/or repetitive sequences.
- Scaffolds with a top score towards a CSS set at least double the second top score were classified to the highest scoring chromosome arm.
- Scaffolds with a top score towards a CSS set less than double the second top score were left as unclassified (marked as Chromosome arm “U”, but with the two top scores and CSS sets included in the sequence name). This category contains scaffolds that are classified as combinations of the two arms from the same chromosome, probably due to imprecise identification during flow-sorting. It also contains scaffolds from regions of the genome with specific flow-sorting biases, and assembly chimeras, which will all be investigated further.
Rather than using a simple length cutoff to include scaffolds in the final assembly, a content filter was applied to the scaffolds classified into each chromosome-arm bin in order to ensure short scaffolds containing unique content were not excluded from the assembly. Scaffolds were sorted by length, longest first. Scaffolds longer than 5Kbp were automatically added to the assembly. Scaffolds between 5Kbp and 500bp were added from longest to smallest if 20% of the kmers in the scaffold were not already present in the assembly. Scaffolds shorter than 500bp were excluded.
For assigned scaffolds, the arm assignment is included in the FASTA identifier. For unassigned scaffolds with more than 10% voting kmers, the highest and second highest vote is included in the FASTA identifier to indicate possible arms.
A high quality gene set for wheat was generated using a custom pipeline integrating wheat-specific transcriptomic data, protein similarity, and evidence-guided gene predictions generated with AUGUSTUS .
RNA-Seq data from three different datasets was utilised: ERP004714, ERP004505 and a dataset of 250 bp paired-end strand-specific reads from six different tissues. In total, the three datasets comprised over 3.2 billion paired-end reads. RNA-Seq reads were assembled using four alternative assembly methods (Cufflinks , StringTie , CLASS  and Trinity ) and integrated together with PacBio transcripts into a coherent and non-redundant set of models using Mikado . In the second phase, PacBio reads were classified based on protein similarity and a subset of high quality (e.g. full length, canonical splicing, non-redundant) transcripts employed to train an AUGUSTUS wheat-specific gene prediction model. AUGUSTUS was used to generate a first draft of the genome annotation, using as input Mikado-filtered transcript models, reliable junctions identified with Portcullis , and peptide alignments of proteins from five species closely related to wheat (Brachypodium distachyon, maize, rice, sorghum bicolor and Setaria italica). In the fourth stage, this draft annotation was refined and polished by identifying and correcting probable gene fusions, missing loci and alternative splice variants. Finally, the polished annotation was functionally annotated and all loci were assigned a confidence rank based on their similarity to known proteins and their agreement with wheat transcriptomic data.
A total of 217,907 loci and 273,739 transcripts were identified. 104,091 were assigned as coding genes (154,798 transcripts) and 10,156 long ncRNAs as high confidence. The high confidence protein-coding set contained 51,851 genes confirmed by a PacBio transcript and an additional 29,996 genes fully supported by assembled RNA-Seq data. Taken together, transcriptome evidence fully supported the annotated gene structures of 81,847 (78.63%) high confidence genes. Another 103,660 loci were defined as low confidence based on either similarity searches indicating a potential gene fragment or pseudogene, classification as non-homology supported or repeat associated, or with partial or no transcriptome data to support the annotation.
Alongside the new gene annoation, a total of 99,000 genes (99% of the total) annotated on the previous IWGSC CSS  assembly (MIPS) have been mapped to the TGACv1 wheat assembly.
Transcriptome assembly in diploid einkorn wheat Triticum monococcum 
Genome-wide transcriptomes of two Triticum monococcum subspecies were constructed, the wild winter wheat T. monococcum ssp. aegilopoides (accession G3116) and the domesticated spring wheat T. monococcum ssp. monococcum (accession DV92) by generating de novo assemblies of RNA-Seq data derived from both etiolated and green seedlings. Assembled data is available from the Jaiswal lab and raw reads are available from INSDC projects PRJNA203221 and PRJNA195398.
The de novo transcriptome assemblies of DV92 and G3116 represent 120,911 and 117,969 transcripts, respectively. They were mapped to the TGACv1 wheat assembly using BWA.
T. Urartu and T. turgidum ORFs predicted by "findorf" 
Triticum urartu Triticum turgidum ORFs where predicted using the findorf program as part of a study for separating homeologs using phasing in the tetraploid wheat transcriptome.
34277 (out of 37806) T.Urartu ORFs and 60416 (out of 66633) T. turgidum ORFs where aligned to the TGACv1 wheat assembly using BWA.
RNAseq track hub examples
Over 50 RNA-Seq alignment studies can be found and attached from the track hub registry. We have attached an example from flowering in wheat.
It can be seen and displayed under "mRNA and protein alignments" track in the location panel.
Data from CerealsDB 
Markers from three SNP Arrays has been provided by CerealsDB, and have been mapped to the reference assembly. Markers that did not align uniquely with 100% identity have been discarded.
- The Axiom 820K SNP Array (contains ~820,000 SNP markers of which 504,092 have been mapped).
- The Axiom 35K SNP Array (contains 35,000 SNP markers - a subset of the 820k set - of which 21,423 have been mapped). Includes genotype data from 590 sample genotypes.
- The KASP array (contains 960 SNP markers of which 486 have been mapped). Includes genotype data from 169 genotypes.
EMS Mutation data from Earlham 
EMS-type variants from sequenced tetraploid (cv ‘Kronos’) and hexaploid (cv ‘Cadenza’) TILLING populations. Mutations were originally called in the IWGSC CSS scaffolds as described in Krasileva et al . Here the high confidence (HetMC5HomMC3) mutations have been projected onto the TGAC Chinese Spring scaffolds. Please note that the raw data has not been reanalysed with the TGAC scaffolds and gene models, therefore there may be inconsistencies. Given that the mutations were projected and not re-analysed, only a fraction of the SNPs were transferred.
- 2.8 million Kronos mutations projected from a total of 4.2 million
- 4.6 million Cadenza mutations projected from a total of 6.4 million
Researchers and breeders can search this database online, identify mutations in the different copies of their target gene, and request seeds to study gene function or improve wheat varieties. Seeds can be requested from the UK SeedStor (https://www.seedstor.ac.uk/shopping-cart-tilling.php) or from the US based Dubcovsky lab (http://dubcovskylab.ucdavis.edu/wheat-tilling).
This resource was generated as part of a joint project between the University of California Davis, Rothamsted Research, The Earlham Institute, and the John Innes Centre.
HapMap markers 
A sample of 62 diverse lines re-sequenced using the whole exome capture (WEC) and genotyping-by-sequencing (GBS) approaches
- 160,256 GBS SNP markers projected from a total of 210,564
- 1,061,989 WEC SNP markers projected from a total of 1,341,352
SIFT predicts whether an amino acid substitution affects protein function based on sequence homology and the physical properties of amino acids. SIFT can be applied to naturally occurring nonsynonymous polymorphisms and laboratory-induced missense mutations.
SIFT scores and predictions (whether it is 'tolerated' or 'deleterious') have been calculated for all missense variants across all bread wheat variation datasets. See SIFT predictions for missense variants present in TRIAE_CS42_3DL_TGACv1_249024_AA0834820, as an example.
- Gene prediction with a hidden Markov model and a new intron submodel.
Stanke M, Waack S. 2003. Bioinformatics . 19
- Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation.
Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L. 2010. Nat. Biotechnol. 28:511-515.
- StringTie enables improved reconstruction of a transcriptome from RNA-seq reads.
Pertea, M., Pertea, G. M., Antonescu, C. M., Chang, T.-C., Mendell, J. T. et al.
- CLASS2: accurate and efficient splice variant annotation from RNA-seq reads.
Song, L., Sabunciyan, S., & Florea, L. . 2016. Nucleic Acids Research. 44
- De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis.
Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, Couger MB, Eccles D, Li B, Lieber M et al. 2013. Nat Protoc. 8:1494-1512.
- Mikado: leveraging multiple transcriptome assembly methods for improved gene structure annotation (manuscript in preparation).
- Portcullis - Fast, reliable and accurate splice junction prediction from RNAseq data (manuscript in preparation).
- Comprehensive variation discovery in single human genomes.
Weisenfeld NI, Yin S, Sharpe T, Lau B, Hegarty R, Holmes L, Sogoloff B, Tabbaa D, Williams L, Russ C et al. 2014. Nat. Genet.. 46:1350-1355.
- Generating robust assemblies using w2rap (manuscript in preparation).
- NextClip: an analysis and read preparation tool for Nextera Long Mate Pair libraries.
Leggett RM, Clavijo BJ, Clissold L, Clark MD, Caccamo M. 2014. Bioinformatics. 30:566-568.
- SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler.
Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y et al. 2012. Gigascience. 1:18.
- A chromosome-based draft sequence of the hexaploid bread wheat (Triticum aestivum) genome.
The International Wheat Genome Sequencing Consortium (IWGSC). 2014. Science. 345:1251788.
- KAT: A K-mer Analysis Toolkit to quality control NGS datasets and genome assemblies.
D. Mapleson et al.
- CerealsDB 2.0: an integrated resource for plant breeders and scientists.
Wilkinson PA, Winfield MO, Barker GL, Allen AM, Burridge A, Coghill JA, Edwards KJ. 2012. BMC Bioinformatics. 13:219.
- An improved assembly and annotation of the allohexaploid wheat genome identifies complete families of agronomic genes and provides genomic evidence for chromosomal translocations.
Christian Schudoma, Gonzalo Garcia Accinelli, Gemy Kaithakottil, Jonathan Wright, Philippa Borrill, George Kettleborough, Darren Heavens, Helen Chapman, James Lipcombe, Tom Barker et al. 2017. Genome Research.
- Uncovering hidden variation in polyploid wheat.
Ksenia V. Krasileva, Hans A. Vasquez-Gross, Tyson Howell, Paul Bailey, Francine Paraiso, Leah Clissold, James Simmonds, Ricardo H. Ramirez-Gonzalez et al. . 2016. PNAS. 114:E913E921.
- A haplotype map of allohexaploid wheat reveals distinct patterns of selection on homoeologous genomes.
Katherine W Jordan, Shichen Wang, Yanni Lun, Laura-Jayne Gardiner, Ron MacLachlan, Pierre Hucl, Krysta Wiebe, Debbie Wong, Kerrie L Forrest, IWGS Consortium et al. 2015. Genome Biology.
- De Novo Transcriptome Assembly and Analyses of Gene Expression during Photomorphogenesis in Diploid Wheat Triticum monococcum.
Samuel E. Fox, Matthew Geniza, Mamatha Hanumappa et al.. 2014. PLOS ONE. 9(8)
- Separating homeologs by phasing in the tetraploid wheat transcriptome.
Ksenia V Krasileva, Vince Buffalo, Paul Bailey, Stephen Pearce, Sarah Ayling, Facundo Tabbita, Marcelo Soria, Shichen Wang, IWGS Consortium5, Eduard Akhunov et al. 2013. Genome Biology. 14
General information about this species can be found in Wikipedia.
|Assembly||TGACv1, INSDC Assembly GCA_900067645.1, Dec 2015|
|Golden Path Length||13,427,354,022|
|Genebuild method||Imported from TGAC|
|Data source||The Earlham Institute|
|Non coding genes||10,156|
|Small non coding genes||10,156|