Aegilops umbellulata (TA1851)

Aegilops umbellulata Assembly and Gene Annotation

About Aegilops umbellulata

Aegilops umbellulata (2n = 2x = 14, UU genome) is the only diploid Aegilops species containing the U genome. Compared to the bread wheat A, B and D genomes, the U genome contains several large chromosome rearrangements and is a source of disease resistance genes that have been transferred into wheat. The genus Aegilops contains several grass species, commonly referred to as goatgrass. The genus comprises at least 23 diploid and polyploid species and six different genomes (C, D, M, N, S, and U). Aegilops species belong to the same tribe as the major cereal crops bread wheat (Triticum aestivum, 2n = 6x = 42; AABBDD genome), durum wheat (Triticum durum, 2n = 4x = 28; AABB genome) and barley (Hordeum vulgare, 2n = 2x = 14). The genus has thus been explored to increase genetic diversity of wheat via wide hybridization and chromosome recombination.

Assembly

Aegilops umbellulata assembly TA1851 is created by the KAUST(King Abdullah University of Science and Technology). This chromosome-scale assembly is sequenced using PacBio Sequel and HiFi reads were then assembled using La Jolla Assembler (LJA) and [DeepConsensus] (https://www.nature.com/articles/s41587-022-01435-7). The genome assembly resulted in seven pseudomolecules and one unanchored chromosome with a total size of 4.25 Gb with a contig N50 of 17.7 Mb.

Annotation

Gene model prediction was performed by combining a lifting approach using liftoff (v1.6.3) and a genome-guided approach using transcriptomics data with HISAT2 (v2.2.1), StringTie (2.1.7) and Transdecoder (v5.7.0). Post-processing of gff3 files and filtering were performed using AGAT and gffread (v0.11.7).

For the gene lifting, gene models of hexaploid wheat line Chinese Spring, Ae tauschii, and Triticum monoccocum accession TA29932 were independently transferred using liftoff. For the genome-guided approach, publicly available RNA-Seq data of 12 representative Ae. umbellulata accessions and the RNA-Seq data of two bulks representing Ae. umbellulata leaf tissues were used. All the RNA-Seq data were mapped individually against the reference sequence using HISAT2 and the transcripts were assembled using StringTie and merged into a single gtf file. The Transdecoder.LongOrfs script was used to identify open reading frames (ORF) of at least 100 amino acids from the merged gtf file. The predicted protein sequences were compared to the UniProt and Pfam databases using BLASTP and hmmer3. The Transdecoder.Predict script was used with the BLASTP and hmmer results to select the best translation per transcript. Finally, the annotation gff3 file was computed using the perl script cdna_alignment_orf_to_genome_orf.pl.

All the output gff files from the lifting and genome-guided approaches were merged and filtered to retain transcripts with start and stop codons. In total, 36,268 gene models were predicted for which the putative functional annotations were assigned using a protein comparison against the UniProt database using DIAMOND. PFAM domain signatures and GO were assigned using InterproScan.

Repeated sequences were called with the Repeat Detector, which is part of the Ensembl Genomes repeat feature pipelines. Repeats length: 3514124750 - Repeats content: 82.7%

References

Image credit: Abrouk et al. (see reference above)

Statistics

Summary

AssemblyTA1851, INSDC Assembly GCA_032464435.1,
Database version112.1
Golden Path Length4,246,573,824
Genebuild byKAUST
Genebuild methodImport
Data sourceKAUST

Gene counts

Coding genes36,135
Non coding genes133
Small non coding genes133
Gene transcripts154,367