Zea mays Assembly and Gene Annotation

About Zea mays

Maize or Zea mays had the highest world-wide production of all grain crops in 2019. Although a food staple in many regions of the world, most is used for animal feed and ethanol fuel. Maize was domesticated from wild teosinte in Central America and its cultivation spread throughout the Americas by Pre-Columbian civilisations. In addition to its economic value, maize is an important model organism for studies in plant genetics, physiology, and development. It has a large genome of about 2.4 gigabases with a haploid chromosome number of 10 (2n=2x=20) [2-4]. Maize is distinguished from other grasses in that its genome arose from an ancient tetraploidy event unique to its lineage. This sequence corresponds to inbred line B73.

Assembly

The B73 genome was sequenced to high depth using PacBio long-read technology together with 25 founder inbred lines strategically selected to represent the breadth of maize diversity including lines from temperate, tropical, sweet corn, and popcorn germplasm. All lines were assembled into contigs using a hybrid approach, corrected with long-read and Illumina short-read data, scaffolded using Bionano optical maps and ordered into pseudomolecules using linkage data from the NAM recombinant inbred lines and maize pan-genome anchor markers. The 26 lines were also annotated with a common pipeline.

The current assembly is named Zm-B73-REFERENCE-NAM-5.0 and was aligned to the previous B73_RefGen_v4 version with software ATAC/A2Amapper [5], obtaining 96.76% mapped positions. Those mappings can be used to translate coordinates across with the Assembly Converter.

Annotation

The gene annotations were produced with the CSHL gene pipeline developed under the NAM project. In summary, it is an automated, evidence-based method combining third-party software including Mikado, BRAKER and PASA. Gene models were filtered by conservation and Maker Annotation Edit Distance (AED) score, and then classified into protein_coding and misc_non-coding sets.

Gene models from version B73_RefGen_v4 have been mapped to the current Zm-B73-REFERENCE-NAM-5.0 and can be retrieved with the ID History Converter

Evidence-based predictions were directly inferred from assembled transcripts, which were generated using five different genome-guided transcript assembly programs and processed using Mikado to pick the optimal set of transcripts for each locus. To generate assembled transcripts, quality inspected RNA-seq reads were mapped to the genome. In order to pick the final transcripts, Mikado uses assembled transcripts combined with high-confidence splice junctions with the mapped reads as input, predicted ORFs for the assembled transcripts generated and homology results of transcripts to SwissProt (Viridiplantae) sequences.

Ab initio predictions were performed using BRAKER with both evidence-based predicted proteins and mapped RNA-seq reads as input.

A working set (WS) of models was generated to capture the complete gene space by combining evidence based and non-overlapping BRAKER gene models using BEDtools. Additional structural improvements on the WS models were completed using the software PASA. Transposable element related genes were filtered from the evidence and non-overlapping BRAKER sets using the TEsorter tool, which uses the REXdb database of TEs.

The TE filtered WS models were given AED scores using MAKER-P (v.3.0). Only models with AED < 0.75 passed to the high-confidence set (HCS). The HCS gene models were further classified based on homology to related species, and assigned coding and non-coding biotypes. The HCS gene models were checked for missing start and stop codons. The CDS boundaries of the transcripts were modified based on conserved start codon positions or extended to a start or stop codon whenever possible. All conserved genes in addition to lineage-specific genes that had a complete CDS were marked as protein-coding. The remaining lineage-specific genes were marked as non-coding. HCS gene models were checked and potentially split or merged using the GFF3toolkit. Gene ID assignment was made as per MaizeGDB nomenclature schema.

Repeats

Repeat features were annotated with the RepeatMasker pipeline with RepBase and wessler-bennetzen-2015/TE_12-Feb-2015_15-35 libraries.

Variation

The following variation datasets were remapped from assembly B73_RefGen_v4 to Zm-B73-REFERENCE-NAM-5.0

HapMap2 dataset

A variation set which comprises the maize HapMap2 data (Chia et al., 2012). This dataset incorporates approximately 55 million SNPs and InDels identified in a collection of 103 pre-domesticated and domesticated Zea mays varieties, including a representative from the sister genus, Tripsacum dactyloides (Eastern gamagrass). Each line was sequenced to an average of 4.5-fold coverage using the Illumina GAIIx platform. The reads can be accessed from the SRA, with accession ID: SRA051245. Reads were initially mapped to the B73 RefGen_v3 reference genome using a combination of Bowtie, Novoalign and SOAP, then remapped to the most recent B73 RefGen_v4 reference genome. The variations were scored by taking into account identity-by-descent blocks that are shared among the lines.

The Panzea 2.7 genotyped-by-sequencing (GBS) dataset

This variation data set consists of 719,472 SNPs (excluding 332 SNPs that were removed for mapping to scaffolds) typed in 16,718 maize and teosinte lines, and grouped in 14 overlapping populations according to the germplasm set in the corresponding metadata table.

Main references

PPR266733
MED/28605751
MED/19965430
MED/19936061

Other references:

MED/14769938
A genome-wide characterization of microRNA genes in maize.
Zhang L, Chia JM, Kumari S, Stein JC, Liu Z, Narechania A, Maher CA, Guill K, McMullen MD, Ware D. 2009. PLoS Genet. 5:e1000716.
A near complete snapshot of the Zea mays seedling transcriptome revealed from ultra-deep sequencing.
Martin JA, Johnson NV, Gross SM, Schnable J, Meng X, Wang M, Coleman-Derr D, Lindquist E, Wei CL, Kaeppler S et al. 2014. Sci Rep. 4:4519.
A novel hybrid gene prediction method employing protein multiple sequence alignments.
Keller O, Kollmar M, Stanke M, Waack S. 2011. Bioinformatics. 27:757-763.
Ab initio gene finding in Drosophila genomic DNA.
Salamov AA, Solovyev VV. 2000. Genome Res.. 10:516-522.
Automated update, revision, and quality control of the maize genome annotations using MAKER-P improves the B73 RefGen_v3 gene models and identifies new genes.
Law M, Childs KL, Campbell MS, Stein JC, Olson AJ, Holt C, Panchy N, Lei J, Jiao D, Andorf CM et al. 2015. Plant Physiol.. 167:25-39.
Genome-wide discovery and characterization of maize long non-coding RNAs.
Li L, Eichten SR, Shimizu R, Petsch K, Yeh CT, Wu W, Chettoor AM, Givan SA, Cole RA, Fowler JE et al. 2014. Genome Biol.. 15:R40.
Insights into corn genes derived from large-scale cDNA sequencing.
Alexandrov NN, Brover VV, Freidin S, Troukhan ME, Tatarinova TV, Zhang H, Swaller TJ, Lu YP, Bouck J, Flavell RB et al. 2009. Plant Mol. Biol.. 69:179-194.
Maize HapMap2 identifies extant variation from a genome in flux.
Chia JM, Song C, Bradbury PJ, Costich D, de Leon N, Doebley J, Elshire RJ, Gaut B, Geller L, Glaubitz JC et al. 2012. Nat. Genet.. 44:803-807.
MAKER-P: a tool kit for the rapid creation, management, and quality control of plant genome annotations.
Campbell MS, Law M, Holt C, Stein JC, Moghe GD, Hufnagel DE, Lei J, Achawanantakun R, Jiao D, Lawrence CJ et al. 2014. Plant Physiol.. 164:513-524.
miRBase: annotating high confidence microRNAs using deep sequencing data.
Kozomara A, Griffiths-Jones S. 2014. Nucleic Acids Res.. 42:D68-73.
Sequencing, mapping, and analysis of 27,455 maize full-length cDNAs.
Soderlund C, Descour A, Kudrna D, Bomhoff M, Boyd L, Currie J, Angelova A, Collura K, Wissotski M, Ashley E et al. 2009. PLoS Genet.. 5:e1000740.
Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing.
Wang B, Tseng E, Regulski M, Clark TA, Hon T, Jiao Y, Lu Z, Olson A, Stein JC, Ware D. 2016. Nat Commun. 7:11708.
Comparative population genomics of maize domestication and improvement.
Hufford MB, Xu X, van Heerwaarden J, Pyhjrvi T, Chia JM, Cartwright RA, Elshire RJ, Glaubitz JC, Guill KE, Kaeppler SM et al. 2012. Nat. Genet.. 44:808-811.
A first-generation haplotype map of maize.
Gore MA, Chia JM, Elshire RJ, Sun Q, Ersoz ES, Hurwitz BL, Peiffer JA, McMullen MD, Grills GS, Ross-Ibarra J et al. 2009. Science. 326:1115-1117.
Detailed analysis of a contiguous 22-Mb region of the maize genome.
Wei F, Stein JC, Liang C, Zhang J, Fulton RS, Baucom RS, De Paoli E, Zhou S, Yang L, Han Y et al. 2009. PLoS Genet.. 5:e1000728.
A single molecule scaffold for the maize genome.
Zhou S, Wei F, Nguyen J, Bechner M, Potamousis K, Goldstein S, Pape L, Mehan MR, Churas C, Pasternak S et al. 2009. PLoS Genet.. 5:e1000711.
Evidence-based gene predictions in plant genomes.
Liang C, Mao L, Ware D, Stein L. 2009. Genome Res.. 19:1912-1923.

Picture credit: Nicolle Rager Fuller, National Science Foundation.

Statistics

Summary

Assembly	Zm-B73-REFERENCE-NAM-5.0, INSDC Assembly GCA_902167145.1, Dec 2019
Database version	116.8
Golden Path Length	2,182,075,994
Genebuild by	CSHL
Genebuild method	Curated
Data source	nam-genomes

Gene counts

Coding genes	39,756
Non coding genes	4,547
Misc non coding genes	4,547
Gene transcripts	77,341

Other

Short Variants

49,783,517

Upcoming Ensembl Platform Transition

Zea mays Assembly and Gene Annotation

About Zea mays

Assembly

Annotation

Repeats

Variation

HapMap2 dataset

The Panzea 2.7 genotyped-by-sequencing (GBS) dataset

Main references

Links

Statistics

Summary

Gene counts

Other

About Us

Get help

Our sister sites

Follow us

Upcoming Ensembl Platform Transition

Favourite species

All species

Zea mays Assembly and Gene Annotation

About Zea mays

Assembly

Annotation

Repeats

Variation

HapMap2 dataset

The Panzea 2.7 genotyped-by-sequencing (GBS) dataset

Main references

Links

Statistics

Summary

Gene counts

Other

About Us

Get help

Our sister sites

Follow us