Zea mays Assembly and Gene Annotation
About Zea mays
Maize or Zea mays had the highest world-wide production of all grain crops in 2019. Although a food staple in many regions of the world, most is used for animal feed and ethanol fuel. Maize was domesticated from wild teosinte in Central America and its cultivation spread throughout the Americas by Pre-Columbian civilisations. In addition to its economic value, maize is an important model organism for studies in plant genetics, physiology, and development. It has a large genome of about 2.4 gigabases with a haploid chromosome number of 10 (2n=2x=20) [2-4]. Maize is distinguished from other grasses in that its genome arose from an ancient tetraploidy event unique to its lineage. This sequence corresponds to inbred line B73.
Assembly
The B73 genome was sequenced to high depth using PacBio long-read technology together with 25 founder inbred lines strategically selected to represent the breadth of maize diversity including lines from temperate, tropical, sweet corn, and popcorn germplasm. All lines were assembled into contigs using a hybrid approach, corrected with long-read and Illumina short-read data, scaffolded using Bionano optical maps and ordered into pseudomolecules using linkage data from the NAM recombinant inbred lines and maize pan-genome anchor markers. The 26 lines were also annotated with a common pipeline.
The current assembly is named Zm-B73-REFERENCE-NAM-5.0 and was aligned to the previous B73_RefGen_v4 version with software ATAC/A2Amapper [5], obtaining 96.76% mapped positions. Those mappings can be used to translate coordinates across with the Assembly Converter.
Annotation
The gene annotations were produced with the CSHL gene pipeline developed under the NAM project. In summary, it is an automated, evidence-based method combining third-party software including Mikado, BRAKER and PASA. Gene models were filtered by conservation and Maker Annotation Edit Distance (AED) score, and then classified into protein_coding and misc_non-coding sets.
Gene models from version B73_RefGen_v4 have been mapped to the current Zm-B73-REFERENCE-NAM-5.0 and can be retrieved with the ID History Converter
Evidence-based predictions were directly inferred from assembled transcripts, which were generated using five different genome-guided transcript assembly programs and processed using Mikado to pick the optimal set of transcripts for each locus. To generate assembled transcripts, quality inspected RNA-seq reads were mapped to the genome. In order to pick the final transcripts, Mikado uses assembled transcripts combined with high-confidence splice junctions with the mapped reads as input, predicted ORFs for the assembled transcripts generated and homology results of transcripts to SwissProt (Viridiplantae) sequences.
Ab initio predictions were performed using BRAKER with both evidence-based predicted proteins and mapped RNA-seq reads as input.
A working set (WS) of models was generated to capture the complete gene space by combining evidence based and non-overlapping BRAKER gene models using BEDtools. Additional structural improvements on the WS models were completed using the software PASA. Transposable element related genes were filtered from the evidence and non-overlapping BRAKER sets using the TEsorter tool, which uses the REXdb database of TEs.
The TE filtered WS models were given AED scores using MAKER-P (v.3.0). Only models with AED < 0.75 passed to the high-confidence set (HCS). The HCS gene models were further classified based on homology to related species, and assigned coding and non-coding biotypes. The HCS gene models were checked for missing start and stop codons. The CDS boundaries of the transcripts were modified based on conserved start codon positions or extended to a start or stop codon whenever possible. All conserved genes in addition to lineage-specific genes that had a complete CDS were marked as protein-coding. The remaining lineage-specific genes were marked as non-coding. HCS gene models were checked and potentially split or merged using the GFF3toolkit. Gene ID assignment was made as per MaizeGDB nomenclature schema.
Repeats
Repeat features were annotated with the RepeatMasker pipeline with RepBase and wessler-bennetzen-2015/TE_12-Feb-2015_15-35 libraries.
Regulation
Gene expression probes
Oligo probes from the GeneChip Maize Genome Array have been aligned using the standard Ensembl 2-step mapping procedure.
Variation
The following variation datasets were remapped from assembly B73_RefGen_v4 to Zm-B73-REFERENCE-NAM-5.0
HapMap2 dataset
A variation set which comprises the maize HapMap2 data (Chia et al., 2012). This dataset incorporates approximately 55 million SNPs and InDels identified in a collection of 103 pre-domesticated and domesticated Zea mays varieties, including a representative from the sister genus, Tripsacum dactyloides (Eastern gamagrass). Each line was sequenced to an average of 4.5-fold coverage using the Illumina GAIIx platform. The reads can be accessed from the SRA, with accession ID: SRA051245. Reads were initially mapped to the B73 RefGen_v3 reference genome using a combination of Bowtie, Novoalign and SOAP, then remapped to the most recent B73 RefGen_v4 reference genome. The variations were scored by taking into account identity-by-descent blocks that are shared among the lines.
The Panzea 2.7 genotyped-by-sequencing (GBS) dataset
This variation data set consists of 719,472 SNPs (excluding 332 SNPs that were removed for mapping to scaffolds) typed in 16,718 maize and teosinte lines, and grouped in 14 overlapping populations according to the germplasm set in the corresponding metadata table.
Main references
PPR266733
MED/28605751
MED/19965430
MED/19936061
Other references:
MED/14769938
A genome-wide characterization of microRNA genes in maize.
Zhang L, Chia JM, Kumari S, Stein JC, Liu Z, Narechania A, Maher CA, Guill K, McMullen MD, Ware D. 2009. PLoS Genet. 5:e1000716.- A near complete snapshot of the Zea mays seedling transcriptome
revealed from ultra-deep
sequencing.
Martin JA, Johnson NV, Gross SM, Schnable J, Meng X, Wang M, Coleman-Derr D, Lindquist E, Wei CL, Kaeppler S et al. 2014. Sci Rep. 4:4519. - A novel hybrid gene prediction method employing protein multiple
sequence alignments.
Keller O, Kollmar M, Stanke M, Waack S. 2011. Bioinformatics. 27:757-763. - Ab initio gene finding in Drosophila genomic
DNA.
Salamov AA, Solovyev VV. 2000. Genome Res.. 10:516-522. - Automated update, revision, and quality control of the maize genome
annotations using MAKER-P improves the B73 RefGen_v3 gene models
and identifies new
genes.
Law M, Childs KL, Campbell MS, Stein JC, Olson AJ, Holt C, Panchy N, Lei J, Jiao D, Andorf CM et al. 2015. Plant Physiol.. 167:25-39. - Genome-wide discovery and characterization of maize long non-coding
RNAs.
Li L, Eichten SR, Shimizu R, Petsch K, Yeh CT, Wu W, Chettoor AM, Givan SA, Cole RA, Fowler JE et al. 2014. Genome Biol.. 15:R40. - Insights into corn genes derived from large-scale cDNA
sequencing.
Alexandrov NN, Brover VV, Freidin S, Troukhan ME, Tatarinova TV, Zhang H, Swaller TJ, Lu YP, Bouck J, Flavell RB et al. 2009. Plant Mol. Biol.. 69:179-194. - Maize HapMap2 identifies extant variation from a genome in
flux.
Chia JM, Song C, Bradbury PJ, Costich D, de Leon N, Doebley J, Elshire RJ, Gaut B, Geller L, Glaubitz JC et al. 2012. Nat. Genet.. 44:803-807. - MAKER-P: a tool kit for the rapid creation, management, and quality
control of plant genome
annotations.
Campbell MS, Law M, Holt C, Stein JC, Moghe GD, Hufnagel DE, Lei J, Achawanantakun R, Jiao D, Lawrence CJ et al. 2014. Plant Physiol.. 164:513-524. - miRBase: annotating high confidence microRNAs using deep sequencing
data.
Kozomara A, Griffiths-Jones S. 2014. Nucleic Acids Res.. 42:D68-73. - Sequencing, mapping, and analysis of 27,455 maize full-length
cDNAs.
Soderlund C, Descour A, Kudrna D, Bomhoff M, Boyd L, Currie J, Angelova A, Collura K, Wissotski M, Ashley E et al. 2009. PLoS Genet.. 5:e1000740. - Unveiling the complexity of the maize transcriptome by
single-molecule long-read
sequencing.
Wang B, Tseng E, Regulski M, Clark TA, Hon T, Jiao Y, Lu Z, Olson A, Stein JC, Ware D. 2016. Nat Commun. 7:11708. - Comparative population genomics of maize domestication and
improvement.
Hufford MB, Xu X, van Heerwaarden J, Pyhjrvi T, Chia JM, Cartwright RA, Elshire RJ, Glaubitz JC, Guill KE, Kaeppler SM et al. 2012. Nat. Genet.. 44:808-811. - A first-generation haplotype map of
maize.
Gore MA, Chia JM, Elshire RJ, Sun Q, Ersoz ES, Hurwitz BL, Peiffer JA, McMullen MD, Grills GS, Ross-Ibarra J et al. 2009. Science. 326:1115-1117. - Detailed analysis of a contiguous 22-Mb region of the maize
genome.
Wei F, Stein JC, Liang C, Zhang J, Fulton RS, Baucom RS, De Paoli E, Zhou S, Yang L, Han Y et al. 2009. PLoS Genet.. 5:e1000728. - A single molecule scaffold for the maize
genome.
Zhou S, Wei F, Nguyen J, Bechner M, Potamousis K, Goldstein S, Pape L, Mehan MR, Churas C, Pasternak S et al. 2009. PLoS Genet.. 5:e1000711. - Evidence-based gene predictions in plant
genomes.
Liang C, Mao L, Ware D, Stein L. 2009. Genome Res.. 19:1912-1923.
Picture credit: Nicolle Rager Fuller, National Science Foundation.
Links
Statistics
Summary
Assembly | Zm-B73-REFERENCE-NAM-5.0, INSDC Assembly GCA_902167145.1, Dec 2019 |
Database version | 113.8 |
Golden Path Length | 2,182,075,994 |
Genebuild by | CSHL |
Genebuild method | Curated |
Data source | nam-genomes |
Gene counts
Coding genes | 39,756 |
Non coding genes | 4,547 |
Misc non coding genes | 4,547 |
Gene transcripts | 77,341 |
Other
Short Variants | 49,783,517 |