Oryza sativa Japonica Assembly and Gene Annotation

About Oryza sativa Japonica

Oryza sativa Japonica (rice) is the staple food for 2.5 billion people. It is the grain with the second highest worldwide production after Zea mays. In addition to its agronomic importance, rice is an important model species for monocot plants and cereals such as maize, wheat, barley and sorghum. O. sativa has a compact diploid genome of approximately 500 Mbp (n=12) compared with the multi-gigabase genomes of maize, wheat and barley.


Scientists from the MSU Rice Genome Annotation Project (MSU) and the International Rice Genome Sequencing Project (IRGSP) / Rice Annotation Project Database (RAP-DB) generated a unified assembly of the 12 rice pseudomolecules of Oryza sativa Japonica Group cv. Nipponbare [1,2].

The pseudomolecule for each chromosome was constructed by joining the nucleotide sequences of each PAC/BAC clone based on the order of the clones on the physical map. Overlapping sequences were removed and physical gaps were replaced with Ns. Updated pseudomolecules were constructed based on the original IRGSP sequence data [1] in combination with a BAC-optical map and error correction using 44-fold coverage next generation sequencing reads [2]. The nucleotide sequences of 7 new clones mapped on the euchromatin/telomere junctions were added in the new genome assembly. In addition, several clones in the centromere region of chromosome 5 were improved and one gap on chromosome 11 was closed [2].

Kawahara et al. (2013) describe the integrated Os-Nipponbare-Reference-IRGSP-1.0 pseudomolecules, also known as MSU7. Gene loci, gene models and associated annotations were independently created by each group, but can be easily compared using the common reference.

Read more about the assembly at MSU or on rap-db.


The IRGSP gene models were imported from rap-db [3]. The prediction pipeline uses a number of ab initio methods and EST/mRNA alignments. Gene models were combined using JIGSAW. In total, 35,681 protein-coding genes with transcript evidence were detected. Feature annotation and comparative analysis pipelines have been run and variations have been projected from the old assembly to the new one.

The MSU-7 gene models [4] have been added into the rice genome browser for visual comparison to the IRGSP set. Gene models were generated, refined and updated for the estimated 40,000 to 60,000 rice genes, provided standardized automatic annotation pipeline described in detail here. Briefly, a number of ab initio methods have been combined with homology based evidence and refined with EST alignments.

Cross references between the two gene sets provided by rap-db allow searching and querying using either identifier space, but only the IRGSP gene models are used in our gene trees.


  • Probes from the Rice Genome Array for two rice cultivars were aligned to the genome [5].


Variation data from six different large scale studies are available:

  1. The 3000 Rice Genome Project (2015, [6]), an international effort to sequence the genomes of 3,024 rice varieties from 89 countries providing 365,710 variant loci (SNPs and InDels).
  2. Whole genome sequencing of 104 elite rice cultivars (Duitama et al. 2015, [7]), described as, "a comprehensive information resource for marker assisted selection" providing 25,769,548 variant loci.
  3. Chip based analysis of 1,310 SNPs across 395 samples (Zhao et al. 2010, [8]), described as, "revealing the impact of domestication and breeding on the rice genome".
  4. Chip based analysis of approximately 160k SNPs across 20 diversity rice accessions (OryzaSNP, McNally et al. 2009 [9]), described as, "revealing relationships among landraces and modern varieties of rice".
  5. The Oryza Map Alignment Project (OMAP 2007): approximately 1.6M variant loci detected by comparing BAC End Sequences from four rice varieties to Japonica. [dbSNP]
  6. Adaptive loss-of-function in domesticated rice (BGI 2004, [10]): A collection of approximately 3M variant loci from the comparison of the Indica (93-11) and Japonica (Nipponbare) genomes. [dbSNP]

The following genetic markers were remapped to the IRGSP-1.0 assembly by industry collaborator KeyGene:



  1. The map-based sequence of the rice genome.
    2005. Nature. 436:793-800.
  2. Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data.
    Kawahara Y, de la Bastide M, Hamilton JP, Kanamori H, McCombie WR, Ouyang S, Schwartz DC, Tanaka T, Wu J, Zhou S et al. 2013. Rice (N Y). 6:4.
  3. Rice Annotation Project Database (RAP-DB): an integrative and interactive database for rice genomics.
    Sakai H, Lee SS, Tanaka T, Numa H, Kim J, Kawahara Y, Wakimoto H, Yang CC, Iwamoto M, Abe T et al. 2013. Plant Cell Physiol.. 54:e6.
  4. The TIGR Rice Genome Annotation Resource: improvements and new features.
    Ouyang S, Zhu W, Hamilton J, Lin H, Campbell M, Childs K, Thibaud-Nissen F, Malek RL, Lee Y, Zheng L et al. 2007. Nucleic Acids Res.. 35:D883-7.
  5. Global analysis of gene expression using GeneChip microarrays.
    Zhu T. 2003. Curr. Opin. Plant Biol.. 6:418-425.
  6. The 3,000 rice genomes project.
    2014. Gigascience. 3:7.
  7. Whole genome sequencing of elite rice cultivars as a comprehensive information resource for marker assisted selection.
    Duitama J, Silva A, Sanabria Y, Cruz DF, Quintero C, Ballen C, Lorieux M, Scheffler B, Farmer A, Torres E et al. 2015. PLoS ONE. 10:e0124617.
  8. Genomic diversity and introgression in O. sativa reveal the impact of domestication and breeding on the rice genome.
    Zhao K, Wright M, Kimball J, Eizenga G, McClung A, Kovach M, Tyagi W, Ali ML, Tung CW, Reynolds A et al. 2010. PLoS ONE. 5:e10780.
  9. Genomewide SNP variation reveals relationships among landraces and modern varieties of rice.
    McNally KL, Childs KL, Bohnert R, Davidson RM, Zhao K, Ulat VJ, Zeller G, Clark RM, Hoen DR, Bureau TE et al. 2009. Proc. Natl. Acad. Sci. U.S.A.. 106:12273-12278.
  10. The Genomes of Oryza sativa: a history of duplications.
    Yu J, Wang J, Lin W, Li S, Li H, Zhou J, Ni P, Dong W, Hu S, Zeng C et al. 2005. PLoS Biol.. 3:e38.

More information

General information about this species can be found in Wikipedia.



AssemblyIRGSP-1.0, INSDC Assembly GCA_001433935.1,
Database version90.7
Base Pairs374,424,240
Golden Path Length374,424,240
Genebuild byIRGSP
Genebuild methodImported from RAP-DB
Data sourceRAP-DB

Gene counts

Coding genes35,679
Non coding genes56,313
Small non coding genes56,249
Long non coding genes64
Gene transcripts98,663


FGENESH gene prediction46,238
TE-related Gene (MSU)17,272
Short Variants28,179,246
Structural variants1,278

About this species