Oryza sativa Japonica Group Assembly and Gene Annotation

About Oryza sativa Japonica

Oryza sativa Japonica (rice) is the staple food for 2.5 billion people. It is the grain with the second highest worldwide production after Zea mays. In addition to its agronomic importance, rice is an important model species for monocot plants and cereals such as maize, wheat, barley and sorghum. O. sativa has a compact diploid genome of approximately 500 Mbp (n=12) compared with the multi-gigabase genomes of maize, wheat and barley.


Scientists from the MSU Rice Genome Annotation Project (MSU) and the International Rice Genome Sequencing Project (IRGSP) / Rice Annotation Project Database (RAP-DB) generated a unified assembly of the 12 rice pseudomolecules of Oryza sativa Japonica Group cv. Nipponbare [1,2].

The pseudomolecule for each chromosome was constructed by joining the nucleotide sequences of each PAC/BAC clone based on the order of the clones on the physical map. Overlapping sequences were removed and physical gaps were replaced with Ns. Updated pseudomolecules were constructed based on the original IRGSP sequence data [1] in combination with a BAC-optical map and error correction using 44-fold coverage next generation sequencing reads [2]. The nucleotide sequences of 7 new clones mapped on the euchromatin/telomere junctions were added in the new genome assembly. In addition, several clones in the centromere region of chromosome 5 were improved and one gap on chromosome 11 was closed [2].

Kawahara et al. (2013) describe the integrated Os-Nipponbare-Reference-IRGSP-1.0 pseudomolecules, also known as MSU7. Gene loci, gene models and associated annotations were independently created by each group, but can be easily compared using the common reference.

Read more about the assembly at MSU or on rap-db.


The IRGSP gene models were imported from rap-db [3]. The most recent update was from its's Aug 4, 2017 release. This version corrected numerous protein coding gene models with manual curation, also deprecated some bad models. In total, 35,667 protein-coding genes were included in this release. Compared with last release of 37,830 genes, 35,340 stayed the same, 2,173 got deprecated, 317 updated, 10 new genes added. Feature annotation and comparative analysis pipelines have been run and variations have been projected from the old annotation to the new one. In addition, 2,387 nonCoding genes and 8,115 predicted genes were added as seperate data sets.

The MSU-7 gene models [4] have been added into the rice genome browser for visual comparison to the IRGSP set. Gene models were generated, refined and updated for the estimated 40,000 to 60,000 rice genes, provided standardized automatic annotation pipeline described in detail here. Briefly, a number of ab initio methods have been combined with homology based evidence and refined with EST alignments.

Cross references between the two gene sets provided by rap-db allow searching and querying using either identifier space, but only the IRGSP gene models are used in our gene trees.


  • Probes from the Rice Genome Array for two rice cultivars were aligned to the genome [5].


Variation data from six different large scale studies are available:

  1. The 3000 Rice Genome Project (2015, [6]), an international effort to sequence the genomes of 3,024 rice varieties from 89 countries providing 365,710 variant loci (SNPs and InDels).
  2. Whole genome sequencing of 104 elite rice cultivars (Duitama et al. 2015, [7]), described as, "a comprehensive information resource for marker assisted selection" providing 25,769,548 variant loci.
  3. Chip based analysis of 1,310 SNPs across 395 samples (Zhao et al. 2010, [8]), described as, "revealing the impact of domestication and breeding on the rice genome".
  4. Chip based analysis of approximately 160k SNPs across 20 diversity rice accessions (OryzaSNP, McNally et al. 2009 [9]), described as, "revealing relationships among landraces and modern varieties of rice".
  5. The Oryza Map Alignment Project (OMAP 2007): approximately 1.6M variant loci detected by comparing BAC End Sequences from four rice varieties to Japonica. [dbSNP]
  6. Adaptive loss-of-function in domesticated rice (BGI 2004, [10]): A collection of approximately 3M variant loci from the comparison of the Indica (93-11) and Japonica (Nipponbare) genomes. [dbSNP]

The following genetic markers were remapped to the IRGSP-1.0 assembly by industry collaborator KeyGene:



  1. The map-based sequence of the rice genome.
    2005. Nature. 436:793-800.
  2. Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data.
    Kawahara Y, de la Bastide M, Hamilton JP, Kanamori H, McCombie WR, Ouyang S, Schwartz DC, Tanaka T, Wu J, Zhou S et al. 2013. Rice (N Y). 6:4.
  3. Rice Annotation Project Database (RAP-DB): an integrative and interactive database for rice genomics.
    Sakai H, Lee SS, Tanaka T, Numa H, Kim J, Kawahara Y, Wakimoto H, Yang CC, Iwamoto M, Abe T et al. 2013. Plant Cell Physiol.. 54:e6.
  4. The TIGR Rice Genome Annotation Resource: improvements and new features.
    Ouyang S, Zhu W, Hamilton J, Lin H, Campbell M, Childs K, Thibaud-Nissen F, Malek RL, Lee Y, Zheng L et al. 2007. Nucleic Acids Res.. 35:D883-7.
  5. Global analysis of gene expression using GeneChip microarrays.
    Zhu T. 2003. Curr. Opin. Plant Biol.. 6:418-425.
  6. The 3,000 rice genomes project.
    2014. Gigascience. 3:7.
  7. Whole genome sequencing of elite rice cultivars as a comprehensive information resource for marker assisted selection.
    Duitama J, Silva A, Sanabria Y, Cruz DF, Quintero C, Ballen C, Lorieux M, Scheffler B, Farmer A, Torres E et al. 2015. PLoS ONE. 10:e0124617.
  8. Genomic diversity and introgression in O. sativa reveal the impact of domestication and breeding on the rice genome.
    Zhao K, Wright M, Kimball J, Eizenga G, McClung A, Kovach M, Tyagi W, Ali ML, Tung CW, Reynolds A et al. 2010. PLoS ONE. 5:e10780.
  9. Genomewide SNP variation reveals relationships among landraces and modern varieties of rice.
    McNally KL, Childs KL, Bohnert R, Davidson RM, Zhao K, Ulat VJ, Zeller G, Clark RM, Hoen DR, Bureau TE et al. 2009. Proc. Natl. Acad. Sci. U.S.A.. 106:12273-12278.
  10. The Genomes of Oryza sativa: a history of duplications.
    Yu J, Wang J, Lin W, Li S, Li H, Zhou J, Ni P, Dong W, Hu S, Zeng C et al. 2005. PLoS Biol.. 3:e38.

More information

General information about this species can be found in Wikipedia.



AssemblyIRGSP-1.0, INSDC Assembly GCA_001433935.1, Oct 2015
Database version94.7
Base Pairs375,049,285
Golden Path Length375,049,285
Genebuild byIRGSP
Genebuild methodImported from RAP-DB
Data sourceRAP-DB

Gene counts

Coding genes35,825
Non coding genes1,017
Small non coding genes922
Long non coding genes95
Gene transcripts43,404


FGENESH gene prediction46,238
TE-related Gene (MSU)17,272
Short Variants28,179,246
Structural variants1,278

About this species