Gossypium raimondii Assembly and Gene Annotation

About Gossypium raimondii

Gossypium raimondii is a species of cotton plant endemic to northern Peru. The Gossypium genus is ideal for investigating emergent consequences of polyploidy. A-genome diploids native to Africa and Mexican D-genome diploids diverged ∼5–10 Myr ago. They were reunited ∼1–2 Myr ago by trans-oceanic dispersal of a maternal A-genome propagule resembling G. herbaceum to the New World, hybridization with a native D-genome species resembling G. raimondii, and chromosome doubling. The nascent AtDt allopolyploid spread throughout the American tropics and subtropics, diverging into at least five species; two of these species (G. hirsutum and G. barbadense) were independently domesticated to spawn one of the world’s largest industries (textiles) and become a major oilseed.[1]

Assembly

Assembly of 80,765,952 sequence reads used a modified version of Arachne, integrating linear (15x genome coverage) and paired (3.1x genome coverage) Roche 454 libraries corrected using 41.9 Gbp Illumina sequence, with 1.54x paired-end Sanger sequences from two subclone, six fosmid and two BAC libraries. Cotton genetic and physical maps, and Vitis vinifera and Theobroma cacao synteny were used to identify 51 joins across 64 scaffolds to form the 13 chromosomes. The remaining scaffolds were screened for contamination to produce a final assembly of 1,033 scaffolds (19,735 contigs) and 761.4 Mbp [1][2].

Annotation

85,746 transcript assemblies were made from about 1 billion pairs of D5 paired-end Illumina RNA-Seq reads, 55,294 transcript assemblies about 0.25B D5 single end Illumina RNAseq reads, 62,526 transcript assemblies from 0.15B TET single end Illumina RNAseq reads. All these transcript assemblies from RNAseq reads were made using PERTRAN. 120,929 transcript assemblies were constructed using PASA from 56,638 D5 Sanger ESTs, 2.5M D5 454 RNAseq reads and all RNA-Seq transcript assemblies above. 133,073 transcript assemblies were constructed using PASA from 296,214 TET Sanger ESTs and about 2.9M TET 454 reads. The larger number of transcript assemblies from fewer TET sequences is due to fragment nature of the assemblies. Loci were determined by transcript assembly alignments and/or EXONERATE alignments of proteins from arabi (Arabidopsis thaliana), cacao, rice, soybean, grape and poplar proteins to repeat-soft-masked G. raimondii genome using RepeatMasker with up to 2 kbp extension on both ends unless extending into another locus on the same strand.

Gene models were predicted by homology-based predictors, FGENESH+, FGENESH_EST (similar to FGENESH+, EST as splice site and intron input instead of protein/translated ORF), and GenomeScan. The best scored predictions for each locus are selected using multiple positive factors including EST and protein support, and one negative factor: overlap with repeats. The selected gene predictions were improved by PASA. Improvement includes adding UTRs, splicing correction, and adding alternative transcripts. PASA-improved gene model proteins were subject to protein homology analysis to above mentioned proteomes to obtain Cscore and protein coverage. Cscore is a protein BLASTP score ratio to MBH (mutual best hit) BLASTP score and protein coverage is highest percentage of protein aligned to the best of homologs. PASA-improved transcripts were selected based on Cscore, protein coverage, EST coverage, and its CDS overlapping with repeats. The transcripts were selected if its Cscore is larger than or equal to 0.5 and protein coverage larger than or equal to 0.5, or it has EST coverage, but its CDS overlapping with repeats is less than 20%. For gene models whose CDS overlaps with repeats for more that 20%, its Cscore must be at least 0.9 and homology coverage at least 70% to be selected. The selected gene models were subject to Pfam analysis and gene models whose protein is more than 30% in Pfam TE domains were removed. Final gene set has 37,505 protein coding genes and 77,267 protein coding transcripts[1][2].

References

  1. Repeated polyploidization of Gossypium genomes and the evolution of spinnable cotton fibres.
    Andrew H. Paterson, Jonathan F. Wendel, Heidrun Gundlach, Hui Guo, Jerry Jenkins, Dianchuan Jin, Danny Llewellyn, Kurtis C. Showmaker, Shengqiang Shu, Joshua Udall et al. 2012. Nature. 492:423427.
  2. Gossypium raimondii (D5) genome JGI assembly v2.0 (annot v2.1).
    COTTONGEN.

More information

General information about this species can be found in Wikipedia.

Statistics

Summary

AssemblyGraimondii2_0, INSDC Assembly GCA_000327365.1, Sep 2017
Database version93.1
Base Pairs747,964,930
Golden Path Length761,405,269
Genebuild byEnsemblPlants
Genebuild methodGenerated from ENA annotation
Data sourceEuropean Nucleotide Archive

Gene counts

Coding genes38,208
Non coding genes511
Small non coding genes511
Gene transcripts78,882

About this species