Gossypium raimondii Assembly and Gene Annotation

About Gossypium raimondii

Gossypium raimondii is a species of cotton plant endemic to northern Peru. The Gossypium genus is ideal for investigating emergent consequences of polyploidy. A-genome diploids native to Africa and Mexican D-genome diploids diverged ∼5--10 Myr ago. They were reunited ∼1--2 Myr ago by trans-oceanic dispersal of a maternal A-genome propagule resembling G. herbaceum to the New World, hybridisation with a native D-genome species resembling G. raimondii, and chromosome doubling. The nascent AtDt allopolyploid spread throughout the American tropics and subtropics, diverging into at least five species; two of these species (G. hirsutum and G. barbadense) were independently domesticated to spawn one of the world's largest industries (textiles) and become a major oilseed.

Assembly

The genome was assembled by JGI, using a modified version of Arachne to assemble 80,765,952 sequence reads, integrating linear (15x genome coverage) and paired (3.1x genome coverage) Roche 454 libraries corrected using 41.9 Gb Illumina sequence, with 1.54x paired-end Sanger sequences from two subclone, six fosmid and two BAC libraries. Cotton genetic and physical maps, and Vitis vinifera * and Theobroma cacao* synteny were used to identify 51 joins across 64 scaffolds to form the 13 chromosomes. The remaining scaffolds were screened for contamination to produce a final assembly of 1,033 scaffolds (19,735 contigs) and 761.4 Mbp [1][2]. This assembly corresponds to ENA Release 133, Version 6 (20-JUN-2017).

Annotation

85,746 transcript assemblies were made from about 1 billion pairs of D5 paired-end Illumina RNA-Seq reads, 55,294 transcript assemblies about 0.25B D5 single end Illumina RNAseq reads, 62,526 transcript assemblies from 0.15B TET single end Illumina RNAseq reads. All these transcript assemblies from RNAseq reads were made using PERTRAN. 120,929 transcript assemblies were constructed using PASA from 56,638 D5 Sanger ESTs, 2.5M D5 454 RNAseq reads and all RNA-Seq transcript assemblies above. 133,073 transcript assemblies were constructed using PASA from 296,214 TET Sanger ESTs and about 2.9M TET 454 reads. Loci were determined by transcript assembly alignments and/or EXONERATE alignments of proteins from Arabidopsis thaliana, cacao, rice, soybean, grape and poplar proteins to the soft-masked genome with up to 2 kb extension on both ends.

Gene models were predicted by homology-based predictors, FGENESH+, FGENESH_EST (similar to FGENESH+, EST as splice site and intron input instead of protein/translated ORF), and GenomeScan. The best scored predictions for each locus are selected using multiple positive factors including EST and protein support, and one negative factor: overlap with repeats. The selected gene predictions were improved by PASA. Improvement includes adding UTRs, splicing correction, and adding alternative transcripts. PASA-improved gene model proteins were subject to protein homology analysis to above mentioned proteomes to obtain Cscore and protein coverage. Cscore is a protein BLASTP score ratio to MBH (mutual best hit) BLASTP score and protein coverage is highest percentage of protein aligned to the best of homologues. PASA-improved transcripts were selected based on Cscore, protein coverage, EST coverage, and its CDS overlapping with repeats. The transcripts were selected if its Cscore is larger than or equal to 0.5 and protein coverage larger than or equal to 0.5, or it has EST coverage, but its CDS overlapping with repeats is less than 20%. For gene models whose CDS overlaps with repeats for more that 20%, its Cscore must be at least 0.9 and homology coverage at least 70% to be selected. The selected gene models were subject to Pfam analysis and gene models whose protein is more than 30% in Pfam TE domains were removed. This gene annotation corresponds to ENA Release 133, Version 6 (20-JUN-2017).

Repeats were annotated with the Ensembl Genomes repeat feature pipeline. There are: 1,632,582 Low complexity (Dust) features, covering 86 Mb (11.4% of the genome); 551,323 RepeatMasker features (with the nrTEplants library), covering 423 Mb (55.5% of the genome); 474,340 Tandem repeats (TRF) features, covering 44 Mb (5.8% of the genome).

References

Repeated polyploidization of Gossypium genomes and the evolution of spinnable cotton fibres.
Andrew H. Paterson, Jonathan F. Wendel, Heidrun Gundlach, Hui Guo, Jerry Jenkins, Dianchuan Jin, Danny Llewellyn, Kurtis C. Showmaker, Shengqiang Shu, Joshua Udall et al. 2012. Nature. 492:423427.
Gossypium raimondii (D5) genome JGI assembly v2.0 (annot v2.1).
COTTONGEN.

More information

General information about this species can be found in Wikipedia.

Statistics

Summary

Assembly	Graimondii2_0_v6, INSDC Assembly GCA_000327365.1,
Database version	115.2
Golden Path Length	761,405,269
Genebuild by	DOE Joint Genome Institute
Genebuild method	Import
Data source	DOE Joint Genome Institute

Gene counts

Gene/transcipt that contains an open reading frame (ORF).Coding genes	38,208
A transcript is the operational unit of a gene. In a genomic context, transcripts consist of one or more exons, with adjoining exons being separated by introns. The exons/introns are transcribed and then the introns spliced out. Transcripts may or may not encode a proteinGene transcripts	78,371

Gossypium raimondii Assembly and Gene Annotation

About Gossypium raimondii

Assembly

Annotation

References

More information

Statistics

Summary

Gene counts

About Us

Get help

Our sister sites

Follow us

Favourite species

All species

Gossypium raimondii Assembly and Gene Annotation

About Gossypium raimondii

Assembly

Annotation

References

More information

Statistics

Summary

Gene counts

About Us

Get help

Our sister sites

Follow us