Populus trichocarpa Assembly and Gene Annotation

About Populus trichocarpa

Populus trichocarpa (poplar) is a deciduous broadleaf tree that is native to Western North America. It is an economically important source of timber. Its rapid growth and compact genome size (~500 Mb, n=18) has lead to its use as a model organism for tree species.

Assembly

The v3 Populus genome assembly was constructed with Arachne version 20071016HA with an attempt to merge the outbred haplotypes and an extensive attempt to remove contaminating sequence. 81 Mb of finished clone sequence was also integrated, and the latest genetic map information to construct the 19 chromosome size scaffolds which contain 388 Mb of sequence, a majority of the assembled poplar sequence. Care was taken to minimise alternative haplotypes within the assembly. The first 19 scaffolds from the assembly correspond to the poplar chromosomes. The full release covers 423 Mb of sequence with an average read depth of 9.44x assembled sequence.

Annotation

Gene annotation was carried out by Phytozome. 75,566 RNA-seq transcript assemblies were constructed from about 600 million pairs of tremulas paired-end Illumina RNA-seq reads using PERTRAN (Shu et. al., manuscript in preparation). Transcript assemblies (86,677 from Populus trichocarpa and 151,316 from related poplar ESTs/mRNA) were constructed using PASA from Populus trichocarpa RNA-seq transcript assemblies, ESTs/mRNAs, and ESTs/mRNAs of other Poplar species including >2.6M 454-sequenced Populus deltoides EST reads generated at JGI.

Loci were determined by transcript assembly alignments and/or EXONERATE alignments of proteins from Arabidopsis thaliana, rice, soybean or grape genomes to repeat-soft-masked P. trichocarpa genome using RepeatMasker. Gene models were predicated by homology-based predictors, mainly FGENESH+, FGENESH_EST (similar to FGENESH+, EST as splice site and intron input instead of protein/translated ORF), and GenomeScan. The best scored predictions for each locus are selected using multiple positive factors including EST and protein support, and one negative factor: overlap with repeats.

The selected gene predictions were improved by PASA. Improvement includes adding UTRs, splicing correction, and adding alternative transcripts. PASA-improved gene model proteins were subject to protein homology analysis to above mentioned proteomes to obtain C-score and protein coverage. Cscore is a protein BLASTP score ratio to mutual best hit BLASTP score and protein coverage is highest percentage of protein aligned to the best of homologues. PASA-improved transcripts were selected based on Cscore, protein coverage, EST coverage, and its CDS overlapping with repeats. The transcripts were selected if its Cscore is larger than or equal to 0.5 and protein coverage larger than or equal to 0.5, or it has EST coverage, but its CDS overlapping with repeats is less than 20%. For gene models whose CDS overlaps with repeats for more that 20%, its Cscore must be at least 0.9 and homology coverage at least 70% to be selected. The selected gene models were subject to Pfam analysis and gene models whose protein is more than 30% in Pfam TE domains were removed. The final result was 41,335 loci containing protein-coding transcripts and 73,013 protein-coding transcripts.

Regulation

Probes from the Poplar Genome Array Affymetrix GeneChip were aligned to the genome using the Ensembl Functional Genomics pipeline.

Links

References

  1. The genome of black cottonwood, Populus trichocarpa (Torr. & Gray).
    Tuskan GA, Difazio S, Jansson S, Bohlmann J, Grigoriev I, Hellsten U, Putnam N, Ralph S, Rombauts S, Salamov A et al. 2006. Science. 313:1596-1604.
  2. Image credit: Cherubino (Own work) [Public domain], via Wikimedia Commons.
  3. Phytozome: a comparative platform for green plant genomics.
    David M. Goodstein, Shengqiang Shu, Russell Howson, Rochak Neupane, Richard D. Hayes, Joni Fazo, Therese Mitros, William Dirks, Uffe Hellsten, Nicholas Putnam et al. 2012. Nucleic Acids Res. . 40 (D1) : D1178-D1186.

More information

General information about this species can be found in Wikipedia.

Statistics

Summary

AssemblyPop_tri_v3, INSDC Assembly GCA_000002775.3, Jan 2018
Database version98.3
Base Pairs422,940,594
Golden Path Length434,132,815
Genebuild byJGI
Genebuild methodImport
Data sourceJoint Genome Institute

Gene counts

Coding genes41,335
Non coding genes1,012
Small non coding genes993
Long non coding genes19
Gene transcripts74,024

About this species