Populus trichocarpa Assembly and Gene Annotation
About Populus trichocarpa
Populus trichocarpa (poplar) is a deciduous broadleaf tree that is native to Western North America. It is an economically important source of timber. Its rapid growth and compact genome size (~500 Mb, 2n=38) has lead to its use as a model organism for tree species.
The v3 Populus genome assembly was constructed with Arachne version 20071016HA with an attempt to merge the outbred haplotypes and an extensive attempt to remove contaminating sequence. 81 Mb of finished clone sequence was also integrated, and the latest genetic map information to construct the 19 chromosome size scaffolds which contain 388 Mb of sequence, a majority of the assembled poplar sequence. Care was taken to minimise alternative haplotypes within the assembly. The first 19 scaffolds from the assembly correspond to the poplar chromosomes. The full release covers 423 Mb of sequence with an average read depth of 9.44x assembled sequence.
Gene annotation was carried out by Phytozome. 75,566 RNA-seq transcript assemblies were constructed from about 600 million pairs of tremulas paired-end Illumina RNA-seq reads using PERTRAN (Shu et. al., manuscript in preparation). Transcript assemblies (86,677 from Populus trichocarpa and 151,316 from related poplar ESTs/mRNA) were constructed using PASA from Populus trichocarpa RNA-seq transcript assemblies, ESTs/mRNAs, and ESTs/mRNAs of other Poplar species including >2.6M 454-sequenced Populus deltoides EST reads generated at JGI.
Loci were determined by transcript assembly alignments and/or EXONERATE alignments of proteins from Arabidopsis thaliana, rice, soybean or grape genomes to repeat-soft-masked P. trichocarpa genome using RepeatMasker. Gene models were predicated by homology-based predictors, mainly FGENESH+, FGENESH_EST (similar to FGENESH+, EST as splice site and intron input instead of protein/translated ORF), and GenomeScan. The best scored predictions for each locus are selected using multiple positive factors including EST and protein support, and one negative factor: overlap with repeats.
The selected gene predictions were improved by PASA. Improvement includes adding UTRs, splicing correction, and adding alternative transcripts. PASA-improved gene model proteins were subject to protein homology analysis to above mentioned proteomes to obtain C-score and protein coverage. Cscore is a protein BLASTP score ratio to mutual best hit BLASTP score and protein coverage is highest percentage of protein aligned to the best of homologues. PASA-improved transcripts were selected based on Cscore, protein coverage, EST coverage, and its CDS overlapping with repeats. The transcripts were selected if its Cscore is larger than or equal to 0.5 and protein coverage larger than or equal to 0.5, or it has EST coverage, but its CDS overlapping with repeats is less than 20%. For gene models whose CDS overlaps with repeats for more that 20%, its Cscore must be at least 0.9 and homology coverage at least 70% to be selected. The selected gene models were subject to Pfam analysis and gene models whose protein is more than 30% in Pfam TE domains were removed. The final result was 41,335 loci containing protein-coding transcripts and 73,013 protein-coding transcripts.
Repeated sequences were called with the Repeat Detector, which is part of the Ensembl Genomes repeat feature pipelines. Repeats length: 157575986 - Repeats content: 36.3%
Probes from the Poplar Genome Array Affymetrix GeneChip were aligned to the genome using the Ensembl Functional Genomics pipeline.
- The genome of black cottonwood, Populus trichocarpa (Torr. &
Tuskan GA, Difazio S, Jansson S, Bohlmann J, Grigoriev I, Hellsten U, Putnam N, Ralph S, Rombauts S, Salamov A et al. 2006. Science. 313:1596-1604.
- Image credit: Cherubino (Own work) [Public domain], via Wikimedia
- Phytozome: a comparative platform for green plant
- David M. Goodstein, Shengqiang Shu, Russell Howson, Rochak Neupane,
- Richard D. Hayes, Joni Fazo, Therese Mitros, William Dirks, Uffe
- Hellsten, Nicholas Putnam et al. 2012. Nucleic Acids Res. . 40 (D1)
General information about this species can be found in Wikipedia.
|Assembly||Pop_tri_v3, INSDC Assembly GCA_000002775.3, Jan 2018|
|Golden Path Length||434,132,815|
|Data source||Joint Genome Institute|
|Non coding genes||1,012|
|Small non coding genes||993|
|Long non coding genes||19|