EMBL-EBI User Survey 2024

Do data resources managed by EMBL-EBI and our collaborators make a difference to your work?

Please take 10 minutes to fill in our annual user survey, and help us make the case for why sustaining open data resources is critical for life sciences research.

Survey link: https://www.surveymonkey.com/r/HJKYKTT?channel=[webpage]

Populus trichocarpa (Pop_tri_v4)

Populus trichocarpa Assembly and Gene Annotation

About Populus trichocarpa

Populus trichocarpa (poplar) is a deciduous broadleaf tree that is native to Western North America. It is an economically important source of timber. Its rapid growth and compact genome size (~500 Mb, 2n=38) has lead to its use as a model organism for tree species.

Assembly

Main assembly consisted of 133.2x of PACBIO coverage (6,139 average read size), and was assembled using MECAT and the resulting sequence was polished using QUIVER. The primary transcripts from the V3 release of Populus trichocarpa (var. Nisqually) were used to identify misjoins in the Populus assembly. Misjoins were characterized by an abrupt change in the Paspalum linkage group. A total of 12 breaks were made to the assembly.

Scaffolds were then oriented, ordered, and joined together using gene synteny. Significant telomeric sequence was properly oriented in the assembly. A total of 81 joins were applied to the broken assembly to form the final assembly consisting of 19 chromosomes. 99.2% of the assembled sequence is contained in the chromosomes.

Adjacent alternative haplotypes were identified on the joined contig set. Althap regions were collapsed using the longest common substring between the two haplotypes. A total of 21 adjacent altHaps were collapsed. A set of 676 finished BAC clones were then aligned to the chromosomes in an attempt to patch remaining gaps. One gap on Chr07 was patched, resulting in the addition of 16,801 bases.

Heterozygous snp/indel phasing errors were corrected using the 133.2x raw PACBIO data. A total of 53,017 (2.03% of the 2,614,201 Heterozygous snps/indels) were corrected. Additionally, homozygous SNPs and INDELs were corrected in the release sequence using 100x of Illumina reads (2x150, 500bp insert, library ID IPFY).

Annotation

Gene annotation was carried out by Phytozome.

Transcript assemblies were made from ~1.4B pairs of 2X150 stranded paired-end Illumina RNA-seq GeneAtlas reads and ~1.2B pairs of 2X100 stranded paired-end Illumina RNA-seq reads from Dr. Pankaj Jaiswal using PERTRAN. About ~3M PacBio Iso-Seq CCSs were corrected and collapsed by genome guided correction pipleine (Shu, unpublished) to obtain ~0.5M putative full length transcripts. 289,692 transcript assemblies were constructed using PASA from RNA-seq transcript assemblies above. Loci were determined by transcript assembly alignments and/or EXONERATE alignments of proteins from arabi (Arabidopsis thaliana), soybean, peach, Kitaake rice, Setaria viridis, tomato, cassava, grape and Swiss-Prot proteomes to repeat-soft-masked Populus trichocarpa genome using RepeatMasker (Smit, 2013-2015) with up to 2K BP extension on both ends unless extending into another locus on the same strand. Gene models were predicted by homology-based predictors, FGENESH+, FGENESH_EST (similar to FGENESH+, EST as splice site and intron input instead of protein/translated ORF), and EXONERATE, PASA assembly ORFs (in-house homology constrained ORF finder) and from AUGUSTUS via BRAKER1 (Hoff, 2015). The best scored predictions for each locus are selected using multiple positive factors including EST and protein support, and one negative factor: overlap with repeats. The selected gene predictions were improved by PASA. Improvement includes adding UTRs, splicing correction, and adding alternative transcripts. PASA-improved gene model proteins were subject to protein homology analysis to above mentioned proteomes to obtain Cscore and protein coverage. Cscore is a protein BLASTP score ratio to MBH (mutual best hit) BLASTP score and protein coverage is highest percentage of protein aligned to the best of homologs. PASA-improved transcripts were selected based on Cscore, protein coverage, EST coverage, and its CDS overlapping with repeats. The transcripts were selected if its Cscore is larger than or equal to 0.5 and protein coverage larger than or equal to 0.5, or it has EST coverage, but its CDS overlapping with repeats is less than 20%. For gene models whose CDS overlaps with repeats for more that 20%, its Cscore must be at least 0.9 and homology coverage at least 70% to be selected. The selected gene models were subject to Pfam analysis and gene models whose protein is more than 30% in Pfam TE domains were removed and weak gene models. Incomplete gene models, low homology supported without fully transcriptome supported gene models and short single exon (< 300 BP CDS) without protein domain nor good expression gene models were manually filtered out.

Regulation

Probes from the Poplar Genome Array Affymetrix GeneChip were aligned to the genome using the Ensembl Functional Genomics pipeline.

References

  1. The genome of black cottonwood, Populus trichocarpa (Torr. & Gray).
    Tuskan GA, Difazio S, Jansson S, Bohlmann J, Grigoriev I, Hellsten U, Putnam N, Ralph S, Rombauts S, Salamov A et al. 2006. Science. 313:1596-1604.
  2. Image credit: Cherubino (Own work) [Public domain], via Wikimedia Commons.
  3. Phytozome: a comparative platform for green plant
    genomics.
    David M. Goodstein, Shengqiang Shu, Russell Howson, Rochak Neupane,
    Richard D. Hayes, Joni Fazo, Therese Mitros, William Dirks, Uffe
    Hellsten, Nicholas Putnam et al. 2012. Nucleic Acids Res. . 40 (D1)
    D1178-D1186.

Statistics

Summary

AssemblyPop_tri_v4, INSDC Assembly GCA_000002775.4, Feb 2022
Database version112.1
Golden Path Length392,162,179
Genebuild byJGI
Genebuild methodExternal annotation import
Data sourceJGI

Gene counts

Coding genes34,699
Gene transcripts52,400