Pistacia vera Assembly and Gene Annotation
About Pistacia vera
Pistachio (Pistacia vera, 2n = 30) is one of the most important commercial nut crops worldwide that originated from Central Asia and the Middle East. Pistachio tree is a deciduous, long-living and desert plant which is able to tolerate high levels of salinity and drought stress. Pistachio is a member of the Anacardiaceae family that was domesticated about 8000 years ago.
An individual of cultivar Batoury was chosen for genome sequencing and assembly. The genome was sequenced with the Illumina Hiseq 2500 platform from multiple paired-end libraries, including two small-insert libraries (270 bp and 500 bp) and six long-insert mate-pair libraries (3 kb, 4 kb, 8 kb, 10 kb, 15 kb, and 17 kb), achieving 270.47X coverage. A draft genome of 569.12 Mb was assembled, with contig and scaffold N50 sizes of 20.69 kb and 768.39 kb, respectively. To improve the continuity, a total of 4,038,150 filtered long reads were generated, with average lengths of 14,568 bp from 59 Gb sequencing data by Pacbio Sequel System. Finally, a draft genome of 671 Mb was assembled, with contig and scaffold N50 sizes of 75.7 kb and 949.2 kb, respectively provided a total of 373.84X coverage. The completeness of the genome assembly was confirmed by CEGMA and BUSCO software.
Protein-coding genes were predicted using de novo and protein homology-based approaches. Genscan v1.0, Augustus v2.5.5, GlimmerHMM v3.0.1, GeneID v1.3, and SNAP were performed for de novo gene prediction, while homologous peptides from the A. thaliana (TAIR 10), Oryza sativa (Nipponbare, IRGSP-1.0), Theobroma cacao (Phytozome v12.1), and C. sinensis (Phytozome v12.1) genomes were aligned to our assembly to identify the homologous genes with GeMoMa v1.4.2. RNA-Seq reads were assembled using Trinity, and the resulting unigenes were aligned to the repeat-masked assemblies using BLAT, and subsequently, the gene structures of BLAT alignment results were modeled using PASA. Then, protein-coding regions were identified with TransDecoder v3.0.1 and GeneMarkS-T, respectively. Consensus gene models were generated by integrating the de novo predictions and protein alignments using EVidenceModeler.
Repeated sequences were called with the Repeat Detector, which is part of the Ensembl Genomes repeat feature pipelines. Repeats length: 332023120 - Repeats content: 49.5%
- Whole genomes and transcriptomes reveal adaptation and
Zeng L, Tu XL, Dai H, Han FM, Lu BS, Wang MS, Nanaei HA, Tajabadipour A, Mansouri M, Li XL et al. 2019. Genome Biology. 20:79.
Picture credit: Professor Ali Esmailizadeh, Shahid Bahonar University of Kerman
General information about this species can be found in Wikipedia.
|Assembly||PisVer_v2, INSDC Assembly GCA_008641045.1,|
|Golden Path Length||671,152,441|
|Genebuild method||External annotation import|
|Data source||Chinese Academy of Sciences|