Kalanchoe fedtschenkoi Assembly and Gene Annotation
About Kalanchoe fedtschenkoi
Crassulacean acid metabolism (CAM) is a water-use efficient adaptation of photosynthesis that has evolved independently many times in diverse lineages of flowering plants. Kalanchoe fedtschenkoi, an obligate CAM species, is a model species for research into the molecular biology and functional genomics of CAM due to its relatively small diploid genome, low repetitive sequence content, and efficient stable transformation protocols. Most species of the genus are native to Madagascar and tropical Africa.
Assembly
The diploid (2n=2x=34) genome size was estimated to be 260 Mb. It was assembled from 70x paired-end and 37x mate-pair reads generated using the Illumina MiSeq platform. The genome assembly consisted of 1324 scaffolds with a total length of 256 Mb and scaffold N50 of 2.45 Mb.
When compared to the grape genome, which is the best available reference for studying eudicot genome duplication events, synteny patterns reflect a 1:4 gene copy ratio between grape and K. fedtschenkoi.
Annotation
Transcript assemblies (TAs) were made from 414M paired-end reads by PERTRAN. TAs and ESTs were further assembled into PASA transcript assemblies (46,382), and 185,671 TAs and ESTs into sibling PASA transcript assemblies (62,501) by PASA. Loci were determined by PASA transcript assembly alignments and/or EXONERATE alignments of proteins from Arabidopsis thaliana, rice, sorghum, mimulus, grape, Phalaenopsis equestris and Swiss-Prot proteomes to the repeat-soft-masked genome sequence up to 2 Kbp extension on both ends unless extending into another locus on the same strand.
Gene models were predicted by homology-based predictors (FGENESH+, FGENESH_EST, GenomeScan). The best scored predictions for each locus were selected using multiple positive factors including EST and protein support, and one negative factor: overlap with repeats. The selected gene predictions were improved by PASA considering UTRs, splicing corrections and alternative transcripts. PASA-improved gene model proteins were subject to protein homology analysis to above mentioned proteomes to obtain Cscore and protein coverage. Cscore is a protein BLASTP score ratio to MBH (mutual best hit) BLASTP score and protein coverage is highest percentage of protein aligned to the best of homologs. PASA-improved transcripts were selected based on Cscore, protein coverage, EST coverage, and its CDS overlapping with repeats. The transcripts were selected if its Cscore is larger than or equal to 0.5 and protein coverage larger than or equal to 0.5, or it has EST coverage, but its CDS overlapping with repeats is less than 20%. For gene models whose CDS overlaps with repeats for more than 20%, its Cscore must be at least 0.9 and homology coverage at least 70% to be selected. The selected gene models were subject to Pfam and Panther analysis and gene models whose protein is more than 30% in Pfam/Panther TE domains were removed.
Repeats were annotated with the Ensembl Genomes repeat feature pipeline. There are: 474,588 Low complexity (Dust) features, covering 29 Mb (11.4% of the genome); 66,721 RepeatMasker features (with the nrTEplants library), covering 21 Mb (8.1% of the genome); 108126 Tandem repeats (TRF) features, covering 10 Mb (4.0% of the genome); Repeat Detector repeats length 90Mb (35.3% of the genome).
- The Kalanchoë genome provides insights into convergent evolution and building blocks of crassulacean acid metabolism.
Yang X, Hu R, Yin H, Jenkins J, Shu S, Tang H, Liu D, Weighill DA, Cheol Yim W, Ha J, Heyduk K, Goodstein DM, Guo HB, Moseley RC, Fitzek E, Jawdy S, Zhang Z, Xie M, Hartwell J, Grimwood J, Abraham PE, Mewalal R, Beltrán JD, Boxall SF, Dever LV, Palla KJ, Albion R, Garcia T, Mayer JA, Don Lim S, Man Wai C, Peluso P, Van Buren R, De Paoli HC, Borland AM, Guo H, Chen JG, Muchero W, Yin Y, Jacobson DA, Tschaplinski TJ, Hettich RL, Ming R, Winter K, Leebens-Mack JH, Smith JAC, Cushman JC, Schmutz J, Tuskan GA.. Nat Commun 8 (1)
Picture credit: Yang et al (see reference, above)
Links
More information
General information about this species can be found in Wikipedia.
Statistics
Summary
Assembly | K_fedtschenkoi_M2_v1, INSDC Assembly GCA_002312845.1, |
Database version | 113.1 |
Golden Path Length | 256,351,415 |
Genebuild by | Phytozome |
Genebuild method | External annotation import |
Data source | Oak Ridge National Laboratory |
Gene counts
Coding genes | 30,964 |
Gene transcripts | 45,190 |