Eutrema salsugineum (Eutsalg1_0)

Eutrema salsugineum Assembly and Gene Annotation

About Eutrema salsugineum

Halophytes are plants that can naturally tolerate high concentrations of salt in the soil.Saltwater cress (Eutrema salsugineum) is a halophytic species in the Brassicaceae family that can naturally tolerate multiple types of abiotic stresses that typically limit crop productivity, including extreme salinity and cold. It has been widely used as a laboratorial model for stress biology research in plants. According to the chromosomal comparative painting karyotype it is a diploid with 2n=14 chromosomes. This plant, formerly known as Thellungiella halophila, is native to the seashore saline soils of eastern China.

Assembly

A high-quality genome sequence of E. salsugineum at 8x coverage using a traditional Sanger sequencing-based whole-genome shotgun approach was obtained. This was based on a BAC library with read length ranging from 500 to 1000 bp. A total of 3,875,327 sequence reads were assembled using a modified version of Arachne to form 1,107 scaffolds (4,292 contigs) with a scaffold L50 of 9.2 Mb, in total of 245.8 Mb of sequences.

The scaffolds were screened against bacterial proteins, organelle sequences, and GenBank non-redundant database to remove potential contamination. Additional scaffolds were removed if they consisted of >95% 24-mers that occurred four other times in scaffolds larger than 50 kb. To improve assembly contiguity, the 1,107 scaffolds were further joined using a combination of A. thaliana synteny and BAC and fosmid pair support. The final release contains 639 scaffolds (243.1 Mb).

Annotation

To facilitate gene prediction, five RNA libraries generated from the tissues for sequencing. Completeness of the resulting assembly was assessed using 1,138,639 454 ESTs. The aim of this analysis was to obtain a measure of completeness of the assembly, rather than a comprehensive examination of gene expression abundance. Briefly, ESTs < 200 bps were removed, along with all duplicate ESTs to avoid bias. The remaining ESTs were placed against the genome using BLAT and the resulting placements were screened for alignments that were ≥90% identity and ≥85% EST coverage. The screened alignments indicate that 98.76% of available expressed gene loci were included in the assembly.

Gene annotation was accomplished using combinatory homology-based searches, ab initio Genome Scan and FgeneSH prediction, and 454-collected ESTs. Specifically, protein sequences from angiosperm plants and the 61,797 EST sequences in E. salsugineum were aligned to the scaffolds to determine the potential coding open reading frames (ORFs). Then, the candidate genomic regions extended by 2 kb in each direction from the center of aligned ORFs were submitted to Genome Scan and FgeneSH to predict full-length protein-coding genes.

Repeats were annotated with the Ensembl Genomes repeat feature pipeline. There are: 439214 Low complexity (Dust) features, covering 18 Mb (7.2% of the genome); 127716 RepeatMasker features (with the nrTEplants library), covering 58 Mb (24.0% of the genome); 111125 Tandem repeats (TRF) features, covering 11 Mb (4.6% of the genome); Repeat Detector repeats length 95Mb (39.2% of the genome).

Picture credit: MED30237204

Statistics

Summary

AssemblyEutsalg1_0, INSDC Assembly GCA_000478725.1,
Database version111.1
Golden Path Length243,110,105
Genebuild byJoint Genome Institute
Genebuild methodImport
Data sourceJoint Genome Institute

Gene counts

Coding genes26,528
Gene transcripts29,485