Phaseolus vulgaris Assembly and Gene Annotation
About Phaseolus vulgaris
Legumes are the third largest family of angiosperms and include many populous species. The majority of legumes contain symbiotic bacteria within nodules in their roots that mediate nitrogen fixation and provide an advantage towards competing plants. Legume seeds are rich in protein content and thus many species have been used for human or animal consumption over the years. Legumes as a whole constitute the second largest class of crops, including peas, soybeans, peanuts, and beans. Common bean (Phaseolus vulgaris) is fundamental for the nutrition of more than 500 million people in developing countries and a major source of protein that complements carbohydrate-rich rice, maize, and cassava.
An inbred landrace (G19833) derived from the Andean pool (Race Peru) was sequenced using a whole-genome shotgun strategy that combined multiple linear libraries (18.6x assembled sequence coverage) and ten paired libraries of varying insert sizes sequenced with the Roche 454 platform together with 24.1 Gb of Illumina-sequenced fragment libraries. For longer-range linkage, three fosmid libraries and two BAC libraries on the Sanger platform were also sequenced. The resulting assembled sequences were organized into 11 chromosomal pseudomolecules by integration with a dense GoldenGate- and Infinium-based SNP map of 7,015 markers from a Stampede x Red Hawk cross and a similar set of Infinium markers and 261 SSRs (simple sequence repeats). Additional refinements to the pseudomolecules were made on the basis of synteny with soybean (Glycine max), where allowed by available map data. Almost all of these changes were made in pericentromeric regions, where recombination is generally too limited to resolve the ordering and orientation of small scaffolds. The pseudomolecules included 468.2 Mb of mapped sequence in 240 scaffolds. The total release includes 472.5 Mb of the estimated 587 Mb genome, with half of the assembled nucleotides in contigs longer than 39.5 kb.
Sanger-derived EST resources and a substantial amount of new RNA sequencing reads (727 million reads from 11 tissues and developmental stages) where combined with homology-based and de novo gene prediction approaches. A total 43,627 transcript assemblies were obtained PERTRAN and PASA. Loci were identified by transcript assembly alignments and/or EXONERATE alignments of peptides from Arabidopsis, poplar, Medicago truncatula, grape and rice peptides to the repeat-soft-masked genome. Gene models were predicted by the homology-based predictors FGENESH+, FGENESH_EST and GenomeScan. The highest scoring predictions for each locus were selected using multiple positive factors. Selected gene predictions were improved by PASA, including by adding UTRs, correcting splicing and adding alternative transcripts. PASA-improved gene model peptides were subjected to peptide homology analysis with the above-mentioned proteomes to obtain Cscore values and peptide coverage. Selected gene models were subjected to Pfam analysis, and gene models whose encoded peptide contained more than 30% Pfam transposon element domains were removed. The resulting annotation includes 28134 protein-coding loci. Most of these genes (91%) were retained in synteny blocks with soybean.
Repeated sequences were called with the Repeat Detector, which is part of the Ensembl Genomes repeat feature pipelines. Repeats length: 197689120 - Repeats content: 37.9%
Variation from the European Variation Archive was added.
This was taken from whole genome sequencing and bioinformatic analysis of 29 cultivars of common bean (Phaseolus vulgaris), nine pools and 2 cultivated relatives of P. coccineus and P. acutifolius (https://ciat.cgiar.org/what-we-do/breeding-better-crops/beans/).
- A reference genome for common bean and genome-wide analysis of dual
Schmutz J, McClean PE, Mamidi S et al. 2014. Nat Genet.. 46(7):707-13.
General information about this species can be found in Wikipedia.
|Assembly||PhaVulg1_0, INSDC Assembly GCA_000499845.1, Dec 2013|
|Golden Path Length||521,076,696|
|Data source||Joint Genome Institute|
|Non coding genes||1,190|
|Small non coding genes||1,185|
|Long non coding genes||5|