Phaseolus vulgaris Assembly and Gene Annotation

About Phaseolus vulgaris

Legumes are the third largest family of angiosperms and include many populous species. The majority of legumes contain symbiotic bacteria within nodules in their roots that mediate nitrogen fixation and provide an advantage towards competing plants. Legume seeds are rich in protein content and thus many species have been used for human or animal consumption over the years. Legumes as a whole constitute the second largest class of crops, including peas, soybeans, peanuts, and beans. Common bean (Phaseolus vulgaris.), a major source of protein that complements carbohydrate-rich rice, maize, and cassava, is fundamental for the nutrition of more than 500 million people in developing countries [1].


The P. vulgaris Mesoamerican common bean BAT93 genome was assembled using a hybrid sequencing strategy involving 454 single reads and 8, 10, and 20 kb mate pair libraries; 3 and 5 kb SOLiD mate pair libraries; and Sanger bacterial artificial chromosome (BAC)-end and genomic read pairs. Data free of redundancies were used as input for a Newbler assembly, and Illumina reads (45x coverage) were used to correct homopolymer errors and close or reduce gaps within scaffolds. Illumina genotyping-by-sequencing (GBS), data from a set of 60 F5 lines of a BAT93 x Jalo EEP558 advanced intercross (6.7x coverage per line on average), together with 827 public marker sequences, were used for assembly correction and scaffold anchoring. Discontinuous genotype profiles observed in 48 cases were manually corrected by breaking scaffolds at the mis-assembly points. Markers were aligned to the assembly and GBS profiles of these scaffolds were used as seeds to place other scaffolds with this or similar profiles onto chromosomes, followed by genetic map calculation. The final BAT93 genome sequence encompassed 549.6 Mbp, close to previous size estimates, with 81% of the assembly anchored to eleven linkage groups. The assembly included 97% of the conserved core eukaryotic genes, thus reflecting its completeness [1].

These assembly includes 68,335 scaffolds (N50=433,759 bp) and 111,805 contigs (N50=10,795 bp) [2].


Transposable elements were identified by combining de novo and homology-based approaches, finding 35% of the P. vulgaris BAT93 genome assembly to be covered by repeats, mostly long terminal repeats. To aid in gene prediction and to obtain a global view of the transcriptome during development, sequencing was done with Illumina 61 RNA samples from 34 different organs and/or developmental stages from healthy plants. In addition, two normalized libraries derived from 162 RNA samples from plants grown under optimal and stress conditions were used for 454 pyrosequencing. Illumina and 454 RNA-Seq reads, as well as public expressed sequence tags (EST) and cDNA sequences, were combined with ab initio predictions to produce an initial gene set. This was filtered to remove genes lacking both similarity to other plant proteins and any evidence of expression, resulting in 30,491 protein coding genes (PCGs), whose 66,634 transcripts encode 53,904 unique proteins. Using protein signatures and phylogeny-based transference of functional annotations it was possible to associate functions with 94% of the bean transcripts, with 76 % of them specifically associated with Gene Ontology (GO) terms [2].


  1. Genome and transcriptome analysis of the Mesoamerican common bean and the role of gene duplications in establishing tissue and temporal specialization of genes.
    Vlasova A, Capella-Gutirrez S, Rendn-Anaya M, Hernndez-Oate M, Minoche AE, Erb I, Cmara F, Prieto-Barja P, Corvelo A, Sanseverino W et al. 2016. Genome Biology. 17
  2. Phaseolus vulgaris assembly in NCBI.

More information

General information about this species can be found in Wikipedia.



AssemblyPhaVulg1_0, INSDC Assembly GCA_000499845.1, Dec 2013
Database version94.1
Base Pairs472,453,824
Golden Path Length521,076,696
Genebuild byEnsemblPlants
Genebuild methodImported from ENA
Data sourceJoint Genome Institute

Gene counts

Coding genes28,134
Non coding genes1,190
Small non coding genes1,185
Long non coding genes5
Gene transcripts33,910

About this species