Cajanus cajan (pigeon pea) - GCA_000340665.1 Assembly and Gene Annotation
About Cajunus cajan
Pigeonpea (Cajunus cajan) is an important legume food crop grown primarily by smallholder farmers in many semi-arid tropical regions of the world. Pigeonpea is grown on ∼5 million hectares, making it the sixth most important legume food crop globally. Domesticated >3,500 years ago in India, it is the main protein source for more than a billion people in the developing world and a cash crop that supports the livelihoods of millions of resource-poor farmers in Asia, Africa, South America, Central America and the Caribbean.
Assembly
Illumina GA and HiSeq 2000 Sequencing system were used to sequence 11 small-insert (180--800 bp) and 11 large-insert (2--20 kb) libraries [1]. This generated a total of 237.2 Gb of paired-end reads, ranging from 50--100 bp. Filtering and correction of the sequence data for very small and/or bad-quality sequences yielded 130.7 Gb of high-quality sequence, ∼163.4× coverage of the pigeonpea genome. Analysis of sequence data for GC content indicated a similar GC content distribution in the genomes of pigeonpea and soybean. Additionally, a set of 88,860 bacterial artificial chromosome (BAC) end sequences were generated using Sanger sequencing from two BAC libraries (69,120 clones) by using the HindIII (34,560 clones) and BamHI (34,560 clones) restriction enzymes.
SOAPdenovo was used to assemble 605.78 Mb of the pigeonpea genome de novo, generating a sequence with a contig N50 of 21.95 kb, and longest contig length of 185.39 kb. This was then improved by using both the paired-BAC end sequences (41,302) that passed after filtering through RepeatMasker, and a genetic map comprising 833 marker loci. This increased N50 to 516.06 kb (longest scaffold in chromosome level of 48.97 Mb) . The draft genome assembly has <5.69% (~34 Mb) unclosed gaps. These analyses showed that mapped genetic loci provide additional information for assembling superscaffolds, especially in regions in which scaffolds were not large enough to cross the repeat rich regions. The generated chromosome-scale scaffolds can be considered as 'pseudomolecules'. The estimated pigeonpea genome size, based on K-mer statistics, is 833.07 Mb, suggesting that the assembly captures 72.7% of the genome in the genome scaffolds. If only the 6,534 scaffolds >2 kb are considered, the assembly spans 578 Mb with an N50 of 0.58 Mb.
Annotation
A combination of de novo gene prediction programs and homology-based methods were used to predict gene models in the pigeonpea genome. These were combined using the GLEAN algorithm, resulting in the identification of 48,680 genes with an average transcript length of 2,348.70 bp, coding sequence size of 959.35 bp and 3.59 exons per gene. The majority of these predicted genes (99.6%) were supported either by de novo gene prediction, expressed sequence tags (EST)/unigenes or homology-based searching, or a combination of these approaches. Of these genes Ensembl Plants display 2212 genes which were imported from ENA (European Nucleotide Archive).
References
- Draft genome sequence of pigeonpea (Cajanus cajan), an orphan
legume crop of resource-poor
farmers.
Rajeev K Varshney, Wenbin Chen, Yupeng Li, Arvind K Bharti, Rachit K Saxena, Jessica A Schlueter, Mark T A Donoghue, Sarwar Azam, Guangyi Fan, Adam M Whaley et al. 2012. Nature Biotechnology. 30:83-89.
Statistics
Summary
Assembly | C.cajan_V1.0, INSDC Assembly GCA_000340665.1, Sep 2016 |
Database version | 113.1 |
Golden Path Length | 592,816,859 |
Genebuild by | IIPG |
Genebuild method | External annotation import |
Data source | BGI |
Gene counts
Coding genes | 48,654 |
Gene transcripts | 48,654 |