R&D CENTER

Plant Genome Database Release 2.5:A Standardized Plant Genome Repository for 233 species

Jongsun Park*, Hong Xi, and Yongsung Kim
URL  
Currently, more than 200 plant species have been sequenced and/or published; however, there is no central repository for plant genome sequences. NCBI genome database, as a general sequence repository, does not contain all published plant genomes with gene models (e.g. Utricularia gibba). Ensembl (Release 37) and Phytozome (v 12.1) are another plant genome repositories containing 45 and 82 genomes, respectively, which is much less than currently available plant genomes. In addition, a lot of re-sequencing projects including Arabidopsis thaliana (>1,135 genomes) and Oryza sativa genomes (>3,000 genomes) do not provide assembled genome sequences for understanding intraspecies divergences. To overcome these problems, we developed a standardized plant genome database (http://www.plantgenome.info/) for collecting all available plant genomes with gene annotation pipeline with InterProScan, identification and comparison of simple sequence repeats (SSRs). Moreover, several genome-wide analyses can be conducted on the web site with the aid of GlobalScrap®. The Plant Genome Database release 2.5 contains 1,547 plant genomes (322 species) and five red algal genomes as an outgroup have been collected from diverse sources including NCBI, Phytozome, Ensembl, and independent plant databases as well as have been analyzed with automated pipelines. Total length of 1,546 genomes is 557.26 Gbp and total numbers of genes and ORFs are 7,086,488 and 9,112,328 from 212 plant genomes, respectively. The largest one is Pinus lambertiana (34.08 Gbp) from Gymnosperm of which average genome length is 21.08 Gbp. 327 species comprise of 5 red algae, 37 chlorophytes, two charophytes, one liverworts, two mosses, six Gymnosperm species, and 272 Angiosperm species. 30 orders of Angiosperm have sequenced genomes: Poales covers 40 species, Lamiales contains 36 species and Brassicales has 31 species. 83.57\% (7,391,092) of plant ORFs have 13,274 distinct functional domains detected by InterProScan. 19,364,029 Simple Sequence Repeats (SSRs) were identified from 1,546 genomes. Botryococcus braunii genome have largest proportion of SSRs (6.23\%) and Arabidopsis thaliana Castelfed-4.1 has the smallest (0.004\%). Throughout these analyses, 557.46 Gbp plants genome sequence is not just collection of A, T, G, and C but new possible indicators to understand characteristics of plant genomes along with taxonomy.