DIVERGENOME

The production of biological data by high-throughput technologies has revolutionized Biology. In genetics, classical and emerging scientific questions are being approached using SNPs and CNVs genotyping and Next Generation Sequencing (NGS) platforms. Today, the body of investigators in biology is composed by few big research groups that produce high-throughput data, and thousands of small- and medium-size groups that, in addition to produce smaller amounts of data, use and integrate the data produced by the former to resolve relevant scientific questions. While large-scale genomics initiatives such as the HapMap project, CGEMs and the 1000-genomes rely on powerful computational and bioinformatics support to assist in the production and analyses of data, there are very few bioinformatics platforms oriented to small-medium groups to storage, handle and integrate data from different sources, as well as to assist in efficiently performing different kinds of analyses. As a consequence, these tasks are frequently performed sub-optimally, frequently handling data files manually, which is an error prone task that is seldom coupled with adequate quality control procedures. Here we developed a bioinformatics platform, DIVERGENOME, to assist population genetics and genetic epidemiology studies performed by small-medium scale research groups. DIVERGENOME is a web accessible open-source platform (http://www.pggenetica.icb.ufmg.br/divergenome) developed to help investigators in data storage and analysis for population genetics and genetic epidemiology studies. The platform contains two components. The first component, DIVERGENOMEdb, is a relational database developed using MySQL. It allows to safely storing individual genotypes from different types of data such as contigs (resulted from re-sequencing projects), SNPs/INDELs and microsatellites. Genotype data can be linked to a description of the protocols used to generate them. Individuals can be linked to populations, as well as to individual phenotypic information that are collected in biomedical studies, allowing using different kinds of variables. The database structure permits easy integration with other data types, including public databases such as the HapMap project, opening prospects for future implementations. The second component, DIVERGENOMEtools, is a dynamic pipeline composed of a set of scripts, developed using a graph-based coordination algorithm and implemented in the programming language Perl. It enables the conversion of either queries submitted to the database as well as independent files to many popular file formats required by popular population genetics and genetic epidemiology software.