lactis strains, which would allow finding analogous genes that have similar function but different sequences. Even with DNA sequencing
prices dropping, determining the gene content of dozens of strains by genome sequencing could still be costly. Pan-genome arrays allow querying occurrence of genes in multiple strains more cost-effectively, but genes absent in reference RG7112 sequences and strongly divergent genes would be missed. Though the presence/absence data can be linked to phenotypes, it cannot account for effects of regulatory control or post-translational modifications. Thus putative gene-phenotype relations should be experimentally learn more tested by high-throughput techniques such as gene expression analysis. Annotating genes of a genome is essential in understanding the genomic properties of any strain. Gene annotation is often based on sequence similarity,
so mistakes in annotating a single gene could propagate to genes of different organisms through annotation by sequence similarity. Therefore identified gene-phenotype relations should be experimentally validated and linked selleck chemicals to other information sources such as pathway information. This would allow decreasing error propagation introduced by sequence similarity based gene function prediction approaches. Genotype-phenotype matching results show that the largest group of proteins related to phenotypes was hypothetical proteins indicating that gene annotations could still be improved for all 4 reference strains. Genomes of more bacterial strains are sequenced on a daily basis, which shows the critical importance of accurate gene function prediction. Identified gene-phenotype relations would allow more accurately determining functions of many genes, and hence better understanding of genotype- and phenotype-level differences among 38 L. lactis strains. We provide all identified relations as well as complete genotype and phenotype data set (see Additional files). This data set not only serves as a collection of leads to phenotypes, but due to large data size could also be used to test different association methods. Conclusions
Lactococcus lactis has Dapagliflozin been extensively studied due to its industrial importance. Here we provide a coherent genotype and phenotype dataset and its interpretation for the Lactococcus species. We integrated for 38 L. lactis strains their genotypic measurements as well as phenotypes derived from 207 different experiments (see Methods) to identify gene-phenotype relations. Our results are publicly available (see also Additional files) and contains many leads into Lactococcus species-wide genotype-phenotype relations that can further be analysed and experimentally validated. These relations could be used to refine functions of genes. As new genome sequences emerge frequently, this would allow annotating gene functions for these new genomes more accurately and predicting phenotypes of new strains based on their DNA sequence.