The Use of Weighted Graphs for Large-Scale Genome Analysis: Datasets

This page contains the datasets used in the following paper:

Fang Zhou, Hannu Toivonen, Ross D. King: The Use of Weighted Graphs for Large-Scale Genome Analysis. PLoS One 9 (3): e89618. 2014.

In the paper, we propose the use of weighted graphs as a data structure to enable large-scale phylogenetic analysis of networks.

We downloaded all the needed biological information from Kyoto Encyclopedia of Genes and Genomes (KEGG) (http://www.genome.jp/kegg/) (Release 59.0, July 1, 2011). Below are links to the data we used in our work. All of the data can also be downloaded as a zipped file: zhouetal_plosone_2014.zip.

To generate the super-metabolic graph (file pathway-subgraph-all-cpd-exclude-cofactor.bmg), we selected the 192 pathways that occur in prokaryota (folder: pathway-infor).
We selected 108 Archaea species and 1,287 Eubacteria species with complete genomes (folder: complete-genome).
To sample genomes, we first applied CD-HIT (http://weizhong-lab.ucsd.edu/cd-hit/) to cluster species based on their 16S ribosomal RNA sequences similarities at 0.8 level in each domain. We obtained 15 clusters of Archaea and 114 clusters of Eubacteria species. Then we sampled 15 genomes from each domain, and repeated the sampling process 100 times (folder: clustering-results)
Three types of enzyme weights:
- Taxonomic weights:
  Archaea (folder: Enzyme-weights/Archaea-taxonomic-weights) and
  Eubacteria (folder: Enzyme-weights/Eubacteria-taxonomic-weights)
- Isoenzymatic weights:
  Archaea (folder: Enzyme-weights/Archaea-isoenzymatic-weights) and
  Eubacteria (folder: Enzyme-weights/Eubacteria-isoenzymatic-weights)
- Sequence-similarity weights:
  Archaea (folder: Enzyme-weights/Archaea-sequence-similarity-weights) and
  Eubacteria (folder: Enzyme-weights/Eubacteria-sequence-similarity-weights)