Running StrainGST

Identify close reference genome(s) to strain(s) in a sample.

Prerequisites

A pre-built database for the genus or species of interest
A whole metagenomic sequencing (WMS) sample

Usage

1. K-merize the sample reads

StrainGST iteratively compares the k-mer profiles of references in the database to the k-mers in the sample to identify close reference genomes to strains in a sample.

Our first step is to kmerize the sample reads. For example, if you have a FASTQ file named patient1.fastq with all reads, then we generate its corresponding k-mer set as follows:

straingst kmerize -k 23 -o patient1.hdf5 patient1.fastq

Similar to the first step of the database creation section, this will generate a HDF5 file named patient1.hdf5 with all k-mers and their corresponding counts. Make sure the value of k is the same as used in the database creation step.

You can specify multiple FASTQ files to the command above, which is useful if you have paired-end reads. Furthermore, it will also automatically decompress gzipped files. For example, if you have gzipped paired-end FASTQ files, then the following command also works:

straingst kmerize -k 23 -o patient1.hdf5 \
    patient1.1.fastq.gz patient1.2.fastq.gz

2. Run StrainGST

We can now run straingst run with our database HDF5 and our sample HDF5:

straingst run -o results.tsv pan-genome-db.hdf5 patient1.hdf5

This will output a tab separated values (tsv) file, containing statistics about the sample k-mer set and a list of identified reference strains with accompanying metrics.

New in version 1.3: instead of writing both sample statistics and the identified strains to a single TSV file, which is generally not as easily read in Python’s pandas or R, you can now enable the option to write sample statistics and strains to separate files when enabling the --separate-output (-O) option. If enabling this option, use -o to specify the output filename prefix.

Example:

straingst run -O -o PREFIX pan-genome-df.hdf5 patient1.hdf5

This will result in two files: PREFIX.stats.tsv (sample statistics), and PREFIX.strains.tsv (list of identified strains).

Output file description

Example output (single file output)

sample    totalkmers      distinct   pkmers  pkcov   pan%
UMB11_01  2277023860      380759656  50090   6.984   1.536
i         strain          gkmers     ikmers  skmers  cov    kcov   gcov   acct   even   spec   rapct  wscore  score
0         Esch_coli_NGF1  49631      49622   50090   0.985  7.009  6.831  0.980  0.987  1.000  1.507  0.940   0.940

Example output (separate file output; new in version 1.3)

PREFIX.stats.tsv

sample    totalkmers      distinct   pkmers  pkcov   pan%
UMB11_01  2277023860      380759656  50090   6.984   1.536

PREFIX.strains.tsv

i         strain          gkmers     ikmers  skmers  cov    kcov   gcov   acct   even   spec   rapct  wscore  score
0         Esch_coli_NGF1  49631      49622   50090   0.985  7.009  6.831  0.980  0.987  1.000  1.507  0.940   0.940

Sample statistics

The first two lines contain statistics on the whole sample.

Columns:

sample: Sample name, derived from the k-mer set filename
totalkmers: total number of k-mers counted in the sample, including k-mers that occur multiple times
distinct: total unique number of k-mers
pkmers: total unique number of k-mers that are also present in the database
pkcov: average “coverage” (multiplicity) of each unique k-mer in the sample that is also present in the database.
pan%: total number of k-mers (including duplicates) that are present in both the sample and database divided by the total number of k-mers in the sample (totalkmers), i.e. an estimation of the relative abundance of the species/genus of interest in this sample.

Reference strain statistics

The next lines contain the close reference genomes identified by StrainGST.

Columns:

i: Iteration number
strain: Reference strain name
gkmers: Total number of unique k-mers in the original reference genome (or its fingerprint).
ikmers: Remaining unique k-mers in the genome after discarding k-mers excluded in an earlier iteration or because their average coverage was too high
skmers: Remaining unique k-mers from the sample
cov: Breadth of coverage of this reference, i.e. what fraction of k-mers in the reference is also present in the sample
kcov: Average depth of coverage of k-mers both present in the reference and in the sample
gcov: Average depth of coverage of all k-mers in the reference
acct: What fraction of the sample k-mers can be explained by this reference?
even: Evenness of coverage. A value close to 1 indicates that the coverage is evenly distributed along the genome, a value close to zero indicates that only a small part of the genome is well covered.
spec: Obsolete
rapct: Estimated strain relative abundance (relative to the whole sample).
wscore: Obsolete
score: Score used to rank each reference in the database at each iteration. A high score represents high confidence in this prediction. Scores cannot be compared across iterations or across samples, and it is possible that a strain in a second iteration has a higher score than the strain in the first iteration.

Tips and Tricks

Easily parse StrainGST file in Python (mainly useful for single file output):

from strainge.io.utils import parse_straingst

results = ['sample1.tsv', 'sample2.tsv']

for sample in results:
    print('#', sample)
    with open(sample) as f:
        for strain in parse_straingst(f):
            print(strain)  # strain is a dict with above columns

With sample statistics:

from strainge.io.utils import parse_straingst

results = ['sample1.tsv', 'sample2.tsv']

for sample in results:
    print('#', sample)
    with open(sample) as f:
        straingst_iter = iter(parse_straingst(f, return_sample_stats=True))
        sample_stats = next(straingst_iter)
        print(sample_stats)

        for strain in straingst_iter:
            print(strain)  # strain is a dict with above columns