Comparing strains across samples

Prerequisites

  • StrainGR call data (HDF5 files) for the samples of interest

Comparing strains in different samples

Strains in different samples that match the same close reference genome can be compared in more detail (at the nucleotide level) using StrainGR.

To compare strains run straingr compare:

straingr compare sample1.hdf5 sample2.hdf5 \
   -o sample1.vs.sample2.summary.tsv -d sample1.vs.sample2.details.tsv

straingr compare takes in two HDF5 files as generated by straingr call, and the compares the base calls in each sample for each scaffold in the concatenated reference. If different concatenated references were used for each sample, only the scaffolds the two concatenated references have in common will be compared.

Output file description

Summary TSV

This file contains several metrics that summarizes the comparisons of each strain (scaffold).

Warning: this file currently contains a ton of metrics, several of which are slight variations on others. In the final version of StrainGE we will likely remove a few and only keep the most relevant ones.

Columns:

  • sample1, sample2: Sample names (from filename)

  • ref: The name of the original reference this scaffold belongs to

  • scaffold: scaffold name

  • length: length of the scaffold

  • common (commonPct): Number (percentage) of positions of this scaffold that’s callable in both samples

  • single (singlePct): Number (percentage) of positions where both samples have a single strong call (i.e. no evidence for multiple alleles)

  • singleAgree (singleAgreePct): Number (percentage) of positions where both sample have single strong call, and the base call is the same. singleAgreePct is the ACNI metric as described in the paper.

  • sharedAlleles (sharedAllelesPct): Number (percentage) of positions where both samples share an allele. This allows for positions to have multiple alleles, and at least one allele should match.

  • variants (variantsPct): Number (percentage) of positions where either sample has an allele other than the reference.

  • commonVariant (commonVariantPct): Number (percentage) of variants where both samples share an allele

  • variantExact (variantExactPct): Number (percentage) of variants that are exactly the same in both samples (including the same positions with multiple alleles).

  • AnotB (AnotBPct): Number (percentage) of variants in Sample A but not in Sample B

  • BnotA (BnotAPct): Number (percentage) of variants in Sample B but not in Sample A

  • gapJaccardSimilarity: Jaccard similarity between samples of set of positions not marked as gap (i.e. analogous to gene content similarity).