Comparing strains across samples
Prerequisites
StrainGR call data (HDF5 files) for the samples of interest
Comparing strains in different samples
Strains in different samples that match the same close reference genome can be compared in more detail (at the nucleotide level) using StrainGR.
To compare strains run straingr compare
:
straingr compare sample1.hdf5 sample2.hdf5 \
-o sample1.vs.sample2.summary.tsv -d sample1.vs.sample2.details.tsv
straingr compare
takes in two HDF5 files as generated by straingr call
, and the compares the base calls in each
sample for each scaffold in the concatenated reference. If different concatenated references were used for each sample,
only the scaffolds the two concatenated references have in common will be compared.
Output file description
Summary TSV
This file contains several metrics that summarizes the comparisons of each strain (scaffold).
Warning: this file currently contains a ton of metrics, several of which are slight variations on others. In the final version of StrainGE we will likely remove a few and only keep the most relevant ones.
Columns:
sample1, sample2: Sample names (from filename)
ref: The name of the original reference this scaffold belongs to
scaffold: scaffold name
length: length of the scaffold
common (commonPct): Number (percentage) of positions of this scaffold that’s callable in both samples
single (singlePct): Number (percentage) of positions where both samples have a single strong call (i.e. no evidence for multiple alleles)
singleAgree (singleAgreePct): Number (percentage) of positions where both sample have single strong call, and the base call is the same. singleAgreePct is the ACNI metric as described in the paper.
sharedAlleles (sharedAllelesPct): Number (percentage) of positions where both samples share an allele. This allows for positions to have multiple alleles, and at least one allele should match.
variants (variantsPct): Number (percentage) of positions where either sample has an allele other than the reference.
commonVariant (commonVariantPct): Number (percentage) of variants where both samples share an allele
variantExact (variantExactPct): Number (percentage) of variants that are exactly the same in both samples (including the same positions with multiple alleles).
AnotB (AnotBPct): Number (percentage) of variants in Sample A but not in Sample B
BnotA (BnotAPct): Number (percentage) of variants in Sample B but not in Sample A
gapJaccardSimilarity: Jaccard similarity between samples of set of positions not marked as gap (i.e. analogous to gene content similarity).