Analyze StrainGE output in Python

Now that we have run StrainGST and StrainGR (including the compare step), how do we analyze the outputs? This page uses Python and its commonly used data science stack (NumPy, SciPy, Pandas and matplotlib+seaborn) to parse the data, plot the relative abundances of strains over time, and generate an ACNI/gap similarity plot.

Download data

We download an archive containing StrainGE outputs part of the vignette described in the paper on the persistence of an E. coli strain in the gut of a woman with recurrent urinary tract infections. The extracted data is organized in a straingst and straingr folder.

[2]:
!curl --output umb_data.tar.gz https://raw.githubusercontent.com/broadinstitute/strainge-paper/master/umb/umb_data.tar.gz
!tar -xzvf umb_data.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12196  100 12196    0     0  42347      0 --:--:-- --:--:-- --:--:-- 42347
x straingst/UMB11_01.tsv
x straingst/UMB11_02.tsv
x straingst/UMB11_03.1.tsv
x straingst/UMB11_03.tsv
x straingst/UMB11_04.1.tsv
x straingst/UMB11_04.tsv
x straingst/UMB11_05.tsv
x straingst/UMB11_06.tsv
x straingst/UMB11_07.tsv
x straingst/UMB11_08.tsv
x straingst/UMB11_11.tsv
x straingst/UMB11_12.tsv
x straingr/UMB11_01.tsv
x straingr/UMB11_02.tsv
x straingr/UMB11_03.1.tsv
x straingr/UMB11_03.tsv
x straingr/UMB11_04.1.tsv
x straingr/UMB11_04.tsv
x straingr/UMB11_05.tsv
x straingr/UMB11_06.tsv
x straingr/UMB11_07.tsv
x straingr/UMB11_08.tsv
x straingr/UMB11_11.tsv
x straingr/UMB11_12.tsv
x straingr/compare.summary.chrom.txt

Import required modules

[3]:
from pathlib import Path

import numpy
import pandas
import matplotlib.pyplot as plt
from IPython.display import display

StrainGST

Read StrainGST outputs and combine it in a DataFrame

The TSV files written by StrainGST contain both sample statistics (the first two lines), and statistics for each identified strain (see StrainGST page). In this tutorial, we are mainly interested in the identified strains. In the code below, when calling pandas.read_csv, we give the argument skiprows=2 to skip the sample statistics.

[16]:
STRAINGST_DIR = Path("straingst/")

df_list = []
sample_names = []
for f in STRAINGST_DIR.glob("*.tsv"):
    sample_name = f.stem
    df = pandas.read_csv(f, sep='\t', comment='#', skiprows=2, index_col=1)

    df_list.append(df)
    sample_names.append(sample_name)


# Combine all StrainGST results from each sample into a single DataFrame.
straingst_df = pandas.concat(df_list, keys=sample_names, names=["sample"])

sample_names = list(sorted(sample_names, key=lambda e: float(e.replace("UMB11_", ""))))
straingst_df.sort_index()
[16]:
i gkmers ikmers skmers cov kcov gcov acct even spec rapct old_rapct wscore score
sample strain
UMB11_01 Esch_coli_NGF1 0 49631 49622 50090 0.985 7.009 6.831 0.980 0.987 1.000 1.536 1.505 0.940 0.940
UMB11_02 Esch_coli_NGF1 0 49631 49623 5358 0.103 1.546 0.158 0.932 0.707 1.032 0.048 0.045 0.047 0.048
UMB11_03 Esch_coli_1190 0 48261 48249 37144 0.711 2.814 1.975 0.926 0.826 0.972 0.395 0.366 0.437 0.449
UMB11_03.1 Esch_coli_1190 0 48261 48254 31201 0.595 2.152 1.264 0.918 0.830 1.030 0.711 0.652 0.365 0.376
UMB11_04.1 Esch_coli_1190 0 48261 48250 19042 0.362 1.870 0.668 0.908 0.743 1.047 0.135 0.122 0.173 0.182
UMB11_05 Esch_coli_1190 0 48261 48248 40411 0.775 2.741 2.097 0.923 0.884 0.967 0.484 0.447 0.540 0.558
UMB11_06 Esch_coli_1190 1 48261 21600 30714 0.794 7.102 5.565 0.283 0.797 0.552 1.005 0.391 0.079 0.142
Esch_coli_H3 0 45610 45560 74449 0.960 114.343 109.038 0.921 0.960 0.976 16.420 16.042 0.795 0.814
UMB11_07 Esch_coli_1190 0 48261 48237 58276 0.854 4.557 3.846 0.803 0.873 0.794 0.411 0.822 0.415 0.522
Esch_coli_f974b26a-5e81-11e8-bf7f-3c4a9275d6c8 1 47727 21265 17074 0.441 2.588 1.077 0.526 0.668 1.012 0.613 0.106 0.102 0.103
UMB11_08 Esch_coli_1190 0 48261 48248 31599 0.592 2.243 1.309 0.904 0.810 0.980 0.315 0.285 0.344 0.351
UMB11_11 Esch_coli_1190 0 48261 48248 49462 0.920 6.107 5.545 0.920 0.923 0.965 1.180 1.086 0.696 0.721
UMB11_12 Esch_coli_1190 0 48261 48235 66509 0.941 9.061 8.431 0.800 0.941 0.734 0.978 2.020 0.489 0.667
Esch_coli_26561 1 46249 19738 21112 0.853 4.773 3.983 0.780 0.869 0.968 1.548 0.395 0.486 0.502

Plot relative abundances

[5]:
plt.figure(figsize=(6, 4))

strain_order = ['Esch_coli_1190', 'Esch_coli_H3', 'Esch_coli_NGF1', 'Esch_coli_f974b26a-5e81-11e8-bf7f-3c4a9275d6c8', "Esch_coli_26561"]
strain_labels = ['1190', 'H3', 'NGF1', 'f974b26a...', "26561"]
xlabels = [s.replace("UMB11_", "") for s in sample_names]

x = numpy.arange(len(sample_names))
bottom = numpy.zeros((len(sample_names),))
for ref, label in zip(strain_order, strain_labels):
    # Create an array with all relative abundances for the current reference in each sample. If not available, set to zero.
    rel_abun = numpy.array([
        straingst_df.loc[(sample, ref), 'rapct'] if (sample, ref) in straingst_df.index else 0.0
        for sample in sample_names
    ])

    plt.bar(x, rel_abun, bottom=bottom, tick_label=xlabels, label=label, width=0.8)
    bottom += rel_abun

plt.xlabel("Sample (time point)")
plt.ylabel("Relative abundance")
plt.gca().yaxis.set_major_formatter("{x:g}%")
plt.legend(title="Strain", loc="center left", bbox_to_anchor=(1.05, 0.5), ncol=2)

plt.show()
_images/analysis_7_0.png

StrainGR

Load call data in a DataFrame

To load StrainGR outputs, we use a similar approach as descried above. In this case, the StrainGR TSV files can be directly loaded with pandas without skiprows.

One thing to note, StrainGR outputs metrics for every contig in the concatenated reference used for alignment. The output file thus contains metrics for contigs from strains that were not predicted to be present by StrainGST. We use the presence/absence predictions of StrainGST as our “truth” and remove the contigs from strains that weren’t present.

We apply a few other filters, including removing plasmid contigs, and contigs with less coverage than 0.5x.

[28]:
STRAINGR_DIR = Path("straingr/")

df_list = []
sample_names = []
for f in STRAINGR_DIR.glob("*.tsv"):
    df = pandas.read_csv(f, sep='\t', index_col=0)
    df = df.drop(index='TOTAL')  # Remove TOTAL statistics

    df_list.append(df)
    sample_names.append(f.stem)

straingr_df = pandas.concat(df_list, keys=sample_names, names=["sample"])
straingr_df['straingst_present'] = straingr_df.index.map(lambda ix: ix in straingst_df.index)
straingr_df['is_plasmid'] = straingr_df['length'] < 4e6
straingr_df['enough_cov'] = straingr_df['coverage'] > 0.5

# Filter and re-index
straingr_df = straingr_df[straingr_df['straingst_present'] & ~straingr_df['is_plasmid'] & straingr_df['enough_cov']].reset_index().set_index(['sample', 'ref'])
straingr_df
[28]:
name length coverage uReads abundance median callable callablePct confirmed confirmedPct ... multiPct lowmq lowmqPct high highPct gapCount gapLength straingst_present is_plasmid enough_cov
sample ref
UMB11_11 Esch_coli_1190 NZ_CP023386.1 4900891 2.492 106232 0.449 2 2819899 57.538 2818902 99.965 ... 0.004 435239 8.881 488 0.010 9 165998 True False True
UMB11_06 Esch_coli_H3 NZ_CP010167.1 4630919 47.519 1331869 7.902 48 3863102 83.420 3861892 99.969 ... 0.004 1439048 31.075 84996 1.835 9 120001 True False True
Esch_coli_1190 NZ_CP023386.1 4900891 1.963 69465 0.264 2 1515111 30.915 1513886 99.919 ... 0.015 1161951 23.709 699045 14.264 9 165099 True False True
UMB11_07 Esch_coli_f974b26a-5e81-11e8-bf7f-3c4a9275d6c8 NZ_LR536430.1 4975029 0.700 14686 0.132 0 378975 7.618 377743 99.675 ... 0.013 495996 9.970 17165 0.345 17 347002 True False True
Esch_coli_1190 NZ_CP023386.1 4900891 1.591 59112 0.347 1 1894740 38.661 1893628 99.941 ... 0.010 323110 6.593 3384 0.069 12 171056 True False True
UMB11_03 Esch_coli_1190 NZ_CP023386.1 4900891 0.822 35131 0.143 1 859998 17.548 859671 99.962 ... 0.001 160054 3.266 26 0.001 9 185085 True False True
UMB11_01 Esch_coli_NGF1 NZ_CP016007.1 5026105 3.549 85824 0.823 3 2506998 49.880 2506921 99.997 ... 0.003 1668501 33.197 3681 0.073 1 16868 True False True
UMB11_03.1 Esch_coli_1190 NZ_CP023386.1 4900891 0.596 24708 0.278 0 547015 11.162 546816 99.964 ... 0.001 99001 2.020 114 0.002 7 172806 True False True
UMB11_08 Esch_coli_1190 NZ_CP023386.1 4900891 0.580 24145 0.118 0 505713 10.319 505494 99.957 ... 0.001 115822 2.363 64 0.001 6 158445 True False True

9 rows × 23 columns

Load compare data in a DataFrame

The above data mainly contains data per sample of individual strains as compared to its closest reference. In general, we are often more interested how strains in each sample relate to each other. These kind of relationships are computed with the straingr compare command. Here, we load the data from compare, make sure we only include comparisons between strains that were predicted to be present by StrainGST, and plot the ACNI/gap similarity.

[27]:
compare_df = pandas.read_csv(STRAINGR_DIR / "compare.summary.chrom.txt", sep='\t', index_col=[0, 1, 2])

def both_straingst_present(ix):
    sample1, sample2, ref = ix

    return (sample1, ref) in straingr_df.index and (sample2, ref) in straingr_df.index

compare_df['both_present'] = compare_df.index.map(both_straingst_present)
compare_df = compare_df[compare_df['both_present']].copy()
compare_df

[27]:
scaffold length common commonPct single singlePct singleAgree singleAgreePct sharedAlleles sharedAllelesPct ... BnotAweak BnotAweakPct Agaps AsharedGaps AgapPct Bgaps BsharedGaps BgapPct gapJaccardSim both_present
sample1 sample2 ref
UMB11_03 UMB11_03.1 Esch_coli_1190 NZ_CP023386.1 4900891 126732 2.5859 126732 100.0000 126725 99.9945 126725 99.9945 ... 5 17.8571 185085 163905 88.5566 172806 172806 100.0000 0.9919 True
UMB11_06 Esch_coli_1190 NZ_CP023386.1 4900891 336010 6.8561 335954 99.9833 335809 99.9568 335865 99.9568 ... 160 57.3477 185085 163905 88.5566 165099 158136 95.7825 0.9895 True
UMB11_07 Esch_coli_1190 NZ_CP023386.1 4900891 405110 8.2660 405067 99.9894 405023 99.9891 405066 99.9891 ... 65 32.5000 185085 175075 94.5917 171056 154895 90.5522 0.9872 True
UMB11_08 Esch_coli_1190 NZ_CP023386.1 4900891 117635 2.4003 117635 100.0000 117629 99.9949 117629 99.9949 ... 2 3.8462 185085 145541 78.6347 158445 146928 92.7312 0.9971 True
UMB11_11 Esch_coli_1190 NZ_CP023386.1 4900891 611275 12.4727 611224 99.9917 611203 99.9966 611254 99.9966 ... 29 11.9835 185085 185085 100.0000 165998 165998 100.0000 0.9889 True
UMB11_03.1 UMB11_06 Esch_coli_1190 NZ_CP023386.1 4900891 215863 4.4046 215823 99.9815 215758 99.9699 215798 99.9699 ... 95 71.4286 172806 172806 100.0000 165099 158136 95.7825 0.9922 True
UMB11_07 Esch_coli_1190 NZ_CP023386.1 4900891 252557 5.1533 252524 99.9869 252490 99.9865 252523 99.9865 ... 56 43.0769 172806 172806 100.0000 171056 148033 86.5407 0.9879 True
UMB11_08 Esch_coli_1190 NZ_CP023386.1 4900891 75899 1.5487 75897 99.9974 75894 99.9960 75896 99.9960 ... 2 10.0000 172806 148091 85.6978 158445 146928 92.7312 0.9916 True
UMB11_11 Esch_coli_1190 NZ_CP023386.1 4900891 390913 7.9764 390894 99.9951 390880 99.9964 390899 99.9964 ... 14 14.0000 172806 172806 100.0000 165998 154893 93.3102 0.9916 True
UMB11_06 UMB11_07 Esch_coli_1190 NZ_CP023386.1 4900891 718395 14.6585 718228 99.9768 717916 99.9566 718083 99.9566 ... 151 22.1083 165099 158136 95.7825 171056 148033 86.5407 0.9898 True
UMB11_08 Esch_coli_1190 NZ_CP023386.1 4900891 199668 4.0741 199626 99.9790 199556 99.9649 199598 99.9649 ... 18 11.1111 165099 151355 91.6753 158445 158445 100.0000 0.9951 True
UMB11_11 Esch_coli_1190 NZ_CP023386.1 4900891 1059446 21.6174 1059249 99.9814 1058911 99.9681 1059108 99.9681 ... 50 6.2500 165099 158136 95.7825 165998 154893 93.3102 0.9964 True
UMB11_07 UMB11_08 Esch_coli_1190 NZ_CP023386.1 4900891 235328 4.8017 235291 99.9843 235271 99.9915 235308 99.9915 ... 6 5.1724 171056 131119 76.6527 158445 146928 92.7312 0.9909 True
UMB11_11 Esch_coli_1190 NZ_CP023386.1 4900891 1284089 26.2011 1283913 99.9863 1283763 99.9883 1283939 99.9883 ... 26 3.8981 171056 154895 90.5522 165998 160523 96.7018 0.9900 True
UMB11_08 UMB11_11 Esch_coli_1190 NZ_CP023386.1 4900891 358765 7.3204 358752 99.9964 358744 99.9978 358757 99.9978 ... 5 3.6765 158445 146928 92.7312 165998 139943 84.3040 0.9945 True

15 rows × 32 columns

[41]:
import seaborn

seaborn.scatterplot(x="gapJaccardSim", y="singleAgreePct", size="commonPct", data=compare_df)

plt.xlim(0.970, 1.0)
plt.xlabel("Gap similarity")

plt.ylim(99.9, 100)
plt.ylabel("ACNI")
plt.gca().yaxis.set_major_formatter("{x:g}%")

plt.grid('on')
plt.legend(title="Common\nCallable [%]", loc="center left", bbox_to_anchor=(1.05, 0.5))

[41]:
<matplotlib.legend.Legend at 0x160142700>
_images/analysis_12_1.png