Analyze StrainGE output in Python

Now that we have run StrainGST and StrainGR (including the compare step), how do we analyze the outputs? This page uses Python and its commonly used data science stack (NumPy, SciPy, Pandas and matplotlib+seaborn) to parse the data, plot the relative abundances of strains over time, and generate an ACNI/gap similarity plot.

Download data

We download an archive containing StrainGE outputs part of the vignette described in the paper on the persistence of an E. coli strain in the gut of a woman with recurrent urinary tract infections. The extracted data is organized in a straingst and straingr folder.

[2]:

!curl --output umb_data.tar.gz https://raw.githubusercontent.com/broadinstitute/strainge-paper/master/umb/umb_data.tar.gz
!tar -xzvf umb_data.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12196  100 12196    0     0  42347      0 --:--:-- --:--:-- --:--:-- 42347
x straingst/UMB11_01.tsv
x straingst/UMB11_02.tsv
x straingst/UMB11_03.1.tsv
x straingst/UMB11_03.tsv
x straingst/UMB11_04.1.tsv
x straingst/UMB11_04.tsv
x straingst/UMB11_05.tsv
x straingst/UMB11_06.tsv
x straingst/UMB11_07.tsv
x straingst/UMB11_08.tsv
x straingst/UMB11_11.tsv
x straingst/UMB11_12.tsv
x straingr/UMB11_01.tsv
x straingr/UMB11_02.tsv
x straingr/UMB11_03.1.tsv
x straingr/UMB11_03.tsv
x straingr/UMB11_04.1.tsv
x straingr/UMB11_04.tsv
x straingr/UMB11_05.tsv
x straingr/UMB11_06.tsv
x straingr/UMB11_07.tsv
x straingr/UMB11_08.tsv
x straingr/UMB11_11.tsv
x straingr/UMB11_12.tsv
x straingr/compare.summary.chrom.txt

Import required modules

[3]:

from pathlib import Path

import numpy
import pandas
import matplotlib.pyplot as plt
from IPython.display import display

StrainGST

Read StrainGST outputs and combine it in a DataFrame

The TSV files written by StrainGST contain both sample statistics (the first two lines), and statistics for each identified strain (see StrainGST page). In this tutorial, we are mainly interested in the identified strains. In the code below, when calling pandas.read_csv, we give the argument skiprows=2 to skip the sample statistics.

[16]:

STRAINGST_DIR = Path("straingst/")

df_list = []
sample_names = []
for f in STRAINGST_DIR.glob("*.tsv"):
    sample_name = f.stem
    df = pandas.read_csv(f, sep='\t', comment='#', skiprows=2, index_col=1)

    df_list.append(df)
    sample_names.append(sample_name)


# Combine all StrainGST results from each sample into a single DataFrame.
straingst_df = pandas.concat(df_list, keys=sample_names, names=["sample"])

sample_names = list(sorted(sample_names, key=lambda e: float(e.replace("UMB11_", ""))))
straingst_df.sort_index()

[16]:

		i	gkmers	ikmers	skmers	cov	kcov	gcov	acct	even	spec	rapct	old_rapct	wscore	score
sample	strain
UMB11_01	Esch_coli_NGF1	0	49631	49622	50090	0.985	7.009	6.831	0.980	0.987	1.000	1.536	1.505	0.940	0.940
UMB11_02	Esch_coli_NGF1	0	49631	49623	5358	0.103	1.546	0.158	0.932	0.707	1.032	0.048	0.045	0.047	0.048
UMB11_03	Esch_coli_1190	0	48261	48249	37144	0.711	2.814	1.975	0.926	0.826	0.972	0.395	0.366	0.437	0.449
UMB11_03.1	Esch_coli_1190	0	48261	48254	31201	0.595	2.152	1.264	0.918	0.830	1.030	0.711	0.652	0.365	0.376
UMB11_04.1	Esch_coli_1190	0	48261	48250	19042	0.362	1.870	0.668	0.908	0.743	1.047	0.135	0.122	0.173	0.182
UMB11_05	Esch_coli_1190	0	48261	48248	40411	0.775	2.741	2.097	0.923	0.884	0.967	0.484	0.447	0.540	0.558
UMB11_06	Esch_coli_1190	1	48261	21600	30714	0.794	7.102	5.565	0.283	0.797	0.552	1.005	0.391	0.079	0.142
UMB11_06	Esch_coli_H3	0	45610	45560	74449	0.960	114.343	109.038	0.921	0.960	0.976	16.420	16.042	0.795	0.814
UMB11_07	Esch_coli_1190	0	48261	48237	58276	0.854	4.557	3.846	0.803	0.873	0.794	0.411	0.822	0.415	0.522
UMB11_07	Esch_coli_f974b26a-5e81-11e8-bf7f-3c4a9275d6c8	1	47727	21265	17074	0.441	2.588	1.077	0.526	0.668	1.012	0.613	0.106	0.102	0.103
UMB11_08	Esch_coli_1190	0	48261	48248	31599	0.592	2.243	1.309	0.904	0.810	0.980	0.315	0.285	0.344	0.351
UMB11_11	Esch_coli_1190	0	48261	48248	49462	0.920	6.107	5.545	0.920	0.923	0.965	1.180	1.086	0.696	0.721
UMB11_12	Esch_coli_1190	0	48261	48235	66509	0.941	9.061	8.431	0.800	0.941	0.734	0.978	2.020	0.489	0.667
UMB11_12	Esch_coli_26561	1	46249	19738	21112	0.853	4.773	3.983	0.780	0.869	0.968	1.548	0.395	0.486	0.502

Plot relative abundances

[5]:

plt.figure(figsize=(6, 4))

strain_order = ['Esch_coli_1190', 'Esch_coli_H3', 'Esch_coli_NGF1', 'Esch_coli_f974b26a-5e81-11e8-bf7f-3c4a9275d6c8', "Esch_coli_26561"]
strain_labels = ['1190', 'H3', 'NGF1', 'f974b26a...', "26561"]
xlabels = [s.replace("UMB11_", "") for s in sample_names]

x = numpy.arange(len(sample_names))
bottom = numpy.zeros((len(sample_names),))
for ref, label in zip(strain_order, strain_labels):
    # Create an array with all relative abundances for the current reference in each sample. If not available, set to zero.
    rel_abun = numpy.array([
        straingst_df.loc[(sample, ref), 'rapct'] if (sample, ref) in straingst_df.index else 0.0
        for sample in sample_names
    ])

    plt.bar(x, rel_abun, bottom=bottom, tick_label=xlabels, label=label, width=0.8)
    bottom += rel_abun

plt.xlabel("Sample (time point)")
plt.ylabel("Relative abundance")
plt.gca().yaxis.set_major_formatter("{x:g}%")
plt.legend(title="Strain", loc="center left", bbox_to_anchor=(1.05, 0.5), ncol=2)

plt.show()

StrainGR

Load `call` data in a DataFrame

To load StrainGR outputs, we use a similar approach as descried above. In this case, the StrainGR TSV files can be directly loaded with pandas without skiprows.

One thing to note, StrainGR outputs metrics for every contig in the concatenated reference used for alignment. The output file thus contains metrics for contigs from strains that were not predicted to be present by StrainGST. We use the presence/absence predictions of StrainGST as our “truth” and remove the contigs from strains that weren’t present.

We apply a few other filters, including removing plasmid contigs, and contigs with less coverage than 0.5x.

[28]:

STRAINGR_DIR = Path("straingr/")

df_list = []
sample_names = []
for f in STRAINGR_DIR.glob("*.tsv"):
    df = pandas.read_csv(f, sep='\t', index_col=0)
    df = df.drop(index='TOTAL')  # Remove TOTAL statistics

    df_list.append(df)
    sample_names.append(f.stem)

straingr_df = pandas.concat(df_list, keys=sample_names, names=["sample"])
straingr_df['straingst_present'] = straingr_df.index.map(lambda ix: ix in straingst_df.index)
straingr_df['is_plasmid'] = straingr_df['length'] < 4e6
straingr_df['enough_cov'] = straingr_df['coverage'] > 0.5

# Filter and re-index
straingr_df = straingr_df[straingr_df['straingst_present'] & ~straingr_df['is_plasmid'] & straingr_df['enough_cov']].reset_index().set_index(['sample', 'ref'])
straingr_df

[28]:

		name	length	coverage	uReads	abundance	median	callable	callablePct	confirmed	confirmedPct	...	multiPct	lowmq	lowmqPct	high	highPct	gapCount	gapLength	straingst_present	is_plasmid	enough_cov
sample	ref
UMB11_11	Esch_coli_1190	NZ_CP023386.1	4900891	2.492	106232	0.449	2	2819899	57.538	2818902	99.965	...	0.004	435239	8.881	488	0.010	9	165998	True	False	True
UMB11_06	Esch_coli_H3	NZ_CP010167.1	4630919	47.519	1331869	7.902	48	3863102	83.420	3861892	99.969	...	0.004	1439048	31.075	84996	1.835	9	120001	True	False	True
UMB11_06	Esch_coli_1190	NZ_CP023386.1	4900891	1.963	69465	0.264	2	1515111	30.915	1513886	99.919	...	0.015	1161951	23.709	699045	14.264	9	165099	True	False	True
UMB11_07	Esch_coli_f974b26a-5e81-11e8-bf7f-3c4a9275d6c8	NZ_LR536430.1	4975029	0.700	14686	0.132	0	378975	7.618	377743	99.675	...	0.013	495996	9.970	17165	0.345	17	347002	True	False	True
UMB11_07	Esch_coli_1190	NZ_CP023386.1	4900891	1.591	59112	0.347	1	1894740	38.661	1893628	99.941	...	0.010	323110	6.593	3384	0.069	12	171056	True	False	True
UMB11_03	Esch_coli_1190	NZ_CP023386.1	4900891	0.822	35131	0.143	1	859998	17.548	859671	99.962	...	0.001	160054	3.266	26	0.001	9	185085	True	False	True
UMB11_01	Esch_coli_NGF1	NZ_CP016007.1	5026105	3.549	85824	0.823	3	2506998	49.880	2506921	99.997	...	0.003	1668501	33.197	3681	0.073	1	16868	True	False	True
UMB11_03.1	Esch_coli_1190	NZ_CP023386.1	4900891	0.596	24708	0.278	0	547015	11.162	546816	99.964	...	0.001	99001	2.020	114	0.002	7	172806	True	False	True
UMB11_08	Esch_coli_1190	NZ_CP023386.1	4900891	0.580	24145	0.118	0	505713	10.319	505494	99.957	...	0.001	115822	2.363	64	0.001	6	158445	True	False	True

9 rows × 23 columns

Load `compare` data in a DataFrame

The above data mainly contains data per sample of individual strains as compared to its closest reference. In general, we are often more interested how strains in each sample relate to each other. These kind of relationships are computed with the straingr compare command. Here, we load the data from compare, make sure we only include comparisons between strains that were predicted to be present by StrainGST, and plot the ACNI/gap similarity.

[27]:

compare_df = pandas.read_csv(STRAINGR_DIR / "compare.summary.chrom.txt", sep='\t', index_col=[0, 1, 2])

def both_straingst_present(ix):
    sample1, sample2, ref = ix

    return (sample1, ref) in straingr_df.index and (sample2, ref) in straingr_df.index

compare_df['both_present'] = compare_df.index.map(both_straingst_present)
compare_df = compare_df[compare_df['both_present']].copy()
compare_df

[27]:

			scaffold	length	common	commonPct	single	singlePct	singleAgree	singleAgreePct	sharedAlleles	sharedAllelesPct	...	BnotAweak	BnotAweakPct	Agaps	AsharedGaps	AgapPct	Bgaps	BsharedGaps	BgapPct	gapJaccardSim	both_present
sample1	sample2	ref
UMB11_03	UMB11_03.1	Esch_coli_1190	NZ_CP023386.1	4900891	126732	2.5859	126732	100.0000	126725	99.9945	126725	99.9945	...	5	17.8571	185085	163905	88.5566	172806	172806	100.0000	0.9919	True
	UMB11_06	Esch_coli_1190	NZ_CP023386.1	4900891	336010	6.8561	335954	99.9833	335809	99.9568	335865	99.9568	...	160	57.3477	185085	163905	88.5566	165099	158136	95.7825	0.9895	True
	UMB11_07	Esch_coli_1190	NZ_CP023386.1	4900891	405110	8.2660	405067	99.9894	405023	99.9891	405066	99.9891	...	65	32.5000	185085	175075	94.5917	171056	154895	90.5522	0.9872	True
	UMB11_08	Esch_coli_1190	NZ_CP023386.1	4900891	117635	2.4003	117635	100.0000	117629	99.9949	117629	99.9949	...	2	3.8462	185085	145541	78.6347	158445	146928	92.7312	0.9971	True
	UMB11_11	Esch_coli_1190	NZ_CP023386.1	4900891	611275	12.4727	611224	99.9917	611203	99.9966	611254	99.9966	...	29	11.9835	185085	185085	100.0000	165998	165998	100.0000	0.9889	True
UMB11_03.1	UMB11_06	Esch_coli_1190	NZ_CP023386.1	4900891	215863	4.4046	215823	99.9815	215758	99.9699	215798	99.9699	...	95	71.4286	172806	172806	100.0000	165099	158136	95.7825	0.9922	True
	UMB11_07	Esch_coli_1190	NZ_CP023386.1	4900891	252557	5.1533	252524	99.9869	252490	99.9865	252523	99.9865	...	56	43.0769	172806	172806	100.0000	171056	148033	86.5407	0.9879	True
	UMB11_08	Esch_coli_1190	NZ_CP023386.1	4900891	75899	1.5487	75897	99.9974	75894	99.9960	75896	99.9960	...	2	10.0000	172806	148091	85.6978	158445	146928	92.7312	0.9916	True
	UMB11_11	Esch_coli_1190	NZ_CP023386.1	4900891	390913	7.9764	390894	99.9951	390880	99.9964	390899	99.9964	...	14	14.0000	172806	172806	100.0000	165998	154893	93.3102	0.9916	True
UMB11_06	UMB11_07	Esch_coli_1190	NZ_CP023386.1	4900891	718395	14.6585	718228	99.9768	717916	99.9566	718083	99.9566	...	151	22.1083	165099	158136	95.7825	171056	148033	86.5407	0.9898	True
	UMB11_08	Esch_coli_1190	NZ_CP023386.1	4900891	199668	4.0741	199626	99.9790	199556	99.9649	199598	99.9649	...	18	11.1111	165099	151355	91.6753	158445	158445	100.0000	0.9951	True
	UMB11_11	Esch_coli_1190	NZ_CP023386.1	4900891	1059446	21.6174	1059249	99.9814	1058911	99.9681	1059108	99.9681	...	50	6.2500	165099	158136	95.7825	165998	154893	93.3102	0.9964	True
UMB11_07	UMB11_08	Esch_coli_1190	NZ_CP023386.1	4900891	235328	4.8017	235291	99.9843	235271	99.9915	235308	99.9915	...	6	5.1724	171056	131119	76.6527	158445	146928	92.7312	0.9909	True
UMB11_07	UMB11_11	Esch_coli_1190	NZ_CP023386.1	4900891	1284089	26.2011	1283913	99.9863	1283763	99.9883	1283939	99.9883	...	26	3.8981	171056	154895	90.5522	165998	160523	96.7018	0.9900	True
UMB11_08	UMB11_11	Esch_coli_1190	NZ_CP023386.1	4900891	358765	7.3204	358752	99.9964	358744	99.9978	358757	99.9978	...	5	3.6765	158445	146928	92.7312	165998	139943	84.3040	0.9945	True

15 rows × 32 columns

[41]:

import seaborn

seaborn.scatterplot(x="gapJaccardSim", y="singleAgreePct", size="commonPct", data=compare_df)

plt.xlim(0.970, 1.0)
plt.xlabel("Gap similarity")

plt.ylim(99.9, 100)
plt.ylabel("ACNI")
plt.gca().yaxis.set_major_formatter("{x:g}%")

plt.grid('on')
plt.legend(title="Common\nCallable [%]", loc="center left", bbox_to_anchor=(1.05, 0.5))

[41]:

<matplotlib.legend.Legend at 0x160142700>