-
Notifications
You must be signed in to change notification settings - Fork 81
Description
When applying a paired test, the sequence of the data is crucial.
Example
>>> df = pd.DataFrame({
... "person": ["A","B","C","C","A","B"],
... "day": [1,2,2,1,2,1],
... "value": [1,2,3,4,5,6],
... })
>>> df
person day value
0 A 1 1
1 B 2 2
2 C 2 3
3 C 1 4
4 A 2 5
5 B 1 6
Applying scipy.stats.ttest_rel on the paired data:
>>> # sort value by person A,B,C
>>> values_day_1 = df[df["day"]==1].sort_values(by="person")["value"].values
>>> values_day_2 = df[df["day"]==2].sort_values(by="person")["value"].values
>>> values_day_1, values_day_2
(array([1, 6, 4]), array([5, 2, 3]))
>>> from scipy.stats import ttest_rel
>>> ttest_rel(values_day_1, values_day_2) # call the function on the paired data
TtestResult(statistic=0.14285714285714288, pvalue=0.8994962184740788, df=2)
But
x = "day"
y = "value"
pairs = [(1,2)] # day 1 to day 2
ax = sns.boxplot(x=x, y=y, data=df)
ax = sns.stripplot(x=x, y=y, data=df)
annotator = Annotator(ax, pairs, data=df, x=x, y=y)
annotator.configure(test='t-test_paired', text_format='star', loc='inside', verbose=True)
annotator.apply_and_annotate()
plt.show()
returns wrongly
1 vs. 2: t-test paired samples, P_val:6.667e-01 t=5.000e-01
Of course it cannot know that it should compare the one person with the same person for the relative test.
Fixing it:
>>> df.sort_values(by="person", inplace=True)
>>> df
person day value
0 A 1 1
4 A 2 5
1 B 2 2
5 B 1 6
2 C 2 3
3 C 1 4
returns correctly 1 vs. 2: t-test paired samples, P_val:8.995e-01 t=1.429e-01 (same as from scipy.stats TtestResult)
The same would go for 't-test_paired', 'Wilcoxon' (scipy.stats ttest_rel, wilcoxon) or any paired test.
Question
Is there a function to tell our annotator which is the identifier label for objects to pair (in this case it is by "person") ?
It could also then find out if pairs exists or some person did not participate on the second day for example.
Note
If we transform from wide (with identifier as index) to long format, the data is already sorted correctly in each group.
Or we change it to wide, so that identifier is grouped correctly:
>>> df_wide = df.pivot(index='person', columns='day', values='value') # long to wide
>>> df_wide
day 1 2
person
A 1 5
B 6 2
C 4 3