Skip to content

(question) paired test, data frame sorting, t-test_paired, Wilcoxon #186

@pas-calc

Description

@pas-calc

When applying a paired test, the sequence of the data is crucial.

Example

>>> df = pd.DataFrame({
...     "person": ["A","B","C","C","A","B"],
...     "day": [1,2,2,1,2,1],
...     "value": [1,2,3,4,5,6],
... })
>>> df
  person  day  value
0      A    1      1
1      B    2      2
2      C    2      3
3      C    1      4
4      A    2      5
5      B    1      6

Applying scipy.stats.ttest_rel on the paired data:

>>> # sort value by person A,B,C
>>> values_day_1 = df[df["day"]==1].sort_values(by="person")["value"].values
>>> values_day_2 = df[df["day"]==2].sort_values(by="person")["value"].values
>>> values_day_1, values_day_2
(array([1, 6, 4]), array([5, 2, 3]))
>>> from scipy.stats import ttest_rel
>>> ttest_rel(values_day_1, values_day_2) # call the function on the paired data
TtestResult(statistic=0.14285714285714288, pvalue=0.8994962184740788, df=2)

But

x = "day"
y = "value"
pairs = [(1,2)] # day 1 to day 2

ax = sns.boxplot(x=x, y=y, data=df)
ax = sns.stripplot(x=x, y=y, data=df)

annotator = Annotator(ax, pairs, data=df, x=x, y=y)
annotator.configure(test='t-test_paired', text_format='star', loc='inside', verbose=True)
annotator.apply_and_annotate()
plt.show()

returns wrongly
1 vs. 2: t-test paired samples, P_val:6.667e-01 t=5.000e-01

Of course it cannot know that it should compare the one person with the same person for the relative test.

Fixing it:

>>> df.sort_values(by="person", inplace=True)
>>> df
  person  day  value
0      A    1      1
4      A    2      5
1      B    2      2
5      B    1      6
2      C    2      3
3      C    1      4

returns correctly 1 vs. 2: t-test paired samples, P_val:8.995e-01 t=1.429e-01 (same as from scipy.stats TtestResult)

The same would go for 't-test_paired', 'Wilcoxon' (scipy.stats ttest_rel, wilcoxon) or any paired test.

Question
Is there a function to tell our annotator which is the identifier label for objects to pair (in this case it is by "person") ?
It could also then find out if pairs exists or some person did not participate on the second day for example.

Note
If we transform from wide (with identifier as index) to long format, the data is already sorted correctly in each group.
Or we change it to wide, so that identifier is grouped correctly:

>>> df_wide = df.pivot(index='person', columns='day', values='value') # long to wide
>>> df_wide
day     1  2
person      
A       1  5
B       6  2
C       4  3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions