(question) paired test, data frame sorting, t-test_paired, Wilcoxon

When applying a paired test, the sequence of the data is crucial.

Example

```
>>> df = pd.DataFrame({
...     "person": ["A","B","C","C","A","B"],
...     "day": [1,2,2,1,2,1],
...     "value": [1,2,3,4,5,6],
... })
>>> df
  person  day  value
0      A    1      1
1      B    2      2
2      C    2      3
3      C    1      4
4      A    2      5
5      B    1      6
```

Applying `scipy.stats.ttest_rel` on the paired data:

```
>>> # sort value by person A,B,C
>>> values_day_1 = df[df["day"]==1].sort_values(by="person")["value"].values
>>> values_day_2 = df[df["day"]==2].sort_values(by="person")["value"].values
>>> values_day_1, values_day_2
(array([1, 6, 4]), array([5, 2, 3]))
>>> from scipy.stats import ttest_rel
>>> ttest_rel(values_day_1, values_day_2) # call the function on the paired data
TtestResult(statistic=0.14285714285714288, pvalue=0.8994962184740788, df=2)
```

But
```
x = "day"
y = "value"
pairs = [(1,2)] # day 1 to day 2

ax = sns.boxplot(x=x, y=y, data=df)
ax = sns.stripplot(x=x, y=y, data=df)

annotator = Annotator(ax, pairs, data=df, x=x, y=y)
annotator.configure(test='t-test_paired', text_format='star', loc='inside', verbose=True)
annotator.apply_and_annotate()
plt.show()
```
returns wrongly
`1 vs. 2: t-test paired samples, P_val:6.667e-01 t=5.000e-01`

Of course it cannot know that it should compare the one person with the same person for the relative test.

Fixing it:
```
>>> df.sort_values(by="person", inplace=True)
>>> df
  person  day  value
0      A    1      1
4      A    2      5
1      B    2      2
5      B    1      6
2      C    2      3
3      C    1      4
```
returns correctly `1 vs. 2: t-test paired samples, P_val:8.995e-01 t=1.429e-01` (same as from scipy.stats TtestResult)

The same would go for 't-test_paired', 'Wilcoxon' (scipy.stats `ttest_rel`, `wilcoxon`) or any paired test.

**Question**
Is there a function to tell our annotator which is the identifier label for objects to pair (in this case it is by "person") ? 
It could also then find out if pairs exists or some person did not participate on the second day for example.


**Note**
If we transform from wide (with identifier as index) to long format, the data is already sorted correctly in each group.
Or we change it to wide, so that identifier is grouped correctly:
```
>>> df_wide = df.pivot(index='person', columns='day', values='value') # long to wide
>>> df_wide
day     1  2
person      
A       1  5
B       6  2
C       4  3

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(question) paired test, data frame sorting, t-test_paired, Wilcoxon #186

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

(question) paired test, data frame sorting, t-test_paired, Wilcoxon #186

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions