-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
I evaluated the base model Llama-3.2-1B-Instruct and the zip2xip model zip2zip-Llama-3.2-1B-Instruct-v0.1 using the code banch/run_harness_pretrained.py. The evaluation results are as follows:
- Llama-3.2-1B-Instruct
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| arc_easy | 1 | none | 2 | acc | ↑ | 0.57 | ± | 0.0498 |
| none | 2 | acc_norm | ↑ | 0.64 | ± | 0.0482 | ||
| commonsense_qa | Yaml | none | 2 | acc | ↑ | 0.57 | ± | 0.0498 |
| hellaswag | 1 | none | 2 | acc | ↑ | 0.47 | ± | 0.0502 |
| none | 2 | acc_norm | ↑ | 0.63 | ± | 0.0485 | ||
| mathqa | 1 | none | 2 | acc | ↑ | 0.29 | ± | 0.0456 |
| none | 2 | acc_norm | ↑ | 0.30 | ± | 0.0461 | ||
| openbookqa | 1 | none | 2 | acc | ↑ | 0.24 | ± | 0.0429 |
| none | 2 | acc_norm | ↑ | 0.35 | ± | 0.0479 | ||
| piqa | 1 | none | 2 | acc | ↑ | 0.72 | ± | 0.0451 |
| none | 2 | acc_norm | ↑ | 0.78 | ± | 0.0416 | ||
| winogrande | 1 | none | 2 | acc | ↑ | 0.62 | ± | 0.0488 |
- zip2zip-Llama-3.2-1B-Instruct-v0.1
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| arc_easy | 1 | none | 2 | acc | ↑ | 0.42 | ± | 0.0496 |
| none | 2 | acc_norm | ↑ | 0.39 | ± | 0.0490 | ||
| commonsense_qa | Yaml | none | 2 | acc | ↑ | 0.23 | ± | 0.0423 |
| hellaswag | 1 | none | 2 | acc | ↑ | 0.19 | ± | 0.0394 |
| none | 2 | acc_norm | ↑ | 0.21 | ± | 0.0409 | ||
| mathqa | 1 | none | 2 | acc | ↑ | 0.26 | ± | 0.0441 |
| none | 2 | acc_norm | ↑ | 0.26 | ± | 0.0441 | ||
| openbookqa | 1 | none | 2 | acc | ↑ | 0.16 | ± | 0.0368 |
| none | 2 | acc_norm | ↑ | 0.29 | ± | 0.0456 | ||
| piqa | 1 | none | 2 | acc | ↑ | 0.55 | ± | 0.0500 |
| none | 2 | acc_norm | ↑ | 0.56 | ± | 0.0499 | ||
| winogrande | 1 | none | 2 | acc | ↑ | 0.55 | ± | 0.0500 |
I noticed that the results of the zip2zip model differ significantly from those of the base model. Has anyone encountered this situation before? Is this difference normal, or could there be issues in the evaluation process?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels