The evaluation results did not meet expectations.

I evaluated the base model Llama-3.2-1B-Instruct and the zip2xip model zip2zip-Llama-3.2-1B-Instruct-v0.1 using the code banch/run_harness_pretrained.py. The evaluation results are as follows:

- Llama-3.2-1B-Instruct    

|    Tasks     |Version|Filter|n-shot| Metric |   |Value|   |Stderr|    
|--------------|-------|------|-----:|--------|---|----:|---|-----:|    
|arc_easy      |      1|none  |     2|acc     |↑  | 0.57|±  |0.0498|    
|              |       |none  |     2|acc_norm|↑  | 0.64|±  |0.0482|    
|commonsense_qa|Yaml   |none  |     2|acc     |↑  | 0.57|±  |0.0498|    
|hellaswag     |      1|none  |     2|acc     |↑  | 0.47|±  |0.0502|    
|              |       |none  |     2|acc_norm|↑  | 0.63|±  |0.0485|    
|mathqa        |      1|none  |     2|acc     |↑  | 0.29|±  |0.0456|    
|              |       |none  |     2|acc_norm|↑  | 0.30|±  |0.0461|    
|openbookqa    |      1|none  |     2|acc     |↑  | 0.24|±  |0.0429|    
|              |       |none  |     2|acc_norm|↑  | 0.35|±  |0.0479|    
|piqa          |      1|none  |     2|acc     |↑  | 0.72|±  |0.0451|    
|              |       |none  |     2|acc_norm|↑  | 0.78|±  |0.0416|    
|winogrande    |      1|none  |     2|acc     |↑  | 0.62|±  |0.0488|    

- zip2zip-Llama-3.2-1B-Instruct-v0.1  
  
|    Tasks     |Version|Filter|n-shot| Metric |   |Value|   |Stderr|  
|--------------|-------|------|-----:|--------|---|----:|---|-----:|  
|arc_easy      |      1|none  |     2|acc     |↑  | 0.42|±  |0.0496|
|              |       |none  |     2|acc_norm|↑  | 0.39|±  |0.0490|    
|commonsense_qa|Yaml   |none  |     2|acc     |↑  | 0.23|±  |0.0423|    
|hellaswag     |      1|none  |     2|acc     |↑  | 0.19|±  |0.0394|    
|              |       |none  |     2|acc_norm|↑  | 0.21|±  |0.0409|    
|mathqa        |      1|none  |     2|acc     |↑  | 0.26|±  |0.0441|    
|              |       |none  |     2|acc_norm|↑  | 0.26|±  |0.0441|    
|openbookqa    |      1|none  |     2|acc     |↑  | 0.16|±  |0.0368|    
|              |       |none  |     2|acc_norm|↑  | 0.29|±  |0.0456|    
|piqa          |      1|none  |     2|acc     |↑  | 0.55|±  |0.0500|    
|              |       |none  |     2|acc_norm|↑  | 0.56|±  |0.0499|    
|winogrande    |      1|none  |     2|acc     |↑  | 0.55|±  |0.0500|    


I noticed that the results of the zip2zip model differ significantly from those of the base model. Has anyone encountered this situation before? Is this difference normal, or could there be issues in the evaluation process?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The evaluation results did not meet expectations. #7

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
arc_easy	1	none	2	acc	↑	0.57	±	0.0498
		none	2	acc_norm	↑	0.64	±	0.0482
commonsense_qa	Yaml	none	2	acc	↑	0.57	±	0.0498
hellaswag	1	none	2	acc	↑	0.47	±	0.0502
		none	2	acc_norm	↑	0.63	±	0.0485
mathqa	1	none	2	acc	↑	0.29	±	0.0456
		none	2	acc_norm	↑	0.30	±	0.0461
openbookqa	1	none	2	acc	↑	0.24	±	0.0429
		none	2	acc_norm	↑	0.35	±	0.0479
piqa	1	none	2	acc	↑	0.72	±	0.0451
		none	2	acc_norm	↑	0.78	±	0.0416
winogrande	1	none	2	acc	↑	0.62	±	0.0488

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
arc_easy	1	none	2	acc	↑	0.42	±	0.0496
		none	2	acc_norm	↑	0.39	±	0.0490
commonsense_qa	Yaml	none	2	acc	↑	0.23	±	0.0423
hellaswag	1	none	2	acc	↑	0.19	±	0.0394
		none	2	acc_norm	↑	0.21	±	0.0409
mathqa	1	none	2	acc	↑	0.26	±	0.0441
		none	2	acc_norm	↑	0.26	±	0.0441
openbookqa	1	none	2	acc	↑	0.16	±	0.0368
		none	2	acc_norm	↑	0.29	±	0.0456
piqa	1	none	2	acc	↑	0.55	±	0.0500
		none	2	acc_norm	↑	0.56	±	0.0499
winogrande	1	none	2	acc	↑	0.55	±	0.0500

The evaluation results did not meet expectations. #7

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions