Skip to content

The evaluation results did not meet expectations. #7

@jiyanxin

Description

@jiyanxin

I evaluated the base model Llama-3.2-1B-Instruct and the zip2xip model zip2zip-Llama-3.2-1B-Instruct-v0.1 using the code banch/run_harness_pretrained.py. The evaluation results are as follows:

  • Llama-3.2-1B-Instruct
Tasks Version Filter n-shot Metric Value Stderr
arc_easy 1 none 2 acc 0.57 ± 0.0498
none 2 acc_norm 0.64 ± 0.0482
commonsense_qa Yaml none 2 acc 0.57 ± 0.0498
hellaswag 1 none 2 acc 0.47 ± 0.0502
none 2 acc_norm 0.63 ± 0.0485
mathqa 1 none 2 acc 0.29 ± 0.0456
none 2 acc_norm 0.30 ± 0.0461
openbookqa 1 none 2 acc 0.24 ± 0.0429
none 2 acc_norm 0.35 ± 0.0479
piqa 1 none 2 acc 0.72 ± 0.0451
none 2 acc_norm 0.78 ± 0.0416
winogrande 1 none 2 acc 0.62 ± 0.0488
  • zip2zip-Llama-3.2-1B-Instruct-v0.1
Tasks Version Filter n-shot Metric Value Stderr
arc_easy 1 none 2 acc 0.42 ± 0.0496
none 2 acc_norm 0.39 ± 0.0490
commonsense_qa Yaml none 2 acc 0.23 ± 0.0423
hellaswag 1 none 2 acc 0.19 ± 0.0394
none 2 acc_norm 0.21 ± 0.0409
mathqa 1 none 2 acc 0.26 ± 0.0441
none 2 acc_norm 0.26 ± 0.0441
openbookqa 1 none 2 acc 0.16 ± 0.0368
none 2 acc_norm 0.29 ± 0.0456
piqa 1 none 2 acc 0.55 ± 0.0500
none 2 acc_norm 0.56 ± 0.0499
winogrande 1 none 2 acc 0.55 ± 0.0500

I noticed that the results of the zip2zip model differ significantly from those of the base model. Has anyone encountered this situation before? Is this difference normal, or could there be issues in the evaluation process?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions