Abnormal evaluation results for L&S

Hi, I’m reproducing the experiments from the paper. I used the model Llama-3.1-8B-Instruct to reproduce Dataless L&S and L&S. According to the paper, the two methods should achieve similar scores:

<img width="1037" height="281" alt="Image" src="https://github.com/user-attachments/assets/50eaee00-6c08-41e5-9fcc-2235731d01af" />

However, in my reproduction, L&S performs significantly worse than Dataless L&S:

<img width="1234" height="171" alt="Image" src="https://github.com/user-attachments/assets/382616a0-8281-4b39-a0ef-1285255af8b6" />

I used the evaluation script provided in the repository. The L&S merging command is as follows:
```
python ./merging/main.py --algo LocalizeAndStitch \
  --base-model /share/home/wenqingchen/zmj/keyan_merge/models/Meta-Llama-3.1-8B-Instruct \
  --lr 1e8 \
  --sparsity 0.1 \
  --n_epochs 1
```
When checking the model outputs on IFEval, I found that the outputs of L&S often contain repeated text segments:

<img width="1001" height="373" alt="Image" src="https://github.com/user-attachments/assets/7f59d553-87e7-4804-be76-b196014fad41" />
I’m not sure what went wrong. Could you please help me identify the issue?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abnormal evaluation results for L&S #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Abnormal evaluation results for L&S #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions