You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jan 26, 2026. It is now read-only.
Firstly, I would like to express my gratitude for your exceptional work.
Recently, I attempted to utilize Factor for evaluating instruction-tuned models, such as llama2-chat. However, I observed that the evaluation format of Factor is primarily designed for text completion, making it more suitable for base models rather than instruction-tuned models.
In an effort to instruct SFT models, I experimented with prompts such as "Please complete the following text." However, their performance still falls behind that of base models. This differs from the results I obtained when conducting experiments on other benchmarks, such as TruthfulQA.
I would greatly appreciate any insights or suggestions you may have. Thank you!