Resumable single step evals by jla-gardner · Pull Request #149 · microsoft/syntheseus

jla-gardner · 2026-02-17T16:23:12Z

This PR (attempts to) makes the syntheseus/cli/eval_single_step.py script resumable, mitigating the impact of e.g. job crashes/pre-emption.

To do this, the script now saves model inputs+outputs+information in a JSON Lines file during evaluation. Re-invoking this script after e.g. pre-emption/restarts attempts to load this file if available, and skips calling the model for any already-queried inputs.

The new output format is easier to work with than the existing stats.json files, since you can load line-by-line (see attached script and analysis for what this looks like).

To preserve backwards compatibility, this feature is hidden behind the config.resumable flag, which is turned off by default.

Thank you to @kmaziarz for the suggested implementation.

jla-gardner · 2026-02-17T16:26:17Z

This script: demo_resume.py mocks the use of a simple model when performing single step evaluations, and shows the resumability of the process.

The raw output is:

{"input": "CN", "ground_truth": "C.N>>CN", "num_predictions": 3, "ground_truth_correct": [true, false, false], "timing": {"time_model_call": 2.1298726399739582e-05, "time_post_processing": 5.404154459635417e-06}, "predictions": ["C.N>>CN", "c1ccccc1>>CN", "NC=O>>CN"]}
{"input": "CN", "ground_truth": "NC=O>>CN", "num_predictions": 3, "ground_truth_correct": [false, false, true], "timing": {"time_model_call": 2.0265579223632812e-05, "time_post_processing": 2.9802322387695312e-06}, "predictions": ["C.N>>CN", "c1ccccc1>>CN", "NC=O>>CN"]}
{"input": "CC", "ground_truth": "C.C>>CC", "num_predictions": 3, "ground_truth_correct": [false, false, false], "timing": {"time_model_call": 2.0265579223632812e-05, "time_post_processing": 2.9802322387695312e-06}, "predictions": ["C.N>>CC", "c1ccccc1>>CC", "NC=O>>CC"]}

Parsing this into a more readable format: head -1 predictions.jsonl | python3 -m json.tool gives:

{
    "input": "CN",
    "ground_truth": "C.N>>CN",
    "num_predictions": 3,
    "ground_truth_correct": [
        true,
        false,
        false
    ],
    "timing": {
        "time_model_call": 2.1298726399739582e-05,
        "time_post_processing": 5.404154459635417e-06
    },
    "predictions": [
        "C.N>>CN",
        "c1ccccc1>>CN",
        "NC=O>>CN"
    ]
}

kmaziarz

Thanks @jla-gardner, this has been annoying me since forever 🙂 I had an initial look and left a few comments, I can have another pass later. Before merging, I will also battle-test it on an actual problematic case (i.e. eval that spans multiple days yet gets pre-empted every 12h).

If you look at CHANGELOG.md, you'll find that we record all changes to syntheseus there (well, apart from those that don't affect any behaviour, e.g. changes purely in tests or fixing a typo in a comment). Feel free to add an entry there following the existing pattern (so, also append yourself to the contributors list)!

kmaziarz · 2026-02-20T14:40:01Z