Skip to content

Comments

Resumable single step evals#149

Open
jla-gardner wants to merge 2 commits intomicrosoft:mainfrom
jla-gardner:resumable-single-step-eval
Open

Resumable single step evals#149
jla-gardner wants to merge 2 commits intomicrosoft:mainfrom
jla-gardner:resumable-single-step-eval

Conversation

@jla-gardner
Copy link

This PR (attempts to) makes the syntheseus/cli/eval_single_step.py script resumable, mitigating the impact of e.g. job crashes/pre-emption.

To do this, the script now saves model inputs+outputs+information in a JSON Lines file during evaluation. Re-invoking this script after e.g. pre-emption/restarts attempts to load this file if available, and skips calling the model for any already-queried inputs.

The new output format is easier to work with than the existing stats.json files, since you can load line-by-line (see attached script and analysis for what this looks like).

To preserve backwards compatibility, this feature is hidden behind the config.resumable flag, which is turned off by default.

Thank you to @kmaziarz for the suggested implementation.

@jla-gardner
Copy link
Author

This script: demo_resume.py mocks the use of a simple model when performing single step evaluations, and shows the resumability of the process.

The raw output is:

{"input": "CN", "ground_truth": "C.N>>CN", "num_predictions": 3, "ground_truth_correct": [true, false, false], "timing": {"time_model_call": 2.1298726399739582e-05, "time_post_processing": 5.404154459635417e-06}, "predictions": ["C.N>>CN", "c1ccccc1>>CN", "NC=O>>CN"]}
{"input": "CN", "ground_truth": "NC=O>>CN", "num_predictions": 3, "ground_truth_correct": [false, false, true], "timing": {"time_model_call": 2.0265579223632812e-05, "time_post_processing": 2.9802322387695312e-06}, "predictions": ["C.N>>CN", "c1ccccc1>>CN", "NC=O>>CN"]}
{"input": "CC", "ground_truth": "C.C>>CC", "num_predictions": 3, "ground_truth_correct": [false, false, false], "timing": {"time_model_call": 2.0265579223632812e-05, "time_post_processing": 2.9802322387695312e-06}, "predictions": ["C.N>>CC", "c1ccccc1>>CC", "NC=O>>CC"]}

Parsing this into a more readable format: head -1 predictions.jsonl | python3 -m json.tool gives:

{
    "input": "CN",
    "ground_truth": "C.N>>CN",
    "num_predictions": 3,
    "ground_truth_correct": [
        true,
        false,
        false
    ],
    "timing": {
        "time_model_call": 2.1298726399739582e-05,
        "time_post_processing": 5.404154459635417e-06
    },
    "predictions": [
        "C.N>>CN",
        "c1ccccc1>>CN",
        "NC=O>>CN"
    ]
}

@kmaziarz kmaziarz requested review from AustinT and kmaziarz February 20, 2026 14:36
Copy link
Contributor

@kmaziarz kmaziarz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jla-gardner, this has been annoying me since forever 🙂 I had an initial look and left a few comments, I can have another pass later. Before merging, I will also battle-test it on an actual problematic case (i.e. eval that spans multiple days yet gets pre-empted every 12h).

If you look at CHANGELOG.md, you'll find that we record all changes to syntheseus there (well, apart from those that don't affect any behaviour, e.g. changes purely in tests or fixing a typo in a comment). Feel free to add an entry there following the existing pattern (so, also append yourself to the contributors list)!

Comment on lines +255 to +256
input=d.get("input", ""),
ground_truth=d.get("ground_truth", ""),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would input or ground_truth be missing? Also, going in that direction, I'm wondering why we can't do something akin to

return cls(
    **{
        k: TimingRecord(**v) if k.endswith("timing") else v
        for k, v in d.items()
    }
)

This would fail explicitly if extra stuff is provided in d, but maybe that's a good thing?

Same comment for to_dict; I suppose that also could be a for loop?

"""Reconstruct a Reaction from its reaction SMILES string."""
sep = REACTION_SEPARATOR * 2
reactants_str, products_str = rxn_smiles.split(sep)
reactants = Bag(Molecule(s) for s in reactants_str.split(SMILES_SEPARATOR) if s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have molecule_bag_from_smiles_strict in chem/utils.py doing almost exactly this, modulo the rejection of empty SMILES components via if s. But I'm not sure why we'd want the latter?

More generally, maybe _reaction_from_smiles could be a classmethod of the Reaction class? I was somewhat surprised we don't have it there already.

Comment on lines +207 to +212
@dataclass(frozen=True)
class TimingRecord:
"""Timing information for a single sample's model call."""

time_model_call: float
time_post_processing: float
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not an exact copy of ModelTimingResults from utils/metrics.py? I guess the existing one is used for timing of a step (batch) and the new one you use for timing of a single sample (by dividing by batch size), but not sure if that warrants having both of them around.

Comment on lines 542 to 544
if include_predictions:
all_predictions.extend(batch_predictions)
all_back_translation_predictions.extend(batch_back_translation_predictions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in resumable mode, are we flushing the predictions to a file as we go and also keep all of them in memory? Or am I misunderstanding things?

input_smiles = input.smiles
else:
assert isinstance(input, Bag)
input_smiles = ".".join(mol.smiles for mol in input)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is molecule_bag_to_smiles from interface/molecule.py.

Copy link
Collaborator

@AustinT AustinT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @jla-gardner. The implementation roughly makes sense to me- I'll leave it to @kmaziarz for detailed review. I agree with his comments so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants