Consistently use flock and not the sometimes-non-interacting fcntl locks by adamnovak · Pull Request #5449 · DataBiosphere/toil

adamnovak · 2026-02-03T21:53:25Z

This should fix #5447 by using the same file locking mechanism on the write and read sides.

Changelog Entry

To be copied to the draft changelog by merger:

Toil server mode should no longer be able to read partially-written state data. Workflows will no longer be reported as RUNNINGIZING, as fun as that sounds for them.

Reviewer Checklist

Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
- If it is coming from an external repo, make sure to pull it in for CI with:
```
contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
```
- If there is no associated issue, create one.
Read through the code changes. Make sure that it doesn't have:
- Addition of trailing whitespace.
- New variable or member names in camelCase that want to be in snake_case.
- New functions without type hints.
- New functions or classes without informative docstrings.
- Changes to semantics not reflected in the relevant docstrings.
- New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
- New features without tests.
Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
Finish the review with an overall description of your opinion.

Merger Checklist

Make sure the PR passed tests, including the Gitlab tests, for the most recent commit in its branch.
Make sure the PR has been reviewed. If not, review it. If it has been reviewed and any requested changes seem to have been addressed, proceed.
Merge with the Github "Squash and merge" feature.
- If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
Copy its recommended changelog entry to the Draft Changelog.
Append the issue number in parentheses to the changelog entry.

Add deterministic tests that verify the locking protocol is correct by mocking fcntl.flock and file operations to control thread execution order. Tests verify: - Reader blocked while writer holds exclusive lock - Writer blocked while reader holds shared lock - Multiple readers can hold shared locks simultaneously - Writers serialize (cannot hold exclusive locks concurrently) - Reader never sees partial write content Also add BOTS.md with development environment notes for AI assistants.

Use Checkpoint class with arrive_and_wait/wait_for_arrival pattern instead of time.sleep() calls. Tests now run in ~0.7s instead of ~24s and are more deterministic.

adamnovak · 2026-02-03T22:44:34Z

I've added a ream or so of synthetic test code that does fiddly mocking and condition variable dances to try and actually lean on the locking-ness of the locks in the "safe" read and write file functions. I don't currently understand it and I'm going to have to review it with a real review before merging.

I'm not sure it's worth its maintenance; do we need to maintain 1 KLOC to ensure that I don't forget to call the right lock function at the right place again in 40 lines of locking/unlocking code? Just because we can spin this out in half an hour doesn't mean we should.

…en-safe-read-file

adamnovak

I think the tests are mostly on the right track, but I think the design of the various manager widgets should be simplified/unified around a kind of class that lets you hook a Checkpoint into an operation (of which we can have 3 implementations: one for flock, one for read, and one for write).

We also need to consolidate all the hooking into one with on one context manager method we implement in the test class.

adamnovak · 2026-02-04T16:01:53Z