Skip to content

Comments

Consistently use flock and not the sometimes-non-interacting fcntl locks#5449

Merged
adamnovak merged 7 commits intomasterfrom
issues/5447-ensafen-safe-read-file
Feb 6, 2026
Merged

Consistently use flock and not the sometimes-non-interacting fcntl locks#5449
adamnovak merged 7 commits intomasterfrom
issues/5447-ensafen-safe-read-file

Conversation

@adamnovak
Copy link
Member

This should fix #5447 by using the same file locking mechanism on the write and read sides.

Changelog Entry

To be copied to the draft changelog by merger:

  • Toil server mode should no longer be able to read partially-written state data. Workflows will no longer be reported as RUNNINGIZING, as fun as that sounds for them.

Reviewer Checklist

  • Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
    • If it is coming from an external repo, make sure to pull it in for CI with:
      contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
      
    • If there is no associated issue, create one.
  • Read through the code changes. Make sure that it doesn't have:
    • Addition of trailing whitespace.
    • New variable or member names in camelCase that want to be in snake_case.
    • New functions without type hints.
    • New functions or classes without informative docstrings.
    • Changes to semantics not reflected in the relevant docstrings.
    • New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
    • New features without tests.
  • Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
  • Finish the review with an overall description of your opinion.

Merger Checklist

  • Make sure the PR passed tests, including the Gitlab tests, for the most recent commit in its branch.
  • Make sure the PR has been reviewed. If not, review it. If it has been reviewed and any requested changes seem to have been addressed, proceed.
  • Merge with the Github "Squash and merge" feature.
    • If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
  • Copy its recommended changelog entry to the Draft Changelog.
  • Append the issue number in parentheses to the changelog entry.

adamnovak and others added 5 commits February 3, 2026 16:50
Add deterministic tests that verify the locking protocol is correct by
mocking fcntl.flock and file operations to control thread execution
order. Tests verify:
- Reader blocked while writer holds exclusive lock
- Writer blocked while reader holds shared lock
- Multiple readers can hold shared locks simultaneously
- Writers serialize (cannot hold exclusive locks concurrently)
- Reader never sees partial write content

Also add BOTS.md with development environment notes for AI assistants.
Use Checkpoint class with arrive_and_wait/wait_for_arrival pattern
instead of time.sleep() calls. Tests now run in ~0.7s instead of ~24s
and are more deterministic.
@adamnovak
Copy link
Member Author

I've added a ream or so of synthetic test code that does fiddly mocking and condition variable dances to try and actually lean on the locking-ness of the locks in the "safe" read and write file functions. I don't currently understand it and I'm going to have to review it with a real review before merging.

I'm not sure it's worth its maintenance; do we need to maintain 1 KLOC to ensure that I don't forget to call the right lock function at the right place again in 40 lines of locking/unlocking code? Just because we can spin this out in half an hour doesn't mean we should.

Copy link
Member Author

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the tests are mostly on the right track, but I think the design of the various manager widgets should be simplified/unified around a kind of class that lets you hook a Checkpoint into an operation (of which we can have 3 implementations: one for flock, one for read, and one for write).

We also need to consolidate all the hooking into one with on one context manager method we implement in the test class.

"""Release whatever lock this thread holds."""
thread_id = threading.current_thread().ident
with self._condition:
if self._exclusive_held and self._exclusive_holder == thread_id:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not stand up to thread ID re-use, but we can't have that in this test.



class LockManager:
"""Manages simulated locks for multiple files."""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manages how?

Comment on lines 218 to 219
except (OSError, ValueError):
pass
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to fail here? Or do we need to be able to pretend to wrap things we can't really wrap, as part of the hooks not breaking everything?

Comment on lines 211 to 240
self.during_write: dict[str, Checkpoint] = {}
self.during_read: dict[str, Checkpoint] = {}

def wrap_file(self, file_obj: Any, path: str) -> Any:
"""Wrap a file object to track operations."""
try:
self.lock_manager.register_fd(file_obj.fileno(), path)
except (OSError, ValueError):
pass

original_write = file_obj.write
original_read = file_obj.read
tracker = self

def tracked_write(data: str) -> int:
thread_name = threading.current_thread().name
checkpoint = tracker.during_write.get(thread_name)
if checkpoint:
checkpoint.arrive_and_wait()
return original_write(data)

def tracked_read(size: int = -1) -> str:
thread_name = threading.current_thread().name
checkpoint = tracker.during_read.get(thread_name)
if checkpoint:
checkpoint.arrive_and_wait()
return original_read(size)

file_obj.write = tracked_write
file_obj.read = tracked_read
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of two duplicate but slightly different versions of the wrapper function and the whole system, this could be like one function that sticks the sync point onto another function given the function and the checkpoint collection.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we probably should split this class up into one for read and one for write, which would make each be able to be the same shape as the one for flock, in terms of providing an interface to attach checkpoints.


def writer() -> None:
try:
safe_write_file(str(self.test_file), "BBBB")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This write isn't short enough that we could tell if we saw a partial value. Really the partial case we can trigger with the tools we have is that the write finished but the truncate() didn't happen yet. So we need to overwrite with a shorter value, not a same-length value. And I think the blocks-reader-mid-write test is where we want to test that.

return file_obj


# TODO: Add tests for AtomicFileCreate path (concurrent new file creation).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to test that for now.

checkpoint.arrive_and_wait()


class FileOperationTracker:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't really "track" file operations; it really provides a Checkpoint-based interface to blocking read and write calls by particular threads. Maybe it should be IOCheckpointer.

return self._shared_count


class LockManager:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really here to let us put Checkpoints on lock operations, so it should be a LockCheckpointer.

Comment on lines 272 to 274
def patched_open(path: Any, mode: str = "r", *args: Any, **kwargs: Any) -> Any:
f = original_open(path, mode, *args, **kwargs)
return file_tracker.wrap_file(f, str(path))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make this register the file with both the LockManager and the FileOperationTracker separately, neither has to know about the other, and I think we can get away with passing fewer arguments to the registration functions.


# Checkpoint to pause writer after acquiring lock
writer_checkpoint = Checkpoint()
self.lock_manager.after_acquire["writer"] = writer_checkpoint
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of storing checkpoints in after_acquire, during_read, during_write, etc. from outside the manager classes, they should both expose the same function to put a checkpoint on a thread by name.

Simplify test infrastructure by creating a unified Checkpointer base
class with three implementations (LockCheckpointer, ReadCheckpointer,
WriteCheckpointer). Each checkpointer provides its own install() context
manager that patches the necessary functions, composing naturally via
ExitStack.

- Remove LockManager, FileOperationTracker, and _create_patches()
- Add Checkpointer ABC with add()/get() and abstract install() method
- LockCheckpointer handles fd registration and flock simulation
- ReadCheckpointer/WriteCheckpointer hook into file read/write ops
- Consolidate patching into single patched_io() context manager
- Replace test_reader_never_sees_partial_write with
  test_reader_paused_mid_read_blocks_writer for better coverage
@adamnovak
Copy link
Member Author

I had to feed the code generator a bunch of the pieces here (separate hooking per kind of event, ExitStack, actually using the hooks to block in read), but I'm still having the commit come in as in the name of Anthropic Claude since I didn't actually edit the code.

I think the test code is good enough to be getting on with.

@adamnovak adamnovak merged commit 19ebf5f into master Feb 6, 2026
3 checks passed
@adamnovak adamnovak deleted the issues/5447-ensafen-safe-read-file branch February 6, 2026 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WES server can show spliced state names

2 participants