Cache flush can take a long time under certain conditions

My understanding of the cache flush behavior is as follows:
 
* The event loop uses `epoll_wait` with a timeout value (passed as the timeout parameter, set to 1000 in main, i.e., 1000 milliseconds or 1 second).
* If there are no FUSE events in that period (n == 0), it triggers zdbfs_cache_sync(fs), which flushes the cache to backend storage.
* Also, if the loop has processed over 8192 requests (proceed > 8192), it flushes the cache immediately, regardless of events.

That means that a scenario like this can occur:

1. Some data is written, causing a small number of FUSE requests
2. An ongoing stream of FUSE requests occurs at a rate of ~1 per second, such that the 1 second timeout for cache flushing is not passed
3. Some time before 8192 FUSE requests have accumulated, `zdbfs` dies unexpectedly and the cache data is lost

So this means that there can be a period of more than two hours (8000 / 60 / 60) during which the cache is not flushed and (meta)data can be lost.

For the purposes of QSFS, we want to be able to define a strict upper limit on the amount of time that can pass before data is made durable, regardless of the usage pattern.

## Potential solutions

1. Add a configurable time based upper limit on how long the system can go without flushing the cache
2. Tweak the existing 1 second and 8192 request timeout limits, possibly making one or both configurable

I think the first option makes sense, but I could be overlooking some of the reasoning in the original design.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cache flush can take a long time under certain conditions #47

Potential solutions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cache flush can take a long time under certain conditions #47

Description

Potential solutions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions