Skip to content

Correctness bug in VSR implementation: A replica in recovery status participates in view changes #15

@jorangreef

Description

@jorangreef

The Viewstamped Replication Revisited paper in Section 4.2 requires that:

When a replica recovers after a crash it cannot participate in request processing and view changes until it has a state at least as recent as when it failed. If it could participate sooner than this, the system can fail. For example, if it forgets that it prepared some operation, this operation might then be known to fewer than a quorum of replicas even though it committed, which could cause the operation to be forgotten in a view change.

However, I believe there may be a bug in https://github.com/UWSysLab/tapir/blob/master/replication/vr/replica.cc#L833-L835 where a replica in recovery status is allowed by the implementation to participate in a higher view change, leading to data loss.

I found this while working on TigerBeetle's implementation of Viewstamped Replication, as I was doing a survey of existing implementations. By the way, Tapir's implementation of VSR is really nice and clean.

On a similar note, if anyone is interested, we just launched a $20k consensus challenge over at https://github.com/coilhq/viewstamped-replication-made-famous, where if you can find a correctness bug in an implementation of VSR you could earn bounties of up to $3,000.

The live launch event on Saturday also featured special interviews with Brian Oki and James Cowling, if you're a fan of the pioneering protocol and would like to take a watch: https://www.youtube.com/watch?v=_Jlikdtm4OA

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions