diff --git a/UIPS/UIP-0113.md b/UIPS/UIP-0113.md index a82d723..90eef1a 100644 --- a/UIPS/UIP-0113.md +++ b/UIPS/UIP-0113.md @@ -11,8 +11,44 @@ created: ~2023.8.9 ## Abstract -Imposing a request/response discipline on all %ames messages and packets provides legibility at every layer, making the entire network easier to reason about, extending the urbit network to new use-cases and enabling a system that is more stable, reliable, and scalable. +The "directed messaging" project consists of a full-stack rewrite of Urbit's networking intended for much higher (100-1000x) throughput and increased connection stability. The "directed" term refers to the directedness of the connection between a request and a response, i.e. the request goes in one direction and the response goes in the opposite direction. +By imposing a request/response discipline on all %ames messages and packets, this design provides legibility at every layer, making the entire network stack easier to reason about, extending the Urbit network to new use-cases and enabling a system that is more stable, reliable, and scalable. + +At a high level: + +- Stateful relays (ie, galaxies, stars) alleviate publishers of any routing responsibilty and make peer-discovery reliable. (This works particularly well with multi-hop forwarding through the sponsorship hierarchy.) +- Packet legibility enables trivial request/response and packet/message correlation. (A fully-qualified, global, referentially-transparent namespace path uniquely identifies each response packet, and is trivially upgraded to the path of a semantic, complete network message.) +- The request/response discipline locates all congestion control and retransmission at the client edge, where congestion is most likely. +- Trivial packet-to-message correlation enables alternate transport and off-loop "de-packetization" (message assembly), our largest performance wins. +- The entirety of the publisher-side implementation is a layering of small, stateless functions. +- Existing semantics and interfaces for both %ames and fine can be straightforwardly rebased onto this model. + +## Motivation + +As of 2023 Urbit has two communication protocols: Ames, for sending commands and receiving acknowledgments; and Fine, a "remote scry protocol" for reading data out of other ships -- specifically, out of their scry namespaces. Both protocols are forms of one-to-one communication between two Urbit nodes. They both use Urbit's "galaxy" supernodes for peer discovery, and for packet relaying in case the nodes are behind firewalls. + +The designs and implementations of these protocols impose enormous performance overhead, causing them to both have quite low throughput: on the order of a megabit per second, even on a gigabit internet connection. The directed messaging project removes these overheads almost entirely, bringing the throughput much closer to the maximum supported by the underlying hardware. + +The first change this new design makes is to tag every packet as either a request or a response. For reads, a request contains a "scry path" (Urbit's version of a URL), and a response contains the data that path refers to. For writes (commands), a request contains the contents of the command, and a response contains an acknowledgment. Tagging each packet this way enables "directed routing". + +Directed routing uses a simplification of the routing design from Van Jacobson's [Named Data Networking](https://named-data.net/project/) project, in which a request is routed to the next relay by looking at its request path (in Urbit's case, a scry path), and a response is routed back to wherever the request came from. This means a response takes the exact reverse path through the network of the request that it satisfies. + +Directed routing has a number of advantages over Ames's current routing, which has proven complex and finicky. The biggest advantage is that by applying the request/response constraint, the whole routing probllem is made simpler and easier, making the routing semantics in turn easier to reason about and implement correctly. For example, unlike in the current design, publisher nodes never need to persist routes to subscriber nodes, since each subscriber remembers the route to the publisher and the publisher can just hold the immediate transport address (IP+port) it heard a request packet from temporarily to route responses back to it; the other relays will handle returning the response to the original requesting node. + +The second big change in this new design is to assign a scry path to every packet and message -- this explicitly lays out all packets and messages in Urbit's scry namespace. One reason to do this is that it allows all large pieces of data to be "pulled" by the node who will receive them, rather than "pushed" by the sending node as in the case now for Ames commands. With everything as pull rather than push, packet-level operations can be abstracted away from senders, which means there can be a single state machine for managing downloads, no matter whether the datum being downloaded is an Ames command to be performed (a write) or another kind of data to be read from another node's namespace (a read). + +This state machine for managing downloads uses a congestion control algorithm to determine how many request packets to send at what rate, to maximize the throughput without overloading the underlying hardware. In all previous protocols, this state machine has been part of the Ames vane (kernel module) inside Urbit's Arvo kernel. Arvo is a transactional system, so every time an Urbit node receives a response packet and wants to figure out how many new requests packets it should send to max out the connection, it must first write the incoming response packet to disk. Since each packet contains a data payload of a kilobyte, this means the system needs to do a disk write for every kilobyte -- absurdly high overhead for a production system. + +By moving all congestion control to pull, the state machine can be moved out of Urbit's kernel into its runtime -- the processor-native code that runs Urbit. In addition to getting rid of the per-packet disk writes, this means Urbit's packet handling will no longer need to engage in Unix interprocess communication, which has some overhead. Nor will it need to run any Nock: the packet processing state machine can be written in a hot loop in C, sidestepping the biggest slowdowns of the current system. + +The final remaining slowness of current Urbit packet processing lies in packet authentication: when a node receives a packet from another node, how does it verify that it was actually the sender node who sent the packet, not a malicious actor who forged a packet to deny service or otherwise interfere with healthy network operation? + +In the current Fine protocol, for performing reads, every packet contains a digital signature. This prevents forgery, but at the cost of multiple milliseconds for each packet -- another slapstick-level performance disaster. + +In directed messaging, a novel packet authentication scheme called "LockStep" reduces packet authentication time to a single Blake3 hash operation, which is orders of magnitude faster. + +These changes combine to ensure bulletproof peer-to-peer routing and low performance overhead, getting Urbit out of the way and letting application programmers make effective use of the networking hardware. ## Specification @@ -29,7 +65,7 @@ A new message layer is introduced, unifying %ames and fine. %ames' existing flow ##### %peek: `path` -A %peek message is a request for a %page at a path in the global namespace. It is unauthenticated and anonymous. In the future, request authentication could be used to gate access to computational resources and enable QoS, but never for access control to data itself. +A %peek message is a read request: a request for a %page at a path in the global namespace. It is unauthenticated and anonymous. In the future, request authentication could be used to gate access to computational resources and enable QoS, but never for access control to data itself. %peek can be injected as an event, but must not change formal state. It should be handled by dereferencing the request path via arvo's `+peek` arm on the publishing ship. A %peek message produces at-most-one %page response -- blocking/crashing requests are dropped. @@ -37,7 +73,7 @@ A %peek message is a request for a %page at a path in the global namespace. It i ##### %page: `[oath path page]` -A %page message is a public, authenticated binding of marked data in the urbit namespace -- in practical terms, a response to a %peek or %poke request. The binding between path and data must never change, violation of this principle should be shared widely and have severe reputational consequences for the offending party. +A %page message is a read response: a public, authenticated binding of marked data in the urbit namespace -- in practical terms, a response to a %peek or %poke request. The binding between path and data must never change, violation of this principle should be shared widely and have severe reputational consequences for the offending party. A %page message is injected as an event. If it correlates to an outstanding request -- via an exact match of its path -- it is processed with arbitrary stateful semantics. If it does not correlate, it is silently dropped. @@ -45,54 +81,163 @@ A %page message is injected as an event. If it correlates to an outstanding requ ##### %poke: `[path oath path page]` -A %poke messages pushes a %page message from one requester to requestee, specifying the path by which the requestee will acknowledge the %poke. It must be straightforward to unambiguously correlate the payload path to the prefix of the request path, such that the authentication of the payload is sufficient to authenticate the request (as is trivially the case for %ames' flows). +A %poke message is a write request: it is used to implement a command+acknowledgment communication protocol, where the requesting ship sends a command and the receiving ship performs the command and replies with an acknowledgment. + +The poke protocol is initiated by the requesting ship sending a %peek to the receiving ship to try to read the acknowledgment out of the receiver's namespace. The receiving ship recognizes this request as a poke and injects it as a stateful Arvo event rather than trying to read it out of Arvo the way it would usually do for a %peek packet. To process the request, the receiving Arvo emits a new %peek request back to the sending ship, to fetch the command datum out of the sender's namespace. Once the full datum has been downloaded, the receiving ship attempts to perform the request. If it succeeds, it sends a positive acknowledgment; if it fails, it sends a "negative" acknowledgment containing an error message. + +An optimization for small messages is also included in the protocol: the first packet of the initiating peek request also includes the first fragment of the command datum. If the command is under a kilobyte, then the entire command is included in the first packet, recovering the long-standing property of Ames protocols that small commands are processed and acknowledged in a single network roundtrip. + +A %poke pushes a %page message from one requester to requestee, specifying the path by which the requestee will acknowledge the %poke. It must be straightforward to unambiguously correlate the payload path to the prefix of the request path, such that the authentication of the payload is sufficient to authenticate the request (as is trivially the case for %ames' flows). A %poke message can be resolved by dereferencing the request path via arvo's `+peek` arm on the publish ship. If the path can be resolved, the message has been processed and the result is the response %page. If not, the %poke must be injected as an event. A valid %poke message produces exactly one %page response -- crashing requests are converted to negative acknowledgments. -%poke generalizes %ames' %plea and %boon messages, simultaneously rendering "nacksplanations" superfluous. +%poke generalizes %ames' %plea (command) and %boon (response) messages, without changing the interface presented to other vanes. Making the acknowledgment a scry response instead of a bespoke packet type allows for multi-packet error messages, rendering Ames's current "nacksplanations" machinery superfluous. + + +## Modules + +The major logical modules involved in the system are the following: +- client state machine (client, vane + driver) +- flow state machine (both client and server, vane) +- publisher namespace (server, vane) +- relay (driver, vane possible) + +## Client State Machine + +This module is responsible for performing a request for data at a scry path on another ship. Its caller supplies the path to be resolved, and when the module has finished resolving that path, it delivers the data bound to that path back to the caller. + +To perform the request, the module issues as many request packets as needed to retrieve all the fragments of the response message. It is responsible for re-sending each packet on a timer until a response is received. The formal state machine in the Ames vane is specified with minimal timing details, just enough to ensure packet-by-packet progress toward a complete message. A production-level implementation of the Ames I/O driver in the runtime will need to implement a congestion control algorithm to achieve high bandwidth. + +The general pattern of interaction between the Ames vane and a production-level I/O driver is that Ames will initiate a request by encoding and emitting the first packet of a request to the driver, then it will send one packet at a time, re-sending on a timer, until it receives the corresponding response packet. + +Ames will re-send any unacknowledged packets on a single repeating global timer that re-sends all unacknowledged packets at every two-minute interval. This uses the same design as current Ames's "dead flow" timer, where a dead flow is a flow that has not received any response packets for 30 seconds. All flows will be formally treated the same as dead flows in this new design -- live flows will be recognized and accelerated by the runtime using congestion control. + +The driver, when it hears a response packet over the wire, will recognize the flow as live and implicitly assume responsibility for congestion control, i.e. setting fine-grained packet re-send timers, tracking statistics about the communication channel, and using the statistics to determine how many further request packets to emit at what times. Once the driver has received a complete message, it injects the message as a single Arvo event. This clears any packet-level state in Ames for that message. + +Note that because a production I/O driver intercepts them, Arvo will not hear response packets for a message until the message has been completed. This means that if Vere is shut down and restarted, it will not remember the message and it will restart the download from scratch. + +When the I/O driver receives the first response packet for a message, it could inject the packet into Arvo as an event. A production-level I/O driver will likely opt instead for scrying into Arvo to ask Arvo to validate any HMAC on the first packet. The I/O driver does not contain any private keys in its state, as a security boundary, so it needs to ask Arvo to use its keys. This IPC request to scry into Arvo can be fired off synchronously at the same time as congestion control sends the next few request packets, to prevent unneeded latency. + +Later the driver will hear the scry response from Arvo. If the packet validates, then the driver will update its state on this message to reflect the confirmation. If it fails to validate, the driver will cancel its attempt to download this message, removing all state regarding it. Any duplicate responses received after cancellation are dropped. The flow is now considered dead, so Ames will take over re-sending once every two minutes. + +Keeping congestion control in the driver and out of the formal model allows improvement in congestion control design without modifying the Ames protocol or any code in the kernel. Even at Kelvin zero, a clever new congestion control algorithm could emerge and gain prominence. + +There is potentially minor extra overhead from duplicate timers: Ames's global dead flow re-send timer might fire a few times during the download of a very large message. We do not expect noticeable performance issues from this, and notably, the Ames vane will only ever have one timer set in this design, as compared to thousands on some ships with current Ames. + + +State Transitions: +- on-poke + - if our ack-path + - create request for payload-path (at message level) and begin processing as if a %page + - XX inline auth packet if required, ensure synchronous validation of first +- on-page + - check for pending request (peek|poke) + - if none, drop + - if first fragment + - authenticate + - initialize hash-tree + - request auth-packet if necessray + - else if auth-packet + - finish initializing hash-tree + - else + - if out-of-order, stash to process later or just drop + - incorporate incremental hash update into hash-tree + - validate fragment via hash-tree + - save fragment + - if incomplete, request next fragment + - else produce completed message +- send requests + + +## Next Module (TODO) + +TODO #### packet structure and semantics Packets recapitulate the structure of messages exactly, specifying their precise serialization and fragmentation, with sufficient metadata (and sized for) straightforward, maximal deliverabity over current networks. -Every packet has a maximum size of 1.472 bytes, and begins with a 4 byte header. The header must encode (in no particular order): - -- the protocol version (3 bits) -- the packet type (2 bits) -- the publisher rank (2 bits, only for requests) -- a truncated mug checksum (20 bits) -- hopcount (5-7 bits available) - - The remaining 1.468 bytes are allocated as follows: - -- %peek: { publisher[<=16], tag-byte[1], path[<=328] } -- %poke: { publisher[<=16], path[<=328], total[4], authenticator[96], fragment[<=1024] } -- %page: { next-hop[6], path[<=328], total[4], authenticator[96], fragment[<=1024] } - -These values are as follows: - -- publisher: ship from the root of the request type, variable length, based on rank in header -- %peek tag-byte: future-proofing for signed requests, possibly hash-based exclusions -- path: length-prefixed (2 bytes) request path, with 4-byte fragment number - - in the case of %poke: the request and payload paths are concatenated -- total: number of fragments in %page response (max size: 4TiB) -- authenticator: see discussion below -- fragment: bloq 13 slice of the serialized message - -outstanding questions: - -- precise order and interpretation of header bits - - hopcount semantics (actual + max; saturating counter?) -- authenticator details -- fixed length limits - - %page - - max path length could increase by ten - - or next-hop could increase by 12 bytes and prepare for ipv6 (offsetting path) - - %poke structure - - currently, request and payload paths are concatenated to simplify max-length calculation - - if specified as variable length - - could be two separate paths - - or simply two concatenated packets (%peek and %page) +Every packet has a 8-byte header. The first 4 bytes are used as follows: + +- 2 bits reserved +- 2 bits next-hop + - 0b00 - no next-hop + - 0b01 - one, 6-byte next-hop at the end + - 0b10 - one, single-byte-length-prefixed next-hop at the end + - 0b11 - multiple single-byte-length-prefixed next-hops at the end +- 3 bits for the protocol version + - 0-7 +- 2 bits for the packet type + - 0b00 - reserved (for pine, or open-ended packet-type-tag in body?) + - 0b01 - response %page + - 0b10 - request %peek + - 0b11 - request %poke +- 3 bits for saturating hopcount + - 0-6 precise, 7 is >= +- 20-bit truncated mug + - least-significant bits + +The next 4 bytes are a constant token (or "cookie") to identify the ames protocol suite: + +- 4-byte cookie + - `~tasfyn-partyv` ie `0x51ad.1d5e` ie `{ 0x5e, 0x1d, 0xad, 0x51 }`. + +Every packet has a theoretical limit of XX, but a practical limit of 1.472 bytes (1500-byte de-facto MTU due to ethernet frame size). In practice, the remaining 1.468 bytes are allocated as follows: + +- %peek: { encoded-path, optional-peek-attributes } +- %page: { encoded-path, encoded-response, optional-page-attributes } +- %poke: { peek, page } + +- encoded-path: { meta-byte, ship, rift, path-length, path, bloq, fragment-number } + - meta-byte: { rank[2], rift-length[2], path-length-length[1], bloq-length[1], fragment-length[2] } + - rank: 2 bits + - `(dec (met 0 (met 4 ship)))` + - rift-length: 2 bits + - `(dec (met 3 rift))` + - path-length-length: 1 bit + - bloq length: 1 bit + - fragment number length: 2 bits + - `(dec (met 3 fragment-number))` + - ship: encoded in `(bex +(rank))` bytes + - rift: encoded in `1-4` bytes + - path-length: encoded in `(bex path-length-length)` bytes + - path: serialized without leading slash, path-length bytes + - bloq: + - bloq-length 0b0: implicit bloq size 13, 0 bytes used + - bloq-length 0b1: bloq size encoded in 1 byte + - fragment-number: 1-4 bytes +- optional-peek-attributes: + - undefined +- encoded-response: { meta-byte, total-fragments, authentication-length, authentication, fragment-length, fragment } + - meta-byte: { total-fragments-length[2], authentication-length-length[1], fragment-length[5] } + - total-fragments-length: 2 bits + - `(dec (met 3 total-fragments))` + - authentication-length-length: 1 bit + - fragment-length: 5 bits + - total-fragments: message level fragment-length + - encoded in `1-4` bytes + - authentication-length: 0-1 bytes + - authentication: { tag[1], value[0-254] } + - tag: + - ed25519 signature + - digest (blake3) + - hmac (blake3) + - intermediate blake3 parent nodes + - combination signature + root hash + - ... + - fragment-length: 0-32 bytes + - fragment: 0 - 2^252 bytes +- optional-page-attributes: next-hop +- next-hop: + - dependant on header bits + - 0b00: not-present + - 0b01: 6 bytes, no length tag + - 0b10: { length[1], next-hop[length] } + - 0b11: multiple length-prefixed next-hops + - XX length-prefixed values should have tag bytes + +XX: next-hop only on response +XX: total-bits only on first response #### routing topology and mechanisms @@ -285,6 +430,9 @@ These features and changes may well be developed and released in the opposite or Existing %fine packets can be routed statefully in this model, but can only be multicast with each other (unless relays can convert bedirectionally between old and new structures). Existing %ames packets will continue be routed stateless, as (mutual) requests. +## Reference Implementation + +Initial types, packet en/de-coding, and associated prototypes are on [`jb/dire`](https://github.com/urbit/urbit/compare/f58fc8b4628...jb/dire). ## Security Considerations