feat: unreliable datagram support (QUIC + libp2p-datagram)#6489
feat: unreliable datagram support (QUIC + libp2p-datagram)#6489royzah wants to merge 15 commits into
Conversation
Signed-off-by: Royyan Zahir <muhammad.royyan@tii.ae>
Signed-off-by: Royyan Zahir <muhammad.royyan@tii.ae>
…hangelog Signed-off-by: Royyan Zahir <muhammad.royyan@tii.ae>
Signed-off-by: Royyan Zahir <muhammad.royyan@tii.ae>
jxs
left a comment
There was a problem hiding this comment.
Hi, and thanks for starting this!
as commented on your specs post, wdyt of implementing libp2p/specs#680 and go from there? You're in need for this in production right? That would be a nice testbed for the spec
|
Nice, that's the direction we were hoping for. The quic/core/swarm bits here are the datagram primitive #680 needs anyway, so I'll keep those and rework the libp2p-datagram crate to match #680: |
|
specs#680 is implemented here: Two spec notes: one app-protocol per behaviour for now, and we drop on a bad/unknown stream id rather than raising PROTOCOL_VIOLATION (not reachable from this layer). |
Handler.outbound grew without limit while the /dg/1 control stream was still pending, so a peer that never opens it could pin memory. Cap at MAX_OUTBOUND_BACKLOG and head-drop the oldest; delivery is unreliable, so dropping is acceptable.
There was a problem hiding this comment.
A lot of nice things in the PR already!
I agree with the datagram support at the Muxer and Transport level implementations.
But I read the spec definition of the /dg/1 protocol control stream as a protocol that each application protocol may negotiate individually, not a standalone protocol with its own Behaviour. I.e:
// listen_protocol advertises both protocols
fn listen_protocol(&self) -> SubstreamProtocol<Self::InboundProtocol, Self::InboundOpenInfo> {
SelectUpgrade::new(
GossipsubUpgrade,
DgControlUpgrade { app_id: "/meshsub/1.0.0" },
)
}Where DgControlUpgrade::upgrade_inbound reads and validates [uvarint] [app_proto_id] from the stream per the spec. The handler then tracks control_stream_id and handles SendDatagram/Datagram events itself, framing/parsing the QUIC varint inline.
However, I'm torn on one point. The decentralized approach means every datagram-enabled handler reimplements the same pattern (manage /dg/1 control stream, filter by control_stream_id). An alternative would be to add a method like supports_datagrams(protocol: StreamProtocol) -> bool to the ConnectionHandler trait.
The Swarm could then aggregate which protocols each sub-handler supports datagrams for, and the connection layer could route Datagram events directly (parsing the QUIC varint once and dispatching to the correct handler by matching control_stream_id against known datagram-supporting streams). This centralizes routing and avoids every handler filtering every datagram, at the cost of more trait machinery.
What do you think?
Thanks!
| } | ||
| Poll::Ready(ConnectionHandlerEvent::SendDatagram(data)) => { | ||
| if let Err(e) = muxing.send_datagram_unpin(data) { | ||
| tracing::debug!("failed to send datagram: {e}"); |
| with: | ||
| # `libp2p-datagram` is new and unpublished, so it has no crates.io | ||
| # baseline to diff against. Remove this once it is first released. | ||
| exclude: libp2p-datagram |
There was a problem hiding this comment.
let's remove this, it's preferable to let CI fail while we haven't published than have this dangling afterwards and not catch semver errors
|
Thanks @jxs. We went decentralized deliberately, and measured it. Per handler, the Numbers, arm64 over two mesh bearers: 100 Mbit/s at 0% loss on Wi-Fi 802.11s, 5 Mbit/s on sub-GHz. At a 1200-byte cap that's ~10k datagram/s, so the filter is ~0.1% of a core. The QUIC path and the link are the bottleneck, never routing. Centralized |
A failed send_datagram_unpin is a local send rejection, not a routine drop, so surface it at error rather than debug.
Drop the exclude: better to let semver-checks fail until the crate is first published than leave a dangling exclude that masks later errors.
Thanks for te measurements! Yeah let's then tackle the centralized approach to see if it's clean enough the internal routing and matching the existing behaviors. Thanks! |
|
@jxs sounds good, going with the centralized route. Plan:
One thing to decide: the connection only sees the combined handler, so the final hop goes through |
|
Sounds good. The shape I'm picturing:
Question though, connection only sees the combined handler, so the last hop goes through |
Description
Adds unreliable datagram support to libp2p, end to end:
core:StreamMuxer::send_datagram/max_datagram_sizeandStreamMuxerEvent::Datagram(default: unsupported).quic: implements them over quinn datagrams, enabled by default and configurable.swarm:ConnectionEvent::Datagram(inbound) andConnectionHandlerEvent::SendDatagram(outbound).libp2p-datagramcrate: aBehaviour+Controlwithsend_datagram(peer, ...)and anIncomingDatagramsstream.Motivation and prior art: libp2p/specs#626 (IP over libp2p), #225, and the MTU edge in #4647. Tunnelling IP over reliable, ordered streams is 2-10x slower (HOL blocking; TCP-over-TCP meltdown); a datagram carrier fixes it.
Notes
quic::pollnow surfaces connection close via the datagram future (was alwaysPending).Open questions
quic::Config; happy to make it opt-in.#[non_exhaustive]?