Skip to content

Improve sendsync performance#151

Open
omnibs wants to merge 5 commits into
trunkfrom
improve-sendsync-performance
Open

Improve sendsync performance#151
omnibs wants to merge 5 commits into
trunkfrom
improve-sendsync-performance

Conversation

@omnibs
Copy link
Copy Markdown
Member

@omnibs omnibs commented May 7, 2026

Context

Our synchronous writes to Kafka have been slow since forever, and we used to think that was on Kafka itself: high throughput but high latency.

Turns out it was a simple mistake on our side.

Synchronous sends are synchronous as much as Erlang/Elixir's synchronous messaging is synchronous: it's a wrapper around an async send that synchronously waits on an async ack.

When a kafka broker acks a message, librdkafka's I/O thread enqueues a delivery-report event onto an internal queue. That event sits there until some thread calls rd_kafka_poll, which drains the queue and dispatches callbacks. Those callbacks free up pending sendSync calls.

The issue

In our codebase, pollEvents is what triggers rd_kafka_poll, and we used to have it like this:

pollEvents producer = do
  Producer.produceMessageBatch producer []
    |> map (\_ -> ())
  Control.Concurrent.threadDelay 100_000 {- 100ms -}
  pollEvents producer

That 100ms delay means sendSync might take up to 100ms to be waken up after an ack comes back from the broker.

The fix

We could have simply reduced the delay, but there's another function called flushProducer, which does something similar to what we did, but in a better way.

flushProducer hangs on a call to rd_kafka_poll for up to 100ms. If no events come, it times out, and you can call it again. If an event comes, it return immediately.

The 100ms timeout might seem superfluous, but ceding control back to Haskell allows us to respond to async exceptions and terminate our producers cleanly.

hw-kafka's own example producer uses this function.

Benchmarks

We created a simple benchmark script here to test sync sends with and without the fix.

$ KAFKA_BROKER_ADDRESSES=localhost:9092 BENCHMARK_MESSAGE_COUNT=200 BENCHMARK_WARMUP=20 cabal run sync-write-benchmark -fsync-write-benchmark

Without the fix

Benchmarking sync writes: warmup=20 count=200 topic=nri-kafka-sync-benchmark
Warming up...
Measuring 200 sends...
count=200  min=100.1ms  avg=101.1ms  p50=101.1ms  p95=101.4ms  p99=102.1ms  max=102.1ms

With the fix

Benchmarking sync writes: warmup=20 count=200 topic=nri-kafka-sync-benchmark
Warming up...
Measuring 200 sends...
count=200  min=5.6ms  avg=6.3ms  p50=6.2ms  p95=6.9ms  p99=8.6ms  max=8.8ms

This opens the possibility to adopt synchronous Kafka writes more broadly.

New issue

We uncovered a new issue while digging into this: we're not handling delivery failures in sendSync.

We just wrote a known-issues.md documenting future work.

omnibs added 3 commits May 7, 2026 13:37
$ KAFKA_BROKER_ADDRESSES=localhost:9092 BENCHMARK_MESSAGE_COUNT=200
BENCHMARK_WARMUP=20 cabal run sync-write-benchmark
-fsync-write-benchmark

# Without the fix

Benchmarking sync writes: warmup=20 count=200
topic=nri-kafka-sync-benchmark
Warming up...
Measuring 200 sends...
count=200  min=100.1ms  avg=101.1ms  p50=101.1ms  p95=101.4ms
p99=102.1ms  max=102.1ms

# With the fix

Benchmarking sync writes: warmup=20 count=200
topic=nri-kafka-sync-benchmark
Warming up...
Measuring 200 sends...
count=200  min=5.6ms  avg=6.3ms  p50=6.2ms  p95=6.9ms  p99=8.6ms
max=8.8ms
Copilot AI review requested due to automatic review settings May 7, 2026 17:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces Kafka.sendSync latency by changing the producer event-polling loop to drain librdkafka delivery-report events promptly, and adds a small benchmark executable to measure per-message sync-write latency.

Changes:

  • Replaces the pollEvents loop’s fixed 100ms threadDelay + empty-batch trick with Producer.flushProducer to wake sendSync waiters as soon as delivery reports arrive.
  • Adds a sync-write-benchmark executable (behind a Cabal flag) to benchmark sequential sendSync latency.
  • Updates Cabal metadata (package.yaml + generated .cabal) to include the new flag/executable.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

File Description
nri-kafka/src/Kafka.hs Switches the background event loop to use flushProducer and documents the rationale/shutdown behavior.
nri-kafka/scripts/sync-write-benchmark/Main.hs New benchmark program for measuring sendSync per-message latency.
nri-kafka/package.yaml Adds a flag-gated sync-write-benchmark executable definition.
nri-kafka/nri-kafka.cabal Regenerated Cabal file reflecting the new flag/executable.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +111 to +116
formatStats :: [Data.Word.Word64] -> String
formatStats samples =
let sorted = Data.List.sort samples
n = Prelude.length sorted
total = Prelude.sum sorted
avg = nsToMs (total `Prelude.div` fromIntegral n)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants