Skip to content

Commit 3e0f7f4

Browse files
committed
Restructure narrative: two flamegraphs, diagnose then confirm
Add diff flamegraph comparing perf-lab vs local (both unpatched) right after mpstat to show where kernel time goes. Move existing before/after flamegraph to after the fix as visual confirmation.
1 parent 97d9fc9 commit 3e0f7f4

2 files changed

Lines changed: 31 additions & 23 deletions

File tree

1.06 MB
Loading

content/post/hidden-cost-rootless-container-networking/index.adoc

Lines changed: 31 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -42,11 +42,19 @@ Both environments run Quarkus at 2.3GHz with the same workload and CPU pinning.
4242
| Perf-lab (RHEL, 24,472 TPS) | 87-94% | 5-11% | 0-2% | 0%
4343
|===
4444

45-
`%usr` is time running application code. `%sys` is time in the kernel. On perf-lab, over 85% of CPU goes to the application. Locally, nearly half goes to the kernel — and the application has idle CPU it cannot use. Same application, same clock speed, same workload: **the local environment is burning CPU in the kernel instead of running the app.** We isolated the network path next.
45+
`%usr` is time running application code. `%sys` is time in the kernel. On perf-lab, over 85% of CPU goes to the application. Locally, nearly half goes to the kernel — and the application has idle CPU it cannot use. Same application, same clock speed, same workload: **the local environment is burning CPU in the kernel instead of running the app.**
46+
47+
== Where is the kernel time going?
48+
49+
A https://www.brendangregg.com/flamegraphs.html[differential flamegraph] of the JFR CPU profiles (collected via https://github.com/async-profiler/async-profiler[async-profiler]) from the perf-lab and local Quarkus runs shows exactly where the extra kernel time is spent:
50+
51+
image::diff-flamegraph-gap.png[Differential flamegraph: perf-lab vs local]
52+
53+
Red frames appear more in the local run; blue frames appear more on the perf-lab. The red hotspots are all in the kernel network path: `tcp_sendmsg`, `ip_output`, softirq `net_rx_action`, and firewall evaluation (`nf_hook_slow`, `nft_do_chain`). The local environment is doing **real network I/O work** — sending and receiving TCP packets through extra hops — that the perf-lab doesn't need to do.
4654

4755
== Isolating the network layer with pgbench
4856

49-
To confirm the network path was the bottleneck, we ran `pgbench` with the same 2-query workload (50 clients, prepared statements, 30 seconds) over different network paths. We also tested with Fedora's https://wiki.nftables.org/[nftables] firewall disabled, since the JFR flamegraph showed `nft_do_chain` in the kernel stacks:
57+
To confirm the network path was the bottleneck, we ran `pgbench` with the same 2-query workload (50 clients, prepared statements, 30 seconds) over different network paths. We also tested with Fedora's https://wiki.nftables.org/[nftables] firewall disabled, since the flamegraph showed `nft_do_chain` in the kernel stacks:
5058

5159
[cols="2,1,1", options="header"]
5260
|===
@@ -59,27 +67,6 @@ To confirm the network path was the bottleneck, we ran `pgbench` with the same 2
5967

6068
With `--network=host`, statement latency drops from 1.38ms to 0.47ms — a 3x reduction. With 2 statements per HTTP request, that overhead adds up on every request.
6169

62-
== The flamegraph tells the story
63-
64-
JFR CPU profiles (collected via https://github.com/async-profiler/async-profiler[async-profiler]) from the default and host-networking Quarkus runs were compared using a https://www.brendangregg.com/flamegraphs.html[differential flamegraph]. Red frames appear more in the default (pasta) configuration; blue frames appear more with host networking.
65-
66-
image::diff-flamegraph.png[Differential flamegraph: pasta vs host networking]
67-
68-
Red means more CPU in the default (pasta) run; blue means more CPU with host networking. The red stacks split into two groups: the pasta proxy overhead — extra `tcp_sendmsg`, `ip_output`, and softirq `net_rx_action` from the two additional kernel/userspace boundary crossings — and the firewall overhead — `nf_hook_slow` and `nft_do_chain` from Fedora's 973 nftables rules. Both disappear with `--network=host`, because the app and postgres share the same network namespace and packets never leave the kernel.
69-
70-
Per-request CPU cost confirms the picture:
71-
72-
[cols="2,1", options="header"]
73-
|===
74-
| Configuration | CPU ms/req
75-
76-
| Default pasta (15,504 TPS) | 0.231
77-
| Host networking (24,116 TPS) | 0.158
78-
| Perf-lab (24,472 TPS) | 0.158
79-
|===
80-
81-
With host networking, per-request cost **matches the perf-lab exactly**: 0.158 ms/req.
82-
8370
== Root cause: pasta, the userspace TCP proxy
8471

8572
Rootless podman on Fedora uses https://passt.top/passt/[pasta (passt)] to forward container ports. Unlike rootful podman (which uses kernel-level port forwarding), pasta is a userspace process that proxies every TCP packet:
@@ -134,6 +121,27 @@ Run the postgres container with `--network=host` instead of port-mapping (`-p 54
134121

135122
**With host networking, the local Fedora workstation matches the perf-lab.** The remaining gap to the perf-lab's 2.08x ratio is accounted for by nftables (Fedora's 973 rules vs RHEL's minimal ruleset) and minor kernel differences.
136123

124+
Per-request CPU cost confirms the picture:
125+
126+
[cols="2,1", options="header"]
127+
|===
128+
| Configuration | CPU ms/req
129+
130+
| Default pasta (15,504 TPS) | 0.231
131+
| Host networking (24,116 TPS) | 0.158
132+
| Perf-lab (24,472 TPS) | 0.158
133+
|===
134+
135+
With host networking, per-request cost **matches the perf-lab exactly**: 0.158 ms/req.
136+
137+
== Confirming the fix
138+
139+
A second differential flamegraph — this time comparing the local default (pasta) run with the local `--network=host` run — confirms the overhead is gone:
140+
141+
image::diff-flamegraph.png[Differential flamegraph: default pasta vs host networking]
142+
143+
Red means more CPU in the default (pasta) run; blue means more CPU with host networking. The red stacks that dominated the first flamegraph — `tcp_sendmsg`, `ip_output`, `net_rx_action`, `nf_hook_slow` — have disappeared. With `--network=host`, the app and postgres share the same network namespace; packets never leave the kernel.
144+
137145
== Takeaways
138146

139147
* **A benchmark that doesn't stress what it claims to stress will deliver misleading results.** This is a textbook case of what Brendan Gregg calls https://www.brendangregg.com/activebenchmarking.html[active benchmarking]:

0 commit comments

Comments
 (0)