You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Restructure narrative: two flamegraphs, diagnose then confirm
Add diff flamegraph comparing perf-lab vs local (both unpatched)
right after mpstat to show where kernel time goes. Move existing
before/after flamegraph to after the fix as visual confirmation.
`%usr` is time running application code. `%sys` is time in the kernel. On perf-lab, over 85% of CPU goes to the application. Locally, nearly half goes to the kernel — and the application has idle CPU it cannot use. Same application, same clock speed, same workload: **the local environment is burning CPU in the kernel instead of running the app.** We isolated the network path next.
45
+
`%usr` is time running application code. `%sys` is time in the kernel. On perf-lab, over 85% of CPU goes to the application. Locally, nearly half goes to the kernel — and the application has idle CPU it cannot use. Same application, same clock speed, same workload: **the local environment is burning CPU in the kernel instead of running the app.**
46
+
47
+
== Where is the kernel time going?
48
+
49
+
A https://www.brendangregg.com/flamegraphs.html[differential flamegraph] of the JFR CPU profiles (collected via https://github.com/async-profiler/async-profiler[async-profiler]) from the perf-lab and local Quarkus runs shows exactly where the extra kernel time is spent:
50
+
51
+
image::diff-flamegraph-gap.png[Differential flamegraph: perf-lab vs local]
52
+
53
+
Red frames appear more in the local run; blue frames appear more on the perf-lab. The red hotspots are all in the kernel network path: `tcp_sendmsg`, `ip_output`, softirq `net_rx_action`, and firewall evaluation (`nf_hook_slow`, `nft_do_chain`). The local environment is doing **real network I/O work** — sending and receiving TCP packets through extra hops — that the perf-lab doesn't need to do.
46
54
47
55
== Isolating the network layer with pgbench
48
56
49
-
To confirm the network path was the bottleneck, we ran `pgbench` with the same 2-query workload (50 clients, prepared statements, 30 seconds) over different network paths. We also tested with Fedora's https://wiki.nftables.org/[nftables] firewall disabled, since the JFR flamegraph showed `nft_do_chain` in the kernel stacks:
57
+
To confirm the network path was the bottleneck, we ran `pgbench` with the same 2-query workload (50 clients, prepared statements, 30 seconds) over different network paths. We also tested with Fedora's https://wiki.nftables.org/[nftables] firewall disabled, since the flamegraph showed `nft_do_chain` in the kernel stacks:
50
58
51
59
[cols="2,1,1", options="header"]
52
60
|===
@@ -59,27 +67,6 @@ To confirm the network path was the bottleneck, we ran `pgbench` with the same 2
59
67
60
68
With `--network=host`, statement latency drops from 1.38ms to 0.47ms — a 3x reduction. With 2 statements per HTTP request, that overhead adds up on every request.
61
69
62
-
== The flamegraph tells the story
63
-
64
-
JFR CPU profiles (collected via https://github.com/async-profiler/async-profiler[async-profiler]) from the default and host-networking Quarkus runs were compared using a https://www.brendangregg.com/flamegraphs.html[differential flamegraph]. Red frames appear more in the default (pasta) configuration; blue frames appear more with host networking.
65
-
66
-
image::diff-flamegraph.png[Differential flamegraph: pasta vs host networking]
67
-
68
-
Red means more CPU in the default (pasta) run; blue means more CPU with host networking. The red stacks split into two groups: the pasta proxy overhead — extra `tcp_sendmsg`, `ip_output`, and softirq `net_rx_action` from the two additional kernel/userspace boundary crossings — and the firewall overhead — `nf_hook_slow` and `nft_do_chain` from Fedora's 973 nftables rules. Both disappear with `--network=host`, because the app and postgres share the same network namespace and packets never leave the kernel.
69
-
70
-
Per-request CPU cost confirms the picture:
71
-
72
-
[cols="2,1", options="header"]
73
-
|===
74
-
| Configuration | CPU ms/req
75
-
76
-
| Default pasta (15,504 TPS) | 0.231
77
-
| Host networking (24,116 TPS) | 0.158
78
-
| Perf-lab (24,472 TPS) | 0.158
79
-
|===
80
-
81
-
With host networking, per-request cost **matches the perf-lab exactly**: 0.158 ms/req.
82
-
83
70
== Root cause: pasta, the userspace TCP proxy
84
71
85
72
Rootless podman on Fedora uses https://passt.top/passt/[pasta (passt)] to forward container ports. Unlike rootful podman (which uses kernel-level port forwarding), pasta is a userspace process that proxies every TCP packet:
@@ -134,6 +121,27 @@ Run the postgres container with `--network=host` instead of port-mapping (`-p 54
134
121
135
122
**With host networking, the local Fedora workstation matches the perf-lab.** The remaining gap to the perf-lab's 2.08x ratio is accounted for by nftables (Fedora's 973 rules vs RHEL's minimal ruleset) and minor kernel differences.
136
123
124
+
Per-request CPU cost confirms the picture:
125
+
126
+
[cols="2,1", options="header"]
127
+
|===
128
+
| Configuration | CPU ms/req
129
+
130
+
| Default pasta (15,504 TPS) | 0.231
131
+
| Host networking (24,116 TPS) | 0.158
132
+
| Perf-lab (24,472 TPS) | 0.158
133
+
|===
134
+
135
+
With host networking, per-request cost **matches the perf-lab exactly**: 0.158 ms/req.
136
+
137
+
== Confirming the fix
138
+
139
+
A second differential flamegraph — this time comparing the local default (pasta) run with the local `--network=host` run — confirms the overhead is gone:
140
+
141
+
image::diff-flamegraph.png[Differential flamegraph: default pasta vs host networking]
142
+
143
+
Red means more CPU in the default (pasta) run; blue means more CPU with host networking. The red stacks that dominated the first flamegraph — `tcp_sendmsg`, `ip_output`, `net_rx_action`, `nf_hook_slow` — have disappeared. With `--network=host`, the app and postgres share the same network namespace; packets never leave the kernel.
144
+
137
145
== Takeaways
138
146
139
147
* **A benchmark that doesn't stress what it claims to stress will deliver misleading results.** This is a textbook case of what Brendan Gregg calls https://www.brendangregg.com/activebenchmarking.html[active benchmarking]:
0 commit comments