Skip to content

fix(mysql): probe over TCP so a moved socket can't kill a healthy DB#225

Merged
saadqbal merged 2 commits into
developfrom
fix/mysql-probe-socket-tcp
Jun 9, 2026
Merged

fix(mysql): probe over TCP so a moved socket can't kill a healthy DB#225
saadqbal merged 2 commits into
developfrom
fix/mysql-probe-socket-tcp

Conversation

@LukasWodka

Copy link
Copy Markdown
Contributor

Summary

All three MySQL probes (startupProbe, livenessProbe, readinessProbe) ran mysqladmin ping -h localhost, which connects over the client-default unix socket /var/run/mysqld/mysqld.sock. The mysql-client image actually writes its socket to /var/lib/mysql/mysql.sock, so the probe can never reach a perfectly healthy mysqld — the startupProbe exhausts (~130s) and the kubelet kill-loops the database, cascading jobs-manager and requests-proxy into CrashLoopBackOff.

This switches all three probes to TCP (mysqladmin ping -h 127.0.0.1, port 3306), which is immune to wherever the image places its socket, and adds a comment so it isn't reverted.

Why it slipped

Mutable image tag :prod + imagePullPolicy: IfNotPresent: cached clusters keep an older working image, but every fresh install pulling the current :prod is broken. Reproduced on a clean cluster (RAM and proxy/NO_PROXY both ruled out). Full root-cause writeup: tracebloc/backend#767.

Type

  • Bug fix (regression)

Test plan

  • helm template renders cleanly (1998 lines, valid YAML) with ci/bm-values.yaml; all three rendered probes show -h 127.0.0.1.
  • mysqladmin ping returns "mysqld is alive" (exit 0) over TCP without credentials — same auth behavior as the previous -h localhost, just TCP transport.
  • Verified live on an affected cluster via the equivalent runtime patch: mysql-client went 1/1 with 0 restarts and the cascade recovered.

Checklist

  • Targets develop
  • No customer identifiers
  • Reviewer to confirm CI (helm-ci, installer-tests) green

Follow-ups (tracked in backend#767, not in this PR)

  • Pin mysql-client by digest (images.mysqlClient.digest) — a mutable :prod tag is how this shipped silently.
  • A secrets-gated full-stack E2E (waits for mysql-client/jobs-manager Ready) under the install-journey epic #736 would have caught it.

The startup/liveness/readiness probes ran `mysqladmin ping -h localhost`,
which connects via the client-default unix socket /var/run/mysqld/mysqld.sock.
The mysql-client image actually writes its socket to /var/lib/mysql/mysql.sock,
so the probe can never reach a healthy mysqld: the startup probe exhausts
(~130s) and the kubelet kill-loops the database, cascading jobs-manager and
requests-proxy into CrashLoopBackOff. Switch all three probes to TCP
(-h 127.0.0.1, port 3306) — immune to where the image places the socket —
and document why so it is not reverted.

Reproduced on a fresh install; cached environments were masked by the
mutable :prod tag + imagePullPolicy: IfNotPresent.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@LukasWodka

Copy link
Copy Markdown
Contributor Author

👋 Heads-up — Code review queue is at 9 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

…loop regression guard, backend#767)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@saadqbal saadqbal merged commit a2dae56 into develop Jun 9, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants