fix(mysql): probe over TCP so a moved socket can't kill a healthy DB#225
Merged
Conversation
The startup/liveness/readiness probes ran `mysqladmin ping -h localhost`, which connects via the client-default unix socket /var/run/mysqld/mysqld.sock. The mysql-client image actually writes its socket to /var/lib/mysql/mysql.sock, so the probe can never reach a healthy mysqld: the startup probe exhausts (~130s) and the kubelet kill-loops the database, cascading jobs-manager and requests-proxy into CrashLoopBackOff. Switch all three probes to TCP (-h 127.0.0.1, port 3306) — immune to where the image places the socket — and document why so it is not reverted. Reproduced on a fresh install; cached environments were masked by the mutable :prod tag + imagePullPolicy: IfNotPresent. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
Author
|
👋 Heads-up — Code review queue is at 9 / 8 Above the WIP limit. The team convention is to review existing PRs before opening new work. Open PRs currently in Code review (oldest first):
Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.) |
This was referenced Jun 8, 2026
…loop regression guard, backend#767) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This was referenced Jun 8, 2026
saadqbal
approved these changes
Jun 9, 2026
This was referenced Jun 9, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
All three MySQL probes (
startupProbe,livenessProbe,readinessProbe) ranmysqladmin ping -h localhost, which connects over the client-default unix socket/var/run/mysqld/mysqld.sock. Themysql-clientimage actually writes its socket to/var/lib/mysql/mysql.sock, so the probe can never reach a perfectly healthy mysqld — thestartupProbeexhausts (~130s) and the kubelet kill-loops the database, cascadingjobs-managerandrequests-proxyinto CrashLoopBackOff.This switches all three probes to TCP (
mysqladmin ping -h 127.0.0.1, port 3306), which is immune to wherever the image places its socket, and adds a comment so it isn't reverted.Why it slipped
Mutable image tag
:prod+imagePullPolicy: IfNotPresent: cached clusters keep an older working image, but every fresh install pulling the current:prodis broken. Reproduced on a clean cluster (RAM and proxy/NO_PROXYboth ruled out). Full root-cause writeup: tracebloc/backend#767.Type
Test plan
helm templaterenders cleanly (1998 lines, valid YAML) withci/bm-values.yaml; all three rendered probes show-h 127.0.0.1.mysqladmin pingreturns "mysqld is alive" (exit 0) over TCP without credentials — same auth behavior as the previous-h localhost, just TCP transport.mysql-clientwent1/1with 0 restarts and the cascade recovered.Checklist
develophelm-ci,installer-tests) greenFollow-ups (tracked in backend#767, not in this PR)
mysql-clientby digest (images.mysqlClient.digest) — a mutable:prodtag is how this shipped silently.mysql-client/jobs-managerReady) under the install-journey epic #736 would have caught it.