fix(docker): supervise Java process in entrypoints instead of tail -f /dev/null#3051
fix(docker): supervise Java process in entrypoints instead of tail -f /dev/null#3051bitflicker64 wants to merge 2 commits into
Conversation
… /dev/null Problem: all three docker-entrypoint.sh files used tail -f /dev/null to keep the container alive. When Java crashed, tail kept running and the container stayed up with no Java inside — Docker restart policy never fired. PD and Store fix: replace daemon start + tail -f /dev/null with -d false, which triggers the foreground branch added in chunks 2-3. exec java replaces the shell with Java directly, so the entrypoint blocks and exits with Java's real exit code when Java exits. Server fix: keep daemon start (wait-partition.sh must run after Java is up). Replace tail -f /dev/null with tail --pid=$(cat ./bin/pid) -f /dev/null. When Java exits, tail --pid returns, entrypoint exits 1, restart policy fires. Note: server entrypoint always exits 1 (cannot get Java's real exit code since start-hugegraph.sh calls disown before returning). This is sufficient for --restart=on-failure. Verified locally: kill -9 Java inside running container -> container exits -> Docker restarts it automatically (tested 3 times for server, PD restart loop confirmed for pd). Related to: apache#3043
imbajin
left a comment
There was a problem hiding this comment.
Blocking: no. Summary: server Docker entrypoint still needs graceful shutdown forwarding. Evidence: static review of hugegraph-server/hugegraph-dist/docker/docker-entrypoint.sh; bash -n passed for all three changed entrypoints.
| tail -f /dev/null | ||
| PID=$(cat ./bin/pid 2>/dev/null || true) | ||
| if [[ -n "$PID" ]]; then | ||
| tail --pid="$PID" -f /dev/null |
There was a problem hiding this comment.
Evidence: PD and Store now execute their foreground scripts with -d false, but the server entrypoint still starts start-hugegraph.sh in daemon mode and then blocks on tail --pid="$PID" -f /dev/null at this line. Docker sends SIGTERM to PID 1, so this shell/tail path can exit without signaling the Java PID; the server then waits for Docker's forced kill instead of a graceful shutdown.
Please add a TERM/INT trap that kills and waits for $PID, or otherwise run the server foreground path after the post-start checks, so docker stop shuts HugeGraphServer down cleanly.
There was a problem hiding this comment.
Good catch added a TERM/INT trap before the tail --pid block in f20b8a4:
PID=$(cat ./bin/pid 2>/dev/null || true)
if [[ -n "$PID" ]]; then
trap 'kill -TERM "$PID" 2>/dev/null; while kill -0 "$PID" 2>/dev/null; do sleep 1; done; exit 0' TERM INT
tail --pid="$PID" -f /dev/null
exit 1
fiWhen dumb-init forwards SIGTERM from docker stop, the trap fires: sends SIGTERM to Java, polls with kill -0 until Java exits cleanly, then exits 0. The tail --pid crash-detection path still exits 1 so restart policy fires on a crash.
tail --pid does not forward signals to the supervised process. Add a TERM/INT trap that sends SIGTERM to Java and polls with kill -0 until Java exits cleanly before the entrypoint exits. Addresses review feedback from imbajin on apache#3049.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #3051 +/- ##
=============================================
+ Coverage 36.11% 93.25% +57.14%
+ Complexity 338 65 -273
=============================================
Files 803 9 -794
Lines 68234 267 -67967
Branches 8964 22 -8942
=============================================
- Hits 24640 249 -24391
+ Misses 40936 8 -40928
+ Partials 2658 10 -2648 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Purpose of the PR
All three
docker-entrypoint.shfiles usedtail -f /dev/nullto keep the container alive. When Java crashed,tailkept running and the container stayed up with no Java inside — Docker restart policy never fired.Main Changes
PD and Store: replace daemon start +
tail -f /dev/nullwith-d false, which triggers the foreground branch added in #3047.exec javareplaces the shell with Java directly, so the entrypoint blocks and exits with Java's real exit code when Java exits.Server: keep daemon start (
wait-partition.shmust run after Java is already up in the background). Replacetail -f /dev/nullwithtail --pid=$(cat ./bin/pid) -f /dev/null. When Java exits,tail --pidreturns, entrypoint exits 1, restart policy fires.Note: server entrypoint always exits 1 —
start-hugegraph.shcallsdisownbefore returning so Java is not a child of the entrypoint shell and its real exit code is not available. This is sufficient for--restart=on-failure.tail -f /dev/null-d false→exec javablockstail -f /dev/null-d false→exec javablockstail -f /dev/nulltail --pidsupervisionVerifying these changes
docker run --restart=on-failure→kill -9Java inside container → container exits and Docker restarts it automaticallyDoes this PR potentially affect the following parts?
Documentation Status
Doc - TODO— README/docs update for Docker deployment lifecycle will follow in a separate PR