Skip to content

fix(docker): supervise Java process in entrypoints instead of tail -f /dev/null#3051

Open
bitflicker64 wants to merge 2 commits into
apache:masterfrom
bitflicker64:fix/docker-entrypoints-supervision
Open

fix(docker): supervise Java process in entrypoints instead of tail -f /dev/null#3051
bitflicker64 wants to merge 2 commits into
apache:masterfrom
bitflicker64:fix/docker-entrypoints-supervision

Conversation

@bitflicker64
Copy link
Copy Markdown
Contributor

Purpose of the PR

All three docker-entrypoint.sh files used tail -f /dev/null to keep the container alive. When Java crashed, tail kept running and the container stayed up with no Java inside — Docker restart policy never fired.

Main Changes

PD and Store: replace daemon start + tail -f /dev/null with -d false, which triggers the foreground branch added in #3047. exec java replaces the shell with Java directly, so the entrypoint blocks and exits with Java's real exit code when Java exits.

Server: keep daemon start (wait-partition.sh must run after Java is already up in the background). Replace tail -f /dev/null with tail --pid=$(cat ./bin/pid) -f /dev/null. When Java exits, tail --pid returns, entrypoint exits 1, restart policy fires.

Note: server entrypoint always exits 1 — start-hugegraph.sh calls disown before returning so Java is not a child of the entrypoint shell and its real exit code is not available. This is sufficient for --restart=on-failure.

Component Before After
PD daemon start + tail -f /dev/null -d falseexec java blocks
Store daemon start + tail -f /dev/null -d falseexec java blocks
Server daemon start + tail -f /dev/null daemon start + tail --pid supervision

Verifying these changes

  • Need tests and can be verified as follows:
    • docker run --restart=on-failurekill -9 Java inside container → container exits and Docker restarts it automatically
    • Verified locally: 3 kill-9 tests on server container, PD restart loop confirmed

Does this PR potentially affect the following parts?

  • Dependencies
  • Modify configurations
  • The public API
  • Other affects
  • Nope

Documentation Status

  • Doc - TODO — README/docs update for Docker deployment lifecycle will follow in a separate PR

… /dev/null

Problem: all three docker-entrypoint.sh files used tail -f /dev/null to
keep the container alive. When Java crashed, tail kept running and the
container stayed up with no Java inside — Docker restart policy never fired.

PD and Store fix: replace daemon start + tail -f /dev/null with
-d false, which triggers the foreground branch added in chunks 2-3.
exec java replaces the shell with Java directly, so the entrypoint
blocks and exits with Java's real exit code when Java exits.

Server fix: keep daemon start (wait-partition.sh must run after Java is
up). Replace tail -f /dev/null with tail --pid=$(cat ./bin/pid) -f
/dev/null. When Java exits, tail --pid returns, entrypoint exits 1,
restart policy fires.

Note: server entrypoint always exits 1 (cannot get Java's real exit code
since start-hugegraph.sh calls disown before returning). This is
sufficient for --restart=on-failure.

Verified locally: kill -9 Java inside running container -> container
exits -> Docker restarts it automatically (tested 3 times for server,
PD restart loop confirmed for pd).

Related to: apache#3043
@dosubot dosubot Bot added size:S This PR changes 10-29 lines, ignoring generated files. ci-cd Build or deploy labels Jun 5, 2026
Copy link
Copy Markdown
Member

@imbajin imbajin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: no. Summary: server Docker entrypoint still needs graceful shutdown forwarding. Evidence: static review of hugegraph-server/hugegraph-dist/docker/docker-entrypoint.sh; bash -n passed for all three changed entrypoints.

tail -f /dev/null
PID=$(cat ./bin/pid 2>/dev/null || true)
if [[ -n "$PID" ]]; then
tail --pid="$PID" -f /dev/null
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Forward shutdown to HugeGraphServer

Evidence: PD and Store now execute their foreground scripts with -d false, but the server entrypoint still starts start-hugegraph.sh in daemon mode and then blocks on tail --pid="$PID" -f /dev/null at this line. Docker sends SIGTERM to PID 1, so this shell/tail path can exit without signaling the Java PID; the server then waits for Docker's forced kill instead of a graceful shutdown.

Please add a TERM/INT trap that kills and waits for $PID, or otherwise run the server foreground path after the post-start checks, so docker stop shuts HugeGraphServer down cleanly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch added a TERM/INT trap before the tail --pid block in f20b8a4:

PID=$(cat ./bin/pid 2>/dev/null || true)
if [[ -n "$PID" ]]; then
    trap 'kill -TERM "$PID" 2>/dev/null; while kill -0 "$PID" 2>/dev/null; do sleep 1; done; exit 0' TERM INT
    tail --pid="$PID" -f /dev/null
    exit 1
fi

When dumb-init forwards SIGTERM from docker stop, the trap fires: sends SIGTERM to Java, polls with kill -0 until Java exits cleanly, then exits 0. The tail --pid crash-detection path still exits 1 so restart policy fires on a crash.

tail --pid does not forward signals to the supervised process. Add a
TERM/INT trap that sends SIGTERM to Java and polls with kill -0 until
Java exits cleanly before the entrypoint exits.

Addresses review feedback from imbajin on apache#3049.
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.25%. Comparing base (1f61c48) to head (f20b8a4).

Additional details and impacted files
@@              Coverage Diff              @@
##             master    #3051       +/-   ##
=============================================
+ Coverage     36.11%   93.25%   +57.14%     
+ Complexity      338       65      -273     
=============================================
  Files           803        9      -794     
  Lines         68234      267    -67967     
  Branches       8964       22     -8942     
=============================================
- Hits          24640      249    -24391     
+ Misses        40936        8    -40928     
+ Partials       2658       10     -2648     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-cd Build or deploy size:S This PR changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants