feat(docker): add HEALTHCHECK and remove unused cron from Dockerfiles by bitflicker64 · Pull Request #3052 · apache/hugegraph

bitflicker64 · 2026-06-05T15:40:46Z

Purpose of the PR

relates to [Bug] Docker containers never restart when Java process crashes — broken cron monitor + tail -f /dev/null supervision #3043

docker ps always shows Up even when Java has crashed inside the container because all three Dockerfiles had no HEALTHCHECK. Operators cannot distinguish a healthy container from a zombie. Also, cron was installed in all three images but is never used in Docker deployments — the cron-based monitor (-m true) is for VM/bare-metal only.

Main Changes

Add HEALTHCHECK to hugegraph-server/Dockerfile, hugegraph-pd/Dockerfile, hugegraph-store/Dockerfile:
- Server: curl http://localhost:8080/versions
- PD: curl http://localhost:8620/v1/health
- Store: curl http://localhost:8520/v1/health
- Fallback: if HTTP is not yet up but Java is alive (kill -0 on pid file), report healthy — avoids false unhealthy during startup
- Endpoints match what is already used in docker/docker-compose.yml
Remove cron from apt-get install in all three Dockerfiles — shrinks image size and reduces attack surface

Verifying these changes

Need tests and can be verified as follows:
- docker run container → docker inspect --format='{{.State.Health.Status}}' → reports healthy once Java is up
- kill -9 Java inside container → status transitions to unhealthy within --interval * --retries seconds

Does this PR potentially affect the following parts?

Documentation Status

Doc - No Need

Without HEALTHCHECK, docker ps always shows 'Up' even when Java has crashed inside the container. Add HEALTHCHECK to all three Dockerfiles: - Server: curl http://localhost:8080/versions - PD: curl http://localhost:8620/v1/health - Store: curl http://localhost:8520/v1/health Fallback: if HTTP is not yet up but Java is alive (kill -0 on pid file), report healthy. Avoids false unhealthy during startup. Remove cron: Docker containers use foreground mode (-d false) after the entrypoint fix. The cron-based monitor is for VM/bare-metal only and is never started in Docker — removing it shrinks the image and reduces attack surface. Endpoints match what is already used in docker/docker-compose.yml. Related to: apache#3043

codecov · 2026-06-05T15:46:59Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 39.08%. Comparing base (1f61c48) to head (f6f7c14).

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #3052      +/-   ##
============================================
+ Coverage     36.11%   39.08%   +2.96%     
- Complexity      338      622     +284     
============================================
  Files           803      814      +11     
  Lines         68234    69385    +1151     
  Branches       8964     9161     +197     
============================================
+ Hits          24640    27116    +2476     
+ Misses        40936    39428    -1508     
- Partials       2658     2841     +183

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Extend docker-build-ci.yml to inspect image metadata after build and fail if HEALTHCHECK is not configured (except Dockerfile-hstore). Zero CI time cost — docker inspect runs on the already-built image. Prevents accidental removal of HEALTHCHECK in future PRs. Related to: apache#3043

Missed in the previous commit. Dockerfile-hstore has the same runtime stage as hugegraph-server/Dockerfile (same WORKDIR, same port 8080), so the same HEALTHCHECK applies. Also remove the hstore exclusion from docker-build-ci.yml — all four Dockerfiles now have HEALTHCHECK and are checked unconditionally. Related to: apache#3043

imbajin

Blocking: yes. Summary: The image healthchecks currently mark a live JVM as healthy even when the configured HTTP health endpoint is failing. Evidence: static review of hugegraph-server/Dockerfile:69-71 and existing compose healthchecks in docker/docker-compose.yml:49/77/99.

CI/status: codecov/project is failing on the latest head; please check 🔗 https://github.com/apache/hugegraph/runs/79772635318.

imbajin · 2026-06-06T20:49:22Z


+HEALTHCHECK --interval=15s --timeout=10s --start-period=90s --retries=3 \
+    CMD curl -fsS http://localhost:8080/versions >/dev/null \
+        || kill -0 "$(cat ./bin/pid 2>/dev/null)" 2>/dev/null


⚠️ Keep the healthcheck tied to the HTTP endpoint

Evidence: the existing compose healthchecks use curl ... || exit 1 for /versions and /v1/health, but the new Dockerfile healthchecks return success whenever ./bin/pid still points to a live JVM. This same fallback is added to the server, hstore, PD, and store images.

Impact: after the start period, a container can remain healthy even when the REST endpoint is down or wedged, which weakens the health signal this PR is adding. Please make the steady-state Dockerfile healthcheck fail on HTTP failure, or keep any process-only fallback outside the steady-state HEALTHCHECK path.

Good point --start-period=90s already covers the startup window, so the || kill -0 fallback is redundant and weakens the signal. Removing it from all four Dockerfiles so the HEALTHCHECK is purely HTTP-based. Pushing now.

The || kill -0 fallback allowed a container to stay healthy when the REST endpoint was down but Java was alive. --start-period=90s already covers the startup window, so the fallback is redundant and weakens the health signal. Addresses review feedback from imbajin on apache#3052.

Missed in 9b3f6c4 — same fix as the other three Dockerfiles. Addresses review feedback from imbajin on apache#3052.

dosubot Bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Jun 5, 2026

bitflicker64 mentioned this pull request Jun 5, 2026

[Bug] Docker containers never restart when Java process crashes — broken cron monitor + tail -f /dev/null supervision #3043

Open

10 tasks

bitflicker64 added 2 commits June 5, 2026 21:32

imbajin reviewed Jun 6, 2026

View reviewed changes

bitflicker64 added 2 commits June 7, 2026 11:59

fix(docker): remove kill-0 fallback from Dockerfile-hstore HEALTHCHECK

f6f7c14

Missed in 9b3f6c4 — same fix as the other three Dockerfiles. Addresses review feedback from imbajin on apache#3052.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(docker): add HEALTHCHECK and remove unused cron from Dockerfiles #3052

feat(docker): add HEALTHCHECK and remove unused cron from Dockerfiles #3052
bitflicker64 wants to merge 5 commits into
apache:masterfrom
bitflicker64:fix/docker-healthcheck

bitflicker64 commented Jun 5, 2026

Uh oh!

codecov Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

imbajin left a comment

Uh oh!

imbajin Jun 6, 2026

Uh oh!

bitflicker64 Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bitflicker64 commented Jun 5, 2026

Purpose of the PR

Main Changes

Verifying these changes

Does this PR potentially affect the following parts?

Documentation Status

Uh oh!

codecov Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

imbajin left a comment

Choose a reason for hiding this comment

Uh oh!

imbajin Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

bitflicker64 Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Jun 5, 2026 •

edited

Loading