Skip to content

feat(docker): add HEALTHCHECK and remove unused cron from Dockerfiles #3052

Open
bitflicker64 wants to merge 5 commits into
apache:masterfrom
bitflicker64:fix/docker-healthcheck
Open

feat(docker): add HEALTHCHECK and remove unused cron from Dockerfiles #3052
bitflicker64 wants to merge 5 commits into
apache:masterfrom
bitflicker64:fix/docker-healthcheck

Conversation

@bitflicker64
Copy link
Copy Markdown
Contributor

Purpose of the PR

docker ps always shows Up even when Java has crashed inside the container because all three Dockerfiles had no HEALTHCHECK. Operators cannot distinguish a healthy container from a zombie. Also, cron was installed in all three images but is never used in Docker deployments — the cron-based monitor (-m true) is for VM/bare-metal only.

Main Changes

  • Add HEALTHCHECK to hugegraph-server/Dockerfile, hugegraph-pd/Dockerfile, hugegraph-store/Dockerfile:
    • Server: curl http://localhost:8080/versions
    • PD: curl http://localhost:8620/v1/health
    • Store: curl http://localhost:8520/v1/health
    • Fallback: if HTTP is not yet up but Java is alive (kill -0 on pid file), report healthy — avoids false unhealthy during startup
    • Endpoints match what is already used in docker/docker-compose.yml
  • Remove cron from apt-get install in all three Dockerfiles — shrinks image size and reduces attack surface

Verifying these changes

  • Need tests and can be verified as follows:
    • docker run container → docker inspect --format='{{.State.Health.Status}}' → reports healthy once Java is up
    • kill -9 Java inside container → status transitions to unhealthy within --interval * --retries seconds

Does this PR potentially affect the following parts?

  • Dependencies
  • Modify configurations
  • The public API
  • Other affects
  • Nope

Documentation Status

  • Doc - No Need

Without HEALTHCHECK, docker ps always shows 'Up' even when Java has
crashed inside the container. Add HEALTHCHECK to all three Dockerfiles:
- Server: curl http://localhost:8080/versions
- PD:     curl http://localhost:8620/v1/health
- Store:  curl http://localhost:8520/v1/health

Fallback: if HTTP is not yet up but Java is alive (kill -0 on pid file),
report healthy. Avoids false unhealthy during startup.

Remove cron: Docker containers use foreground mode (-d false) after the
entrypoint fix. The cron-based monitor is for VM/bare-metal only and is
never started in Docker — removing it shrinks the image and reduces
attack surface.

Endpoints match what is already used in docker/docker-compose.yml.

Related to: apache#3043
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 39.08%. Comparing base (1f61c48) to head (f6f7c14).

Additional details and impacted files
@@             Coverage Diff              @@
##             master    #3052      +/-   ##
============================================
+ Coverage     36.11%   39.08%   +2.96%     
- Complexity      338      622     +284     
============================================
  Files           803      814      +11     
  Lines         68234    69385    +1151     
  Branches       8964     9161     +197     
============================================
+ Hits          24640    27116    +2476     
+ Misses        40936    39428    -1508     
- Partials       2658     2841     +183     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Extend docker-build-ci.yml to inspect image metadata after build and
fail if HEALTHCHECK is not configured (except Dockerfile-hstore).

Zero CI time cost — docker inspect runs on the already-built image.
Prevents accidental removal of HEALTHCHECK in future PRs.

Related to: apache#3043
Missed in the previous commit. Dockerfile-hstore has the same runtime
stage as hugegraph-server/Dockerfile (same WORKDIR, same port 8080),
so the same HEALTHCHECK applies.

Also remove the hstore exclusion from docker-build-ci.yml — all four
Dockerfiles now have HEALTHCHECK and are checked unconditionally.

Related to: apache#3043
Copy link
Copy Markdown
Member

@imbajin imbajin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: yes. Summary: The image healthchecks currently mark a live JVM as healthy even when the configured HTTP health endpoint is failing. Evidence: static review of hugegraph-server/Dockerfile:69-71 and existing compose healthchecks in docker/docker-compose.yml:49/77/99.


CI/status: codecov/project is failing on the latest head; please check 🔗 https://github.com/apache/hugegraph/runs/79772635318.

Comment thread hugegraph-server/Dockerfile Outdated

HEALTHCHECK --interval=15s --timeout=10s --start-period=90s --retries=3 \
CMD curl -fsS http://localhost:8080/versions >/dev/null \
|| kill -0 "$(cat ./bin/pid 2>/dev/null)" 2>/dev/null
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Keep the healthcheck tied to the HTTP endpoint

Evidence: the existing compose healthchecks use curl ... || exit 1 for /versions and /v1/health, but the new Dockerfile healthchecks return success whenever ./bin/pid still points to a live JVM. This same fallback is added to the server, hstore, PD, and store images.

Impact: after the start period, a container can remain healthy even when the REST endpoint is down or wedged, which weakens the health signal this PR is adding. Please make the steady-state Dockerfile healthcheck fail on HTTP failure, or keep any process-only fallback outside the steady-state HEALTHCHECK path.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point --start-period=90s already covers the startup window, so the || kill -0 fallback is redundant and weakens the signal. Removing it from all four Dockerfiles so the HEALTHCHECK is purely HTTP-based. Pushing now.

The || kill -0 fallback allowed a container to stay healthy when the
REST endpoint was down but Java was alive. --start-period=90s already
covers the startup window, so the fallback is redundant and weakens
the health signal.

Addresses review feedback from imbajin on apache#3052.
Missed in 9b3f6c4 — same fix as the other three Dockerfiles.

Addresses review feedback from imbajin on apache#3052.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S This PR changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants