feat(docker): add HEALTHCHECK and remove unused cron from Dockerfiles #3052
feat(docker): add HEALTHCHECK and remove unused cron from Dockerfiles #3052bitflicker64 wants to merge 5 commits into
Conversation
Without HEALTHCHECK, docker ps always shows 'Up' even when Java has crashed inside the container. Add HEALTHCHECK to all three Dockerfiles: - Server: curl http://localhost:8080/versions - PD: curl http://localhost:8620/v1/health - Store: curl http://localhost:8520/v1/health Fallback: if HTTP is not yet up but Java is alive (kill -0 on pid file), report healthy. Avoids false unhealthy during startup. Remove cron: Docker containers use foreground mode (-d false) after the entrypoint fix. The cron-based monitor is for VM/bare-metal only and is never started in Docker — removing it shrinks the image and reduces attack surface. Endpoints match what is already used in docker/docker-compose.yml. Related to: apache#3043
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #3052 +/- ##
============================================
+ Coverage 36.11% 39.08% +2.96%
- Complexity 338 622 +284
============================================
Files 803 814 +11
Lines 68234 69385 +1151
Branches 8964 9161 +197
============================================
+ Hits 24640 27116 +2476
+ Misses 40936 39428 -1508
- Partials 2658 2841 +183 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Extend docker-build-ci.yml to inspect image metadata after build and fail if HEALTHCHECK is not configured (except Dockerfile-hstore). Zero CI time cost — docker inspect runs on the already-built image. Prevents accidental removal of HEALTHCHECK in future PRs. Related to: apache#3043
Missed in the previous commit. Dockerfile-hstore has the same runtime stage as hugegraph-server/Dockerfile (same WORKDIR, same port 8080), so the same HEALTHCHECK applies. Also remove the hstore exclusion from docker-build-ci.yml — all four Dockerfiles now have HEALTHCHECK and are checked unconditionally. Related to: apache#3043
imbajin
left a comment
There was a problem hiding this comment.
Blocking: yes. Summary: The image healthchecks currently mark a live JVM as healthy even when the configured HTTP health endpoint is failing. Evidence: static review of hugegraph-server/Dockerfile:69-71 and existing compose healthchecks in docker/docker-compose.yml:49/77/99.
CI/status: codecov/project is failing on the latest head; please check 🔗 https://github.com/apache/hugegraph/runs/79772635318.
|
|
||
| HEALTHCHECK --interval=15s --timeout=10s --start-period=90s --retries=3 \ | ||
| CMD curl -fsS http://localhost:8080/versions >/dev/null \ | ||
| || kill -0 "$(cat ./bin/pid 2>/dev/null)" 2>/dev/null |
There was a problem hiding this comment.
Evidence: the existing compose healthchecks use curl ... || exit 1 for /versions and /v1/health, but the new Dockerfile healthchecks return success whenever ./bin/pid still points to a live JVM. This same fallback is added to the server, hstore, PD, and store images.
Impact: after the start period, a container can remain healthy even when the REST endpoint is down or wedged, which weakens the health signal this PR is adding. Please make the steady-state Dockerfile healthcheck fail on HTTP failure, or keep any process-only fallback outside the steady-state HEALTHCHECK path.
There was a problem hiding this comment.
Good point --start-period=90s already covers the startup window, so the || kill -0 fallback is redundant and weakens the signal. Removing it from all four Dockerfiles so the HEALTHCHECK is purely HTTP-based. Pushing now.
The || kill -0 fallback allowed a container to stay healthy when the REST endpoint was down but Java was alive. --start-period=90s already covers the startup window, so the fallback is redundant and weakens the health signal. Addresses review feedback from imbajin on apache#3052.
Missed in 9b3f6c4 — same fix as the other three Dockerfiles. Addresses review feedback from imbajin on apache#3052.
Purpose of the PR
docker psalways showsUpeven when Java has crashed inside the container because all three Dockerfiles had noHEALTHCHECK. Operators cannot distinguish a healthy container from a zombie. Also,cronwas installed in all three images but is never used in Docker deployments — the cron-based monitor (-m true) is for VM/bare-metal only.Main Changes
HEALTHCHECKtohugegraph-server/Dockerfile,hugegraph-pd/Dockerfile,hugegraph-store/Dockerfile:curl http://localhost:8080/versionscurl http://localhost:8620/v1/healthcurl http://localhost:8520/v1/healthkill -0on pid file), report healthy — avoids false unhealthy during startupdocker/docker-compose.ymlcronfromapt-get installin all three Dockerfiles — shrinks image size and reduces attack surfaceVerifying these changes
docker runcontainer →docker inspect --format='{{.State.Health.Status}}'→ reportshealthyonce Java is upkill -9Java inside container → status transitions tounhealthywithin--interval * --retriessecondsDoes this PR potentially affect the following parts?
Documentation Status
Doc - No Need