fix: continuous healthcheck for ClickHouse + HTTP status code for unavailability#2875
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughThis PR introduces ClickHouse health monitoring via exponential backoff polling. It adds a reusable poll-with-backoff utility, tracks client availability through a ping loop, maps unavailability through error handlers, and integrates health checks into the Fastify plugin lifecycle with proper shutdown cleanup. ChangesClickHouse Health Monitoring
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@controlplane/src/core/clickhouse/client/ClickHouseClient.ts`:
- Around line 44-47: The isAvailable getter currently relies solely on
pingFailedAttempts and causes per-request errors to be rewritten wrongly; update
request error handling to inspect the caught AxiosError (check
AxiosError.isAxiosError, error.code, error.request/response absence, and
network/ECONN* or ETIMEDOUT codes) and only map/throw ClickHouseUnavailableError
when the AxiosError indicates a transport/unreachable failure; treat
isAvailable/pingFailedAttempts as an advisory hint (used for
metrics/logs/retries) but do not override a legitimate ClickHouse HTTP/SQL error
returned in the Axios response. Change any code paths that currently replace all
exceptions with ClickHouseUnavailableError (including the logic near the
isAvailable getter and the other handling spots referencing ping state) to first
examine the caught error object and preserve/propagate non-transport Axios
errors unchanged. Ensure references to isAvailable, pingFailedAttempts,
AxiosError, and ClickHouseUnavailableError are used to locate and update the
affected handlers.
In `@controlplane/src/core/plugins/clickhouse.ts`:
- Around line 28-46: chHealthcheck is not idempotent: every call registers a new
'ping' event listener and launches another connection.ping() loop; make it
idempotent by tracking and reusing a single healthcheck instance (e.g. store a
boolean or the AbortController/Promise on fastify like
fastify.chHealthcheckStarted or fastify._chListenerController) so subsequent
calls return early if already started, or by aborting the previous
listenerController before creating a new one; ensure the logic around
connection.addEventListener('ping', ...), listenerController (signal),
connection.ping(), and fastify.onClose remains tied to that single controller so
listeners and ping loops are not duplicated.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: fd7a2377-a774-41bb-a65f-9e66562e0601
📒 Files selected for processing (6)
controlplane/src/core/clickhouse/client/ClickHouseClient.tscontrolplane/src/core/errors/errors.tscontrolplane/src/core/plugins/clickhouse.tscontrolplane/src/core/util.tscontrolplane/src/core/util/poll-with-backoff.tscontrolplane/test/poll-with-backoff.test.ts
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2875 +/- ##
==========================================
- Coverage 65.08% 64.47% -0.62%
==========================================
Files 275 319 +44
Lines 28513 45359 +16846
Branches 0 4927 +4927
==========================================
+ Hits 18559 29245 +10686
- Misses 8474 16089 +7615
+ Partials 1480 25 -1455
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
…n-case-of-clickhouse-outage
This PR is one of the changes that I intend to do to improve resiliency of Cosmo Studio. More PRs
will follow.
This change prepares front-end app to be able to explicitly detect when analytical (ClickHouse)
service is down.
Notable changes:
controlplanestarts up, it will boot even if ClickHouse is not availableand it'll keep polling instead
Before:
Generic error message:


500 status code:
Now:
Targeted error message:


503 status code:
The idea behind these changes is that we'll expose explicit enum status code on top of
the changes made here. We should be able to gracefully sent partial data to front-end
and tell user that they might be seeing just a subset of data.
After that is done, we can think about caching analytical data so at least stale results
are served, but even in that case we should indicate on front-end that the data is stale
and for that, we still need to detect ClickHouse is not available.
How to test?
Run
LOG_LEVEL=debug pnpm --filter controlplane dev, alongside with Studio (make start-studio).When the application is running, you can try disabling the service (via Orbstack or Docker CLI),
when the service is disabled, console output of
controlplaneshould emit error messages, alongwith attempt numbers.
Toggling the service back on should cause the errors to go away and instead a log with
healthcheck for ClickHouse to appear.
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Checklist
Open Source AI Manifesto
This project follows the principles of the Open Source AI Manifesto. Please ensure your contribution aligns with its principles.