Skip to content

fix: continuous healthcheck for ClickHouse + HTTP status code for unavailability#2875

Merged
comatory merged 7 commits into
mainfrom
ondrej/eng-6980-improve-studio-reliability-in-case-of-clickhouse-outage
May 20, 2026
Merged

fix: continuous healthcheck for ClickHouse + HTTP status code for unavailability#2875
comatory merged 7 commits into
mainfrom
ondrej/eng-6980-improve-studio-reliability-in-case-of-clickhouse-outage

Conversation

@comatory
Copy link
Copy Markdown
Contributor

@comatory comatory commented May 19, 2026

This PR is one of the changes that I intend to do to improve resiliency of Cosmo Studio. More PRs
will follow.

This change prepares front-end app to be able to explicitly detect when analytical (ClickHouse)
service is down.

Notable changes:

  • continous ping to ClickHouse instead of checking the service availability only during startup
  • the ping has exponential back-off with jitter
  • change: when controlplane starts up, it will boot even if ClickHouse is not available
    and it'll keep polling instead
  • error is no longer thrown, instead the ClickHouse class is emitting results which are logged
  • instead of generic 500 HTTP status code, 503 is sent instead to indicate that a service is down

Before:

Generic error message:
Screenshot 2026-05-19 at 14 06 17
500 status code:
Screenshot 2026-05-19 at 14 06 14

Now:

Targeted error message:
Screenshot 2026-05-19 at 14 04 16
503 status code:
Screenshot 2026-05-19 at 14 04 25

The idea behind these changes is that we'll expose explicit enum status code on top of
the changes made here. We should be able to gracefully sent partial data to front-end
and tell user that they might be seeing just a subset of data.
After that is done, we can think about caching analytical data so at least stale results
are served, but even in that case we should indicate on front-end that the data is stale
and for that, we still need to detect ClickHouse is not available.

How to test?

Run LOG_LEVEL=debug pnpm --filter controlplane dev, alongside with Studio (make start-studio).
When the application is running, you can try disabling the service (via Orbstack or Docker CLI),
when the service is disabled, console output of controlplane should emit error messages, along
with attempt numbers.

Toggling the service back on should cause the errors to go away and instead a log with
healthcheck for ClickHouse to appear.

Summary by CodeRabbit

  • New Features

    • Automatic ClickHouse health monitoring with exponential backoff, event-based ping notifications, an availability indicator, and a graceful shutdown hook.
    • New specific error type to indicate ClickHouse is unreachable.
  • Bug Fixes

    • Transport failures during queries are surfaced consistently as an unavailable-backend error; healthcheck no longer blocks startup and logs outcomes asynchronously.
  • Tests

    • Added comprehensive tests for backoff/polling behavior and jitter/delay calculations.

Review Change Stack

Checklist

  • I have discussed my proposed changes in an issue and have received approval to proceed.
  • I have followed the coding standards of the project.
  • Tests or benchmarks have been added or updated.
  • Documentation has been updated on https://github.com/wundergraph/docs-website. (n/a)
  • I have read the Contributors Guide.

Open Source AI Manifesto

This project follows the principles of the Open Source AI Manifesto. Please ensure your contribution aligns with its principles.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c13f5bf5-621c-4f3f-95f3-8363c8159dc1

📥 Commits

Reviewing files that changed from the base of the PR and between 1872312 and 9e55731.

📒 Files selected for processing (1)
  • controlplane/src/core/clickhouse/client/ClickHouseClient.ts

Walkthrough

This PR introduces ClickHouse health monitoring via exponential backoff polling. It adds a reusable poll-with-backoff utility, tracks client availability through a ping loop, maps unavailability through error handlers, and integrates health checks into the Fastify plugin lifecycle with proper shutdown cleanup.

Changes

ClickHouse Health Monitoring

Layer / File(s) Summary
Poll with Backoff Utility
controlplane/src/core/util/poll-with-backoff.ts, controlplane/test/poll-with-backoff.test.ts
Introduces PollWithBackoffOptions configuration, computeDelay for exponential growth with optional jitter, and pollWithBackoff async loop supporting AbortSignal-based cancellation, success/failure callbacks, and optional leading execution. Comprehensive tests validate delay computation, backoff growth, capping, jitter bounds, attempt tracking, abort behavior, and error normalization.
ClickHouse Error Type and Event Infrastructure
controlplane/src/core/errors/errors.ts, controlplane/src/core/clickhouse/client/ClickHouseClient.ts
Adds ClickHouseUnavailableError with optional cause and type guard. ClickHouseClient gains internal EventTarget emitter, typed ping payloads, ping state tracking via pingStopController and pingFailedAttempts, isAvailable getter derived from consecutive failures, and public typed event listener methods for ping events.
Ping Health Loop with Exponential Backoff
controlplane/src/core/clickhouse/client/ClickHouseClient.ts
Replaces single-shot ping with async method driving a pollWithBackoff loop; invokes private pingRequest helper for /ping GET requests, updates failure counters on success/failure, dispatches typed ping events with error/attempt details, and supports cancellation via close() method.
Request Error Mapping to Unavailability
controlplane/src/core/clickhouse/client/ClickHouseClient.ts, controlplane/src/core/util.ts
QueryPromise and insertPromise reject with ClickHouseUnavailableError when transport failures are detected, otherwise reject original error. Error handler utility detects ClickHouseUnavailableError and translates to a gRPC-compatible ConnectError with Code.Unavailable.
Fastify Plugin Lifecycle and Health Checks
controlplane/src/core/plugins/clickhouse.ts
Plugin switches to callback-style initialization; registers ping event listener for logging instead of try/catch, calls fastify.chHealthcheck() fire-and-forget on startup, and adds onClose hook to abort listener and call connection.close() on server shutdown.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main changes: establishing continuous healthcheck polling for ClickHouse and properly reporting unavailability with HTTP 503 status codes instead of generic errors.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@controlplane/src/core/clickhouse/client/ClickHouseClient.ts`:
- Around line 44-47: The isAvailable getter currently relies solely on
pingFailedAttempts and causes per-request errors to be rewritten wrongly; update
request error handling to inspect the caught AxiosError (check
AxiosError.isAxiosError, error.code, error.request/response absence, and
network/ECONN* or ETIMEDOUT codes) and only map/throw ClickHouseUnavailableError
when the AxiosError indicates a transport/unreachable failure; treat
isAvailable/pingFailedAttempts as an advisory hint (used for
metrics/logs/retries) but do not override a legitimate ClickHouse HTTP/SQL error
returned in the Axios response. Change any code paths that currently replace all
exceptions with ClickHouseUnavailableError (including the logic near the
isAvailable getter and the other handling spots referencing ping state) to first
examine the caught error object and preserve/propagate non-transport Axios
errors unchanged. Ensure references to isAvailable, pingFailedAttempts,
AxiosError, and ClickHouseUnavailableError are used to locate and update the
affected handlers.

In `@controlplane/src/core/plugins/clickhouse.ts`:
- Around line 28-46: chHealthcheck is not idempotent: every call registers a new
'ping' event listener and launches another connection.ping() loop; make it
idempotent by tracking and reusing a single healthcheck instance (e.g. store a
boolean or the AbortController/Promise on fastify like
fastify.chHealthcheckStarted or fastify._chListenerController) so subsequent
calls return early if already started, or by aborting the previous
listenerController before creating a new one; ensure the logic around
connection.addEventListener('ping', ...), listenerController (signal),
connection.ping(), and fastify.onClose remains tied to that single controller so
listeners and ping loops are not duplicated.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fd7a2377-a774-41bb-a65f-9e66562e0601

📥 Commits

Reviewing files that changed from the base of the PR and between 3d90c6c and 1872312.

📒 Files selected for processing (6)
  • controlplane/src/core/clickhouse/client/ClickHouseClient.ts
  • controlplane/src/core/errors/errors.ts
  • controlplane/src/core/plugins/clickhouse.ts
  • controlplane/src/core/util.ts
  • controlplane/src/core/util/poll-with-backoff.ts
  • controlplane/test/poll-with-backoff.test.ts

Comment thread controlplane/src/core/clickhouse/client/ClickHouseClient.ts Outdated
Comment thread controlplane/src/core/plugins/clickhouse.ts
@codecov
Copy link
Copy Markdown

codecov Bot commented May 19, 2026

Codecov Report

❌ Patch coverage is 39.87342% with 95 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.47%. Comparing base (3d90c6c) to head (b7641b9).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...ane/src/core/clickhouse/client/ClickHouseClient.ts 14.28% 60 Missing ⚠️
controlplane/src/core/plugins/clickhouse.ts 4.54% 21 Missing ⚠️
controlplane/src/core/errors/errors.ts 40.00% 6 Missing ⚠️
controlplane/src/core/util/poll-with-backoff.ts 88.67% 6 Missing ⚠️
controlplane/src/core/util.ts 33.33% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2875      +/-   ##
==========================================
- Coverage   65.08%   64.47%   -0.62%     
==========================================
  Files         275      319      +44     
  Lines       28513    45359   +16846     
  Branches        0     4927    +4927     
==========================================
+ Hits        18559    29245   +10686     
- Misses       8474    16089    +7615     
+ Partials     1480       25    -1455     
Files with missing lines Coverage Δ
controlplane/src/core/util.ts 81.34% <33.33%> (ø)
controlplane/src/core/errors/errors.ts 87.75% <40.00%> (ø)
controlplane/src/core/util/poll-with-backoff.ts 88.67% <88.67%> (ø)
controlplane/src/core/plugins/clickhouse.ts 8.57% <4.54%> (ø)
...ane/src/core/clickhouse/client/ClickHouseClient.ts 11.28% <14.28%> (ø)

... and 589 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@comatory comatory marked this pull request as ready for review May 19, 2026 12:55
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

@comatory comatory enabled auto-merge (squash) May 20, 2026 07:25
@comatory comatory merged commit be9d015 into main May 20, 2026
10 checks passed
@comatory comatory deleted the ondrej/eng-6980-improve-studio-reliability-in-case-of-clickhouse-outage branch May 20, 2026 07:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants