fix: continuous healthcheck for ClickHouse + HTTP status code for unavailability by comatory · Pull Request #2875 · wundergraph/cosmo

comatory · 2026-05-19T12:15:35Z

This PR is one of the changes that I intend to do to improve resiliency of Cosmo Studio. More PRs
will follow.
This change prepares front-end app to be able to explicitly detect when analytical (ClickHouse)
service is down.

Notable changes:

continous ping to ClickHouse instead of checking the service availability only during startup
the ping has exponential back-off with jitter
change: when controlplane starts up, it will boot even if ClickHouse is not available
and it'll keep polling instead
error is no longer thrown, instead the ClickHouse class is emitting results which are logged
instead of generic 500 HTTP status code, 503 is sent instead to indicate that a service is down

Before:

Generic error message:

500 status code:

Now:

Targeted error message:

503 status code:

The idea behind these changes is that we'll expose explicit enum status code on top of
the changes made here. We should be able to gracefully sent partial data to front-end
and tell user that they might be seeing just a subset of data.
After that is done, we can think about caching analytical data so at least stale results
are served, but even in that case we should indicate on front-end that the data is stale
and for that, we still need to detect ClickHouse is not available.

How to test?

Run LOG_LEVEL=debug pnpm --filter controlplane dev, alongside with Studio (make start-studio).
When the application is running, you can try disabling the service (via Orbstack or Docker CLI),
when the service is disabled, console output of controlplane should emit error messages, along
with attempt numbers.

Toggling the service back on should cause the errors to go away and instead a log with
healthcheck for ClickHouse to appear.

Summary by CodeRabbit

New Features
- Automatic ClickHouse health monitoring with exponential backoff, event-based ping notifications, an availability indicator, and a graceful shutdown hook.
- New specific error type to indicate ClickHouse is unreachable.
Bug Fixes
- Transport failures during queries are surfaced consistently as an unavailable-backend error; healthcheck no longer blocks startup and logs outcomes asynchronously.
Tests
- Added comprehensive tests for backoff/polling behavior and jitter/delay calculations.

Checklist

I have discussed my proposed changes in an issue and have received approval to proceed.
I have followed the coding standards of the project.
Tests or benchmarks have been added or updated.
Documentation has been updated on https://github.com/wundergraph/docs-website. (n/a)
I have read the Contributors Guide.

Open Source AI Manifesto

This project follows the principles of the Open Source AI Manifesto. Please ensure your contribution aligns with its principles.

coderabbitai · 2026-05-19T12:17:48Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c13f5bf5-621c-4f3f-95f3-8363c8159dc1

📥 Commits

Reviewing files that changed from the base of the PR and between 1872312 and 9e55731.

📒 Files selected for processing (1)

controlplane/src/core/clickhouse/client/ClickHouseClient.ts

Walkthrough

This PR introduces ClickHouse health monitoring via exponential backoff polling. It adds a reusable poll-with-backoff utility, tracks client availability through a ping loop, maps unavailability through error handlers, and integrates health checks into the Fastify plugin lifecycle with proper shutdown cleanup.

Changes

ClickHouse Health Monitoring

Layer / File(s)	Summary
Poll with Backoff Utility `controlplane/src/core/util/poll-with-backoff.ts`, `controlplane/test/poll-with-backoff.test.ts`	Introduces `PollWithBackoffOptions` configuration, `computeDelay` for exponential growth with optional jitter, and `pollWithBackoff` async loop supporting AbortSignal-based cancellation, success/failure callbacks, and optional leading execution. Comprehensive tests validate delay computation, backoff growth, capping, jitter bounds, attempt tracking, abort behavior, and error normalization.
ClickHouse Error Type and Event Infrastructure `controlplane/src/core/errors/errors.ts`, `controlplane/src/core/clickhouse/client/ClickHouseClient.ts`	Adds `ClickHouseUnavailableError` with optional cause and type guard. ClickHouseClient gains internal `EventTarget` emitter, typed ping payloads, ping state tracking via `pingStopController` and `pingFailedAttempts`, `isAvailable` getter derived from consecutive failures, and public typed event listener methods for `ping` events.
Ping Health Loop with Exponential Backoff `controlplane/src/core/clickhouse/client/ClickHouseClient.ts`	Replaces single-shot ping with async method driving a `pollWithBackoff` loop; invokes private `pingRequest` helper for `/ping` GET requests, updates failure counters on success/failure, dispatches typed ping events with error/attempt details, and supports cancellation via `close()` method.
Request Error Mapping to Unavailability `controlplane/src/core/clickhouse/client/ClickHouseClient.ts`, `controlplane/src/core/util.ts`	QueryPromise and insertPromise reject with `ClickHouseUnavailableError` when transport failures are detected, otherwise reject original error. Error handler utility detects `ClickHouseUnavailableError` and translates to a gRPC-compatible `ConnectError` with `Code.Unavailable`.
Fastify Plugin Lifecycle and Health Checks `controlplane/src/core/plugins/clickhouse.ts`	Plugin switches to callback-style initialization; registers `ping` event listener for logging instead of try/catch, calls `fastify.chHealthcheck()` fire-and-forget on startup, and adds `onClose` hook to abort listener and call `connection.close()` on server shutdown.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically summarizes the main changes: establishing continuous healthcheck polling for ClickHouse and properly reporting unavailability with HTTP 503 status codes instead of generic errors.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@controlplane/src/core/clickhouse/client/ClickHouseClient.ts`:
- Around line 44-47: The isAvailable getter currently relies solely on
pingFailedAttempts and causes per-request errors to be rewritten wrongly; update
request error handling to inspect the caught AxiosError (check
AxiosError.isAxiosError, error.code, error.request/response absence, and
network/ECONN* or ETIMEDOUT codes) and only map/throw ClickHouseUnavailableError
when the AxiosError indicates a transport/unreachable failure; treat
isAvailable/pingFailedAttempts as an advisory hint (used for
metrics/logs/retries) but do not override a legitimate ClickHouse HTTP/SQL error
returned in the Axios response. Change any code paths that currently replace all
exceptions with ClickHouseUnavailableError (including the logic near the
isAvailable getter and the other handling spots referencing ping state) to first
examine the caught error object and preserve/propagate non-transport Axios
errors unchanged. Ensure references to isAvailable, pingFailedAttempts,
AxiosError, and ClickHouseUnavailableError are used to locate and update the
affected handlers.

In `@controlplane/src/core/plugins/clickhouse.ts`:
- Around line 28-46: chHealthcheck is not idempotent: every call registers a new
'ping' event listener and launches another connection.ping() loop; make it
idempotent by tracking and reusing a single healthcheck instance (e.g. store a
boolean or the AbortController/Promise on fastify like
fastify.chHealthcheckStarted or fastify._chListenerController) so subsequent
calls return early if already started, or by aborting the previous
listenerController before creating a new one; ensure the logic around
connection.addEventListener('ping', ...), listenerController (signal),
connection.ping(), and fastify.onClose remains tied to that single controller so
listeners and ping loops are not duplicated.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fd7a2377-a774-41bb-a65f-9e66562e0601

📥 Commits

Reviewing files that changed from the base of the PR and between 3d90c6c and 1872312.

📒 Files selected for processing (6)

controlplane/src/core/clickhouse/client/ClickHouseClient.ts
controlplane/src/core/errors/errors.ts
controlplane/src/core/plugins/clickhouse.ts
controlplane/src/core/util.ts
controlplane/src/core/util/poll-with-backoff.ts
controlplane/test/poll-with-backoff.test.ts

codecov · 2026-05-19T12:29:20Z

Codecov Report

❌ Patch coverage is 39.87342% with 95 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.47%. Comparing base (3d90c6c) to head (b7641b9).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...ane/src/core/clickhouse/client/ClickHouseClient.ts	14.28%	60 Missing ⚠️
controlplane/src/core/plugins/clickhouse.ts	4.54%	21 Missing ⚠️
controlplane/src/core/errors/errors.ts	40.00%	6 Missing ⚠️
controlplane/src/core/util/poll-with-backoff.ts	88.67%	6 Missing ⚠️
controlplane/src/core/util.ts	33.33%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2875      +/-   ##
==========================================
- Coverage   65.08%   64.47%   -0.62%     
==========================================
  Files         275      319      +44     
  Lines       28513    45359   +16846     
  Branches        0     4927    +4927     
==========================================
+ Hits        18559    29245   +10686     
- Misses       8474    16089    +7615     
+ Partials     1480       25    -1455

Files with missing lines	Coverage Δ
controlplane/src/core/util.ts	`81.34% <33.33%> (ø)`
controlplane/src/core/errors/errors.ts	`87.75% <40.00%> (ø)`
controlplane/src/core/util/poll-with-backoff.ts	`88.67% <88.67%> (ø)`
controlplane/src/core/plugins/clickhouse.ts	`8.57% <4.54%> (ø)`
...ane/src/core/clickhouse/client/ClickHouseClient.ts	`11.28% <14.28%> (ø)`

... and 589 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

claude

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

_{Tip: disable this comment in your organization's Code Review settings.}

…n-case-of-clickhouse-outage

comatory added 5 commits May 19, 2026 10:48

feat: ping periodically for ClickHouse healthcheck

7c99102

feat: proper shutdown for healthcheck

fbaa271

feat: exponential backoff for healthcheck

89c48b8

feat: add leading option to polling util to ping CH immediately

6e0bc79

feat: detect unavailable ClickHouse and send 503

1872312

github-actions Bot added the controlplane label May 19, 2026

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Comment thread controlplane/src/core/clickhouse/client/ClickHouseClient.ts Outdated

Comment thread controlplane/src/core/plugins/clickhouse.ts

fix: only throw ClickHouse connection error if issue is with transport

9e55731

comatory marked this pull request as ready for review May 19, 2026 12:55

comatory requested review from Aenimus, JivusAyrus, StarpTech, thisisnithin and wilsonrivera as code owners May 19, 2026 12:55

claude Bot reviewed May 19, 2026

View reviewed changes

thisisnithin approved these changes May 20, 2026

View reviewed changes

comatory enabled auto-merge (squash) May 20, 2026 07:25

Merge branch 'main' into ondrej/eng-6980-improve-studio-reliability-i…

b7641b9

…n-case-of-clickhouse-outage

comatory merged commit be9d015 into main May 20, 2026
10 checks passed

comatory deleted the ondrej/eng-6980-improve-studio-reliability-in-case-of-clickhouse-outage branch May 20, 2026 07:37

coderabbitai Bot mentioned this pull request May 20, 2026

feat: studio handles analytics downtime gracefully #2878

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: continuous healthcheck for ClickHouse + HTTP status code for unavailability#2875

fix: continuous healthcheck for ClickHouse + HTTP status code for unavailability#2875
comatory merged 7 commits into
mainfrom
ondrej/eng-6980-improve-studio-reliability-in-case-of-clickhouse-outage

comatory commented May 19, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 19, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 19, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

comatory commented May 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to test?

Summary by CodeRabbit

Checklist

Open Source AI Manifesto

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

comatory commented May 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading

codecov Bot commented May 19, 2026 •

edited

Loading