Skip to content

feat: add findGrouped() to Adapter + ClickHouse implementation#119

Open
lohanidamodar wants to merge 2 commits into
mainfrom
feat-find-grouped
Open

feat: add findGrouped() to Adapter + ClickHouse implementation#119
lohanidamodar wants to merge 2 commits into
mainfrom
feat-find-grouped

Conversation

@lohanidamodar

Copy link
Copy Markdown
Contributor

Summary

Adds a new findGrouped(array $queries, string $groupBy, string $interval): array method to the audit Adapter abstraction and implements it in the ClickHouse adapter. It produces time-bucketed grouped counts for charts of the form events per <interval>, grouped by <attribute>.

Public API

/**
 * @param  array<\Utopia\Audit\Query>  $queries  Filter/limit/offset queries; callers should include a time BETWEEN filter
 * @param  string  $groupBy  One of {`event`, `userType`, `resourceType`}
 * @param  string  $interval One of {`hour`, `day`, `week`, `month`}
 * @return array<int, array{value: string, bucket: string, count: int}>
 */
abstract public function findGrouped(array $queries, string $groupBy, string $interval): array;

The Audit facade gains a matching passthrough.

Behavior

  • $groupBy and $interval are validated; invalid values throw.
  • Filter queries are passed through the existing parseQueries() pipeline so all existing Query::* filters work unchanged. getTenantFilter() is honored exactly like find() / count() so shared-table installs stay tenant-safe.
  • Query::limit() / Query::offset() are applied to the groups list. Default limit 25, hard max 100 (over-limit values are clamped). Top-N is selected by SUM(count) desc, ties by groupValue asc.
  • Every bucket between the smallest and largest bucket implied by the time filter is zero-filled for every returned group. Buckets are aligned to UTC via toStartOfInterval(time, INTERVAL 1 <unit>, 'UTC') and returned as ISO-8601 UTC strings (matching how find() formats time).
  • Result rows are ordered by value ASC, bucket ASC for predictable consumption.

SQL approach

Single round-trip using four CTEs:

  1. top_groups — top-N groups by SUM(count) desc.
  2. boundsMIN(time) / MAX(time) aligned to bucket starts.
  3. bucketsarrayJoin(arrayMap(i -> date_add(<unit>, i, bucket_from), range(0, ...))) produces the dense bucket axis.
  4. agg(value, bucket) -> COUNT(*) aggregate.

Then CROSS JOIN buckets + LEFT JOIN agg yields the dense (value, bucket, count) grid. All identifiers go through escapeIdentifier; all values go through ClickHouse bound parameters.

Database adapter

The Database adapter throws explicitly — grouped time-bucketed aggregations are designed for an analytical backend (ClickHouse).

Schema note

Added a _key_resource_type bloom-filter index to the ClickHouse adapter's getIndexes() so new installs cover the resourceType column natively. Existing installs need a one-off migration to add this index; that migration is intentionally not in this PR (separate concern).

Test plan

  • composer lint (pint --test) — pass
  • composer check (phpstan analyse --level max) — pass
  • Full PHPUnit suite (ClickHouse + Database adapters) — 99 tests, 942 assertions, all green
  • New tests cover: invalid groupBy, invalid interval, grouped counts by event, ordering invariant (value asc, bucket asc), zero-fill, limit clamping, offset paging, alternate groupBy=userType, empty time range, Database adapter "not supported"

Add a new method `findGrouped(array $queries, string $groupBy, string
$interval): array` to the audit Adapter abstraction and implement it in
the ClickHouse adapter. It produces time-bucketed grouped counts for
charts of the form "events per <interval>, grouped by <attribute>".

The method validates `$groupBy` against {event, userType, resourceType}
and `$interval` against {hour, day, week, month}, then reuses the
existing `parseQueries()` filter/param pipeline to apply both the query
filters and `getTenantFilter()` for tenant-safe scoping. `Query::limit()`
/ `Query::offset()` are applied to the group set (default 25, clamped
to 100), with top-N selection by `SUM(count)` desc and ties broken by
groupValue asc.

A single SQL round-trip drives the aggregation via CTEs: top-N groups,
time bounds, a zero-filled bucket axis built with `arrayJoin(range(...))`
+ `date_add(<unit>, i, bucket_from)`, and the per-bucket aggregate. A
final `CROSS JOIN` + `LEFT JOIN` produces the dense (groupValue, bucket,
count) result set, with buckets aligned via
`toStartOfInterval(time, INTERVAL 1 <unit>, 'UTC')` and formatted as
ISO-8601 UTC strings to match how `find()` returns `time`.

The Database adapter throws explicitly — grouped time-bucketed
aggregations are an analytical-store feature.

Schema: a `_key_resource_type` bloom-filter index has been added to the
ClickHouse `getIndexes()` definition so new installs get it. Existing
installs need a one-off migration to add it; left out of this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@greptile-apps

greptile-apps Bot commented May 17, 2026

Copy link
Copy Markdown

Greptile Summary

This PR adds a findGrouped(array $queries, string $groupBy, string $interval): array method to the audit Adapter abstraction and implements it in the ClickHouse adapter, producing time-bucketed event counts for charting. The Database adapter throws explicitly, and the Audit facade gains a matching passthrough.

  • The ClickHouse implementation uses a four-CTE single-round-trip query: top_groups (top-N by event count), bounds (min/max aligned to interval start), buckets (dense bucket axis via arrayJoin/arrayMap/range), and agg (per-group per-bucket counts) with a final CROSS JOIN + LEFT JOIN for zero-fill. Input validation whitelists groupBy and interval values; all user-supplied filter values go through the existing parseQueries() bound-parameter pipeline.
  • A new _key_resource_type bloom-filter index is added to getIndexes() for the resourceType column. New tests cover invalid args, bucketed counts, ordering, zero-fill, limit clamping, offset paging, alternate groupBy, empty range, and the Database "not supported" path.

Confidence Score: 5/5

Safe to merge — all user-supplied values are whitelist-validated or go through bound parameters, tenant isolation is preserved, and the test suite is thorough.

The new method is well-isolated, input validation is strict, SQL parameterisation follows the established pattern, and the test coverage is comprehensive. The two observations about the IS NOT NULL guard and the full-range bounds scan are both non-blocking: the first leaves the code behaviorally correct and the second is consistent with the documented design intent.

The buckets CTE logic in src/Audit/Adapter/ClickHouse.php around IS NOT NULL and the time-range bounds deserves a second look, but neither is a blocking concern.

Important Files Changed

Filename Overview
src/Audit/Adapter.php Adds abstract findGrouped() with clear contract docs; straightforward change, no issues.
src/Audit/Adapter/ClickHouse.php Implements findGrouped() via a four-CTE query; inputs are whitelist-validated and SQL-parametrized; subtle gap around IS NOT NULL guard on non-nullable DateTime columns.
src/Audit/Adapter/Database.php Throws Exception with a clear message; satisfies the abstract contract; no issues.
src/Audit/Audit.php Thin passthrough to adapter; docblock accurately mirrors the abstract; no issues.
tests/Audit/Adapter/ClickHouseTest.php Good coverage: invalid args, bucketed counts, ordering, zero-fill, limit/offset/clamp, alternate groupBy, empty range.
tests/Audit/Adapter/DatabaseTest.php Single test asserting the expected exception; correct and sufficient.

Reviews (2): Last reviewed commit: "Merge remote-tracking branch 'origin/mai..." | Re-trigger Greptile

Comment on lines +1218 to +1222
FROM top_groups g
CROSS JOIN buckets b
LEFT JOIN agg a ON a.value = g.value AND a.bucket = b.bucket
WHERE g.value IN (SELECT value FROM top_groups)
ORDER BY value ASC, bucket ASC

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The outer WHERE g.value IN (SELECT value FROM top_groups) predicate is redundant — g is already derived directly from top_groups, so every row of g is guaranteed to satisfy this condition. The clause adds an extra correlated subquery that ClickHouse may or may not optimise away, and it obscures the intent of the JOIN.

Suggested change
FROM top_groups g
CROSS JOIN buckets b
LEFT JOIN agg a ON a.value = g.value AND a.bucket = b.bucket
WHERE g.value IN (SELECT value FROM top_groups)
ORDER BY value ASC, bucket ASC
FROM top_groups g
CROSS JOIN buckets b
LEFT JOIN agg a ON a.value = g.value AND a.bucket = b.bucket
ORDER BY value ASC, bucket ASC

Comment on lines +1206 to +1213
agg AS (
SELECT
{$escapedGroup} AS value,
toStartOfInterval({$escapedTime}, INTERVAL 1 {$unit}, 'UTC') AS bucket,
COUNT(*) AS count
FROM {$escapedTable}{$whereClause}
GROUP BY value, bucket
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The agg CTE aggregates every distinct group value that matches $whereClause, not just the top_groups. For high-cardinality columns like resourceType this can mean scanning and grouping many more rows than needed. Adding an IN-subquery filter restricts the aggregation to only the top-N groups already selected, at the cost of one extra read of the small top_groups result.

Suggested change
agg AS (
SELECT
{$escapedGroup} AS value,
toStartOfInterval({$escapedTime}, INTERVAL 1 {$unit}, 'UTC') AS bucket,
COUNT(*) AS count
FROM {$escapedTable}{$whereClause}
GROUP BY value, bucket
)
agg AS (
SELECT
{$escapedGroup} AS value,
toStartOfInterval({$escapedTime}, INTERVAL 1 {$unit}, 'UTC') AS bucket,
COUNT(*) AS count
FROM {$escapedTable}{$whereClause}
WHERE {$escapedGroup} IN (SELECT value FROM top_groups)
GROUP BY value, bucket
)

# Conflicts:
#	src/Audit/Adapter.php
#	src/Audit/Audit.php
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant