chore(ci): shard django tests by file to cut per-shard collection by frankh · Pull Request #64499 · PostHog/posthog

frankh · 2026-06-18T10:25:57Z

Problem

Every Django CI shard ran pytest posthog ee/ --splits N --group G, so pytest-split collected (imported) the entire ~26,600-test tree before deselecting the tests outside the shard's group. That full-tree import is the dominant cost of the pre-first-test setup span (posthog.setup_seconds), and it was paid by all 38 Core + 7 CorePOE + 7 Temporal shards — measured at ~88s locally and ~2.5x that on CI runners. DB setup (ClickHouse DDL + sqlx persons) is only ~3s, so collection is the lever.

Changes

File-level sharding. A new planner assigns whole test files to shards instead of letting every shard import everything:

.github/scripts/shard_django_files.py — discovers the file universe from the filesystem (authoritative, so new/untimed files are always covered), weights files by .test_durations, then hybrid-packs: whole files balanced via longest-processing-time bin-packing, and the few files heavier than one shard's budget split across dedicated --splits shards. Returns exactly shards specs and a full partition, so no group can silently drop tests. Falls back to the legacy whole-tree invocation when .test_durations is missing.
.github/workflows/ci-backend.yml — the Core, CorePOE, and Temporal full-run steps now source their pytest targets from the script. Compat shards and selected-mode are untouched. Added an inline legacy fallback for branches predating the script (the job checks out head.ref, so unrebased PRs may not have it yet) plus an empty-target guard, and registered the script in the backend paths-filter.
.github/scripts/test_shard_django_files.py — 37 tests covering the partition invariant, never-exceed-shards, file splitting, determinism, balance bounds, weight handling, discovery exclusions, and fallback.

Measured against the real .test_durations:

Segment	Coverage	Balance vs today	Collection / shard
Core (38)	exact, 0 missing/dupes	+9.4%	88.7s → 6.05s (verified)
CorePOE (7)	exact	+13.5%	full-tree → ~one file's worth
Temporal (7)	exact	+2.1%	—

Balance barely moves while collection drops ~15x — roughly 270s saved per shard on CI. It also unblocks a follow-up: the baked-in Core: 4*60 per-shard overhead in turbo-discover.js can be reduced so the matrix shards less.

How did you test this code?

I'm an agent (PostHog Code); I did not run the CI matrix. Locally I verified, against the real .test_durations and repo tree:

shard_django_files.py --check for all three segments: exact coverage (0 missing, 0 duplicates, whole-file partition holds), plan length equals shard count, deterministic output across repeated runs.
A real pytest --collect-only on a generated Core shard (group 7): 754 tests in 6.05s vs 88.7s for the full tree.
--store-durations works without --splits (so master timing collection is unaffected).
uv run pytest .github/scripts/test_shard_django_files.py — 37 passing.
ruff check + ruff format --check clean; bash -n on the rendered workflow blocks; yaml.safe_load parses the workflow.

This still needs a real CI smoke run to validate the GHA matrix wiring end-to-end, which I can't exercise locally.

Automatic notifications

Publish to changelog?
Alert Sales and Marketing teams?

Docs update

No docs changes.

🤖 Agent context

Autonomy: Human-driven (agent-assisted)

Built with PostHog Code (Claude Opus). The work started as an investigation into why the Django shard setup span runs ~400-500s. Diagnosis: collection duplication across shards (confirmed by local measurement and by turbo-discover.js already baking ~4min of "pytest collection" overhead into its Amdahl shard-sizing).

Key decisions:

Hybrid over pure file-level. Pure whole-file sharding costs +34.7% on the slowest Core shard (one 610-test file dominates). Splitting only the handful of files heavier than one shard's budget (5 for Core) recovers near-current balance — that trade-off was simulated against .test_durations before implementing.
A pytest-invocation constraint shaped the design: --splits/--group applies to the whole collected set, so a big-file slice can't share a shard with whole files. Oversized files therefore get dedicated split-shards; the planner guarantees exactly shards specs so no group is ever unassigned.
Filesystem is the coverage source of truth, durations are weights only — .test_durations carries stale keys for renamed/deleted files, and new files wouldn't be timed yet.
Backwards compatibility: the django job checks out head.ref, so the workflow degrades to the legacy whole-tree invocation when the script is absent (unrebased PRs) and on missing durations.

No repo skills were invoked.

Created with PostHog Code

greptile-apps · 2026-06-18T10:45:15Z

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
.github/scripts/test_shard_django_files.py:59-67
**Empty-bin invariant not validated**

`test_plan_never_exceeds_shards_even_with_many_oversized[shards=20]` actually produces an empty bin and passes anyway. With 10 equal-weight files and 20 shards, `ideal = 0.5`, so all 10 files are oversized. Files 0–8 consume 18 split-spec slots (≤ shards-1=19), leaving `whole_bins = 2`. File 9 goes into `whole`, fills bin 0, and bin 1 is empty. `pytest_args()` on that bin returns `""`, so the corresponding CI shard hits the `if [[ -z "$targets" ]]; then exit 1` guard.

The assertion `sorted(whole + split) == sorted(f for f, _ in files)` only checks file coverage; it doesn't catch that some shards have no work. Adding `assert all(s.files for s in plan if s.split_total is None)` to this test would surface the gap. The same invariant is also missing from the `--check` validation in `main()` (the `if missing or dupes or len(plan) != args.shards` guard doesn't include an empty-bin check), so `--check` would report "plan valid" for a plan that causes a CI failure.

### Issue 2 of 2
.github/scripts/shard_django_files.py:261-288
**`--check` does not validate empty bins**

The `if missing or dupes or len(plan) != args.shards` guard does not check for whole-file `ShardSpec` entries with no files. An empty bin is reachable when `whole_bins > len(whole)` (e.g., when many files are oversized and consume most split-spec slots), which produces a spec whose `pytest_args()` returns `""`. The CI guard catches it at runtime, but `--check` would exit 0 and print no error, giving false confidence when verifying the plan locally. Adding `empty_bins = [i+1 for i, s in enumerate(plan) if s.split_total is None and not s.files]` and including it in the failure condition would close the gap.

_{Reviews (1): Last reviewed commit: "chore(ci): shard django tests by file to..." | Re-trigger Greptile}

tests-posthog · 2026-06-18T10:57:34Z

Query snapshots: Backend query snapshots updated

Changes: 19 snapshots (19 modified, 0 added, 0 deleted)

What this means:

Query snapshots have been automatically updated to match current output
These changes reflect modifications to database queries or schema

Next steps:

Review the query changes to ensure they're intentional
If unexpected, investigate what caused the query to change

Review snapshot changes →

greptile-apps · 2026-06-18T13:00:16Z

_{Reviews (2): Last reviewed commit: "chore(temporal): keep SIGPIPE ignored in..." | Re-trigger Greptile}

Every Django shard ran `pytest posthog ee/ --splits N --group G`, so pytest-split collected (imported) the entire ~26k-test tree before deselecting the tests outside the shard's group. That full-tree import is the dominant cost of the pre-first-test "setup" span and was paid by all 38 Core + 7 CorePOE + 7 Temporal shards (~88s locally, ~2.5x on CI). shard_django_files.py assigns whole test files to shards (duration- balanced) so each shard only collects its own files; files heavier than one shard's budget are split across dedicated pytest-split shards. Coverage comes from the filesystem so untimed/new files are always assigned. Falls back to the legacy whole-tree invocation when .test_durations or the script is unavailable (unrebased PRs). Measured: collection 88.7s -> 6.05s per Core shard; balance within +9.4% (Core) / +13.5% (CorePOE) / +2.1% (Temporal) of today. Generated-By: PostHog Code Task-Id: a1816771-6c84-433d-a970-aa79851c8472

sse_streaming_response() calls close_old_connections(), a real DB operation. The tests didn't declare db access, so the pytest-django blocker tripped only when a prior test in the worker left a connection open — an order-dependent failure that file-level sharding (and pytest-randomly) can surface deterministically. Mark the class django_db so the call is always permitted. Generated-By: PostHog Code Task-Id: a1816771-6c84-433d-a970-aa79851c8472

The combined-metrics tests start mock HTTP servers whose handler threads can outlive the test and write to a client socket the timeout tests intentionally abandon. With SIGPIPE at its Python default (SIG_IGN) that write raises a catchable BrokenPipeError. But a Rust extension's runtime in an earlier shard test can leave SIGPIPE as SIG_DFL, so the write kills the whole pytest process (exit 141) — surfaced when file-level sharding co-located that test with this file. Re-assert SIG_IGN via an autouse fixture so the broken-pipe write can never take down the shard. Generated-By: PostHog Code Task-Id: a1816771-6c84-433d-a970-aa79851c8472

tests-posthog · 2026-06-18T14:34:14Z

⏭️ Skipped snapshot commit because branch advanced to 74b6f7e while workflow was testing d7e81bb.

The new commit will trigger its own snapshot update workflow.

If you expected this workflow to succeed: This can happen due to concurrent commits. To get a fresh workflow run, either:

Merge master into your branch, or
Push an empty commit: git commit --allow-empty -m 'trigger CI' && git push

tests-posthog · 2026-06-18T14:54:27Z

Query snapshots: Backend query snapshots updated

Changes: 1 snapshots (1 modified, 0 added, 0 deleted)

What this means:

Query snapshots have been automatically updated to match current output
These changes reflect modifications to database queries or schema

Next steps:

Review the query changes to ensure they're intentional
If unexpected, investigate what caused the query to change

Review snapshot changes →

gantoine

does this conflict with #64461?

frankh · 2026-06-18T19:22:24Z

ah annoyingly yes, probably. Maybe i'll revisit later

webjunkie · 2026-06-19T07:11:06Z

My PR now balances the shards almost perfectly 😊

Is the ~88s number how long it takes pytest to collect tests? That's not something I came across yet 😱
I'll validate this on my side.

webjunkie · 2026-06-19T07:44:12Z

Validated the collection-cost diagnosis locally (flox env), and it holds — collection is import-bound and dwarfs DB setup:

full-tree --collect-only: ~20s warm / ~42s cold. The 88s is just the cold end on a slower box — it's pure import cost, so it swings ~2x on .pyc/FS cache, which is exactly why CI runs the high end.
single file: ~3.3s — the fixed django-boot + conftest floor.
DB setup: ~3s, as you measured.

-X importtime breakdown of the 22s warm run (16.6s captured as import self-time, ~6s is pytest's own item-building):

first-party 7.6s / third-party 9.0s.
posthog.schema alone: 2.0s — one generated pydantic module, imported by ~everything.
ee.hogai 2.2s, then the data-science/DW stack (scipy/sklearn/pandas/databricks/dlt) — pulled in only by specific test files.

That last point splits the problem in two, and they need opposite fixes:

Per-shard tail (ee.hogai, databricks, dlt, scipy-stack): imported only because some file in the full tree pulls them. This is what file-level sharding removes — your ~15x is real.
Shared floor (posthog.schema ~2s, django boot ~2.5s): paid by every shard's own files regardless of slicing. File-sharding can't touch it; posthog.schema alone is ~2s × 52 shards ≈ ~100s aggregate. Separate surgical fix (split the generated file / defer the pydantic build) — tracking that on its own.

Plan, since this overlaps #64461: rather than a standalone GHA script, fold file-level sharding into the pytest-split fork as a granularity axis that composes with optimal_chunks:

--split-granularity=file (new) × --splitting-algorithm=optimal_chunks (existing) — the same makespan search runs over file weights, and the plugin uses pytest_ignore_collect to skip files outside the group before they're imported. pytest-split today deselects in pytest_collection_modifyitems, which is post-import — too late to save collection. Moving the cut to ignore_collect is the whole trick.
Keeps the workflow line as plain --splits/--group, retires shard_django_files.py, and is upstreamable to jerry-git as a general feature (not posthog-specific CI glue).
Oversized single file → falls back to item-level deselection within its dedicated shard (same edge you handled).

So #64499's idea is the right lever — just moving it down a layer into the splitter so it's one mechanism, default-off, and order-safe by construction. Starting on it now; will ping when there's something to look at.

frankh self-assigned this Jun 18, 2026

frankh added the run-ci-backend Force ci-backend's full test matrices to run even on a draft PR label Jun 18, 2026

greptile-apps Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread .github/scripts/test_shard_django_files.py

Comment thread .github/scripts/shard_django_files.py

frankh marked this pull request as ready for review June 18, 2026 12:54

assign-reviewers-posthog Bot requested a review from a team June 18, 2026 12:54

frankh requested a review from rnegron June 18, 2026 12:58

frankh force-pushed the posthog-code/django-file-level-sharding branch 2 times, most recently from 714bd89 to d7e81bb Compare June 18, 2026 14:03

frankh and others added 4 commits June 18, 2026 15:05

test(backend): update query snapshots

9032bf0

frankh force-pushed the posthog-code/django-file-level-sharding branch from d7e81bb to 74b6f7e Compare June 18, 2026 14:05

test(backend): update query snapshots

808b186

gantoine requested changes Jun 18, 2026

View reviewed changes

frankh closed this Jun 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(ci): shard django tests by file to cut per-shard collection#64499

chore(ci): shard django tests by file to cut per-shard collection#64499
frankh wants to merge 5 commits into
masterfrom
posthog-code/django-file-level-sharding

frankh commented Jun 18, 2026

Uh oh!

greptile-apps Bot commented Jun 18, 2026

Uh oh!

Uh oh!

Uh oh!

tests-posthog Bot commented Jun 18, 2026

Uh oh!

greptile-apps Bot commented Jun 18, 2026

Uh oh!

tests-posthog Bot commented Jun 18, 2026

Uh oh!

tests-posthog Bot commented Jun 18, 2026

Uh oh!

gantoine left a comment

Uh oh!

frankh commented Jun 18, 2026

Uh oh!

webjunkie commented Jun 19, 2026

Uh oh!

webjunkie commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

frankh commented Jun 18, 2026

Problem

Changes

How did you test this code?

Automatic notifications

Docs update

🤖 Agent context

Uh oh!

greptile-apps Bot commented Jun 18, 2026

Uh oh!

Uh oh!

Uh oh!

tests-posthog Bot commented Jun 18, 2026

Query snapshots: Backend query snapshots updated

Uh oh!

greptile-apps Bot commented Jun 18, 2026

Uh oh!

tests-posthog Bot commented Jun 18, 2026

Uh oh!

tests-posthog Bot commented Jun 18, 2026

Query snapshots: Backend query snapshots updated

Uh oh!

gantoine left a comment

Choose a reason for hiding this comment

Uh oh!

frankh commented Jun 18, 2026

Uh oh!

webjunkie commented Jun 19, 2026

Uh oh!

webjunkie commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants