chore(ci): shard django tests by file to cut per-shard collection#64499
chore(ci): shard django tests by file to cut per-shard collection#64499frankh wants to merge 5 commits into
Conversation
Prompt To Fix All With AIFix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
.github/scripts/test_shard_django_files.py:59-67
**Empty-bin invariant not validated**
`test_plan_never_exceeds_shards_even_with_many_oversized[shards=20]` actually produces an empty bin and passes anyway. With 10 equal-weight files and 20 shards, `ideal = 0.5`, so all 10 files are oversized. Files 0–8 consume 18 split-spec slots (≤ shards-1=19), leaving `whole_bins = 2`. File 9 goes into `whole`, fills bin 0, and bin 1 is empty. `pytest_args()` on that bin returns `""`, so the corresponding CI shard hits the `if [[ -z "$targets" ]]; then exit 1` guard.
The assertion `sorted(whole + split) == sorted(f for f, _ in files)` only checks file coverage; it doesn't catch that some shards have no work. Adding `assert all(s.files for s in plan if s.split_total is None)` to this test would surface the gap. The same invariant is also missing from the `--check` validation in `main()` (the `if missing or dupes or len(plan) != args.shards` guard doesn't include an empty-bin check), so `--check` would report "plan valid" for a plan that causes a CI failure.
### Issue 2 of 2
.github/scripts/shard_django_files.py:261-288
**`--check` does not validate empty bins**
The `if missing or dupes or len(plan) != args.shards` guard does not check for whole-file `ShardSpec` entries with no files. An empty bin is reachable when `whole_bins > len(whole)` (e.g., when many files are oversized and consume most split-spec slots), which produces a spec whose `pytest_args()` returns `""`. The CI guard catches it at runtime, but `--check` would exit 0 and print no error, giving false confidence when verifying the plan locally. Adding `empty_bins = [i+1 for i, s in enumerate(plan) if s.split_total is None and not s.files]` and including it in the failure condition would close the gap.
Reviews (1): Last reviewed commit: "chore(ci): shard django tests by file to..." | Re-trigger Greptile |
Query snapshots: Backend query snapshots updatedChanges: 19 snapshots (19 modified, 0 added, 0 deleted) What this means:
Next steps:
|
|
Reviews (2): Last reviewed commit: "chore(temporal): keep SIGPIPE ignored in..." | Re-trigger Greptile |
714bd89 to
d7e81bb
Compare
Every Django shard ran `pytest posthog ee/ --splits N --group G`, so pytest-split collected (imported) the entire ~26k-test tree before deselecting the tests outside the shard's group. That full-tree import is the dominant cost of the pre-first-test "setup" span and was paid by all 38 Core + 7 CorePOE + 7 Temporal shards (~88s locally, ~2.5x on CI). shard_django_files.py assigns whole test files to shards (duration- balanced) so each shard only collects its own files; files heavier than one shard's budget are split across dedicated pytest-split shards. Coverage comes from the filesystem so untimed/new files are always assigned. Falls back to the legacy whole-tree invocation when .test_durations or the script is unavailable (unrebased PRs). Measured: collection 88.7s -> 6.05s per Core shard; balance within +9.4% (Core) / +13.5% (CorePOE) / +2.1% (Temporal) of today. Generated-By: PostHog Code Task-Id: a1816771-6c84-433d-a970-aa79851c8472
sse_streaming_response() calls close_old_connections(), a real DB operation. The tests didn't declare db access, so the pytest-django blocker tripped only when a prior test in the worker left a connection open — an order-dependent failure that file-level sharding (and pytest-randomly) can surface deterministically. Mark the class django_db so the call is always permitted. Generated-By: PostHog Code Task-Id: a1816771-6c84-433d-a970-aa79851c8472
The combined-metrics tests start mock HTTP servers whose handler threads can outlive the test and write to a client socket the timeout tests intentionally abandon. With SIGPIPE at its Python default (SIG_IGN) that write raises a catchable BrokenPipeError. But a Rust extension's runtime in an earlier shard test can leave SIGPIPE as SIG_DFL, so the write kills the whole pytest process (exit 141) — surfaced when file-level sharding co-located that test with this file. Re-assert SIG_IGN via an autouse fixture so the broken-pipe write can never take down the shard. Generated-By: PostHog Code Task-Id: a1816771-6c84-433d-a970-aa79851c8472
d7e81bb to
74b6f7e
Compare
|
⏭️ Skipped snapshot commit because branch advanced to The new commit will trigger its own snapshot update workflow. If you expected this workflow to succeed: This can happen due to concurrent commits. To get a fresh workflow run, either:
|
Query snapshots: Backend query snapshots updatedChanges: 1 snapshots (1 modified, 0 added, 0 deleted) What this means:
Next steps:
|
|
ah annoyingly yes, probably. Maybe i'll revisit later |
|
My PR now balances the shards almost perfectly 😊 Is the |
|
Validated the collection-cost diagnosis locally (flox env), and it holds — collection is import-bound and dwarfs DB setup:
That last point splits the problem in two, and they need opposite fixes:
Plan, since this overlaps #64461: rather than a standalone GHA script, fold file-level sharding into the pytest-split fork as a granularity axis that composes with
So #64499's idea is the right lever — just moving it down a layer into the splitter so it's one mechanism, default-off, and order-safe by construction. Starting on it now; will ping when there's something to look at. |
Problem
Every Django CI shard ran
pytest posthog ee/ --splits N --group G, so pytest-split collected (imported) the entire ~26,600-test tree before deselecting the tests outside the shard's group. That full-tree import is the dominant cost of the pre-first-testsetupspan (posthog.setup_seconds), and it was paid by all 38 Core + 7 CorePOE + 7 Temporal shards — measured at ~88s locally and ~2.5x that on CI runners. DB setup (ClickHouse DDL + sqlx persons) is only ~3s, so collection is the lever.Changes
File-level sharding. A new planner assigns whole test files to shards instead of letting every shard import everything:
.github/scripts/shard_django_files.py— discovers the file universe from the filesystem (authoritative, so new/untimed files are always covered), weights files by.test_durations, then hybrid-packs: whole files balanced via longest-processing-time bin-packing, and the few files heavier than one shard's budget split across dedicated--splitsshards. Returns exactlyshardsspecs and a full partition, so no group can silently drop tests. Falls back to the legacy whole-tree invocation when.test_durationsis missing..github/workflows/ci-backend.yml— the Core, CorePOE, and Temporal full-run steps now source their pytest targets from the script. Compat shards and selected-mode are untouched. Added an inline legacy fallback for branches predating the script (the job checks outhead.ref, so unrebased PRs may not have it yet) plus an empty-target guard, and registered the script in the backend paths-filter..github/scripts/test_shard_django_files.py— 37 tests covering the partition invariant, never-exceed-shards, file splitting, determinism, balance bounds, weight handling, discovery exclusions, and fallback.Measured against the real
.test_durations:Balance barely moves while collection drops ~15x — roughly 270s saved per shard on CI. It also unblocks a follow-up: the baked-in
Core: 4*60per-shard overhead inturbo-discover.jscan be reduced so the matrix shards less.How did you test this code?
I'm an agent (PostHog Code); I did not run the CI matrix. Locally I verified, against the real
.test_durationsand repo tree:shard_django_files.py --checkfor all three segments: exact coverage (0 missing, 0 duplicates, whole-file partition holds), plan length equals shard count, deterministic output across repeated runs.pytest --collect-onlyon a generated Core shard (group 7): 754 tests in 6.05s vs 88.7s for the full tree.--store-durationsworks without--splits(so master timing collection is unaffected).uv run pytest .github/scripts/test_shard_django_files.py— 37 passing.ruff check+ruff format --checkclean;bash -non the rendered workflow blocks;yaml.safe_loadparses the workflow.This still needs a real CI smoke run to validate the GHA matrix wiring end-to-end, which I can't exercise locally.
Automatic notifications
Docs update
No docs changes.
🤖 Agent context
Autonomy: Human-driven (agent-assisted)
Built with PostHog Code (Claude Opus). The work started as an investigation into why the Django shard
setupspan runs ~400-500s. Diagnosis: collection duplication across shards (confirmed by local measurement and byturbo-discover.jsalready baking ~4min of "pytest collection" overhead into its Amdahl shard-sizing).Key decisions:
.test_durationsbefore implementing.--splits/--groupapplies to the whole collected set, so a big-file slice can't share a shard with whole files. Oversized files therefore get dedicated split-shards; the planner guarantees exactlyshardsspecs so no group is ever unassigned..test_durationscarries stale keys for renamed/deleted files, and new files wouldn't be timed yet.head.ref, so the workflow degrades to the legacy whole-tree invocation when the script is absent (unrebased PRs) and on missing durations.No repo skills were invoked.
Created with PostHog Code