Skip to content

chore(ci): shard django tests by file to cut per-shard collection#64499

Closed
frankh wants to merge 5 commits into
masterfrom
posthog-code/django-file-level-sharding
Closed

chore(ci): shard django tests by file to cut per-shard collection#64499
frankh wants to merge 5 commits into
masterfrom
posthog-code/django-file-level-sharding

Conversation

@frankh

@frankh frankh commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Problem

Every Django CI shard ran pytest posthog ee/ --splits N --group G, so pytest-split collected (imported) the entire ~26,600-test tree before deselecting the tests outside the shard's group. That full-tree import is the dominant cost of the pre-first-test setup span (posthog.setup_seconds), and it was paid by all 38 Core + 7 CorePOE + 7 Temporal shards — measured at ~88s locally and ~2.5x that on CI runners. DB setup (ClickHouse DDL + sqlx persons) is only ~3s, so collection is the lever.

Changes

File-level sharding. A new planner assigns whole test files to shards instead of letting every shard import everything:

  • .github/scripts/shard_django_files.py — discovers the file universe from the filesystem (authoritative, so new/untimed files are always covered), weights files by .test_durations, then hybrid-packs: whole files balanced via longest-processing-time bin-packing, and the few files heavier than one shard's budget split across dedicated --splits shards. Returns exactly shards specs and a full partition, so no group can silently drop tests. Falls back to the legacy whole-tree invocation when .test_durations is missing.
  • .github/workflows/ci-backend.yml — the Core, CorePOE, and Temporal full-run steps now source their pytest targets from the script. Compat shards and selected-mode are untouched. Added an inline legacy fallback for branches predating the script (the job checks out head.ref, so unrebased PRs may not have it yet) plus an empty-target guard, and registered the script in the backend paths-filter.
  • .github/scripts/test_shard_django_files.py — 37 tests covering the partition invariant, never-exceed-shards, file splitting, determinism, balance bounds, weight handling, discovery exclusions, and fallback.

Measured against the real .test_durations:

Segment Coverage Balance vs today Collection / shard
Core (38) exact, 0 missing/dupes +9.4% 88.7s → 6.05s (verified)
CorePOE (7) exact +13.5% full-tree → ~one file's worth
Temporal (7) exact +2.1%

Balance barely moves while collection drops ~15x — roughly 270s saved per shard on CI. It also unblocks a follow-up: the baked-in Core: 4*60 per-shard overhead in turbo-discover.js can be reduced so the matrix shards less.

How did you test this code?

I'm an agent (PostHog Code); I did not run the CI matrix. Locally I verified, against the real .test_durations and repo tree:

  • shard_django_files.py --check for all three segments: exact coverage (0 missing, 0 duplicates, whole-file partition holds), plan length equals shard count, deterministic output across repeated runs.
  • A real pytest --collect-only on a generated Core shard (group 7): 754 tests in 6.05s vs 88.7s for the full tree.
  • --store-durations works without --splits (so master timing collection is unaffected).
  • uv run pytest .github/scripts/test_shard_django_files.py — 37 passing.
  • ruff check + ruff format --check clean; bash -n on the rendered workflow blocks; yaml.safe_load parses the workflow.

This still needs a real CI smoke run to validate the GHA matrix wiring end-to-end, which I can't exercise locally.

Automatic notifications

  • Publish to changelog?
  • Alert Sales and Marketing teams?

Docs update

No docs changes.

🤖 Agent context

Autonomy: Human-driven (agent-assisted)

Built with PostHog Code (Claude Opus). The work started as an investigation into why the Django shard setup span runs ~400-500s. Diagnosis: collection duplication across shards (confirmed by local measurement and by turbo-discover.js already baking ~4min of "pytest collection" overhead into its Amdahl shard-sizing).

Key decisions:

  • Hybrid over pure file-level. Pure whole-file sharding costs +34.7% on the slowest Core shard (one 610-test file dominates). Splitting only the handful of files heavier than one shard's budget (5 for Core) recovers near-current balance — that trade-off was simulated against .test_durations before implementing.
  • A pytest-invocation constraint shaped the design: --splits/--group applies to the whole collected set, so a big-file slice can't share a shard with whole files. Oversized files therefore get dedicated split-shards; the planner guarantees exactly shards specs so no group is ever unassigned.
  • Filesystem is the coverage source of truth, durations are weights only — .test_durations carries stale keys for renamed/deleted files, and new files wouldn't be timed yet.
  • Backwards compatibility: the django job checks out head.ref, so the workflow degrades to the legacy whole-tree invocation when the script is absent (unrebased PRs) and on missing durations.

No repo skills were invoked.


Created with PostHog Code

@frankh frankh self-assigned this Jun 18, 2026
@frankh frankh added the run-ci-backend Force ci-backend's full test matrices to run even on a draft PR label Jun 18, 2026
@greptile-apps

greptile-apps Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
.github/scripts/test_shard_django_files.py:59-67
**Empty-bin invariant not validated**

`test_plan_never_exceeds_shards_even_with_many_oversized[shards=20]` actually produces an empty bin and passes anyway. With 10 equal-weight files and 20 shards, `ideal = 0.5`, so all 10 files are oversized. Files 0–8 consume 18 split-spec slots (≤ shards-1=19), leaving `whole_bins = 2`. File 9 goes into `whole`, fills bin 0, and bin 1 is empty. `pytest_args()` on that bin returns `""`, so the corresponding CI shard hits the `if [[ -z "$targets" ]]; then exit 1` guard.

The assertion `sorted(whole + split) == sorted(f for f, _ in files)` only checks file coverage; it doesn't catch that some shards have no work. Adding `assert all(s.files for s in plan if s.split_total is None)` to this test would surface the gap. The same invariant is also missing from the `--check` validation in `main()` (the `if missing or dupes or len(plan) != args.shards` guard doesn't include an empty-bin check), so `--check` would report "plan valid" for a plan that causes a CI failure.

### Issue 2 of 2
.github/scripts/shard_django_files.py:261-288
**`--check` does not validate empty bins**

The `if missing or dupes or len(plan) != args.shards` guard does not check for whole-file `ShardSpec` entries with no files. An empty bin is reachable when `whole_bins > len(whole)` (e.g., when many files are oversized and consume most split-spec slots), which produces a spec whose `pytest_args()` returns `""`. The CI guard catches it at runtime, but `--check` would exit 0 and print no error, giving false confidence when verifying the plan locally. Adding `empty_bins = [i+1 for i, s in enumerate(plan) if s.split_total is None and not s.files]` and including it in the failure condition would close the gap.

Reviews (1): Last reviewed commit: "chore(ci): shard django tests by file to..." | Re-trigger Greptile

Comment thread .github/scripts/test_shard_django_files.py
Comment thread .github/scripts/shard_django_files.py
@tests-posthog

tests-posthog Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Query snapshots: Backend query snapshots updated

Changes: 19 snapshots (19 modified, 0 added, 0 deleted)

What this means:

  • Query snapshots have been automatically updated to match current output
  • These changes reflect modifications to database queries or schema

Next steps:

  • Review the query changes to ensure they're intentional
  • If unexpected, investigate what caused the query to change

Review snapshot changes →

@frankh frankh marked this pull request as ready for review June 18, 2026 12:54
@assign-reviewers-posthog assign-reviewers-posthog Bot requested a review from a team June 18, 2026 12:54
@frankh frankh requested a review from rnegron June 18, 2026 12:58
@greptile-apps

greptile-apps Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Reviews (2): Last reviewed commit: "chore(temporal): keep SIGPIPE ignored in..." | Re-trigger Greptile

@frankh frankh force-pushed the posthog-code/django-file-level-sharding branch 2 times, most recently from 714bd89 to d7e81bb Compare June 18, 2026 14:03
frankh and others added 4 commits June 18, 2026 15:05
Every Django shard ran `pytest posthog ee/ --splits N --group G`, so
pytest-split collected (imported) the entire ~26k-test tree before
deselecting the tests outside the shard's group. That full-tree import
is the dominant cost of the pre-first-test "setup" span and was paid by
all 38 Core + 7 CorePOE + 7 Temporal shards (~88s locally, ~2.5x on CI).

shard_django_files.py assigns whole test files to shards (duration-
balanced) so each shard only collects its own files; files heavier than
one shard's budget are split across dedicated pytest-split shards.
Coverage comes from the filesystem so untimed/new files are always
assigned. Falls back to the legacy whole-tree invocation when
.test_durations or the script is unavailable (unrebased PRs).

Measured: collection 88.7s -> 6.05s per Core shard; balance within
+9.4% (Core) / +13.5% (CorePOE) / +2.1% (Temporal) of today.

Generated-By: PostHog Code
Task-Id: a1816771-6c84-433d-a970-aa79851c8472
sse_streaming_response() calls close_old_connections(), a real DB
operation. The tests didn't declare db access, so the pytest-django
blocker tripped only when a prior test in the worker left a connection
open — an order-dependent failure that file-level sharding (and
pytest-randomly) can surface deterministically. Mark the class
django_db so the call is always permitted.

Generated-By: PostHog Code
Task-Id: a1816771-6c84-433d-a970-aa79851c8472
The combined-metrics tests start mock HTTP servers whose handler threads
can outlive the test and write to a client socket the timeout tests
intentionally abandon. With SIGPIPE at its Python default (SIG_IGN) that
write raises a catchable BrokenPipeError. But a Rust extension's runtime
in an earlier shard test can leave SIGPIPE as SIG_DFL, so the write kills
the whole pytest process (exit 141) — surfaced when file-level sharding
co-located that test with this file. Re-assert SIG_IGN via an autouse
fixture so the broken-pipe write can never take down the shard.

Generated-By: PostHog Code
Task-Id: a1816771-6c84-433d-a970-aa79851c8472
@frankh frankh force-pushed the posthog-code/django-file-level-sharding branch from d7e81bb to 74b6f7e Compare June 18, 2026 14:05
@tests-posthog

tests-posthog Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

⏭️ Skipped snapshot commit because branch advanced to 74b6f7e while workflow was testing d7e81bb.

The new commit will trigger its own snapshot update workflow.

If you expected this workflow to succeed: This can happen due to concurrent commits. To get a fresh workflow run, either:

  • Merge master into your branch, or
  • Push an empty commit: git commit --allow-empty -m 'trigger CI' && git push

@tests-posthog

tests-posthog Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Query snapshots: Backend query snapshots updated

Changes: 1 snapshots (1 modified, 0 added, 0 deleted)

What this means:

  • Query snapshots have been automatically updated to match current output
  • These changes reflect modifications to database queries or schema

Next steps:

  • Review the query changes to ensure they're intentional
  • If unexpected, investigate what caused the query to change

Review snapshot changes →

@gantoine gantoine left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this conflict with #64461?

@frankh

frankh commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

ah annoyingly yes, probably. Maybe i'll revisit later

@frankh frankh closed this Jun 18, 2026
@webjunkie

Copy link
Copy Markdown
Contributor

My PR now balances the shards almost perfectly 😊

Is the ~88s number how long it takes pytest to collect tests? That's not something I came across yet 😱
I'll validate this on my side.

@webjunkie

Copy link
Copy Markdown
Contributor

Validated the collection-cost diagnosis locally (flox env), and it holds — collection is import-bound and dwarfs DB setup:

  • full-tree --collect-only: ~20s warm / ~42s cold. The 88s is just the cold end on a slower box — it's pure import cost, so it swings ~2x on .pyc/FS cache, which is exactly why CI runs the high end.
  • single file: ~3.3s — the fixed django-boot + conftest floor.
  • DB setup: ~3s, as you measured.

-X importtime breakdown of the 22s warm run (16.6s captured as import self-time, ~6s is pytest's own item-building):

  • first-party 7.6s / third-party 9.0s.
  • posthog.schema alone: 2.0s — one generated pydantic module, imported by ~everything.
  • ee.hogai 2.2s, then the data-science/DW stack (scipy/sklearn/pandas/databricks/dlt) — pulled in only by specific test files.

That last point splits the problem in two, and they need opposite fixes:

  1. Per-shard tail (ee.hogai, databricks, dlt, scipy-stack): imported only because some file in the full tree pulls them. This is what file-level sharding removes — your ~15x is real.
  2. Shared floor (posthog.schema ~2s, django boot ~2.5s): paid by every shard's own files regardless of slicing. File-sharding can't touch it; posthog.schema alone is ~2s × 52 shards ≈ ~100s aggregate. Separate surgical fix (split the generated file / defer the pydantic build) — tracking that on its own.

Plan, since this overlaps #64461: rather than a standalone GHA script, fold file-level sharding into the pytest-split fork as a granularity axis that composes with optimal_chunks:

  • --split-granularity=file (new) × --splitting-algorithm=optimal_chunks (existing) — the same makespan search runs over file weights, and the plugin uses pytest_ignore_collect to skip files outside the group before they're imported. pytest-split today deselects in pytest_collection_modifyitems, which is post-import — too late to save collection. Moving the cut to ignore_collect is the whole trick.
  • Keeps the workflow line as plain --splits/--group, retires shard_django_files.py, and is upstreamable to jerry-git as a general feature (not posthog-specific CI glue).
  • Oversized single file → falls back to item-level deselection within its dedicated shard (same edge you handled).

So #64499's idea is the right lever — just moving it down a layer into the splitter so it's one mechanism, default-off, and order-safe by construction. Starting on it now; will ping when there's something to look at.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-ci-backend Force ci-backend's full test matrices to run even on a draft PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants