Skip to content

feat: Multistep scheduler#37

Draft
define-null wants to merge 250 commits into
masterfrom
defnull/net-681-multi-step-scheduling-2
Draft

feat: Multistep scheduler#37
define-null wants to merge 250 commits into
masterfrom
defnull/net-681-multi-step-scheduling-2

Conversation

@define-null

@define-null define-null commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

What is this PR about?

Goal: make it possible to replace or correct data chunks in the network without breaking availability or consistency for portals reading them. Today the scheduler has no safe way to swap a chunk: a portal mid-query could see the old version vanish before the new one is ready. This branch builds the foundation for that — an MVCC (multi-version) chunk lifecycle — plus a smarter scheduling algorithm, and a large property-based test harness that validates both. Everything is behind the mvcc-chunks feature flag (off by default), so production builds are unaffected

MVCC chunk lifecycle (src/scheduler_storage/, migrations/)

A new storage layer that tracks every chunk through a versioned lifecycle instead of a single mutable assignment:

  • Two kinds of published state: the worker assignment (everything workers must hold) and portal assignments (point-in-time snapshots portals read from). Workers keep serving old snapshots for a grace window (the "M window") so slow portals never see data disappear under them.
  • A confirmation watermark: data is only dropped after a quorum of workers confirms it has applied the newer assignment. Until then, outgoing copies "drain" rather than vanish.
  • Chunk corrections: a 1-to-1 swap mechanism (register a replacement → it gets placed and confirmed → the swap fires atomically in a visibility cycle → the old chunk drains out). The old row is retained for audit.
  • Two interchangeable backends behind one trait: a fast in-memory implementation and a Postgres implementation (new migrations/, transactional cycles, advisory-lock guarded).

Multi-step (reconciliation) scheduling (src/multistep_scheduler.rs)

The current algorithm computes an ideal placement from scratch, blind to what workers already hold — which can demand more data movement than the fleet can absorb. The new algorithm reconciles instead: starting from the current placement, it produces a feasible step toward the ideal — held copies are free to keep, mandatory replication ("floor") copies preempt nice-to-have ("bonus") copies, and new chunks are placed all-or-nothing so nothing lands half-replicated. Standalone for now — not yet wired into the production path.

Simulation & property-based testing (src/multistep_scheduler/sim/)

A model-driven state-machine test harness that drives the full lifecycle the way the real network would — random walks of 100–300 steps mixing chunk additions, worker joins/departures/lagging fetches, clock jumps, corrections, and replication-factor changes — then checks safety/liveness oracles after every step (no portal ever routes to a worker that lacks the chunk; floors are eventually met; drains terminate). The same walks run against both the in-memory and Postgres backends (via testcontainers), with seed-based replay, captured regressions, statistics telemetry, and CI budget knobs (SIM_IN_MEMORY_CASES/SIM_PG_CASES, set to 16/2 on CI).

Supporting changes & docs

  • schedule_with_per_worker_allocations added to the existing scheduler so placement can account for per-worker occupied bytes; small extensions to weight.rs and test utilities; new reshuffling-cost scenario tests.
  • Six design docs under docs/ (mvcc-chunks.md, mvcc-corrections.md, mvcc-schema.md, capacity-aware-scheduling.md, chunk-reshuffling.md, mvcc-worker-mappings.md) — the durable references the code comments point to.

define-null and others added 30 commits May 27, 2026 16:46
Captures the design for closing the silent-overcommit gap: charge the
full per-worker footprint, credit held copies as free, and skip (not
spill or panic) when a new replica doesn't fit, converging over cycles.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…tep scheduler

The reconciliation scheduler in `scheduling_net_786_multistep.rs` distinguished
the current placement by status — `current_ideal` (established replicas) vs
`current_draining` (being-removed copies) — but the algorithm only needs to know
which copies are physically on disk: every held copy is free to keep and occupies
disk alike. The split only ever mattered for the floor add-back ordering preference.

Change the `schedule` interface to take one per-chunk holder list, `current`:
- `schedule`/`schedule_to_workers` take `current: &[Vec<PeerId>]` instead of two args.
- `Reconcile` holds a single `held` list (the membership set `held_sorted` and the
  footprint charge are unchanged — they were already the union).
- `add_back_candidates` now orders simply: held copies on an ideal position first,
  then the rest (dropping the established-before-draining sub-ordering).

The simulation keeps the ideal/stale split for its own judging (convergence oracle,
per-step safety) and merges the two into `current` only at the `schedule` call sites,
via a new `merge_current` helper. Module tests updated to the single-arg interface.

Also drop the `pub mod scheduling_net_786_improved;` declaration from lib.rs: that
module file is not tracked on the base branch, so a fresh worktree cannot compile
without it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ep scheduler

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ning cycle

Replace the per-peer stale_allocations parameter of SchedulingAlgorithm::schedule
with a per-chunk current_placement (ideal ∪ stale), which strictly supersedes it.
run_scheduling_cycle now returns Result so a scheduler shortage is surfaced to the
caller instead of panicking. Make InMemoryStorage pub(crate) so it can be driven as
a SUT. Dead stale_allocations_by_peer helper removed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
define-null and others added 30 commits June 16, 2026 18:51
Rename prop_corrections_cross_dataset -> prop_corrections_in_different_datasets_dont_block:
it tests per-dataset ordering independence (each correction intra-dataset),
not a cross-dataset old->new swap — which now has a distinct meaning.

Add prop_correction_succeeds_only_within_old_dataset: prop_oneof! draws the
replacement's dataset as the old chunk's (registers) or a foreign one
(rejected with DatasetMismatch), so the PBT examines both outcomes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the composite FK (which needed a surrogate UNIQUE(chunk_pk,
dataset_id) index on chunks, redundant since chunk_pk is already the PK)
with a BEFORE INSERT/UPDATE trigger on chunk_corrections that rejects any
row whose old and new chunks disagree with dataset_id. Same DB-level
guarantee for all clients, without the extra index.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Row-level check, O(1) per write, independent of chunks table size.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…tandalone PBT

Remove the standalone prop_correction_succeeds_only_within_old_dataset and
instead vary the shared guided/churn correction generator: prop_oneof! mints
the replacement in the old chunk's dataset (registers) or, occasionally, a
foreign unregistered dataset (rejected as a legal no-op). Verified the churn
sim now reports both 'correction: registered' and 'correction: rejected'.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Drop FK REFERENCES clauses and the self-replace CHECK from the doc's
chunk_corrections block (the migration is authoritative); replace the
composite-FK-vs-trigger explanation with a one-line same-dataset note.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- migration: drop the self-replace CHECK comment and the composite-FK
  rationale on the trigger; one-line same-dataset note instead.
- sim/utils.rs: drop the obvious SIM_CHUNK_BLOCK_SPAN comment; make the
  register_correction strategy doc concise.
- in_memory register_correction: drop the composite-FK reference.
- postgres register_correction: correct the stale composite-FK comment;
  note dataset_id is stored on the row (the trigger only validates it).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…d chunk

The correction row's dataset_id is read back from the replacement insert
(RETURNING dataset_id) instead of a separate SELECT on the old chunk. The
trigger already guarantees old and new share a dataset, so sourcing it
from the old chunk was redundant and backwards. The old chunk's existence
is still checked up front (SELECT EXISTS) for a friendly rejection.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…test)]

register_correction now writes only chunks + chunk_corrections. The
replacement's sched_chunk_metadata row is created by register_new_chunks
(the standard addition flow), like any new chunk — it's still held out of
the portal until the correction fires. Drop the metadata INSERT in both
backends; have the sim's do_register_correction and the pg test helper
schedule_all run register_new_chunks so the replacement is discovered.

Also gate the trait method and its Postgres impl behind #[cfg(test)]: it
stands in for operator-driven ingestion, not the scheduler cycle flow.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ws.rs

The block-range <-> (first_block, last_block_delta) conversion was
inlined at four sites (two encode, two decode), repeating the delta
arithmetic and the lossy casts. Extract block_range_columns /
block_range_from_columns into rows.rs (the row<->domain conversion
module), keyed on the BlockNumber domain alias, and use them from the
insert binds, chunk_from_row, and the inspect read.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Old-chunk existence (FK on old_chunk_pk) and one-correction-per-old-chunk
(PK on old_chunk_pk) are enforced by the chunk_corrections insert itself.
Remove the redundant application-level SELECT pre-checks; a violation now
surfaces as the database's own error (StorageError::Database). The only
remaining application guard is the old-being-removed check, which no DB
constraint covers. Tests for unknown/duplicate old_pk now assert the DB
rejection.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A bare Database(_) match passes for any DB error. Add a pg_db_error helper
that digs the sqlx DatabaseError out of the anyhow chain, and assert the
exact rejecting constraint: ForeignKeyViolation (unknown old_pk),
UniqueViolation (duplicate old_pk), and SQLSTATE P0001 (the same-dataset
trigger RAISE).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nt checks

Apply the same constraint-over-application-check principle to the rest of
mod.rs:
- insert_new_chunks: drop ON CONFLICT DO NOTHING and the extra
  SELECT EXISTS classification query. A duplicate (dataset_id, chunk_id)
  now surfaces as the UNIQUE violation; a no-row unambiguously means the
  dataset name didn't resolve.
- insert_new_datasets: drop ON CONFLICT; a duplicate name surfaces as the
  UNIQUE(name) violation.
- register_correction replacement insert: same — a duplicate replacement
  surfaces as the UNIQUE violation; None means the dataset is unknown.

The existing-replacement test now asserts the UniqueViolation.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
insert_new_chunks now rejects a duplicate (dataset, id) with
ChunkAlreadyExists instead of silently no-oping, mirroring the Postgres
UNIQUE constraint so the two backends agree.

The sim's insert_and_register filters keys already present (and
intra-batch repeats from the random generator) before inserting, so a
re-add stays a no-op at the harness level — a real ingester never
re-inserts an existing chunk — without feeding a duplicate to storage.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The shared Postgres container is parked in a `static OnceLock`, which Rust
never drops — so testcontainers' own Drop-based cleanup never runs and the
container leaked after every test run.

Reap it at process exit via `#[dtor]` (expands to a libc `atexit`
registration). Enable testcontainers' `watchdog` feature too, so signal
termination (Ctrl-C, nextest timeouts) — which bypasses exit hooks — is also
covered.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… from claude/propagate-chunk-block-range into defnull/net-681-multi-step-scheduling-2

## Summary

Two related storage-layer changes for the MVCC scheduler. (Correction-ordering optimization — structural readiness — is deliberately **out of scope** and handled in a follow-up PR.)

### 1. Propagate the chunk block range
Block range is a scheduling input (weight strategy → replication factor) and is emitted into the published assignment, but the Postgres path neither stored nor loaded it — `chunk_from_row` hardcoded `blocks: 0..=0`, collapsing every chunk into the first weight segment. Now stored as **`first_block BIGINT` + `last_block_delta INT`** (delta saves 4 bytes/row vs a second BIGINT; a chunk span fits in 32 bits) and wired through both backends, the inspect surface (`ChunkView.blocks`), and the sim's chunk generator.

### 2. Enforce same-dataset corrections in the DB
A correction's replacement must live in the old chunk's dataset. Enforced by a `BEFORE INSERT/UPDATE` trigger on `chunk_corrections` that rejects any row whose old/new chunks disagree with `dataset_id` — so it holds for **every client**, not just `register_correction` (which no longer coerces the replacement's dataset; the in-memory oracle mirrors the rule with a `DatasetMismatch` rejection). A `CHECK (old_chunk_pk <> new_chunk_pk)` forbids self-correction.

Trigger chosen over a composite FK to avoid a redundant `UNIQUE(chunk_pk, dataset_id)` index on the (10M-row) chunks table. Validated on a real 10M-row table: the trigger's lookup is a `chunks_pkey` index probe (~0.05 ms), run only on the rare correction-registration write.

## Tests
- Block-range round-trip through both backends (incl. `ChunkView.blocks`).
- Cross-dataset rejection (both backends); self-correction rejected by the CHECK.
- Guided/churn sim now exercises **both** succeeding and rejected corrections (foreign-dataset replacements drawn via `prop_oneof!`; confirmed via the sim's `correction: registered` / `correction: rejected` statistics).
- Renamed `prop_corrections_cross_dataset` → `prop_corrections_in_different_datasets_dont_block` (it tests per-dataset ordering independence, not a cross-dataset swap).

## Follow-up (separate PR)
Replacing the over-conservative temporal correction ordering ("no earlier pending correction in the same dataset") with a **structural** readiness check ("my old chunk is not the new_chunk_pk of any pending correction"), so independent same-dataset corrections apply concurrently.

All Postgres + in-memory storage and sim tests pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
* Structural correction readiness instead of temporal ordering

A correction's readiness no longer depends on created_at order within a
dataset. It fires once its replacement is confirmed and no *pending*
correction still has to produce its old chunk (no pending X->old) — the
real dependency, which is a correction chain (B->C waits for A->B because
B is A->B's replacement). Independent corrections in the same dataset now
apply in the same visibility cycle instead of serializing.

- in-memory + postgres apply_ready_corrections: structural predicate.
  Postgres resolves the pending set with an order-independent Rust
  fixpoint, so created_at is no longer load-bearing (audit only) and
  chains still collapse in one pass.
- O3 oracle: dataset_ordering -> dependency_ordering (catches a chain
  link completing before its producer; allows independent concurrency).
- Drop the now-unused chunk_corrections_pending_by_dataset index; the
  structural lookup is served by chunk_corrections_pending_by_new_chunk.
- Tests: in-memory "blocked by earlier" reframed to independent-fires;
  postgres "held by earlier" renamed to chain-link-held; added
  independent-same-dataset-fire-together (both backends). Docs updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test: drop vestigial cross-dataset independence proptest

Under structural readiness the test's only inputs (lag_a/lag_b varying
created_at) no longer affect anything, and cross-dataset independence is a
trivial case of the same-dataset independence already covered by
correction_independent_same_dataset_fires_without_waiting. Left a note in
its place.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs/comments: trim temporal-ordering asides per review

Drop the "structural, not temporal / not by created_at" editorializing from
the correction docs and the apply_ready_corrections doc-comments (in-memory +
postgres), tighten the pending-index migration comment to state what the index
serves, and remove the orphaned cross-dataset-independence note.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* postgres/visibility: phases take the cycle transaction, not a bare connection

The visibility cycle is one atomic state transition — apply_ready_corrections
stamps corrections, then promote/drop/activate depend on those writes. Typing
the phase fns as `&mut PgConnection` let the signature permit a non-atomic
call; `&mut Transaction` encodes that they must run inside the cycle's tx. Call
sites already pass `&mut tx`, so they're unchanged. Post-commit read helpers
stay on `&mut PgConnection`.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test: assert chain-collapse portal holds exactly chunk C

The portal-state check spot-asserted membership of A/B/C; assert the exact
visible set is {C} instead, so an undropped A/B or a stray chunk fails the test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test: assert exact portal/worker chunk sets in correction tests

Tighten the correction visibility tests (both backends) to assert the exact
visible chunk set rather than spot-checking individual membership, so a stray
or undropped chunk fails the test. In-memory: replace the per-chunk
assert_portal_visible/assert_not_portal_visible helpers with one
assert_portal_chunks_exact, and assert the worker assignment holds exactly B
after tombstone. Postgres: assert the exact portal set per cycle.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test: share assert_portal_chunks_exact across both correction suites

Hoist the exact-portal-set assertion into test_harness so the in-memory and
Postgres correction suites use one helper. Postgres now calls it everywhere it
had inline HashSet comparisons.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: Defnull <879658+define-null@users.noreply.github.com>
Co-authored-by: claude <claude@example.com>
Co-committed-by: claude <claude@example.com>
* docs: reorganize multistep-scheduler design docs

Merge the three overlapping MVCC docs (mvcc-chunks, mvcc-corrections,
mvcc-worker-mappings) into a single protocol spine, mvcc-storage.md, and
add a README.md status hub that is the single source of truth for what is
built vs design-only. Drop the stale mvcc-schema-diagram.html.

- mvcc-storage.md — the protocol: two assignments, invariants, the two-gate
  model, chunk lifecycle, corrections, and deferred removal.
- mvcc-schema.md — Postgres table/column reference, grouped by write
  ownership (shared ingestion tables vs scheduler-only sched_* tables);
  kept in sync with migrations/0001_sched_tables.sql.
- capacity-aware-scheduling.md — the placement algorithm at design altitude,
  stripped of backend-specific function names and file paths.
- README.md — reading order plus a status table (built / sim-only /
  design-only) and known limitations.

Scope each claim to what is actually enforced: PG does not prevent
cross-kind id confusion (both BIGSERIAL from 1) — the two-table split only
makes it structural; a correction's same-block-range is caller discipline,
not checked, though block ranges themselves are stored and consumed by the
weight strategy. Update doc-link references in source comments to point at
the merged docs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Defnull <879658+define-null@users.noreply.github.com>
Co-authored-by: claude <claude@example.com>
Co-committed-by: claude <claude@example.com>
* test(in-memory): add chain-link-held correction test for parity with Postgres

Adds correction_chain_link_held_until_producer_fires, the one correction
scenario that existed for the Postgres backend but not in-memory: in chain
A->B->C, with C confirmed but B unconfirmed, A->B is held (B unconfirmed)
which in turn holds B->C, so only A stays portal-visible. Exercises the
structural chain-link dependency rather than temporal same-dataset ordering.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(in-memory): descriptive names in chain-link-held test

Per review: hoist the dataset literal into a `dataset` variable and rename
`a`/`w`/`*_pk` to descriptive names (chunk_a, single_worker, pk_a/pk_b/pk_c),
aligning with the Postgres sibling test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Defnull <879658+define-null@users.noreply.github.com>
Reviewed-on: http://localhost:3000/defnull/network-scheduler/pulls/14
Co-authored-by: claude <claude@example.com>
Co-committed-by: claude <claude@example.com>
* refactor: move MultistepAlgorithm to scheduler_storage::algorithm

The multistep SchedulingAlgorithm adapter is the production scheduler we are
building toward, not test machinery, but it lived as a pub(super) struct inside
the #[cfg(test)] sim subtree (multistep_scheduler/sim/sut/adapter.rs) where only
the sim could reach it. Relocate it verbatim into scheduler_storage/algorithm.rs
beside DefaultSchedulingAlgorithm — the production home of SchedulingAlgorithm
impls — so other (non-test) callers and the backend test suites can use it.

Pure move, no behavior change. The multistep ScheduledChunk/SchedulingConfig are
aliased on import to avoid colliding with the single-step crate::scheduling types
already used by DefaultSchedulingAlgorithm. The sim now imports it from its new
home.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs: tighten MultistepAlgorithm doc comment

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Defnull <879658+define-null@users.noreply.github.com>
Co-authored-by: claude <claude@example.com>
Co-committed-by: claude <claude@example.com>
…er (#18)

* test(in-memory): drive correction feature tests with the real scheduler

The firing/visibility correction tests fed a StaticSchedulingAlgorithm a fixed
mapping, so they proved the storage state machine handled a hand-built placement
but never that the real multistep scheduler + state machine cooperate on
corrections — leaving that to the probabilistic sim oracle alone.

Migrate the "whole-feature" tests onto the real MultistepAlgorithm (1 worker,
floor 1, reliability ignored ⇒ deterministic placement, so the assertions stay
exact and confirmation/visibility drive the outcome):
- held_until_confirmed (now also subsumes the deleted atomic_swap)
- new_chunk_not_promoted_until_correction_fires (kept separate: it uniquely
  exercises the pending-correction promote-skip guard)
- chain_collapses_in_one_cycle and prop_correction_chain
- new correction_independent_corrections_fire_together

Add a UniformWeight WeightStrategy (weight 1 per chunk) — production
DatasetsConfig and the sim weight table both panic on unconfigured chunks.

Kept on the static stub, with doc notes on why (each needs selective scheduling
the real algorithm cannot express): the 8 registration guards, duplicate_completed,
audit_row_retained, old_chunk_removed_from_worker_after_m_ticks,
chain_link_held_until_producer_fires, the asymmetric independent test, and
prop_corrections_safety.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(in-memory): descriptive names, tighter comments in correction tests

Per review: name worker assignments after what they schedule (assignment_a,
assignment_ab, ...) instead of wa/wa1/wa2, use chunk_a/pk_a builders, and trim
the migrated tests' and stub-rationale comments to the non-obvious points.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(in-memory): hoist dataset literal into a variable

Per review: the migrated real-scheduler tests now bind `let dataset = "a"` and
pass it to `chunk(...)` rather than repeating the literal, matching the existing
chain-link-held test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(in-memory): split correction tests by driver

correction_tests.rs becomes a module root holding the shared helpers
(register, corrections_safety_ok, timing constants) plus two submodules:

- machinery: drives the correction state machine through the static
  scheduling stub, where the test controls which chunks land in a cycle
  (registration guards + the "kept on the static stub" selective-placement
  cases + prop_corrections_safety).
- multistep: drives the same machinery through the real MultistepAlgorithm,
  where the scheduler decides placement.

No behavior change; tests only regrouped along the seam already documented
in the file's comments.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(in-memory): route correction assertions through StorageInspect

Read assertions in machinery.rs reached into InMemoryStorage internals
(sched_chunk_metadata, chunk_corrections) directly. Rewrite them through
the backend-agnostic inspect API (metadata_for / get_corrections) so the
assertions no longer depend on the in-memory representation. The three
removal-state guards still poke fields for *setup* — the read-only inspect
API can't express that, so it stays.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(in-memory): consolidate storage ops onto the SchedulerStorage trait

InMemoryStorage exposed both inherent methods (bare returns, no `now`) and
the SchedulerStorage trait impl that forwarded to them. The duplicate public
surface meant `storage.foo(...)` resolved to the inherent method on a concrete
type but to the trait in a generic context — a silent shadowing footgun — and
kept the in-memory tests on a different API than the sim and Postgres.

Move the five operation bodies (register_new_chunks, update_worker_set,
run_scheduling_cycle, confirm_worker_assignment, run_visibility_cycle) into the
trait impl in adapter.rs and delete the inherent versions, so there is one
entry point per operation. The struct's private mechanics stay in mod.rs.
register_correction keeps its inherent typed-error version: it returns the
typed CorrectionRejected the trait deliberately flattens to a string.

Tests now call the trait API (Result + `now`); call sites updated accordingly.
No behavior change — full mvcc-chunks suite (incl. sim PBT) green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(in-memory): fold mark_for_removal into the trait impl too

The last redundant inherent forwarder: its only caller was the adapter
wrapper (no test or backfill path used it). Move the body into the
SchedulerStorage impl and drop the inherent version and its stale
"backfill path" doc, matching the rest of the consolidation. Behavior
unchanged; in-memory storage + sim suites green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(in-memory): inline the SchedulerStorage impl into mod.rs, drop adapter.rs

The adapter file added a...---------

Co-authored-by: Defnull <879658+define-null@users.noreply.github.com>
Reviewed-on: http://localhost:3000/defnull/network-scheduler/pulls/18
Co-authored-by: claude <claude@example.com>
Co-committed-by: claude <claude@example.com>
…xtures (#20)

* test: hoist backend-agnostic correction helpers into test_harness::fixtures

The in-memory and Postgres correction suites each carried their own copies of
the same backend-agnostic test helpers (peer/worker/chunk builders, a static
SchedulingAlgorithm stub, pk + metadata lookups, a register_correction
shorthand). Hoist single shared versions into a new
test_harness::fixtures module so the two suites stay in lock-step:

- peer, worker(seed, version), dataset, chunk(name, id_seed, size)
- pk_of<S: StorageInspect>, metadata_for<S: StorageInspect> (by-value ChunkPk)
- register<S: SchedulerStorage>
- StaticSchedulingAlgorithm

Both suites import from the shared module; the Postgres suite's semantic
string chunk ids become numeric seeds (distinct per test; the existing
UNIQUE-violation and cross-dataset cases keep their intended collisions/
non-collisions). insert_and_register_chunk now takes an id_seed since its
body used the removed make_chunk.

Backend-specific helpers (state-seeding SQL, fresh_db construction, next_id,
register_correction_int) are deliberately left per-suite. No behavior change;
full --features mvcc-chunks suite green (187 passed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: address review — move lookups onto StorageInspect, drop register helper

PR #20 review feedback:
- pk_of and metadata_for move from test_harness::fixtures onto the
  StorageInspect trait as provided methods (available on both backends);
  metadata_for is renamed get_chunk_metadata_by_pk. Call sites become
  storage.pk_of(&chunk) / storage.get_chunk_metadata_by_pk(pk).
- The register() shorthand is dropped. The register-calling non-prop tests
  now return anyhow::Result<()> and use ? on all fallible storage calls;
  the two proptest bodies keep .unwrap() (? can't apply to StorageError
  inside a proptest body).

fixtures.rs now holds only peer/worker/dataset/chunk/StaticSchedulingAlgorithm.
No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: rename test_harness::fixtures to test_harness::utils

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: run multistep correction scenarios against both backends via a macro

The four real-scheduler correction scenarios are now written once, generic over
a `TestStorage` fixture (`fresh()` for in-memory default vs a fresh Postgres
database, mirroring the sim). A `backend_cases!` macro stamps each out as a
`#[test]` under `multistep::in_memory` and `multistep::pg`, so the suite covers
both backends with no hand-duplicated bodies.

- `run_real_cycle` now calls `register_new_chunks()` before scheduling, which
  Postgres requires to materialise new/replacement chunk metadata (in-memory
  tolerates it — it created metadata lazily).
- `prop_correction_chain` stays in-memory only (a fresh DB per proptest case
  would be prohibitively slow; the PG sim PBT already covers the chain there).
- Retired the five overlapping static-stub Postgres twins now covered by the
  real-scheduler `multistep::pg` variants. Kept the typed-rejection guards, the
  selective `chain_link_held` (needs the static stub), and prop_pg_corrections_safety.

Net-new: Postgres gains real-scheduler correction coverage it lacked. In-memory
suite green; multistep::pg + trimmed PG suite green against a live database.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(test): quality pass over the modules touched in this PR

Behavior-preserving cleanups (in-memory + PG suites green, clippy/fmt clean):

- machinery: run_cycle now delegates to run_cycle_multi; drop the single-use
  ideal_mapping helper.
- postgres: schedule_all builds the worker HashSet once instead of per chunk;
  use AssignmentId instead of an inline path / bare i64 in schedule_all/confirm.
- postgres: extract anchor_metadata_column; set_dropped_at_portal and
  set_tombstoned become thin wrappers (identical statements, binds, panics).
- in_memory/tests/mod.rs: bare HashMap for consistency with the file's other
  collection uses.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Defnull <879658+define-null@users.noreply.github.com>
Co-authored-by: claude <claude@example.com>
Co-committed-by: claude <claude@example.com>
…ts (#22)

* refactor(sim): dedup chunk/config/fetch scaffolding in simulation tests

Three reviewed duplications in the sim test scaffolding:

- new_chunk(): one tuple->NewChunk factory replaces the four hand-rolled
  `NewChunk { .. }` literals across the chunk strategies, and the 12 inlined
  literals in the heterogeneous-sizes regression collapse to a per-test
  `chunk(seed, size, weight, dataset)` closure (matching the heavy/light idiom).
- base_config(): a shared SimConfig baseline; each regression now overrides only
  the fields its property turns on via struct-update, instead of repeating the
  full 8-field literal.
- fetch_succeeds(success, miss): the Bernoulli draw behind every fetch action,
  so the 9:1 / 1:2 success:miss ratios live in one place.

Behavior-preserving; sim + regression tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Defnull <879658+define-null@users.noreply.github.com>
Co-authored-by: claude <claude@example.com>
Co-committed-by: claude <claude@example.com>
…tal_cmp (#23)

* refactor(tests): share Assignment chunk-holder inversion; use f64::total_cmp

The "invert worker_chunks into chunk -> holders" loop was hand-rolled in four
test sites (two shapes: ordered Vec and per-chunk set). Add a test-only
`Assignment::chunk_holders(n) -> Vec<BTreeSet<PeerId>>` and route the set-shaped
sites through it:

- multistep_scheduler/tests.rs: drop the `holders` helper; call sites use the method.
- tests/scenarios.rs: replace the owners_before/after BTreeMap builds.
- tests/chunks_shuffling.rs: replace the index-keyed before/after inversion.

The ordered-Vec `chunk_to_peers` (used as mutable current placement) is left as is.

Also swap the two `partial_cmp(..).unwrap()` float sorts in scenarios.rs for
`sort_unstable_by(f64::total_cmp)` — total order, no panic path.

Behavior-preserving; affected tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Defnull <879658+define-null@users.noreply.github.com>
Co-authored-by: claude <claude@example.com>
Co-committed-by: claude <claude@example.com>
…ation) (#21)

* fix(test): resolve churn under-replication panic (oracle misclassification)

The churn under-replication regression was an oracle artifact, not a scheduler
bug: a continuing chunk whose sole holder departs mid-cycle was misclassified as
a brand-new chunk and forced to meet the first-publication floor. Fix the
classification (key presence, not an empty holder set) and capture the regression.

Also in this PR:
- Gate-A visibility monotonicity oracle (#26) and the correction-safety oracles.
- placement_oracles: plain-English rewrite; floor/retention vs adequacy split.
- sut.rs cleanup: helpers below actions, with_step_safety wrapper, reschedule_frozen
  moved into its sole test, promoted_chunks -> visible_chunks.
- Per-step scheduler status (SchedulerPlaced/NotEnoughCapacity/NoSchedulerRun) in
  the sim trace.
- FIXME: a below-floor chunk should reclaim space from a draining surplus copy
  rather than wait out the grace period.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Defnull <879658+define-null@users.noreply.github.com>
Resolves the scheduler-sim panic surfaced by `churn_simulation` (case 88, while validating #20). The panic turned out to be a **test-oracle misclassification**, not a real publication-atomicity break — so this PR now both pins the reproduction *and* fixes the oracle, with the regression test un-ignored and green.

## The panic
`step_floors` (`placement_oracles.rs`) panicked with `new chunk … published under-replicated: 1 copies` on a churn replay: after `min_replication` is raised 1→2 and the sole holder of a saturated, already-portal-visible chunk departs.

## Root cause (oracle, not scheduler)
`held_before` is sampled *after* `do_worker_left` deactivates and GC-evicts the departing holder, so the chunk's **active**-holder set is empty. `step_floors` classified that empty set as a *newly published* chunk and demanded the first-publication floor (0 or ≥ `min_replication`). In reality the chunk **physically pre-existed** the cycle (it was already visible at one copy) — losing its last holder under saturation is a *tolerated shortfall*, which the retention branch (floor `min(floor, 0) = 0`) already permits.

## Fix (harness-only)
- `sut.rs`: `ideal_by_pk_active` → `held_before_by_pk`; emit an entry for every chunk that physically pre-existed (`ideal ∪ stale` non-empty), value = active-filtered ideal holders (possibly empty).
- `placement_oracles.rs`: `step_floors` classifies by **key presence** alone (drop the `is_empty` filter). Absent ⇒ genuinely new ⇒ atomic `0-or-≥floor` gate **preserved**; present-but-empty ⇒ continuing chunk whose holders all departed ⇒ retention floor 0.
- `regression.rs`: drop `#[ignore]`, rewrite the doc to the resolved cause.
- Two `step_floors` unit tests pin the new contract (present-but-empty passes; absent at the same copy count still fails).

Verified: full `multistep_scheduler::sim` suite green (61 tests, both backends incl. the `churn_simulation` proptest), 26 `placement_oracles` unit tests, clippy clean.

## Out of scope (separate follow-up)
While diagnosing this, a *plausible but unverified* production gap was flagged: confirmed routing (`sched_confirmed_chunk_workers`) is not scrubbed when a worker departs the registry, so portals could route an already-visible chunk to a departed/GC-evicted worker. That lives on the routing plane — invisible to this physical-presence oracle — and is **not** addressed here. Worth its own issue + a routing-plane assertion.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: claude <claude@example.com>
Co-committed-by: claude <claude@example.com>
* fix harness

---------

Co-authored-by: Defnull <879658+define-null@users.noreply.github.com>
Reaping the shared Postgres testcontainer at exit could abort the process.

The `#[dtor]` runs after `main`, when the main thread's TLS is already gone.
tokio's `block_on` parks via `std::thread::current()`, which panics there — and
a panic in a `#[dtor]` aborts. Fix: run `container.rm()` on a fresh thread (intact
TLS), still inside the tokio runtime since `rm()` needs the reactor.

Test-harness only, one file.

Co-authored-by: claude <claude@example.com>
Co-committed-by: claude <claude@example.com>
* feat(scheduler): non-overlap enforcement + 1:1 same-range corrections (mvcc-chunks)

Guarantee a published portal assignment never contains two chunks with overlapping
block ranges within one dataset, and make corrections strict 1:1 same-range swaps.
Both backends (in-memory + Postgres) decide identically; the simulation cross-checks
them. Gated behind the `mvcc-chunks` feature.

Non-overlap, two layers (shared resolver; lower (first_block, chunk_pk) wins):
- Registration (primary): register_new_chunks refuses a new chunk whose range
  overlaps a live chunk in its dataset (or another in the same batch). The loser gets
  a terminal `rejected` row — never scheduled, replicated, or re-evaluated.
- Promotion (backstop): the visibility-cycle gate refuses to promote a chunk that
  would overlap the surviving-visible set. Should-never-fire once registration does
  its job.

Corrections (1:1 same-range):
- register_correction rejects a range-changing replacement, and a correction whose
  old chunk is rejected or already being removed — before any insert.
- A same-range replacement is exempt from the registration overlap check (it overlaps
  only the chunk it supersedes); register_new_chunks backstops the invariant.
- Both backends return typed rejections (CorrectionRejected / ChunkAlreadyExists) for
  the same inputs, classifying the duplicate / existing-replacement DB violations.

Storage / schema:
- sched_chunk_metadata.rejected column; chunks(dataset_id, first_block+last_block_delta)
  index backing the indexed Postgres overlap probe.
- registration_rejected / promotion_held_back metrics, per dataset (by name).

In-memory model:
- Lifecycle-state predicates on SchedulerChunkMetadata, split into threshold-reached
  vs current-state families and named after docs/mvcc-storage.md; they dedup the
  table-scan filters.

Simulation:
- Generators stay dumb: any chunk is a correction target and every add goes through
  the storage — the machinery rejects bad ones as legal no-ops (panicking only on a
  broken contract).
- Chunk block ranges are generated at transition time; corrections inherit the old
  chunk's range.

Docs: new nonoverlap-promotion-gate.md; mvcc-storage / mvcc-schema / README updated
for the rejected state and same-range corrections; consistent chunk-state terminology
across docs and code.

Full suite green (in-memory, Postgres, cross-backend churn/guided sims); clippy + fmt
clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Defnull <879658+define-null@users.noreply.github.com>
## What

Guarantees a published **portal assignment** never contains two chunks with overlapping block ranges
within one dataset. Enforced at **two layers** (both backends, behind `mvcc-chunks`):

1. **Registration (primary).** `register_new_chunks` refuses a *new* chunk whose `[first_block,
   last_block]` overlaps a live chunk in the same dataset (or another new chunk in the same batch).
   The loser gets a terminal `sched_chunk_metadata.rejected` marker → never scheduled, never
   replicated to workers, never re-evaluated. This is where "don't replicate two overlapping chunks"
   is enforced, *before* a doomed chunk costs a download.
2. **Promotion (backstop).** The visibility-cycle gate still refuses to promote a chunk that would
   overlap the dataset's surviving-visible set — the hard guarantee that the *published* assignment
   is non-overlapping, regardless of how a chunk reached promotion. A should-never-fire alarm given
   registration does its job.

Both layers run the **same shared resolver** (`overlap::select_non_overlapping`), so the two backends
decide identically (the simulation cross-checks them). Conflict resolution is deterministic: among
overlapping candidates the lower `(first_block, chunk_pk)` wins.

## Corrections

Corrections are **1-to-1 same-range** swaps; a same-range replacement can't introduce overlap, so the
existing atomic swap suffices and a correction's replacement is **exempt** from the registration
check (it overlaps the old chunk it supersedes by design). **Out of scope (documented):**
range-changing / re-partitioning corrections — the gate still refuses to publish overlap, but such a
correction won't complete cleanly (it can stall / leave a gap). A rejected chunk is terminal (no
self-heal).

## Behaviour (tested, both backends)

| Scenario | Outcome |
|---|---|
| Two overlapping new chunks (ingest) | lower wins; the other **rejected at registration** (terminal, never replicated) |
| Rejected duplicate, winner later removed | does **not** self-heal; freed range needs a fresh registration |
| Same-range correction `A→B` | `B` promotes, `A` drops atomically (`B` exempt at registration) |
| Chain `A→B→C` | collapses in one cycle, only `C` visible |
| Range-changing correction overlapping a neighbour | out of scope; gate refuses overlap (replacement held, gap, no overlap) |
| Draining chunk (M-tick window) | not in the comparison set |

## Implementation

- `sched_chunk_metadata.rejected` column (in-memory: a bool field).
- `overlap::select_non_overlapping` shared resolver (accepted + held-back, deterministic by
  `(first_block, chunk_pk)`, `O(log n)` neighbour probe; `chunks(dataset_id, first_block)` index).
- `register_new_chunks` rewritten in both backends; scheduling input excludes rejected chunks.
- `registration_rejected` + `promotion_held_back` metrics; rejections/held-backs are logged.
- Docs: `docs/nonoverlap-promotion-gate.md`, README, and the `register_correction` docstring.

## Status

Full suite green — **201 passed, 0 failed** (Postgres container + sim), `cargo clippy` + `cargo fmt`
clean. Rebased onto the current base; single commit.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: claude <claude@example.com>
Co-committed-by: claude <claude@example.com>
* feat(reshuffle-sim): add multistep scheduler + Postgres backend

Adds a `--scheduler multistep` mode to reshuffle-sim that drives the
placement-aware multistep scheduler over the Postgres-backed storage
lifecycle, alongside the existing stateless path (still the default).

With no `--database-url` it starts an ephemeral Postgres container;
otherwise it connects to (and migrates) the given database.

Library: expose `scheduler_storage::postgres` and un-gate the two
seeding methods (`insert_new_datasets`, `insert_new_chunks`) from
test-only to the `mvcc-chunks` feature so the tool can ingest chunks.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(reshuffle-sim): plainer comments; avoid baseline chunk clone

Post-review cleanup: simplify wording in the multistep driver and CLI
help (drop internal jargon), and move the baseline chunk vec into the
insert instead of cloning it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(reshuffle-sim): use tracing for status logs with timestamps

Replace the status/progress eprintln! calls with tracing macros and
init a tracing-subscriber (timestamps on, level via RUST_LOG, default
info). The metrics table stays on stdout via println!, so logs and
report output don't interleave.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(reshuffle-sim): one cycle per step, confirm after each run

Drop the settle-to-fixed-point loop: each step now runs a single
scheduling cycle and confirms the assignment right after the run, so
the metrics capture the movement one cycle causes rather than a settled
end state. Removes the MAX_SETTLE cap and the assignment-equality check.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(reshuffle-sim): share one step loop across both schedulers

Introduce a `StepScheduler` trait (`step` returns a common `StepPlacement`)
and a single `run_simulation` loop that owns chunk generation, diffing,
printing, and metric collection. The stateless and multistep paths now
differ only in their `step` implementation — the part that actually runs
a cycle — and in their per-path setup.

Also change `generate_new_chunks` to take `&mut [DatasetInfo]` (clears a
clippy ptr_arg warning) and drop the now-unused public helpers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(reshuffle-sim): one entry path; algorithm chosen by a match

Both schedulers now run through the same code in `main`: a match builds
a `Box<dyn StepScheduler>`, then a single `run_simulation` call drives
it. The separate `run_stateless` / `multistep::run` orchestrators are
gone.

To make this work the multistep scheduler now owns its `Backend`
(instead of borrowing one held by the caller), built via
`MultistepScheduler::build`. The `StepScheduler` trait gained
`initial_owners` and `total_capacity_bytes`, so the shared loop needs
nothing scheduler-specific passed in.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(reshuffle-sim): trim comments, prose, and the smoke test

Remove the ignored Docker-only smoke test and tighten doc comments,
inline comments, and the README to cut the diff.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(scheduler): surface replication_by_weight from SchedulingAlgorithm

The multistep/stateless schedulers already compute replication_by_weight
on the Assignment, but the SchedulingAlgorithm adapter dropped it and the
reshuffle-sim tool re-derived it from the published ideal∪stale holder
counts — which double-counts draining copies and re-couples the tool to
prepare_chunks/weight defaults.

Have SchedulingAlgorithm::schedule return a ScheduleOutput { mapping,
replication_by_weight }, carry the map onto WorkerAssignment, and read it
in the tool. The reported factors are now the scheduler's chosen (ideal)
replication, matching the stateless path and excluding transient drains.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(scheduler): trim ScheduleOutput doc comment

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(reshuffle-sim): drop two comments per review

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(reshuffle-sim): split provision; shorten Backend type; drop comment

Split provision() into existing_database()/ephemeral_database() sharing a
connect_migrated() helper; the caller picks via a match. Import
ContainerAsync/Postgres so the Backend field type is short. Drop the
GC_TICKS comment.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(reshuffle-sim): move StatelessScheduler to its own module

Mirror the multistep path: StatelessScheduler (and its scheduling /
assignment-diff helpers) now live in stateless.rs, leaving simulation.rs
with just the shared loop, metrics, and chunk generation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(scheduler): idiomatic cleanups (map_err, destructure, drop)

- m...---------

Co-authored-by: Defnull <879658+define-null@users.noreply.github.com>
## What

Adds a `--scheduler multistep` mode to `reshuffle-sim` that runs the placement-aware multistep scheduler over the Postgres-backed storage lifecycle. The existing stateless scheduler stays the default, so both can be run for comparison.

- With no `--database-url`, an ephemeral Postgres container is started (needs Docker).
- With `--database-url`, it connects to the given database and migrates it (the database must be empty).

## Why

The multistep scheduler and its Postgres storage existed only behind the `mvcc-chunks` feature and were driven only by test code. `reshuffle-sim` could measure data movement for the stateless scheduler but not for the multistep one. This wires it up so the reshuffle cost of the two can be compared on the same input.

## Changes

**Library (minimal exposure):**
- Make `scheduler_storage::postgres` public so `PostgresStorage` is reachable from the tool.
- Un-gate `insert_new_datasets` / `insert_new_chunks` from `#[cfg(test)]` to the `mvcc-chunks` feature (trait + Postgres impl) so the tool can seed chunks. `register_correction` stays test-only.

**Tool:**
- New `multistep.rs` driver: provisions Postgres, seeds datasets/workers/baseline chunks, then runs scheduling/visibility/confirmation cycles on a logical clock until each step reaches a drained fixed point, and diffs holder sets. Reuses the existing metrics/report code via a shared `assemble_metrics`.
- `main.rs`: `--scheduler {stateless,multistep}` and `--database-url`.
- An `#[ignore]`d end-to-end smoke test (needs Docker); confirms new chunks are added with no reshuffling.

## Notes

- The multistep baseline is the scheduler's own converged placement of the input chunks, not the input file's worker indexes, so per-step movement is internally consistent but not directly comparable in absolute terms to the stateless path's first step.
- Chunks are ingested one row per INSERT, so large inputs are slow against Postgres; prefer smaller `--chunks-per-step`/`--steps` when exploring this path.

## Testing

- `cargo build` (workspace), `cargo build --features mvcc-chunks --tests`, `cargo fmt --check` — all clean.
- `cargo test --features mvcc-chunks` in-memory (69) and algorithm tests pass.
- Multistep smoke test passes against an ephemeral Postgres.

Co-authored-by: claude <claude@example.com>
Co-committed-by: claude <claude@example.com>
* test(sim): capture pre-existing PG-only worker overcommit

A rare pg_guided_simulation failure ("overcommit: worker holds 11534336 >
capacity 10485760" — 11 copies on a 10-copy worker). Captured the shrunk
sequence (SIM_CASE_SEED=86415892...) two ways:

- regression::pg_guided_overcommit_capture — in-memory replay, PASSES,
  confirming the bug is Postgres-specific (placement-input ordering), not
  the shared algorithm.
- pg_tests::pg_guided_overcommit_capture_pg — Postgres replay, reproduces it
  deterministically; #[ignore]d as it pins an unfixed bug.

Independent of the promotion-probe exemption: worker placement
(fetch_active_chunks) never reads applied_at_portal_assignment_id.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(sim): trim prose on the captured overcommit regressions

Per review.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(sim): make the overcommit regression a PG replay in regression.rs

Per review: the in-memory replay isn't the actual regression (it passes), so
drop it and move the Postgres reproducer into regression.rs as
pg_guided_overcommit_capture_pg (driven via init_test, #[ignore]d). The
in-memory clean run is noted in the doc rather than kept as a test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(sim): drop redundant comment on the overcommit replay loop

Per review.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Defnull <879658+define-null@users.noreply.github.com>
Captures a rare, **pre-existing** `pg_guided_simulation` failure as deterministic regression tests. Branched off the base; independent of the #2 promotion-probe PR.

## The bug
`step_safety` panics with `overcommit: worker WorkerPk(2) holds 11534336 > capacity 10485760` — the **Postgres** backend places 11 copies on a worker with room for 10. Rare under random seeds (~1 in several hundred cases), but the shrunk action sequence reproduces it deterministically (5/5).

## It is Postgres-specific, not the algorithm
- `regression::pg_guided_overcommit_capture` — replays the sequence on the **in-memory oracle**: it **passes**, so the shared placement algorithm is correct. The divergence is in the PG path (placement-input ordering).
- `pg_tests::pg_guided_overcommit_capture_pg` — replays on **Postgres**: reproduces the overcommit. `#[ignore]`d because it pins an **unfixed** bug (so CI stays green); run with `cargo test … -- --ignored pg_guided_overcommit_capture_pg`.

Verified on this branch (no #2 present): in-memory passes, PG reproduces — confirming it's independent of the promotion-probe exemption. (It also can't be: worker placement `fetch_active_chunks` never reads `applied_at_portal_assignment_id`.)

Seed: `SIM_CASE_SEED=86415892433a952109298d1aec73e1da062112c371412cf3f0d3f9f88151cf94`.

## Follow-up
The PG placement overcommit still needs a real fix (likely deterministic ordering / capacity-accounting parity with the in-memory backend). Until then, `pg_guided_simulation` remains rarely flaky on this seed class.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: claude <claude@example.com>
Co-committed-by: claude <claude@example.com>
* test(sim): add zero-quorum scheduler simulation model

A guided-style property test that runs the multistep scheduler at a 0%
confirmation quorum: no worker has to confirm an assignment, so the
confirmation watermark tracks the latest published assignment and portal
promotion (plus drain activation) lands the same cycle instead of waiting
on a fetch. Workers and the portal still poll — those fetches feed the
observation oracles — they just no longer gate confirmation.

ZeroQuorumModel reuses the guided walk verbatim, overriding only init_state
to pin confirm_threshold_pct=0. The SUT reacts to a 0% quorum in
refresh_confirmation (watermark -> latest assignment id), run_cycle (confirm
before the visibility pass), and lagging_worker_indexes (no designated
stragglers).

Portal consistency is now uniform across quorums, not special-cased for 0:
only a full (100%) quorum owes the hard guarantee that every routed chunk is
held, so the oracle is fatal only there. Below 100% — a 70-99% quorum lagging
a straggler, or the 0% extreme — the scheduler can route ahead of confirmation,
so a query can legitimately miss; those sub-quorum runs only measure how many
routings would miss (portal_consistency_misses) and never fail. Zero quorum is
simply where that count runs highest. Wired for in-memory and Postgres.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Defnull <879658+define-null@users.noreply.github.com>
## What

Adds a zero-quorum variant of the guided scheduler simulation: a property test that runs the multistep scheduler at confirm_threshold_pct = 0, i.e. no worker has to confirm an assignment. The confirmation watermark tracks the latest published assignment, so portal promotion (and drain activation) lands the same cycle instead of waiting on a fetch. Workers and the portal still poll on their own cadence — those fetches feed the observation oracles — they just no longer gate confirmation.

## How

- New model sim/sut/zero_quorum.rs (ZeroQuorumModel): reuses the guided walk verbatim, overriding only init_state to pin confirm_threshold_pct = 0.
- The behaviour change lives in the SUT reaction to a 0% quorum:
  - refresh_confirmation: watermark jumps straight to the latest published assignment id.
  - run_cycle: confirms before the visibility pass so promotion is eager. The observed fleet is deliberately NOT caught up — workers lag on their poll cadence.
  - lagging_worker_indexes: no designated stragglers at 0%.
- Because promotion outruns confirmation, the portal can route a freshly promoted chunk before any worker holds it. That is the documented cost of skipping confirmation, not a bug, so at a 0% quorum the portal-consistency oracle MEASURES these would-miss routings (placement_oracles::portal_consistency_misses) instead of failing. Every structural oracle — per-step safety (no overcommit, retention floor, atomic publication), published coverage, floor convergence, corrections — stays fatal.
- Wired zero_quorum_simulation + _case for both in-memory and Postgres backends.

## No production changes

Nothing in the scheduler algorithm or storage backends was touched — only the test harness. The regime degrades exactly the portal-routing-to-unsynced-workers property and nothing else; all structural invariants hold.

## Verification

- in-memory zero_quorum_simulation: green over 256 / 128 / repeated 64-case sweeps.
- Telemetry confirms worker-fetch (~18.6%) and portal-fetch (~18.4%) still happen.
- fmt + clippy clean.
- Postgres variants are wired but not run here (need Docker).

## Notes for review

1. The accountable predicate in assert_portal_consistency reads `last_applied >= watermark` (watermark-scoped), which also governs the existing guided/churn fatal path. Under that scoping the suppressed-miss count at 0% quorum is ~0 (a handful of residual edge cases); with the broader `last_applied > 0` scoping it was ~17% of checks (mean ~1, max 27). Say which scoping you want the metric to use.
2. Separately: I saw a one-off, non-reproducible portal-consistency failure in guided/churn at confirm_threshold_pct 82 (a normal quorum, not zero) during a full-suite run. Not reproducible from per-case or whole-run seed, did not recur in ~15 later runs, and independent of this change (which only touches the threshold == 0 paths). Looks like a rare pre-existing nondeterministic flake on this base — worth a separate look.

Co-authored-by: claude <claude@example.com>
Co-committed-by: claude <claude@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants