feat: Multistep scheduler by define-null · Pull Request #37 · subsquid/network-scheduler

define-null · 2026-06-12T22:58:09Z

What is this PR about?

Goal: make it possible to replace or correct data chunks in the network without breaking availability or consistency for portals reading them. Today the scheduler has no safe way to swap a chunk: a portal mid-query could see the old version vanish before the new one is ready. This branch builds the foundation for that — an MVCC (multi-version) chunk lifecycle — plus a smarter scheduling algorithm, and a large property-based test harness that validates both. Everything is behind the mvcc-chunks feature flag (off by default), so production builds are unaffected

MVCC chunk lifecycle (src/scheduler_storage/, migrations/)

A new storage layer that tracks every chunk through a versioned lifecycle instead of a single mutable assignment:

Two kinds of published state: the worker assignment (everything workers must hold) and portal assignments (point-in-time snapshots portals read from). Workers keep serving old snapshots for a grace window (the "M window") so slow portals never see data disappear under them.
A confirmation watermark: data is only dropped after a quorum of workers confirms it has applied the newer assignment. Until then, outgoing copies "drain" rather than vanish.
Chunk corrections: a 1-to-1 swap mechanism (register a replacement → it gets placed and confirmed → the swap fires atomically in a visibility cycle → the old chunk drains out). The old row is retained for audit.
Two interchangeable backends behind one trait: a fast in-memory implementation and a Postgres implementation (new migrations/, transactional cycles, advisory-lock guarded).

Multi-step (reconciliation) scheduling (src/multistep_scheduler.rs)

The current algorithm computes an ideal placement from scratch, blind to what workers already hold — which can demand more data movement than the fleet can absorb. The new algorithm reconciles instead: starting from the current placement, it produces a feasible step toward the ideal — held copies are free to keep, mandatory replication ("floor") copies preempt nice-to-have ("bonus") copies, and new chunks are placed all-or-nothing so nothing lands half-replicated. Standalone for now — not yet wired into the production path.

Simulation & property-based testing (src/multistep_scheduler/sim/)

A model-driven state-machine test harness that drives the full lifecycle the way the real network would — random walks of 100–300 steps mixing chunk additions, worker joins/departures/lagging fetches, clock jumps, corrections, and replication-factor changes — then checks safety/liveness oracles after every step (no portal ever routes to a worker that lacks the chunk; floors are eventually met; drains terminate). The same walks run against both the in-memory and Postgres backends (via testcontainers), with seed-based replay, captured regressions, statistics telemetry, and CI budget knobs (SIM_IN_MEMORY_CASES/SIM_PG_CASES, set to 16/2 on CI).

Supporting changes & docs

schedule_with_per_worker_allocations added to the existing scheduler so placement can account for per-worker occupied bytes; small extensions to weight.rs and test utilities; new reshuffling-cost scenario tests.
Six design docs under docs/ (mvcc-chunks.md, mvcc-corrections.md, mvcc-schema.md, capacity-aware-scheduling.md, chunk-reshuffling.md, mvcc-worker-mappings.md) — the durable references the code comments point to.

…tics cases

Captures the design for closing the silent-overcommit gap: charge the full per-worker footprint, credit held copies as free, and skip (not spill or panic) when a new replica doesn't fit, converging over cycles. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…tep scheduler The reconciliation scheduler in `scheduling_net_786_multistep.rs` distinguished the current placement by status — `current_ideal` (established replicas) vs `current_draining` (being-removed copies) — but the algorithm only needs to know which copies are physically on disk: every held copy is free to keep and occupies disk alike. The split only ever mattered for the floor add-back ordering preference. Change the `schedule` interface to take one per-chunk holder list, `current`: - `schedule`/`schedule_to_workers` take `current: &[Vec<PeerId>]` instead of two args. - `Reconcile` holds a single `held` list (the membership set `held_sorted` and the footprint charge are unchanged — they were already the union). - `add_back_candidates` now orders simply: held copies on an ideal position first, then the rest (dropping the established-before-draining sub-ordering). The simulation keeps the ideal/stale split for its own judging (convergence oracle, per-step safety) and merges the two into `current` only at the `schedule` call sites, via a new `merge_current` helper. Module tests updated to the single-arg interface. Also drop the `pub mod scheduling_net_786_improved;` declaration from lib.rs: that module file is not tracked on the base branch, so a fresh worktree cannot compile without it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ep scheduler Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ning cycle Replace the per-peer stale_allocations parameter of SchedulingAlgorithm::schedule with a per-chunk current_placement (ideal ∪ stale), which strictly supersedes it. run_scheduling_cycle now returns Result so a scheduler shortage is surfaced to the caller instead of panicking. Make InMemoryStorage pub(crate) so it can be driven as a SUT. Dead stale_allocations_by_peer helper removed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Rename prop_corrections_cross_dataset -> prop_corrections_in_different_datasets_dont_block: it tests per-dataset ordering independence (each correction intra-dataset), not a cross-dataset old->new swap — which now has a distinct meaning. Add prop_correction_succeeds_only_within_old_dataset: prop_oneof! draws the replacement's dataset as the old chunk's (registers) or a foreign one (rejected with DatasetMismatch), so the PBT examines both outcomes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Replace the composite FK (which needed a surrogate UNIQUE(chunk_pk, dataset_id) index on chunks, redundant since chunk_pk is already the PK) with a BEFORE INSERT/UPDATE trigger on chunk_corrections that rejects any row whose old and new chunks disagree with dataset_id. Same DB-level guarantee for all clients, without the extra index. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Row-level check, O(1) per write, independent of chunks table size. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…tandalone PBT Remove the standalone prop_correction_succeeds_only_within_old_dataset and instead vary the shared guided/churn correction generator: prop_oneof! mints the replacement in the old chunk's dataset (registers) or, occasionally, a foreign unregistered dataset (rejected as a legal no-op). Verified the churn sim now reports both 'correction: registered' and 'correction: rejected'. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Drop FK REFERENCES clauses and the self-replace CHECK from the doc's chunk_corrections block (the migration is authoritative); replace the composite-FK-vs-trigger explanation with a one-line same-dataset note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- migration: drop the self-replace CHECK comment and the composite-FK rationale on the trigger; one-line same-dataset note instead. - sim/utils.rs: drop the obvious SIM_CHUNK_BLOCK_SPAN comment; make the register_correction strategy doc concise. - in_memory register_correction: drop the composite-FK reference. - postgres register_correction: correct the stale composite-FK comment; note dataset_id is stored on the row (the trigger only validates it). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…d chunk The correction row's dataset_id is read back from the replacement insert (RETURNING dataset_id) instead of a separate SELECT on the old chunk. The trigger already guarantees old and new share a dataset, so sourcing it from the old chunk was redundant and backwards. The old chunk's existence is still checked up front (SELECT EXISTS) for a friendly rejection. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…test)] register_correction now writes only chunks + chunk_corrections. The replacement's sched_chunk_metadata row is created by register_new_chunks (the standard addition flow), like any new chunk — it's still held out of the portal until the correction fires. Drop the metadata INSERT in both backends; have the sim's do_register_correction and the pg test helper schedule_all run register_new_chunks so the replacement is discovered. Also gate the trait method and its Postgres impl behind #[cfg(test)]: it stands in for operator-driven ingestion, not the scheduler cycle flow. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ws.rs The block-range <-> (first_block, last_block_delta) conversion was inlined at four sites (two encode, two decode), repeating the delta arithmetic and the lossy casts. Extract block_range_columns / block_range_from_columns into rows.rs (the row<->domain conversion module), keyed on the BlockNumber domain alias, and use them from the insert binds, chunk_from_row, and the inspect read. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Old-chunk existence (FK on old_chunk_pk) and one-correction-per-old-chunk (PK on old_chunk_pk) are enforced by the chunk_corrections insert itself. Remove the redundant application-level SELECT pre-checks; a violation now surfaces as the database's own error (StorageError::Database). The only remaining application guard is the old-being-removed check, which no DB constraint covers. Tests for unknown/duplicate old_pk now assert the DB rejection. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

A bare Database(_) match passes for any DB error. Add a pg_db_error helper that digs the sqlx DatabaseError out of the anyhow chain, and assert the exact rejecting constraint: ForeignKeyViolation (unknown old_pk), UniqueViolation (duplicate old_pk), and SQLSTATE P0001 (the same-dataset trigger RAISE). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…nt checks Apply the same constraint-over-application-check principle to the rest of mod.rs: - insert_new_chunks: drop ON CONFLICT DO NOTHING and the extra SELECT EXISTS classification query. A duplicate (dataset_id, chunk_id) now surfaces as the UNIQUE violation; a no-row unambiguously means the dataset name didn't resolve. - insert_new_datasets: drop ON CONFLICT; a duplicate name surfaces as the UNIQUE(name) violation. - register_correction replacement insert: same — a duplicate replacement surfaces as the UNIQUE violation; None means the dataset is unknown. The existing-replacement test now asserts the UniqueViolation. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

insert_new_chunks now rejects a duplicate (dataset, id) with ChunkAlreadyExists instead of silently no-oping, mirroring the Postgres UNIQUE constraint so the two backends agree. The sim's insert_and_register filters keys already present (and intra-batch repeats from the random generator) before inserting, so a re-add stays a no-op at the harness level — a real ingester never re-inserts an existing chunk — without feeding a duplicate to storage. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The shared Postgres container is parked in a `static OnceLock`, which Rust never drops — so testcontainers' own Drop-based cleanup never runs and the container leaked after every test run. Reap it at process exit via `#[dtor]` (expands to a libc `atexit` registration). Enable testcontainers' `watchdog` feature too, so signal termination (Ctrl-C, nextest timeouts) — which bypasses exit hooks — is also covered. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… from claude/propagate-chunk-block-range into defnull/net-681-multi-step-scheduling-2 ## Summary Two related storage-layer changes for the MVCC scheduler. (Correction-ordering optimization — structural readiness — is deliberately **out of scope** and handled in a follow-up PR.) ### 1. Propagate the chunk block range Block range is a scheduling input (weight strategy → replication factor) and is emitted into the published assignment, but the Postgres path neither stored nor loaded it — `chunk_from_row` hardcoded `blocks: 0..=0`, collapsing every chunk into the first weight segment. Now stored as **`first_block BIGINT` + `last_block_delta INT`** (delta saves 4 bytes/row vs a second BIGINT; a chunk span fits in 32 bits) and wired through both backends, the inspect surface (`ChunkView.blocks`), and the sim's chunk generator. ### 2. Enforce same-dataset corrections in the DB A correction's replacement must live in the old chunk's dataset. Enforced by a `BEFORE INSERT/UPDATE` trigger on `chunk_corrections` that rejects any row whose old/new chunks disagree with `dataset_id` — so it holds for **every client**, not just `register_correction` (which no longer coerces the replacement's dataset; the in-memory oracle mirrors the rule with a `DatasetMismatch` rejection). A `CHECK (old_chunk_pk <> new_chunk_pk)` forbids self-correction. Trigger chosen over a composite FK to avoid a redundant `UNIQUE(chunk_pk, dataset_id)` index on the (10M-row) chunks table. Validated on a real 10M-row table: the trigger's lookup is a `chunks_pkey` index probe (~0.05 ms), run only on the rare correction-registration write. ## Tests - Block-range round-trip through both backends (incl. `ChunkView.blocks`). - Cross-dataset rejection (both backends); self-correction rejected by the CHECK. - Guided/churn sim now exercises **both** succeeding and rejected corrections (foreign-dataset replacements drawn via `prop_oneof!`; confirmed via the sim's `correction: registered` / `correction: rejected` statistics). - Renamed `prop_corrections_cross_dataset` → `prop_corrections_in_different_datasets_dont_block` (it tests per-dataset ordering independence, not a cross-dataset swap). ## Follow-up (separate PR) Replacing the over-conservative temporal correction ordering ("no earlier pending correction in the same dataset") with a **structural** readiness check ("my old chunk is not the new_chunk_pk of any pending correction"), so independent same-dataset corrections apply concurrently. All Postgres + in-memory storage and sim tests pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

* Structural correction readiness instead of temporal ordering A correction's readiness no longer depends on created_at order within a dataset. It fires once its replacement is confirmed and no *pending* correction still has to produce its old chunk (no pending X->old) — the real dependency, which is a correction chain (B->C waits for A->B because B is A->B's replacement). Independent corrections in the same dataset now apply in the same visibility cycle instead of serializing. - in-memory + postgres apply_ready_corrections: structural predicate. Postgres resolves the pending set with an order-independent Rust fixpoint, so created_at is no longer load-bearing (audit only) and chains still collapse in one pass. - O3 oracle: dataset_ordering -> dependency_ordering (catches a chain link completing before its producer; allows independent concurrency). - Drop the now-unused chunk_corrections_pending_by_dataset index; the structural lookup is served by chunk_corrections_pending_by_new_chunk. - Tests: in-memory "blocked by earlier" reframed to independent-fires; postgres "held by earlier" renamed to chain-link-held; added independent-same-dataset-fire-together (both backends). Docs updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test: drop vestigial cross-dataset independence proptest Under structural readiness the test's only inputs (lag_a/lag_b varying created_at) no longer affect anything, and cross-dataset independence is a trivial case of the same-dataset independence already covered by correction_independent_same_dataset_fires_without_waiting. Left a note in its place. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs/comments: trim temporal-ordering asides per review Drop the "structural, not temporal / not by created_at" editorializing from the correction docs and the apply_ready_corrections doc-comments (in-memory + postgres), tighten the pending-index migration comment to state what the index serves, and remove the orphaned cross-dataset-independence note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * postgres/visibility: phases take the cycle transaction, not a bare connection The visibility cycle is one atomic state transition — apply_ready_corrections stamps corrections, then promote/drop/activate depend on those writes. Typing the phase fns as `&mut PgConnection` let the signature permit a non-atomic call; `&mut Transaction` encodes that they must run inside the cycle's tx. Call sites already pass `&mut tx`, so they're unchanged. Post-commit read helpers stay on `&mut PgConnection`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test: assert chain-collapse portal holds exactly chunk C The portal-state check spot-asserted membership of A/B/C; assert the exact visible set is {C} instead, so an undropped A/B or a stray chunk fails the test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test: assert exact portal/worker chunk sets in correction tests Tighten the correction visibility tests (both backends) to assert the exact visible chunk set rather than spot-checking individual membership, so a stray or undropped chunk fails the test. In-memory: replace the per-chunk assert_portal_visible/assert_not_portal_visible helpers with one assert_portal_chunks_exact, and assert the worker assignment holds exactly B after tombstone. Postgres: assert the exact portal set per cycle. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test: share assert_portal_chunks_exact across both correction suites Hoist the exact-portal-set assertion into test_harness so the in-memory and Postgres correction suites use one helper. Postgres now calls it everywhere it had inline HashSet comparisons. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: Defnull <879658+define-null@users.noreply.github.com> Co-authored-by: claude <claude@example.com> Co-committed-by: claude <claude@example.com>

* docs: reorganize multistep-scheduler design docs Merge the three overlapping MVCC docs (mvcc-chunks, mvcc-corrections, mvcc-worker-mappings) into a single protocol spine, mvcc-storage.md, and add a README.md status hub that is the single source of truth for what is built vs design-only. Drop the stale mvcc-schema-diagram.html. - mvcc-storage.md — the protocol: two assignments, invariants, the two-gate model, chunk lifecycle, corrections, and deferred removal. - mvcc-schema.md — Postgres table/column reference, grouped by write ownership (shared ingestion tables vs scheduler-only sched_* tables); kept in sync with migrations/0001_sched_tables.sql. - capacity-aware-scheduling.md — the placement algorithm at design altitude, stripped of backend-specific function names and file paths. - README.md — reading order plus a status table (built / sim-only / design-only) and known limitations. Scope each claim to what is actually enforced: PG does not prevent cross-kind id confusion (both BIGSERIAL from 1) — the two-table split only makes it structural; a correction's same-block-range is caller discipline, not checked, though block ranges themselves are stored and consumed by the weight strategy. Update doc-link references in source comments to point at the merged docs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Defnull <879658+define-null@users.noreply.github.com> Co-authored-by: claude <claude@example.com> Co-committed-by: claude <claude@example.com>

* test(in-memory): add chain-link-held correction test for parity with Postgres Adds correction_chain_link_held_until_producer_fires, the one correction scenario that existed for the Postgres backend but not in-memory: in chain A->B->C, with C confirmed but B unconfirmed, A->B is held (B unconfirmed) which in turn holds B->C, so only A stays portal-visible. Exercises the structural chain-link dependency rather than temporal same-dataset ordering. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(in-memory): descriptive names in chain-link-held test Per review: hoist the dataset literal into a `dataset` variable and rename `a`/`w`/`*_pk` to descriptive names (chunk_a, single_worker, pk_a/pk_b/pk_c), aligning with the Postgres sibling test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Defnull <879658+define-null@users.noreply.github.com> Reviewed-on: http://localhost:3000/defnull/network-scheduler/pulls/14 Co-authored-by: claude <claude@example.com> Co-committed-by: claude <claude@example.com>

* refactor: move MultistepAlgorithm to scheduler_storage::algorithm The multistep SchedulingAlgorithm adapter is the production scheduler we are building toward, not test machinery, but it lived as a pub(super) struct inside the #[cfg(test)] sim subtree (multistep_scheduler/sim/sut/adapter.rs) where only the sim could reach it. Relocate it verbatim into scheduler_storage/algorithm.rs beside DefaultSchedulingAlgorithm — the production home of SchedulingAlgorithm impls — so other (non-test) callers and the backend test suites can use it. Pure move, no behavior change. The multistep ScheduledChunk/SchedulingConfig are aliased on import to avoid colliding with the single-step crate::scheduling types already used by DefaultSchedulingAlgorithm. The sim now imports it from its new home. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: tighten MultistepAlgorithm doc comment Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Defnull <879658+define-null@users.noreply.github.com> Co-authored-by: claude <claude@example.com> Co-committed-by: claude <claude@example.com>

…er (#18) * test(in-memory): drive correction feature tests with the real scheduler The firing/visibility correction tests fed a StaticSchedulingAlgorithm a fixed mapping, so they proved the storage state machine handled a hand-built placement but never that the real multistep scheduler + state machine cooperate on corrections — leaving that to the probabilistic sim oracle alone. Migrate the "whole-feature" tests onto the real MultistepAlgorithm (1 worker, floor 1, reliability ignored ⇒ deterministic placement, so the assertions stay exact and confirmation/visibility drive the outcome): - held_until_confirmed (now also subsumes the deleted atomic_swap) - new_chunk_not_promoted_until_correction_fires (kept separate: it uniquely exercises the pending-correction promote-skip guard) - chain_collapses_in_one_cycle and prop_correction_chain - new correction_independent_corrections_fire_together Add a UniformWeight WeightStrategy (weight 1 per chunk) — production DatasetsConfig and the sim weight table both panic on unconfigured chunks. Kept on the static stub, with doc notes on why (each needs selective scheduling the real algorithm cannot express): the 8 registration guards, duplicate_completed, audit_row_retained, old_chunk_removed_from_worker_after_m_ticks, chain_link_held_until_producer_fires, the asymmetric independent test, and prop_corrections_safety. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(in-memory): descriptive names, tighter comments in correction tests Per review: name worker assignments after what they schedule (assignment_a, assignment_ab, ...) instead of wa/wa1/wa2, use chunk_a/pk_a builders, and trim the migrated tests' and stub-rationale comments to the non-obvious points. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(in-memory): hoist dataset literal into a variable Per review: the migrated real-scheduler tests now bind `let dataset = "a"` and pass it to `chunk(...)` rather than repeating the literal, matching the existing chain-link-held test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(in-memory): split correction tests by driver correction_tests.rs becomes a module root holding the shared helpers (register, corrections_safety_ok, timing constants) plus two submodules: - machinery: drives the correction state machine through the static scheduling stub, where the test controls which chunks land in a cycle (registration guards + the "kept on the static stub" selective-placement cases + prop_corrections_safety). - multistep: drives the same machinery through the real MultistepAlgorithm, where the scheduler decides placement. No behavior change; tests only regrouped along the seam already documented in the file's comments. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(in-memory): route correction assertions through StorageInspect Read assertions in machinery.rs reached into InMemoryStorage internals (sched_chunk_metadata, chunk_corrections) directly. Rewrite them through the backend-agnostic inspect API (metadata_for / get_corrections) so the assertions no longer depend on the in-memory representation. The three removal-state guards still poke fields for *setup* — the read-only inspect API can't express that, so it stays. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(in-memory): consolidate storage ops onto the SchedulerStorage trait InMemoryStorage exposed both inherent methods (bare returns, no `now`) and the SchedulerStorage trait impl that forwarded to them. The duplicate public surface meant `storage.foo(...)` resolved to the inherent method on a concrete type but to the trait in a generic context — a silent shadowing footgun — and kept the in-memory tests on a different API than the sim and Postgres. Move the five operation bodies (register_new_chunks, update_worker_set, run_scheduling_cycle, confirm_worker_assignment, run_visibility_cycle) into the trait impl in adapter.rs and delete the inherent versions, so there is one entry point per operation. The struct's private mechanics stay in mod.rs. register_correction keeps its inherent typed-error version: it returns the typed CorrectionRejected the trait deliberately flattens to a string. Tests now call the trait API (Result + `now`); call sites updated accordingly. No behavior change — full mvcc-chunks suite (incl. sim PBT) green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(in-memory): fold mark_for_removal into the trait impl too The last redundant inherent forwarder: its only caller was the adapter wrapper (no test or backfill path used it). Move the body into the SchedulerStorage impl and drop the inherent version and its stale "backfill path" doc, matching the rest of the consolidation. Behavior unchanged; in-memory storage + sim suites green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(in-memory): inline the SchedulerStorage impl into mod.rs, drop adapter.rs The adapter file added a...--------- Co-authored-by: Defnull <879658+define-null@users.noreply.github.com> Reviewed-on: http://localhost:3000/defnull/network-scheduler/pulls/18 Co-authored-by: claude <claude@example.com> Co-committed-by: claude <claude@example.com>

…xtures (#20) * test: hoist backend-agnostic correction helpers into test_harness::fixtures The in-memory and Postgres correction suites each carried their own copies of the same backend-agnostic test helpers (peer/worker/chunk builders, a static SchedulingAlgorithm stub, pk + metadata lookups, a register_correction shorthand). Hoist single shared versions into a new test_harness::fixtures module so the two suites stay in lock-step: - peer, worker(seed, version), dataset, chunk(name, id_seed, size) - pk_of<S: StorageInspect>, metadata_for<S: StorageInspect> (by-value ChunkPk) - register<S: SchedulerStorage> - StaticSchedulingAlgorithm Both suites import from the shared module; the Postgres suite's semantic string chunk ids become numeric seeds (distinct per test; the existing UNIQUE-violation and cross-dataset cases keep their intended collisions/ non-collisions). insert_and_register_chunk now takes an id_seed since its body used the removed make_chunk. Backend-specific helpers (state-seeding SQL, fresh_db construction, next_id, register_correction_int) are deliberately left per-suite. No behavior change; full --features mvcc-chunks suite green (187 passed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: address review — move lookups onto StorageInspect, drop register helper PR #20 review feedback: - pk_of and metadata_for move from test_harness::fixtures onto the StorageInspect trait as provided methods (available on both backends); metadata_for is renamed get_chunk_metadata_by_pk. Call sites become storage.pk_of(&chunk) / storage.get_chunk_metadata_by_pk(pk). - The register() shorthand is dropped. The register-calling non-prop tests now return anyhow::Result<()> and use ? on all fallible storage calls; the two proptest bodies keep .unwrap() (? can't apply to StorageError inside a proptest body). fixtures.rs now holds only peer/worker/dataset/chunk/StaticSchedulingAlgorithm. No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: rename test_harness::fixtures to test_harness::utils Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: run multistep correction scenarios against both backends via a macro The four real-scheduler correction scenarios are now written once, generic over a `TestStorage` fixture (`fresh()` for in-memory default vs a fresh Postgres database, mirroring the sim). A `backend_cases!` macro stamps each out as a `#[test]` under `multistep::in_memory` and `multistep::pg`, so the suite covers both backends with no hand-duplicated bodies. - `run_real_cycle` now calls `register_new_chunks()` before scheduling, which Postgres requires to materialise new/replacement chunk metadata (in-memory tolerates it — it created metadata lazily). - `prop_correction_chain` stays in-memory only (a fresh DB per proptest case would be prohibitively slow; the PG sim PBT already covers the chain there). - Retired the five overlapping static-stub Postgres twins now covered by the real-scheduler `multistep::pg` variants. Kept the typed-rejection guards, the selective `chain_link_held` (needs the static stub), and prop_pg_corrections_safety. Net-new: Postgres gains real-scheduler correction coverage it lacked. In-memory suite green; multistep::pg + trimmed PG suite green against a live database. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(test): quality pass over the modules touched in this PR Behavior-preserving cleanups (in-memory + PG suites green, clippy/fmt clean): - machinery: run_cycle now delegates to run_cycle_multi; drop the single-use ideal_mapping helper. - postgres: schedule_all builds the worker HashSet once instead of per chunk; use AssignmentId instead of an inline path / bare i64 in schedule_all/confirm. - postgres: extract anchor_metadata_column; set_dropped_at_portal and set_tombstoned become thin wrappers (identical statements, binds, panics). - in_memory/tests/mod.rs: bare HashMap for consistency with the file's other collection uses. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Defnull <879658+define-null@users.noreply.github.com> Co-authored-by: claude <claude@example.com> Co-committed-by: claude <claude@example.com>

…ts (#22) * refactor(sim): dedup chunk/config/fetch scaffolding in simulation tests Three reviewed duplications in the sim test scaffolding: - new_chunk(): one tuple->NewChunk factory replaces the four hand-rolled `NewChunk { .. }` literals across the chunk strategies, and the 12 inlined literals in the heterogeneous-sizes regression collapse to a per-test `chunk(seed, size, weight, dataset)` closure (matching the heavy/light idiom). - base_config(): a shared SimConfig baseline; each regression now overrides only the fields its property turns on via struct-update, instead of repeating the full 8-field literal. - fetch_succeeds(success, miss): the Bernoulli draw behind every fetch action, so the 9:1 / 1:2 success:miss ratios live in one place. Behavior-preserving; sim + regression tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Defnull <879658+define-null@users.noreply.github.com> Co-authored-by: claude <claude@example.com> Co-committed-by: claude <claude@example.com>

…tal_cmp (#23) * refactor(tests): share Assignment chunk-holder inversion; use f64::total_cmp The "invert worker_chunks into chunk -> holders" loop was hand-rolled in four test sites (two shapes: ordered Vec and per-chunk set). Add a test-only `Assignment::chunk_holders(n) -> Vec<BTreeSet<PeerId>>` and route the set-shaped sites through it: - multistep_scheduler/tests.rs: drop the `holders` helper; call sites use the method. - tests/scenarios.rs: replace the owners_before/after BTreeMap builds. - tests/chunks_shuffling.rs: replace the index-keyed before/after inversion. The ordered-Vec `chunk_to_peers` (used as mutable current placement) is left as is. Also swap the two `partial_cmp(..).unwrap()` float sorts in scenarios.rs for `sort_unstable_by(f64::total_cmp)` — total order, no panic path. Behavior-preserving; affected tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Defnull <879658+define-null@users.noreply.github.com> Co-authored-by: claude <claude@example.com> Co-committed-by: claude <claude@example.com>

…ation) (#21) * fix(test): resolve churn under-replication panic (oracle misclassification) The churn under-replication regression was an oracle artifact, not a scheduler bug: a continuing chunk whose sole holder departs mid-cycle was misclassified as a brand-new chunk and forced to meet the first-publication floor. Fix the classification (key presence, not an empty holder set) and capture the regression. Also in this PR: - Gate-A visibility monotonicity oracle (#26) and the correction-safety oracles. - placement_oracles: plain-English rewrite; floor/retention vs adequacy split. - sut.rs cleanup: helpers below actions, with_step_safety wrapper, reschedule_frozen moved into its sole test, promoted_chunks -> visible_chunks. - Per-step scheduler status (SchedulerPlaced/NotEnoughCapacity/NoSchedulerRun) in the sim trace. - FIXME: a below-floor chunk should reclaim space from a draining surplus copy rather than wait out the grace period. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Defnull <879658+define-null@users.noreply.github.com> Resolves the scheduler-sim panic surfaced by `churn_simulation` (case 88, while validating #20). The panic turned out to be a **test-oracle misclassification**, not a real publication-atomicity break — so this PR now both pins the reproduction *and* fixes the oracle, with the regression test un-ignored and green. ## The panic `step_floors` (`placement_oracles.rs`) panicked with `new chunk … published under-replicated: 1 copies` on a churn replay: after `min_replication` is raised 1→2 and the sole holder of a saturated, already-portal-visible chunk departs. ## Root cause (oracle, not scheduler) `held_before` is sampled *after* `do_worker_left` deactivates and GC-evicts the departing holder, so the chunk's **active**-holder set is empty. `step_floors` classified that empty set as a *newly published* chunk and demanded the first-publication floor (0 or ≥ `min_replication`). In reality the chunk **physically pre-existed** the cycle (it was already visible at one copy) — losing its last holder under saturation is a *tolerated shortfall*, which the retention branch (floor `min(floor, 0) = 0`) already permits. ## Fix (harness-only) - `sut.rs`: `ideal_by_pk_active` → `held_before_by_pk`; emit an entry for every chunk that physically pre-existed (`ideal ∪ stale` non-empty), value = active-filtered ideal holders (possibly empty). - `placement_oracles.rs`: `step_floors` classifies by **key presence** alone (drop the `is_empty` filter). Absent ⇒ genuinely new ⇒ atomic `0-or-≥floor` gate **preserved**; present-but-empty ⇒ continuing chunk whose holders all departed ⇒ retention floor 0. - `regression.rs`: drop `#[ignore]`, rewrite the doc to the resolved cause. - Two `step_floors` unit tests pin the new contract (present-but-empty passes; absent at the same copy count still fails). Verified: full `multistep_scheduler::sim` suite green (61 tests, both backends incl. the `churn_simulation` proptest), 26 `placement_oracles` unit tests, clippy clean. ## Out of scope (separate follow-up) While diagnosing this, a *plausible but unverified* production gap was flagged: confirmed routing (`sched_confirmed_chunk_workers`) is not scrubbed when a worker departs the registry, so portals could route an already-visible chunk to a departed/GC-evicted worker. That lives on the routing plane — invisible to this physical-presence oracle — and is **not** addressed here. Worth its own issue + a routing-plane assertion. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: claude <claude@example.com> Co-committed-by: claude <claude@example.com>

* fix harness --------- Co-authored-by: Defnull <879658+define-null@users.noreply.github.com> Reaping the shared Postgres testcontainer at exit could abort the process. The `#[dtor]` runs after `main`, when the main thread's TLS is already gone. tokio's `block_on` parks via `std::thread::current()`, which panics there — and a panic in a `#[dtor]` aborts. Fix: run `container.rm()` on a fresh thread (intact TLS), still inside the tokio runtime since `rm()` needs the reactor. Test-harness only, one file. Co-authored-by: claude <claude@example.com> Co-committed-by: claude <claude@example.com>

* feat(scheduler): non-overlap enforcement + 1:1 same-range corrections (mvcc-chunks) Guarantee a published portal assignment never contains two chunks with overlapping block ranges within one dataset, and make corrections strict 1:1 same-range swaps. Both backends (in-memory + Postgres) decide identically; the simulation cross-checks them. Gated behind the `mvcc-chunks` feature. Non-overlap, two layers (shared resolver; lower (first_block, chunk_pk) wins): - Registration (primary): register_new_chunks refuses a new chunk whose range overlaps a live chunk in its dataset (or another in the same batch). The loser gets a terminal `rejected` row — never scheduled, replicated, or re-evaluated. - Promotion (backstop): the visibility-cycle gate refuses to promote a chunk that would overlap the surviving-visible set. Should-never-fire once registration does its job. Corrections (1:1 same-range): - register_correction rejects a range-changing replacement, and a correction whose old chunk is rejected or already being removed — before any insert. - A same-range replacement is exempt from the registration overlap check (it overlaps only the chunk it supersedes); register_new_chunks backstops the invariant. - Both backends return typed rejections (CorrectionRejected / ChunkAlreadyExists) for the same inputs, classifying the duplicate / existing-replacement DB violations. Storage / schema: - sched_chunk_metadata.rejected column; chunks(dataset_id, first_block+last_block_delta) index backing the indexed Postgres overlap probe. - registration_rejected / promotion_held_back metrics, per dataset (by name). In-memory model: - Lifecycle-state predicates on SchedulerChunkMetadata, split into threshold-reached vs current-state families and named after docs/mvcc-storage.md; they dedup the table-scan filters. Simulation: - Generators stay dumb: any chunk is a correction target and every add goes through the storage — the machinery rejects bad ones as legal no-ops (panicking only on a broken contract). - Chunk block ranges are generated at transition time; corrections inherit the old chunk's range. Docs: new nonoverlap-promotion-gate.md; mvcc-storage / mvcc-schema / README updated for the rejected state and same-range corrections; consistent chunk-state terminology across docs and code. Full suite green (in-memory, Postgres, cross-backend churn/guided sims); clippy + fmt clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Defnull <879658+define-null@users.noreply.github.com> ## What Guarantees a published **portal assignment** never contains two chunks with overlapping block ranges within one dataset. Enforced at **two layers** (both backends, behind `mvcc-chunks`): 1. **Registration (primary).** `register_new_chunks` refuses a *new* chunk whose `[first_block, last_block]` overlaps a live chunk in the same dataset (or another new chunk in the same batch). The loser gets a terminal `sched_chunk_metadata.rejected` marker → never scheduled, never replicated to workers, never re-evaluated. This is where "don't replicate two overlapping chunks" is enforced, *before* a doomed chunk costs a download. 2. **Promotion (backstop).** The visibility-cycle gate still refuses to promote a chunk that would overlap the dataset's surviving-visible set — the hard guarantee that the *published* assignment is non-overlapping, regardless of how a chunk reached promotion. A should-never-fire alarm given registration does its job. Both layers run the **same shared resolver** (`overlap::select_non_overlapping`), so the two backends decide identically (the simulation cross-checks them). Conflict resolution is deterministic: among overlapping candidates the lower `(first_block, chunk_pk)` wins. ## Corrections Corrections are **1-to-1 same-range** swaps; a same-range replacement can't introduce overlap, so the existing atomic swap suffices and a correction's replacement is **exempt** from the registration check (it overlaps the old chunk it supersedes by design). **Out of scope (documented):** range-changing / re-partitioning corrections — the gate still refuses to publish overlap, but such a correction won't complete cleanly (it can stall / leave a gap). A rejected chunk is terminal (no self-heal). ## Behaviour (tested, both backends) | Scenario | Outcome | |---|---| | Two overlapping new chunks (ingest) | lower wins; the other **rejected at registration** (terminal, never replicated) | | Rejected duplicate, winner later removed | does **not** self-heal; freed range needs a fresh registration | | Same-range correction `A→B` | `B` promotes, `A` drops atomically (`B` exempt at registration) | | Chain `A→B→C` | collapses in one cycle, only `C` visible | | Range-changing correction overlapping a neighbour | out of scope; gate refuses overlap (replacement held, gap, no overlap) | | Draining chunk (M-tick window) | not in the comparison set | ## Implementation - `sched_chunk_metadata.rejected` column (in-memory: a bool field). - `overlap::select_non_overlapping` shared resolver (accepted + held-back, deterministic by `(first_block, chunk_pk)`, `O(log n)` neighbour probe; `chunks(dataset_id, first_block)` index). - `register_new_chunks` rewritten in both backends; scheduling input excludes rejected chunks. - `registration_rejected` + `promotion_held_back` metrics; rejections/held-backs are logged. - Docs: `docs/nonoverlap-promotion-gate.md`, README, and the `register_correction` docstring. ## Status Full suite green — **201 passed, 0 failed** (Postgres container + sim), `cargo clippy` + `cargo fmt` clean. Rebased onto the current base; single commit. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: claude <claude@example.com> Co-committed-by: claude <claude@example.com>

* feat(reshuffle-sim): add multistep scheduler + Postgres backend Adds a `--scheduler multistep` mode to reshuffle-sim that drives the placement-aware multistep scheduler over the Postgres-backed storage lifecycle, alongside the existing stateless path (still the default). With no `--database-url` it starts an ephemeral Postgres container; otherwise it connects to (and migrates) the given database. Library: expose `scheduler_storage::postgres` and un-gate the two seeding methods (`insert_new_datasets`, `insert_new_chunks`) from test-only to the `mvcc-chunks` feature so the tool can ingest chunks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(reshuffle-sim): plainer comments; avoid baseline chunk clone Post-review cleanup: simplify wording in the multistep driver and CLI help (drop internal jargon), and move the baseline chunk vec into the insert instead of cloning it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(reshuffle-sim): use tracing for status logs with timestamps Replace the status/progress eprintln! calls with tracing macros and init a tracing-subscriber (timestamps on, level via RUST_LOG, default info). The metrics table stays on stdout via println!, so logs and report output don't interleave. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(reshuffle-sim): one cycle per step, confirm after each run Drop the settle-to-fixed-point loop: each step now runs a single scheduling cycle and confirms the assignment right after the run, so the metrics capture the movement one cycle causes rather than a settled end state. Removes the MAX_SETTLE cap and the assignment-equality check. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(reshuffle-sim): share one step loop across both schedulers Introduce a `StepScheduler` trait (`step` returns a common `StepPlacement`) and a single `run_simulation` loop that owns chunk generation, diffing, printing, and metric collection. The stateless and multistep paths now differ only in their `step` implementation — the part that actually runs a cycle — and in their per-path setup. Also change `generate_new_chunks` to take `&mut [DatasetInfo]` (clears a clippy ptr_arg warning) and drop the now-unused public helpers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(reshuffle-sim): one entry path; algorithm chosen by a match Both schedulers now run through the same code in `main`: a match builds a `Box<dyn StepScheduler>`, then a single `run_simulation` call drives it. The separate `run_stateless` / `multistep::run` orchestrators are gone. To make this work the multistep scheduler now owns its `Backend` (instead of borrowing one held by the caller), built via `MultistepScheduler::build`. The `StepScheduler` trait gained `initial_owners` and `total_capacity_bytes`, so the shared loop needs nothing scheduler-specific passed in. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(reshuffle-sim): trim comments, prose, and the smoke test Remove the ignored Docker-only smoke test and tighten doc comments, inline comments, and the README to cut the diff. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(scheduler): surface replication_by_weight from SchedulingAlgorithm The multistep/stateless schedulers already compute replication_by_weight on the Assignment, but the SchedulingAlgorithm adapter dropped it and the reshuffle-sim tool re-derived it from the published ideal∪stale holder counts — which double-counts draining copies and re-couples the tool to prepare_chunks/weight defaults. Have SchedulingAlgorithm::schedule return a ScheduleOutput { mapping, replication_by_weight }, carry the map onto WorkerAssignment, and read it in the tool. The reported factors are now the scheduler's chosen (ideal) replication, matching the stateless path and excluding transient drains. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(scheduler): trim ScheduleOutput doc comment Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(reshuffle-sim): drop two comments per review Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(reshuffle-sim): split provision; shorten Backend type; drop comment Split provision() into existing_database()/ephemeral_database() sharing a connect_migrated() helper; the caller picks via a match. Import ContainerAsync/Postgres so the Backend field type is short. Drop the GC_TICKS comment. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(reshuffle-sim): move StatelessScheduler to its own module Mirror the multistep path: StatelessScheduler (and its scheduling / assignment-diff helpers) now live in stateless.rs, leaving simulation.rs with just the shared loop, metrics, and chunk generation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(scheduler): idiomatic cleanups (map_err, destructure, drop) - m...--------- Co-authored-by: Defnull <879658+define-null@users.noreply.github.com> ## What Adds a `--scheduler multistep` mode to `reshuffle-sim` that runs the placement-aware multistep scheduler over the Postgres-backed storage lifecycle. The existing stateless scheduler stays the default, so both can be run for comparison. - With no `--database-url`, an ephemeral Postgres container is started (needs Docker). - With `--database-url`, it connects to the given database and migrates it (the database must be empty). ## Why The multistep scheduler and its Postgres storage existed only behind the `mvcc-chunks` feature and were driven only by test code. `reshuffle-sim` could measure data movement for the stateless scheduler but not for the multistep one. This wires it up so the reshuffle cost of the two can be compared on the same input. ## Changes **Library (minimal exposure):** - Make `scheduler_storage::postgres` public so `PostgresStorage` is reachable from the tool. - Un-gate `insert_new_datasets` / `insert_new_chunks` from `#[cfg(test)]` to the `mvcc-chunks` feature (trait + Postgres impl) so the tool can seed chunks. `register_correction` stays test-only. **Tool:** - New `multistep.rs` driver: provisions Postgres, seeds datasets/workers/baseline chunks, then runs scheduling/visibility/confirmation cycles on a logical clock until each step reaches a drained fixed point, and diffs holder sets. Reuses the existing metrics/report code via a shared `assemble_metrics`. - `main.rs`: `--scheduler {stateless,multistep}` and `--database-url`. - An `#[ignore]`d end-to-end smoke test (needs Docker); confirms new chunks are added with no reshuffling. ## Notes - The multistep baseline is the scheduler's own converged placement of the input chunks, not the input file's worker indexes, so per-step movement is internally consistent but not directly comparable in absolute terms to the stateless path's first step. - Chunks are ingested one row per INSERT, so large inputs are slow against Postgres; prefer smaller `--chunks-per-step`/`--steps` when exploring this path. ## Testing - `cargo build` (workspace), `cargo build --features mvcc-chunks --tests`, `cargo fmt --check` — all clean. - `cargo test --features mvcc-chunks` in-memory (69) and algorithm tests pass. - Multistep smoke test passes against an ephemeral Postgres. Co-authored-by: claude <claude@example.com> Co-committed-by: claude <claude@example.com>

* test(sim): capture pre-existing PG-only worker overcommit A rare pg_guided_simulation failure ("overcommit: worker holds 11534336 > capacity 10485760" — 11 copies on a 10-copy worker). Captured the shrunk sequence (SIM_CASE_SEED=86415892...) two ways: - regression::pg_guided_overcommit_capture — in-memory replay, PASSES, confirming the bug is Postgres-specific (placement-input ordering), not the shared algorithm. - pg_tests::pg_guided_overcommit_capture_pg — Postgres replay, reproduces it deterministically; #[ignore]d as it pins an unfixed bug. Independent of the promotion-probe exemption: worker placement (fetch_active_chunks) never reads applied_at_portal_assignment_id. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(sim): trim prose on the captured overcommit regressions Per review. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(sim): make the overcommit regression a PG replay in regression.rs Per review: the in-memory replay isn't the actual regression (it passes), so drop it and move the Postgres reproducer into regression.rs as pg_guided_overcommit_capture_pg (driven via init_test, #[ignore]d). The in-memory clean run is noted in the doc rather than kept as a test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(sim): drop redundant comment on the overcommit replay loop Per review. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Defnull <879658+define-null@users.noreply.github.com> Captures a rare, **pre-existing** `pg_guided_simulation` failure as deterministic regression tests. Branched off the base; independent of the #2 promotion-probe PR. ## The bug `step_safety` panics with `overcommit: worker WorkerPk(2) holds 11534336 > capacity 10485760` — the **Postgres** backend places 11 copies on a worker with room for 10. Rare under random seeds (~1 in several hundred cases), but the shrunk action sequence reproduces it deterministically (5/5). ## It is Postgres-specific, not the algorithm - `regression::pg_guided_overcommit_capture` — replays the sequence on the **in-memory oracle**: it **passes**, so the shared placement algorithm is correct. The divergence is in the PG path (placement-input ordering). - `pg_tests::pg_guided_overcommit_capture_pg` — replays on **Postgres**: reproduces the overcommit. `#[ignore]`d because it pins an **unfixed** bug (so CI stays green); run with `cargo test … -- --ignored pg_guided_overcommit_capture_pg`. Verified on this branch (no #2 present): in-memory passes, PG reproduces — confirming it's independent of the promotion-probe exemption. (It also can't be: worker placement `fetch_active_chunks` never reads `applied_at_portal_assignment_id`.) Seed: `SIM_CASE_SEED=86415892433a952109298d1aec73e1da062112c371412cf3f0d3f9f88151cf94`. ## Follow-up The PG placement overcommit still needs a real fix (likely deterministic ordering / capacity-accounting parity with the in-memory backend). Until then, `pg_guided_simulation` remains rarely flaky on this seed class. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: claude <claude@example.com> Co-committed-by: claude <claude@example.com>

* test(sim): add zero-quorum scheduler simulation model A guided-style property test that runs the multistep scheduler at a 0% confirmation quorum: no worker has to confirm an assignment, so the confirmation watermark tracks the latest published assignment and portal promotion (plus drain activation) lands the same cycle instead of waiting on a fetch. Workers and the portal still poll — those fetches feed the observation oracles — they just no longer gate confirmation. ZeroQuorumModel reuses the guided walk verbatim, overriding only init_state to pin confirm_threshold_pct=0. The SUT reacts to a 0% quorum in refresh_confirmation (watermark -> latest assignment id), run_cycle (confirm before the visibility pass), and lagging_worker_indexes (no designated stragglers). Portal consistency is now uniform across quorums, not special-cased for 0: only a full (100%) quorum owes the hard guarantee that every routed chunk is held, so the oracle is fatal only there. Below 100% — a 70-99% quorum lagging a straggler, or the 0% extreme — the scheduler can route ahead of confirmation, so a query can legitimately miss; those sub-quorum runs only measure how many routings would miss (portal_consistency_misses) and never fail. Zero quorum is simply where that count runs highest. Wired for in-memory and Postgres. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Defnull <879658+define-null@users.noreply.github.com> ## What Adds a zero-quorum variant of the guided scheduler simulation: a property test that runs the multistep scheduler at confirm_threshold_pct = 0, i.e. no worker has to confirm an assignment. The confirmation watermark tracks the latest published assignment, so portal promotion (and drain activation) lands the same cycle instead of waiting on a fetch. Workers and the portal still poll on their own cadence — those fetches feed the observation oracles — they just no longer gate confirmation. ## How - New model sim/sut/zero_quorum.rs (ZeroQuorumModel): reuses the guided walk verbatim, overriding only init_state to pin confirm_threshold_pct = 0. - The behaviour change lives in the SUT reaction to a 0% quorum: - refresh_confirmation: watermark jumps straight to the latest published assignment id. - run_cycle: confirms before the visibility pass so promotion is eager. The observed fleet is deliberately NOT caught up — workers lag on their poll cadence. - lagging_worker_indexes: no designated stragglers at 0%. - Because promotion outruns confirmation, the portal can route a freshly promoted chunk before any worker holds it. That is the documented cost of skipping confirmation, not a bug, so at a 0% quorum the portal-consistency oracle MEASURES these would-miss routings (placement_oracles::portal_consistency_misses) instead of failing. Every structural oracle — per-step safety (no overcommit, retention floor, atomic publication), published coverage, floor convergence, corrections — stays fatal. - Wired zero_quorum_simulation + _case for both in-memory and Postgres backends. ## No production changes Nothing in the scheduler algorithm or storage backends was touched — only the test harness. The regime degrades exactly the portal-routing-to-unsynced-workers property and nothing else; all structural invariants hold. ## Verification - in-memory zero_quorum_simulation: green over 256 / 128 / repeated 64-case sweeps. - Telemetry confirms worker-fetch (~18.6%) and portal-fetch (~18.4%) still happen. - fmt + clippy clean. - Postgres variants are wired but not run here (need Docker). ## Notes for review 1. The accountable predicate in assert_portal_consistency reads `last_applied >= watermark` (watermark-scoped), which also governs the existing guided/churn fatal path. Under that scoping the suppressed-miss count at 0% quorum is ~0 (a handful of residual edge cases); with the broader `last_applied > 0` scoping it was ~17% of checks (mean ~1, max 27). Say which scoping you want the metric to use. 2. Separately: I saw a one-off, non-reproducible portal-consistency failure in guided/churn at confirm_threshold_pct 82 (a normal quorum, not zero) during a full-suite run. Not reproducible from per-case or whole-run seed, did not recur in ~15 later runs, and independent of this change (which only touches the threshold == 0 paths). Looks like a rare pre-existing nondeterministic flake on this base — worth a separate look. Co-authored-by: claude <claude@example.com> Co-committed-by: claude <claude@example.com>

define-null and others added 30 commits May 27, 2026 16:46

WIP on multi-version chunks

27d04e3

minor refactor

27edb0e

Add portal assignment

58cc0f6

Portal assignment added

56e0cad

Add tests, inspect trait, scheduling algorithm trait

a08400e

Switch to ByteSize, tests for basic api and mvcc base concept

5561556

Document current design limitation, add tests that shows the problema…

1c2d4b0

…tics cases

Fix both tests with diff history approach

68a1b26

Add another test for the full flow

bde40d5

Pass all existing tests

c75b68b

Drop reservation table

e6b1832

Added net-786 scheduler

25f621e

Property-based-tests are working

8ad57f5

found failure with random sizes

615920a

Fixed the regression

52cfd06

Restructure code

3553a79

introduced different models

8268907

minor improvements

6a73755

minor improvements

3f1bd93

Add regression and more agressive assertion

8a52268

Comments cleanup

24b821a

Fix regression tests

40a47a3

Fix clippy

d4feb45

Add/remove workers tests

8a23137

minor rename

3fa3432

docs: Design for driving InMemoryStorage as proptest SUT over multist…

c1d535a

…ep scheduler Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

docs: Implementation plan for InMemoryStorage-as-SUT proptest

4c9c25a

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

define-null and others added 30 commits June 16, 2026 18:51

Forbid self-correction with a CHECK (old_chunk_pk <> new_chunk_pk)

ce4da10

Row-level check, O(1) per write, independent of chunks table size. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

chore: add Gitea squash merge template

aa8be4d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Multistep scheduler#37

feat: Multistep scheduler#37
define-null wants to merge 250 commits into
masterfrom
defnull/net-681-multi-step-scheduling-2

define-null commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

define-null commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is this PR about?

MVCC chunk lifecycle (src/scheduler_storage/, migrations/)

Multi-step (reconciliation) scheduling (src/multistep_scheduler.rs)

Simulation & property-based testing (src/multistep_scheduler/sim/)

Supporting changes & docs

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

define-null commented Jun 12, 2026 •

edited

Loading