Skip to content

feat(payload): positional-cursor resume for chunk vtab (prototype)#52

Open
andinux wants to merge 1 commit into
codex/chunked-payloads-networkfrom
codex/positional-cursor-chunks
Open

feat(payload): positional-cursor resume for chunk vtab (prototype)#52
andinux wants to merge 1 commit into
codex/chunked-payloads-networkfrom
codex/positional-cursor-chunks

Conversation

@andinux

@andinux andinux commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator

What

Make the cloudsync_payload_chunks vtab resumable by position so a window can
be paged one chunk at a time from an arbitrary point — including boundaries that
fall inside a single committed db_version or inside a fragmented oversized
value.

  • New hidden inputs resume_db_version, resume_seq, resume_frag_offset start
    the scan at (db_version, seq) inclusive and re-enter a mid-value fragment at a
    byte offset.
  • New outputs next_db_version, next_seq, next_frag_offset, is_final report
    exactly where the emitted chunk stopped, so the next call resumes from there.
  • Fragment setup is factored into payload_chunks_begin_fragment so a streamed
    and a resumed fragment build identically.
  • New unit test Payload Chunks Positional Resume pages a window (a db_version
    split across chunks plus a value larger than the chunk budget) one chunk per
    call via the cursor and asserts byte-for-byte identity with a full-window
    scan, including a mid-fragment resume.

Why

Today the server stages a whole window into the cloudsync_payload_spool table so
the /check job can page it out one chunk per node round-trip. That staging step
issues CREATE TABLE under the tenant's readwrite apikey, which is not
authorized to create tables — surfacing as failures.check: database_permission_denied
and an endless /check poll (the original bench failure on this branch).

cloudsync_payload_chunks could only resume from since > db_version, so a chunk
boundary landing inside a single db_version (or inside a fragmented value) was not
addressable — which is the whole reason the spool table exists. A positional
cursor makes those boundaries addressable, so the job can page the node directly:

  • O(1) seek per chunk (vs O(N^2) replay-from-since), O(N) per window —
    matching the spool's compute-once cost.
  • No spool table → no CREATE privilege → the permission bug is gone at the root
    (no provisioning, no migration, no dbadmin fallback, no SQLiteCloud-core change).

Exact tiling, no idempotency reliance. (db_version, seq) is a unique total
order (changes rowid = (db_version << 30) | seq); next_* names the row that did
not fit or the exact byte already emitted, and the fragment plan is a deterministic
function of the row. So each change is emitted in exactly one chunk — verified
byte-for-byte by the new test. The only idempotency in play stays at the sqlite-sync
changes level.

Scope / compatibility

  • Send path and spool fill are unchanged. New columns are appended (existing
    positions 0..10 untouched); the positional branch only activates when
    resume_db_version is bound. cloudsync_network_send_changes (binds only
    since_db_version) and cloudsync_payload_spool_fill take the byte-identical
    legacy path. All existing payload/chunk/spool unit tests pass.
  • The client /check wire protocol does not change. The positional cursor is a
    server-internal (job ↔ node, SQL-level) mechanism that replaces the spool's chunk
    addressing. The job loops once over the window, fetching one chunk per round-trip
    (node memory stays at one chunk) and staging each chunk into the object store as
    the legacy path already does; the client downloads the staged chunks at its own
    pace with its existing cursor. See docs/internal/chunked-check-spool-vs-positional-cursor.md
    for the full two-layer analysis and the drain-start "exclusive-after" rule.

This is a prototype: the vtab primitive + its test. Wiring it into the live
/check job is tracked below.

TODO (follow-ups)

  • /check job (Go server): replace spool_fill + spool paging with a
    positional SELECT ... LIMIT 1 staging loop into the object store
    (job-internal layer-2 cursor + pinned until watermark; apply the drain-start
    exclusive-after rule for the first node fetch). Client wire protocol
    unchanged.
  • PostgreSQL parity: mirror the positional inputs/outputs on the
    cloudsync_payload_chunks SRF (cloudsync_postgresql.c / cloudsync*.sql) —
    this prototype is SQLite-only.
  • Retire cloudsync_payload_spool: drop spool_fill/spool_drop/the table
    and its SQL once the positional path is wired end-to-end and verified.
  • Benchmark the positional loop vs. the spool on a large window before
    removing the spool.
  • Once retired, the spool-provisioning work (eager-create in
    cloudsync_init_table, existence-guard in spool_fill, core migrate-on-open)
    becomes unnecessary.

🤖 Generated with Claude Code

cloudsync_payload_chunks could only resume a window from since>db_version,
so a chunk boundary that lands inside a single committed db_version (or
inside a fragmented oversized value) was not addressable — which is the
whole reason the server stages the stream into a cloudsync_payload_spool
table to page it out. This makes the vtab resumable by position instead.

Add an optional positional cursor: hidden inputs resume_db_version,
resume_seq, resume_frag_offset start the scan at (db_version, seq)
inclusive and re-enter a mid-value fragment at a byte offset; new outputs
next_db_version, next_seq, next_frag_offset and is_final report where the
emitted chunk stopped. A stateless /check can then page the whole window
with an O(1) seek per call (vs O(N^2) replay-from-since), no spool table,
no server-side state. Legacy since>db_version callers (send path, spool
fill) are unchanged: columns are appended and the positional branch only
activates when resume_db_version is bound.

Tiling is exact, not idempotent-overlap: (db_version, seq) is a unique
total order (changes rowid = (db_version<<30)|seq), next_* names the row
that did not fit or the exact byte already emitted, and the fragment plan
is a deterministic function of the row, so a resumed fragment tiles
identically. The drain-start cursor must seek exclusive-after the last
applied change (see docs/internal design note) so the protocol never
relies on changes-level idempotency to absorb a re-sent row.

New unit test Payload Chunks Positional Resume drives a window mixing a
db_version split across chunks with a value larger than the chunk budget,
pages it one chunk per call via the cursor, and asserts byte-identity
with a full-window scan (including a mid-fragment resume).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant