Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions agents/completed.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,46 @@

---

# #375: `yield ProducedFlavor` conversion contract is gone — from-scratch example (and training-endpoints quant endpoints) silently publish nothing

**Completed:** yes
**Status:** DONE (2026-07-04, Claude) — settled the ONE contract, option (a): producer endpoints (conversion/dataset/training) write files locally, call `cozy_convert.publish_flavors(ctx, flavors)` — one Tensorhub /commits commit per ProducedFlavor (path = file or dir) — and return a result struct; the executor stays contract-free. The dead yield-shape is deleted AND enforced: @endpoint rejects generator handlers for producer kinds at decoration time (TypeError pointing at publish_flavors), so the old shape fails loudly at import instead of silently streaming. examples/from-scratch rewritten to the contract (returns FromScratchResult w/ revision_ids) + discovery smoke test (kind=conversion, output_mode=single, result struct) + publish_flavors e2e against the fake /commits server. Docs: @endpoint docstring, endpoint-authoring.md (worked example), local-dev.md, produced.py/publish.py docstrings. training-endpoints quant/convert endpoints rewrite against this contract (their #37); live publish-then-consume e2e (tensorhub #542 scenario) deferred to the e2e suite.

## Metadata
- Category: bug / api-contract
- Status: planned
- Passes: false

## Tasks
- [x] Decide the ONE contract: (a) endpoints call `cozy_convert.publish_flavors(ctx, flavors)` explicitly and return a result struct, or (b) the executor collects `ProducedFlavor` yields from conversion-kind handlers and publishes them. Pick, document on @endpoint, delete the other shape.
- [x] Fix examples/from-scratch to the chosen contract and cover it (discovery+lock smoke at minimum; ideally the e2e publish-then-consume scenario of tensorhub #542).

## Acceptance
from-scratch runs against a live stack and its commit lands in the CAS; the dead shape no longer imports.

---

# #374: cozy_convert publish/clone robustness — retries, resume, ingest junk in published trees, silent empty publish

**Completed:** yes
**Status:** DONE (2026-07-04, Claude) — hub.py: bounded retries w/ backoff + Retry-After on commit POST, part PUTs, complete, and finalize polling (429/5xx/network; abort-DELETE on terminal failure unchanged). run_clone: persistent workdir keyed by sha256(provider|source|destination) under $COZY_CONVERT_WORKDIR (default <tmp>/cozy-convert) — retained on failure so a retry resumes the HF snapshot (hf local-dir metadata kept for exactly that reason), deleted on success; partial flavor trees are wiped per-retry. Junk filtering: files_from_tree AND _copy_non_weights skip `.cache/huggingface/**` so mirrors are byte-faithful (root dotfiles like a real `.gitignore` still publish). `if not result.published: raise` — publishing nothing can never read as success (empty `outputs` was already defaulted to bf16 by normalize_outputs; the guard now covers every path). Legacy `publish_repo_revision` (~450 LOC, legacy /publish route) deleted from _PublisherMixin along with its local-context stub, finalize-poll constants, and orphaned _helpers; checkpoint publishing is ONLY cozy_convert.publish_flavors (/commits). New tests: retry/junk in test_hub.py + run_clone lifecycle in test_publish.py (fake /commits server, real HTTP). Deliberate: publish_dataset_revision (datasets subsystem) untouched; civitai download resume is #373's scope.

## Metadata
- Category: bug / cozy_convert
- Status: planned
- Passes: false

## Tasks
- [x] Live probe: the published repo contains `.cache/huggingface/**` lock/metadata files and `.gitignore` — `ingest_huggingface` snapshots into the tree and `files_from_tree` (hub.py) uploads EVERYTHING. Filter HF-cache internals (or snapshot to a clean copy) so mirrors are byte-faithful to the source repo, not to huggingface_hub's cache layout.
- [x] HubClient has no retry/backoff/429 handling on part PUTs, commit POST, or complete (hub.py:113-156) — one transient S3 hiccup after a multi-GB download+convert fails the whole clone; run_clone `shutil.rmtree`s the workdir in `finally` (clone.py:552) so a retry is a full re-download. Add bounded retries + a keyed persistent workdir for resume.
- [x] `run_clone` with an empty `outputs` list publishes NOTHING and returns success (the specs loop simply doesn't run; the no-publish error only fires when failed_flavors is non-empty). Default to publish-as-is or make empty outputs an error.
- [x] Kill the second publish path: `gen_worker.publish_repo_revision` (request_context/__init__.py:995) still posts to the legacy `/repos/:tenant/:name/publish` route while cozy_convert uses `/commits` (#515 "the ONE publish path"). Convert remaining callers or delete.

## Acceptance
A mirrored repo's tree equals the source repo's tree; transient network errors don't restart multi-GB work; publishing nothing cannot read as success.

---

# #372: transport hardening — auth-failure gating, HelloAck deadline, redirect hop reset, TLS on redirects

**Completed:** yes
Expand Down
38 changes: 0 additions & 38 deletions agents/progress.md
Original file line number Diff line number Diff line change
Expand Up @@ -379,41 +379,3 @@ Tasks:
Alternating requests across two endpoints on one GPU show demote/promote transitions (no full reloads) and both run unoffloaded when resident alone.

---

# #374: cozy_convert publish/clone robustness — retries, resume, ingest junk in published trees, silent empty publish

**Completed:** no
**Status:** OPEN (2026-07-04, mirror-flow audit + live probe)

## Metadata
- Category: bug / cozy_convert
- Status: planned
- Passes: false

## Tasks
- [ ] Live probe: the published repo contains `.cache/huggingface/**` lock/metadata files and `.gitignore` — `ingest_huggingface` snapshots into the tree and `files_from_tree` (hub.py) uploads EVERYTHING. Filter HF-cache internals (or snapshot to a clean copy) so mirrors are byte-faithful to the source repo, not to huggingface_hub's cache layout.
- [ ] HubClient has no retry/backoff/429 handling on part PUTs, commit POST, or complete (hub.py:113-156) — one transient S3 hiccup after a multi-GB download+convert fails the whole clone; run_clone `shutil.rmtree`s the workdir in `finally` (clone.py:552) so a retry is a full re-download. Add bounded retries + a keyed persistent workdir for resume.
- [ ] `run_clone` with an empty `outputs` list publishes NOTHING and returns success (the specs loop simply doesn't run; the no-publish error only fires when failed_flavors is non-empty). Default to publish-as-is or make empty outputs an error.
- [ ] Kill the second publish path: `gen_worker.publish_repo_revision` (request_context/__init__.py:995) still posts to the legacy `/repos/:tenant/:name/publish` route while cozy_convert uses `/commits` (#515 "the ONE publish path"). Convert remaining callers or delete.

## Acceptance
A mirrored repo's tree equals the source repo's tree; transient network errors don't restart multi-GB work; publishing nothing cannot read as success.

---

# #375: `yield ProducedFlavor` conversion contract is gone — from-scratch example (and training-endpoints quant endpoints) silently publish nothing

**Completed:** no
**Status:** OPEN (2026-07-04) — examples/from-scratch yields `ProducedFlavor` from a `kind="conversion"` generator, but post-#367/#368 gen_worker contains ZERO ProducedFlavor handling: the executor treats the generator as a streaming handler and nothing publishes. training-endpoints' quant/convert endpoints are written against the same dead contract (their #37 depends on this decision).

## Metadata
- Category: bug / api-contract
- Status: planned
- Passes: false

## Tasks
- [ ] Decide the ONE contract: (a) endpoints call `cozy_convert.publish_flavors(ctx, flavors)` explicitly and return a result struct, or (b) the executor collects `ProducedFlavor` yields from conversion-kind handlers and publishes them. Pick, document on @endpoint, delete the other shape.
- [ ] Fix examples/from-scratch to the chosen contract and cover it (discovery+lock smoke at minimum; ideally the e2e publish-then-consume scenario of tensorhub #542).

## Acceptance
from-scratch runs against a live stack and its commit lands in the CAS; the dead shape no longer imports.
31 changes: 26 additions & 5 deletions docs/endpoint-authoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,14 +100,35 @@ Resources(gpu=True, vram_gb=24, compute_capability=8.0, libraries=("nunchaku",))
## Kinds

`@endpoint(kind="conversion" | "training" | "dataset")` selects the context
subclass the handler receives: `ConversionContext` adds
`publish_repo_revision` / `save_checkpoint` / `mktemp` / `source` /
`destination`; `DatasetContext` adds `publish_dataset_revision` /
`resolve_dataset`; `TrainingContext` adds the repo-metadata RPCs.
subclass the handler receives: `ConversionContext` adds `save_checkpoint` /
`mktemp` / `source` / `destination`; `DatasetContext` adds
`publish_dataset_revision` / `resolve_dataset`; `TrainingContext` adds the
repo-metadata RPCs.

Producer endpoints publish **explicitly**: write files locally, call
`cozy_convert.publish_flavors(ctx, flavors)` — one Tensorhub commit per
`ProducedFlavor` (path = file or directory) — and return a result struct:

```python
@endpoint(kind="conversion")
class Convert:
def run(self, ctx: ConversionContext, p: In) -> Out:
out_dir = ctx.mktemp()
... # write model files under out_dir
commits = publish_flavors(
ctx, [ProducedFlavor(path=out_dir, flavor="bf16")],
destination_repo=p.destination_repo,
)
return Out(revision_ids=[c.revision_id for c in commits])
```

Generator handlers are rejected for producer kinds — yielding streams
chunks, it never publishes.

## Streaming

An async-generator handler streams; each yielded struct is one chunk:
An async-generator handler streams (inference kinds only); each yielded
struct is one chunk:

```python
async def stream(self, ctx, p: In) -> AsyncIterator[Out]:
Expand Down
11 changes: 7 additions & 4 deletions docs/local-dev.md
Original file line number Diff line number Diff line change
Expand Up @@ -173,10 +173,13 @@ The `.gen-worker-run/` directory is throwaway. Add it to `.gitignore` /

## Conversion / dataset endpoints

`ConversionContext.publish_repo_revision` and `materialize_blob` are
stubbed by default — they print the would-be call to stderr and return
a fake response. Pass `--allow-publish` to call the real tensorhub APIs
(useful for round-tripping against a dev tensorhub).
Checkpoint publishing goes through `cozy_convert.publish_flavors(ctx,
flavors)`, which talks to tensorhub directly using the worker capability
token — with no token configured (plain local runs) it fails loudly
instead of pretending to publish. `ConversionContext.materialize_blob`
is stubbed against the local CAS by default; pass `--allow-publish` to
call the real tensorhub API (useful for round-tripping against a dev
tensorhub).

```bash
gen-worker run --payload '{"source":{"ref":"..."},"specs":[...]}' --allow-publish
Expand Down
9 changes: 4 additions & 5 deletions examples/from-scratch/README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,16 @@
# from-scratch

A `@conversion` that emits an **orphan checkpoint** — a brand-new set of weights with no source repo, no parent lineage. The SDK accepts an empty lineage array and lands the checkpoint as a root node in the lineage DAG.
An `@endpoint(kind="conversion")` that publishes an **orphan checkpoint** — a brand-new set of weights with no source repo, no parent lineage.

## What it demonstrates
- The `from-scratch` training kind: `@conversion(kind="from-scratch", concurrency="sequential")` — declares to the runtime that this job genuinely has no upstream model to materialize.
- **`ProducedFlavor`** as the return contract — the function generates weights, writes them to a path, returns `[ProducedFlavor(path=..., flavor=...)]`; the library handles upload + finalize + tag application.
- Tenant code never touches tensorhub's upload API directly — the SDK owns the session lifecycle.
- **The producer publish contract**: write files locally, call `cozy_convert.publish_flavors(ctx, flavors)` — one Tensorhub commit per `ProducedFlavor` — and return a result struct. Nothing publishes implicitly; generator handlers are rejected for producer kinds.
- Tenant code never touches tensorhub's upload API directly — `publish_flavors` owns hashing, presigned part PUTs, dedup, and finalize.

## When to copy it
- Generating random-init weights for a new architecture.
- Producing a "blank" base model that downstream conversion/training jobs can build off.
- Any job that produces checkpoints from nothing (synthesis, distillation from a non-Cozy source, etc.).

## Files
- `from_scratch.py` — the function; uses `torch.manual_seed` for deterministic output.
- `from_scratch.py` — the endpoint; uses `torch.manual_seed` for deterministic output.
- `endpoint.toml` — declares CPU-only resources (this example doesn't need GPU).
35 changes: 25 additions & 10 deletions examples/from-scratch/from_scratch.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,17 @@
"""from-scratch — example ``@endpoint(kind="conversion")`` that generates
random weights and publishes them as an orphan checkpoint (no parent lineage).

Tenant code never touches tensorhub's upload contract: it writes files
locally, yields ``ProducedFlavor``s, and ``cozy_convert.publish_flavors``
turns each into one Tensorhub commit.
The publish contract: a conversion handler writes files locally, calls
``cozy_convert.publish_flavors(ctx, flavors)`` — one Tensorhub commit per
``ProducedFlavor`` — and returns a result struct. Nothing publishes
implicitly; generator handlers are rejected for producer kinds.
"""

from __future__ import annotations

from typing import Iterator

import msgspec

from cozy_convert import ProducedFlavor
from cozy_convert import ProducedFlavor, publish_flavors
from gen_worker import ConversionContext, endpoint


Expand All @@ -22,15 +21,21 @@ class FromScratchInput(msgspec.Struct, forbid_unknown_fields=True):
hidden_dim: int = 64


class FromScratchResult(msgspec.Struct):
destination_repo: str
revision_ids: list[str]
published_files: int


@endpoint(kind="conversion")
class FromScratch:
"""Generate random weights and emit them as an orphan checkpoint."""
"""Generate random weights and publish them as an orphan checkpoint."""

def generate(
self,
ctx: ConversionContext,
payload: FromScratchInput,
) -> Iterator[ProducedFlavor]:
) -> FromScratchResult:
import torch
from safetensors.torch import save_file

Expand All @@ -43,7 +48,17 @@ def generate(
}
weights_path = ctx.mktemp() / "weights.safetensors"
save_file(weights, str(weights_path))
yield ProducedFlavor(path=weights_path, flavor="fp32")

commits = publish_flavors(
ctx,
[ProducedFlavor(path=weights_path, flavor="fp32")],
destination_repo=payload.destination_repo,
)
return FromScratchResult(
destination_repo=payload.destination_repo,
revision_ids=[c.revision_id for c in commits],
published_files=sum(c.uploaded + c.deduped for c in commits),
)


__all__ = ["FromScratchInput", "FromScratch"]
__all__ = ["FromScratchInput", "FromScratchResult", "FromScratch"]
46 changes: 41 additions & 5 deletions packages/cozy_convert/src/cozy_convert/clone.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@

from __future__ import annotations

import hashlib
import logging
import os
import re
import shutil
Expand All @@ -23,6 +25,8 @@
from .ingest import IngestedSource, ingest_civitai, ingest_huggingface
from .writer import MAX_SAFETENSORS_SHARD_BYTES, shard_safetensors_by_offset

logger = logging.getLogger(__name__)

_PUBLIC_NAME_RE = re.compile(r"^[a-z0-9][a-z0-9.-]{0,127}$")
_PUBLIC_TAG_RE = re.compile(r"^[a-z0-9][a-z0-9._-]{0,62}$")

Expand Down Expand Up @@ -177,6 +181,8 @@ def _copy_non_weights(source_dir: Path, out_dir: Path, *, skip_components: set[s
if not f.is_file():
continue
rel = f.relative_to(source_dir)
if rel.parts[:2] == (".cache", "huggingface"):
continue # hf local-dir download metadata, not repo content
comp = rel.parts[0] if len(rel.parts) > 1 else ""
name = f.name
is_weightish = _is_weight(f) or name.endswith(".safetensors.index.json")
Expand Down Expand Up @@ -326,6 +332,20 @@ def build_flavor_tree(
# run_clone — ingest, convert, ONE finalize path
# ---------------------------------------------------------------------------

def _clone_workdir(provider: str, source_key: str, destination: str) -> Path:
"""Persistent workdir keyed by (provider, source, destination): a failed
clone keeps its downloaded snapshot so a retry resumes instead of
re-downloading. Deleted on success. Base dir: ``$COZY_CONVERT_WORKDIR``
or ``<tmp>/cozy-convert``."""
base = Path(os.environ.get("COZY_CONVERT_WORKDIR", "").strip()
or Path(tempfile.gettempdir()) / "cozy-convert")
digest = hashlib.sha256(
f"{provider}|{source_key}|{destination}".encode("utf-8")).hexdigest()[:16]
workdir = base / f"clone-{digest}"
workdir.mkdir(parents=True, exist_ok=True)
return workdir


def run_clone(
ctx: Any,
*,
Expand Down Expand Up @@ -359,7 +379,11 @@ def _progress(p: float, stage: str) -> None:
except Exception:
pass

workdir = Path(tempfile.mkdtemp(prefix=f"clone-{getattr(ctx, 'request_id', 'x')}-"))
source_key = source_ref if provider == "huggingface" else str(civitai_model_version_id or 0)
if source_revision:
source_key = f"{source_key}@{source_revision}"
workdir = _clone_workdir(provider, source_key, destination)
succeeded = False
try:
_progress(0.05, "clone.ingest")

Expand Down Expand Up @@ -423,8 +447,14 @@ def _dl_progress(done: int, total: Optional[int]) -> None:
attrs = dict(source.attrs)
flavor_label = source_dtype or spec.dtype
else:
# Wipe any partial flavor tree from a prior failed run —
# only the downloaded source is resumable.
flavor_dir = workdir / f"flavor-{spec.label}"
shutil.rmtree(flavor_dir, ignore_errors=True)
shutil.rmtree(workdir / f"flavor-{spec.label}.__repack__",
ignore_errors=True)
tree, attrs = build_flavor_tree(
source, spec, workdir / f"flavor-{spec.label}",
source, spec, flavor_dir,
quantize_components=quantize_components,
)
except InlineConversionNotPossible as exc:
Expand Down Expand Up @@ -490,18 +520,24 @@ def _dl_progress(done: int, total: Optional[int]) -> None:
"total_bytes": commit.total_bytes,
})

if not result.published and result.failed_flavors:
reasons = "; ".join(str(f.get("reason") or "") for f in result.failed_flavors)
if not result.published:
reasons = "; ".join(
str(f.get("reason") or "") for f in result.failed_flavors
) or "no output spec produced anything"
raise RuntimeError(f"clone produced no publishable flavor: {reasons}")

result.metadata["destination_repo"] = destination
result.metadata["published_count"] = str(len(result.published))
if result.failed_flavors:
result.metadata["failed_flavor_count"] = str(len(result.failed_flavors))
_progress(1.0, "clone.completed")
succeeded = True
return result
finally:
shutil.rmtree(workdir, ignore_errors=True)
if succeeded:
shutil.rmtree(workdir, ignore_errors=True)
else:
logger.warning("clone failed; workdir retained for resume: %s", workdir)


def from_huggingface(ctx: Any, payload: Any, *, hf_token: str | None = None) -> CloneResult:
Expand Down
Loading
Loading