feat(scripts): Docker-only end-to-end /e2e command for example-libpng#552
feat(scripts): Docker-only end-to-end /e2e command for example-libpng#552ret2libc wants to merge 10 commits into
Conversation
|
Addressed CI lint/static failures in 5210208 (shellcheck SC2015 in scripts/e2e.sh). |
Adds scripts/e2e.sh, `make e2e`, and a .claude/commands/e2e.md slash command that bring the Buttercup stack up via dev/docker-compose (no Kubernetes), submit the example-libpng task, and monitor the scheduler / seed-gen / patcher logs through the milestones tracked by .github/workflows/system-integration.yml (fuzzer build, POV submit/ pass, seed-gen, patch generate / approve / pass, bundle submit, and optionally SARIF). Defaults LITELLM_MAX_BUDGET to \$3 so accidental runs are cheap; tears the stack down on exit unless --keep-up is set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The e2e driver now brings the stack up through the compose.prebuilt.yaml overlay and `docker compose pull` (tag configurable via --image-tag / BUTTERCUP_IMAGE_TAG, default "main") instead of `docker compose build`, so a run no longer depends on a working local image build (e.g. the cscope submodule / oss-fuzz base-runner build chain). - dc() applies `-f compose.yaml -f compose.prebuilt.yaml` and exports BUTTERCUP_IMAGE_TAG for every compose subcommand (pull/up/logs/down). - --no-build kept as a deprecated alias for the new --no-pull. - Teardown hint and e2e.md updated for the overlay. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
e2e.sh regenerates dev/docker-compose/.env from scratch every run, sourcing values only from environment variables. Variables not exported (notably LANGFUSE_HOST/PUBLIC_KEY/SECRET_KEY) were defaulted to empty and written back, clobbering values a user had set directly in .env. Add prev_env() and a 3-tier resolution: environment > existing .env > placeholder. Manually-set .env values (Langfuse creds, provider keys, litellm key) now survive subsequent runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the `wait_for ... && record ok || record TIMEOUT` and `curl ... && record ok || record fail` constructs with explicit if-then-else blocks. shellcheck flagged these as SC2015 (A && B || C is not if-then-else), causing the "Lint shell scripts" step in the Static Checks workflow to fail. Behavior is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
With `set -o pipefail`, `dc logs ... | grep -m1` makes the upstream `docker compose logs` die with SIGPIPE (rc 141) once grep matches the first line; pipefail then fails the whole pipeline, so milestones whose log line appears early in a high-volume stream (e.g. seed-gen's 'Copied N files to corpus') are never registered and wait_for spins until timeout even though the milestone occurred. Capture grep output with '|| true' and test for non-empty instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop --no-build, --keep-up, --skip-wait, --sarif, --task-json and the per-phase --*-timeout flags. The stack now always tears down on exit; milestone timeouts are internal constants. Addresses PR #552 review: - provider-key check moved below the .env fallback so keys saved to .env on a prior run are accepted (tip is now accurate) - --task-json removed (was silently falling back to the libpng default) - trigger_task response uses mktemp + on_exit cleanup instead of a predictable /tmp/e2e_task_resp.$$ leaked on SIGINT/SIGTERM - --no-build phantom "deprecated alias" removed Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The local litellm master key is an internal detail of the docker-compose stack, not something the user should set. Remove it from the usage text and the env/.env resolution; e2e.sh now just writes the local default (sk-1234) into the generated .env. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
e2e.sh regenerates dev/docker-compose/.env every run and was always writing LANGFUSE_HOST=/PUBLIC_KEY=/SECRET_KEY= even when unset. Since .env is loaded last in compose's env_file list, an empty value silently disabled Langfuse telemetry. Now resolved env -> existing .env, and the LANGFUSE_* lines are only written when non-empty, so values the user set in .env survive across runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pov-submit and bundle-submit waiters used "POV submission response: pov_id=" and "Bundle submission response: bundle_id=" which never match any rendered log line: the only "... submission response:" logs are logger.debug calls whose payload is an API object repr (no literal pov_id=/bundle_id=), while pov_id=/bundle_id= appear only in the separate structured summary line (logger.info) with a different prefix. Result: both milestones always timed out, so every run — including fully successful ones — wasted MILESTONE_TIMEOUT+BUNDLE_TIMEOUT and exited non-zero. Repoint both to the structured summary tokens (pov_id= / bundle_id=) and sync the marker list in .claude/commands/e2e.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ults
Three defects found while verifying the pipeline end-to-end:
1. Approval one-shot race: capture_line 'competition_patch_id=' ran once
right after the patch-generated milestone, but the scheduler logs that
id only minutes later (after it builds+verifies+submits the patch). The
capture always lost the race, so approval was always skipped and the
local stack never reached Patch passed / bundle. Replace with a
wait_capture() poll loop (mirrors wait_for) so approval actually fires.
2. Default --task-duration 1800 is self-defeating: build->POV->seed-gen->
patch exceeds 30 min on normal hardware, so the task expires mid-patch
("task expired/cancelled? Will discard") and never reaches patch/bundle.
Default to 7200 so the task outlives the pipeline.
3. Default --budget 3 cannot reach patch/bundle: a full run through patch
generation costs ~$10; $3 is exhausted around POV. Default to 10.
e2e.md updated to match (defaults, the cheap --budget 3 caveat, and the
poll-then-approve description).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hbrodin
left a comment
There was a problem hiding this comment.
LGTM. One inline note on TASK_ID parsing — fragility, not a current bug.
| if [[ -n "$PATCH_LINE" ]]; then | ||
| PATCH_ID=$(printf '%s' "$PATCH_LINE" | sed -n 's/.*competition_patch_id=\([^ ]*\).*/\1/p') | ||
| # Task id is inside the first [...] block, after the last ':'. | ||
| TASK_ID=$(printf '%s' "$PATCH_LINE" | sed -n 's/.*\[\([^]]*\)\].*/\1/p' | sed 's/^[^:]*://') |
There was a problem hiding this comment.
The inline comment on the previous line says "Task id is inside the first [...] block, after the last :" — but the regex does the opposite of both: s/.*\[\([^]]*\)\]/\1/ uses greedy .* and so captures the last [...] block, and s/^[^:]*:// strips up to the first :.
It works today because log_entry() in orchestrator/src/buttercup/orchestrator/scheduler/submissions.py emits exactly one bracketed [{idx_msg}{task_id}] per line, and task_id contains no :. If anyone formats another bracketed token into the log message (e.g. appends bracketed context, or adds a field to MaxLengthFormatter's format string in common/logger.py), this silently captures the wrong block; the subsequent curl .../v1/task/${TASK_ID}/patch/${PATCH_ID}/approve would then return 404 → patch-approve: HTTP fail with no diagnostic pointing back to the parse.
Repro: printf '%s\n' '… [2:taskid] … extra [foo:bar]' | sed -n 's/.*\[\([^]]*\)\].*/\1/p' | sed 's/^[^:]*://' yields bar instead of taskid.
Either tighten the regex (e.g. match a single \[[^:]+:[^]]+\] rather than .*\[…\]) or update the comment to match what the code actually does.
What this adds
A Docker-only end-to-end smoke test of the full Buttercup pipeline against
example-libpng — no
Kubernetes/minikube. Mirrors the milestones in
.github/workflows/system-integration.ymlbut tailsdocker compose logs.scripts/e2e.sh— brings thedev/docker-compose/stack up, submits thecanned libpng
trigger_task, and waits on the pipeline milestones(fuzzer build → POV submitted → POV accepted → seed-gen → patch
generated/approved/passed → bundle submitted; optional SARIF).
make e2e(andmake e2e E2E_ARGS=...)..claude/commands/e2e.md—/e2eslash command wrapper.Flags:
--budget(LiteLLM per-user max budget, default $3),--task-duration,--image-tag/BUTTERCUP_IMAGE_TAG,--no-pull,--keep-up,--skip-wait,--sarif, per-phase timeout overrides.Image source
By default the stack runs the prebuilt GHCR images via the
compose.prebuilt.yamloverlay (nothing built locally).--no-pullskips thedocker compose pulland uses already-present images (e.g. locally built andtagged
ghcr.io/trailofbits/buttercup/*:<tag>)..env handling
e2e.shregeneratesdev/docker-compose/.enveach run. It resolves eachvalue as environment → existing
.env→ placeholder, so manually-setvalues (e.g.
LANGFUSE_*) are preserved across runs instead of beingclobbered with empty/placeholder.
Dependency / merge ordering
The prebuilt path invokes
docker compose -f compose.yaml -f compose.prebuilt.yaml. Thecompose.prebuilt.yamloverlay is not in this PR — it lives on theseparate compose-prebuilt branch/PR. This PR should land after or together
with that one; on its own the overlay file must already be present in
dev/docker-compose/.Scope
e2e tooling only —
.claude/commands/e2e.md,Makefile,scripts/e2e.sh.Independent of the three pipeline fixes surfaced while building this
(buttercup-ui internal port, litellm budget enforcement, patcher task
storage), which are their own separate PRs.
Validation
This tooling was used to drive the pipeline end-to-end during development:
fuzzer build → POV submitted → POV accepted, through seed-gen and patch
generation, with budget tracking and Langfuse tracing.
🤖 Generated with Claude Code