feat(scripts): Docker-only end-to-end /e2e command for example-libpng by ret2libc · Pull Request #552 · trailofbits/buttercup

ret2libc · 2026-05-15T13:15:33Z

What this adds

A Docker-only end-to-end smoke test of the full Buttercup pipeline against
example-libpng — no
Kubernetes/minikube. Mirrors the milestones in
.github/workflows/system-integration.yml but tails docker compose logs.

scripts/e2e.sh — brings the dev/docker-compose/ stack up, submits the
canned libpng trigger_task, and waits on the pipeline milestones
(fuzzer build → POV submitted → POV accepted → seed-gen → patch
generated/approved/passed → bundle submitted; optional SARIF).
make e2e (and make e2e E2E_ARGS=...).
.claude/commands/e2e.md — /e2e slash command wrapper.

Flags: --budget (LiteLLM per-user max budget, default $3),
--task-duration, --image-tag / BUTTERCUP_IMAGE_TAG, --no-pull,
--keep-up, --skip-wait, --sarif, per-phase timeout overrides.

Image source

By default the stack runs the prebuilt GHCR images via the
compose.prebuilt.yaml overlay (nothing built locally). --no-pull skips the
docker compose pull and uses already-present images (e.g. locally built and
tagged ghcr.io/trailofbits/buttercup/*:<tag>).

.env handling

e2e.sh regenerates dev/docker-compose/.env each run. It resolves each
value as environment → existing .env → placeholder, so manually-set
values (e.g. LANGFUSE_*) are preserved across runs instead of being
clobbered with empty/placeholder.

Dependency / merge ordering

The prebuilt path invokes
docker compose -f compose.yaml -f compose.prebuilt.yaml. The
compose.prebuilt.yaml overlay is not in this PR — it lives on the
separate compose-prebuilt branch/PR. This PR should land after or together
with that one; on its own the overlay file must already be present in
dev/docker-compose/.

Scope

e2e tooling only — .claude/commands/e2e.md, Makefile, scripts/e2e.sh.
Independent of the three pipeline fixes surfaced while building this
(buttercup-ui internal port, litellm budget enforcement, patcher task
storage), which are their own separate PRs.

Validation

This tooling was used to drive the pipeline end-to-end during development:
fuzzer build → POV submitted → POV accepted, through seed-gen and patch
generation, with budget tracking and Langfuse tracing.

🤖 Generated with Claude Code

ret2libc · 2026-05-15T14:45:50Z

Addressed CI lint/static failures in 5210208 (shellcheck SC2015 in scripts/e2e.sh).

Adds scripts/e2e.sh, `make e2e`, and a .claude/commands/e2e.md slash command that bring the Buttercup stack up via dev/docker-compose (no Kubernetes), submit the example-libpng task, and monitor the scheduler / seed-gen / patcher logs through the milestones tracked by .github/workflows/system-integration.yml (fuzzer build, POV submit/ pass, seed-gen, patch generate / approve / pass, bundle submit, and optionally SARIF). Defaults LITELLM_MAX_BUDGET to \$3 so accidental runs are cheap; tears the stack down on exit unless --keep-up is set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The e2e driver now brings the stack up through the compose.prebuilt.yaml overlay and `docker compose pull` (tag configurable via --image-tag / BUTTERCUP_IMAGE_TAG, default "main") instead of `docker compose build`, so a run no longer depends on a working local image build (e.g. the cscope submodule / oss-fuzz base-runner build chain). - dc() applies `-f compose.yaml -f compose.prebuilt.yaml` and exports BUTTERCUP_IMAGE_TAG for every compose subcommand (pull/up/logs/down). - --no-build kept as a deprecated alias for the new --no-pull. - Teardown hint and e2e.md updated for the overlay. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

e2e.sh regenerates dev/docker-compose/.env from scratch every run, sourcing values only from environment variables. Variables not exported (notably LANGFUSE_HOST/PUBLIC_KEY/SECRET_KEY) were defaulted to empty and written back, clobbering values a user had set directly in .env. Add prev_env() and a 3-tier resolution: environment > existing .env > placeholder. Manually-set .env values (Langfuse creds, provider keys, litellm key) now survive subsequent runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the `wait_for ... && record ok || record TIMEOUT` and `curl ... && record ok || record fail` constructs with explicit if-then-else blocks. shellcheck flagged these as SC2015 (A && B || C is not if-then-else), causing the "Lint shell scripts" step in the Static Checks workflow to fail. Behavior is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

With `set -o pipefail`, `dc logs ... | grep -m1` makes the upstream `docker compose logs` die with SIGPIPE (rc 141) once grep matches the first line; pipefail then fails the whole pipeline, so milestones whose log line appears early in a high-volume stream (e.g. seed-gen's 'Copied N files to corpus') are never registered and wait_for spins until timeout even though the milestone occurred. Capture grep output with '|| true' and test for non-empty instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drop --no-build, --keep-up, --skip-wait, --sarif, --task-json and the per-phase --*-timeout flags. The stack now always tears down on exit; milestone timeouts are internal constants. Addresses PR #552 review: - provider-key check moved below the .env fallback so keys saved to .env on a prior run are accepted (tip is now accurate) - --task-json removed (was silently falling back to the libpng default) - trigger_task response uses mktemp + on_exit cleanup instead of a predictable /tmp/e2e_task_resp.$$ leaked on SIGINT/SIGTERM - --no-build phantom "deprecated alias" removed Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The local litellm master key is an internal detail of the docker-compose stack, not something the user should set. Remove it from the usage text and the env/.env resolution; e2e.sh now just writes the local default (sk-1234) into the generated .env. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

e2e.sh regenerates dev/docker-compose/.env every run and was always writing LANGFUSE_HOST=/PUBLIC_KEY=/SECRET_KEY= even when unset. Since .env is loaded last in compose's env_file list, an empty value silently disabled Langfuse telemetry. Now resolved env -> existing .env, and the LANGFUSE_* lines are only written when non-empty, so values the user set in .env survive across runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The pov-submit and bundle-submit waiters used "POV submission response: pov_id=" and "Bundle submission response: bundle_id=" which never match any rendered log line: the only "... submission response:" logs are logger.debug calls whose payload is an API object repr (no literal pov_id=/bundle_id=), while pov_id=/bundle_id= appear only in the separate structured summary line (logger.info) with a different prefix. Result: both milestones always timed out, so every run — including fully successful ones — wasted MILESTONE_TIMEOUT+BUNDLE_TIMEOUT and exited non-zero. Repoint both to the structured summary tokens (pov_id= / bundle_id=) and sync the marker list in .claude/commands/e2e.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ults Three defects found while verifying the pipeline end-to-end: 1. Approval one-shot race: capture_line 'competition_patch_id=' ran once right after the patch-generated milestone, but the scheduler logs that id only minutes later (after it builds+verifies+submits the patch). The capture always lost the race, so approval was always skipped and the local stack never reached Patch passed / bundle. Replace with a wait_capture() poll loop (mirrors wait_for) so approval actually fires. 2. Default --task-duration 1800 is self-defeating: build->POV->seed-gen-> patch exceeds 30 min on normal hardware, so the task expires mid-patch ("task expired/cancelled? Will discard") and never reaches patch/bundle. Default to 7200 so the task outlives the pipeline. 3. Default --budget 3 cannot reach patch/bundle: a full run through patch generation costs ~$10; $3 is exhausted around POV. Default to 10. e2e.md updated to match (defaults, the cheap --budget 3 caveat, and the poll-then-approve description). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hbrodin

LGTM. One inline note on TASK_ID parsing — fragility, not a current bug.

hbrodin · 2026-06-16T08:51:05Z

+if [[ -n "$PATCH_LINE" ]]; then
+    PATCH_ID=$(printf '%s' "$PATCH_LINE" | sed -n 's/.*competition_patch_id=\([^ ]*\).*/\1/p')
+    # Task id is inside the first [...] block, after the last ':'.
+    TASK_ID=$(printf '%s' "$PATCH_LINE" | sed -n 's/.*\[\([^]]*\)\].*/\1/p' | sed 's/^[^:]*://')


The inline comment on the previous line says "Task id is inside the first [...] block, after the last :" — but the regex does the opposite of both: s/.*\[$[^]]*$\]/\1/ uses greedy .* and so captures the last [...] block, and s/^[^:]*:// strips up to the first :.

It works today because log_entry() in orchestrator/src/buttercup/orchestrator/scheduler/submissions.py emits exactly one bracketed [{idx_msg}{task_id}] per line, and task_id contains no :. If anyone formats another bracketed token into the log message (e.g. appends bracketed context, or adds a field to MaxLengthFormatter's format string in common/logger.py), this silently captures the wrong block; the subsequent curl .../v1/task/${TASK_ID}/patch/${PATCH_ID}/approve would then return 404 → patch-approve: HTTP fail with no diagnostic pointing back to the parse.

Repro: printf '%s\n' '… [2:taskid] … extra [foo:bar]' | sed -n 's/.*\[$[^]]*$\].*/\1/p' | sed 's/^[^:]*://' yields bar instead of taskid.

Either tighten the regex (e.g. match a single \[[^:]+:[^]]+\] rather than .*\[…\]) or update the comment to match what the code actually does.

ret2libc requested a review from hbrodin as a code owner May 15, 2026 13:15

hbrodin reviewed May 19, 2026

View reviewed changes

Comment thread scripts/e2e.sh Outdated

hbrodin reviewed May 19, 2026

View reviewed changes

Comment thread scripts/e2e.sh Outdated

hbrodin reviewed May 19, 2026

View reviewed changes

Comment thread scripts/e2e.sh Outdated

hbrodin reviewed May 19, 2026

View reviewed changes

Comment thread scripts/e2e.sh Outdated

ret2libc and others added 9 commits May 19, 2026 08:25

ret2libc force-pushed the e2e-commands branch from 66563c7 to dc77e02 Compare May 19, 2026 08:57

hbrodin approved these changes Jun 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scripts): Docker-only end-to-end /e2e command for example-libpng#552

feat(scripts): Docker-only end-to-end /e2e command for example-libpng#552
ret2libc wants to merge 10 commits into
mainfrom
e2e-commands

ret2libc commented May 15, 2026

Uh oh!

ret2libc commented May 15, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hbrodin left a comment

Uh oh!

hbrodin Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ret2libc commented May 15, 2026

What this adds

Image source

.env handling

Dependency / merge ordering

Scope

Validation

Uh oh!

ret2libc commented May 15, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hbrodin left a comment

Choose a reason for hiding this comment

Uh oh!

hbrodin Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants