Skip to content

chore(audit): weekly cloud audit — capacity-content guard + visible scanner failures#110

Merged
tzone85 merged 1 commit into
mainfrom
fix/audit-cloud-2026-06-29
Jun 29, 2026
Merged

chore(audit): weekly cloud audit — capacity-content guard + visible scanner failures#110
tzone85 merged 1 commit into
mainfrom
fix/audit-cloud-2026-06-29

Conversation

@tzone85

@tzone85 tzone85 commented Jun 29, 2026

Copy link
Copy Markdown
Owner

Weekly cloud audit — 2026-06-29

Fan-out adversarial audit (concurrency, silent failures, event-sourcing wiring, security) plus a focused second round on the newest code (security agent, completion gate, conflict-resolution JSON merge, integration build). Round 1 was dry across all four priority categories; round 2 on the freshest code surfaced the two verified defects below. Each was adversarially confirmed against the source before fixing and is pinned by a regression test.

Findings fixed

1. internal/engine/conflict_resolver.go · resolveFile — senior conflict path could write a capacity notice into a source file
The tech-lead path (resolveFileTechLead) guards against a session-limit / overloaded notice that some CLI versions return as successful content rather than an error envelope (if llm.ContainsCapacitySignature(resp.Content) { return 429 }). The senior fast path — used for the common single-file conflict — had no such guard. A Claude Max session-limit notice returned as content has no conflict markers and matches no chatter marker, so it passed every check, was returned as a "clean" resolution, and was written verbatim into the conflicted file via os.WriteFile while the rebase reported success — corrupting the file under a transient outage instead of pausing. This is exactly the failure mode the documented "429 / session-limit clean pause" design exists to prevent.
Fix: mirror the tech-lead guard — surface a 429 *llm.APIError (capacity, not errUnmergeable) so the caller aborts the rebase and the pipeline pauses-and-resumes after the limit resets.

2. internal/security/scanners.go · RunScanners — a failed scanner masqueraded as a clean run, with no log
A scanner was appended to ran before s.Run, and on error the failure was continued with a comment claiming "log handled by caller." The callers (ScanRepo / ReviewStory) receive only (findings, ran, skipped) — no error — so nothing was ever logged, and a scanner that crashed / timed out / failed to parse was indistinguishable from one that ran clean and found nothing. For the per-story pre-merge gate this means a build whose scan partly failed was reported as scanned-clean with no audit trail of the lost coverage.
Fix: failed scanners are now logged at the failure site and returned in a new Report.Failed bucket (and excluded from ScannersRun); FormatMarkdown renders "Failed (ran but errored — coverage lost)". ScanRepo/ReviewStory log when coverage is lost. The documented graceful-degradation policy (a scanner failure never blocks the merge) is preserved — the change restores visibility, not blocking.

Audit areas that came back clean (no changes)

  • Concurrency — no lock held across I/O, no unsynchronized shared-map writes, no new unbounded contexts.
  • Silent failures / safety gates — completion gate, premerge QA, reviewer, conflict errUnmergeable classification all surface failure correctly.
  • Event-sourcing wiring — all 60 declared event types are handled in Project(); TestProject_AllDeclaredEventsHandled is sound.
  • Security — shell/path/SQL injection, dashboard auth, secret redaction, and prompt-injection boundaries all remain guarded.

Verification

  • go build ./... — clean
  • go vet ./... — clean
  • golangci-lint run ./... — 0 issues
  • go test ./internal/security/ ./internal/engine/ -count=1 — pass (including the two new tests + extended report_test)
  • go test ./... -count=1 — pass except three pre-existing environment-only failures unrelated to this diff, reproduced on unmodified main: cli/TestApproveStory_* (the gh CLI binary is not installed in the audit container), engine/TestWriteMetadata_InvalidDir and repolearn/TestCountFilesInDir_NonexistentDir (the container runs as root, so writes to /nonexistent/... succeed where the tests assume an unwritable path; the repolearn case is collateral pollution from the engine one). All three pass under a normal non-root CI user with init.defaultBranch=main.
  • make doc-coverage (TestDocCoverage_*) — pass. No new CLI command or config field; Report.Failed is an additive omitempty JSON field on existing vxd security scan --json output, not covered by the doc-coverage gate.
  • make vuln (govulncheck) — could not run in this environment: the audit container's network policy blocks vuln.go.dev:443 (gateway returns 403), so the vulnerability DB cannot be fetched. No dependency changes were made in this PR.

🤖 Generated with Claude Code


Generated by Claude Code

…canner failures

Two verified defects found in the newest (security agent + conflict
resolution) code and fixed with regression tests.

1. conflict_resolver.go · resolveFile — the senior conflict-resolution
   path lacked the capacity-signature guard that resolveFileTechLead
   already has. A Claude Max session-limit notice returned as SUCCESSFUL
   content (some CLI versions do this instead of an error envelope) passed
   the marker/chatter checks and was written verbatim into the conflicted
   source file while the rebase "succeeded" — corrupting the file under a
   transient outage. Now surfaced as a 429 capacity error so the caller
   aborts and the pipeline pauses-and-resumes after the limit resets.

2. security/scanners.go · RunScanners — a scanner that ran but errored
   (exec crash, timeout, parse failure) was appended to `ran` and its
   error swallowed with a comment claiming the caller logs it; the callers
   never receive the error, so a failed scan was indistinguishable from a
   clean "ran · 0 findings" with no audit trail. Failed scanners are now
   logged and reported in a new Report.Failed bucket (and excluded from
   ScannersRun), restoring visibility of lost coverage. Graceful
   degradation (never blocks the merge) is preserved.

Tests: TestRebaseWithResolution_SeniorCapacityContentDoesNotCorruptFile,
TestSecurityGate_ScanRepo_SurfacesFailedScanners, extended report_test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VzEm2jAYUXJqRi9YhWXdXK
@tzone85 tzone85 merged commit 72eacb7 into main Jun 29, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants