feat(verifier): add verifier evaluator shell and types#2157
feat(verifier): add verifier evaluator shell and types#2157miguelg719 wants to merge 15 commits into
Conversation
🦋 Changeset detectedLatest commit: 3d8c324 The changes in this PR will be included in the next version bump. This PR includes changesets to release 4 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
There was a problem hiding this comment.
No issues found across 11 files
Confidence score: 5/5
- Automated review surfaced no issues in the provided summaries.
- No files require special attention.
Architecture diagram
sequenceDiagram
participant UserCode as "User/CLI"
participant V3Eval as V3Evaluator
participant LegacyEval as LegacyV3Evaluator
participant TrajUtil as "trajectory.ts"
participant FileSys as "File System"
participant BackendEnv as STAGEHAND_EVALUATOR_BACKEND
participant LLM as LLM
Note over UserCode,BackendEnv: Verifier Facade Initialization
UserCode->>V3Eval: new V3Evaluator(v3Instance, opts)
V3Eval->>BackendEnv: Read env var
alt backend=verifier
V3Eval->>V3Eval: Store backend=verifier
else backend=legacy
V3Eval->>V3Eval: Store backend=legacy
end
Note over UserCode,FileSys: NEW: verify(trajectory, taskSpec)
UserCode->>V3Eval: verify(trajectory, taskSpec)
V3Eval->>V3Eval: assertVerifierInput()
alt backend=verifier
V3Eval-->>UserCode: Throw "backend not available"
else backend=legacy
V3Eval->>V3Eval: collectLegacyScreenshots(trajectory)
V3Eval->>V3Eval: renderLegacyAgentReasoning(trajectory)
alt no screenshots AND no finalAnswer
V3Eval-->>UserCode: legacyInsufficientEvidenceResult()
else
V3Eval->>LegacyEval: ask({question, screenshot, answer, agentReasoning})
LegacyEval-->>V3Eval: legacy result (YES/NO/INVALID)
V3Eval->>V3Eval: legacyEvaluationToResult()
V3Eval-->>UserCode: EvaluationResult with rawSteps.backend="legacy"
end
end
Note over UserCode,V3Eval: NEW: generateRubric(taskSpec)
UserCode->>V3Eval: generateRubric(taskSpec)
alt backend=verifier
V3Eval-->>UserCode: Throw "backend not available"
else backend=legacy
V3Eval->>V3Eval: Create single criterion rubric
V3Eval-->>UserCode: Rubric { items: [legacyTaskCompletionCriterion] }
end
Note over UserCode,FileSys: NEW: On-disk Trajectory Loading
UserCode->>TrajUtil: loadTrajectoryFromDisk(dir)
TrajUtil->>FileSys: readFile(trajectory.json)
FileSys-->>TrajUtil: raw JSON
TrajUtil->>TrajUtil: Parse JSON
loop each step
TrajUtil->>FileSys: readFile(screenshotPath) for probe
alt screenshot file exists
FileSys-->>TrajUtil: Buffer
TrajUtil->>TrajUtil: Set probeEvidence.screenshot
else file missing
TrajUtil->>TrajUtil: Leave screenshot unset
end
alt image modality with bytesBase64
TrajUtil->>TrajUtil: Decode base64 → Buffer
end
end
TrajUtil-->>UserCode: Hydrated Trajectory
Note over UserCode,FileSys: NEW: Path Security Check
TrajUtil->>TrajUtil: resolveWithinTrajectoryDir(candidate)
alt path escapes trajectory directory
TrajUtil-->>UserCode: Throw error
else safe
TrajUtil->>FileSys: readFile(resolved)
end
Note over UserCode,FileSys: Runtime: Legacy Evaluator with Answer
LegacyEval->>LegacyEval: _evaluateWithMultipleScreenshots()
rect over LegacyEval
Note over LegacyEval: CHANGED: included answer in prompt
end
LegacyEval->>LLM: prompt(text + image contents + "the answer is {answer}")
Note over LLM: NEW: answer appended to user message
LLM-->>LegacyEval: YES/NO + reasoning
LegacyEval-->>UserCode: LegacyEvaluationResult
|
|
||
| const trajectoryPath = path.join(trajectoryDir, "trajectory.json"); | ||
| const raw = await fs.readFile(trajectoryPath, "utf8"); | ||
| const parsed = JSON.parse(raw) as Trajectory & { |
There was a problem hiding this comment.
This could be made more typesafe at runtime if we used zod at the parsing boundary, like: TrajectorySchema.safeParse(JSON.parse(raw))
and then you could z.infer to still have a Trajectory type (the array of trajectory steps could all be part of the zod schema too)
but might be a nit!
There was a problem hiding this comment.
going to add an extra pr at the end to parse as much as possible, this one is tricky because we use downstream some precomputed rubrics from webtailbench specifically that don't match our schema (snake)
| * return an EvaluationResult — they MUST NOT touch a live browser. | ||
| */ | ||
| export interface Verifier { | ||
| verify(trajectory: Trajectory, taskSpec: TaskSpec): Promise<EvaluationResult>; |
There was a problem hiding this comment.
Since the trajectory already contains a taskspec within it, can we remove the additional taskSpec that gets passed in here? This would make it simpler, but maybe i'm missing something.
| * Snake-case dataset fields are accepted here so serialized quirks do not leak | ||
| * into the canonical rubric type. | ||
| */ | ||
| export function normalizeRubric(rubric: unknown): Rubric | undefined { |
There was a problem hiding this comment.
Also here - a rubric could just be a zod object + a z.infer type, and then this logic could be built into the zod object (via superrefine or otherwise), and it might be a bit simpler, but optional nit!
Replacement for #2130, which was merged into the PR1 branch instead of main. This branch is rebased onto current main and contains the PR2 verifier evaluator shell/type changes.
Summary by cubic
Adds a verifier evaluator shell with public trajectory/rubric/result types and utilities, plus a
V3Evaluator.verify(trajectory)facade that uses the legacy backend by default without breaking existing flows.New Features
v3/verifier:Trajectory,Rubric,EvaluationResult,Verifier, and more.normalizeRubric,loadTrajectoryFromDisk(rehydrates screenshots and image modalities),nextResultFilename.V3Evaluatorsupportsverify()andgenerateRubric(). Backend selectable viaSTAGEHAND_EVALUATOR_BACKEND(legacydefault;verifieris stubbed for future). Legacy path maps trajectory screenshots/final answer/reasoning to the old evaluator and returns anEvaluationResult.Stagehand.loadTrajectoryFromDisk,Stagehand.nextResultFilename,Stagehand.normalizeRubric.Bug Fixes
earned_pointsnoise.screenshotPathto stay within the trajectory directory and decodes on-diskbytesBase64toBuffer.Trajectoryto prevent mismatched task specs and simplifyverify().Written for commit 3d8c324. Summary will update on new commits. Review in cubic