diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 025085e..55238c9 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -7,6 +7,19 @@ on: branches: [main] jobs: + lint-benchmark: + runs-on: ubuntu-latest + defaults: + run: + working-directory: benchmarks/snapshot-efficiency + steps: + - uses: actions/checkout@v4 + - uses: actions/setup-python@v5 + with: + python-version: "3.11" + - run: pip install -r requirements-dev.txt + - run: make check + build-and-test: runs-on: ubuntu-latest steps: diff --git a/README.md b/README.md index ad29c73..886ec94 100644 --- a/README.md +++ b/README.md @@ -328,6 +328,10 @@ export OPERA_CLI_MCP_BIN=opera-devtools-mcp export OPERA_CLI_HEADED=1 ``` +## Benchmarks + +See [`benchmarks/snapshot-efficiency/`](benchmarks/snapshot-efficiency/README.md) — measures token cost and task-completion quality of compact snapshot output vs raw MCP and `chrome-devtools-axi`. + ## Development ```sh diff --git a/benchmarks/snapshot-efficiency/.flake8 b/benchmarks/snapshot-efficiency/.flake8 new file mode 100644 index 0000000..65cb60e --- /dev/null +++ b/benchmarks/snapshot-efficiency/.flake8 @@ -0,0 +1,5 @@ +[flake8] +max-line-length = 120 +# E203: whitespace before ':' — conflicts with black's slice formatting +# W503: line break before binary operator — conflicts with black +extend-ignore = E203, W503 diff --git a/benchmarks/snapshot-efficiency/CLAUDE.md b/benchmarks/snapshot-efficiency/CLAUDE.md new file mode 100644 index 0000000..83deec9 --- /dev/null +++ b/benchmarks/snapshot-efficiency/CLAUDE.md @@ -0,0 +1,65 @@ +# snapshot-efficiency benchmark — Claude guidance + +## File roles + +| File | Role | +|---|---| +| `src/run_benchmark.py` | Entry point. Loads all three config files, resolves CLI overrides, runs the outer condition × task × repeat loop, writes artifacts and JSONL. | +| `src/agent.py` | Browser agent loop. `run_agent()` drives the LLM turn loop; `AgentState` owns all mutable state accumulation; `AgentResult` is the immutable output. | +| `src/judge.py` | LLM-as-judge grading. `grade()` takes a trajectory and returns `{"pass": bool, "reason": str}`. | +| `src/tools.py` | `ToolSet` base class + `CLIToolSet` (subprocess) and `BridgeToolSet` (HTTP) subclasses. `make_tool_set(condition)` is the factory. | +| `src/llm.py` | Thin OpenAI Responses API wrapper. `Client.call()` returns a `Turn` dataclass. | +| `src/report.py` | Reads `results/*.jsonl`, prints and writes `results/report.md`. No external deps beyond stdlib + the results files. | +| `src/utils.py` | `snapshot_chars(text)` — counts characters in a snapshot result, returns 0 for empty/None. | +| `config/conditions.yaml` | Benchmark conditions: tool mode (`cli` or `bridge`), CLI binary path, bridge URL. | +| `config/tasks.yaml` | Task prompts and grading hints. | +| `config/models.yaml` | Agent and judge model names and reasoning effort. **The only place to change model defaults.** | + +## Data flow + +``` +run_benchmark.py + └── run_once() + ├── make_tool_set(condition) → ToolSet (CLIToolSet or BridgeToolSet) + ├── run_agent(prompt, tool_set, model, reasoning_effort) + │ └── loop: + │ client.call() → Turn + │ tool_set.dispatch() → result str (side effect: browser action) + │ state.update(turn, turn_index, tool_results) + │ └── state.to_result() → AgentResult + └── grade(prompt, trajectory, model, reasoning_effort, grading_hint) + └── Client.call() → {"pass": bool, "reason": str} +``` + +## Running checks + +```sh +# Install dev dependencies (once) +pip install -r requirements-dev.txt + +make format # apply black + isort (modifies files) +make lint # ruff + flake8 (read-only) +make typecheck # mypy (read-only) +make check # format-check + lint + typecheck — no modifications, matches CI +``` + +Config: `pyproject.toml` for black/isort/ruff/mypy; `.flake8` for flake8 (88-char line length throughout). + +## Key design decisions + +### No hardcoded model defaults +`run_agent()` and `grade()` require `model` and `reasoning_effort` as positional parameters — there are no defaults in the function signatures. All defaults live in `config/models.yaml`. CLI flags `--model`, `--reasoning-effort`, `--judge-model`, `--judge-reasoning-effort` override them for a single run. + +### AgentState owns all state mutations +`AgentState.update(turn, turn_index, tool_results=None)` is the single place that mutates benchmark state: +- Always: accumulates `input_tokens` and `output_tokens` from the turn +- `tool_results=None` (final turn): sets `answer`, appends to `trajectory` +- `tool_results` provided (tool-call turn): increments `tool_call_count`, appends to `snapshot_chars` for snapshot tools, appends to `trajectory` + +`run_agent()` only handles control flow and I/O (LLM calls, tool dispatch, `inputs` buffer). + +### SNAPSHOT_TOOLS +`SNAPSHOT_TOOLS: frozenset[str]` in `agent.py` defines which tool names produce page snapshots worth measuring. Add a tool name here if it returns a snapshot. + +### ToolSet dispatch +Both `CLIToolSet` and `BridgeToolSet` use `match/case` in `dispatch()`. The shared tool schema lives in `_CLI_SCHEMA` (module-level constant in `tools.py`), evaluated once at import time. diff --git a/benchmarks/snapshot-efficiency/Makefile b/benchmarks/snapshot-efficiency/Makefile new file mode 100644 index 0000000..f78ced6 --- /dev/null +++ b/benchmarks/snapshot-efficiency/Makefile @@ -0,0 +1,23 @@ +SRC = src + +.PHONY: format check lint typecheck + +# Apply formatting (local dev) +format: + black $(SRC)/ + isort $(SRC)/ + +# Check formatting without modifying (CI) +format-check: + black --check $(SRC)/ + isort --check-only $(SRC)/ + +lint: + ruff check $(SRC)/ + flake8 $(SRC)/ + +typecheck: + mypy $(SRC)/ + +# Full validation suite — no file modifications (used in CI) +check: format-check lint typecheck diff --git a/benchmarks/snapshot-efficiency/README.md b/benchmarks/snapshot-efficiency/README.md new file mode 100644 index 0000000..4daf87c --- /dev/null +++ b/benchmarks/snapshot-efficiency/README.md @@ -0,0 +1,185 @@ +# Snapshot Efficiency Benchmark + +Measures the token cost and task-completion quality of `opera-browser-cli`'s compact snapshot output against raw MCP output and alternative browser CLI tools. + +## What it measures + +Every browser agent task requires sending the current page as context to the LLM. This benchmark answers: + +- **Token savings** — how much does compact snapshot output reduce input token usage vs raw MCP output? +- **Quality** — does compression affect task-completion rate? +- **vs AXI** — how does `opera-browser-cli` compare to `chrome-devtools-axi`, an established browser CLI tool? + +### Conditions + +| ID | Description | +|-----------------|-----------------------------------------------------------------------------------------| +| `opera-compact` | `opera-browser-cli` default — compact snapshots with URL compression (our tool) | +| `opera-raw` | `opera-browser-cli --raw` — uncompressed MCP output piped through our CLI | +| `mcp-raw` | Raw `take_snapshot` via bridge HTTP API — no compression at all (chrome-mcp equivalent) | +| `axi` | `chrome-devtools-axi` CLI — external comparison baseline | + +### Tasks + +7 browser tasks adapted from the [axi bench-browser benchmark](https://github.com/kunchenguid/axi/tree/main/bench-browser), covering single-step reads, multi-step navigation, and complex multi-page extraction: + +| ID | Category | Target | +|------------------------------|---------------|------------------------------------------| +| `read_static_page` | single-step | example.com | +| `wikipedia_fact_lookup` | single-step | Wikipedia — Moon infobox | +| `github_repo_stars` | single-step | github.com/torvalds/linux | +| `wikipedia_table_read` | single-step | Wikipedia — population table | +| `wikipedia_link_follow` | multi-step | Wikipedia Ada Lovelace → Charles Babbage | +| `wikipedia_deep_extraction` | investigation | Wikipedia Nobel Physics laureates | +| `github_issue_investigation` | investigation | github.com/facebook/react/issues | + +### Model + +Model defaults are set in [`config/models.yaml`](config/models.yaml): + +```yaml +agent: + model: gpt-5.5 + reasoning_effort: medium + +judge: + model: gpt-5.5 + reasoning_effort: low +``` + +Both use the OpenAI Responses API (`/v1/responses`). The judge runs at lower effort since pass/fail grading is simpler than browser navigation. To use a different model for a run, pass CLI flags (see [CLI reference](#cli-reference)) — these override the config file without changing it. + +## Setup + +Requirements: Python 3.11+, `opera-browser-cli` in PATH, Opera/Chrome browser open. + +```sh +cd benchmarks/snapshot-efficiency +python -m venv .venv +source .venv/bin/activate # Windows: .venv\Scripts\activate +pip install -r requirements.txt +``` + +For the `axi` condition, also install: + +```sh +npm install -g chrome-devtools-axi +``` + +## Running + +All commands run from `benchmarks/snapshot-efficiency/` with the venv active. + +### Sanity check (1 run, 1 task) + +```sh +OPENAI_API_KEY= python src/run_benchmark.py \ + --conditions opera-compact \ + --tasks read_static_page \ + --repeats 1 +``` + +### Single condition + +```sh +OPENAI_API_KEY= python src/run_benchmark.py --conditions opera-compact --repeats 5 +``` + +### All conditions (skipping axi if not installed) + +```sh +OPENAI_API_KEY= python src/run_benchmark.py \ + --conditions opera-compact,opera-raw,mcp-raw \ + --repeats 5 +``` + +### Full matrix (requires chrome-devtools-axi) + +```sh +OPENAI_API_KEY= python src/run_benchmark.py --repeats 5 +``` + +### Generate report + +```sh +python src/report.py +# → results/report.md +``` + +## Linting & formatting + +Install dev tools (separate from benchmark runtime deps): + +```sh +pip install -r requirements-dev.txt +``` + +| Command | What it does | +|---|---| +| `make format` | Apply black + isort (local dev) | +| `make lint` | ruff + flake8 | +| `make typecheck` | mypy | +| `make check` | All of the above, read-only — same as CI | + +Config lives in `pyproject.toml` (black, isort, ruff, mypy) and `.flake8`. +All tools are configured for 120-char line length. + +## Source layout + +``` +src/ +├── run_benchmark.py # entry point — CLI arg parsing, outer loop, artifact writing +├── agent.py # browser agent loop (AgentState, AgentResult, run_agent) +├── judge.py # LLM-as-judge pass/fail grading (grade) +├── tools.py # ToolSet subclasses (CLIToolSet, BridgeToolSet) + factory +├── llm.py # thin OpenAI Responses API wrapper (Client, Turn) +├── report.py # reads results/*.jsonl and writes results/report.md +└── utils.py # shared utilities (snapshot_chars) + +config/ +├── conditions.yaml # benchmark conditions (tool mode, CLI binary, bridge URL) +├── tasks.yaml # task prompts and grading hints +└── models.yaml # agent and judge model + reasoning_effort defaults +``` + +## CLI reference + +``` +python src/run_benchmark.py [options] + + --conditions Comma-separated condition IDs (default: all four) + --tasks Comma-separated task IDs (default: all seven) + --repeats Runs per condition × task (default: 5) + --model Agent model — overrides config/models.yaml + --reasoning-effort Agent reasoning effort: low / medium / high — overrides config/models.yaml + --judge-model Judge model — overrides config/models.yaml + --judge-reasoning-effort Judge reasoning effort: low / medium / high — overrides config/models.yaml +``` + +To permanently change the defaults, edit [`config/models.yaml`](config/models.yaml). + +## Results layout + +``` +results/ +├── opera-compact.jsonl # one record per run +├── opera-raw.jsonl +├── mcp-raw.jsonl +├── axi.jsonl +├── report.md # generated by report.py +└── {condition}/{task}/run{N}/ + ├── agent_output.json # full trajectory + per-turn token usage + ├── grade.json # pass/fail verdict + reason + └── result.json # merged record (same shape as the .jsonl row) +``` + +## Attribution + +This benchmark is based on the [axi browser benchmark](https://github.com/kunchenguid/axi/tree/main/bench-browser) by [@kunchenguid](https://github.com/kunchenguid): + +- **Task definitions** (`config/tasks.yaml`) — adapted directly from [`bench-browser/config/tasks.yaml`](https://github.com/kunchenguid/axi/blob/main/bench-browser/config/tasks.yaml) +- **LLM-as-judge grading approach** — adapted from [`bench-browser/src/grader.ts`](https://github.com/kunchenguid/axi/blob/main/bench-browser/src/grader.ts) +- **Benchmark methodology** (per-condition JSONL results, trajectory capture, usage metrics) — adapted from [`bench-browser/src/runner.ts`](https://github.com/kunchenguid/axi/blob/main/bench-browser/src/runner.ts) +- **`axi` condition** — uses [`chrome-devtools-axi`](https://github.com/kunchenguid/axi), the browser CLI tool the axi project benchmarks + +The original benchmark uses TypeScript + Claude Sonnet. This port uses Python + OpenAI GPT-5.5 with the Responses API. diff --git a/benchmarks/snapshot-efficiency/config/conditions.yaml b/benchmarks/snapshot-efficiency/config/conditions.yaml new file mode 100644 index 0000000..b7f21d7 --- /dev/null +++ b/benchmarks/snapshot-efficiency/config/conditions.yaml @@ -0,0 +1,25 @@ +conditions: + - id: opera-compact + description: opera-browser-cli default (compact snapshots, URL compression) + tool_mode: cli + cli_bin: opera-browser-cli + raw: false + + - id: opera-raw + description: opera-browser-cli with --raw flag (uncompressed MCP output) + tool_mode: cli + cli_bin: opera-browser-cli + raw: true + + - id: mcp-raw + description: Raw take_snapshot via bridge HTTP API, no compression layer + tool_mode: bridge + bridge_url: "http://localhost:9224" + + - id: axi + description: chrome-devtools-axi CLI (external comparison baseline) + tool_mode: cli + cli_bin: chrome-devtools-axi + raw: false + start: "chrome-devtools-axi start" + stop: "chrome-devtools-axi stop" diff --git a/benchmarks/snapshot-efficiency/config/models.yaml b/benchmarks/snapshot-efficiency/config/models.yaml new file mode 100644 index 0000000..271829d --- /dev/null +++ b/benchmarks/snapshot-efficiency/config/models.yaml @@ -0,0 +1,7 @@ +agent: + model: gpt-5.5 + reasoning_effort: medium + +judge: + model: gpt-5.5 + reasoning_effort: low diff --git a/benchmarks/snapshot-efficiency/config/tasks.yaml b/benchmarks/snapshot-efficiency/config/tasks.yaml new file mode 100644 index 0000000..bb3ca67 --- /dev/null +++ b/benchmarks/snapshot-efficiency/config/tasks.yaml @@ -0,0 +1,56 @@ +tasks: + read_static_page: + category: single_step + prompt: > + Navigate to https://example.com and report the main heading of the page. + grading: + grading_hint: "The main heading on example.com is 'Example Domain'." + + wikipedia_fact_lookup: + category: single_step + prompt: > + Navigate to the Wikipedia article for the Moon + (https://en.wikipedia.org/wiki/Moon) and report the Moon's average + orbital speed from the infobox. + grading: + grading_hint: "The Moon's average orbital speed is 1.022 km/s (approximately 1.022 km/s or 2,286 mph)." + + github_repo_stars: + category: single_step + prompt: > + Navigate to https://github.com/torvalds/linux and report the + approximate star count and the primary programming language. + grading: + grading_hint: "torvalds/linux has 190k+ stars and the primary language is C." + + wikipedia_table_read: + category: single_step + prompt: > + Navigate to the Wikipedia article 'List of countries and dependencies + by population' and report the top 3 countries by population. + grading: + grading_hint: "The top 3 countries by population are India, China, and the United States." + + wikipedia_link_follow: + category: multi_step + prompt: > + Navigate to the Wikipedia article for Ada Lovelace, click the link + to Charles Babbage, and report his birth date. + grading: + grading_hint: "Charles Babbage was born on 26 December 1791." + + wikipedia_deep_extraction: + category: investigation + prompt: > + Navigate to 'List of Nobel laureates in Physics' on Wikipedia and + report the winners for the 3 most recent years listed. + grading: + grading_hint: "2024: John Hopfield and Geoffrey Hinton; 2023: Pierre Agostini, Ferenc Krausz, Anne L'Huillier; 2022: Alain Aspect, John Clauser, Anton Zeilinger." + + github_issue_investigation: + category: investigation + prompt: > + Navigate to https://github.com/facebook/react/issues and report + the titles of the 5 most recent issues and the total open issue count. + grading: + grading_hint: "The agent must report 5 specific issue titles. The open issue count should be in the hundreds." diff --git a/benchmarks/snapshot-efficiency/pyproject.toml b/benchmarks/snapshot-efficiency/pyproject.toml new file mode 100644 index 0000000..ab7de2d --- /dev/null +++ b/benchmarks/snapshot-efficiency/pyproject.toml @@ -0,0 +1,17 @@ +[tool.black] +line-length = 120 + +[tool.isort] +profile = "black" +line_length = 120 + +[tool.ruff] +line-length = 120 + +[tool.ruff.lint] +select = ["E", "F"] + +[tool.mypy] +python_version = "3.11" +ignore_missing_imports = true +explicit_package_bases = true diff --git a/benchmarks/snapshot-efficiency/requirements-dev.txt b/benchmarks/snapshot-efficiency/requirements-dev.txt new file mode 100644 index 0000000..3a1089f --- /dev/null +++ b/benchmarks/snapshot-efficiency/requirements-dev.txt @@ -0,0 +1,5 @@ +black>=24.0 +flake8>=7.0 +isort>=5.13 +mypy>=1.10 +ruff>=0.4 diff --git a/benchmarks/snapshot-efficiency/requirements.txt b/benchmarks/snapshot-efficiency/requirements.txt new file mode 100644 index 0000000..32ebcb2 --- /dev/null +++ b/benchmarks/snapshot-efficiency/requirements.txt @@ -0,0 +1,3 @@ +openai>=1.30 +pyyaml>=6.0 +requests>=2.31 diff --git a/benchmarks/snapshot-efficiency/src/agent.py b/benchmarks/snapshot-efficiency/src/agent.py new file mode 100644 index 0000000..ddf5819 --- /dev/null +++ b/benchmarks/snapshot-efficiency/src/agent.py @@ -0,0 +1,128 @@ +import json +import time +from dataclasses import dataclass, field + +from llm import Client, Turn +from tools import ToolSet + +SYSTEM_PROMPT = """You are a browser automation agent. Use the provided tools to navigate the web and answer questions. + +Guidelines: +- Use `navigate` to open URLs +- Use `snapshot` to re-read the current page if needed +- Use `click` on element refs (e.g. @1.5) shown in snapshots to follow links +- Use `go_back` to return to the previous page +- When you have enough information, reply with your final answer directly (no tool call) +- Be concise and factual — only report what you observed in the page +""" + +MAX_TURNS = 20 +SNAPSHOT_TOOLS: frozenset[str] = frozenset({"navigate", "snapshot", "click", "go_back"}) + + +@dataclass +class AgentResult: + answer: str + input_tokens: int + output_tokens: int + trajectory: list[dict] + snapshot_chars: list[int] + tool_call_count: int + wall_clock_seconds: float + error: str | None = None + + @property + def total_tokens(self) -> int: + return self.input_tokens + self.output_tokens + + +@dataclass +class AgentState: + input_tokens: int = 0 + output_tokens: int = 0 + trajectory: list[dict] = field(default_factory=list) + snapshot_chars: list[int] = field(default_factory=list) + tool_call_count: int = 0 + start: float = field(default_factory=time.monotonic) + error: str | None = None + answer: str = "" + + def update(self, turn: Turn, turn_index: int, tool_results: dict | None = None) -> None: + self.input_tokens += turn.input_tokens + self.output_tokens += turn.output_tokens + + if tool_results is None: + self.answer = turn.text + self.trajectory.append({"turn": turn_index, "tool_calls": [], "text": turn.text}) + return + + self.tool_call_count += len(turn.tool_calls) + for tc in turn.tool_calls: + if tc.name in SNAPSHOT_TOOLS: + self.snapshot_chars.append(len(tool_results[tc.call_id])) + for tc in turn.tool_calls: + self.trajectory.append( + { + "turn": turn_index, + "tool_calls": [{"name": tc.name, "args": tc.arguments}], + "tool_result": tool_results.get(tc.call_id, ""), + "text": turn.text, + } + ) + + def to_result(self) -> AgentResult: + return AgentResult( + answer=self.answer, + input_tokens=self.input_tokens, + output_tokens=self.output_tokens, + trajectory=self.trajectory, + snapshot_chars=self.snapshot_chars, + tool_call_count=self.tool_call_count, + wall_clock_seconds=round(time.monotonic() - self.start, 1), + error=self.error, + ) + + +def run_agent( + task_prompt: str, + tool_set: ToolSet, + model: str, + reasoning_effort: str, +) -> AgentResult: + client = Client(model, reasoning_effort) + inputs: list = [ + {"role": "system", "content": SYSTEM_PROMPT}, + {"role": "user", "content": task_prompt}, + ] + + state = AgentState() + + try: + for _turn in range(MAX_TURNS): + turn = client.call(inputs, tools=tool_set.definitions) + inputs.extend(turn.output_items) + + if not turn.tool_calls: + state.update(turn, _turn) + break + + tool_results = {} + for tc in turn.tool_calls: + args = json.loads(tc.arguments) + tool_results[tc.call_id] = tool_set.dispatch(tc.name, args) + inputs.append( + { + "type": "function_call_output", + "call_id": tc.call_id, + "output": tool_results[tc.call_id], + } + ) + + state.update(turn, _turn, tool_results) + else: + state.error = f"Reached max turns ({MAX_TURNS}) without final answer" + + except Exception as e: + state.error = str(e) + + return state.to_result() diff --git a/benchmarks/snapshot-efficiency/src/judge.py b/benchmarks/snapshot-efficiency/src/judge.py new file mode 100644 index 0000000..6f4d7a1 --- /dev/null +++ b/benchmarks/snapshot-efficiency/src/judge.py @@ -0,0 +1,75 @@ +import json + +from llm import Client + +SYSTEM_PROMPT = "You are a benchmark grader evaluating whether an AI agent completed a browser automation task." + +RULES = """ +- PASS if the agent navigated to the correct pages AND produced a correct, complete answer +- FAIL if the agent hallucinated data without actually browsing to the page +- FAIL if the agent browsed but misinterpreted the page content +- FAIL if the agent gave a partial answer when a complete one was requested +- For error recovery tasks, PASS if the agent correctly identified the error and then recovered +- For multi-step tasks, PASS only if all steps were completed + +Respond with exactly: {"pass": true, "reason": "..."} or {"pass": false, "reason": "..."} +""" + +TOOL_OUTPUT_CAP = 30_000 + + +def _format_trajectory(trajectory: list[dict]) -> str: + lines = [] + for turn in trajectory: + for tc in turn.get("tool_calls", []): + args = tc.get("args", "") + if isinstance(args, str): + try: + args = json.loads(args) + except json.JSONDecodeError: + pass + lines.append(f"[tool] {tc['name']}({json.dumps(args)})") + result = turn.get("tool_result", "") + if result: + if len(result) > TOOL_OUTPUT_CAP: + result = result[:TOOL_OUTPUT_CAP] + f"\n... (truncated, {len(result)} chars total)" + lines.append(f"[result] {result}") + if turn.get("text"): + lines.append(f"[agent] {turn['text']}") + return "\n".join(lines) + + +def _build_prompt(task_prompt: str, trajectory: list[dict], grading_hint: str | None) -> str: + parts = [f"TASK:\n{task_prompt.strip()}"] + if trajectory: + parts.append(f"AGENT TRAJECTORY:\n{_format_trajectory(trajectory)}") + if grading_hint: + parts.append(f"KNOWN FACTS:\n{grading_hint}") + parts.append(f"GRADING RULES:{RULES}") + return "\n\n".join(parts) + + +def grade( + task_prompt: str, + trajectory: list[dict], + model: str, + reasoning_effort: str, + grading_hint: str | None = None, +) -> dict: + prompt = _build_prompt(task_prompt, trajectory, grading_hint) + client = Client(model, reasoning_effort=reasoning_effort) + try: + turn = client.call( + [ + {"role": "system", "content": SYSTEM_PROMPT}, + {"role": "user", "content": prompt}, + ] + ) + raw = turn.text.strip() + if raw.startswith("```"): + raw = raw.split("```")[1].removeprefix("json") + return json.loads(raw) + except json.JSONDecodeError as e: + return {"pass": False, "reason": f"judge parse error: {e}"} + except Exception as e: + return {"pass": False, "reason": f"judge error: {e}"} diff --git a/benchmarks/snapshot-efficiency/src/llm.py b/benchmarks/snapshot-efficiency/src/llm.py new file mode 100644 index 0000000..2d86559 --- /dev/null +++ b/benchmarks/snapshot-efficiency/src/llm.py @@ -0,0 +1,52 @@ +from dataclasses import dataclass + +import openai + + +@dataclass +class Turn: + text: str + tool_calls: list # raw function_call items from response.output + output_items: list # model_dump'd, ready to extend next input + input_tokens: int + output_tokens: int + + +def _to_input_item(item) -> dict: + # status is an output-only field; the API rejects it when fed back as input + d = item.model_dump() + d.pop("status", None) + return d + + +class Client: + def __init__(self, model: str, reasoning_effort: str = "medium"): + self._api = openai.OpenAI() + self._model = model + self._reasoning_effort = reasoning_effort + + def call(self, input_items: list, tools: list | None = None) -> Turn: + response = self._api.responses.create( # type: ignore[call-overload] + model=self._model, + reasoning={"effort": self._reasoning_effort}, + input=input_items, + tools=tools or [], + ) + + text_parts: list[str] = [] + tool_calls: list = [] + for item in response.output: + if item.type == "function_call": + tool_calls.append(item) + elif item.type == "message": + for block in item.content: + if hasattr(block, "text"): + text_parts.append(block.text) + + return Turn( + text=" ".join(text_parts), + tool_calls=tool_calls, + output_items=[_to_input_item(item) for item in response.output], + input_tokens=response.usage.input_tokens, + output_tokens=response.usage.output_tokens, + ) diff --git a/benchmarks/snapshot-efficiency/src/report.py b/benchmarks/snapshot-efficiency/src/report.py new file mode 100644 index 0000000..87bd980 --- /dev/null +++ b/benchmarks/snapshot-efficiency/src/report.py @@ -0,0 +1,184 @@ +import json +import statistics +from pathlib import Path + +ROOT = Path(__file__).parent.parent # benchmarks/snapshot-efficiency/ +RESULTS_DIR = ROOT / "results" +REPORT_PATH = RESULTS_DIR / "report.md" + +CONDITION_ORDER = ["opera-compact", "opera-raw", "mcp-raw", "axi"] + + +def load_results() -> dict[str, list[dict]]: + results: dict[str, list[dict]] = {} + for f in sorted(RESULTS_DIR.glob("*.jsonl")): + cid = f.stem + records = [] + for line in f.read_text().splitlines(): + line = line.strip() + if line: + records.append(json.loads(line)) + results[cid] = records + return results + + +def summarize(records: list[dict]) -> dict: + if not records: + return {} + tasks = set(r["task"] for r in records) + passes = [r for r in records if r.get("pass")] + pass_rate = len(passes) / len(records) * 100 if records else 0 + input_tokens = [r["input_tokens"] for r in records] + output_tokens = [r["output_tokens"] for r in records] + total_tokens = [r["total_tokens"] for r in records] + snap_avg = [r["snapshot"]["avg_chars"] for r in records if r.get("snapshot", {}).get("avg_chars")] + snap_total = [r["snapshot"]["total_chars"] for r in records if r.get("snapshot", {}).get("total_chars")] + wall = [r["wall_clock_seconds"] for r in records] + tool_calls = [r["tool_call_count"] for r in records] + + def avg(xs: list) -> float: + return statistics.mean(xs) if xs else 0.0 + + return { + "runs": len(records), + "tasks": len(tasks), + "pass_rate": pass_rate, + "avg_input_tokens": avg(input_tokens), + "avg_output_tokens": avg(output_tokens), + "avg_total_tokens": avg(total_tokens), + "avg_snap_chars": avg(snap_avg), + "avg_snap_total_chars": avg(snap_total), + "avg_wall_seconds": avg(wall), + "avg_tool_calls": avg(tool_calls), + } + + +def per_task_summary(records: list[dict]) -> dict[str, dict]: + by_task: dict[str, list[dict]] = {} + for r in records: + by_task.setdefault(r["task"], []).append(r) + return {tid: summarize(recs) for tid, recs in sorted(by_task.items())} + + +def fmt_int(x: float) -> str: + return f"{int(x):,}" + + +def fmt_pct(x: float) -> str: + return f"{x:.0f}%" + + +def fmt_chars(x: float) -> str: + if x >= 1000: + return f"{x/1000:.1f}k" + return str(int(x)) + + +def main() -> None: + results = load_results() + if not results: + print(f"No results found in {RESULTS_DIR}/ — run run_benchmark.py first") + return + + lines: list[str] = ["# Snapshot Token Efficiency Benchmark\n"] + + # --- Summary table --- + lines.append("## Summary\n") + header = ( + "| Condition | Runs | Pass% | Avg input tok | Avg total tok | Avg snap chars | Avg wall (s) | Avg tool calls |" + ) + sep = ( + "|-----------|------|-------|---------------|---------------|----------------|--------------|----------------|" + ) + lines += [header, sep] + + ordered_cids = [c for c in CONDITION_ORDER if c in results] + [c for c in results if c not in CONDITION_ORDER] + summaries: dict[str, dict] = {} + for cid in ordered_cids: + s = summarize(results[cid]) + summaries[cid] = s + row = ( + f"| {cid} " + f"| {s['runs']} " + f"| {fmt_pct(s['pass_rate'])} " + f"| {fmt_int(s['avg_input_tokens'])} " + f"| {fmt_int(s['avg_total_tokens'])} " + f"| {fmt_chars(s['avg_snap_chars'])} " + f"| {s['avg_wall_seconds']:.1f} " + f"| {s['avg_tool_calls']:.1f} |" + ) + lines.append(row) + lines.append("") + + # --- Token savings vs mcp-raw --- + if "mcp-raw" in summaries and "opera-compact" in summaries: + baseline = summaries["mcp-raw"]["avg_total_tokens"] + compact = summaries["opera-compact"]["avg_total_tokens"] + if baseline > 0: + pct_saved = (baseline - compact) / baseline * 100 + lines.append(f"> opera-compact saves **{pct_saved:.0f}%** total tokens vs mcp-raw baseline.\n") + + # --- Per-task breakdown --- + all_tasks = sorted({r["task"] for records in results.values() for r in records}) + lines.append("## Per-task breakdown\n") + + for tid in all_tasks: + lines.append(f"### {tid}\n") + th = "| Condition | Pass% | Avg input tok | Avg snap chars |" + ts = "|-----------|-------|---------------|----------------|" + lines += [th, ts] + for cid in ordered_cids: + task_recs = [r for r in results[cid] if r["task"] == tid] + if not task_recs: + continue + s = summarize(task_recs) + row = ( + f"| {cid} " + f"| {fmt_pct(s['pass_rate'])} " + f"| {fmt_int(s['avg_input_tokens'])} " + f"| {fmt_chars(s['avg_snap_chars'])} |" + ) + lines.append(row) + lines.append("") + + # --- Snapshot size distribution --- + lines.append("## Snapshot size distribution (avg chars per snapshot call)\n") + dist_header = "| Condition | Min | Median | Max |" + dist_sep = "|-----------|-----|--------|-----|" + lines += [dist_header, dist_sep] + for cid in ordered_cids: + all_snap = [] + for r in results[cid]: + snap = r.get("snapshot", {}) + # reconstruct per-call from avg×count (rough; exact per-call in agent_output.json) + if snap.get("avg_chars") and snap.get("count"): + all_snap.append(snap["avg_chars"]) + if all_snap: + row = ( + f"| {cid} " + f"| {fmt_chars(min(all_snap))} " + f"| {fmt_chars(statistics.median(all_snap))} " + f"| {fmt_chars(max(all_snap))} |" + ) + lines.append(row) + lines.append("") + + # --- Failures --- + lines.append("## Failures\n") + for cid in ordered_cids: + fails = [r for r in results[cid] if not r.get("pass")] + if fails: + lines.append(f"### {cid} ({len(fails)} failures)\n") + for r in fails: + lines.append(f"- **{r['task']}** run{r['run']}: {r.get('grade_reason', '')}") + lines.append("") + + report = "\n".join(lines) + RESULTS_DIR.mkdir(parents=True, exist_ok=True) + REPORT_PATH.write_text(report) + print(report) + print(f"\nReport written to {REPORT_PATH}") + + +if __name__ == "__main__": + main() diff --git a/benchmarks/snapshot-efficiency/src/run_benchmark.py b/benchmarks/snapshot-efficiency/src/run_benchmark.py new file mode 100644 index 0000000..f9f2a63 --- /dev/null +++ b/benchmarks/snapshot-efficiency/src/run_benchmark.py @@ -0,0 +1,247 @@ +import argparse +import json +import os +import shlex +import subprocess +import sys +import time +from pathlib import Path + +import yaml + +from agent import run_agent +from judge import grade +from tools import make_tool_set + +ROOT = Path(__file__).parent.parent # benchmarks/snapshot-efficiency/ +RESULTS_DIR = ROOT / "results" + + +def load_config() -> tuple[dict, dict, dict]: + config = ROOT / "config" + with open(config / "tasks.yaml") as f: + tasks = yaml.safe_load(f)["tasks"] + with open(config / "conditions.yaml") as f: + conditions = {c["id"]: c for c in yaml.safe_load(f)["conditions"]} + with open(config / "models.yaml") as f: + models = yaml.safe_load(f) + return tasks, conditions, models + + +def artifact_dir(condition_id: str, task_id: str, run_n: int) -> Path: + d = RESULTS_DIR / condition_id / task_id / f"run{run_n}" + d.mkdir(parents=True, exist_ok=True) + return d + + +def next_run_index(condition_id: str, task_id: str) -> int: + base = RESULTS_DIR / condition_id / task_id + if not base.exists(): + return 0 + existing = [d for d in base.iterdir() if d.is_dir() and d.name.startswith("run")] + return len(existing) + + +def upsert_jsonl(condition_id: str, record: dict) -> None: + path = RESULTS_DIR / f"{condition_id}.jsonl" + with open(path, "a") as f: + f.write(json.dumps(record) + "\n") + + +def start_daemon(condition: dict) -> subprocess.Popen | None: + start_cmd = condition.get("start") + if not start_cmd: + return None + print(f" Starting daemon: {start_cmd}") + proc = subprocess.Popen(shlex.split(start_cmd), stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL) + time.sleep(2) + return proc + + +def stop_daemon(condition: dict, proc: subprocess.Popen | None) -> None: + stop_cmd = condition.get("stop") + if stop_cmd: + subprocess.run(shlex.split(stop_cmd), capture_output=True) + if proc: + proc.terminate() + + +def run_once( + condition: dict, + task_id: str, + task: dict, + run_n: int, + model: str, + reasoning_effort: str, + judge_model: str, + judge_reasoning_effort: str, +) -> dict: + tool_set = make_tool_set(condition) + result = run_agent( + task_prompt=task["prompt"], + tool_set=tool_set, + model=model, + reasoning_effort=reasoning_effort, + ) + grading_hint = task.get("grading", {}).get("grading_hint") + if tool_set.all_errored: + verdict = { + "pass": False, + "reason": "all tool calls errored — tool not installed or not running", + } + else: + verdict = grade( + task["prompt"], + result.trajectory, + judge_model, + judge_reasoning_effort, + grading_hint=grading_hint, + ) + + # per-snapshot stats + sc = result.snapshot_chars + snapshot_stats = { + "count": len(sc), + "total_chars": sum(sc), + "avg_chars": int(sum(sc) / len(sc)) if sc else 0, + "max_chars": max(sc) if sc else 0, + } + + record = { + "condition": condition["id"], + "task": task_id, + "run": run_n, + "pass": verdict.get("pass", False), + "grade_reason": verdict.get("reason", ""), + "answer": result.answer, + "input_tokens": result.input_tokens, + "output_tokens": result.output_tokens, + "total_tokens": result.total_tokens, + "tool_call_count": result.tool_call_count, + "wall_clock_seconds": round(result.wall_clock_seconds, 1), + "snapshot": snapshot_stats, + "error": result.error, + } + + adir = artifact_dir(condition["id"], task_id, run_n) + (adir / "agent_output.json").write_text( + json.dumps( + { + "trajectory": result.trajectory, + "input_tokens": result.input_tokens, + "output_tokens": result.output_tokens, + "snapshot_chars": result.snapshot_chars, + }, + indent=2, + ) + ) + (adir / "grade.json").write_text(json.dumps(verdict, indent=2)) + (adir / "result.json").write_text(json.dumps(record, indent=2)) + + return record + + +def main() -> None: + parser = argparse.ArgumentParser(description="Run snapshot benchmark") + parser.add_argument( + "--conditions", + default=None, + help="Comma-separated condition IDs (default: all)", + ) + parser.add_argument("--tasks", default=None, help="Comma-separated task IDs (default: all)") + parser.add_argument("--repeats", type=int, default=5, help="Runs per condition×task") + parser.add_argument("--model", default=None, help="Agent model (overrides config/models.yaml)") + parser.add_argument( + "--reasoning-effort", + default=None, + dest="reasoning_effort", + help="Agent reasoning effort low/medium/high (overrides config/models.yaml)", + ) + parser.add_argument( + "--judge-model", + default=None, + dest="judge_model", + help="Judge model (overrides config/models.yaml)", + ) + parser.add_argument( + "--judge-reasoning-effort", + default=None, + dest="judge_reasoning_effort", + help="Judge reasoning effort low/medium/high (overrides config/models.yaml)", + ) + args = parser.parse_args() + + if not os.environ.get("OPENAI_API_KEY"): + sys.exit("Error: OPENAI_API_KEY environment variable not set") + + all_tasks, all_conditions, models_cfg = load_config() + + agent_model = args.model or models_cfg["agent"]["model"] + agent_effort = args.reasoning_effort or models_cfg["agent"]["reasoning_effort"] + judge_model = args.judge_model or models_cfg["judge"]["model"] + judge_effort = args.judge_reasoning_effort or models_cfg["judge"]["reasoning_effort"] + + selected_conditions = args.conditions.split(",") if args.conditions else list(all_conditions.keys()) + selected_tasks = args.tasks.split(",") if args.tasks else list(all_tasks.keys()) + + # validate + for cid in selected_conditions: + if cid not in all_conditions: + sys.exit(f"Unknown condition: {cid}. Available: {', '.join(all_conditions)}") + for tid in selected_tasks: + if tid not in all_tasks: + sys.exit(f"Unknown task: {tid}. Available: {', '.join(all_tasks)}") + + RESULTS_DIR.mkdir(parents=True, exist_ok=True) + + total = len(selected_conditions) * len(selected_tasks) * args.repeats + done = 0 + + for cid in selected_conditions: + condition = all_conditions[cid] + print(f"\n{'='*60}") + print(f"Condition: {cid}") + print(f"{'='*60}") + + daemon = start_daemon(condition) + try: + for tid in selected_tasks: + task = all_tasks[tid] + for repeat in range(args.repeats): + run_n = next_run_index(cid, tid) + done += 1 + print(f"\n[{done}/{total}] {cid} / {tid} / run{run_n}") + try: + record = run_once( + condition=condition, + task_id=tid, + task=task, + run_n=run_n, + model=agent_model, + reasoning_effort=agent_effort, + judge_model=judge_model, + judge_reasoning_effort=judge_effort, + ) + status = "PASS" if record["pass"] else "FAIL" + tokens = record["total_tokens"] + avg_snap = record["snapshot"]["avg_chars"] + elapsed = record["wall_clock_seconds"] + print(f" {status} | {tokens} tokens | {avg_snap} avg snap chars | {elapsed}s") + if record["error"]: + print(f" Error: {record['error']}") + upsert_jsonl(cid, record) + except KeyboardInterrupt: + print("\nInterrupted.") + stop_daemon(condition, daemon) + sys.exit(0) + except Exception as e: + print(f" Run failed: {e}") + finally: + stop_daemon(condition, daemon) + + print(f"\nDone. Results in {RESULTS_DIR}/") + print("Run: python report.py") + + +if __name__ == "__main__": + main() diff --git a/benchmarks/snapshot-efficiency/src/tools.py b/benchmarks/snapshot-efficiency/src/tools.py new file mode 100644 index 0000000..6b9758c --- /dev/null +++ b/benchmarks/snapshot-efficiency/src/tools.py @@ -0,0 +1,222 @@ +import json +import subprocess +from dataclasses import dataclass, field + +import requests + +from utils import snapshot_chars + + +@dataclass +class ToolCallRecord: + tool_name: str + args: dict + result: str + snapshot_chars: int = 0 + error: str | None = None + + +@dataclass +class ToolSet: + condition_id: str + definitions: list[dict] # OpenAI tool schemas + records: list[ToolCallRecord] = field(default_factory=list) + + def dispatch(self, name: str, args: dict) -> str: + raise NotImplementedError + + @property + def all_errored(self) -> bool: + """True if every tool call returned an error — indicates the tool is not installed/running.""" + return bool(self.records) and all(r.result.startswith("[error:") for r in self.records) + + +# --------------------------------------------------------------------------- +# CLI-mode tool set (opera-compact, opera-raw, axi) +# --------------------------------------------------------------------------- + + +class CLIToolSet(ToolSet): + def __init__(self, condition_id: str, cli_bin: str, raw: bool = False): + self.cli_bin = cli_bin + self.raw = raw + super().__init__(condition_id=condition_id, definitions=_CLI_SCHEMA) + + def _run(self, *args: str, timeout: int = 60) -> str: + cmd = [self.cli_bin, *args] + try: + result = subprocess.run( + cmd, + capture_output=True, + text=True, + timeout=timeout, + ) + output = result.stdout + if result.returncode != 0 and not output: + output = result.stderr or f"[exit {result.returncode}]" + return output.strip() + except subprocess.TimeoutExpired: + return f"[timeout after {timeout}s]" + except FileNotFoundError: + return f"[error: {self.cli_bin} not found in PATH]" + + def dispatch(self, name: str, args: dict) -> str: + extra = ["--raw"] if self.raw and name in ("navigate", "snapshot", "click", "go_back") else [] + + match name: + case "navigate": + result = self._run("open", args.get("url", ""), *extra) + case "snapshot": + result = self._run("snapshot", *extra) + case "click": + result = self._run("click", args.get("ref", ""), *extra) + case "go_back": + result = self._run("back", *extra) + case _: + result = f"[unknown tool: {name}]" + + record = ToolCallRecord( + tool_name=name, + args=args, + result=result, + snapshot_chars=(snapshot_chars(result) if name in ("navigate", "snapshot", "click", "go_back") else 0), + ) + self.records.append(record) + return result + + +# --------------------------------------------------------------------------- +# Bridge-mode tool set (mcp-raw) +# --------------------------------------------------------------------------- + +# Default bridge URL — matches opera-browser-cli's default port (OPERA_CLI_PORT). +# Override via bridge_url in conditions.yaml. +DEFAULT_BRIDGE_URL = "http://localhost:9224" + + +class BridgeToolSet(ToolSet): + def __init__(self, condition_id: str, bridge_url: str = DEFAULT_BRIDGE_URL): + self.bridge_url = bridge_url.rstrip("/") + self.session = requests.Session() + super().__init__(condition_id=condition_id, definitions=_CLI_SCHEMA) + + def _call(self, tool_name: str, tool_args: dict) -> str: + try: + resp = self.session.post( + f"{self.bridge_url}/call", + json={"name": tool_name, "args": tool_args}, + timeout=60, + ) + resp.raise_for_status() + data = resp.json() + # MCP result: {"result": [...content items...]} + result = data.get("result", data) + if isinstance(result, list): + parts = [] + for item in result: + if isinstance(item, dict) and item.get("type") == "text": + parts.append(item["text"]) + elif isinstance(item, dict): + parts.append(json.dumps(item)) + else: + parts.append(str(item)) + return "\n".join(parts) + return json.dumps(result) + except requests.exceptions.ConnectionError: + return "[error: bridge not running — start with: opera-browser-cli start]" + except Exception as e: + return f"[error: {e}]" + + def dispatch(self, name: str, args: dict) -> str: + match name: + case "navigate": + result = self._call( + "navigate_page", + {"url": args.get("url", ""), "includeSnapshot": True}, + ) + case "snapshot": + result = self._call("take_snapshot", {}) + case "click": + result = self._call("click", {"uid": args.get("ref", ""), "includeSnapshot": True}) + case "go_back": + result = self._call("navigate_page", {"url": "back", "includeSnapshot": True}) + case _: + result = f"[unknown tool: {name}]" + + record = ToolCallRecord( + tool_name=name, + args=args, + result=result, + snapshot_chars=snapshot_chars(result), + ) + self.records.append(record) + return result + + +# --------------------------------------------------------------------------- +# Factory +# --------------------------------------------------------------------------- + + +def make_tool_set(condition: dict) -> ToolSet: + mode = condition["tool_mode"] + cid = condition["id"] + if mode == "cli": + return CLIToolSet( + condition_id=cid, + cli_bin=condition["cli_bin"], + raw=condition.get("raw", False), + ) + elif mode == "bridge": + return BridgeToolSet( + condition_id=cid, + bridge_url=condition.get("bridge_url", DEFAULT_BRIDGE_URL), + ) + else: + raise ValueError(f"Unknown tool_mode: {mode}") + + +# --------------------------------------------------------------------------- +# OpenAI tool schemas (same for all conditions) +# Responses API (/v1/responses) uses flat tool format — no nested "function" key +# --------------------------------------------------------------------------- + +_CLI_SCHEMA: list[dict] = [ + { + "type": "function", + "name": "navigate", + "description": "Navigate the browser to a URL and return the page snapshot.", + "parameters": { + "type": "object", + "properties": {"url": {"type": "string", "description": "Full URL to navigate to."}}, + "required": ["url"], + }, + }, + { + "type": "function", + "name": "snapshot", + "description": "Return the current page's accessibility snapshot without navigating.", + "parameters": {"type": "object", "properties": {}, "required": []}, + }, + { + "type": "function", + "name": "click", + "description": "Click an element on the current page by its reference ID (e.g. @1.5) and return the updated snapshot.", # noqa: E501 + "parameters": { + "type": "object", + "properties": { + "ref": { + "type": "string", + "description": "Element reference such as @1.5", + } + }, + "required": ["ref"], + }, + }, + { + "type": "function", + "name": "go_back", + "description": "Navigate back to the previous page and return the snapshot.", + "parameters": {"type": "object", "properties": {}, "required": []}, + }, +] diff --git a/benchmarks/snapshot-efficiency/src/utils.py b/benchmarks/snapshot-efficiency/src/utils.py new file mode 100644 index 0000000..6633501 --- /dev/null +++ b/benchmarks/snapshot-efficiency/src/utils.py @@ -0,0 +1,2 @@ +def snapshot_chars(text: str) -> int: + return len(text) if text else 0 diff --git a/src/bridge.ts b/src/bridge.ts index e24d8de..19c5040 100644 --- a/src/bridge.ts +++ b/src/bridge.ts @@ -3,9 +3,10 @@ * * Spawns opera-devtools-mcp as a child process and maintains a single * persistent MCP session. Exposes a simple HTTP API: - * POST /call { name, args } → { result } - * GET /tools → [{ name, description }] - * GET /health → { status: "ok" } + * POST /call { name, args } → { result } + * GET /tools → [{ name, description }] + * GET /health → { status: "ok" } + * GET /last-snapshot → { raw, pageUrl, capturedAt } | 404 * * Writes a PID file to ~/.opera-browser-cli/bridge.pid on startup. */ @@ -23,6 +24,7 @@ import { type ServerResponse, } from "node:http"; import { existsSync, mkdirSync, unlinkSync, writeFileSync } from "node:fs"; +import { extractPageOrigin } from "./snapshot.js"; import { createRequire } from "node:module"; import { dirname, join, resolve } from "node:path"; import { homedir } from "node:os"; @@ -42,6 +44,26 @@ const OPERA_AI_TOOLS = new Set([ "opera_make", ]); +export interface LastSnapshotCache { + raw: string; + pageUrl: string | null; + capturedAt: number; +} + +// The most recent raw snapshot text returned by take_snapshot. +// Shared across all concurrent HTTP requests; last write wins. +// Survives navigation — callers use pageUrl to detect drift if needed. +let lastSnapshot: LastSnapshotCache | null = null; + +export function getLastSnapshotCache(): LastSnapshotCache | null { + return lastSnapshot; +} + +/** Reset the snapshot cache — for use in tests only. */ +export function resetLastSnapshotCache(): void { + lastSnapshot = null; +} + export interface BridgeContentBlock { type: string; text?: string; @@ -233,13 +255,20 @@ async function handleCallRequest( return; } - // Non-streaming path (unchanged). + // Non-streaming path. try { const result = await client.callTool( { name: payload.name, arguments: payload.args }, undefined, ); const text = extractToolText(getToolContent(result)); + if (payload.name === "take_snapshot") { + lastSnapshot = { + raw: text, + pageUrl: extractPageOrigin(text), + capturedAt: Date.now(), + }; + } res.statusCode = 200; res.end(JSON.stringify({ result: text })); } catch (error) { @@ -265,6 +294,15 @@ export async function handleBridgeRequest( return; } + if (req.method === "GET" && req.url === "/last-snapshot") { + if (lastSnapshot === null) { + writeJson(res, 404, { error: "no snapshot cached" }); + } else { + writeJson(res, 200, lastSnapshot); + } + return; + } + try { if (req.method === "GET" && req.url === "/tools") { await handleToolsRequest(client, res); diff --git a/src/cli.ts b/src/cli.ts index dbf2d99..a421c50 100644 --- a/src/cli.ts +++ b/src/cli.ts @@ -13,6 +13,7 @@ import { getConfigFile, getLogFile, getSessionSnapshotIfRunning, + getLastSnapshot, loadConfig, parseConfigValue, stopBridge, @@ -23,6 +24,9 @@ import { extractTitle, truncateSnapshot, truncateText, + compactSnapshot, + applyUrlLut, + resolveUrl, } from "./snapshot.js"; import { getSuggestions } from "./suggestions.js"; @@ -90,7 +94,7 @@ tips: `; const COMMAND_HELP: Record = { - open: `usage: opera-browser-cli open [--full] + open: `usage: opera-browser-cli open [--full] [--raw] Navigate to a URL and capture an accessibility snapshot. args: @@ -98,6 +102,7 @@ args: flags: --full Show complete snapshot without truncation + --raw Show unprocessed MCP output (disables compact format) examples: opera-browser-cli open https://example.com @@ -119,15 +124,31 @@ examples: opera-browser-cli screenshot ./element.png --uid @3 opera-browser-cli screenshot ./full.png --full-page --format jpeg`, - snapshot: `usage: opera-browser-cli snapshot [--full] + snapshot: `usage: opera-browser-cli snapshot [--full] [--raw] Capture the current page accessibility snapshot. flags: --full Show complete snapshot without truncation + --raw Show unprocessed MCP output (disables compact format) examples: opera-browser-cli snapshot - opera-browser-cli snapshot --full`, + opera-browser-cli snapshot --full + opera-browser-cli snapshot --raw`, + + url: `usage: opera-browser-cli url <$uN | @ref> +Resolve a URL token or element ref from the last snapshot. + +args: + $uN URL token printed in the snapshot's urls: trailer (e.g. $u3) + @ref Element ref from the snapshot (e.g. @11.57) + +Tokens ($uN) are scoped to the last snapshot. If no snapshot is cached +the bridge takes a fresh one automatically. + +examples: + opera-browser-cli url \\$u3 + opera-browser-cli url @11.57`, click: `usage: opera-browser-cli click @ [--full] Click an interactive element by its ref from the snapshot. @@ -892,10 +913,11 @@ function readPackageVersion(): string { throw new Error("Could not determine opera-browser-cli package version"); } -function splitFullFlag(args: string[]): { args: string[]; full: boolean } { +function splitFullFlag(args: string[]): { args: string[]; full: boolean; raw: boolean } { return { - args: args.filter((arg) => arg !== "--full"), + args: args.filter((arg) => arg !== "--full" && arg !== "--raw"), full: args.includes("--full"), + raw: args.includes("--raw"), }; } @@ -976,15 +998,18 @@ function parseSnapshotFromResponse(response: string): string | null { : trimmed.slice(0, nextHeading).trimEnd(); } -/** Format page metadata (TOON) + raw snapshot + suggestions. */ +/** Format page metadata (TOON) + snapshot + suggestions. */ function formatPageOutput( snapshot: string, command: string, url?: string, full = false, + raw = false, ): string { - const title = extractTitle(snapshot); - const refs = countRefs(snapshot); + const tree = raw ? snapshot : compactSnapshot(snapshot); + + const title = extractTitle(tree); + const refs = countRefs(tree); const blocks: string[] = []; @@ -995,16 +1020,19 @@ function formatPageOutput( page.refs = refs; blocks.push(encode({ page })); - // Truncate snapshot - const tr = truncateSnapshot(snapshot, full); - let snapshotBlock = `snapshot:\n${tr.text.trimEnd()}`; + // Truncate snapshot, then apply URL LUT to the visible portion only. + // LUT runs after truncation so the trailer lists only URLs the agent can see. + const tr = truncateSnapshot(tree, full, raw ? 16000 : 12000); + const { body, trailer } = raw ? { body: tr.text, trailer: "" } : applyUrlLut(tr.text); + let snapshotBlock = `snapshot:\n${body.trimEnd()}`; + if (trailer) snapshotBlock += `\n${trailer}`; if (tr.truncated) { snapshotBlock += `\n ... (truncated, ${tr.totalLength} chars total)`; } blocks.push(snapshotBlock); // Contextual suggestions - const suggestions = getSuggestions({ command, url, snapshot }); + const suggestions = getSuggestions({ command, url, snapshot: tree }); if (tr.truncated) { suggestions.push( `Run \`opera-browser-cli ${command}${url ? " " + url : ""} --full\` to see complete snapshot`, @@ -1027,9 +1055,9 @@ function stripSnapshotHeader(text: string): string { return text.replace(/^[\s\S]*?##\s+Latest page snapshot\s*\n/, ""); } -/** Strip leading @ from uid ref. */ +/** Strip leading @ and normalise dot-form refs to underscore form for MCP ("@2.4" → "2_4"). */ function parseUid(arg: string): string { - return arg.startsWith("@") ? arg.slice(1) : arg; + return arg.replace(/^@/, "").replace(/\./g, "_"); } function isRecoverableOpenError(error: unknown): error is CdpError { @@ -1062,7 +1090,7 @@ const SCROLL_FUNCTIONS: Record = { bottom: "window.scrollTo(0, document.body.scrollHeight)", }; -async function handleOpen(args: string[], full: boolean): Promise { +async function handleOpen(args: string[], full: boolean, raw = false): Promise { const url = args[0]; if (!url) { throw new CdpError("Missing URL", "VALIDATION_ERROR", [ @@ -1079,12 +1107,12 @@ async function handleOpen(args: string[], full: boolean): Promise { await callTool("new_page", { url }); } const snapshot = stripSnapshotHeader(await callTool("take_snapshot")); - return formatPageOutput(snapshot, "open", url, full); + return formatPageOutput(snapshot, "open", url, full, raw); } -async function handleSnapshot(full: boolean): Promise { +async function handleSnapshot(full: boolean, raw = false): Promise { const snapshot = stripSnapshotHeader(await callTool("take_snapshot")); - return formatPageOutput(snapshot, "snapshot", undefined, full); + return formatPageOutput(snapshot, "snapshot", undefined, full, raw); } async function handleScreenshot(args: string[]): Promise { @@ -1120,7 +1148,7 @@ async function handleScreenshot(args: string[]): Promise { return formatScreenshotOutput(parsed.filePath); } -async function handleClick(args: string[], full: boolean): Promise { +async function handleClick(args: string[], full: boolean, raw = false): Promise { const uid = args[0]; if (!uid) { throw new CdpError("Missing element ref", "VALIDATION_ERROR", [ @@ -1129,10 +1157,10 @@ async function handleClick(args: string[], full: boolean): Promise { } const snapshot = await callWithSnapshot("click", { uid: parseUid(uid) }); - return formatPageOutput(snapshot, "click", undefined, full); + return formatPageOutput(snapshot, "click", undefined, full, raw); } -async function handleFill(args: string[], full: boolean): Promise { +async function handleFill(args: string[], full: boolean, raw = false): Promise { const uid = args[0]; const value = args.slice(1).join(" "); if (!uid) { @@ -1150,10 +1178,10 @@ async function handleFill(args: string[], full: boolean): Promise { uid: parseUid(uid), value, }); - return formatPageOutput(snapshot, "fill", undefined, full); + return formatPageOutput(snapshot, "fill", undefined, full, raw); } -async function handlePress(args: string[], full: boolean): Promise { +async function handlePress(args: string[], full: boolean, raw = false): Promise { const key = args[0]; if (!key) { throw new CdpError("Missing key name", "VALIDATION_ERROR", [ @@ -1162,10 +1190,10 @@ async function handlePress(args: string[], full: boolean): Promise { } const snapshot = await callWithSnapshot("press_key", { key }); - return formatPageOutput(snapshot, "press", undefined, full); + return formatPageOutput(snapshot, "press", undefined, full, raw); } -async function handleType(args: string[], full: boolean): Promise { +async function handleType(args: string[], full: boolean, raw = false): Promise { const text = args.join(" "); if (!text) { throw new CdpError("Missing text", "VALIDATION_ERROR", [ @@ -1175,10 +1203,10 @@ async function handleType(args: string[], full: boolean): Promise { await callTool("type_text", { text }); const snapshot = stripSnapshotHeader(await callTool("take_snapshot")); - return formatPageOutput(snapshot, "type", undefined, full); + return formatPageOutput(snapshot, "type", undefined, full, raw); } -async function handleScroll(args: string[], full: boolean): Promise { +async function handleScroll(args: string[], full: boolean, raw = false): Promise { const dir = (args[0] ?? "down").toLowerCase(); const fn = SCROLL_FUNCTIONS[dir]; if (!fn) { @@ -1189,13 +1217,13 @@ async function handleScroll(args: string[], full: boolean): Promise { await callTool("evaluate_script", { function: fn }); const snapshot = stripSnapshotHeader(await callTool("take_snapshot")); - return formatPageOutput(snapshot, "scroll", undefined, full); + return formatPageOutput(snapshot, "scroll", undefined, full, raw); } -async function handleBack(full: boolean): Promise { +async function handleBack(full: boolean, raw = false): Promise { await callTool("navigate_page", { type: "back" }); const snapshot = stripSnapshotHeader(await callTool("take_snapshot")); - return formatPageOutput(snapshot, "back", undefined, full); + return formatPageOutput(snapshot, "back", undefined, full, raw); } async function handleWait(args: string[]): Promise { @@ -1317,7 +1345,7 @@ async function handlePages(): Promise { return renderOutput(blocks); } -async function handleNewPage(args: string[], full: boolean): Promise { +async function handleNewPage(args: string[], full: boolean, raw = false): Promise { const url = args.filter((a) => !a.startsWith("--"))[0]; if (!url) { throw new CdpError("Missing URL", "VALIDATION_ERROR", [ @@ -1329,12 +1357,13 @@ async function handleNewPage(args: string[], full: boolean): Promise { if (background) toolArgs.background = true; await callTool("new_page", toolArgs); const snapshot = stripSnapshotHeader(await callTool("take_snapshot")); - return formatPageOutput(snapshot, "newpage", url, full); + return formatPageOutput(snapshot, "newpage", url, full, raw); } async function handleSelectPage( args: string[], full: boolean, + raw = false, ): Promise { const id = args[0]; if (!id) { @@ -1350,7 +1379,7 @@ async function handleSelectPage( } await callTool("select_page", { pageId }); const snapshot = stripSnapshotHeader(await callTool("take_snapshot")); - return formatPageOutput(snapshot, "selectpage", undefined, full); + return formatPageOutput(snapshot, "selectpage", undefined, full, raw); } async function handleClosePage(args: string[]): Promise { @@ -1412,7 +1441,7 @@ async function handleResize(args: string[]): Promise { // --- Interaction handlers --- -async function handleHover(args: string[], full: boolean): Promise { +async function handleHover(args: string[], full: boolean, raw = false): Promise { const uid = args[0]; if (!uid) { throw new CdpError("Missing element ref", "VALIDATION_ERROR", [ @@ -1420,10 +1449,10 @@ async function handleHover(args: string[], full: boolean): Promise { ]); } const snapshot = await callWithSnapshot("hover", { uid: parseUid(uid) }); - return formatPageOutput(snapshot, "hover", undefined, full); + return formatPageOutput(snapshot, "hover", undefined, full, raw); } -async function handleDrag(args: string[], full: boolean): Promise { +async function handleDrag(args: string[], full: boolean, raw = false): Promise { const from = args[0]; const to = args[1]; if (!from || !to) { @@ -1435,10 +1464,10 @@ async function handleDrag(args: string[], full: boolean): Promise { from_uid: parseUid(from), to_uid: parseUid(to), }); - return formatPageOutput(snapshot, "drag", undefined, full); + return formatPageOutput(snapshot, "drag", undefined, full, raw); } -async function handleFillForm(args: string[], full: boolean): Promise { +async function handleFillForm(args: string[], full: boolean, raw = false): Promise { const { entries } = parseFillFormArgs(args); if (entries.length === 0) { throw new CdpError("No valid field entries", "VALIDATION_ERROR", [ @@ -1446,7 +1475,7 @@ async function handleFillForm(args: string[], full: boolean): Promise { ]); } const snapshot = await callWithSnapshot("fill_form", { elements: entries }); - return formatPageOutput(snapshot, "fillform", undefined, full); + return formatPageOutput(snapshot, "fillform", undefined, full, raw); } async function handleDialog(args: string[]): Promise { @@ -1463,7 +1492,7 @@ async function handleDialog(args: string[]): Promise { return encode({ dialog: action }); } -async function handleUpload(args: string[], full: boolean): Promise { +async function handleUpload(args: string[], full: boolean, raw = false): Promise { const uid = args[0]; const filePath = args[1]; if (!uid) { @@ -1480,7 +1509,7 @@ async function handleUpload(args: string[], full: boolean): Promise { uid: parseUid(uid), filePath, }); - return formatPageOutput(snapshot, "upload", undefined, full); + return formatPageOutput(snapshot, "upload", undefined, full, raw); } // --- Emulation handler --- @@ -2285,7 +2314,7 @@ async function handleHome(_full: boolean): Promise { renderHelp(help), ]); } - const snapshot = stripSnapshotHeader(result); + const snapshot = compactSnapshot(stripSnapshotHeader(result)); const title = extractTitle(snapshot); const refs = countRefs(snapshot); const page: Record = {}; @@ -2299,14 +2328,48 @@ async function handleHome(_full: boolean): Promise { return renderOutput([encode({ page }), renderHelp(help)]); } +async function handleUrl(args: string[]): Promise { + const target = args[0]; + if (!target) { + throw new CdpError("Missing argument", "VALIDATION_ERROR", [ + "Run `opera-browser-cli url \\$u3` to resolve a URL token", + "Run `opera-browser-cli url @11.57` to resolve an element ref", + ]); + } + + // Prefer the bridge's cached snapshot to avoid an extra MCP round-trip. + // Fall back to a fresh snapshot if the cache is cold. + let raw: string; + const cached = await getLastSnapshot(); + if (cached) { + raw = cached.raw; + } else { + await ensureBridge(); + raw = stripSnapshotHeader(await callTool("take_snapshot")); + } + + // Re-derive the full (non-truncated) URL map so tokens match what the agent + // saw, regardless of the truncation applied to the original output. + const compact = compactSnapshot(raw); + const { body, urlMap } = applyUrlLut(compact); + + const resolved = resolveUrl(body, urlMap, target); + if (resolved === null) { + process.stderr.write(`url: "${target}" not found in last snapshot\n`); + process.exitCode = 1; + return ""; + } + return resolved; +} + type CommandFn = (args: string[]) => Promise; function withFullFlag( - handler: (args: string[], full: boolean) => Promise, + handler: (args: string[], full: boolean, raw?: boolean) => Promise, ): CommandFn { return (args) => { const parsed = splitFullFlag(args); - return handler(parsed.args, parsed.full); + return handler(parsed.args, parsed.full, parsed.raw); }; } @@ -2318,14 +2381,15 @@ function withoutFullFlag( const COMMANDS: Record = { open: withFullFlag(handleOpen), - snapshot: async (args) => handleSnapshot(splitFullFlag(args).full), + snapshot: async (args) => { const f = splitFullFlag(args); return handleSnapshot(f.full, f.raw); }, + url: withoutFullFlag(handleUrl), screenshot: withoutFullFlag(handleScreenshot), click: withFullFlag(handleClick), fill: withFullFlag(handleFill), type: withFullFlag(handleType), press: withFullFlag(handlePress), scroll: withFullFlag(handleScroll), - back: async (args) => handleBack(splitFullFlag(args).full), + back: async (args) => { const f = splitFullFlag(args); return handleBack(f.full, f.raw); }, wait: withoutFullFlag(handleWait), eval: withFullFlag(handleEval), run: async () => handleRun(), diff --git a/src/client.ts b/src/client.ts index a9e9865..1cb21f9 100644 --- a/src/client.ts +++ b/src/client.ts @@ -409,6 +409,26 @@ export async function getBridgeStatus(): Promise { * Get the current page snapshot without starting the bridge. * Returns null if the bridge is not running or healthy. */ +export interface CachedSnapshot { + raw: string; + pageUrl: string | null; + capturedAt: number; +} + +/** Retrieve the most recent snapshot the bridge has cached, without triggering a new one. */ +export async function getLastSnapshot(): Promise { + const pidInfo = readPidFile(); + if (!pidInfo || !isProcessAlive(pidInfo.pid)) return null; + try { + const resp = await httpGet(pidInfo.port, "/last-snapshot", 2000); + const data = JSON.parse(resp) as { error?: string } & Partial; + if (data.error || !data.raw) return null; + return { raw: data.raw, pageUrl: data.pageUrl ?? null, capturedAt: data.capturedAt ?? 0 }; + } catch { + return null; + } +} + export async function getSessionSnapshotIfRunning(): Promise { const pidInfo = readPidFile(); if (!pidInfo || !isProcessAlive(pidInfo.pid)) { diff --git a/src/run.ts b/src/run.ts index 8e599fb..7adc678 100644 --- a/src/run.ts +++ b/src/run.ts @@ -9,6 +9,7 @@ import { mkdtempSync, writeFileSync, unlinkSync, rmdirSync } from "node:fs"; import { join } from "node:path"; import { tmpdir } from "node:os"; import { CdpError } from "./client.js"; +import { compactSnapshot } from "./snapshot.js"; type CallTool = ( name: string, @@ -47,9 +48,9 @@ function stripSnapshotHeader(text: string): string { return text.replace(/^[\s\S]*?##\s+Latest page snapshot\s*\n/, ""); } -/** Strip leading @ from uid ref string. */ +/** Strip leading @ and normalise dot-form refs to underscore form for MCP ("@2.4" → "2_4"). */ function parseUid(ref: string): string { - return ref.startsWith("@") ? ref.slice(1) : ref; + return ref.replace(/^@/, "").replace(/\./g, "_"); } /** Check if an open error is recoverable by falling back to new_page. */ @@ -63,7 +64,7 @@ function isRecoverableOpenError(error: unknown): boolean { // --- Selector detection --- -const UID_RE = /^@?\d[\d_]*$/; +const UID_RE = /^@?\d[\d_.]*$/; /** Returns true when the string looks like a @uid ref (e.g. "@12", "26_181"). */ export function isUidRef(s: string): boolean { @@ -163,7 +164,7 @@ export function createPageHelper(callTool: CallTool): PageHelper { async snapshot(): Promise { const result = await callTool("take_snapshot"); - return stripSnapshotHeader(result); + return compactSnapshot(stripSnapshotHeader(result)); }, async click(refOrSelector: string): Promise { diff --git a/src/snapshot.ts b/src/snapshot.ts index 311a9ab..fcd96db 100644 --- a/src/snapshot.ts +++ b/src/snapshot.ts @@ -4,9 +4,19 @@ export interface RefInfo { type: string; } -/** Count interactive refs (uid=...) in snapshot text. */ +/** Convert a canonical MCP ref ("2_4") to display form ("2.4"). */ +export function refToDisplay(mcpRef: string): string { + return mcpRef.replace(/_/g, "."); +} + +/** Convert any ref form — "@2.4", "@2_4", "2.4", "2_4" — to MCP wire form "2_4". */ +export function refToMcp(ref: string): string { + return ref.replace(/^@/, "").replace(/\./g, "_"); +} + +/** Count interactive refs in snapshot text (accepts both uid= and compact @X.Y form). */ export function countRefs(snapshot: string): number { - const matches = snapshot.match(/\buid=\S+/g); + const matches = snapshot.match(/^\s*(?:uid=\S+|@\d[\d.]*)\b/gm); return matches ? matches.length : 0; } @@ -14,22 +24,317 @@ export function countRefs(snapshot: string): number { export function extractRefs(snapshot: string): RefInfo[] { const refs: RefInfo[] = []; for (const line of snapshot.split("\n")) { - const m = line.match(/\buid=(\S+)\s+(\w+)\s+"([^"]*)"/); + // Accept both uid=X_Y (raw MCP) and @X.Y (compact) forms; + // avoid \b before @ since @ is a non-word character + const m = line.match(/(?:uid=(\S+)|(?:^|[ \t])@([\d.]+))\s+([\w]+)\s+"([^"]*)"/); if (!m) continue; - refs.push({ ref: m[1], type: m[2], label: m[3] }); + const rawRef = m[1] ?? m[2]; + // Always return in display form so suggestion strings emit @X.Y refs + const ref = m[1] ? refToDisplay(rawRef) : rawRef; + refs.push({ ref, type: m[3], label: m[4] }); } return refs; } -/** Extract page title from snapshot (RootWebArea or first heading). */ +/** Extract page title from snapshot (RootWebArea/root root node or first heading). */ export function extractTitle(snapshot: string): string { - const rootMatch = snapshot.match(/RootWebArea\s+"([^"]+)"/); + const rootMatch = snapshot.match(/(?:RootWebArea|root)\s+"([^"]+)"/); if (rootMatch) return rootMatch[1]; + // Compact markdown heading after compactSnapshot: `@X.Y ## Title` + const mdMatch = snapshot.match(/^(?:@\S+\s+)?#{1,6}\s+(.+)$/m); + if (mdMatch) return mdMatch[1].trim(); const headingMatch = snapshot.match(/\bheading\s+"([^"]+)"/); if (headingMatch) return headingMatch[1]; return ""; } +// Query-string keys issued by external ad/analytics platforms that carry no functional +// meaning for the destination page — safe to drop on any site. +const NOISE_PARAM_EXACT = new Set([ + // Google Ads click IDs + "gclid", "gbraid", "wbraid", "dclid", "gad_source", + // Social / messaging platform click IDs + "fbclid", // Meta/Facebook + "msclkid", // Microsoft Ads + "yclid", // Yandex + "igshid", // Instagram + "ttclid", // TikTok + "twclid", // Twitter/X + "li_fat_id", // LinkedIn + "srsltid", // Google Shopping + "_ke", // Klaviyo +]); +// Prefix-matched families (all members are tracking-only) +const NOISE_PARAM_PREFIXES = [ + "utm_", // Google Analytics UTM parameters + "mc_", // Mailchimp +]; + +function isNoiseParam(key: string): boolean { + if (NOISE_PARAM_EXACT.has(key)) return true; + return NOISE_PARAM_PREFIXES.some((p) => key.startsWith(p)); +} + +/** + * Clean a URL value to reduce token bloat without losing addressability: + * - returns null for javascript: and data: URLs so the caller drops the attribute entirely + * - strips a matching page origin → relative path + * - removes cross-site tracking query params (utm_*, gclid, fbclid, etc.) + * + * Preserves fragment, parameter order, and percent-encoding of remaining values. + */ +export function cleanUrl(url: string, origin: string | null): string | null { + if (url.startsWith("javascript:") || url.startsWith("data:")) return null; + + let working = url; + if (origin && working.startsWith(origin)) { + working = working.slice(origin.length) || "/"; + } + + // Pull the fragment off first so query-param parsing can't accidentally consume it + let fragment = ""; + const hashIdx = working.indexOf("#"); + if (hashIdx >= 0) { + fragment = working.slice(hashIdx); + working = working.slice(0, hashIdx); + } + + const qIdx = working.indexOf("?"); + if (qIdx < 0) return working + fragment; + + const path = working.slice(0, qIdx); + const query = working.slice(qIdx + 1); + if (!query) return path + fragment; + + const kept = query.split("&").filter((part) => { + if (!part) return false; + const eq = part.indexOf("="); + const key = eq < 0 ? part : part.slice(0, eq); + return !isNoiseParam(key); + }); + + if (kept.length === 0) return path + fragment; + return `${path}?${kept.join("&")}${fragment}`; +} + +/** Extract scheme://host from the root node's url= attribute, if present. */ +export function extractPageOrigin(tree: string): string | null { + const m = tree.match( + /^\s*(?:uid=\S+|@\S+)\s+(?:RootWebArea|root)\b[^\n]*\burl="([^"]+)"/m, + ); + if (!m) return null; + try { + const u = new URL(m[1]); + return `${u.protocol}//${u.host}`; + } catch { + return null; + } +} + +// Repeat a description value this many times before we treat it as boilerplate worth deduping. +// Below this, the bytes saved by dropping repeats don't beat the risk of hiding meaningful copy. +const DESCRIPTION_DEDUP_THRESHOLD = 3; + +// Chrome a11y tree uses PascalCase for some internal role names; map them to compact lowercase. +const ROLE_RENAMES: Record = { + RootWebArea: "root", + StaticText: "text", + DisclosureTriangle: "disclosure", + ColorWell: "color", + InputTime: "time", + Date: "date", +}; + +/** + * Compact an accessibility snapshot tree to reduce token usage (~30% fewer tokens). + * Removes noise nodes, strips ARIA default attributes, normalises role names, + * de-quotes numeric attributes, converts headings to markdown, and rewrites + * refs to the @PAGE.ELEM display format. + * + * Operates on the raw tree text (after MCP preamble has been stripped). + */ +export function compactSnapshot(tree: string): string { + const lines = tree.split("\n"); + const out: string[] = []; + let dropDanglingQuote = false; + + // Pre-pass: find page origin (for relative-URL rewriting) and count description values + // so we know which ones cross the dedup threshold. + const origin = extractPageOrigin(tree); + const descriptionCounts = new Map(); + for (const line of lines) { + const re = / description="([^"]*)"/g; + let m: RegExpExecArray | null; + while ((m = re.exec(line)) !== null) { + descriptionCounts.set(m[1], (descriptionCounts.get(m[1]) ?? 0) + 1); + } + } + const seenDescription = new Set(); + + for (const raw of lines) { + let line = raw; + + //
elements appear as LineBreak nodes; they're never useful in the a11y tree. + // Their label is a literal newline, so splitting on \n leaves a dangling `"` on the + // next line — skip that too. + if (/^\s*uid=\S+ LineBreak "/.test(line)) { + dropDanglingQuote = true; + continue; + } + if (dropDanglingQuote) { + dropDanglingQuote = false; + if (/^\s*"\s*$/.test(line)) continue; + } + + // Whitespace-only text nodes between elements are structural artifacts, not content + if (/^\s*uid=\S+ StaticText "\s*"\s*$/.test(line)) continue; + + // StaticText children that just echo the parent's accessible name are redundant — + // links, headings, buttons etc. already carry the label on their own line + { + const m = line.match(/^(\s*)uid=\S+ StaticText "([^"]+)"\s*$/); + if (m) { + const childIndent = m[1].length; + const label = m[2]; + let drop = false; + for (let i = out.length - 1; i >= 0; i--) { + if (!out[i].trim()) continue; + // Previous lines may already be in compact @X.Y form (B1 runs per-line before push) + const pm = out[i].match(/^(\s*)(?:uid=\S+|@\S+) \w+ "([^"]+)"/); + if (pm && pm[1].length === childIndent - 2 && pm[2] === label) drop = true; + break; + } + if (drop) continue; + } + } + + // Empty valuetext is the same as having no valuetext + line = line.replace(/ valuetext=""/g, ""); + + // `disableable` is redundant when `disabled` is already present + if (/ disabled\b/.test(line)) line = line.replace(/ disableable\b/g, ""); + + // Every option and tab is selectable by definition; the attribute adds nothing + if (/ (?:option|tab) "/.test(line)) line = line.replace(/ selectable\b/g, ""); + + // `relevant="additions text"` is the ARIA default for live regions; omit it + line = line.replace(/ relevant="additions text"/g, ""); + + // `atomic` is implicit for alert and status by the ARIA spec + if (/ (?:alert|status) /.test(line)) line = line.replace(/ atomic\b/g, ""); + + // `live=` defaults are mandated by ARIA for these roles; no need to repeat them + if (/ status /.test(line)) line = line.replace(/ live="polite"/g, ""); + if (/ alert /.test(line)) line = line.replace(/ live="assertive"/g, ""); + + // combobox is always expandable with a popup; both attributes are implied by the role + if (/ combobox /.test(line)) { + line = line.replace(/ haspopup="(?:menu|listbox)"/g, ""); + line = line.replace(/ expandable\b/g, ""); + } + + // Horizontal is the default orientation for sliders and listboxes + line = line.replace(/ orientation="horizontal"/g, ""); + + // Autocomplete mode is an implementation detail rarely useful for navigation + line = line.replace(/ autocomplete="(?:both|list)"/g, ""); + + // Drop javascript: URLs entirely (no agent-actionable info), strip the page origin + // from same-site links, and remove tracking/encoding query params + line = line.replace(/ url="([^"]+)"/g, (_full, rawUrl) => { + const cleaned = cleanUrl(rawUrl, origin); + return cleaned == null ? "" : ` url="${cleaned}"`; + }); + + // Boilerplate descriptions (e.g. "use arrow keys to navigate" repeated on every link) + // are recoverable from the first occurrence; drop the rest + line = line.replace(/ description="([^"]*)"/g, (full, value) => { + if ((descriptionCounts.get(value) ?? 0) < DESCRIPTION_DEDUP_THRESHOLD) return full; + if (seenDescription.has(value)) return ""; + seenDescription.add(value); + return full; + }); + + // Normalise known PascalCase Chrome-internal role names to short lowercase forms. + // The uid= or @X.Y prefix is optional to handle simplified test fixtures. + line = line.replace( + /^(\s*(?:(?:uid=|@)\S+\s+)?)([A-Za-z][a-zA-Z]*)( )/, + (_, pre, role, post) => pre + (ROLE_RENAMES[role] ?? role) + post, + ); + + // Numeric attribute values don't need quotes — saves two tokens per attribute + line = line.replace(/(\w+)="(-?\d+)"/g, "$1=$2"); + + // `heading "Label" level=N` → `## Label` — markdown is shorter and familiar to models + { + const m = line.match(/^(\s*uid=\S+) heading "([^"]+)" level=(\d+)(.*)/); + if (m) { + const hashes = "#".repeat(parseInt(m[3], 10)); + const extra = m[4].trim(); + line = `${m[1]} ${hashes} ${m[2]}${extra ? " " + extra : ""}`; + } + } + + // Rewrite refs last so all earlier transforms still match the uid= form; + // dot separator tokenises better than underscore in BPE encodings + line = line.replace(/\buid=(\d+)_(\d+)\b/g, (_, page, elem) => `@${page}.${elem}`); + + out.push(line); + } + + return collapseTextRuns(out).join("\n"); +} + +/** + * Merge consecutive text nodes at the same indent into one, then re-apply + * the echo-dedup: if the merged label exactly matches the parent's label, + * the collapsed line is dropped entirely (parent already carries the content). + * + * Only runs when 2+ text nodes were actually merged; single text nodes that + * already survived the per-line echo-dedup are passed through unchanged. + */ +function collapseTextRuns(lines: string[]): string[] { + const result: string[] = []; + + for (let i = 0; i < lines.length; i++) { + const m = lines[i].match(/^(\s*)(@\S+) text "([^"]*)"\s*$/); + if (!m) { + result.push(lines[i]); + continue; + } + + const [, indent, ref, firstLabel] = m; + let j = i + 1; + let merged = firstLabel; + while (j < lines.length) { + const next = lines[j].match(/^(\s*)@\S+ text "([^"]*)"\s*$/); + if (!next || next[1] !== indent) break; + merged += next[2]; + j++; + } + + if (j === i + 1) { + // Only one text node — pass through (already echo-deduped in main loop) + result.push(lines[i]); + continue; + } + + // Multiple nodes merged — advance past consumed lines and echo-dedup the result + i = j - 1; + const childIndent = indent.length; + let drop = false; + for (let k = result.length - 1; k >= 0; k--) { + if (!result[k].trim()) continue; + const pm = result[k].match(/^(\s*)(?:uid=\S+|@\S+) \w+ "([^"]+)"/); + if (pm && pm[1].length === childIndent - 2 && pm[2] === merged) drop = true; + break; + } + if (!drop) result.push(`${indent}${ref} text "${merged}"`); + } + + return result; +} + export interface TruncationResult { text: string; truncated: boolean; @@ -87,3 +392,105 @@ const INPUT_TYPES = ["textbox", "searchbox", "input", "combobox", "textarea"]; export function isInputType(type: string): boolean { return INPUT_TYPES.includes(type); } + +// --- URL LUT (Layer 2) --- + +const MIN_DEDUP_LEN = 15; +const WHALE_THRESHOLD = 200; +const WHALE_PREVIEW_CAP = 60; + +export interface UrlLutResult { + body: string; + trailer: string; // empty string when no tokens were assigned + urlMap: Map; // token ($u1) → full cleaned URL +} + +// Produce a short human-readable hint for a whale URL (no full value echoed). +// Relative paths are already concise; absolute URLs strip the scheme first. +function whalePreview(url: string): string { + const target = url.startsWith("/") ? url : url.replace(/^https?:\/\//, ""); + return target.length <= WHALE_PREVIEW_CAP + ? target + : target.slice(0, WHALE_PREVIEW_CAP - 1) + "…"; +} + +/** + * Apply a URL lookup table to a compacted, already-truncated snapshot. + * + * Two classes of URL are replaced with short $uN tokens: + * dedup — appears ≥2× and length ≥ MIN_DEDUP_LEN → full URL printed in trailer + * whale — length ≥ WHALE_THRESHOLD and not already a dedup URL + * → hidden in trailer with byte-size + path-stem preview only + * + * Must run AFTER truncation so the trailer only references URLs the agent can + * actually see in the body. Token IDs are assigned in tree-walk (top-down) + * order and are therefore deterministic for identical input. + */ +export function applyUrlLut(text: string): UrlLutResult { + // Count occurrences of each URL value (Layer 1 has already cleaned them) + const urlCounts = new Map(); + const scanRe = / url="([^"]+)"/g; + let m: RegExpExecArray | null; + while ((m = scanRe.exec(text)) !== null) { + urlCounts.set(m[1], (urlCounts.get(m[1]) ?? 0) + 1); + } + + const isDedup = (u: string) => (urlCounts.get(u) ?? 0) >= 2 && u.length >= MIN_DEDUP_LEN; + // Dedup wins when both conditions hold — URL gets full entry in trailer, not hidden. + const isWhale = (u: string) => u.length >= WHALE_THRESHOLD && !isDedup(u); + + const urlToToken = new Map(); + const urlMap = new Map(); + let counter = 0; + + const body = text.replace(/ url="([^"]+)"/g, (_full, url: string) => { + if (!isDedup(url) && !isWhale(url)) return _full; + if (!urlToToken.has(url)) { + const token = `$u${++counter}`; + urlToToken.set(url, token); + urlMap.set(token, url); + } + return ` url=${urlToToken.get(url)!}`; + }); + + if (urlMap.size === 0) return { body, trailer: "", urlMap }; + + const trailerLines = ["urls:"]; + for (const [token, url] of urlMap) { + if (isWhale(url)) { + trailerLines.push(` ${token} [hidden ${url.length}b → ${whalePreview(url)}]`); + } else { + trailerLines.push(` ${token} ${url}`); + } + } + + return { body, trailer: trailerLines.join("\n"), urlMap }; +} + +/** + * Resolve a URL from a LUT-applied snapshot body. + * + * target is either "$u3" (a LUT token) or "11.57" / "@11.57" (an element ref). + * For ref resolution the body is searched for the element's url= attribute; + * if it was tokenised, the token is further resolved via urlMap. + * + * Returns the full URL string, or null if not found. + */ +export function resolveUrl( + body: string, + urlMap: Map, + target: string, +): string | null { + const normalised = target.replace(/^@/, ""); + if (normalised.startsWith("$u")) { + return urlMap.get(normalised) ?? null; + } + // ref → find line and extract url= (quoted plain value or unquoted token) + const escaped = normalised.replace(/\./g, "\\."); + const re = new RegExp(`@${escaped}\\b[^\\n]*? url=(?:"([^"]+)"|(\\$u\\d+))`); + const hit = body.match(re); + if (!hit) return null; + if (hit[1] !== undefined) return hit[1]; + if (hit[2] !== undefined) return urlMap.get(hit[2]) ?? null; + return null; +} diff --git a/test/bridge.test.ts b/test/bridge.test.ts index fdeb7ef..e68471b 100644 --- a/test/bridge.test.ts +++ b/test/bridge.test.ts @@ -6,10 +6,12 @@ import { buildTransportArgs, extractToolText, getErrorMessage, + getLastSnapshotCache, handleBridgeRequest, isBridgeClientConnected, parseBridgeCallPayload, resolveBridgeScript, + resetLastSnapshotCache, wrapTransportForIdCapture, type BridgeClient, } from "../src/bridge.js"; @@ -385,3 +387,99 @@ describe("handleBridgeRequest streaming", () => { expect(JSON.parse(mockB.endPayload)).toEqual({ result: "result-B" }); }); }); + +// --------------------------------------------------------------------------- +// Snapshot cache — lastSnapshot state + /last-snapshot endpoint +// --------------------------------------------------------------------------- + +describe("snapshot cache", () => { + beforeEach(() => resetLastSnapshotCache()); + + const snapshotClient: BridgeClient = { + listTools: async () => ({ tools: [] }), + callTool: async () => ({ + content: [{ type: "text", text: 'uid=1_0 RootWebArea "Page" url="https://example.com/"\n link "Home"' }], + }), + close: async () => {}, + }; + + it("cache is empty before any snapshot call", () => { + expect(getLastSnapshotCache()).toBeNull(); + }); + + it("GET /last-snapshot returns 404 when cache is cold", async () => { + const req = makeMockRequest("GET", "/last-snapshot"); + const mock = makeMockResponse(); + await handleBridgeRequest(snapshotClient, req, mock.res); + expect(mock.res.statusCode).toBe(404); + expect(JSON.parse(mock.endPayload)).toHaveProperty("error"); + }); + + it("take_snapshot call populates the cache", async () => { + const req = makeMockRequest("POST", "/call", JSON.stringify({ name: "take_snapshot", args: {} })); + const mock = makeMockResponse(); + await handleBridgeRequest(snapshotClient, req, mock.res); + + const cached = getLastSnapshotCache(); + expect(cached).not.toBeNull(); + expect(cached!.raw).toContain('RootWebArea "Page"'); + expect(cached!.pageUrl).toBe("https://example.com"); + expect(cached!.capturedAt).toBeGreaterThan(0); + }); + + it("GET /last-snapshot returns 200 with cached data after a snapshot", async () => { + // Populate cache + const postReq = makeMockRequest("POST", "/call", JSON.stringify({ name: "take_snapshot", args: {} })); + await handleBridgeRequest(snapshotClient, postReq, makeMockResponse().res); + + // Now fetch it + const getReq = makeMockRequest("GET", "/last-snapshot"); + const mock = makeMockResponse(); + await handleBridgeRequest(snapshotClient, getReq, mock.res); + + expect(mock.res.statusCode).toBe(200); + const data = JSON.parse(mock.endPayload); + expect(data.raw).toContain("RootWebArea"); + expect(data.pageUrl).toBe("https://example.com"); + expect(typeof data.capturedAt).toBe("number"); + }); + + it("a non-snapshot tool call does not overwrite the cache", async () => { + // Populate with snapshot + const snapReq = makeMockRequest("POST", "/call", JSON.stringify({ name: "take_snapshot", args: {} })); + await handleBridgeRequest(snapshotClient, snapReq, makeMockResponse().res); + const first = getLastSnapshotCache()!.raw; + + // Call a different tool + const clickClient: BridgeClient = { + listTools: async () => ({ tools: [] }), + callTool: async () => ({ content: [{ type: "text", text: "clicked" }] }), + close: async () => {}, + }; + const clickReq = makeMockRequest("POST", "/call", JSON.stringify({ name: "click", args: { uid: "1_1" } })); + await handleBridgeRequest(clickClient, clickReq, makeMockResponse().res); + + expect(getLastSnapshotCache()!.raw).toBe(first); + }); + + it("second take_snapshot overwrites the cache (last write wins)", async () => { + let callCount = 0; + const twoSnapshotClient: BridgeClient = { + listTools: async () => ({ tools: [] }), + callTool: async () => { + callCount++; + return { + content: [{ type: "text", text: `RootWebArea "Page ${callCount}"` }], + }; + }, + close: async () => {}, + }; + + const req1 = makeMockRequest("POST", "/call", JSON.stringify({ name: "take_snapshot", args: {} })); + const req2 = makeMockRequest("POST", "/call", JSON.stringify({ name: "take_snapshot", args: {} })); + await handleBridgeRequest(twoSnapshotClient, req1, makeMockResponse().res); + await handleBridgeRequest(twoSnapshotClient, req2, makeMockResponse().res); + + expect(getLastSnapshotCache()!.raw).toContain("Page 2"); + }); +}); diff --git a/test/fixtures/elements.html b/test/fixtures/elements.html new file mode 100644 index 0000000..60f66c3 --- /dev/null +++ b/test/fixtures/elements.html @@ -0,0 +1,288 @@ + + + + + Element Snapshot Test + + + + +

Element Snapshot Test Page

+ + +
+

Buttons

+ + + + + + +
+ + +
+

Links

+ External link + Internal link + Email link + Anchor link +
+ + +
+

Text Inputs

+
+
+ Basic inputs +
+
+
+
+
+
+
+
+
+
+
+
+ +
+ +
+ Textarea + +
+
+
+ + +
+

Checkboxes & Radios

+
+
+ Checkboxes +
+
+
+ +
+ +
+ Radio buttons +
+
+ +
+
+
+ + +
+

Select / Dropdowns

+
+ +

+ +

+ +
+
+ + +
+

Lists

+
    +
  • Unordered item 1
  • +
  • Unordered item 2 +
      +
    • Nested item A
    • +
    • Nested item B
    • +
    +
  • +
  • Unordered item 3
  • +
+ +
    +
  1. Ordered item 1
  2. +
  3. Ordered item 2
  4. +
  5. Ordered item 3
  6. +
+ +
+
Term A
Definition of term A
+
Term B
Definition of term B
+
+
+ + +
+

ARIA Widgets

+ +
+ + + +
+
Content for Tab One.
+ + + +
+
Brightness: 60%
+ +
+
45%
+ +
+
Dark mode: ON
+ +
+
3
+ +
+ +
+ + +
+

Details / Summary

+
+ Section One +

Content inside section one.

+
+
+ Section Two (open) +

Content inside section two, visible by default.

+
+
+ Section Three +

Content inside section three.

+
+
+ + +
+

Table

+ + + + + + + + + + + + + +
Sample Data Table
NameRoleStatus
AliceEngineerActive
BobDesignerAway
CarolManagerActive
3 members total
+
+ + +
+

Dialog

+ + +

Modal Dialog

+

This is a native dialog element.

+ + +
+
+ + +
+

Media

+ Placeholder image +

+
+ Square placeholder +
Figure with caption
+
+
+ + +
+

Live Regions

+
Ready.
+
No errors.
+
All systems operational.
+ +
+ + +
+

Navigation

+ + + +
+ +

Anchor target at bottom of page.

+ + + diff --git a/test/run.test.ts b/test/run.test.ts index 5518f22..faf6657 100644 --- a/test/run.test.ts +++ b/test/run.test.ts @@ -179,16 +179,19 @@ describe("createPageHelper", () => { expect(result).toBe(3); }); - it("page.snapshot strips header", async () => { + it("page.snapshot strips header and compacts output", async () => { callTool.mockResolvedValueOnce( - '## Latest page snapshot\nRootWebArea "Title"\n uid=1 link "Home"', + '## Latest page snapshot\nuid=1_0 RootWebArea "Title"\n uid=1_1 link "Home" url="/"', ); const page = createPageHelper(callTool); const snap = await page.snapshot(); expect(callTool).toHaveBeenCalledWith("take_snapshot"); - expect(snap).toContain("RootWebArea"); + // RootWebArea is renamed to `root` and uid= refs become @X.Y by compactSnapshot + expect(snap).toContain("root"); + expect(snap).not.toContain("RootWebArea"); + expect(snap).not.toContain("uid="); expect(snap).not.toContain("## Latest"); }); @@ -434,15 +437,18 @@ describe("page.open fallback", () => { // --- 8. page.snapshot strips wrapper headers --- describe("page.snapshot header stripping", () => { - it("strips MCP preamble from snapshot", async () => { + it("strips MCP preamble from snapshot and compacts output", async () => { callTool.mockResolvedValueOnce( - 'Page snapshot captured.\n\n## Latest page snapshot\n\nRootWebArea "Hi"\n uid=1 button "OK"', + 'Page snapshot captured.\n\n## Latest page snapshot\n\nuid=1_0 RootWebArea "Hi"\n uid=1_1 button "OK"', ); const page = createPageHelper(callTool); const snap = await page.snapshot(); - expect(snap).toMatch(/^RootWebArea/); + // RootWebArea is renamed to `root` and uid= refs become @X.Y by compactSnapshot + expect(snap).toMatch(/^@1\.0 root/); + expect(snap).not.toContain("RootWebArea"); + expect(snap).not.toContain("uid="); expect(snap).not.toContain("Latest page snapshot"); expect(snap).not.toContain("Page snapshot captured"); }); @@ -480,15 +486,22 @@ describe("page.eval variants", () => { // --- 10. isUidRef detection --- describe("isUidRef", () => { - it("recognizes @-prefixed numeric refs", () => { + it("recognizes @-prefixed numeric refs (underscore form)", () => { expect(isUidRef("@12")).toBe(true); expect(isUidRef("@1_3")).toBe(true); expect(isUidRef("@26_181")).toBe(true); }); + it("recognizes @-prefixed numeric refs (compact dot form)", () => { + expect(isUidRef("@2.4")).toBe(true); + expect(isUidRef("@12.181")).toBe(true); + expect(isUidRef("@1.0")).toBe(true); + }); + it("recognizes bare numeric refs", () => { expect(isUidRef("5")).toBe(true); expect(isUidRef("26_181")).toBe(true); + expect(isUidRef("2.4")).toBe(true); }); it("rejects CSS selectors", () => { diff --git a/test/snapshot.test.ts b/test/snapshot.test.ts index 08e27e7..82ad41b 100644 --- a/test/snapshot.test.ts +++ b/test/snapshot.test.ts @@ -6,10 +6,17 @@ import { isInputType, truncateSnapshot, truncateText, + compactSnapshot, + refToDisplay, + refToMcp, + cleanUrl, + extractPageOrigin, + applyUrlLut, + resolveUrl, } from "../src/snapshot.js"; describe("countRefs", () => { - it("counts uid= occurrences", () => { + it("counts uid= occurrences in raw form", () => { const snapshot = `RootWebArea "Example" uid=1 button "Submit" uid=2 textbox "Name" @@ -17,13 +24,20 @@ describe("countRefs", () => { expect(countRefs(snapshot)).toBe(3); }); + it("counts @X.Y refs in compact form", () => { + const snapshot = `@1.0 root "Example" + @1.1 button "Submit" + @1.2 textbox "Name"`; + expect(countRefs(snapshot)).toBe(3); + }); + it("returns 0 for no refs", () => { expect(countRefs('RootWebArea "Empty"')).toBe(0); }); }); describe("extractRefs", () => { - it("extracts ref info from snapshot lines", () => { + it("extracts ref info from raw uid= lines", () => { const snapshot = ` uid=1 button "Submit" uid=2 textbox "Name"`; const refs = extractRefs(snapshot); @@ -32,6 +46,21 @@ describe("extractRefs", () => { { ref: "2", type: "textbox", label: "Name" }, ]); }); + + it("extracts ref info from compact @X.Y lines and normalises to display form", () => { + const snapshot = ` @2.1 button "Submit" + @2.2 textbox "Name"`; + const refs = extractRefs(snapshot); + expect(refs).toEqual([ + { ref: "2.1", type: "button", label: "Submit" }, + { ref: "2.2", type: "textbox", label: "Name" }, + ]); + }); + + it("normalises uid=X_Y refs to display form", () => { + const refs = extractRefs(' uid=2_4 button "Go"'); + expect(refs[0].ref).toBe("2.4"); + }); }); describe("extractTitle", () => { @@ -39,6 +68,15 @@ describe("extractTitle", () => { expect(extractTitle('RootWebArea "My Page"')).toBe("My Page"); }); + it("extracts title from compact root", () => { + expect(extractTitle('@1.0 root "My Page" url="https://example.com"')).toBe("My Page"); + }); + + it("extracts title from compact markdown heading", () => { + expect(extractTitle("@1.1 # Welcome")).toBe("Welcome"); + expect(extractTitle("@1.2 ## Section")).toBe("Section"); + }); + it("falls back to heading", () => { expect(extractTitle(' heading "Welcome"')).toBe("Welcome"); }); @@ -151,3 +189,605 @@ describe("truncateText", () => { expect(result.totalLength).toBe(120); }); }); + +// --- refToDisplay / refToMcp --- + +describe("refToDisplay / refToMcp", () => { + it("converts MCP underscore refs to dot display form", () => { + expect(refToDisplay("2_4")).toBe("2.4"); + expect(refToDisplay("12_181")).toBe("12.181"); + expect(refToDisplay("1")).toBe("1"); + }); + + it("converts display refs back to MCP underscore form", () => { + expect(refToMcp("2.4")).toBe("2.4".replace(/\./g, "_")); + expect(refToMcp("12.181")).toBe("12_181"); + expect(refToMcp("@2.4")).toBe("2_4"); + expect(refToMcp("@2_4")).toBe("2_4"); + expect(refToMcp("2_4")).toBe("2_4"); + }); + + it("round-trips correctly", () => { + expect(refToMcp(refToDisplay("2_4"))).toBe("2_4"); + expect(refToMcp(refToDisplay("12_181"))).toBe("12_181"); + }); +}); + +// --- compactSnapshot --- + +describe("compactSnapshot", () => { + it("drops LineBreak nodes", () => { + const tree = `uid=1_0 root "Page"\n uid=1_1 button "OK"\n uid=1_2 LineBreak "\n"\n uid=1_3 link "Home"`; + const result = compactSnapshot(tree); + expect(result).not.toContain("LineBreak"); + expect(result).toContain("button"); + expect(result).toContain("link"); + }); + + it("drops whitespace-only StaticText nodes", () => { + const tree = `uid=1_0 root "Page"\n uid=1_1 StaticText " "\n uid=1_2 button "OK"`; + const result = compactSnapshot(tree); + expect(result).not.toMatch(/StaticText "\s+"/); + expect(result).toContain("button"); + }); + + it("drops StaticText children that duplicate the parent label", () => { + const tree = [ + `uid=1_0 root "Page"`, + ` uid=1_1 link "Home" url="/"`, + ` uid=1_2 StaticText "Home"`, + ` uid=1_3 button "Submit"`, + ].join("\n"); + const result = compactSnapshot(tree); + // StaticText "Home" should be gone; the link and button should remain + expect(result).not.toMatch(/text "Home"/); + expect(result).toContain('link "Home"'); + expect(result).toContain('button "Submit"'); + }); + + it("keeps StaticText children whose label differs from the parent", () => { + const tree = [ + `uid=1_0 root "Page"`, + ` uid=1_1 link "Click here" url="/"`, + ` uid=1_2 StaticText "go"`, + ].join("\n"); + const result = compactSnapshot(tree); + expect(result).toContain("go"); + }); + + it("collapses consecutive text siblings and drops when merged label echoes parent", () => { + const tree = [ + `uid=1_0 root "Page"`, + ` uid=1_1 link "[13]" url="/wiki/cite-13"`, + ` uid=1_2 StaticText "["`, + ` uid=1_3 StaticText "13"`, + ` uid=1_4 StaticText "]"`, + ].join("\n"); + const result = compactSnapshot(tree); + expect(result).toContain('link "[13]"'); + expect(result).not.toMatch(/text "\[/); + expect(result).not.toMatch(/text "13"/); + expect(result).not.toMatch(/text "\]"/); + expect(result).not.toMatch(/text "\[13\]"/); + }); + + it("collapses consecutive text siblings and keeps when merged label differs from parent", () => { + const tree = [ + `uid=1_0 root "Page"`, + ` uid=1_1 link "World" url="/"`, + ` uid=1_2 StaticText "Hel"`, + ` uid=1_3 StaticText "lo"`, + ].join("\n"); + const result = compactSnapshot(tree); + expect(result).toContain('link "World"'); + expect(result).toMatch(/text "Hello"/); + expect(result).not.toMatch(/@1\.3/); + }); + + it("does not collapse text nodes at different indent levels", () => { + const tree = [ + `uid=1_0 root "Page"`, + ` uid=1_1 StaticText "A"`, + ` uid=1_2 StaticText "B"`, + ].join("\n"); + const result = compactSnapshot(tree); + expect(result).toMatch(/text "A"/); + expect(result).toMatch(/text "B"/); + }); + + it("drops empty valuetext attribute", () => { + const tree = `uid=1_0 slider "Volume" value="50" valuemax="100" valuemin="0" valuetext=""`; + expect(compactSnapshot(tree)).not.toContain('valuetext=""'); + }); + + it("drops disableable when disabled is present", () => { + const tree = `uid=1_0 button "Go" disableable disabled`; + expect(compactSnapshot(tree)).not.toContain("disableable"); + expect(compactSnapshot(tree)).toContain("disabled"); + }); + + it("drops selectable on option and tab roles", () => { + const tree = [ + `uid=1_0 option "Alpha" selectable value="a"`, + `uid=1_1 tab "Home" selectable`, + ].join("\n"); + const result = compactSnapshot(tree); + expect(result).not.toContain("selectable"); + }); + + it("drops relevant='additions text'", () => { + const tree = `uid=1_0 status live="polite" relevant="additions text"`; + expect(compactSnapshot(tree)).not.toContain('relevant="additions text"'); + }); + + it("drops atomic and default live= on alert/status", () => { + const tree = [ + `uid=1_0 status atomic live="polite" relevant="additions text"`, + `uid=1_1 alert atomic live="assertive" relevant="additions text"`, + ].join("\n"); + const result = compactSnapshot(tree); + expect(result).not.toContain("atomic"); + expect(result).not.toContain('live="polite"'); + expect(result).not.toContain('live="assertive"'); + }); + + it("drops implied combobox attributes", () => { + const tree = `uid=1_0 combobox "Country" expandable haspopup="menu" value="Poland"`; + const result = compactSnapshot(tree); + expect(result).not.toContain("haspopup"); + expect(result).not.toContain("expandable"); + expect(result).toContain('combobox "Country"'); + }); + + it("drops orientation='horizontal'", () => { + const tree = `uid=1_0 slider "Volume" orientation="horizontal" value="50"`; + expect(compactSnapshot(tree)).not.toContain("orientation"); + }); + + it("drops autocomplete attribute", () => { + const tree = `uid=1_0 combobox "Search" autocomplete="both"`; + expect(compactSnapshot(tree)).not.toContain("autocomplete"); + }); + + it("renames PascalCase role names to compact lowercase forms", () => { + const tree = [ + `uid=1_0 RootWebArea "Page"`, + ` uid=1_1 StaticText "Hello"`, + ` uid=1_2 DisclosureTriangle "Details" expandable`, + ` uid=1_3 ColorWell "Colour" value="#ff0000"`, + ` uid=1_4 InputTime "Appt"`, + ].join("\n"); + const result = compactSnapshot(tree); + expect(result).toContain("root"); + expect(result).toContain("text"); + expect(result).toContain("disclosure"); + expect(result).toContain("color"); + expect(result).toContain("time"); + expect(result).not.toContain("RootWebArea"); + expect(result).not.toContain("StaticText"); + expect(result).not.toContain("DisclosureTriangle"); + expect(result).not.toContain("ColorWell"); + expect(result).not.toContain("InputTime"); + }); + + it("strips quotes from numeric attribute values", () => { + const tree = `uid=1_0 spinbutton "Qty" value="3" valuemin="1" valuemax="10"`; + const result = compactSnapshot(tree); + expect(result).toContain("value=3"); + expect(result).toContain("valuemin=1"); + expect(result).toContain("valuemax=10"); + expect(result).not.toContain('value="3"'); + }); + + it("converts headings to markdown style", () => { + const tree = [ + `uid=1_0 root "Page"`, + ` uid=1_1 heading "Section One" level="1"`, + ` uid=1_2 heading "Subsection" level="2"`, + ].join("\n"); + const result = compactSnapshot(tree); + expect(result).toContain("# Section One"); + expect(result).toContain("## Subsection"); + expect(result).not.toContain('heading "'); + expect(result).not.toContain("level="); + }); + + it("rewrites uid=PAGE_ELEM refs to @PAGE.ELEM display form", () => { + const tree = `uid=2_4 button "Submit"`; + const result = compactSnapshot(tree); + expect(result).toContain("@2.4"); + expect(result).not.toContain("uid="); + }); + + it("processes a realistic multi-element tree and is shorter than the original", () => { + const tree = [ + `uid=2_0 RootWebArea "Test Page" url="file:///test.html"`, + ` uid=2_1 heading "Test Page" level="1"`, + ` uid=2_2 region "Links"`, + ` uid=2_3 link "Home" url="/"`, + ` uid=2_4 StaticText "Home"`, + ` uid=2_5 StaticText " "`, + ` uid=2_6 LineBreak "\n"`, + ` uid=2_7 region "Form"`, + ` uid=2_8 combobox "Country" expandable haspopup="menu" value="Poland"`, + ` uid=2_9 option "Poland" selectable selected value="Poland"`, + ` uid=2_10 option "Germany" selectable value="Germany"`, + ` uid=2_11 status atomic live="polite" relevant="additions text"`, + ` uid=2_12 StaticText "Ready."`, + ].join("\n"); + + const result = compactSnapshot(tree); + + // Ref format + expect(result).not.toContain("uid="); + expect(result).toContain("@2.0"); + + // Role renames + expect(result).not.toContain("RootWebArea"); + expect(result).not.toContain("StaticText"); + expect(result).not.toContain("LineBreak"); + + // Noise removal + expect(result).not.toContain("selectable"); + expect(result).not.toContain("atomic"); + expect(result).not.toContain("expandable"); + expect(result).not.toContain('live="polite"'); + expect(result).not.toContain('relevant='); + + // Markdown headings + expect(result).toContain("# Test Page"); + + // Shorter overall + expect(result.length).toBeLessThan(tree.length); + }); +}); + +// --- cleanUrl --- + +describe("cleanUrl", () => { + it("returns null for javascript: URLs", () => { + expect(cleanUrl("javascript:void(0)", null)).toBeNull(); + expect(cleanUrl("javascript:doStuff()", "https://x.com")).toBeNull(); + }); + + it("returns null for data: URLs", () => { + expect(cleanUrl("data:image/png;base64,abc123", null)).toBeNull(); + expect(cleanUrl("data:text/html,

hi

", "https://x.com")).toBeNull(); + }); + + it("strips matching page origin", () => { + expect(cleanUrl("https://example.com/foo", "https://example.com")).toBe("/foo"); + }); + + it("returns / when URL is exactly the origin", () => { + expect(cleanUrl("https://example.com", "https://example.com")).toBe("/"); + }); + + it("does not strip a different origin", () => { + expect(cleanUrl("https://other.com/foo", "https://example.com")).toBe( + "https://other.com/foo", + ); + }); + + it("leaves absolute URL unchanged when origin is null", () => { + expect(cleanUrl("https://example.com/foo?q=bar", null)).toBe( + "https://example.com/foo?q=bar", + ); + }); + + it("drops Google Analytics UTM params", () => { + expect(cleanUrl("/p?id=1&utm_source=nl&utm_medium=email&utm_campaign=spring", null)).toBe( + "/p?id=1", + ); + }); + + it("drops Google Ads click IDs (gclid, gbraid, wbraid, dclid, gad_source)", () => { + expect(cleanUrl("/p?q=x&gclid=abc&gbraid=def&wbraid=ghi&dclid=jkl&gad_source=1", null)).toBe( + "/p?q=x", + ); + }); + + it("drops social platform click IDs (fbclid, msclkid, yclid, igshid, ttclid, twclid)", () => { + expect( + cleanUrl("/p?id=1&fbclid=a&msclkid=b&yclid=c&igshid=d&ttclid=e&twclid=f", null), + ).toBe("/p?id=1"); + }); + + it("drops LinkedIn, Google Shopping, and Klaviyo click IDs", () => { + expect(cleanUrl("/p?id=1&li_fat_id=a&srsltid=b&_ke=c", null)).toBe("/p?id=1"); + }); + + it("drops Mailchimp mc_ params", () => { + expect(cleanUrl("/p?id=1&mc_cid=abc&mc_eid=xyz", null)).toBe("/p?id=1"); + }); + + it("preserves functional params (q, id, node, page, etc.)", () => { + expect(cleanUrl("/search?q=keyboard&page=2&node=42", null)).toBe( + "/search?q=keyboard&page=2&node=42", + ); + }); + + it("preserves ie= and _encoding= (site-specific, not generic tracking)", () => { + expect(cleanUrl("/p?ie=UTF8&_encoding=UTF8&node=42", null)).toBe( + "/p?ie=UTF8&_encoding=UTF8&node=42", + ); + }); + + it("drops the ? entirely when all params are noise", () => { + expect(cleanUrl("/p?gclid=abc&utm_source=google", null)).toBe("/p"); + }); + + it("preserves the fragment", () => { + expect(cleanUrl("https://example.com/s?q=x&gclid=y#section", "https://example.com")).toBe( + "/s?q=x#section", + ); + }); + + it("preserves the fragment when there is no query", () => { + expect(cleanUrl("https://example.com/foo#bar", "https://example.com")).toBe("/foo#bar"); + }); + + it("preserves percent-encoded values in non-noise params", () => { + expect(cleanUrl("/p?q=hello%20world&gclid=x", null)).toBe("/p?q=hello%20world"); + }); +}); + +// --- extractPageOrigin --- + +describe("extractPageOrigin", () => { + it("returns origin from RootWebArea url= attribute", () => { + const tree = `uid=1_0 RootWebArea "Page" url="https://www.amazon.com/s?k=x"`; + expect(extractPageOrigin(tree)).toBe("https://www.amazon.com"); + }); + + it("returns origin from compact root + @ref form", () => { + const tree = `@1.0 root "Page" url="https://example.com:8080/foo"`; + expect(extractPageOrigin(tree)).toBe("https://example.com:8080"); + }); + + it("returns null when there is no root url=", () => { + expect(extractPageOrigin(`uid=1_0 RootWebArea "Page"`)).toBeNull(); + }); + + it("returns null for a tree without a root node", () => { + expect(extractPageOrigin(`uid=1_1 button "Click"`)).toBeNull(); + }); + + it("returns null for an unparseable URL", () => { + expect( + extractPageOrigin(`uid=1_0 RootWebArea "Page" url="not a url"`), + ).toBeNull(); + }); +}); + +// --- compactSnapshot Layer 1 (URL + description cleanup) --- + +describe("compactSnapshot URL cleanup", () => { + it("drops javascript: url= attributes but keeps the element", () => { + const tree = [ + `uid=1_0 root "Page" url="https://x.com/"`, + ` uid=1_1 link "Search" url="javascript:void(0)"`, + ].join("\n"); + const result = compactSnapshot(tree); + expect(result).not.toContain("javascript:"); + // Link line should have no url= attribute at all (root keeps its url= for origin lookup) + const linkLine = result.split("\n").find((l) => l.includes('link "Search"'))!; + expect(linkLine).not.toContain("url="); + expect(linkLine).toContain('link "Search"'); + }); + + it("strips the page origin from same-site URLs", () => { + const tree = [ + `uid=1_0 root "Page" url="https://www.amazon.com/s?k=x"`, + ` uid=1_1 link "Logo" url="https://www.amazon.com/ref_logo"`, + ` uid=1_2 link "Other" url="https://other.com/foo"`, + ].join("\n"); + const result = compactSnapshot(tree); + expect(result).toContain('url="/ref_logo"'); + expect(result).toContain('url="https://other.com/foo"'); + }); + + it("drops tracking query params from URLs", () => { + const tree = [ + `uid=1_0 root "Page" url="https://example.com/"`, + ` uid=1_1 link "News" url="https://example.com/news?utm_source=nl&utm_medium=email&gclid=abc"`, + ].join("\n"); + const result = compactSnapshot(tree); + expect(result).toContain('url="/news"'); + }); + + it("dedups boilerplate description repeated >= threshold times", () => { + const boilerplate = "use arrow keys to navigate"; + const tree = [ + `uid=1_0 root "Page" url="https://x.com/"`, + ` uid=1_1 link "A" description="${boilerplate}"`, + ` uid=1_2 link "B" description="${boilerplate}"`, + ` uid=1_3 link "C" description="${boilerplate}"`, + ` uid=1_4 link "D" description="${boilerplate}"`, + ].join("\n"); + const result = compactSnapshot(tree); + const matches = result.match(/description=/g) ?? []; + expect(matches.length).toBe(1); + expect(result).toContain(`description="${boilerplate}"`); + expect(result).toContain('link "D"'); + }); + + it("keeps descriptions that occur fewer times than the threshold", () => { + const tree = [ + `uid=1_0 root "Page" url="https://x.com/"`, + ` uid=1_1 link "A" description="hint one"`, + ` uid=1_2 link "B" description="hint one"`, + ].join("\n"); + const result = compactSnapshot(tree); + const matches = result.match(/description=/g) ?? []; + expect(matches.length).toBe(2); + }); + + it("strips tracking params even when page origin is unknown", () => { + const tree = [ + `uid=1_0 root "Page"`, + ` uid=1_1 link "News" url="https://example.com/news?utm_source=nl&gclid=abc"`, + ].join("\n"); + const result = compactSnapshot(tree); + // No origin stripping (root has no url=), but tracking params are removed + expect(result).toContain('url="https://example.com/news"'); + }); +}); + +// --- applyUrlLut --- + +describe("applyUrlLut", () => { + it("returns body unchanged and empty trailer when no URLs are present", () => { + const text = `@1.0 root "Page"\n @1.1 button "Click"`; + const { body, trailer, urlMap } = applyUrlLut(text); + expect(body).toBe(text); + expect(trailer).toBe(""); + expect(urlMap.size).toBe(0); + }); + + it("leaves a short URL that appears once untouched", () => { + const text = `@1.0 root "Page"\n @1.1 link "Home" url="/home"`; + const { body, trailer } = applyUrlLut(text); + expect(body).toContain('url="/home"'); + expect(trailer).toBe(""); + }); + + it("tokenises a URL that appears 2+ times (dedup)", () => { + const repeated = "/s?k=rgb+mechanical+keyboards&category=electronics"; + const text = [ + `@1.0 root "Page" url="${repeated}"`, + ` @1.1 link "A" url="${repeated}"`, + ` @1.2 link "B" url="${repeated}"`, + ].join("\n"); + const { body, trailer, urlMap } = applyUrlLut(text); + expect(body).not.toContain(`url="${repeated}"`); + expect(body).toMatch(/url=\$u\d/); + expect(urlMap.size).toBe(1); + const [token, url] = [...urlMap.entries()][0]; + expect(url).toBe(repeated); + expect(trailer).toContain(`${token} ${repeated}`); + // Full URL in trailer — not hidden form + expect(trailer).not.toContain("[hidden"); + }); + + it("assigns tokens in tree-walk (first-occurrence) order", () => { + const urlA = "/page-a?x=1&y=2&z=3&lots=of¶ms=here"; + const urlB = "/page-b?x=1&y=2&z=3&lots=of¶ms=here"; + const text = [ + `@1.0 root "Page"`, + ` @1.1 link "A" url="${urlA}"`, + ` @1.2 link "B" url="${urlB}"`, + ` @1.3 link "A2" url="${urlA}"`, + ` @1.4 link "B2" url="${urlB}"`, + ].join("\n"); + const { urlMap } = applyUrlLut(text); + const tokens = [...urlMap.keys()]; + expect(tokens[0]).toBe("$u1"); + expect(urlMap.get("$u1")).toBe(urlA); + expect(tokens[1]).toBe("$u2"); + expect(urlMap.get("$u2")).toBe(urlB); + }); + + it("tokenises a long URL appearing once as a whale (hidden in trailer)", () => { + const whale = "/sspa/click?" + "x".repeat(200); + const text = `@1.0 root "Page"\n @1.1 link "Ad" url="${whale}"`; + const { body, trailer, urlMap } = applyUrlLut(text); + expect(body).toMatch(/url=\$u\d/); + expect(urlMap.get("$u1")).toBe(whale); + // Hidden form in trailer + expect(trailer).toContain("[hidden"); + expect(trailer).toContain(`${whale.length}b`); + expect(trailer).not.toContain(whale); + }); + + it("whale trailer includes a path-stem preview", () => { + const whale = "/sspa/click?spc=" + "A".repeat(200); + const text = `@1.0 root "Page"\n @1.1 link "Ad" url="${whale}"`; + const { trailer } = applyUrlLut(text); + expect(trailer).toContain("→ /sspa/click?spc="); + expect(trailer).toContain("…"); + }); + + it("cross-host whale includes host in the preview (no scheme)", () => { + const whale = "https://aax-us-east.amazon.com/x/c/" + "B".repeat(200); + const text = `@1.0 root "Page"\n @1.1 link "Ad" url="${whale}"`; + const { trailer } = applyUrlLut(text); + // Preview should start with host, not https:// + expect(trailer).toMatch(/→ aax-us-east\.amazon\.com/); + }); + + it("dedup wins over whale when URL is both long and repeated", () => { + const url = "/long?" + "x".repeat(200); + const text = [ + `@1.0 root "Page"`, + ` @1.1 link "A" url="${url}"`, + ` @1.2 link "B" url="${url}"`, + ].join("\n"); + const { trailer } = applyUrlLut(text); + // Full URL printed in trailer — not the hidden form + expect(trailer).toContain(url); + expect(trailer).not.toContain("[hidden"); + }); + + it("body + trailer length does not exceed input length", () => { + const repeated = "/s?k=rgb+mechanical+keyboards"; + const text = [ + `@1.0 root "Page" url="${repeated}"`, + ` @1.1 link "A" url="${repeated}"`, + ` @1.2 link "B" url="${repeated}"`, + ` @1.3 link "C" url="https://other.com/` + "x".repeat(200) + `"`, + ].join("\n"); + const { body, trailer } = applyUrlLut(text); + expect(body.length + trailer.length).toBeLessThanOrEqual(text.length); + }); + + it("trailer only lists URLs visible in the supplied text (truncation interaction)", () => { + const urlInBody = "/visible?k=keyboard"; + const urlTruncated = "/hidden?k=mouse"; + // Simulate: body text was already truncated to contain only the first URL + const truncatedText = `@1.0 root "Page"\n @1.1 link "A" url="${urlInBody}"\n @1.2 link "A" url="${urlInBody}"`; + const { trailer } = applyUrlLut(truncatedText); + expect(trailer).toContain(urlInBody); + expect(trailer).not.toContain(urlTruncated); + }); +}); + +// --- resolveUrl --- + +describe("resolveUrl", () => { + it("resolves a $uN token via urlMap", () => { + const urlMap = new Map([["$u1", "https://example.com/foo"]]); + expect(resolveUrl("", urlMap, "$u1")).toBe("https://example.com/foo"); + }); + + it("resolves $uN with leading @ stripped", () => { + const urlMap = new Map([["$u2", "/bar"]]); + expect(resolveUrl("", urlMap, "$u2")).toBe("/bar"); + }); + + it("returns null for an unknown token", () => { + expect(resolveUrl("", new Map(), "$u99")).toBeNull(); + }); + + it("resolves a plain ref to its url= attribute in the body", () => { + const body = `@1.0 root "Page"\n @1.1 link "Home" url="/home"`; + expect(resolveUrl(body, new Map(), "1.1")).toBe("/home"); + expect(resolveUrl(body, new Map(), "@1.1")).toBe("/home"); + }); + + it("resolves a ref whose url= was tokenised, via urlMap", () => { + const urlMap = new Map([["$u1", "/the-real-url"]]); + const body = `@1.0 root "Page"\n @1.1 link "Ad" url=$u1`; + expect(resolveUrl(body, urlMap, "@1.1")).toBe("/the-real-url"); + }); + + it("returns null when the ref has no url= attribute", () => { + const body = `@1.0 root "Page"\n @1.1 button "Click"`; + expect(resolveUrl(body, new Map(), "@1.1")).toBeNull(); + }); + + it("returns null when the ref does not exist in the body", () => { + const body = `@1.0 root "Page"`; + expect(resolveUrl(body, new Map(), "@9.9")).toBeNull(); + }); +}); diff --git a/test/tasks/README.md b/test/tasks/README.md new file mode 100644 index 0000000..60e271c --- /dev/null +++ b/test/tasks/README.md @@ -0,0 +1,31 @@ +# Agent Task Cost Benchmarks + +These tasks measure the real token cost of agent-driven browser automation +under two snapshot formats — **compact** (default) and **raw** (MCP verbatim). + +## How to run a task + +1. Open a **fresh Claude Code session** (clear context or start new). +2. Paste the entire contents of a task file as your first message. +3. Let Claude complete the task. +4. Run `/cost` to record the session cost. +5. Repeat with the matching `*-raw` variant. + +Compare the `/cost` output between the two runs. The compact vs raw difference +shows the real-world token saving an agent gets on that page type. + +## Tasks + +| File | Scenario | Expected saving | +|---|---|---| +| `amazon-search-compact.md` / `*-raw.md` | Amazon product search → top 5 results | High (many long tracking URLs) | +| `hn-top-stories-compact.md` / `*-raw.md` | Hacker News front page → top 5 stories | Medium (link-heavy, clean URLs) | + +## Notes + +- Results will differ between runs (dynamic pages). That's fine — the goal is + cost comparison, not result comparison. +- Both variants do the same task; only the snapshot format changes. +- The raw variants explicitly pass `--raw` on every command so the agent sees + the uncompressed MCP output. +- Record your `/cost` output next to the task file for tracking over time. diff --git a/test/tasks/amazon-search-compact.md b/test/tasks/amazon-search-compact.md new file mode 100644 index 0000000..595cc88 --- /dev/null +++ b/test/tasks/amazon-search-compact.md @@ -0,0 +1,28 @@ +# Task: Amazon product search — compact mode + +**Mode:** compact (default opera-browser-cli output) + +Use `opera-browser-cli` to search Amazon for "rgb mechanical keyboards" and +return the top 5 results. For each result include: +- Product title +- Price (if visible) +- Product URL (resolve via `opera-browser-cli url @` if the URL is + tokenised as `$uN` in the snapshot) + +## Rules + +- Use `opera-browser-cli` for all browser interaction (it is available in PATH). +- Do NOT pass `--raw` to any command — use the default compact output. +- Skip sponsored/ad results; only list organic results. +- If you need to scroll to find more results, do so. +- When you are done, output the 5 results in this exact format and nothing else: + +``` +1. | <price or "n/a"> | <url> +2. <title> | <price or "n/a"> | <url> +3. <title> | <price or "n/a"> | <url> +4. <title> | <price or "n/a"> | <url> +5. <title> | <price or "n/a"> | <url> +``` + +Start now. diff --git a/test/tasks/amazon-search-raw.md b/test/tasks/amazon-search-raw.md new file mode 100644 index 0000000..054d278 --- /dev/null +++ b/test/tasks/amazon-search-raw.md @@ -0,0 +1,31 @@ +# Task: Amazon product search — raw mode + +**Mode:** raw (uncompressed MCP output) + +Use `opera-browser-cli` to search Amazon for "rgb mechanical keyboards" and +return the top 5 results. For each result include: +- Product title +- Price (if visible) +- Product URL + +## Rules + +- Use `opera-browser-cli` for all browser interaction (it is available in PATH). +- Pass `--raw` on EVERY command that produces a snapshot, e.g.: + - `opera-browser-cli open <url> --raw` + - `opera-browser-cli snapshot --raw` + - `opera-browser-cli scroll down --raw` + - `opera-browser-cli click @<ref> --raw` +- Skip sponsored/ad results; only list organic results. +- If you need to scroll to find more results, do so. +- When you are done, output the 5 results in this exact format and nothing else: + +``` +1. <title> | <price or "n/a"> | <url> +2. <title> | <price or "n/a"> | <url> +3. <title> | <price or "n/a"> | <url> +4. <title> | <price or "n/a"> | <url> +5. <title> | <price or "n/a"> | <url> +``` + +Start now. diff --git a/test/tasks/hn-top-stories-compact.md b/test/tasks/hn-top-stories-compact.md new file mode 100644 index 0000000..63e3c4a --- /dev/null +++ b/test/tasks/hn-top-stories-compact.md @@ -0,0 +1,24 @@ +# Task: Hacker News top stories — compact mode + +**Mode:** compact (default opera-browser-cli output) + +Use `opera-browser-cli` to open Hacker News and return the top 5 stories. +For each story include: +- Story title +- Domain/source (e.g. "github.com") +- Points and comment count (if visible) +- URL of the story itself (not the HN comments page) + +## Rules + +- Use `opera-browser-cli` for all browser interaction (it is available in PATH). +- Do NOT pass `--raw` to any command — use the default compact output. +- If a URL is tokenised as `$uN`, resolve it with `opera-browser-cli url $uN`. +- When you are done, output the 5 results in this exact format and nothing else: + +``` +1. <title> | <domain> | <points> pts <comments> comments | <url> +2. ... +``` + +Start now. diff --git a/test/tasks/hn-top-stories-raw.md b/test/tasks/hn-top-stories-raw.md new file mode 100644 index 0000000..8774ce6 --- /dev/null +++ b/test/tasks/hn-top-stories-raw.md @@ -0,0 +1,27 @@ +# Task: Hacker News top stories — raw mode + +**Mode:** raw (uncompressed MCP output) + +Use `opera-browser-cli` to open Hacker News and return the top 5 stories. +For each story include: +- Story title +- Domain/source (e.g. "github.com") +- Points and comment count (if visible) +- URL of the story itself (not the HN comments page) + +## Rules + +- Use `opera-browser-cli` for all browser interaction (it is available in PATH). +- Pass `--raw` on EVERY command that produces a snapshot, e.g.: + - `opera-browser-cli open <url> --raw` + - `opera-browser-cli snapshot --raw` + - `opera-browser-cli scroll down --raw` + - `opera-browser-cli click @<ref> --raw` +- When you are done, output the 5 results in this exact format and nothing else: + +``` +1. <title> | <domain> | <points> pts <comments> comments | <url> +2. ... +``` + +Start now.