Agent API and MCP Developer Reference

This document is the durable reference for BenchLocal's local agent surface:

HTTP JSON commands for state reads and mutations
Server-Sent Events for live progress and state changes
MCP Streamable HTTP for standard agent tool calls

Keep this file updated whenever a UI feature becomes agent-controllable.

Design Contract

BenchLocal is a desktop benchmark app first. The agent surface must expose the same operations the UI exposes without creating a second execution path.

Core rules:

The Electron renderer keeps using IPC through window.benchlocal.
The Agent API and MCP call the same main-process controller as IPC.
Commands use HTTP JSON or MCP tools.
Live progress uses SSE events or MCP recent-event polling.
Long-running benchmark commands return quickly and continue in the UI.
Provider secrets are never returned by HTTP, SSE, MCP resources, or MCP tool results.
Agent access is explicit, token-protected, and local-first.
BenchLocal does not execute arbitrary shell commands for agents.

Implementation files:

app/src/main/controller.ts
  Shared main-process operations used by IPC, HTTP, and MCP.

app/src/main/agent-server.ts
  Local HTTP server, auth, SSE, OpenAPI, agent guide, and route adapters.

app/src/main/agent-mcp.ts
  MCP Streamable HTTP server, resources, prompt, and benchlocal_* tools.

packages/benchlocal-core/src/agent-protocol.ts
  Shared Agent API event and request/response types.

packages/benchlocal-core/src/config.ts
  Provider, model, and Agent Access config types.

packages/benchlocal-core/src/workspaces.ts
  Workspace, tab, model selection, execution mode, and sampling state.

Runtime Model

Agent Access can be enabled from Settings > Agent Access.

The server listens on:

127.0.0.1 when access is localhost
0.0.0.0 when access is local_network

The UI always shows a local client URL like:

http://127.0.0.1:<port>

Agents on another device must use the host machine's LAN IP when local_network is enabled.

The port is either:

the configured port
an automatically assigned port when no port is configured

Environment overrides:

BENCHLOCAL_AGENT_API=1
BENCHLOCAL_AGENT_PORT=50060
BENCHLOCAL_AGENT_ACCESS=localhost
BENCHLOCAL_AGENT_ACCESS=local_network

Token storage:

~/.benchlocal/agent-session.json

The session file contains the bearer token and is written with owner-only permissions when created by BenchLocal.

Authentication

All endpoints except GET /v1/health require:

Authorization: Bearer <token>

The token is shown in Settings > Agent Access and can be regenerated there.

Unauthorized requests return:

{
  "error": {
    "message": "Unauthorized.",
    "statusCode": 401
  }
}

MCP requests also enforce an Origin guard. If an Origin header is present, it must be localhost:

localhost
127.0.0.1
::1
[::1]

This is intentionally stricter than normal HTTP routes because MCP clients may be browser-adjacent.

URL and JSON Rules

Path IDs must be URL-encoded. This is required for model IDs and provider IDs that contain :, /, spaces, or UUID-like provider prefixes.

Example:

MODEL_ID='huggingface:Qwen/Qwen3.5-9B'
curl "$BENCHLOCAL_AGENT_BASE_URL/v1/models/$(node -e 'console.log(encodeURIComponent(process.argv[1]))' "$MODEL_ID")" \
  -H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN"

JSON write endpoints reject unknown fields. This is deliberate:

agents get fast feedback when they drift from the contract
accidental writes do not silently mutate future config
UI and API payloads stay aligned

Request bodies are limited to 1 MB.

Errors use:

{
  "error": {
    "message": "Human-readable error.",
    "statusCode": 400
  }
}

Discovery Endpoints

`GET /v1/health`

No auth required.

Returns runtime status and documentation links.

Example response:

{
  "ok": true,
  "benchLocalVersion": "0.2.6",
  "agent": {
    "enabled": true,
    "running": true,
    "access": "localhost",
    "host": "127.0.0.1",
    "port": 50060,
    "baseUrl": "http://127.0.0.1:50060",
    "connectedClients": 0,
    "message": "Agent API is listening on http://127.0.0.1:50060.",
    "startedAt": "2026-05-18T00:00:00.000Z"
  },
  "docs": {
    "agentGuide": "/v1/agent-guide",
    "openapi": "/v1/openapi.json",
    "mcp": "/mcp"
  }
}

`GET /v1/agent-guide`

Auth required.

Returns agent-readable Markdown generated by the running app. This is intentionally shorter than this developer reference and is meant for runtime agent bootstrapping.

`GET /v1/openapi.json`

Auth required.

Returns the OpenAPI document generated by the running app.

The OpenAPI document is useful for endpoint discovery, but the schemas are intentionally lightweight today. This doc remains the source of implementation guidance.

`POST /mcp`

Auth required.

Standard MCP Streamable HTTP endpoint. See MCP Surface.

POST /v1/mcp is also accepted.

SSE Event Stream

`GET /v1/events`

Auth required.

Opens a Server-Sent Events stream.

curl -N \
  -H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN" \
  "$BENCHLOCAL_AGENT_BASE_URL/v1/events"

Initial response includes a comment:

: BenchLocal agent event stream

Each event is sent as:

id: evt-...
event: benchpack.run.event
data: {"eventId":"evt-...","createdAt":"...","type":"benchpack.run.event","payload":{}}

Event envelope:

type BenchLocalAgentEvent<TPayload = unknown> = {
  eventId: string;
  createdAt: string;
  type: BenchLocalAgentEventType;
  payload: TPayload;
};

Current event types:

agent.state.updated
config.updated
workspace.updated
models.availability.updated
benchpack.run.started
benchpack.run.event
benchpack.run.finished
benchpack.run.error
verifier.event

Important payloads:

type BenchLocalAgentWorkspaceUpdatedPayload = {
  state: BenchLocalWorkspaceState;
};

type BenchLocalAgentConfigUpdatedPayload = {
  config: BenchLocalAgentSafeConfig;
};

type BenchLocalAgentModelAvailabilityPayload = {
  availability: ModelAvailability[];
};

type BenchLocalAgentRunEventPayload = {
  tabId: string;
  benchPackId: string;
  event: ProgressEvent;
};

benchpack.run.event wraps the Bench Pack host progress event. The nested event may include scenario start, model progress, scenario result, run finish, run error, and other Bench Pack progress messages.

SSE is read-only. Do not add command semantics to SSE.

Shared Types

Provider

type BenchLocalProviderKind =
  | "openrouter"
  | "huggingface"
  | "ollama"
  | "llamacpp"
  | "mlx"
  | "lmstudio"
  | "pico"
  | "openai_compatible";

type BenchLocalProviderConfig = {
  kind: BenchLocalProviderKind;
  name: string;
  enabled: boolean;
  base_url: string;
  api_key?: string;
  api_key_env?: string;
};

API and MCP provider reads redact api_key and expose:

type SafeProvider = Omit<BenchLocalProviderConfig, "api_key"> & {
  has_api_key: boolean;
  has_api_key_env: boolean;
};

Model

type BenchLocalModelConfig = {
  id: string;
  provider: string;
  model: string;
  label: string;
  group: string;
  enabled: boolean;
};

provider is the provider ID, not the provider display name.

Workspace Tab

type BenchLocalExecutionMode =
  | "serial"
  | "serial_by_model"
  | "parallel_by_model"
  | "parallel_by_test_case"
  | "full_parallel";

type BenchLocalWorkspaceTabModelSelection = {
  modelId: string;
  alias?: string;
};

type BenchLocalWorkspaceTab = {
  id: string;
  title: string;
  benchPackId: string | null;
  loadedRunId?: string | null;
  focusedScenarioId: string | null;
  modelSelections: BenchLocalWorkspaceTabModelSelection[];
  samplingOverrides?: GenerationRequest;
  executionMode: BenchLocalExecutionMode;
  runsPerTest: number;
  createdAt: string;
  updatedAt: string;
};

Generation

type GenerationRequest = {
  temperature?: number;
  top_p?: number;
  top_k?: number;
  min_p?: number;
  repetition_penalty?: number;
  presence_penalty?: number;
  request_timeout_seconds?: number;
};

Default request timeout is defined in core as DEFAULT_BENCHLOCAL_REQUEST_TIMEOUT_SECONDS.

HTTP API Reference

All examples assume:

export BENCHLOCAL_AGENT_BASE_URL="http://127.0.0.1:50060"
export BENCHLOCAL_AGENT_TOKEN="<token from Settings > Agent Access>"

Read State

`GET /v1/config`

Returns redacted BenchLocal config.

Response:

{
  "config": {
    "schema_version": 1,
    "ui": { "theme": "system" },
    "agent": { "enabled": true, "access": "localhost", "port": 50060 },
    "providers": {
      "huggingface": {
        "kind": "huggingface",
        "name": "Hugging Face",
        "enabled": true,
        "base_url": "https://router.huggingface.co/v1",
        "api_key_env": "HF_TOKEN",
        "has_api_key": false,
        "has_api_key_env": true
      }
    },
    "models": []
  }
}

`GET /v1/workspaces`

Returns the full workspace state:

{
  "path": "/Users/me/.benchlocal/state.json",
  "created": false,
  "state": {
    "schema_version": 1,
    "activeWorkspaceId": "workspace-main",
    "workspaceOrder": ["workspace-main"],
    "workspaces": {},
    "tabs": {}
  }
}

`GET /v1/benchpacks`

Returns installed Bench Packs and scenario metadata:

{
  "benchPacks": []
}

`GET /v1/benchpacks/registry`

Returns registry entries for installable Bench Packs:

{
  "registry": []
}

`GET /v1/providers`

Returns configured providers with secrets redacted:

{
  "providers": {
    "huggingface": {
      "kind": "huggingface",
      "name": "Hugging Face",
      "enabled": true,
      "base_url": "https://router.huggingface.co/v1",
      "api_key_env": "HF_TOKEN",
      "has_api_key": false,
      "has_api_key_env": true
    }
  }
}

`GET /v1/providers/:providerId`

Returns one redacted provider:

{
  "providerId": "huggingface",
  "provider": {
    "kind": "huggingface",
    "name": "Hugging Face",
    "enabled": true,
    "base_url": "https://router.huggingface.co/v1",
    "api_key_env": "HF_TOKEN",
    "has_api_key": false,
    "has_api_key_env": true
  }
}

`GET /v1/providers/:providerId/models/discover`

Discovers provider models when the provider supports browsing.

Response:

{
  "models": []
}

This may call an external provider API and can fail when credentials or network access are unavailable.

`GET /v1/models`

Returns configured models:

{
  "models": [
    {
      "id": "huggingface:Qwen/Qwen3.5-9B",
      "provider": "huggingface",
      "model": "Qwen/Qwen3.5-9B",
      "label": "Qwen3.5-9B",
      "group": "primary",
      "enabled": true
    }
  ]
}

`GET /v1/models/:modelId`

Returns one configured model:

{
  "model": {
    "id": "huggingface:Qwen/Qwen3.5-9B",
    "provider": "huggingface",
    "model": "Qwen/Qwen3.5-9B",
    "label": "Qwen3.5-9B",
    "group": "primary",
    "enabled": true
  }
}

`GET /v1/models/availability`

Checks model availability for all configured models.

Response:

{
  "availability": [
    {
      "modelId": "huggingface:Qwen/Qwen3.5-9B",
      "available": true,
      "checkedAt": "2026-05-18T00:00:00.000Z"
    }
  ]
}

Exact availability fields are defined by ModelAvailability in packages/benchlocal-core/src/protocol.ts.

`GET /v1/runs/active`

Returns active benchmark runs:

{
  "activeRuns": []
}

`GET /v1/verifiers`

Returns verifier runtime status:

{
  "verifiers": []
}

`GET /v1/benchpacks/:benchPackId/history`

Returns run history entries for a Bench Pack:

{
  "history": []
}

`GET /v1/benchpacks/:benchPackId/history/:runId`

Returns a saved run summary:

{
  "run": {}
}

Providers

`POST /v1/providers`

Creates a provider.

Allowed fields:

{
  id?: string;
  kind: BenchLocalProviderKind;
  name?: string;
  enabled?: boolean;
  base_url: string;
  api_key?: string;
  api_key_env?: string;
}

Example:

curl -X POST "$BENCHLOCAL_AGENT_BASE_URL/v1/providers" \
  -H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN" \
  -H "content-type: application/json" \
  -d '{
    "id": "huggingface",
    "kind": "huggingface",
    "name": "Hugging Face",
    "enabled": true,
    "base_url": "https://router.huggingface.co/v1",
    "api_key_env": "HF_TOKEN"
  }'

Returns 201.

`PATCH /v1/providers/:providerId`

Patches a provider.

Allowed fields:

{
  kind?: BenchLocalProviderKind;
  name?: string;
  enabled?: boolean;
  base_url?: string;
  api_key?: string | null;
  api_key_env?: string | null;
}

Use null for api_key or api_key_env to clear stored values when supported by the controller.

`DELETE /v1/providers/:providerId`

Deletes a provider.

Important behavior:

deletes linked models
removes linked models from tab selections
broadcasts config/workspace updates through the controller

`POST /v1/providers/:providerId/duplicate`

Duplicates one provider record.

Important behavior:

duplicates only the provider
does not duplicate linked models

Models

`POST /v1/models`

Creates a model.

Allowed fields:

{
  id?: string;
  provider: string;
  model: string;
  label?: string;
  group?: string;
  enabled?: boolean;
}

Example:

curl -X POST "$BENCHLOCAL_AGENT_BASE_URL/v1/models" \
  -H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN" \
  -H "content-type: application/json" \
  -d '{
    "id": "huggingface:Qwen/Qwen3.5-9B",
    "provider": "huggingface",
    "model": "Qwen/Qwen3.5-9B",
    "label": "Qwen3.5-9B",
    "group": "primary",
    "enabled": true
  }'

Returns 201.

`PATCH /v1/models/:modelId`

Patches a model.

Allowed fields:

{
  id?: string;
  provider?: string;
  model?: string;
  label?: string;
  group?: string;
  enabled?: boolean;
}

If the ID changes, the controller must preserve consistency with tab selections.

`DELETE /v1/models/:modelId`

Deletes one model and removes it from tab selections.

`POST /v1/models/:modelId/duplicate`

Duplicates one model record.

`POST /v1/models/availability/refresh`

Refreshes model availability globally or for selected model IDs.

Allowed fields:

{
  modelIds?: string[];
}

Example:

curl -X POST "$BENCHLOCAL_AGENT_BASE_URL/v1/models/availability/refresh" \
  -H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN" \
  -H "content-type: application/json" \
  -d '{"modelIds":["huggingface:Qwen/Qwen3.5-9B"]}'

Response:

{
  "availability": []
}

Also emits models.availability.updated when the controller broadcasts availability changes.

Workspace and Tabs

`POST /v1/workspaces/:workspaceId/tabs`

Creates a workspace tab.

Allowed fields:

{
  benchPackId?: string | null;
  title?: string;
  modelSelections?: Array<{ modelId: string; alias?: string }>;
}

Example:

curl -X POST "$BENCHLOCAL_AGENT_BASE_URL/v1/workspaces/$WORKSPACE_ID/tabs" \
  -H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN" \
  -H "content-type: application/json" \
  -d '{
    "benchPackId": "toolcall-15",
    "title": "ToolCall-15",
    "modelSelections": [
      { "modelId": "huggingface:Qwen/Qwen3.5-9B" }
    ]
  }'

Returns 201 and the updated workspace state.

`PATCH /v1/tabs/:tabId`

Patches a tab.

Allowed fields:

{
  title?: string;
  focusedScenarioId?: string | null;
  modelSelections?: Array<{ modelId: string; alias?: string }>;
  samplingOverrides?: GenerationRequest;
  executionMode?: BenchLocalExecutionMode;
  runsPerTest?: number;
}

Use this for compound updates. Prefer the more specific endpoints below for common UI actions because they document intent better.

`POST /v1/tabs/:tabId/select-benchpack`

Selects or clears a Bench Pack for a tab.

Allowed fields:

{
  benchPackId: string | null;
  title?: string;
}

`POST /v1/tabs/:tabId/select-models`

Selects models for a tab.

Allowed fields:

{
  modelIds?: string[];
  selections?: Array<{ modelId: string; alias?: string }>;
}

modelIds is the compact form. selections is the explicit form and supports aliases.

Example:

curl -X POST "$BENCHLOCAL_AGENT_BASE_URL/v1/tabs/$TAB_ID/select-models" \
  -H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN" \
  -H "content-type: application/json" \
  -d '{
    "modelIds": [
      "huggingface:Qwen/Qwen3.5-9B",
      "huggingface:qwen3.6:35b-a3b"
    ]
  }'

`POST /v1/tabs/:tabId/sampling`

Sets tab sampling overrides.

Allowed fields:

{
  samplingOverrides: GenerationRequest;
}

Example:

{
  "samplingOverrides": {
    "temperature": 0,
    "top_p": 1,
    "request_timeout_seconds": 500
  }
}

`POST /v1/tabs/:tabId/execution-mode`

Sets tab execution mode and optionally runs-per-test.

Allowed fields:

{
  executionMode: BenchLocalExecutionMode;
  runsPerTest?: number;
}

Execution mode values:

serial
serial_by_model
parallel_by_model
parallel_by_test_case
full_parallel

`POST /v1/tabs/:tabId/runs-per-test`

Sets tab runs-per-test.

Allowed fields:

{
  runsPerTest: number;
}

`POST /v1/tabs/:tabId/models/availability/refresh`

Refreshes availability for a tab.

Allowed fields:

{
  modelIds?: string[];
}

If modelIds is omitted, the selected model IDs from the tab are used.

Runs

Run commands are asynchronous unless otherwise noted. They return quickly, while detailed progress is emitted through GET /v1/events and visible in the desktop UI.

`POST /v1/tabs/:tabId/runs`

Starts a run for a tab.

Allowed fields:

{
  benchPackId?: string;
  modelIds?: string[];
  executionMode?: BenchLocalExecutionMode;
  runsPerTest?: number;
  generation?: GenerationRequest;
}

Resolution behavior:

benchPackId defaults to the tab's selected Bench Pack
modelIds defaults to the tab's selected models
executionMode defaults to the tab's execution mode
runsPerTest defaults to the tab's runs-per-test
generation defaults to the tab's sampling overrides

Response:

{
  "accepted": true,
  "tabId": "tab-..."
}

Status code: 202.

The run will set loadedRunId on the tab after a summary is produced.

`POST /v1/tabs/:tabId/runs/stop`

Stops the active run for a tab.

Response depends on controller state, but generally includes whether a run was stopped.

Unlike start/resume/retry, this is synchronous.

`POST /v1/tabs/:tabId/runs/:runId/resume`

Resumes a historical run.

Allowed fields:

{
  executionMode?: BenchLocalExecutionMode;
  runsPerTest?: number;
  generation?: GenerationRequest;
}

Response:

{
  "accepted": true,
  "tabId": "tab-...",
  "runId": "run-..."
}

Status code: 202.

`POST /v1/tabs/:tabId/runs/:runId/retry-scenario`

Retries one scenario/model cell from a saved run.

Allowed fields:

{
  scenarioId: string;
  modelId: string;
  runsPerTest?: number;
  generation?: GenerationRequest;
}

Response:

{
  "accepted": true,
  "tabId": "tab-...",
  "runId": "run-..."
}

Status code: 202.

`POST /v1/tabs/:tabId/runs/:runId/retry-provider-errors`

Retries provider-error cells from a saved run.

Allowed fields:

{
  runsPerTest?: number;
  generation?: GenerationRequest;
}

Response when there is work:

{
  "accepted": true,
  "tabId": "tab-...",
  "runId": "run-...",
  "kind": "provider_errors",
  "cellCount": 2,
  "groupCount": 1
}

Status code: 202.

Response when there is no eligible work:

{
  "accepted": false,
  "tabId": "tab-...",
  "runId": "run-...",
  "kind": "provider_errors",
  "cellCount": 0,
  "groupCount": 0
}

Status code: 200.

Provider-error classification must come from provider failure metadata and HTTP response status handling, not from scanning verifier failure summaries.

`POST /v1/tabs/:tabId/runs/:runId/retry-failed-results`

Retries non-provider failed cells from a saved run.

Allowed fields:

{
  runsPerTest?: number;
  generation?: GenerationRequest;
}

Response shape matches retry-provider-errors, with:

{
  "kind": "failed_results"
}

Recommended HTTP Workflow

This is the workflow agents should use for a live benchmark run:

GET /v1/health
GET /v1/workspaces
GET /v1/benchpacks
GET /v1/providers
GET /v1/models
Open GET /v1/events and keep it open.
Create or patch a tab.
Select Bench Pack and models.
Refresh model availability.
Set sampling, execution mode, and runs-per-test if needed.
Start the run.
Watch benchpack.run.event until a finished, cancelled, or error event appears.

Example:

curl "$BENCHLOCAL_AGENT_BASE_URL/v1/benchpacks" \
  -H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN"

curl "$BENCHLOCAL_AGENT_BASE_URL/v1/workspaces" \
  -H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN"

curl -X POST "$BENCHLOCAL_AGENT_BASE_URL/v1/workspaces/$WORKSPACE_ID/tabs" \
  -H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN" \
  -H "content-type: application/json" \
  -d '{"benchPackId":"toolcall-15","title":"ToolCall-15"}'

curl -X POST "$BENCHLOCAL_AGENT_BASE_URL/v1/tabs/$TAB_ID/select-models" \
  -H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN" \
  -H "content-type: application/json" \
  -d '{"modelIds":["huggingface:Qwen/Qwen3.5-9B"]}'

curl -X POST "$BENCHLOCAL_AGENT_BASE_URL/v1/tabs/$TAB_ID/models/availability/refresh" \
  -H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN" \
  -H "content-type: application/json" \
  -d '{}'

curl -X POST "$BENCHLOCAL_AGENT_BASE_URL/v1/tabs/$TAB_ID/runs" \
  -H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN" \
  -H "content-type: application/json" \
  -d '{"executionMode":"serial_by_model","runsPerTest":1}'

MCP Surface

BenchLocal exposes MCP through:

POST /mcp
Authorization: Bearer <token>
Content-Type: application/json
Accept: application/json, text/event-stream

POST /v1/mcp is accepted as an alias.

Implementation details:

Uses @modelcontextprotocol/sdk.
Uses StreamableHTTPServerTransport.
sessionIdGenerator is disabled, so the server is stateless per request.
GET, DELETE, and non-POST requests return MCP method-not-allowed JSON-RPC errors.
Long-running tools return accepted: true; progress is obtained from the UI, recent events, or the SSE stream.

MCP Client Configuration

Use the bearer token as an Authorization header.

Generic MCP client shape:

{
  "mcpServers": {
    "benchlocal": {
      "type": "streamable-http",
      "url": "http://127.0.0.1:50060/mcp",
      "headers": {
        "Authorization": "Bearer <token>"
      }
    }
  }
}

Exact configuration keys vary by MCP client.

MCP Resources

Resource URI	Description
`benchlocal://agent/guide`	Runtime agent guide Markdown.
`benchlocal://agent/openapi`	Runtime OpenAPI JSON document.
`benchlocal://state/config`	Redacted BenchLocal config.
`benchlocal://state/workspaces`	Workspace and tab state.
`benchlocal://state/benchpacks`	Installed Bench Packs and scenario metadata.
`benchlocal://state/providers`	Configured providers with secrets redacted.
`benchlocal://state/models`	Configured models.
`benchlocal://state/runs/active`	Active benchmark runs.
`benchlocal://state/events/recent`	Recent Agent API events.

MCP Prompt

`benchlocal-run-benchpack`

Arguments:

{
  benchPackId: string;
  modelIds: string; // comma-separated model IDs
  workspaceId?: string;
}

Use this prompt to teach an MCP-capable agent the preferred run workflow:

inspect workspaces
choose or create a tab
select the Bench Pack
select models
refresh availability
start a run
poll recent events while the UI updates in real time

MCP Tools

All MCP tools return JSON as text content and, where possible, structured content.

Health and State

Tool	Read-only	Input	Result
`benchlocal_get_health`	yes	`{}`	Runtime compatibility and health.
`benchlocal_get_config`	yes	`{}`	Redacted config.
`benchlocal_list_workspaces`	yes	`{}`	Workspace and tab state.
`benchlocal_list_benchpacks`	yes	`{}`	Installed Bench Packs.
`benchlocal_list_benchpack_registry`	yes	`{}`	Registry entries.
`benchlocal_list_active_runs`	yes	`{}`	Active runs.
`benchlocal_list_verifiers`	yes	`{}`	Verifier runtime status.
`benchlocal_get_recent_events`	yes	`{ limit?: number }`	Recent events, newest `limit` if provided.

Providers

Tool	Input	Result
`benchlocal_list_providers`	`{}`	Redacted providers.
`benchlocal_get_provider`	`{ providerId: string }`	One redacted provider.
`benchlocal_create_provider`	`{ id?, kind, name?, enabled?, base_url, api_key?, api_key_env? }`	Created provider result.
`benchlocal_update_provider`	`{ providerId, kind?, name?, enabled?, base_url?, api_key?, api_key_env? }`	Updated provider result.
`benchlocal_delete_provider`	`{ providerId }`	Delete result. Destructive.
`benchlocal_duplicate_provider`	`{ providerId }`	Duplicate provider result.
`benchlocal_discover_provider_models`	`{ providerId }`	Provider model discovery result.

kind must be one of:

openrouter
huggingface
ollama
llamacpp
mlx
lmstudio
pico
openai_compatible

Models

Tool	Input	Result
`benchlocal_list_models`	`{}`	Configured models.
`benchlocal_get_model`	`{ modelId: string }`	One model.
`benchlocal_create_model`	`{ id?, provider, model, label?, group?, enabled? }`	Created model result.
`benchlocal_update_model`	`{ modelId, id?, provider?, model?, label?, group?, enabled? }`	Updated model result.
`benchlocal_delete_model`	`{ modelId }`	Delete result. Destructive.
`benchlocal_duplicate_model`	`{ modelId }`	Duplicate model result.
`benchlocal_check_model_availability`	`{ modelIds?: string[] }`	Availability result.
`benchlocal_refresh_model_availability`	`{ tabId?: string, modelIds?: string[] }`	Availability result.

benchlocal_refresh_model_availability uses selected tab models when tabId is provided and modelIds is omitted.

Tabs

Tool	Input	Result
`benchlocal_create_tab`	`{ workspaceId, benchPackId?, title?, modelSelections? }`	Updated workspace state.
`benchlocal_patch_tab`	`{ tabId, title?, focusedScenarioId?, modelSelections?, samplingOverrides?, executionMode?, runsPerTest? }`	Updated workspace state.
`benchlocal_select_benchpack`	`{ tabId, benchPackId, title? }`	Updated workspace state.
`benchlocal_select_models`	`{ tabId, modelIds?, selections? }`	Updated workspace state.
`benchlocal_set_sampling`	`{ tabId, samplingOverrides }`	Updated workspace state.
`benchlocal_set_execution_mode`	`{ tabId, executionMode, runsPerTest? }`	Updated workspace state.
`benchlocal_set_runs_per_test`	`{ tabId, runsPerTest }`	Updated workspace state.

modelSelections and selections use:

Array<{ modelId: string; alias?: string }>

Runs

Tool	Input	Result
`benchlocal_start_run`	`{ tabId, benchPackId?, modelIds?, executionMode?, runsPerTest?, generation? }`	`{ accepted: true, tabId }`
`benchlocal_resume_run`	`{ tabId, runId, executionMode?, runsPerTest?, generation? }`	`{ accepted: true, tabId, runId }`
`benchlocal_retry_scenario`	`{ tabId, runId, scenarioId, modelId, runsPerTest?, generation? }`	`{ accepted: true, tabId, runId }`
`benchlocal_retry_provider_errors`	`{ tabId, runId, runsPerTest?, generation? }`	Retry batch plan and accepted state.
`benchlocal_retry_failed_results`	`{ tabId, runId, runsPerTest?, generation? }`	Retry batch plan and accepted state.
`benchlocal_stop_run`	`{ tabId }`	Stop result.
`benchlocal_list_run_history`	`{ benchPackId }`	Run history.
`benchlocal_get_run_summary`	`{ benchPackId, runId }`	Saved run summary.

Run tools that start work return before the benchmark completes. Poll with:

benchlocal_get_recent_events
benchlocal://state/events/recent
GET /v1/events

MCP Recommended Workflow

For an agent controlling a local model benchmark:

Read benchlocal://state/workspaces.
Read benchlocal://state/benchpacks.
Call benchlocal_list_providers.
Call benchlocal_list_models.
Create or patch a tab with benchlocal_create_tab or benchlocal_patch_tab.
Select the Bench Pack with benchlocal_select_benchpack.
Select models with benchlocal_select_models.
Ask the user to start the external local model server, or start it outside BenchLocal if the agent has its own safe tool for that.
Call benchlocal_refresh_model_availability.
Call benchlocal_start_run.
Poll benchlocal_get_recent_events while the BenchLocal UI shows the run.
When the model server changes, refresh availability and resume or retry eligible results.

Security and Safety

Required invariants:

GET /v1/health is the only unauthenticated route.
Every other HTTP route requires the bearer token.
MCP requires the bearer token and local Origin.
Config reads use getSafeConfig.
Provider reads use redacted provider helpers.
No route returns api_key.
Routes reject unknown JSON fields.
Routes do not allow arbitrary file reads or writes.
Routes do not allow arbitrary shell execution.
Destructive MCP tools are annotated with destructiveHint.

Local Network mode:

is intended only for trusted networks
exposes the server on 0.0.0.0
still requires the bearer token
should be treated like any local automation endpoint with write access to benchmark state

How To Extend The Agent Surface

When a new UI feature should be agent-controllable, update HTTP and MCP together.

Definition of done:

Add or reuse a controller method in app/src/main/controller.ts.
Add shared request/response/event types in packages/benchlocal-core/src/agent-protocol.ts when the payload is not trivial.
Add an IPC adapter only if the renderer needs a new direct operation.
Add an HTTP route in app/src/main/agent-server.ts.
Add strict JSON key validation with assertOnlyKeys.
Add auth and redaction rules before returning data.
Add or update OpenAPI output in createOpenApiDocument.
Add or update the runtime guide in createAgentGuide if agents need to learn the feature.
Add an MCP resource when the feature exposes durable readable state.
Add an MCP tool when the feature is an action.
Add MCP annotations:
- readOnlyHint: true for pure reads
- destructiveHint: true for deletes or irreversible changes
- openWorldHint: true when the tool may call external providers or start long-running benchmark work
Emit or reuse a controller event so the renderer, SSE clients, and recent-event MCP polling all see the same change.
Update this document.
Run typecheck and a manual local API smoke test.

Do not add a UI-only feature that should be automatable without also deciding one of:

expose it through HTTP and MCP now
explicitly mark it as UI-only in this document with a reason

Event Extension Rules

Add new event types only when existing events cannot represent the change.

Prefer:

workspace.updated when tab/workspace state changes
config.updated when config changes
models.availability.updated when availability changes
benchpack.run.event for benchmark progress
verifier.event for verifier lifecycle

When adding a new event type:

Add it to BenchLocalAgentEventType.
Define a payload type.
Emit it from the controller.
Broadcast it through the existing event bus.
Add it to this doc.
Include it in runtime guide text if agents need to react to it.

HTTP Route Extension Pattern

Use this shape in agent-server.ts:

if (request.method === "POST" && segments.length === 3 && segments[0] === "example") {
  const body = await readJsonRequest(request);
  assertOnlyKeys(body, ["allowedField"]);
  sendJson(response, 200, await this.controller.example(body as BenchLocalAgentExampleRequest));
  return;
}

For long-running commands:

void this.controller.longRunningOperation(input).catch((error) => {
  console.error("[benchlocal] agent-started operation failed", error);
});

sendJson(response, 202, { accepted: true, ...handle });

Use 202 when work has been accepted but not completed.

Use 200 when the command completed synchronously or when there was no eligible work.

MCP Tool Extension Pattern

Use this shape in agent-mcp.ts:

server.registerTool(
  "benchlocal_example_action",
  {
    title: "Example Action",
    description: "Do the same operation exposed by the UI and HTTP API.",
    inputSchema: {
      id: z.string()
    },
    annotations: { readOnlyHint: false, openWorldHint: false }
  },
  async ({ id }) => jsonToolResult(await controller.example(id))
);

For long-running tools, return an accepted result and rely on recent events:

void controller.longRunningOperation(input).catch((error) => {
  console.error("[benchlocal] mcp-started operation failed", error);
});

return jsonToolResult({ accepted: true, id });

Manual Smoke Tests

Use these after changing HTTP or MCP.

Health:

curl "$BENCHLOCAL_AGENT_BASE_URL/v1/health"

Auth failure:

curl "$BENCHLOCAL_AGENT_BASE_URL/v1/models"

Expected: 401.

List models:

curl "$BENCHLOCAL_AGENT_BASE_URL/v1/models" \
  -H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN"

SSE:

curl -N "$BENCHLOCAL_AGENT_BASE_URL/v1/events" \
  -H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN"

Create tab:

curl -X POST "$BENCHLOCAL_AGENT_BASE_URL/v1/workspaces/$WORKSPACE_ID/tabs" \
  -H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN" \
  -H "content-type: application/json" \
  -d '{"benchPackId":"toolcall-15","title":"ToolCall-15"}'

MCP initialize example:

curl -X POST "$BENCHLOCAL_AGENT_BASE_URL/mcp" \
  -H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN" \
  -H "content-type: application/json" \
  -H "accept: application/json, text/event-stream" \
  -d '{
    "jsonrpc": "2.0",
    "id": 1,
    "method": "initialize",
    "params": {
      "protocolVersion": "2025-03-26",
      "capabilities": {},
      "clientInfo": {
        "name": "curl-smoke",
        "version": "0.0.0"
      }
    }
  }'

MCP list tools example:

curl -X POST "$BENCHLOCAL_AGENT_BASE_URL/mcp" \
  -H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN" \
  -H "content-type: application/json" \
  -H "accept: application/json, text/event-stream" \
  -d '{
    "jsonrpc": "2.0",
    "id": 2,
    "method": "tools/list",
    "params": {}
  }'

Current Non-Goals

BenchLocal Agent API and MCP do not currently:

install or uninstall Bench Packs
start arbitrary local model servers
supervise Ollama, llama.cpp, MLX, LM Studio, Docker, or custom scripts
expose general file-system access
expose arbitrary shell execution
replace the desktop UI

For local model orchestration, the agent should manage external model servers using its own environment or ask the user to start/stop them. BenchLocal should expose availability, run, resume, retry, and stop controls so that coordination remains visible in the UI.

FilesExpand file tree

agent-control-api.md

Latest commit

History

agent-control-api.md

File metadata and controls

Agent API and MCP Developer Reference

Design Contract

Runtime Model

Authentication

URL and JSON Rules

Discovery Endpoints

GET /v1/health

GET /v1/agent-guide

GET /v1/openapi.json

POST /mcp

SSE Event Stream

GET /v1/events

Shared Types

Provider

Model

Workspace Tab

Generation

HTTP API Reference

Read State

GET /v1/config

GET /v1/workspaces

GET /v1/benchpacks

GET /v1/benchpacks/registry

GET /v1/providers

GET /v1/providers/:providerId

GET /v1/providers/:providerId/models/discover

GET /v1/models

GET /v1/models/:modelId

GET /v1/models/availability

GET /v1/runs/active

GET /v1/verifiers

GET /v1/benchpacks/:benchPackId/history

GET /v1/benchpacks/:benchPackId/history/:runId

Providers

POST /v1/providers

PATCH /v1/providers/:providerId

DELETE /v1/providers/:providerId

POST /v1/providers/:providerId/duplicate

Models

POST /v1/models

PATCH /v1/models/:modelId

DELETE /v1/models/:modelId

POST /v1/models/:modelId/duplicate

POST /v1/models/availability/refresh

Workspace and Tabs

POST /v1/workspaces/:workspaceId/tabs

PATCH /v1/tabs/:tabId

POST /v1/tabs/:tabId/select-benchpack

POST /v1/tabs/:tabId/select-models

POST /v1/tabs/:tabId/sampling

POST /v1/tabs/:tabId/execution-mode

POST /v1/tabs/:tabId/runs-per-test

POST /v1/tabs/:tabId/models/availability/refresh

Runs

POST /v1/tabs/:tabId/runs

POST /v1/tabs/:tabId/runs/stop

POST /v1/tabs/:tabId/runs/:runId/resume

POST /v1/tabs/:tabId/runs/:runId/retry-scenario

POST /v1/tabs/:tabId/runs/:runId/retry-provider-errors

POST /v1/tabs/:tabId/runs/:runId/retry-failed-results

Recommended HTTP Workflow

MCP Surface

MCP Client Configuration

MCP Resources

MCP Prompt

benchlocal-run-benchpack

MCP Tools

Health and State

Providers

Models

Tabs

Runs

MCP Recommended Workflow

Security and Safety

`GET /v1/health`

`GET /v1/agent-guide`

`GET /v1/openapi.json`

`POST /mcp`

`GET /v1/events`

`GET /v1/config`

`GET /v1/workspaces`

`GET /v1/benchpacks`

`GET /v1/benchpacks/registry`

`GET /v1/providers`

`GET /v1/providers/:providerId`

`GET /v1/providers/:providerId/models/discover`

`GET /v1/models`

`GET /v1/models/:modelId`

`GET /v1/models/availability`

`GET /v1/runs/active`

`GET /v1/verifiers`

`GET /v1/benchpacks/:benchPackId/history`

`GET /v1/benchpacks/:benchPackId/history/:runId`

`POST /v1/providers`

`PATCH /v1/providers/:providerId`

`DELETE /v1/providers/:providerId`

`POST /v1/providers/:providerId/duplicate`

`POST /v1/models`

`PATCH /v1/models/:modelId`

`DELETE /v1/models/:modelId`

`POST /v1/models/:modelId/duplicate`

`POST /v1/models/availability/refresh`

`POST /v1/workspaces/:workspaceId/tabs`

`PATCH /v1/tabs/:tabId`

`POST /v1/tabs/:tabId/select-benchpack`

`POST /v1/tabs/:tabId/select-models`

`POST /v1/tabs/:tabId/sampling`

`POST /v1/tabs/:tabId/execution-mode`

`POST /v1/tabs/:tabId/runs-per-test`

`POST /v1/tabs/:tabId/models/availability/refresh`

`POST /v1/tabs/:tabId/runs`

`POST /v1/tabs/:tabId/runs/stop`

`POST /v1/tabs/:tabId/runs/:runId/resume`

`POST /v1/tabs/:tabId/runs/:runId/retry-scenario`

`POST /v1/tabs/:tabId/runs/:runId/retry-provider-errors`

`POST /v1/tabs/:tabId/runs/:runId/retry-failed-results`

`benchlocal-run-benchpack`