This document is the durable reference for BenchLocal's local agent surface:
- HTTP JSON commands for state reads and mutations
- Server-Sent Events for live progress and state changes
- MCP Streamable HTTP for standard agent tool calls
Keep this file updated whenever a UI feature becomes agent-controllable.
BenchLocal is a desktop benchmark app first. The agent surface must expose the same operations the UI exposes without creating a second execution path.
Core rules:
- The Electron renderer keeps using IPC through
window.benchlocal. - The Agent API and MCP call the same main-process controller as IPC.
- Commands use HTTP JSON or MCP tools.
- Live progress uses SSE events or MCP recent-event polling.
- Long-running benchmark commands return quickly and continue in the UI.
- Provider secrets are never returned by HTTP, SSE, MCP resources, or MCP tool results.
- Agent access is explicit, token-protected, and local-first.
- BenchLocal does not execute arbitrary shell commands for agents.
Implementation files:
app/src/main/controller.ts
Shared main-process operations used by IPC, HTTP, and MCP.
app/src/main/agent-server.ts
Local HTTP server, auth, SSE, OpenAPI, agent guide, and route adapters.
app/src/main/agent-mcp.ts
MCP Streamable HTTP server, resources, prompt, and benchlocal_* tools.
packages/benchlocal-core/src/agent-protocol.ts
Shared Agent API event and request/response types.
packages/benchlocal-core/src/config.ts
Provider, model, and Agent Access config types.
packages/benchlocal-core/src/workspaces.ts
Workspace, tab, model selection, execution mode, and sampling state.
Agent Access can be enabled from Settings > Agent Access.
The server listens on:
127.0.0.1when access islocalhost0.0.0.0when access islocal_network
The UI always shows a local client URL like:
http://127.0.0.1:<port>
Agents on another device must use the host machine's LAN IP when local_network is enabled.
The port is either:
- the configured port
- an automatically assigned port when no port is configured
Environment overrides:
BENCHLOCAL_AGENT_API=1
BENCHLOCAL_AGENT_PORT=50060
BENCHLOCAL_AGENT_ACCESS=localhost
BENCHLOCAL_AGENT_ACCESS=local_networkToken storage:
~/.benchlocal/agent-session.json
The session file contains the bearer token and is written with owner-only permissions when created by BenchLocal.
All endpoints except GET /v1/health require:
Authorization: Bearer <token>The token is shown in Settings > Agent Access and can be regenerated there.
Unauthorized requests return:
{
"error": {
"message": "Unauthorized.",
"statusCode": 401
}
}MCP requests also enforce an Origin guard. If an Origin header is present, it must be localhost:
localhost127.0.0.1::1[::1]
This is intentionally stricter than normal HTTP routes because MCP clients may be browser-adjacent.
Path IDs must be URL-encoded. This is required for model IDs and provider IDs that contain :, /, spaces, or UUID-like provider prefixes.
Example:
MODEL_ID='huggingface:Qwen/Qwen3.5-9B'
curl "$BENCHLOCAL_AGENT_BASE_URL/v1/models/$(node -e 'console.log(encodeURIComponent(process.argv[1]))' "$MODEL_ID")" \
-H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN"JSON write endpoints reject unknown fields. This is deliberate:
- agents get fast feedback when they drift from the contract
- accidental writes do not silently mutate future config
- UI and API payloads stay aligned
Request bodies are limited to 1 MB.
Errors use:
{
"error": {
"message": "Human-readable error.",
"statusCode": 400
}
}No auth required.
Returns runtime status and documentation links.
Example response:
{
"ok": true,
"benchLocalVersion": "0.2.6",
"agent": {
"enabled": true,
"running": true,
"access": "localhost",
"host": "127.0.0.1",
"port": 50060,
"baseUrl": "http://127.0.0.1:50060",
"connectedClients": 0,
"message": "Agent API is listening on http://127.0.0.1:50060.",
"startedAt": "2026-05-18T00:00:00.000Z"
},
"docs": {
"agentGuide": "/v1/agent-guide",
"openapi": "/v1/openapi.json",
"mcp": "/mcp"
}
}Auth required.
Returns agent-readable Markdown generated by the running app. This is intentionally shorter than this developer reference and is meant for runtime agent bootstrapping.
Auth required.
Returns the OpenAPI document generated by the running app.
The OpenAPI document is useful for endpoint discovery, but the schemas are intentionally lightweight today. This doc remains the source of implementation guidance.
Auth required.
Standard MCP Streamable HTTP endpoint. See MCP Surface.
POST /v1/mcp is also accepted.
Auth required.
Opens a Server-Sent Events stream.
curl -N \
-H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN" \
"$BENCHLOCAL_AGENT_BASE_URL/v1/events"Initial response includes a comment:
: BenchLocal agent event stream
Each event is sent as:
id: evt-...
event: benchpack.run.event
data: {"eventId":"evt-...","createdAt":"...","type":"benchpack.run.event","payload":{}}
Event envelope:
type BenchLocalAgentEvent<TPayload = unknown> = {
eventId: string;
createdAt: string;
type: BenchLocalAgentEventType;
payload: TPayload;
};Current event types:
agent.state.updated
config.updated
workspace.updated
models.availability.updated
benchpack.run.started
benchpack.run.event
benchpack.run.finished
benchpack.run.error
verifier.event
Important payloads:
type BenchLocalAgentWorkspaceUpdatedPayload = {
state: BenchLocalWorkspaceState;
};
type BenchLocalAgentConfigUpdatedPayload = {
config: BenchLocalAgentSafeConfig;
};
type BenchLocalAgentModelAvailabilityPayload = {
availability: ModelAvailability[];
};
type BenchLocalAgentRunEventPayload = {
tabId: string;
benchPackId: string;
event: ProgressEvent;
};benchpack.run.event wraps the Bench Pack host progress event. The nested event may include scenario start, model progress, scenario result, run finish, run error, and other Bench Pack progress messages.
SSE is read-only. Do not add command semantics to SSE.
type BenchLocalProviderKind =
| "openrouter"
| "huggingface"
| "ollama"
| "llamacpp"
| "mlx"
| "lmstudio"
| "pico"
| "openai_compatible";
type BenchLocalProviderConfig = {
kind: BenchLocalProviderKind;
name: string;
enabled: boolean;
base_url: string;
api_key?: string;
api_key_env?: string;
};API and MCP provider reads redact api_key and expose:
type SafeProvider = Omit<BenchLocalProviderConfig, "api_key"> & {
has_api_key: boolean;
has_api_key_env: boolean;
};type BenchLocalModelConfig = {
id: string;
provider: string;
model: string;
label: string;
group: string;
enabled: boolean;
};provider is the provider ID, not the provider display name.
type BenchLocalExecutionMode =
| "serial"
| "serial_by_model"
| "parallel_by_model"
| "parallel_by_test_case"
| "full_parallel";
type BenchLocalWorkspaceTabModelSelection = {
modelId: string;
alias?: string;
};
type BenchLocalWorkspaceTab = {
id: string;
title: string;
benchPackId: string | null;
loadedRunId?: string | null;
focusedScenarioId: string | null;
modelSelections: BenchLocalWorkspaceTabModelSelection[];
samplingOverrides?: GenerationRequest;
executionMode: BenchLocalExecutionMode;
runsPerTest: number;
createdAt: string;
updatedAt: string;
};type GenerationRequest = {
temperature?: number;
top_p?: number;
top_k?: number;
min_p?: number;
repetition_penalty?: number;
presence_penalty?: number;
request_timeout_seconds?: number;
};Default request timeout is defined in core as DEFAULT_BENCHLOCAL_REQUEST_TIMEOUT_SECONDS.
All examples assume:
export BENCHLOCAL_AGENT_BASE_URL="http://127.0.0.1:50060"
export BENCHLOCAL_AGENT_TOKEN="<token from Settings > Agent Access>"Returns redacted BenchLocal config.
Response:
{
"config": {
"schema_version": 1,
"ui": { "theme": "system" },
"agent": { "enabled": true, "access": "localhost", "port": 50060 },
"providers": {
"huggingface": {
"kind": "huggingface",
"name": "Hugging Face",
"enabled": true,
"base_url": "https://router.huggingface.co/v1",
"api_key_env": "HF_TOKEN",
"has_api_key": false,
"has_api_key_env": true
}
},
"models": []
}
}Returns the full workspace state:
{
"path": "/Users/me/.benchlocal/state.json",
"created": false,
"state": {
"schema_version": 1,
"activeWorkspaceId": "workspace-main",
"workspaceOrder": ["workspace-main"],
"workspaces": {},
"tabs": {}
}
}Returns installed Bench Packs and scenario metadata:
{
"benchPacks": []
}Returns registry entries for installable Bench Packs:
{
"registry": []
}Returns configured providers with secrets redacted:
{
"providers": {
"huggingface": {
"kind": "huggingface",
"name": "Hugging Face",
"enabled": true,
"base_url": "https://router.huggingface.co/v1",
"api_key_env": "HF_TOKEN",
"has_api_key": false,
"has_api_key_env": true
}
}
}Returns one redacted provider:
{
"providerId": "huggingface",
"provider": {
"kind": "huggingface",
"name": "Hugging Face",
"enabled": true,
"base_url": "https://router.huggingface.co/v1",
"api_key_env": "HF_TOKEN",
"has_api_key": false,
"has_api_key_env": true
}
}Discovers provider models when the provider supports browsing.
Response:
{
"models": []
}This may call an external provider API and can fail when credentials or network access are unavailable.
Returns configured models:
{
"models": [
{
"id": "huggingface:Qwen/Qwen3.5-9B",
"provider": "huggingface",
"model": "Qwen/Qwen3.5-9B",
"label": "Qwen3.5-9B",
"group": "primary",
"enabled": true
}
]
}Returns one configured model:
{
"model": {
"id": "huggingface:Qwen/Qwen3.5-9B",
"provider": "huggingface",
"model": "Qwen/Qwen3.5-9B",
"label": "Qwen3.5-9B",
"group": "primary",
"enabled": true
}
}Checks model availability for all configured models.
Response:
{
"availability": [
{
"modelId": "huggingface:Qwen/Qwen3.5-9B",
"available": true,
"checkedAt": "2026-05-18T00:00:00.000Z"
}
]
}Exact availability fields are defined by ModelAvailability in packages/benchlocal-core/src/protocol.ts.
Returns active benchmark runs:
{
"activeRuns": []
}Returns verifier runtime status:
{
"verifiers": []
}Returns run history entries for a Bench Pack:
{
"history": []
}Returns a saved run summary:
{
"run": {}
}Creates a provider.
Allowed fields:
{
id?: string;
kind: BenchLocalProviderKind;
name?: string;
enabled?: boolean;
base_url: string;
api_key?: string;
api_key_env?: string;
}Example:
curl -X POST "$BENCHLOCAL_AGENT_BASE_URL/v1/providers" \
-H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN" \
-H "content-type: application/json" \
-d '{
"id": "huggingface",
"kind": "huggingface",
"name": "Hugging Face",
"enabled": true,
"base_url": "https://router.huggingface.co/v1",
"api_key_env": "HF_TOKEN"
}'Returns 201.
Patches a provider.
Allowed fields:
{
kind?: BenchLocalProviderKind;
name?: string;
enabled?: boolean;
base_url?: string;
api_key?: string | null;
api_key_env?: string | null;
}Use null for api_key or api_key_env to clear stored values when supported by the controller.
Deletes a provider.
Important behavior:
- deletes linked models
- removes linked models from tab selections
- broadcasts config/workspace updates through the controller
Duplicates one provider record.
Important behavior:
- duplicates only the provider
- does not duplicate linked models
Creates a model.
Allowed fields:
{
id?: string;
provider: string;
model: string;
label?: string;
group?: string;
enabled?: boolean;
}Example:
curl -X POST "$BENCHLOCAL_AGENT_BASE_URL/v1/models" \
-H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN" \
-H "content-type: application/json" \
-d '{
"id": "huggingface:Qwen/Qwen3.5-9B",
"provider": "huggingface",
"model": "Qwen/Qwen3.5-9B",
"label": "Qwen3.5-9B",
"group": "primary",
"enabled": true
}'Returns 201.
Patches a model.
Allowed fields:
{
id?: string;
provider?: string;
model?: string;
label?: string;
group?: string;
enabled?: boolean;
}If the ID changes, the controller must preserve consistency with tab selections.
Deletes one model and removes it from tab selections.
Duplicates one model record.
Refreshes model availability globally or for selected model IDs.
Allowed fields:
{
modelIds?: string[];
}Example:
curl -X POST "$BENCHLOCAL_AGENT_BASE_URL/v1/models/availability/refresh" \
-H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN" \
-H "content-type: application/json" \
-d '{"modelIds":["huggingface:Qwen/Qwen3.5-9B"]}'Response:
{
"availability": []
}Also emits models.availability.updated when the controller broadcasts availability changes.
Creates a workspace tab.
Allowed fields:
{
benchPackId?: string | null;
title?: string;
modelSelections?: Array<{ modelId: string; alias?: string }>;
}Example:
curl -X POST "$BENCHLOCAL_AGENT_BASE_URL/v1/workspaces/$WORKSPACE_ID/tabs" \
-H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN" \
-H "content-type: application/json" \
-d '{
"benchPackId": "toolcall-15",
"title": "ToolCall-15",
"modelSelections": [
{ "modelId": "huggingface:Qwen/Qwen3.5-9B" }
]
}'Returns 201 and the updated workspace state.
Patches a tab.
Allowed fields:
{
title?: string;
focusedScenarioId?: string | null;
modelSelections?: Array<{ modelId: string; alias?: string }>;
samplingOverrides?: GenerationRequest;
executionMode?: BenchLocalExecutionMode;
runsPerTest?: number;
}Use this for compound updates. Prefer the more specific endpoints below for common UI actions because they document intent better.
Selects or clears a Bench Pack for a tab.
Allowed fields:
{
benchPackId: string | null;
title?: string;
}Selects models for a tab.
Allowed fields:
{
modelIds?: string[];
selections?: Array<{ modelId: string; alias?: string }>;
}modelIds is the compact form. selections is the explicit form and supports aliases.
Example:
curl -X POST "$BENCHLOCAL_AGENT_BASE_URL/v1/tabs/$TAB_ID/select-models" \
-H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN" \
-H "content-type: application/json" \
-d '{
"modelIds": [
"huggingface:Qwen/Qwen3.5-9B",
"huggingface:qwen3.6:35b-a3b"
]
}'Sets tab sampling overrides.
Allowed fields:
{
samplingOverrides: GenerationRequest;
}Example:
{
"samplingOverrides": {
"temperature": 0,
"top_p": 1,
"request_timeout_seconds": 500
}
}Sets tab execution mode and optionally runs-per-test.
Allowed fields:
{
executionMode: BenchLocalExecutionMode;
runsPerTest?: number;
}Execution mode values:
serial
serial_by_model
parallel_by_model
parallel_by_test_case
full_parallel
Sets tab runs-per-test.
Allowed fields:
{
runsPerTest: number;
}Refreshes availability for a tab.
Allowed fields:
{
modelIds?: string[];
}If modelIds is omitted, the selected model IDs from the tab are used.
Run commands are asynchronous unless otherwise noted. They return quickly, while detailed progress is emitted through GET /v1/events and visible in the desktop UI.
Starts a run for a tab.
Allowed fields:
{
benchPackId?: string;
modelIds?: string[];
executionMode?: BenchLocalExecutionMode;
runsPerTest?: number;
generation?: GenerationRequest;
}Resolution behavior:
benchPackIddefaults to the tab's selected Bench PackmodelIdsdefaults to the tab's selected modelsexecutionModedefaults to the tab's execution moderunsPerTestdefaults to the tab's runs-per-testgenerationdefaults to the tab's sampling overrides
Response:
{
"accepted": true,
"tabId": "tab-..."
}Status code: 202.
The run will set loadedRunId on the tab after a summary is produced.
Stops the active run for a tab.
Response depends on controller state, but generally includes whether a run was stopped.
Unlike start/resume/retry, this is synchronous.
Resumes a historical run.
Allowed fields:
{
executionMode?: BenchLocalExecutionMode;
runsPerTest?: number;
generation?: GenerationRequest;
}Response:
{
"accepted": true,
"tabId": "tab-...",
"runId": "run-..."
}Status code: 202.
Retries one scenario/model cell from a saved run.
Allowed fields:
{
scenarioId: string;
modelId: string;
runsPerTest?: number;
generation?: GenerationRequest;
}Response:
{
"accepted": true,
"tabId": "tab-...",
"runId": "run-..."
}Status code: 202.
Retries provider-error cells from a saved run.
Allowed fields:
{
runsPerTest?: number;
generation?: GenerationRequest;
}Response when there is work:
{
"accepted": true,
"tabId": "tab-...",
"runId": "run-...",
"kind": "provider_errors",
"cellCount": 2,
"groupCount": 1
}Status code: 202.
Response when there is no eligible work:
{
"accepted": false,
"tabId": "tab-...",
"runId": "run-...",
"kind": "provider_errors",
"cellCount": 0,
"groupCount": 0
}Status code: 200.
Provider-error classification must come from provider failure metadata and HTTP response status handling, not from scanning verifier failure summaries.
Retries non-provider failed cells from a saved run.
Allowed fields:
{
runsPerTest?: number;
generation?: GenerationRequest;
}Response shape matches retry-provider-errors, with:
{
"kind": "failed_results"
}This is the workflow agents should use for a live benchmark run:
GET /v1/healthGET /v1/workspacesGET /v1/benchpacksGET /v1/providersGET /v1/models- Open
GET /v1/eventsand keep it open. - Create or patch a tab.
- Select Bench Pack and models.
- Refresh model availability.
- Set sampling, execution mode, and runs-per-test if needed.
- Start the run.
- Watch
benchpack.run.eventuntil a finished, cancelled, or error event appears.
Example:
curl "$BENCHLOCAL_AGENT_BASE_URL/v1/benchpacks" \
-H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN"
curl "$BENCHLOCAL_AGENT_BASE_URL/v1/workspaces" \
-H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN"
curl -X POST "$BENCHLOCAL_AGENT_BASE_URL/v1/workspaces/$WORKSPACE_ID/tabs" \
-H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN" \
-H "content-type: application/json" \
-d '{"benchPackId":"toolcall-15","title":"ToolCall-15"}'
curl -X POST "$BENCHLOCAL_AGENT_BASE_URL/v1/tabs/$TAB_ID/select-models" \
-H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN" \
-H "content-type: application/json" \
-d '{"modelIds":["huggingface:Qwen/Qwen3.5-9B"]}'
curl -X POST "$BENCHLOCAL_AGENT_BASE_URL/v1/tabs/$TAB_ID/models/availability/refresh" \
-H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN" \
-H "content-type: application/json" \
-d '{}'
curl -X POST "$BENCHLOCAL_AGENT_BASE_URL/v1/tabs/$TAB_ID/runs" \
-H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN" \
-H "content-type: application/json" \
-d '{"executionMode":"serial_by_model","runsPerTest":1}'BenchLocal exposes MCP through:
POST /mcp
Authorization: Bearer <token>
Content-Type: application/json
Accept: application/json, text/event-streamPOST /v1/mcp is accepted as an alias.
Implementation details:
- Uses
@modelcontextprotocol/sdk. - Uses
StreamableHTTPServerTransport. sessionIdGeneratoris disabled, so the server is stateless per request.GET,DELETE, and non-POSTrequests return MCP method-not-allowed JSON-RPC errors.- Long-running tools return
accepted: true; progress is obtained from the UI, recent events, or the SSE stream.
Use the bearer token as an Authorization header.
Generic MCP client shape:
{
"mcpServers": {
"benchlocal": {
"type": "streamable-http",
"url": "http://127.0.0.1:50060/mcp",
"headers": {
"Authorization": "Bearer <token>"
}
}
}
}Exact configuration keys vary by MCP client.
| Resource URI | Description |
|---|---|
benchlocal://agent/guide |
Runtime agent guide Markdown. |
benchlocal://agent/openapi |
Runtime OpenAPI JSON document. |
benchlocal://state/config |
Redacted BenchLocal config. |
benchlocal://state/workspaces |
Workspace and tab state. |
benchlocal://state/benchpacks |
Installed Bench Packs and scenario metadata. |
benchlocal://state/providers |
Configured providers with secrets redacted. |
benchlocal://state/models |
Configured models. |
benchlocal://state/runs/active |
Active benchmark runs. |
benchlocal://state/events/recent |
Recent Agent API events. |
Arguments:
{
benchPackId: string;
modelIds: string; // comma-separated model IDs
workspaceId?: string;
}Use this prompt to teach an MCP-capable agent the preferred run workflow:
- inspect workspaces
- choose or create a tab
- select the Bench Pack
- select models
- refresh availability
- start a run
- poll recent events while the UI updates in real time
All MCP tools return JSON as text content and, where possible, structured content.
| Tool | Read-only | Input | Result |
|---|---|---|---|
benchlocal_get_health |
yes | {} |
Runtime compatibility and health. |
benchlocal_get_config |
yes | {} |
Redacted config. |
benchlocal_list_workspaces |
yes | {} |
Workspace and tab state. |
benchlocal_list_benchpacks |
yes | {} |
Installed Bench Packs. |
benchlocal_list_benchpack_registry |
yes | {} |
Registry entries. |
benchlocal_list_active_runs |
yes | {} |
Active runs. |
benchlocal_list_verifiers |
yes | {} |
Verifier runtime status. |
benchlocal_get_recent_events |
yes | { limit?: number } |
Recent events, newest limit if provided. |
| Tool | Input | Result |
|---|---|---|
benchlocal_list_providers |
{} |
Redacted providers. |
benchlocal_get_provider |
{ providerId: string } |
One redacted provider. |
benchlocal_create_provider |
{ id?, kind, name?, enabled?, base_url, api_key?, api_key_env? } |
Created provider result. |
benchlocal_update_provider |
{ providerId, kind?, name?, enabled?, base_url?, api_key?, api_key_env? } |
Updated provider result. |
benchlocal_delete_provider |
{ providerId } |
Delete result. Destructive. |
benchlocal_duplicate_provider |
{ providerId } |
Duplicate provider result. |
benchlocal_discover_provider_models |
{ providerId } |
Provider model discovery result. |
kind must be one of:
openrouter
huggingface
ollama
llamacpp
mlx
lmstudio
pico
openai_compatible
| Tool | Input | Result |
|---|---|---|
benchlocal_list_models |
{} |
Configured models. |
benchlocal_get_model |
{ modelId: string } |
One model. |
benchlocal_create_model |
{ id?, provider, model, label?, group?, enabled? } |
Created model result. |
benchlocal_update_model |
{ modelId, id?, provider?, model?, label?, group?, enabled? } |
Updated model result. |
benchlocal_delete_model |
{ modelId } |
Delete result. Destructive. |
benchlocal_duplicate_model |
{ modelId } |
Duplicate model result. |
benchlocal_check_model_availability |
{ modelIds?: string[] } |
Availability result. |
benchlocal_refresh_model_availability |
{ tabId?: string, modelIds?: string[] } |
Availability result. |
benchlocal_refresh_model_availability uses selected tab models when tabId is provided and modelIds is omitted.
| Tool | Input | Result |
|---|---|---|
benchlocal_create_tab |
{ workspaceId, benchPackId?, title?, modelSelections? } |
Updated workspace state. |
benchlocal_patch_tab |
{ tabId, title?, focusedScenarioId?, modelSelections?, samplingOverrides?, executionMode?, runsPerTest? } |
Updated workspace state. |
benchlocal_select_benchpack |
{ tabId, benchPackId, title? } |
Updated workspace state. |
benchlocal_select_models |
{ tabId, modelIds?, selections? } |
Updated workspace state. |
benchlocal_set_sampling |
{ tabId, samplingOverrides } |
Updated workspace state. |
benchlocal_set_execution_mode |
{ tabId, executionMode, runsPerTest? } |
Updated workspace state. |
benchlocal_set_runs_per_test |
{ tabId, runsPerTest } |
Updated workspace state. |
modelSelections and selections use:
Array<{ modelId: string; alias?: string }>| Tool | Input | Result |
|---|---|---|
benchlocal_start_run |
{ tabId, benchPackId?, modelIds?, executionMode?, runsPerTest?, generation? } |
{ accepted: true, tabId } |
benchlocal_resume_run |
{ tabId, runId, executionMode?, runsPerTest?, generation? } |
{ accepted: true, tabId, runId } |
benchlocal_retry_scenario |
{ tabId, runId, scenarioId, modelId, runsPerTest?, generation? } |
{ accepted: true, tabId, runId } |
benchlocal_retry_provider_errors |
{ tabId, runId, runsPerTest?, generation? } |
Retry batch plan and accepted state. |
benchlocal_retry_failed_results |
{ tabId, runId, runsPerTest?, generation? } |
Retry batch plan and accepted state. |
benchlocal_stop_run |
{ tabId } |
Stop result. |
benchlocal_list_run_history |
{ benchPackId } |
Run history. |
benchlocal_get_run_summary |
{ benchPackId, runId } |
Saved run summary. |
Run tools that start work return before the benchmark completes. Poll with:
benchlocal_get_recent_events
benchlocal://state/events/recent
GET /v1/events
For an agent controlling a local model benchmark:
- Read
benchlocal://state/workspaces. - Read
benchlocal://state/benchpacks. - Call
benchlocal_list_providers. - Call
benchlocal_list_models. - Create or patch a tab with
benchlocal_create_taborbenchlocal_patch_tab. - Select the Bench Pack with
benchlocal_select_benchpack. - Select models with
benchlocal_select_models. - Ask the user to start the external local model server, or start it outside BenchLocal if the agent has its own safe tool for that.
- Call
benchlocal_refresh_model_availability. - Call
benchlocal_start_run. - Poll
benchlocal_get_recent_eventswhile the BenchLocal UI shows the run. - When the model server changes, refresh availability and resume or retry eligible results.
Required invariants:
GET /v1/healthis the only unauthenticated route.- Every other HTTP route requires the bearer token.
- MCP requires the bearer token and local Origin.
- Config reads use
getSafeConfig. - Provider reads use redacted provider helpers.
- No route returns
api_key. - Routes reject unknown JSON fields.
- Routes do not allow arbitrary file reads or writes.
- Routes do not allow arbitrary shell execution.
- Destructive MCP tools are annotated with
destructiveHint.
Local Network mode:
- is intended only for trusted networks
- exposes the server on
0.0.0.0 - still requires the bearer token
- should be treated like any local automation endpoint with write access to benchmark state
When a new UI feature should be agent-controllable, update HTTP and MCP together.
Definition of done:
- Add or reuse a controller method in
app/src/main/controller.ts. - Add shared request/response/event types in
packages/benchlocal-core/src/agent-protocol.tswhen the payload is not trivial. - Add an IPC adapter only if the renderer needs a new direct operation.
- Add an HTTP route in
app/src/main/agent-server.ts. - Add strict JSON key validation with
assertOnlyKeys. - Add auth and redaction rules before returning data.
- Add or update OpenAPI output in
createOpenApiDocument. - Add or update the runtime guide in
createAgentGuideif agents need to learn the feature. - Add an MCP resource when the feature exposes durable readable state.
- Add an MCP tool when the feature is an action.
- Add MCP annotations:
readOnlyHint: truefor pure readsdestructiveHint: truefor deletes or irreversible changesopenWorldHint: truewhen the tool may call external providers or start long-running benchmark work
- Emit or reuse a controller event so the renderer, SSE clients, and recent-event MCP polling all see the same change.
- Update this document.
- Run typecheck and a manual local API smoke test.
Do not add a UI-only feature that should be automatable without also deciding one of:
- expose it through HTTP and MCP now
- explicitly mark it as UI-only in this document with a reason
Add new event types only when existing events cannot represent the change.
Prefer:
workspace.updatedwhen tab/workspace state changesconfig.updatedwhen config changesmodels.availability.updatedwhen availability changesbenchpack.run.eventfor benchmark progressverifier.eventfor verifier lifecycle
When adding a new event type:
- Add it to
BenchLocalAgentEventType. - Define a payload type.
- Emit it from the controller.
- Broadcast it through the existing event bus.
- Add it to this doc.
- Include it in runtime guide text if agents need to react to it.
Use this shape in agent-server.ts:
if (request.method === "POST" && segments.length === 3 && segments[0] === "example") {
const body = await readJsonRequest(request);
assertOnlyKeys(body, ["allowedField"]);
sendJson(response, 200, await this.controller.example(body as BenchLocalAgentExampleRequest));
return;
}For long-running commands:
void this.controller.longRunningOperation(input).catch((error) => {
console.error("[benchlocal] agent-started operation failed", error);
});
sendJson(response, 202, { accepted: true, ...handle });Use 202 when work has been accepted but not completed.
Use 200 when the command completed synchronously or when there was no eligible work.
Use this shape in agent-mcp.ts:
server.registerTool(
"benchlocal_example_action",
{
title: "Example Action",
description: "Do the same operation exposed by the UI and HTTP API.",
inputSchema: {
id: z.string()
},
annotations: { readOnlyHint: false, openWorldHint: false }
},
async ({ id }) => jsonToolResult(await controller.example(id))
);For long-running tools, return an accepted result and rely on recent events:
void controller.longRunningOperation(input).catch((error) => {
console.error("[benchlocal] mcp-started operation failed", error);
});
return jsonToolResult({ accepted: true, id });Use these after changing HTTP or MCP.
Health:
curl "$BENCHLOCAL_AGENT_BASE_URL/v1/health"Auth failure:
curl "$BENCHLOCAL_AGENT_BASE_URL/v1/models"Expected: 401.
List models:
curl "$BENCHLOCAL_AGENT_BASE_URL/v1/models" \
-H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN"SSE:
curl -N "$BENCHLOCAL_AGENT_BASE_URL/v1/events" \
-H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN"Create tab:
curl -X POST "$BENCHLOCAL_AGENT_BASE_URL/v1/workspaces/$WORKSPACE_ID/tabs" \
-H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN" \
-H "content-type: application/json" \
-d '{"benchPackId":"toolcall-15","title":"ToolCall-15"}'MCP initialize example:
curl -X POST "$BENCHLOCAL_AGENT_BASE_URL/mcp" \
-H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN" \
-H "content-type: application/json" \
-H "accept: application/json, text/event-stream" \
-d '{
"jsonrpc": "2.0",
"id": 1,
"method": "initialize",
"params": {
"protocolVersion": "2025-03-26",
"capabilities": {},
"clientInfo": {
"name": "curl-smoke",
"version": "0.0.0"
}
}
}'MCP list tools example:
curl -X POST "$BENCHLOCAL_AGENT_BASE_URL/mcp" \
-H "Authorization: Bearer $BENCHLOCAL_AGENT_TOKEN" \
-H "content-type: application/json" \
-H "accept: application/json, text/event-stream" \
-d '{
"jsonrpc": "2.0",
"id": 2,
"method": "tools/list",
"params": {}
}'BenchLocal Agent API and MCP do not currently:
- install or uninstall Bench Packs
- start arbitrary local model servers
- supervise Ollama, llama.cpp, MLX, LM Studio, Docker, or custom scripts
- expose general file-system access
- expose arbitrary shell execution
- replace the desktop UI
For local model orchestration, the agent should manage external model servers using its own environment or ask the user to start/stop them. BenchLocal should expose availability, run, resume, retry, and stop controls so that coordination remains visible in the UI.