Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
115 changes: 115 additions & 0 deletions docs/mkdocs/en/model.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ Models in tRPC-Agent have the following core features:
- **Multi-protocol support**: Provides OpenAIModel, AnthropicModel, LiteLLMModel, etc., compatible with most OpenAI-like and Anthropic interfaces both internally and externally
- **Streaming response support**: Supports streaming output for real-time interactive experiences
- **Multimodal capabilities**: Supports multimodal content processing including text, images, etc. (e.g., Hunyuan multimodal models)
- **Prompt Cache support**: Provides unified prompt cache configuration across OpenAI, Anthropic, and LiteLLM routes to reduce repeated input cost for long prompts and multi-turn conversations
- **Extensible configuration**: Supports custom configuration options such as GenerateContentConfig, HttpOptions, client_args to meet various scenario requirements

## Quick Start
Expand Down Expand Up @@ -398,6 +399,120 @@ LlmAgent(
)
```

### Prompt Cache

Prompt Cache is useful when system prompts are long, tool definitions are large, or multi-turn conversations share a stable prefix. Many providers, including OpenAI-compatible serving stacks such as `openai/sglang`, already support automatic prefix caching on the server side. `tRPC-Agent` does not replace the provider's cache implementation; instead, it provides unified management hints and normalized observability for prompt cache behavior.

`tRPC-Agent` exposes these capabilities through `PromptCacheConfig`, which currently applies to `OpenAIModel`, `AnthropicModel`, and provider-prefixed `LiteLLMModel`. Because providers expose different cache controls and usage fields, the SDK maps management options and cache usage metrics to each provider protocol on a best-effort basis:

| Provider | SDK Capability | Typical Usage Fields |
|----------|----------------|----------------------|
| Anthropic | Manages explicit `cache_control` breakpoints according to `breakpoints` | `cache_read_input_tokens`, `cache_creation_input_tokens` |
| OpenAI / OpenAI-compatible endpoints | Passes cache hints such as `prompt_cache_key` / `prompt_cache_retention` when supported; provider-side automatic prefix caching still owns cache creation and lookup | Usually only `cache_read_input_tokens` |
| LiteLLM | Chooses the Anthropic-style or OpenAI-style cache management path according to the `provider/model` prefix, while preserving provider-native automatic caching such as `openai/sglang` | Depends on the final provider route |

#### Model-level configuration

Model-level configuration becomes the default prompt-cache management and observability configuration for the model instance. Use it when all requests can share the same cache hints:

```python
from trpc_agent_sdk.configs import PromptCacheConfig
from trpc_agent_sdk.models import OpenAIModel

model = OpenAIModel(
model_name="gpt-4o",
api_key="your-api-key",
prompt_cache_config=PromptCacheConfig(
enabled=True,
ttl="24h",
prompt_cache_key="weather-concierge-v1",
),
)
```

#### Per-run override

You can also override prompt-cache settings for a single `runner.run_async()` call through `RunConfig.prompt_cache`. The per-run config overrides model-level settings field by field, which is useful when setting different cache hints by user, tenant, or business scenario:

```python
from trpc_agent_sdk.configs import PromptCacheConfig
from trpc_agent_sdk.configs import RunConfig

async for event in runner.run_async(
user_id=user_id,
session_id=session_id,
new_message=user_content,
run_config=RunConfig(
prompt_cache=PromptCacheConfig(
enabled=True,
prompt_cache_key="weather-concierge-user-42",
),
),
):
...
```

#### Anthropic breakpoints

Anthropic-style caching requires selecting breakpoint locations. `breakpoints` supports the following values:

- `"system"`: cache the system prompt, suitable for long instructions
- `"tools"`: cache the last tool definition, suitable when tools are numerous or tool schemas are large
- `"messages"`: cache the most recent assistant message, suitable for growing stable history prefixes in multi-turn conversations

```python
from trpc_agent_sdk.configs import PromptCacheConfig
from trpc_agent_sdk.models import AnthropicModel

model = AnthropicModel(
model_name="claude-3-5-sonnet-20241022",
api_key="your-api-key",
prompt_cache_config=PromptCacheConfig(
enabled=True,
ttl="1h",
breakpoints=["tools", "system", "messages"],
),
)
```

A good starting point is `["tools", "system"]`; add `"messages"` when long multi-turn conversations need to cache a growing history prefix. Some Anthropic proxies or Bedrock routes require a minimum cache block size, so short prompts may not create cache entries.

#### LiteLLM routes

When using `LiteLLMModel`, the model name should include a `provider/model` prefix. The SDK uses that provider prefix to select the appropriate cache-management mapping. For example:

```python
from trpc_agent_sdk.configs import PromptCacheConfig
from trpc_agent_sdk.models import LiteLLMModel

model = LiteLLMModel(
model_name="openai/gpt-4o",
api_key="your-api-key",
prompt_cache_config=PromptCacheConfig(
enabled=True,
prompt_cache_key="shared-prefix-v1",
),
)
```

If the model name does not include a provider prefix, the SDK cannot determine which cache-management protocol to use, so SDK-managed cache hints may not take effect.

#### Reading cache usage

The model response's `usage_metadata` normalizes cache usage fields where possible:

```python
async for event in runner.run_async(...):
usage = getattr(event, "usage_metadata", None)
if usage:
print(usage.cache_read_input_tokens) # Input tokens read from cache
print(usage.cache_creation_input_tokens) # Input tokens written to cache, usually only reported by Anthropic
print(usage.prompt_token_count) # Total input tokens
```

Different model services report different fields. OpenAI-compatible endpoints usually report cache reads but not cache writes. With load-balanced proxies, each backend instance may have its own KV cache and may not be warmed up at the same time, so cache hit rates can fluctuate during the first few runs.

For a complete runnable example, see [examples/llmagent_with_prompt_cache](../../../examples/llmagent_with_prompt_cache/README.md).

### Custom HTTP Headers

Expand Down
115 changes: 115 additions & 0 deletions docs/mkdocs/zh/model.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ tRPC-Agent 内的模型具有以下核心特性:
- **多协议支持**:提供 OpenAIModel、AnthropicModel、LiteLLMModel 等,兼容公司内外多数 OpenAI-like 及 Anthropic 接口
- **流式响应支持**:支持流式输出,实现实时交互体验
- **多模态能力**:支持文本、图像等多模态内容处理(如 hunyuan 多模态模型)
- **Prompt Cache 支持**:支持跨 OpenAI、Anthropic 与 LiteLLM 路由的统一 prompt cache 配置,降低长提示词和多轮会话的重复输入成本
- **可扩展配置**:支持 GenerateContentConfig、HttpOptions、client_args 等自定义配置项,满足不同场景需求

## 快速上手
Expand Down Expand Up @@ -398,6 +399,120 @@ LlmAgent(
)
```

### Prompt Cache

Prompt Cache 适用于系统提示词较长、工具定义较多或多轮会话前缀高度稳定的场景。很多 provider(包括 `openai/sglang` 这类 OpenAI 兼容推理服务)本身已经支持服务端自动前缀缓存。`tRPC-Agent` 并不替代 provider 的缓存实现,而是提供统一的缓存管理提示与缓存观测能力。

`tRPC-Agent` 通过 `PromptCacheConfig` 暴露这些能力,目前可用于 `OpenAIModel`、`AnthropicModel` 以及带 provider 前缀的 `LiteLLMModel`。不同供应商对缓存控制和统计字段的支持不完全相同,SDK 会尽量将管理选项和缓存用量指标映射到对应协议:

| Provider | SDK 能力 | 典型统计字段 |
|----------|----------|--------------|
| Anthropic | 根据 `breakpoints` 管理显式 `cache_control` 断点 | `cache_read_input_tokens`、`cache_creation_input_tokens` |
| OpenAI / OpenAI 兼容端点 | 在支持时传递 `prompt_cache_key` / `prompt_cache_retention` 等缓存提示;缓存创建和命中仍由 provider 侧自动前缀缓存负责 | 通常只有 `cache_read_input_tokens` |
| LiteLLM | 根据 `provider/model` 前缀选择 Anthropic 风格或 OpenAI 风格的缓存管理路径 | 取决于最终路由的 provider |

#### 模型级配置

模型级配置会作为该模型实例默认的 prompt cache 管理与观测配置,适合在所有请求中复用同一套缓存提示:

```python
from trpc_agent_sdk.configs import PromptCacheConfig
from trpc_agent_sdk.models import OpenAIModel

model = OpenAIModel(
model_name="gpt-4o",
api_key="your-api-key",
prompt_cache_config=PromptCacheConfig(
enabled=True,
ttl="24h",
prompt_cache_key="weather-concierge-v1",
),
)
```

#### 单次运行覆盖

也可以通过 `RunConfig.prompt_cache` 对单次 `runner.run_async()` 覆盖 prompt cache 配置。单次运行配置会按字段覆盖模型级配置,适合按用户、租户或业务场景设置不同的缓存提示:

```python
from trpc_agent_sdk.configs import PromptCacheConfig
from trpc_agent_sdk.configs import RunConfig

async for event in runner.run_async(
user_id=user_id,
session_id=session_id,
new_message=user_content,
run_config=RunConfig(
prompt_cache=PromptCacheConfig(
enabled=True,
prompt_cache_key="weather-concierge-user-42",
),
),
):
...
```

#### Anthropic 断点配置

Anthropic 风格的缓存需要选择断点位置。`breakpoints` 支持以下值:

- `"system"`:缓存系统提示词,适合长 instruction 场景
- `"tools"`:缓存最后一个工具定义,适合工具较多或工具 schema 较大的场景
- `"messages"`:缓存最近一条 assistant 消息,适合多轮会话中不断增长的稳定历史前缀

```python
from trpc_agent_sdk.configs import PromptCacheConfig
from trpc_agent_sdk.models import AnthropicModel

model = AnthropicModel(
model_name="claude-3-5-sonnet-20241022",
api_key="your-api-key",
prompt_cache_config=PromptCacheConfig(
enabled=True,
ttl="1h",
breakpoints=["tools", "system", "messages"],
),
)
```

建议从 `["tools", "system"]` 开始;当长多轮会话需要缓存不断增长的历史前缀时,再加入 `"messages"`。部分 Anthropic 代理或 Bedrock 路由对最小缓存块大小有要求,如果提示词过短,可能不会产生缓存写入。

#### LiteLLM 路由

使用 `LiteLLMModel` 时,模型名需要带 `provider/model` 前缀。SDK 会根据 provider 前缀选择对应的缓存管理映射,例如:

```python
from trpc_agent_sdk.configs import PromptCacheConfig
from trpc_agent_sdk.models import LiteLLMModel

model = LiteLLMModel(
model_name="openai/gpt-4o",
api_key="your-api-key",
prompt_cache_config=PromptCacheConfig(
enabled=True,
prompt_cache_key="shared-prefix-v1",
),
)
```

如果模型名缺少 provider 前缀,SDK 无法判断应使用哪类缓存管理协议,因此 SDK 管理的缓存提示可能不会生效。

#### 读取缓存统计

模型响应的 `usage_metadata` 中会尽量归一化缓存统计字段:

```python
async for event in runner.run_async(...):
usage = getattr(event, "usage_metadata", None)
if usage:
print(usage.cache_read_input_tokens) # 从缓存读取的输入 token 数
print(usage.cache_creation_input_tokens) # 写入缓存的输入 token 数,通常仅 Anthropic 上报
print(usage.prompt_token_count) # 总输入 token 数
```

不同模型服务上报的字段并不完全一致。OpenAI 兼容端点通常只上报缓存读取,不上报缓存写入;负载均衡代理场景下,不同后端实例的 KV 缓存可能尚未全部预热,因此命中率可能在前几次运行中波动。

完整可运行示例见 [examples/llmagent_with_prompt_cache](../../../examples/llmagent_with_prompt_cache/README.md)。

### 自定义 HTTP Header

Expand Down
4 changes: 4 additions & 0 deletions examples/llmagent_with_prompt_cache/.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Set TRPC_AGENT_API_KEY、TRPC_AGENT_BASE_URL、TRPC_AGENT_MODEL_NAME
TRPC_AGENT_API_KEY=your-api-key
TRPC_AGENT_BASE_URL=your-base-url
TRPC_AGENT_MODEL_NAME=your-model-name
64 changes: 64 additions & 0 deletions examples/llmagent_with_prompt_cache/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Prompt Cache 示例

本示例演示如何在 OpenAI、Anthropic 上,以及其他经 LiteLLM 接入的兼容端点上,使用 SDK 统一的 prompt cache。
所有场景使用同一个「天气管家」Agent,区别仅在于所选的模型类和缓存配置。

运行这个例子后,在支持 prompt cache 的 API 上,期望能够看到较高的 prompt cache 命中率,以及随轮次增长的 TTFT 改善(Turn 2 起缓存命中后响应明显变快)。本示例中的 TTFT 指从请求开始到第一个有效生成 token 出现的耗时;无论该 token 属于普通 message 还是 tool call 都计入。

---

## 目录结构

```
llmagent_with_prompt_cache/
├── agent/
│ ├── agent.py ← 三个工厂函数 + 自动探测 helper
│ ├── config.py ← 环境变量 helper
│ ├── prompts.py ← 长系统提示词(约 4 900 token)
│ └── tools.py ← 模拟天气工具
├── run_agent.py ← 根据环境变量自动探测 provider 并运行 demo 循环
└── .env ← 环境变量配置(三个 provider 均在此注释分段)
```

---

## 环境与运行

### 环境要求

- Python 3.12

### 安装步骤

```bash
git clone https://github.com/trpc-group/trpc-agent-python.git
cd trpc-agent-python
python3 -m venv .venv
source .venv/bin/activate
pip3 install -e .
```

### 环境变量要求

在 [examples/llmagent_with_prompt_cache/.env](./.env) 中填入凭证:

- `TRPC_AGENT_API_KEY`
- `TRPC_AGENT_BASE_URL`
- `TRPC_AGENT_MODEL_NAME`

### 运行命令

```bash
cd examples/llmagent_with_prompt_cache
python3 run_agent.py # 根据 .env 的模型名字自动选择 provider
```

---

## FQA
### 缓存命中不稳定(命中后又未命中又命中)

在负载均衡的代理部署下属于正常现象。每个后端实例都有独立的 KV 缓存。无论其他实例
预热了多少,落到冷实例上的请求总会显示未命中。把脚本多跑几次即可提高命中率。
5 changes: 5 additions & 0 deletions examples/llmagent_with_prompt_cache/agent/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Tencent is pleased to support the open source community by making tRPC-Agent-Python available.
#
# Copyright (C) 2026 Tencent. All rights reserved.
#
# tRPC-Agent-Python is licensed under Apache-2.0.
Loading
Loading