Skip to content

Commit 55000eb

Browse files
JOYclaude
andcommitted
docs: add two-layer billing model to API Gateway product plan
Document INTERNAL_API_KEY bypass for internal services vs dos_sk_xxx token-based billing for external consumers. Update Phase 0 checklist. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 2286e4b commit 55000eb

1 file changed

Lines changed: 345 additions & 0 deletions

File tree

API_GATEWAY_PRODUCT_PLAN.md

Lines changed: 345 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,345 @@
1+
# DOS.AI API Gateway — Product Plan
2+
3+
Updated: 2026-03-14
4+
Status: Draft
5+
Owner: DOS.AI
6+
Related: `INFERENCESENSE_LIKE_ALPHA_MVP.md`, `API_CONTRACT_V1.md`
7+
8+
## Vision
9+
10+
A unified LLM API gateway at `api.dos.ai` — users call one endpoint, the system
11+
routes to the cheapest/fastest available backend. Like OpenRouter, but with a
12+
self-hosted GPU fleet as first-class capacity (100% margin).
13+
14+
```
15+
User app
16+
17+
18+
api.dos.ai (Cloudflare Worker — auth, routing, metering)
19+
20+
├─► Self-hosted vLLM fleet (cost $0, margin 100%)
21+
│ ├── joy-pc: RTX Pro 6000 — Qwen3.5-35B-A3B
22+
│ ├── node-2: future GPU — model TBD
23+
│ └── (InferenceSense nodes — spare capacity)
24+
25+
├─► Alibaba Cloud (Qwen3.5-flash $0.10/$0.40 per 1M tokens)
26+
├─► Google AI (Gemini 3 Flash $0.50/$3.00)
27+
├─► OpenAI (GPT-4o-mini $0.15/$0.60)
28+
└─► Any OpenAI-compatible provider
29+
```
30+
31+
## Why This Works
32+
33+
1. **Self-hosted = unfair advantage**: When vLLM fleet has capacity, every request
34+
is pure profit. OpenRouter can't compete on margin for models we self-host.
35+
2. **Paid providers = infinite scale**: When self-hosted is down/full, overflow to
36+
paid APIs. User never sees downtime.
37+
3. **Already built the hard parts**: vLLM infra, Cloudflare Worker at api.dos.ai,
38+
DOS.Me auth + credits system, InferenceSense node agent design.
39+
4. **Vietnamese market gap**: No local LLM API gateway. Vietnamese devs use
40+
OpenRouter (USD, no local payment) or direct provider APIs (fragmented).
41+
42+
## Relationship to InferenceSense
43+
44+
InferenceSense (existing docs) defines how to manage a **fleet of self-hosted GPU
45+
nodes** with spare capacity routing. The API Gateway is the **commercial layer on
46+
top**:
47+
48+
```
49+
┌─────────────────────────────────────────────┐
50+
│ API Gateway (this doc) │
51+
│ - User-facing API, auth, billing, routing │
52+
│ - Model catalog, pricing, rate limits │
53+
├─────────────────────────────────────────────┤
54+
│ InferenceSense (existing docs) │
55+
│ - Node agent, heartbeat, spare capacity │
56+
│ - GPU fleet management, draining │
57+
├─────────────────────────────────────────────┤
58+
│ Paid Provider Backends │
59+
│ - Alibaba, Google, OpenAI, Anthropic │
60+
│ - OpenAI-compatible API forwarding │
61+
└─────────────────────────────────────────────┘
62+
```
63+
64+
InferenceSense becomes one of several backends. The API Gateway doesn't replace
65+
it — it wraps it and adds paid fallback + billing.
66+
67+
## Core Architecture
68+
69+
### 1. Gateway (Cloudflare Worker at api.dos.ai)
70+
71+
Already exists. Currently proxies to vLLM. Extend to:
72+
73+
- **Auth**: Validate API key → look up user, tier, credits
74+
- **Route**: Pick backend based on model + availability + cost
75+
- **Meter**: Log tokens in/out, cost, latency, provider used
76+
- **Bill**: Deduct credits from user's DOS.Me account
77+
- **Stream**: Pass through SSE streaming from any backend
78+
79+
### 2. Provider Registry (D1 or KV)
80+
81+
```
82+
providers:
83+
- id: self-hosted
84+
base_url: https://inference.dos.ai (or direct vLLM IPs)
85+
api_key: (internal)
86+
models: [dos-ai] # served-model-name in vLLM
87+
priority: 1 # try first
88+
cost_input: 0
89+
cost_output: 0
90+
max_concurrent: 50
91+
health_check: GET /v1/models
92+
93+
- id: alibaba
94+
base_url: https://dashscope-intl.aliyuncs.com/compatible-mode
95+
api_key: sk-...
96+
models: [qwen3.5-flash, qwen3.5-plus, qwen3.5-35b-a3b]
97+
priority: 2
98+
cost_input: 0.10 # per 1M tokens
99+
cost_output: 0.40
100+
max_concurrent: 100
101+
102+
- id: google
103+
base_url: https://generativelanguage.googleapis.com
104+
api_key: AIza...
105+
models: [gemini-2.5-flash, gemini-3-flash-preview]
106+
priority: 3
107+
cost_input: 0.30
108+
cost_output: 2.50
109+
max_concurrent: 100
110+
```
111+
112+
### 3. Model Catalog (user-facing)
113+
114+
Users request models by alias. Gateway maps to best available backend:
115+
116+
```
117+
model aliases:
118+
"dos-ai" → self-hosted Qwen3.5-35B → alibaba qwen3.5-flash
119+
"qwen-fast" → alibaba qwen3.5-flash (always paid, fastest)
120+
"qwen-plus" → alibaba qwen3.5-plus
121+
"gemini-flash" → google gemini-2.5-flash
122+
"auto" → cheapest available model matching request constraints
123+
```
124+
125+
### 4. Routing Logic
126+
127+
```
128+
function selectProvider(model, request):
129+
candidates = providers.filter(p =>
130+
p.models.includes(model) &&
131+
p.healthy &&
132+
p.current_load < p.max_concurrent
133+
)
134+
// Sort by: priority (self-hosted first), then cost, then latency
135+
candidates.sort(by: priority ASC, cost ASC, avg_latency ASC)
136+
return candidates[0] // fallback chain is implicit
137+
```
138+
139+
### 5. Billing Model
140+
141+
```
142+
user_cost = (input_tokens / 1M) * sell_price_input
143+
+ (output_tokens / 1M) * sell_price_output
144+
145+
our_cost = (input_tokens / 1M) * provider_cost_input
146+
+ (output_tokens / 1M) * provider_cost_output
147+
148+
margin = user_cost - our_cost
149+
```
150+
151+
Sell prices (proposed, ~2-3x markup on paid, 100% on self-hosted):
152+
153+
| Model alias | Sell input/1M | Sell output/1M | Self-hosted margin | Paid margin |
154+
|-----------------|---------------|----------------|--------------------|-------------|
155+
| dos-ai | $0.15 | $0.60 | 100% | ~33%* |
156+
| qwen-fast | $0.20 | $0.80 | - | 50% |
157+
| qwen-plus | $0.50 | $3.00 | - | 48% |
158+
| gemini-flash | $0.50 | $4.00 | - | 37% |
159+
160+
*When self-hosted is down, dos-ai falls back to alibaba qwen3.5-flash at cost.
161+
162+
### 6. Auth & Credits — Two-Layer Billing Model
163+
164+
The gateway serves two distinct consumer types with different billing models:
165+
166+
```
167+
┌──────────────────────────────────────────────────────────┐
168+
│ Application Layer (request-based billing) │
169+
│ │
170+
│ DOSafe Telegram bot → consume_quota() per request │
171+
│ DOSafe web app → dosafe_usage per request │
172+
│ DOS.AI app features → per-feature billing │
173+
│ │
174+
│ Auth: INTERNAL_API_KEY (bypass gateway billing) │
175+
│ Why: billing already handled upstream per user/tier │
176+
└──────────────────────┬───────────────────────────────────┘
177+
178+
┌──────────────────────────────────────────────────────────┐
179+
│ api.dos.ai Gateway (token-based billing) │
180+
│ │
181+
│ INTERNAL_API_KEY → skip billing + rate limit │
182+
│ dos_sk_xxx → deductBalance per token │
183+
│ │
184+
│ Why: external consumers call gateway directly, │
185+
│ no application layer to handle billing for them │
186+
└──────────────────────────────────────────────────────────┘
187+
```
188+
189+
**Rule:** If a product has its own user-facing billing (Telegram quota, web app quota), it uses `INTERNAL_API_KEY` to skip gateway billing. If a consumer calls `api.dos.ai` directly (developers, partners), the gateway handles token-based billing with `dos_sk_xxx` keys.
190+
191+
**LLM fallback cost tracking:** When self-hosted vLLM is unavailable, application-layer services fall back to paid providers (Alibaba Cloud). Fallback usage is logged as structured JSON (`event: llm_fallback_used`) with token count and estimated cost, for internal cost monitoring only — user-facing billing stays request-based.
192+
193+
Leverage DOS.Me existing infrastructure:
194+
195+
- **API keys**: `dosai.api_keys` table (key_hash, user_id, tier, rate_limit)
196+
- **Credits**: DOS.Me credit system (credit_transactions table)
197+
- **Tiers**: Free (rate limited, dos-ai only), Pro (all models, higher limits)
198+
- **Top-up**: VNPay, Stripe (via DOS.Me billing)
199+
200+
### 7. Usage Tracking
201+
202+
```sql
203+
-- dosai.usage_log
204+
CREATE TABLE dosai.usage_log (
205+
id UUID DEFAULT gen_random_uuid(),
206+
api_key_id UUID REFERENCES dosai.api_keys(id),
207+
model TEXT NOT NULL,
208+
provider_id TEXT NOT NULL, -- which backend served it
209+
input_tokens INT NOT NULL,
210+
output_tokens INT NOT NULL,
211+
latency_ms INT,
212+
cost_usd NUMERIC(10,6), -- our cost
213+
revenue_usd NUMERIC(10,6), -- what we charged user
214+
status TEXT DEFAULT 'success', -- success/error
215+
created_at TIMESTAMPTZ DEFAULT now()
216+
);
217+
```
218+
219+
## Phased Rollout
220+
221+
### Phase 0 — Already Done (2026-03-14)
222+
223+
- [x] api.dos.ai Cloudflare Worker (proxy to vLLM)
224+
- [x] Self-hosted vLLM with Qwen3.5-35B-A3B
225+
- [x] DOS.Me auth + credits infrastructure
226+
- [x] Fallback chain in DOSafe entity-check (vLLM → Alibaba Cloud qwen3.5-flash)
227+
- [x] Alibaba Cloud API key provisioned (DashScope International)
228+
- [x] INTERNAL_API_KEY bypass for DOS internal services (skip billing/rate limit)
229+
- [x] Two-layer billing model: application-layer (request-based) + gateway (token-based)
230+
- [x] Fallback usage logging (structured JSON with token count + cost estimate)
231+
232+
### Phase 1 — Internal Gateway (1-2 weeks)
233+
234+
Turn the existing Cloudflare Worker into a proper gateway:
235+
236+
- [ ] Provider registry in D1 (self-hosted + alibaba)
237+
- [ ] Routing logic: try self-hosted → fallback to paid
238+
- [ ] Token counting (tiktoken-compatible, estimate from char count for speed)
239+
- [ ] Usage logging to D1
240+
- [ ] API key auth (INTERNAL_API_KEY for internal, dos_sk_xxx for external)
241+
- [ ] Health check endpoint: `GET /health` returns provider status
242+
243+
Deliverable: DOSafe entity-check uses api.dos.ai gateway instead of direct
244+
vLLM + hardcoded fallback. Same functionality, centralized routing.
245+
246+
### Phase 2 — Multi-Model + Billing (2-3 weeks)
247+
248+
- [ ] Model catalog endpoint: `GET /v1/models` returns available models + pricing
249+
- [ ] Multiple model aliases (dos-ai, qwen-fast, qwen-plus, gemini-flash)
250+
- [ ] Per-request billing: deduct credits from DOS.Me account
251+
- [ ] Usage dashboard in app.dos.ai (tokens used, cost, by model)
252+
- [ ] Rate limiting per tier (free: 10 RPM, pro: 100 RPM)
253+
- [ ] Streaming support (SSE passthrough)
254+
255+
Deliverable: External users can sign up, get API key, call multiple models,
256+
pay with credits.
257+
258+
### Phase 3 — InferenceSense Integration (3-4 weeks)
259+
260+
- [ ] Node agent from InferenceSense docs → register self-hosted nodes
261+
- [ ] Multi-node routing (multiple GPUs, spare capacity)
262+
- [ ] Dynamic model loading (node reports which model it's serving)
263+
- [ ] Draining support (operator reclaim without dropping requests)
264+
- [ ] Node health in `/v1/models` response
265+
266+
Deliverable: Multiple self-hosted GPU nodes contribute capacity. Gateway
267+
routes across fleet + paid providers seamlessly.
268+
269+
### Phase 4 — Public Launch
270+
271+
- [ ] Pricing page on dos.ai
272+
- [ ] Self-serve API key creation
273+
- [ ] Documentation (OpenAI SDK compatible — just change base_url)
274+
- [ ] VNPay top-up for Vietnamese market
275+
- [ ] Analytics dashboard (for users)
276+
- [ ] Admin dashboard (for us — revenue, costs, margins)
277+
278+
## Key Design Decisions
279+
280+
### Why Cloudflare Worker (not FastAPI)?
281+
282+
- Already deployed at api.dos.ai
283+
- Edge network = low latency globally
284+
- D1 for structured data, KV for hot config
285+
- Free tier generous (100k requests/day)
286+
- Streaming via ReadableStream works well
287+
- Can always add FastAPI origin for complex logic later
288+
289+
### Why not just use OpenRouter?
290+
291+
- OpenRouter charges ~15-30% markup
292+
- We have self-hosted GPUs = $0 cost for significant traffic
293+
- Vietnamese market needs local payment (VNPay)
294+
- We control the model selection and can optimize for our use cases
295+
- Revenue stays in ecosystem (DOS.Me credits)
296+
297+
### Token Counting Strategy
298+
299+
For Phase 1, estimate tokens from character count (chars / 4 for English,
300+
chars / 2 for Vietnamese/CJK). Accurate enough for billing at our scale.
301+
Switch to tiktoken if precision matters later.
302+
303+
### Pricing Philosophy
304+
305+
- **Self-hosted models**: Price at ~50% of cheapest paid alternative.
306+
Users get a discount, we get 100% margin. Win-win.
307+
- **Paid models**: 2-3x markup. Competitive with OpenRouter.
308+
- **Free tier**: Rate-limited access to dos-ai model only. Acquisition funnel.
309+
310+
## Technical Notes
311+
312+
### Existing Infrastructure to Reuse
313+
314+
| Component | Location | Purpose |
315+
|-----------|----------|---------|
316+
| Cloudflare Worker | `api.dos.ai` (DOS-AI repo) | Gateway, already proxies to vLLM |
317+
| D1 Database | Cloudflare | Request logging, API keys |
318+
| vLLM | joy-pc RTX Pro 6000 | Self-hosted inference |
319+
| DOS.Me API | `api-v2.dos.me` | User auth, credits, billing |
320+
| Supabase | `gulptwduchsjcsbndmua` | Usage data (dosai schema) |
321+
322+
### Request Flow (Phase 1)
323+
324+
```
325+
1. User → POST api.dos.ai/v1/chat/completions
326+
2. Worker validates API key (D1 lookup)
327+
3. Worker checks provider health (KV cache, 30s TTL)
328+
4. Worker selects provider (priority order)
329+
5. Worker forwards request to provider
330+
6. Provider responds (streaming or batch)
331+
7. Worker logs usage to D1
332+
8. Worker returns response to user
333+
```
334+
335+
### Streaming Architecture
336+
337+
```
338+
User ←SSE── Worker ←SSE── Provider
339+
340+
└── (buffer last chunk for token count, then log)
341+
```
342+
343+
Count tokens from the final `usage` field in the provider response.
344+
If provider doesn't return usage (some don't for streaming), estimate
345+
from accumulated content length.

0 commit comments

Comments
 (0)