|
| 1 | +# DOS.AI API Gateway — Product Plan |
| 2 | + |
| 3 | +Updated: 2026-03-14 |
| 4 | +Status: Draft |
| 5 | +Owner: DOS.AI |
| 6 | +Related: `INFERENCESENSE_LIKE_ALPHA_MVP.md`, `API_CONTRACT_V1.md` |
| 7 | + |
| 8 | +## Vision |
| 9 | + |
| 10 | +A unified LLM API gateway at `api.dos.ai` — users call one endpoint, the system |
| 11 | +routes to the cheapest/fastest available backend. Like OpenRouter, but with a |
| 12 | +self-hosted GPU fleet as first-class capacity (100% margin). |
| 13 | + |
| 14 | +``` |
| 15 | +User app |
| 16 | + │ |
| 17 | + ▼ |
| 18 | +api.dos.ai (Cloudflare Worker — auth, routing, metering) |
| 19 | + │ |
| 20 | + ├─► Self-hosted vLLM fleet (cost $0, margin 100%) |
| 21 | + │ ├── joy-pc: RTX Pro 6000 — Qwen3.5-35B-A3B |
| 22 | + │ ├── node-2: future GPU — model TBD |
| 23 | + │ └── (InferenceSense nodes — spare capacity) |
| 24 | + │ |
| 25 | + ├─► Alibaba Cloud (Qwen3.5-flash $0.10/$0.40 per 1M tokens) |
| 26 | + ├─► Google AI (Gemini 3 Flash $0.50/$3.00) |
| 27 | + ├─► OpenAI (GPT-4o-mini $0.15/$0.60) |
| 28 | + └─► Any OpenAI-compatible provider |
| 29 | +``` |
| 30 | + |
| 31 | +## Why This Works |
| 32 | + |
| 33 | +1. **Self-hosted = unfair advantage**: When vLLM fleet has capacity, every request |
| 34 | + is pure profit. OpenRouter can't compete on margin for models we self-host. |
| 35 | +2. **Paid providers = infinite scale**: When self-hosted is down/full, overflow to |
| 36 | + paid APIs. User never sees downtime. |
| 37 | +3. **Already built the hard parts**: vLLM infra, Cloudflare Worker at api.dos.ai, |
| 38 | + DOS.Me auth + credits system, InferenceSense node agent design. |
| 39 | +4. **Vietnamese market gap**: No local LLM API gateway. Vietnamese devs use |
| 40 | + OpenRouter (USD, no local payment) or direct provider APIs (fragmented). |
| 41 | + |
| 42 | +## Relationship to InferenceSense |
| 43 | + |
| 44 | +InferenceSense (existing docs) defines how to manage a **fleet of self-hosted GPU |
| 45 | +nodes** with spare capacity routing. The API Gateway is the **commercial layer on |
| 46 | +top**: |
| 47 | + |
| 48 | +``` |
| 49 | +┌─────────────────────────────────────────────┐ |
| 50 | +│ API Gateway (this doc) │ |
| 51 | +│ - User-facing API, auth, billing, routing │ |
| 52 | +│ - Model catalog, pricing, rate limits │ |
| 53 | +├─────────────────────────────────────────────┤ |
| 54 | +│ InferenceSense (existing docs) │ |
| 55 | +│ - Node agent, heartbeat, spare capacity │ |
| 56 | +│ - GPU fleet management, draining │ |
| 57 | +├─────────────────────────────────────────────┤ |
| 58 | +│ Paid Provider Backends │ |
| 59 | +│ - Alibaba, Google, OpenAI, Anthropic │ |
| 60 | +│ - OpenAI-compatible API forwarding │ |
| 61 | +└─────────────────────────────────────────────┘ |
| 62 | +``` |
| 63 | + |
| 64 | +InferenceSense becomes one of several backends. The API Gateway doesn't replace |
| 65 | +it — it wraps it and adds paid fallback + billing. |
| 66 | + |
| 67 | +## Core Architecture |
| 68 | + |
| 69 | +### 1. Gateway (Cloudflare Worker at api.dos.ai) |
| 70 | + |
| 71 | +Already exists. Currently proxies to vLLM. Extend to: |
| 72 | + |
| 73 | +- **Auth**: Validate API key → look up user, tier, credits |
| 74 | +- **Route**: Pick backend based on model + availability + cost |
| 75 | +- **Meter**: Log tokens in/out, cost, latency, provider used |
| 76 | +- **Bill**: Deduct credits from user's DOS.Me account |
| 77 | +- **Stream**: Pass through SSE streaming from any backend |
| 78 | + |
| 79 | +### 2. Provider Registry (D1 or KV) |
| 80 | + |
| 81 | +``` |
| 82 | +providers: |
| 83 | + - id: self-hosted |
| 84 | + base_url: https://inference.dos.ai (or direct vLLM IPs) |
| 85 | + api_key: (internal) |
| 86 | + models: [dos-ai] # served-model-name in vLLM |
| 87 | + priority: 1 # try first |
| 88 | + cost_input: 0 |
| 89 | + cost_output: 0 |
| 90 | + max_concurrent: 50 |
| 91 | + health_check: GET /v1/models |
| 92 | +
|
| 93 | + - id: alibaba |
| 94 | + base_url: https://dashscope-intl.aliyuncs.com/compatible-mode |
| 95 | + api_key: sk-... |
| 96 | + models: [qwen3.5-flash, qwen3.5-plus, qwen3.5-35b-a3b] |
| 97 | + priority: 2 |
| 98 | + cost_input: 0.10 # per 1M tokens |
| 99 | + cost_output: 0.40 |
| 100 | + max_concurrent: 100 |
| 101 | +
|
| 102 | + - id: google |
| 103 | + base_url: https://generativelanguage.googleapis.com |
| 104 | + api_key: AIza... |
| 105 | + models: [gemini-2.5-flash, gemini-3-flash-preview] |
| 106 | + priority: 3 |
| 107 | + cost_input: 0.30 |
| 108 | + cost_output: 2.50 |
| 109 | + max_concurrent: 100 |
| 110 | +``` |
| 111 | + |
| 112 | +### 3. Model Catalog (user-facing) |
| 113 | + |
| 114 | +Users request models by alias. Gateway maps to best available backend: |
| 115 | + |
| 116 | +``` |
| 117 | +model aliases: |
| 118 | + "dos-ai" → self-hosted Qwen3.5-35B → alibaba qwen3.5-flash |
| 119 | + "qwen-fast" → alibaba qwen3.5-flash (always paid, fastest) |
| 120 | + "qwen-plus" → alibaba qwen3.5-plus |
| 121 | + "gemini-flash" → google gemini-2.5-flash |
| 122 | + "auto" → cheapest available model matching request constraints |
| 123 | +``` |
| 124 | + |
| 125 | +### 4. Routing Logic |
| 126 | + |
| 127 | +``` |
| 128 | +function selectProvider(model, request): |
| 129 | + candidates = providers.filter(p => |
| 130 | + p.models.includes(model) && |
| 131 | + p.healthy && |
| 132 | + p.current_load < p.max_concurrent |
| 133 | + ) |
| 134 | + // Sort by: priority (self-hosted first), then cost, then latency |
| 135 | + candidates.sort(by: priority ASC, cost ASC, avg_latency ASC) |
| 136 | + return candidates[0] // fallback chain is implicit |
| 137 | +``` |
| 138 | + |
| 139 | +### 5. Billing Model |
| 140 | + |
| 141 | +``` |
| 142 | +user_cost = (input_tokens / 1M) * sell_price_input |
| 143 | + + (output_tokens / 1M) * sell_price_output |
| 144 | +
|
| 145 | +our_cost = (input_tokens / 1M) * provider_cost_input |
| 146 | + + (output_tokens / 1M) * provider_cost_output |
| 147 | +
|
| 148 | +margin = user_cost - our_cost |
| 149 | +``` |
| 150 | + |
| 151 | +Sell prices (proposed, ~2-3x markup on paid, 100% on self-hosted): |
| 152 | + |
| 153 | +| Model alias | Sell input/1M | Sell output/1M | Self-hosted margin | Paid margin | |
| 154 | +|-----------------|---------------|----------------|--------------------|-------------| |
| 155 | +| dos-ai | $0.15 | $0.60 | 100% | ~33%* | |
| 156 | +| qwen-fast | $0.20 | $0.80 | - | 50% | |
| 157 | +| qwen-plus | $0.50 | $3.00 | - | 48% | |
| 158 | +| gemini-flash | $0.50 | $4.00 | - | 37% | |
| 159 | + |
| 160 | +*When self-hosted is down, dos-ai falls back to alibaba qwen3.5-flash at cost. |
| 161 | + |
| 162 | +### 6. Auth & Credits — Two-Layer Billing Model |
| 163 | + |
| 164 | +The gateway serves two distinct consumer types with different billing models: |
| 165 | + |
| 166 | +``` |
| 167 | +┌──────────────────────────────────────────────────────────┐ |
| 168 | +│ Application Layer (request-based billing) │ |
| 169 | +│ │ |
| 170 | +│ DOSafe Telegram bot → consume_quota() per request │ |
| 171 | +│ DOSafe web app → dosafe_usage per request │ |
| 172 | +│ DOS.AI app features → per-feature billing │ |
| 173 | +│ │ |
| 174 | +│ Auth: INTERNAL_API_KEY (bypass gateway billing) │ |
| 175 | +│ Why: billing already handled upstream per user/tier │ |
| 176 | +└──────────────────────┬───────────────────────────────────┘ |
| 177 | + ↓ |
| 178 | +┌──────────────────────────────────────────────────────────┐ |
| 179 | +│ api.dos.ai Gateway (token-based billing) │ |
| 180 | +│ │ |
| 181 | +│ INTERNAL_API_KEY → skip billing + rate limit │ |
| 182 | +│ dos_sk_xxx → deductBalance per token │ |
| 183 | +│ │ |
| 184 | +│ Why: external consumers call gateway directly, │ |
| 185 | +│ no application layer to handle billing for them │ |
| 186 | +└──────────────────────────────────────────────────────────┘ |
| 187 | +``` |
| 188 | + |
| 189 | +**Rule:** If a product has its own user-facing billing (Telegram quota, web app quota), it uses `INTERNAL_API_KEY` to skip gateway billing. If a consumer calls `api.dos.ai` directly (developers, partners), the gateway handles token-based billing with `dos_sk_xxx` keys. |
| 190 | + |
| 191 | +**LLM fallback cost tracking:** When self-hosted vLLM is unavailable, application-layer services fall back to paid providers (Alibaba Cloud). Fallback usage is logged as structured JSON (`event: llm_fallback_used`) with token count and estimated cost, for internal cost monitoring only — user-facing billing stays request-based. |
| 192 | + |
| 193 | +Leverage DOS.Me existing infrastructure: |
| 194 | + |
| 195 | +- **API keys**: `dosai.api_keys` table (key_hash, user_id, tier, rate_limit) |
| 196 | +- **Credits**: DOS.Me credit system (credit_transactions table) |
| 197 | +- **Tiers**: Free (rate limited, dos-ai only), Pro (all models, higher limits) |
| 198 | +- **Top-up**: VNPay, Stripe (via DOS.Me billing) |
| 199 | + |
| 200 | +### 7. Usage Tracking |
| 201 | + |
| 202 | +```sql |
| 203 | +-- dosai.usage_log |
| 204 | +CREATE TABLE dosai.usage_log ( |
| 205 | + id UUID DEFAULT gen_random_uuid(), |
| 206 | + api_key_id UUID REFERENCES dosai.api_keys(id), |
| 207 | + model TEXT NOT NULL, |
| 208 | + provider_id TEXT NOT NULL, -- which backend served it |
| 209 | + input_tokens INT NOT NULL, |
| 210 | + output_tokens INT NOT NULL, |
| 211 | + latency_ms INT, |
| 212 | + cost_usd NUMERIC(10,6), -- our cost |
| 213 | + revenue_usd NUMERIC(10,6), -- what we charged user |
| 214 | + status TEXT DEFAULT 'success', -- success/error |
| 215 | + created_at TIMESTAMPTZ DEFAULT now() |
| 216 | +); |
| 217 | +``` |
| 218 | + |
| 219 | +## Phased Rollout |
| 220 | + |
| 221 | +### Phase 0 — Already Done (2026-03-14) |
| 222 | + |
| 223 | +- [x] api.dos.ai Cloudflare Worker (proxy to vLLM) |
| 224 | +- [x] Self-hosted vLLM with Qwen3.5-35B-A3B |
| 225 | +- [x] DOS.Me auth + credits infrastructure |
| 226 | +- [x] Fallback chain in DOSafe entity-check (vLLM → Alibaba Cloud qwen3.5-flash) |
| 227 | +- [x] Alibaba Cloud API key provisioned (DashScope International) |
| 228 | +- [x] INTERNAL_API_KEY bypass for DOS internal services (skip billing/rate limit) |
| 229 | +- [x] Two-layer billing model: application-layer (request-based) + gateway (token-based) |
| 230 | +- [x] Fallback usage logging (structured JSON with token count + cost estimate) |
| 231 | + |
| 232 | +### Phase 1 — Internal Gateway (1-2 weeks) |
| 233 | + |
| 234 | +Turn the existing Cloudflare Worker into a proper gateway: |
| 235 | + |
| 236 | +- [ ] Provider registry in D1 (self-hosted + alibaba) |
| 237 | +- [ ] Routing logic: try self-hosted → fallback to paid |
| 238 | +- [ ] Token counting (tiktoken-compatible, estimate from char count for speed) |
| 239 | +- [ ] Usage logging to D1 |
| 240 | +- [ ] API key auth (INTERNAL_API_KEY for internal, dos_sk_xxx for external) |
| 241 | +- [ ] Health check endpoint: `GET /health` returns provider status |
| 242 | + |
| 243 | +Deliverable: DOSafe entity-check uses api.dos.ai gateway instead of direct |
| 244 | +vLLM + hardcoded fallback. Same functionality, centralized routing. |
| 245 | + |
| 246 | +### Phase 2 — Multi-Model + Billing (2-3 weeks) |
| 247 | + |
| 248 | +- [ ] Model catalog endpoint: `GET /v1/models` returns available models + pricing |
| 249 | +- [ ] Multiple model aliases (dos-ai, qwen-fast, qwen-plus, gemini-flash) |
| 250 | +- [ ] Per-request billing: deduct credits from DOS.Me account |
| 251 | +- [ ] Usage dashboard in app.dos.ai (tokens used, cost, by model) |
| 252 | +- [ ] Rate limiting per tier (free: 10 RPM, pro: 100 RPM) |
| 253 | +- [ ] Streaming support (SSE passthrough) |
| 254 | + |
| 255 | +Deliverable: External users can sign up, get API key, call multiple models, |
| 256 | +pay with credits. |
| 257 | + |
| 258 | +### Phase 3 — InferenceSense Integration (3-4 weeks) |
| 259 | + |
| 260 | +- [ ] Node agent from InferenceSense docs → register self-hosted nodes |
| 261 | +- [ ] Multi-node routing (multiple GPUs, spare capacity) |
| 262 | +- [ ] Dynamic model loading (node reports which model it's serving) |
| 263 | +- [ ] Draining support (operator reclaim without dropping requests) |
| 264 | +- [ ] Node health in `/v1/models` response |
| 265 | + |
| 266 | +Deliverable: Multiple self-hosted GPU nodes contribute capacity. Gateway |
| 267 | +routes across fleet + paid providers seamlessly. |
| 268 | + |
| 269 | +### Phase 4 — Public Launch |
| 270 | + |
| 271 | +- [ ] Pricing page on dos.ai |
| 272 | +- [ ] Self-serve API key creation |
| 273 | +- [ ] Documentation (OpenAI SDK compatible — just change base_url) |
| 274 | +- [ ] VNPay top-up for Vietnamese market |
| 275 | +- [ ] Analytics dashboard (for users) |
| 276 | +- [ ] Admin dashboard (for us — revenue, costs, margins) |
| 277 | + |
| 278 | +## Key Design Decisions |
| 279 | + |
| 280 | +### Why Cloudflare Worker (not FastAPI)? |
| 281 | + |
| 282 | +- Already deployed at api.dos.ai |
| 283 | +- Edge network = low latency globally |
| 284 | +- D1 for structured data, KV for hot config |
| 285 | +- Free tier generous (100k requests/day) |
| 286 | +- Streaming via ReadableStream works well |
| 287 | +- Can always add FastAPI origin for complex logic later |
| 288 | + |
| 289 | +### Why not just use OpenRouter? |
| 290 | + |
| 291 | +- OpenRouter charges ~15-30% markup |
| 292 | +- We have self-hosted GPUs = $0 cost for significant traffic |
| 293 | +- Vietnamese market needs local payment (VNPay) |
| 294 | +- We control the model selection and can optimize for our use cases |
| 295 | +- Revenue stays in ecosystem (DOS.Me credits) |
| 296 | + |
| 297 | +### Token Counting Strategy |
| 298 | + |
| 299 | +For Phase 1, estimate tokens from character count (chars / 4 for English, |
| 300 | +chars / 2 for Vietnamese/CJK). Accurate enough for billing at our scale. |
| 301 | +Switch to tiktoken if precision matters later. |
| 302 | + |
| 303 | +### Pricing Philosophy |
| 304 | + |
| 305 | +- **Self-hosted models**: Price at ~50% of cheapest paid alternative. |
| 306 | + Users get a discount, we get 100% margin. Win-win. |
| 307 | +- **Paid models**: 2-3x markup. Competitive with OpenRouter. |
| 308 | +- **Free tier**: Rate-limited access to dos-ai model only. Acquisition funnel. |
| 309 | + |
| 310 | +## Technical Notes |
| 311 | + |
| 312 | +### Existing Infrastructure to Reuse |
| 313 | + |
| 314 | +| Component | Location | Purpose | |
| 315 | +|-----------|----------|---------| |
| 316 | +| Cloudflare Worker | `api.dos.ai` (DOS-AI repo) | Gateway, already proxies to vLLM | |
| 317 | +| D1 Database | Cloudflare | Request logging, API keys | |
| 318 | +| vLLM | joy-pc RTX Pro 6000 | Self-hosted inference | |
| 319 | +| DOS.Me API | `api-v2.dos.me` | User auth, credits, billing | |
| 320 | +| Supabase | `gulptwduchsjcsbndmua` | Usage data (dosai schema) | |
| 321 | + |
| 322 | +### Request Flow (Phase 1) |
| 323 | + |
| 324 | +``` |
| 325 | +1. User → POST api.dos.ai/v1/chat/completions |
| 326 | +2. Worker validates API key (D1 lookup) |
| 327 | +3. Worker checks provider health (KV cache, 30s TTL) |
| 328 | +4. Worker selects provider (priority order) |
| 329 | +5. Worker forwards request to provider |
| 330 | +6. Provider responds (streaming or batch) |
| 331 | +7. Worker logs usage to D1 |
| 332 | +8. Worker returns response to user |
| 333 | +``` |
| 334 | + |
| 335 | +### Streaming Architecture |
| 336 | + |
| 337 | +``` |
| 338 | +User ←SSE── Worker ←SSE── Provider |
| 339 | + │ |
| 340 | + └── (buffer last chunk for token count, then log) |
| 341 | +``` |
| 342 | + |
| 343 | +Count tokens from the final `usage` field in the provider response. |
| 344 | +If provider doesn't return usage (some don't for streaming), estimate |
| 345 | +from accumulated content length. |
0 commit comments