Skip to content

Commit b3e191a

Browse files
JOYclaude
andcommitted
docs: add entity_scores materialized cache design + implementation plan
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent a13906a commit b3e191a

2 files changed

Lines changed: 773 additions & 0 deletions

File tree

Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
# Entity Scores Materialized Cache
2+
3+
**Date:** 2026-03-10
4+
**Status:** Approved
5+
6+
## Problem
7+
8+
Every API request (url-check, entity-check) re-computes risk scores from scratch:
9+
- Queries threat_intel DB (free, fast)
10+
- Calls 5+ external APIs: Safe Browsing, WHOIS, on-chain EAS, DOS.Me, web search (paid/slow)
11+
- Runs LLM analysis on web results (compute-heavy)
12+
13+
Total latency: up to 30s. External APIs cost money per call. Same entity checked multiple times wastes both time and money.
14+
15+
## Solution
16+
17+
Add `dosafe.entity_scores` table as a materialized cache layer. Store computed scores + external API results with separate TTLs. Subsequent lookups return cached score in < 5ms. Only refresh when data is stale.
18+
19+
## Design Principles
20+
21+
1. **Internal DB queries are free** — always query threat_intel fresh
22+
2. **External APIs cost money** — cache aggressively with long TTLs
23+
3. **Different sources change at different rates** — separate TTLs per source
24+
4. **Extension needs < 50ms response** — single-row DB lookup, no external calls
25+
5. **Score quality must stay high** — re-compute when underlying data changes
26+
27+
## Database Schema
28+
29+
```sql
30+
CREATE TABLE dosafe.entity_scores (
31+
entity_type TEXT NOT NULL,
32+
entity_hash TEXT NOT NULL,
33+
entity_id TEXT NOT NULL,
34+
35+
-- Computed score
36+
risk_score SMALLINT NOT NULL,
37+
risk_level TEXT NOT NULL, -- safe/low/medium/high/critical
38+
confidence TEXT NOT NULL, -- low/medium/high
39+
signals TEXT[] NOT NULL DEFAULT '{}',
40+
41+
-- Expensive cached results
42+
llm_summary TEXT,
43+
web_analysis JSONB, -- { webResults, llmAnalysis }
44+
external_results JSONB NOT NULL DEFAULT '{}',
45+
-- {
46+
-- safeBrowsing: { threats, checkedAt },
47+
-- whois: { ageDays, registrar, checkedAt },
48+
-- onChain: { attestationCount, latestRiskScore, checkedAt },
49+
-- dosMe: { found, trustScore, isFlagged, checkedAt },
50+
-- webSearch: { resultCount, checkedAt }
51+
-- }
52+
53+
-- Change detection
54+
threat_intel_hash TEXT, -- MD5 of sorted entry hashes+scores
55+
56+
-- Timestamps
57+
computed_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
58+
expires_at TIMESTAMPTZ NOT NULL,
59+
60+
PRIMARY KEY (entity_type, entity_hash)
61+
);
62+
63+
CREATE INDEX idx_entity_scores_expires ON dosafe.entity_scores (expires_at);
64+
```
65+
66+
## External API TTLs
67+
68+
| Source | TTL | Rationale |
69+
|--------|-----|-----------|
70+
| Safe Browsing | 7 days | Rarely changes for same URL |
71+
| WHOIS | 30 days | Domain age is static |
72+
| On-chain EAS | 24h | New attestations are rare |
73+
| DOS.Me Identity | 24h | Profile changes infrequently |
74+
| Web search | 7 days | Paid API (Serper/SerpApi) — future: free via SearXNG |
75+
| LLM analysis | 24h | Re-analyze when new threat data arrives |
76+
77+
Score TTL: **6 hours** — re-compute from fresh DB data + cached external results.
78+
79+
## Cache Lookup Flow
80+
81+
```
82+
Request
83+
84+
├─ 1. Query entity_scores WHERE entity_type + entity_hash
85+
│ ├─ HIT + expires_at > NOW() → return cached (< 5ms)
86+
│ ├─ HIT + expired → step 2 (smart refresh)
87+
│ └─ MISS → step 3 (full compute)
88+
89+
├─ 2. Smart Refresh
90+
│ ├─ Query threat_intel → compute threat_intel_hash
91+
│ ├─ Compare hash with cached
92+
│ │ ├─ SAME + all external TTLs fresh
93+
│ │ │ → bump expires_at +6h, return same score
94+
│ │ ├─ DIFFERENT (DB changed) + external fresh
95+
│ │ │ → re-compute score from new DB + cached external_results
96+
│ │ └─ External TTL expired
97+
│ │ → call ONLY expired external APIs, re-compute
98+
│ └─ UPSERT entity_scores
99+
100+
└─ 3. Full Compute (first time ever)
101+
├─ Query threat_intel (free)
102+
├─ Parallel: SB + WHOIS + on-chain + DOS.Me + web search
103+
├─ Sequential: LLM analysis (after web results)
104+
├─ computeRiskScoreV2()
105+
├─ INSERT entity_scores
106+
└─ return score
107+
```
108+
109+
## Consumer Integration
110+
111+
### Extension Fast Path
112+
- Query `entity_scores` only (1 row)
113+
- If MISS → fallback: query threat_intel + computeRiskScoreV2() (no external APIs)
114+
- NEVER calls external APIs
115+
116+
### Full Path (bot/app/web)
117+
- Query `entity_scores`
118+
- HIT + fresh → return cached
119+
- Stale/MISS → smart refresh (only call expired externals)
120+
121+
### Bulk Endpoint
122+
- Batch query `entity_scores` for up to 50 entities
123+
- Return cached results
124+
- Queue MISSes for background refresh (phase 2)
125+
126+
## Change Detection
127+
128+
`threat_intel_hash` = MD5 of sorted `(entry_id, risk_score)` pairs for this entity.
129+
130+
When scrapers add/update entries (every 6h), the hash changes → triggers score re-computation on next request. But external API results are preserved if still within their TTL.
131+
132+
## Cleanup
133+
134+
```sql
135+
-- pg_cron: remove stale scores not refreshed in 30 days
136+
DELETE FROM dosafe.entity_scores
137+
WHERE expires_at < NOW() - INTERVAL '30 days';
138+
```
139+
140+
## Future: Background Refresh (Phase 2)
141+
142+
- pg_cron job every 6h: refresh top 1000 most-queried entities
143+
- Ensures extension always gets a HIT
144+
- Can track query_count on entity_scores for popularity ranking
145+
146+
## Future: Self-hosted Search (SearXNG)
147+
148+
Replace Serper/SerpApi with self-hosted SearXNG meta-search engine:
149+
- Docker container, zero API cost
150+
- Aggregates Google/Bing/DuckDuckGo
151+
- Web search TTL can drop from 7 days to 24h once free
152+
- Full pipeline: SearXNG → LLM analyze → cached
153+
154+
## Implementation Files
155+
156+
- `supabase/migrations/YYYYMMDD_entity_scores.sql` — table + RPC functions
157+
- `apps/web/src/lib/entity-score-cache.ts` — cache lookup, smart refresh, TTL logic
158+
- `apps/web/src/app/api/url-check/route.ts` — integrate cache layer
159+
- `apps/web/src/app/api/entity-check/route.ts` — integrate cache layer
160+
- `apps/web/src/lib/entity-scoring.ts` — add threat_intel_hash computation

0 commit comments

Comments
 (0)