|
| 1 | +# Entity Scores Materialized Cache |
| 2 | + |
| 3 | +**Date:** 2026-03-10 |
| 4 | +**Status:** Approved |
| 5 | + |
| 6 | +## Problem |
| 7 | + |
| 8 | +Every API request (url-check, entity-check) re-computes risk scores from scratch: |
| 9 | +- Queries threat_intel DB (free, fast) |
| 10 | +- Calls 5+ external APIs: Safe Browsing, WHOIS, on-chain EAS, DOS.Me, web search (paid/slow) |
| 11 | +- Runs LLM analysis on web results (compute-heavy) |
| 12 | + |
| 13 | +Total latency: up to 30s. External APIs cost money per call. Same entity checked multiple times wastes both time and money. |
| 14 | + |
| 15 | +## Solution |
| 16 | + |
| 17 | +Add `dosafe.entity_scores` table as a materialized cache layer. Store computed scores + external API results with separate TTLs. Subsequent lookups return cached score in < 5ms. Only refresh when data is stale. |
| 18 | + |
| 19 | +## Design Principles |
| 20 | + |
| 21 | +1. **Internal DB queries are free** — always query threat_intel fresh |
| 22 | +2. **External APIs cost money** — cache aggressively with long TTLs |
| 23 | +3. **Different sources change at different rates** — separate TTLs per source |
| 24 | +4. **Extension needs < 50ms response** — single-row DB lookup, no external calls |
| 25 | +5. **Score quality must stay high** — re-compute when underlying data changes |
| 26 | + |
| 27 | +## Database Schema |
| 28 | + |
| 29 | +```sql |
| 30 | +CREATE TABLE dosafe.entity_scores ( |
| 31 | + entity_type TEXT NOT NULL, |
| 32 | + entity_hash TEXT NOT NULL, |
| 33 | + entity_id TEXT NOT NULL, |
| 34 | + |
| 35 | + -- Computed score |
| 36 | + risk_score SMALLINT NOT NULL, |
| 37 | + risk_level TEXT NOT NULL, -- safe/low/medium/high/critical |
| 38 | + confidence TEXT NOT NULL, -- low/medium/high |
| 39 | + signals TEXT[] NOT NULL DEFAULT '{}', |
| 40 | + |
| 41 | + -- Expensive cached results |
| 42 | + llm_summary TEXT, |
| 43 | + web_analysis JSONB, -- { webResults, llmAnalysis } |
| 44 | + external_results JSONB NOT NULL DEFAULT '{}', |
| 45 | + -- { |
| 46 | + -- safeBrowsing: { threats, checkedAt }, |
| 47 | + -- whois: { ageDays, registrar, checkedAt }, |
| 48 | + -- onChain: { attestationCount, latestRiskScore, checkedAt }, |
| 49 | + -- dosMe: { found, trustScore, isFlagged, checkedAt }, |
| 50 | + -- webSearch: { resultCount, checkedAt } |
| 51 | + -- } |
| 52 | + |
| 53 | + -- Change detection |
| 54 | + threat_intel_hash TEXT, -- MD5 of sorted entry hashes+scores |
| 55 | + |
| 56 | + -- Timestamps |
| 57 | + computed_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), |
| 58 | + expires_at TIMESTAMPTZ NOT NULL, |
| 59 | + |
| 60 | + PRIMARY KEY (entity_type, entity_hash) |
| 61 | +); |
| 62 | + |
| 63 | +CREATE INDEX idx_entity_scores_expires ON dosafe.entity_scores (expires_at); |
| 64 | +``` |
| 65 | + |
| 66 | +## External API TTLs |
| 67 | + |
| 68 | +| Source | TTL | Rationale | |
| 69 | +|--------|-----|-----------| |
| 70 | +| Safe Browsing | 7 days | Rarely changes for same URL | |
| 71 | +| WHOIS | 30 days | Domain age is static | |
| 72 | +| On-chain EAS | 24h | New attestations are rare | |
| 73 | +| DOS.Me Identity | 24h | Profile changes infrequently | |
| 74 | +| Web search | 7 days | Paid API (Serper/SerpApi) — future: free via SearXNG | |
| 75 | +| LLM analysis | 24h | Re-analyze when new threat data arrives | |
| 76 | + |
| 77 | +Score TTL: **6 hours** — re-compute from fresh DB data + cached external results. |
| 78 | + |
| 79 | +## Cache Lookup Flow |
| 80 | + |
| 81 | +``` |
| 82 | +Request |
| 83 | +│ |
| 84 | +├─ 1. Query entity_scores WHERE entity_type + entity_hash |
| 85 | +│ ├─ HIT + expires_at > NOW() → return cached (< 5ms) |
| 86 | +│ ├─ HIT + expired → step 2 (smart refresh) |
| 87 | +│ └─ MISS → step 3 (full compute) |
| 88 | +│ |
| 89 | +├─ 2. Smart Refresh |
| 90 | +│ ├─ Query threat_intel → compute threat_intel_hash |
| 91 | +│ ├─ Compare hash with cached |
| 92 | +│ │ ├─ SAME + all external TTLs fresh |
| 93 | +│ │ │ → bump expires_at +6h, return same score |
| 94 | +│ │ ├─ DIFFERENT (DB changed) + external fresh |
| 95 | +│ │ │ → re-compute score from new DB + cached external_results |
| 96 | +│ │ └─ External TTL expired |
| 97 | +│ │ → call ONLY expired external APIs, re-compute |
| 98 | +│ └─ UPSERT entity_scores |
| 99 | +│ |
| 100 | +└─ 3. Full Compute (first time ever) |
| 101 | + ├─ Query threat_intel (free) |
| 102 | + ├─ Parallel: SB + WHOIS + on-chain + DOS.Me + web search |
| 103 | + ├─ Sequential: LLM analysis (after web results) |
| 104 | + ├─ computeRiskScoreV2() |
| 105 | + ├─ INSERT entity_scores |
| 106 | + └─ return score |
| 107 | +``` |
| 108 | + |
| 109 | +## Consumer Integration |
| 110 | + |
| 111 | +### Extension Fast Path |
| 112 | +- Query `entity_scores` only (1 row) |
| 113 | +- If MISS → fallback: query threat_intel + computeRiskScoreV2() (no external APIs) |
| 114 | +- NEVER calls external APIs |
| 115 | + |
| 116 | +### Full Path (bot/app/web) |
| 117 | +- Query `entity_scores` |
| 118 | +- HIT + fresh → return cached |
| 119 | +- Stale/MISS → smart refresh (only call expired externals) |
| 120 | + |
| 121 | +### Bulk Endpoint |
| 122 | +- Batch query `entity_scores` for up to 50 entities |
| 123 | +- Return cached results |
| 124 | +- Queue MISSes for background refresh (phase 2) |
| 125 | + |
| 126 | +## Change Detection |
| 127 | + |
| 128 | +`threat_intel_hash` = MD5 of sorted `(entry_id, risk_score)` pairs for this entity. |
| 129 | + |
| 130 | +When scrapers add/update entries (every 6h), the hash changes → triggers score re-computation on next request. But external API results are preserved if still within their TTL. |
| 131 | + |
| 132 | +## Cleanup |
| 133 | + |
| 134 | +```sql |
| 135 | +-- pg_cron: remove stale scores not refreshed in 30 days |
| 136 | +DELETE FROM dosafe.entity_scores |
| 137 | +WHERE expires_at < NOW() - INTERVAL '30 days'; |
| 138 | +``` |
| 139 | + |
| 140 | +## Future: Background Refresh (Phase 2) |
| 141 | + |
| 142 | +- pg_cron job every 6h: refresh top 1000 most-queried entities |
| 143 | +- Ensures extension always gets a HIT |
| 144 | +- Can track query_count on entity_scores for popularity ranking |
| 145 | + |
| 146 | +## Future: Self-hosted Search (SearXNG) |
| 147 | + |
| 148 | +Replace Serper/SerpApi with self-hosted SearXNG meta-search engine: |
| 149 | +- Docker container, zero API cost |
| 150 | +- Aggregates Google/Bing/DuckDuckGo |
| 151 | +- Web search TTL can drop from 7 days to 24h once free |
| 152 | +- Full pipeline: SearXNG → LLM analyze → cached |
| 153 | + |
| 154 | +## Implementation Files |
| 155 | + |
| 156 | +- `supabase/migrations/YYYYMMDD_entity_scores.sql` — table + RPC functions |
| 157 | +- `apps/web/src/lib/entity-score-cache.ts` — cache lookup, smart refresh, TTL logic |
| 158 | +- `apps/web/src/app/api/url-check/route.ts` — integrate cache layer |
| 159 | +- `apps/web/src/app/api/entity-check/route.ts` — integrate cache layer |
| 160 | +- `apps/web/src/lib/entity-scoring.ts` — add threat_intel_hash computation |
0 commit comments