You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**Implementation ownership:** Claude is the primary coding agent for architecture changes; this document is the handoff/reference source for Claude-first implementation.
Hybrid DB-first + runtime fallback check. See [threat-intel.md](threat-intel.md) for full details.
390
+
Hybrid DB-first + runtime check pipeline. Uses V2 scoring engine with source tiers, freshness decay, and corroboration bonus. See [threat-intel.md](threat-intel.md) for DB details.
-**Entity types:** domain, url, wallet, phone, bank_account, facebook, email, name
452
483
-**Sync schedules (pg_cron):**
@@ -466,6 +497,189 @@ dosafe.threat_intel
466
497
467
498
---
468
499
500
+
## Risk Scoring Engine V2
501
+
502
+
**Implementation:**`src/lib/entity-scoring.ts` — shared engine used by both `/api/url-check` and `/api/entity-check`.
503
+
504
+
### Concept
505
+
506
+
The V1 scoring system used flat signal weights — every source contributing the same amount regardless of quality, freshness, or corroboration. This led to:
507
+
508
+
-**False positives**: Crowdsourced Vietnamese sources (scam.vn, checkscam.vn) flagging major platforms like google.com
509
+
-**Stale data inflation**: Year-old reports from a single unverified source scoring as high-risk
510
+
-**No confidence signal**: Users couldn't tell if a "medium risk" was backed by 5 authoritative sources or 1 crowdsourced report
511
+
512
+
V2 solves these with 5 mechanisms: **source tiers**, **freshness decay**, **corroboration bonus**, **confidence levels**, and **typosquatting detection**.
513
+
514
+
### Design Principles
515
+
516
+
1.**No single source determines verdict alone** — base score is 15 (low), signals add/subtract
517
+
2.**Higher-quality sources have more influence** — Google Safe Browsing (tier 1) carries more weight than runtime_cache (tier 4)
518
+
3.**Recent reports weigh more** — a phishing report from yesterday is more relevant than one from 2 years ago
519
+
4.**Multiple independent sources increase confidence** — 3 sources saying "scam" is more trustworthy than 1
520
+
5.**Trusted domains bypass domain-level DB lookup** — prevents false positives from crowdsourced reports on major platforms
521
+
522
+
### Source Tiers
523
+
524
+
Each threat intel source is classified into a quality tier. The tier multiplier scales the signal's base weight.
For non-DB signals (on-chain, DOS.Me, web search, URL-specific), base weight is used directly.
539
+
540
+
### Freshness Decay
541
+
542
+
Reports decay in influence as they age, measured by `last_seen_at`:
543
+
544
+
| Age | Factor | Rationale |
545
+
|-----|--------|-----------|
546
+
| ≤ 30 days | 1.0 | Fresh, highly relevant |
547
+
| ≤ 90 days | 0.85 | Recent, still relevant |
548
+
| ≤ 365 days | 0.70 | Aging, may no longer be active |
549
+
| ≤ 2 years | 0.50 | Old, likely inactive but retain historical context |
550
+
| > 2 years | 0.30 | Very old, minimal influence — scam sites almost certainly dead |
551
+
552
+
If `last_seen_at` is null: factor = 0.70 (treat as aging).
553
+
554
+
### Corroboration Bonus
555
+
556
+
Multiple independent risk sources increase the score and confidence:
557
+
558
+
| Unique risk sources | Bonus |
559
+
|---------------------|-------|
560
+
| 4+ | +20 |
561
+
| 3 | +15 |
562
+
| 2 | +10 |
563
+
| 1 | 0 |
564
+
565
+
Sources counted: all DB sources (excluding `google_safe_browsing` for "clean" results and `runtime_cache`), plus `onchain` and `google_safe_browsing` when they report actual threats.
566
+
567
+
### Confidence Levels
568
+
569
+
Returned alongside riskScore to help UIs and downstream consumers calibrate trust:
570
+
571
+
| Level | Criteria |
572
+
|-------|----------|
573
+
|**high**| 3+ unique risk sources, OR tier 1 source + 2+ total sources |
└── Filters LLM signals to valid set: web_identified_scam, web_scam_reports, etc.
677
+
```
678
+
679
+
**Performance:** Web search runs parallel with Phase 1 (DB/on-chain/DOS.Me). Only LLM analysis waits for Phase 1 results. Total added latency: ~3–5s for full path (skipped on extension fast path).
680
+
681
+
---
682
+
469
683
## Database Layout
470
684
471
685
### Supabase Schemas
@@ -480,7 +694,7 @@ dosafe.threat_intel
480
694
481
695
| Table | Schema | Purpose |
482
696
|-------|--------|---------|
483
-
|`threat_intel`| dosafe | Unified threat data (1.2M+ entries, all sources) |
697
+
|`threat_intel`| dosafe | Unified threat data (1.52M+ entries, all sources) |
484
698
|`threat_clusters`| dosafe | Scammer group linking (89k+ clusters) |
485
699
|`raw_imports`| dosafe | Staging for scraped reports |
0 commit comments