|
| 1 | +# DOSafe Scraper Pipeline Design |
| 2 | + |
| 3 | +## Goal |
| 4 | + |
| 5 | +Extend the existing threat intel pipeline (636k entries from 5 automated sources) with |
| 6 | +a **3-layer scraper architecture** that ingests Vietnamese scam databases (admin.vn, |
| 7 | +checkscam.vn) and future manual/crowdsourced sources. Raw data is staged, normalized |
| 8 | +into the existing `threat_intel` table, and linked via `threat_clusters` for |
| 9 | +cross-entity correlation. |
| 10 | + |
| 11 | +## Context |
| 12 | + |
| 13 | +### What exists today |
| 14 | + |
| 15 | +| Layer | Table | Purpose | |
| 16 | +|-------|-------|---------| |
| 17 | +| Lookup | `dosafe.threat_intel` | Flat entity lookup (domain, url, wallet) — 636k rows | |
| 18 | +| Clustering | `dosafe.threat_clusters` | Group related entities (empty, unused so far) | |
| 19 | +| Logging | `dosafe.sync_log` | Per-source sync run tracking | |
| 20 | + |
| 21 | +Current sync flow: `pg_cron` → Edge Function `sync-threats` → `bulk_upsert_threats()` RPC. |
| 22 | +Sources are **structured feeds** (JSON arrays, text lists) that map 1:1 to `threat_intel` rows. |
| 23 | + |
| 24 | +### What's new |
| 25 | + |
| 26 | +Vietnamese scam databases have **multi-field reports** (name + phone + bank account + |
| 27 | +Facebook + evidence) that don't fit the 1-entity-per-row model. One report produces |
| 28 | +**multiple `threat_intel` rows** that need to be linked back to the same scammer. |
| 29 | + |
| 30 | +## Architecture: 3-Layer ELT |
| 31 | + |
| 32 | +``` |
| 33 | +┌──────────────┐ ┌──────────────┐ ┌──────────────┐ |
| 34 | +│ Scrapers │────▶│ raw_imports │────▶│ threat_intel │ |
| 35 | +│ │ │ (staging) │ │ (lookup) │ |
| 36 | +│ admin.vn │ │ │ │ │ |
| 37 | +│ checkscam.vn │ │ JSON blob per │ │ 1 row per │ |
| 38 | +│ future... │ │ report, as-is │ │ entity+source │ |
| 39 | +└──────────────┘ └───────┬───────┘ └───────┬───────┘ |
| 40 | + │ │ |
| 41 | + │ cluster_id FK │ |
| 42 | + │ ┌───────────┘ |
| 43 | + │ ▼ |
| 44 | + │ ┌──────────────┐ |
| 45 | + └─▶│threat_clusters│ |
| 46 | + │ (grouping) │ |
| 47 | + │ │ |
| 48 | + │ 1 cluster per │ |
| 49 | + │ scammer/group │ |
| 50 | + └──────────────┘ |
| 51 | +``` |
| 52 | + |
| 53 | +### Layer 1: `dosafe.raw_imports` (NEW table) |
| 54 | + |
| 55 | +Stores the **original scraped data verbatim** as a JSON blob. Enables re-processing |
| 56 | +if normalization logic changes. One row = one scraped report/page entry. |
| 57 | + |
| 58 | +```sql |
| 59 | +create table dosafe.raw_imports ( |
| 60 | + id uuid primary key default gen_random_uuid(), |
| 61 | + source text not null, -- 'admin_vn', 'checkscam_vn' |
| 62 | + source_id text, -- original ID/slug from source (dedup key) |
| 63 | + raw_data jsonb not null, -- full scraped JSON |
| 64 | + status text not null default 'pending' |
| 65 | + check (status in ('pending', 'processed', 'failed', 'skipped')), |
| 66 | + error text, |
| 67 | + scraped_at timestamptz not null default now(), |
| 68 | + processed_at timestamptz, |
| 69 | + created_at timestamptz not null default now() |
| 70 | +); |
| 71 | + |
| 72 | +create unique index on dosafe.raw_imports (source, source_id) |
| 73 | + where source_id is not null; |
| 74 | +``` |
| 75 | + |
| 76 | +**Key decisions:** |
| 77 | +- `source_id` = slug from admin.vn or post ID from checkscam.vn → natural dedup |
| 78 | +- `status` tracks processing state → idempotent re-runs |
| 79 | +- UNIQUE index on `(source, source_id)` prevents duplicate imports |
| 80 | +- No hash needed here — dedup is by source_id, not content |
| 81 | + |
| 82 | +### Layer 2: `dosafe.threat_intel` (EXISTING — no schema change) |
| 83 | + |
| 84 | +Existing table stays as-is. New entity types added: |
| 85 | +- `phone` — Vietnamese phone numbers (0xxx) |
| 86 | +- `bank_account` — Bank account numbers (STK) |
| 87 | +- `facebook` — Facebook profile URLs/IDs |
| 88 | + |
| 89 | +Each `raw_imports` row produces **multiple** `threat_intel` rows (one per entity |
| 90 | +found in the report). All rows from the same report share the same `cluster_id`. |
| 91 | + |
| 92 | +### Layer 3: `dosafe.threat_clusters` (EXISTING — no schema change) |
| 93 | + |
| 94 | +One cluster per scammer report. Fields: |
| 95 | +- `name` — Scammer name from report |
| 96 | +- `description` — Summary (amount, bank, category) |
| 97 | +- `total_reports` — Count of `raw_imports` rows in this cluster |
| 98 | +- `max_risk_score` — Highest risk_score among member entities |
| 99 | + |
| 100 | +**Cluster matching logic** (for dedup across sources): |
| 101 | +1. Exact match: same phone OR same bank_account across sources → merge into same cluster |
| 102 | +2. Future: fuzzy name matching, LLM-assisted entity resolution |
| 103 | + |
| 104 | +## Data Sources |
| 105 | + |
| 106 | +### admin.vn (Priority 1) |
| 107 | + |
| 108 | +| Field | Maps to | |
| 109 | +|-------|---------| |
| 110 | +| URL pattern | `https://admin.vn/scams?page={N}` (81 pages) | |
| 111 | +| Entries | 35,596 STK/SĐT + 4,284 Facebook | |
| 112 | +| Tech | Custom PHP, server-rendered HTML, Cloudflare | |
| 113 | +| Fields | name, phone, bank_account, bank_name, amount, category, date, evidence | |
| 114 | +| Detail page | `/scams/{slug}.html` — complaint text, images, reporter info | |
| 115 | +| Update freq | Active, new reports daily | |
| 116 | +| Rate limit | Cloudflare-protected, need polite scraping (1-2 req/s) | |
| 117 | + |
| 118 | +**Scrape strategy:** |
| 119 | +- **Initial bulk**: Scrape all 81 pages of `/scams?page=N`, extract table rows |
| 120 | +- **Incremental**: Scrape page 1 only, stop when hitting known `source_id` (slug) |
| 121 | +- **Detail pages**: Optional phase 2 — scrape `/scams/{slug}.html` for evidence/description |
| 122 | + |
| 123 | +**Entity extraction per row:** |
| 124 | +``` |
| 125 | +1 admin.vn row → up to 3 threat_intel rows: |
| 126 | + - entity_type: 'phone', entity_value: '0943241522' |
| 127 | + - entity_type: 'bank_account', entity_value: '0943241522' |
| 128 | + - entity_type: 'facebook', entity_value: 'https://fb.com/...' |
| 129 | +All 3 rows share the same cluster_id → linked to 1 threat_clusters row |
| 130 | +``` |
| 131 | + |
| 132 | +### checkscam.vn (Priority 2) |
| 133 | + |
| 134 | +| Field | Maps to | |
| 135 | +|-------|---------| |
| 136 | +| URL pattern | `https://checkscam.vn/page/{N}` or WP REST API | |
| 137 | +| Entries | ~62,000 posts | |
| 138 | +| Tech | WordPress, WP REST API exposed | |
| 139 | +| Fields | title (scammer name/phone), content (HTML, often empty via API) | |
| 140 | +| Update freq | Active | |
| 141 | +| Rate limit | Standard WordPress, no Cloudflare | |
| 142 | + |
| 143 | +**Scrape strategy:** |
| 144 | +- **WP REST API** for metadata: `GET /wp-json/wp/v2/posts?per_page=100&page=N` |
| 145 | + - Returns title, date, slug, categories — but `content.rendered` is empty |
| 146 | +- **HTML scrape** for content: `GET /checkscam/{slug}/` for full post body |
| 147 | +- **Initial bulk**: Paginate WP API for all 62k post metadata, then HTML scrape |
| 148 | +- **Incremental**: WP API `after=YYYY-MM-DDTHH:mm:ss` for new posts since last sync |
| 149 | + |
| 150 | +**Entity extraction:** |
| 151 | +- Parse title for phone numbers (regex `0\d{9,10}`) |
| 152 | +- Parse title for bank account numbers |
| 153 | +- Parse HTML content for structured scam details |
| 154 | +- Less structured than admin.vn → lower confidence, lower risk_score |
| 155 | + |
| 156 | +## Processing Pipeline |
| 157 | + |
| 158 | +### Step 1: Scrape → `raw_imports` |
| 159 | + |
| 160 | +``` |
| 161 | +scrape_source(source_name) |
| 162 | + for each page/entry: |
| 163 | + INSERT INTO raw_imports (source, source_id, raw_data) |
| 164 | + ON CONFLICT (source, source_id) DO UPDATE raw_data, scraped_at |
| 165 | +``` |
| 166 | + |
| 167 | +### Step 2: Process `raw_imports` → `threat_intel` + `threat_clusters` |
| 168 | + |
| 169 | +``` |
| 170 | +process_pending_imports() |
| 171 | + SELECT * FROM raw_imports WHERE status = 'pending' |
| 172 | + for each row: |
| 173 | + 1. Extract entities from raw_data (phone, STK, FB, ...) |
| 174 | + 2. Find or create threat_cluster: |
| 175 | + - Search existing clusters by phone/STK match |
| 176 | + - If found → reuse cluster_id, increment total_reports |
| 177 | + - If not → create new cluster |
| 178 | + 3. Upsert each entity into threat_intel with cluster_id |
| 179 | + 4. Calculate risk_score (rule-based): |
| 180 | + - 1 source report → 60 |
| 181 | + - 2-3 reports → 70 |
| 182 | + - 3+ reports or multi-source → 80 |
| 183 | + - Confirmed + on-chain attested → 90 |
| 184 | + 5. UPDATE raw_imports SET status = 'processed' |
| 185 | +``` |
| 186 | + |
| 187 | +### Step 3: Sync to DOS.Me (future) |
| 188 | + |
| 189 | +``` |
| 190 | +POST /trust/flags |
| 191 | + entityId: normalized entity value |
| 192 | + sourceSystem: 'dosafe' |
| 193 | + externalId: raw_imports.id ← links back to original report |
| 194 | +``` |
| 195 | + |
| 196 | +## Execution Model |
| 197 | + |
| 198 | +### Initial Bulk Import (one-time) |
| 199 | + |
| 200 | +**Local Deno script** (`scripts/scrape-admin-vn.ts`): |
| 201 | +- Runs on dev machine (not Edge Function — too much memory/time for 81 pages) |
| 202 | +- Scrapes all pages → writes to `raw_imports` via Supabase REST API |
| 203 | +- Then triggers processing via `process_pending_imports()` RPC |
| 204 | +- Estimated: ~5-10 minutes for admin.vn, ~30-60 min for checkscam.vn |
| 205 | + |
| 206 | +### Incremental Sync (daily) |
| 207 | + |
| 208 | +**Edge Function** `sync-scraped-sources`: |
| 209 | +- Runs via `pg_cron` (daily, offset from existing `sync-threats` schedule) |
| 210 | +- Scrapes page 1 of admin.vn, recent WP API posts from checkscam.vn |
| 211 | +- Inserts new entries into `raw_imports` |
| 212 | +- Calls `process_pending_imports()` to normalize + link |
| 213 | + |
| 214 | +### Why separate from `sync-threats`? |
| 215 | + |
| 216 | +| | `sync-threats` (existing) | `sync-scraped-sources` (new) | |
| 217 | +|--|---------------------------|------------------------------| |
| 218 | +| Sources | Structured feeds (JSON, text) | HTML scraping | |
| 219 | +| Processing | Direct → threat_intel | raw_imports → threat_intel | |
| 220 | +| Runtime | ~109s for 636k entries | ~30-60s for incremental | |
| 221 | +| Schedule | Every 6h | Daily | |
| 222 | +| Failure mode | Source down = skip | Cloudflare block = retry | |
| 223 | + |
| 224 | +## Rule-Based Scoring |
| 225 | + |
| 226 | +No LLM in phase 1. Scoring is deterministic: |
| 227 | + |
| 228 | +| Condition | risk_score | |
| 229 | +|-----------|-----------| |
| 230 | +| Single report from 1 Vietnamese source | 60 | |
| 231 | +| 2-3 reports from same source | 70 | |
| 232 | +| Reports from 2+ different sources | 80 | |
| 233 | +| Confirmed by on-chain attestation | 90 | |
| 234 | +| Multiple on-chain attestations | 95 | |
| 235 | + |
| 236 | +Score stored on `threat_intel.risk_score`. Cluster-level score stored on |
| 237 | +`threat_clusters.max_risk_score` (max of all member entities). |
| 238 | + |
| 239 | +## DOS.Me Trust API Integration |
| 240 | + |
| 241 | +When syncing to DOS.Me: |
| 242 | +- `externalId` = `raw_imports.id` (not `threat_intel.id`) |
| 243 | +- This allows DOS.Me to link back to the **original report**, not individual entities |
| 244 | +- One report with 3 entities → 3 `POST /trust/flags` calls, same `externalId` |
| 245 | + |
| 246 | +## Entity ID Normalization |
| 247 | + |
| 248 | +Following DOS Chain EAS Schema 6 conventions: |
| 249 | + |
| 250 | +| Entity Type | Normalization | Example | |
| 251 | +|-------------|--------------|---------| |
| 252 | +| `phone` | Remove spaces/dashes, keep country format | `0943241522` | |
| 253 | +| `bank_account` | Remove spaces, uppercase | `0943241522` | |
| 254 | +| `facebook` | Extract numeric ID or username, lowercase | `100012345678` | |
| 255 | +| `domain` | Lowercase, strip protocol/path | `evil-site.com` | |
| 256 | +| `url` | Lowercase protocol+host, preserve path | `https://evil.com/phish` | |
| 257 | +| `wallet` | Lowercase (EVM) or as-is (non-EVM) | `0xabc...def` | |
| 258 | + |
| 259 | +## File Structure |
| 260 | + |
| 261 | +``` |
| 262 | +d:/Projects/DOSafe/ |
| 263 | +├── scripts/ |
| 264 | +│ ├── scrape-admin-vn.ts # One-time bulk import |
| 265 | +│ └── scrape-checkscam-vn.ts # One-time bulk import |
| 266 | +├── supabase/ |
| 267 | +│ ├── migrations/ |
| 268 | +│ │ └── 20260304000001_raw_imports.sql # New migration |
| 269 | +│ └── functions/ |
| 270 | +│ ├── sync-threats/ # Existing (structured feeds) |
| 271 | +│ └── sync-scraped-sources/ # New (HTML scraper sources) |
| 272 | +│ └── index.ts |
| 273 | +└── docs/ |
| 274 | + └── plans/ |
| 275 | + └── 2026-03-04-scraper-pipeline-design.md # This doc |
| 276 | +``` |
| 277 | + |
| 278 | +## Open Questions (resolved) |
| 279 | + |
| 280 | +| Question | Decision | |
| 281 | +|----------|----------| |
| 282 | +| 1 table vs multi-table for entities? | 1 table (`threat_intel`) + `cluster_id` FK | |
| 283 | +| Separate `entity_links` table? | No — use `cluster_id` on `threat_intel` (DOS.Me recommendation) | |
| 284 | +| LLM scoring? | Rule-based phase 1, LLM phase 2 | |
| 285 | +| Staging table? | Yes — `raw_imports` (ELT pattern) | |
| 286 | +| Scraper runtime? | Local scripts for bulk, Edge Function for incremental | |
| 287 | +| `externalId` for DOS.Me sync? | `raw_imports.id` | |
| 288 | +| admin.vn vs checkscam.vn priority? | admin.vn first (structured data) | |
0 commit comments