Skip to content

Commit 0a742ce

Browse files
committed
docs: add scraper pipeline design and implementation plan
1 parent 5c86a97 commit 0a742ce

2 files changed

Lines changed: 810 additions & 0 deletions

File tree

DOSafe-Scraper-Pipeline-Design.md

Lines changed: 288 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,288 @@
1+
# DOSafe Scraper Pipeline Design
2+
3+
## Goal
4+
5+
Extend the existing threat intel pipeline (636k entries from 5 automated sources) with
6+
a **3-layer scraper architecture** that ingests Vietnamese scam databases (admin.vn,
7+
checkscam.vn) and future manual/crowdsourced sources. Raw data is staged, normalized
8+
into the existing `threat_intel` table, and linked via `threat_clusters` for
9+
cross-entity correlation.
10+
11+
## Context
12+
13+
### What exists today
14+
15+
| Layer | Table | Purpose |
16+
|-------|-------|---------|
17+
| Lookup | `dosafe.threat_intel` | Flat entity lookup (domain, url, wallet) — 636k rows |
18+
| Clustering | `dosafe.threat_clusters` | Group related entities (empty, unused so far) |
19+
| Logging | `dosafe.sync_log` | Per-source sync run tracking |
20+
21+
Current sync flow: `pg_cron` → Edge Function `sync-threats``bulk_upsert_threats()` RPC.
22+
Sources are **structured feeds** (JSON arrays, text lists) that map 1:1 to `threat_intel` rows.
23+
24+
### What's new
25+
26+
Vietnamese scam databases have **multi-field reports** (name + phone + bank account +
27+
Facebook + evidence) that don't fit the 1-entity-per-row model. One report produces
28+
**multiple `threat_intel` rows** that need to be linked back to the same scammer.
29+
30+
## Architecture: 3-Layer ELT
31+
32+
```
33+
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
34+
│ Scrapers │────▶│ raw_imports │────▶│ threat_intel │
35+
│ │ │ (staging) │ │ (lookup) │
36+
│ admin.vn │ │ │ │ │
37+
│ checkscam.vn │ │ JSON blob per │ │ 1 row per │
38+
│ future... │ │ report, as-is │ │ entity+source │
39+
└──────────────┘ └───────┬───────┘ └───────┬───────┘
40+
│ │
41+
│ cluster_id FK │
42+
│ ┌───────────┘
43+
│ ▼
44+
│ ┌──────────────┐
45+
└─▶│threat_clusters│
46+
│ (grouping) │
47+
│ │
48+
│ 1 cluster per │
49+
│ scammer/group │
50+
└──────────────┘
51+
```
52+
53+
### Layer 1: `dosafe.raw_imports` (NEW table)
54+
55+
Stores the **original scraped data verbatim** as a JSON blob. Enables re-processing
56+
if normalization logic changes. One row = one scraped report/page entry.
57+
58+
```sql
59+
create table dosafe.raw_imports (
60+
id uuid primary key default gen_random_uuid(),
61+
source text not null, -- 'admin_vn', 'checkscam_vn'
62+
source_id text, -- original ID/slug from source (dedup key)
63+
raw_data jsonb not null, -- full scraped JSON
64+
status text not null default 'pending'
65+
check (status in ('pending', 'processed', 'failed', 'skipped')),
66+
error text,
67+
scraped_at timestamptz not null default now(),
68+
processed_at timestamptz,
69+
created_at timestamptz not null default now()
70+
);
71+
72+
create unique index on dosafe.raw_imports (source, source_id)
73+
where source_id is not null;
74+
```
75+
76+
**Key decisions:**
77+
- `source_id` = slug from admin.vn or post ID from checkscam.vn → natural dedup
78+
- `status` tracks processing state → idempotent re-runs
79+
- UNIQUE index on `(source, source_id)` prevents duplicate imports
80+
- No hash needed here — dedup is by source_id, not content
81+
82+
### Layer 2: `dosafe.threat_intel` (EXISTING — no schema change)
83+
84+
Existing table stays as-is. New entity types added:
85+
- `phone` — Vietnamese phone numbers (0xxx)
86+
- `bank_account` — Bank account numbers (STK)
87+
- `facebook` — Facebook profile URLs/IDs
88+
89+
Each `raw_imports` row produces **multiple** `threat_intel` rows (one per entity
90+
found in the report). All rows from the same report share the same `cluster_id`.
91+
92+
### Layer 3: `dosafe.threat_clusters` (EXISTING — no schema change)
93+
94+
One cluster per scammer report. Fields:
95+
- `name` — Scammer name from report
96+
- `description` — Summary (amount, bank, category)
97+
- `total_reports` — Count of `raw_imports` rows in this cluster
98+
- `max_risk_score` — Highest risk_score among member entities
99+
100+
**Cluster matching logic** (for dedup across sources):
101+
1. Exact match: same phone OR same bank_account across sources → merge into same cluster
102+
2. Future: fuzzy name matching, LLM-assisted entity resolution
103+
104+
## Data Sources
105+
106+
### admin.vn (Priority 1)
107+
108+
| Field | Maps to |
109+
|-------|---------|
110+
| URL pattern | `https://admin.vn/scams?page={N}` (81 pages) |
111+
| Entries | 35,596 STK/SĐT + 4,284 Facebook |
112+
| Tech | Custom PHP, server-rendered HTML, Cloudflare |
113+
| Fields | name, phone, bank_account, bank_name, amount, category, date, evidence |
114+
| Detail page | `/scams/{slug}.html` — complaint text, images, reporter info |
115+
| Update freq | Active, new reports daily |
116+
| Rate limit | Cloudflare-protected, need polite scraping (1-2 req/s) |
117+
118+
**Scrape strategy:**
119+
- **Initial bulk**: Scrape all 81 pages of `/scams?page=N`, extract table rows
120+
- **Incremental**: Scrape page 1 only, stop when hitting known `source_id` (slug)
121+
- **Detail pages**: Optional phase 2 — scrape `/scams/{slug}.html` for evidence/description
122+
123+
**Entity extraction per row:**
124+
```
125+
1 admin.vn row → up to 3 threat_intel rows:
126+
- entity_type: 'phone', entity_value: '0943241522'
127+
- entity_type: 'bank_account', entity_value: '0943241522'
128+
- entity_type: 'facebook', entity_value: 'https://fb.com/...'
129+
All 3 rows share the same cluster_id → linked to 1 threat_clusters row
130+
```
131+
132+
### checkscam.vn (Priority 2)
133+
134+
| Field | Maps to |
135+
|-------|---------|
136+
| URL pattern | `https://checkscam.vn/page/{N}` or WP REST API |
137+
| Entries | ~62,000 posts |
138+
| Tech | WordPress, WP REST API exposed |
139+
| Fields | title (scammer name/phone), content (HTML, often empty via API) |
140+
| Update freq | Active |
141+
| Rate limit | Standard WordPress, no Cloudflare |
142+
143+
**Scrape strategy:**
144+
- **WP REST API** for metadata: `GET /wp-json/wp/v2/posts?per_page=100&page=N`
145+
- Returns title, date, slug, categories — but `content.rendered` is empty
146+
- **HTML scrape** for content: `GET /checkscam/{slug}/` for full post body
147+
- **Initial bulk**: Paginate WP API for all 62k post metadata, then HTML scrape
148+
- **Incremental**: WP API `after=YYYY-MM-DDTHH:mm:ss` for new posts since last sync
149+
150+
**Entity extraction:**
151+
- Parse title for phone numbers (regex `0\d{9,10}`)
152+
- Parse title for bank account numbers
153+
- Parse HTML content for structured scam details
154+
- Less structured than admin.vn → lower confidence, lower risk_score
155+
156+
## Processing Pipeline
157+
158+
### Step 1: Scrape → `raw_imports`
159+
160+
```
161+
scrape_source(source_name)
162+
for each page/entry:
163+
INSERT INTO raw_imports (source, source_id, raw_data)
164+
ON CONFLICT (source, source_id) DO UPDATE raw_data, scraped_at
165+
```
166+
167+
### Step 2: Process `raw_imports``threat_intel` + `threat_clusters`
168+
169+
```
170+
process_pending_imports()
171+
SELECT * FROM raw_imports WHERE status = 'pending'
172+
for each row:
173+
1. Extract entities from raw_data (phone, STK, FB, ...)
174+
2. Find or create threat_cluster:
175+
- Search existing clusters by phone/STK match
176+
- If found → reuse cluster_id, increment total_reports
177+
- If not → create new cluster
178+
3. Upsert each entity into threat_intel with cluster_id
179+
4. Calculate risk_score (rule-based):
180+
- 1 source report → 60
181+
- 2-3 reports → 70
182+
- 3+ reports or multi-source → 80
183+
- Confirmed + on-chain attested → 90
184+
5. UPDATE raw_imports SET status = 'processed'
185+
```
186+
187+
### Step 3: Sync to DOS.Me (future)
188+
189+
```
190+
POST /trust/flags
191+
entityId: normalized entity value
192+
sourceSystem: 'dosafe'
193+
externalId: raw_imports.id ← links back to original report
194+
```
195+
196+
## Execution Model
197+
198+
### Initial Bulk Import (one-time)
199+
200+
**Local Deno script** (`scripts/scrape-admin-vn.ts`):
201+
- Runs on dev machine (not Edge Function — too much memory/time for 81 pages)
202+
- Scrapes all pages → writes to `raw_imports` via Supabase REST API
203+
- Then triggers processing via `process_pending_imports()` RPC
204+
- Estimated: ~5-10 minutes for admin.vn, ~30-60 min for checkscam.vn
205+
206+
### Incremental Sync (daily)
207+
208+
**Edge Function** `sync-scraped-sources`:
209+
- Runs via `pg_cron` (daily, offset from existing `sync-threats` schedule)
210+
- Scrapes page 1 of admin.vn, recent WP API posts from checkscam.vn
211+
- Inserts new entries into `raw_imports`
212+
- Calls `process_pending_imports()` to normalize + link
213+
214+
### Why separate from `sync-threats`?
215+
216+
| | `sync-threats` (existing) | `sync-scraped-sources` (new) |
217+
|--|---------------------------|------------------------------|
218+
| Sources | Structured feeds (JSON, text) | HTML scraping |
219+
| Processing | Direct → threat_intel | raw_imports → threat_intel |
220+
| Runtime | ~109s for 636k entries | ~30-60s for incremental |
221+
| Schedule | Every 6h | Daily |
222+
| Failure mode | Source down = skip | Cloudflare block = retry |
223+
224+
## Rule-Based Scoring
225+
226+
No LLM in phase 1. Scoring is deterministic:
227+
228+
| Condition | risk_score |
229+
|-----------|-----------|
230+
| Single report from 1 Vietnamese source | 60 |
231+
| 2-3 reports from same source | 70 |
232+
| Reports from 2+ different sources | 80 |
233+
| Confirmed by on-chain attestation | 90 |
234+
| Multiple on-chain attestations | 95 |
235+
236+
Score stored on `threat_intel.risk_score`. Cluster-level score stored on
237+
`threat_clusters.max_risk_score` (max of all member entities).
238+
239+
## DOS.Me Trust API Integration
240+
241+
When syncing to DOS.Me:
242+
- `externalId` = `raw_imports.id` (not `threat_intel.id`)
243+
- This allows DOS.Me to link back to the **original report**, not individual entities
244+
- One report with 3 entities → 3 `POST /trust/flags` calls, same `externalId`
245+
246+
## Entity ID Normalization
247+
248+
Following DOS Chain EAS Schema 6 conventions:
249+
250+
| Entity Type | Normalization | Example |
251+
|-------------|--------------|---------|
252+
| `phone` | Remove spaces/dashes, keep country format | `0943241522` |
253+
| `bank_account` | Remove spaces, uppercase | `0943241522` |
254+
| `facebook` | Extract numeric ID or username, lowercase | `100012345678` |
255+
| `domain` | Lowercase, strip protocol/path | `evil-site.com` |
256+
| `url` | Lowercase protocol+host, preserve path | `https://evil.com/phish` |
257+
| `wallet` | Lowercase (EVM) or as-is (non-EVM) | `0xabc...def` |
258+
259+
## File Structure
260+
261+
```
262+
d:/Projects/DOSafe/
263+
├── scripts/
264+
│ ├── scrape-admin-vn.ts # One-time bulk import
265+
│ └── scrape-checkscam-vn.ts # One-time bulk import
266+
├── supabase/
267+
│ ├── migrations/
268+
│ │ └── 20260304000001_raw_imports.sql # New migration
269+
│ └── functions/
270+
│ ├── sync-threats/ # Existing (structured feeds)
271+
│ └── sync-scraped-sources/ # New (HTML scraper sources)
272+
│ └── index.ts
273+
└── docs/
274+
└── plans/
275+
└── 2026-03-04-scraper-pipeline-design.md # This doc
276+
```
277+
278+
## Open Questions (resolved)
279+
280+
| Question | Decision |
281+
|----------|----------|
282+
| 1 table vs multi-table for entities? | 1 table (`threat_intel`) + `cluster_id` FK |
283+
| Separate `entity_links` table? | No — use `cluster_id` on `threat_intel` (DOS.Me recommendation) |
284+
| LLM scoring? | Rule-based phase 1, LLM phase 2 |
285+
| Staging table? | Yes — `raw_imports` (ELT pattern) |
286+
| Scraper runtime? | Local scripts for bulk, Edge Function for incremental |
287+
| `externalId` for DOS.Me sync? | `raw_imports.id` |
288+
| admin.vn vs checkscam.vn priority? | admin.vn first (structured data) |

0 commit comments

Comments
 (0)