Finnish vocabulary builder that collects articles from YLE sources, extracts words, and builds a personal vocabulary with AI-powered enrichment.
YLE Teletext / Selkouutiset
↓ collect
SQLite DB (articles)
↓ add (CSV/JSON word list)
parse → enrich (AI) → dedup (Voikko) → store
↓ export
vocabulary.json → web viewer
- Collect Finnish news articles from YLE Teletext and Selkouutiset
- Add words from CSV/JSON files with automatic AI enrichment (translations, pronunciations, example sentences)
- Deduplicate using Voikko morphological analysis (exact match, lemma match, Levenshtein distance)
- Export to a vocabulary JSON file served by a built-in web viewer
# Prerequisites: Bun (https://bun.sh)
bun install
# Configure API keys
cp .env.example .env
# Edit .env with your YLE API credentials and Anthropic API key| Variable | Required | Description |
|---|---|---|
YLE_APP_ID |
For fetch |
YLE Teletext API app ID |
YLE_APP_KEY |
For fetch |
YLE Teletext API app key |
ANTHROPIC_API_KEY |
For add --enrich |
AI enrichment (Claude) |
# Fetch YLE Teletext news pages
bun run dev -- fetch
bun run dev -- fetch --pages 103-112
# Fetch Selkouutiset (simplified Finnish news)
bun run dev -- selko
bun run dev -- selko --count 10# List all collected articles
bun run dev -- list
bun run dev -- list --source selko --json
# Read an article
bun run dev -- read 103
bun run dev -- read --id 42# Add words from a file (full pipeline: parse → enrich → dedup → store → export)
bun run dev -- add words.csv
bun run dev -- add words.json --no-enrich --source "lesson 5"
# Look up a word
bun run dev -- lookup talo kissa
# List words by frequency
bun run dev -- words --limit 20 --min-freq 2# Export vocabulary to JSON
bun run dev -- export
bun run dev -- export --dry-run
# Audit for internal duplicates
bun run dev -- audit
# Serve the vocabulary web viewer
bun run dev -- serve --port 3000CSV (header auto-detected):
fi,en,ko,pos
talo,house,집,noun
kissa,cat,고양이,nounJSON:
[
{ "fi": "talo", "en": "house", "ko": "집", "pos": "noun" },
{ "fi": "kissa", "en": "cat", "ko": "고양이", "pos": "noun" }
]Only the fi field is required. Missing fields (en, ko, pos, example, etc.) are filled automatically by AI enrichment.
src/
├── index.ts # CLI entry point (declarative command registry)
├── types.ts # Shared domain types
├── commands/ # Command handlers grouped by concern
│ ├── collect.ts # Article collection (source-agnostic)
│ ├── add.ts # Word import pipeline orchestration
│ ├── query.ts # list, read, words, lookup
│ └── manage.ts # export, audit, serve
├── sources/ # Extensible article sources
│ ├── types.ts # ArticleSource interface
│ ├── index.ts # Source registry
│ ├── teletext.ts # YLE Teletext
│ └── selko.ts # Selkouutiset
├── pipeline/ # Data transformation stages
│ ├── parse.ts # CSV/JSON input parsing
│ ├── enrich.ts # AI enrichment (Anthropic)
│ └── dedup.ts # Voikko-based deduplication
└── store/ # Persistence layer
├── schema.ts # Drizzle ORM table definitions
├── db.ts # Database queries
└── export.ts # vocabulary.json export
Implement the ArticleSource interface and register it:
// src/sources/my-source.ts
import type { ArticleSource } from "./types.js";
export const mySource: ArticleSource = {
name: "my-source",
async *collect(options) {
// yield CollectedArticle objects
},
};// src/sources/index.ts — add to registry
import { mySource } from "./my-source.js";
const sources = { teletext, selko, "my-source": mySource };bun run dev -- <command> # Run CLI
bun test # Run tests (vitest)
bun run lint # Lint (oxlint)
bun run fmt # Format (oxfmt)
bun run check # Full check (lint + format + typecheck)ISC