Skip to content

yongsk0066/suomi-sanavihko

Repository files navigation

suomi-sanavihko

Finnish vocabulary builder that collects articles from YLE sources, extracts words, and builds a personal vocabulary with AI-powered enrichment.

What it does

YLE Teletext / Selkouutiset
        ↓ collect
    SQLite DB (articles)
        ↓ add (CSV/JSON word list)
    parse → enrich (AI) → dedup (Voikko) → store
        ↓ export
    vocabulary.json → web viewer
  1. Collect Finnish news articles from YLE Teletext and Selkouutiset
  2. Add words from CSV/JSON files with automatic AI enrichment (translations, pronunciations, example sentences)
  3. Deduplicate using Voikko morphological analysis (exact match, lemma match, Levenshtein distance)
  4. Export to a vocabulary JSON file served by a built-in web viewer

Setup

# Prerequisites: Bun (https://bun.sh)
bun install

# Configure API keys
cp .env.example .env
# Edit .env with your YLE API credentials and Anthropic API key

Environment variables

Variable Required Description
YLE_APP_ID For fetch YLE Teletext API app ID
YLE_APP_KEY For fetch YLE Teletext API app key
ANTHROPIC_API_KEY For add --enrich AI enrichment (Claude)

Usage

Collect articles

# Fetch YLE Teletext news pages
bun run dev -- fetch
bun run dev -- fetch --pages 103-112

# Fetch Selkouutiset (simplified Finnish news)
bun run dev -- selko
bun run dev -- selko --count 10

Browse articles

# List all collected articles
bun run dev -- list
bun run dev -- list --source selko --json

# Read an article
bun run dev -- read 103
bun run dev -- read --id 42

Build vocabulary

# Add words from a file (full pipeline: parse → enrich → dedup → store → export)
bun run dev -- add words.csv
bun run dev -- add words.json --no-enrich --source "lesson 5"

# Look up a word
bun run dev -- lookup talo kissa

# List words by frequency
bun run dev -- words --limit 20 --min-freq 2

Manage

# Export vocabulary to JSON
bun run dev -- export
bun run dev -- export --dry-run

# Audit for internal duplicates
bun run dev -- audit

# Serve the vocabulary web viewer
bun run dev -- serve --port 3000

Input file formats

CSV (header auto-detected):

fi,en,ko,pos
talo,house,,noun
kissa,cat,고양이,noun

JSON:

[
  { "fi": "talo", "en": "house", "ko": "", "pos": "noun" },
  { "fi": "kissa", "en": "cat", "ko": "고양이", "pos": "noun" }
]

Only the fi field is required. Missing fields (en, ko, pos, example, etc.) are filled automatically by AI enrichment.

Architecture

src/
├── index.ts              # CLI entry point (declarative command registry)
├── types.ts              # Shared domain types
├── commands/             # Command handlers grouped by concern
│   ├── collect.ts        #   Article collection (source-agnostic)
│   ├── add.ts            #   Word import pipeline orchestration
│   ├── query.ts          #   list, read, words, lookup
│   └── manage.ts         #   export, audit, serve
├── sources/              # Extensible article sources
│   ├── types.ts          #   ArticleSource interface
│   ├── index.ts          #   Source registry
│   ├── teletext.ts       #   YLE Teletext
│   └── selko.ts          #   Selkouutiset
├── pipeline/             # Data transformation stages
│   ├── parse.ts          #   CSV/JSON input parsing
│   ├── enrich.ts         #   AI enrichment (Anthropic)
│   └── dedup.ts          #   Voikko-based deduplication
└── store/                # Persistence layer
    ├── schema.ts         #   Drizzle ORM table definitions
    ├── db.ts             #   Database queries
    └── export.ts         #   vocabulary.json export

Adding a new source

Implement the ArticleSource interface and register it:

// src/sources/my-source.ts
import type { ArticleSource } from "./types.js";

export const mySource: ArticleSource = {
  name: "my-source",
  async *collect(options) {
    // yield CollectedArticle objects
  },
};
// src/sources/index.ts — add to registry
import { mySource } from "./my-source.js";
const sources = { teletext, selko, "my-source": mySource };

Development

bun run dev -- <command>    # Run CLI
bun test                    # Run tests (vitest)
bun run lint                # Lint (oxlint)
bun run fmt                 # Format (oxfmt)
bun run check               # Full check (lint + format + typecheck)

License

ISC

About

Finnish vocabulary builder — collect YLE news, enrich with AI, build your word book

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors