suomi-sanavihko

Finnish vocabulary builder that collects articles from YLE sources, extracts words, and builds a personal vocabulary with AI-powered enrichment.

What it does

YLE Teletext / Selkouutiset
        ↓ collect
    SQLite DB (articles)
        ↓ add (CSV/JSON word list)
    parse → enrich (AI) → dedup (Voikko) → store
        ↓ export
    vocabulary.json → web viewer

Collect Finnish news articles from YLE Teletext and Selkouutiset
Add words from CSV/JSON files with automatic AI enrichment (translations, pronunciations, example sentences)
Deduplicate using Voikko morphological analysis (exact match, lemma match, Levenshtein distance)
Export to a vocabulary JSON file served by a built-in web viewer

Setup

# Prerequisites: Bun (https://bun.sh)
bun install

# Configure API keys
cp .env.example .env
# Edit .env with your YLE API credentials and Anthropic API key

Environment variables

Variable	Required	Description
`YLE_APP_ID`	For `fetch`	YLE Teletext API app ID
`YLE_APP_KEY`	For `fetch`	YLE Teletext API app key
`ANTHROPIC_API_KEY`	For `add --enrich`	AI enrichment (Claude)

Usage

Collect articles

# Fetch YLE Teletext news pages
bun run dev -- fetch
bun run dev -- fetch --pages 103-112

# Fetch Selkouutiset (simplified Finnish news)
bun run dev -- selko
bun run dev -- selko --count 10

Browse articles

# List all collected articles
bun run dev -- list
bun run dev -- list --source selko --json

# Read an article
bun run dev -- read 103
bun run dev -- read --id 42

Build vocabulary

# Add words from a file (full pipeline: parse → enrich → dedup → store → export)
bun run dev -- add words.csv
bun run dev -- add words.json --no-enrich --source "lesson 5"

# Look up a word
bun run dev -- lookup talo kissa

# List words by frequency
bun run dev -- words --limit 20 --min-freq 2

Manage

# Export vocabulary to JSON
bun run dev -- export
bun run dev -- export --dry-run

# Audit for internal duplicates
bun run dev -- audit

# Serve the vocabulary web viewer
bun run dev -- serve --port 3000

Input file formats

CSV (header auto-detected):

fi,en,ko,pos
talo,house,집,noun
kissa,cat,고양이,noun

JSON:

[
  { "fi": "talo", "en": "house", "ko": "집", "pos": "noun" },
  { "fi": "kissa", "en": "cat", "ko": "고양이", "pos": "noun" }
]

Only the fi field is required. Missing fields (en, ko, pos, example, etc.) are filled automatically by AI enrichment.

Architecture

src/
├── index.ts              # CLI entry point (declarative command registry)
├── types.ts              # Shared domain types
├── commands/             # Command handlers grouped by concern
│   ├── collect.ts        #   Article collection (source-agnostic)
│   ├── add.ts            #   Word import pipeline orchestration
│   ├── query.ts          #   list, read, words, lookup
│   └── manage.ts         #   export, audit, serve
├── sources/              # Extensible article sources
│   ├── types.ts          #   ArticleSource interface
│   ├── index.ts          #   Source registry
│   ├── teletext.ts       #   YLE Teletext
│   └── selko.ts          #   Selkouutiset
├── pipeline/             # Data transformation stages
│   ├── parse.ts          #   CSV/JSON input parsing
│   ├── enrich.ts         #   AI enrichment (Anthropic)
│   └── dedup.ts          #   Voikko-based deduplication
└── store/                # Persistence layer
    ├── schema.ts         #   Drizzle ORM table definitions
    ├── db.ts             #   Database queries
    └── export.ts         #   vocabulary.json export

Adding a new source

Implement the ArticleSource interface and register it:

// src/sources/my-source.ts
import type { ArticleSource } from "./types.js";

export const mySource: ArticleSource = {
  name: "my-source",
  async *collect(options) {
    // yield CollectedArticle objects
  },
};

// src/sources/index.ts — add to registry
import { mySource } from "./my-source.js";
const sources = { teletext, selko, "my-source": mySource };

Development

bun run dev -- <command>    # Run CLI
bun test                    # Run tests (vitest)
bun run lint                # Lint (oxlint)
bun run fmt                 # Format (oxfmt)
bun run check               # Full check (lint + format + typecheck)

License

ISC

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
docs		docs
src		src
tests		tests
web		web
.gitignore		.gitignore
.nvmrc		.nvmrc
.oxfmtrc.json		.oxfmtrc.json
.oxlintrc.json		.oxlintrc.json
CLAUDE.md		CLAUDE.md
README.md		README.md
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

suomi-sanavihko

What it does

Setup

Environment variables

Usage

Collect articles

Browse articles

Build vocabulary

Manage

Input file formats

Architecture

Adding a new source

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

suomi-sanavihko

What it does

Setup

Environment variables

Usage

Collect articles

Browse articles

Build vocabulary

Manage

Input file formats

Architecture

Adding a new source

Development

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages