Skip to content

H-6504: vec2slug: URL slug generation from text embeddings#142

Open
indietyp wants to merge 35 commits into
mainfrom
bm/h-6504-vec2slug-url-slug-generation-from-text-embeddings
Open

H-6504: vec2slug: URL slug generation from text embeddings#142
indietyp wants to merge 35 commits into
mainfrom
bm/h-6504-vec2slug-url-slug-generation-from-text-embeddings

Conversation

@indietyp
Copy link
Copy Markdown
Member

vec2slug: URL slug generation from text embeddings

Adds libs/vec2slug, a research project that generates URL slugs directly from pooled sentence embeddings using a tiny transformer decoder, without re-feeding source text through a language model.

The core claim is that embeddings are a reusable substrate for cheap auxiliary outputs. Slug generation is the proof of concept: if a system already has embeddings for search or deduplication, it can produce human-readable slugs for ~$0 marginal cost (CPU time only) instead of making a Haiku-class LLM call ($0.001/slug).

What's in the PR

Full training pipeline (69 files, ~15.5k lines):

  • Corpus preparation: URL-extracted slugs from FineWeb-Edu (2.3M samples) and a smaller distilled corpus (10k, Haiku-generated labels)
  • Embedding: OpenAI text-embedding-3-small via OpenRouter or Batch API, with checkpointing and resumability
  • Cluster-based train/val/test split to prevent near-duplicate leakage
  • Two model architectures: MLP multi-label classifier (baseline, fails) and prefix-conditioned transformer decoder (seq2seq, works)
  • BPE tokenizer (5,000 subwords, hyphen-aware) and legacy KMeans vocabulary compression
  • Seven evaluation metrics: validity, exact match, token F1, ROUGE-1/L, BERTScore, distinctiveness, vocab diversity
  • ONNX export for browser/edge deployment
  • Attention analysis tooling (hyphen-routing discovery)

Two trained models:

Model Params Token F1 BERTScore Inference (VPS)
d=384, L=4 11.5M (46 MiB) 0.298 0.869 ~89ms
d=512, L=6 24.8M (99 MiB) 0.306 0.872 ~160ms

Doubling parameters adds +0.008 Token F1, within the ±0.008 confidence interval. The smaller model is recommended for deployment.

HuggingFace publishing pipeline: model card template, eval extraction, ONNX bundling, and upload script targeting hashintel/vec2slug-v1-openai-{small,large}.

Standalone inference script (hf/inference.py): zero-dependency ONNX inference with beam search, also supports PyTorch backend. Runs with uv run directly.

Key findings

  1. Bag-of-tokens classifiers (MLP) collapse to high-frequency function words. The failure is architectural, not a training deficiency.
  2. BPE vocabulary was the single largest quality improvement (+0.072 Token F1 over KMeans). Vocabulary strategy imposes a hard ceiling.
  3. Three calibration artifacts compounded: training-data truncation, position-uniform EOS loss, and standard beam search early-stop. Each fix was small; cumulatively they moved output from "topically correct but truncated" to "topically correct at appropriate length."
  4. Hyphens serve as learned embedding-routing nodes: 4/8 attention heads at layer 1 allocate 96 to 99% of their attention from hyphens to the prefix embedding.
  5. Parameter scaling produced no statistically convincing gains. Data quality and embedding information content are more likely bottlenecks.

Companion

Blog post at hash.dev/blog/vec2slug (separate).

@vercel
Copy link
Copy Markdown

vercel Bot commented May 25, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
petrinaut-hazel Ready Ready Preview, Comment May 25, 2026 2:55pm

@augmentcode
Copy link
Copy Markdown

augmentcode Bot commented May 25, 2026

This pull request is abnormally large and would use a significant amount of tokens to review. If you still wish to review it, comment "augment review" and we will review it.

@cursor
Copy link
Copy Markdown

cursor Bot commented May 25, 2026

PR Summary

Low Risk
Additive library under libs/ with gitignored data and env-based API keys; no changes to core application auth or runtime paths beyond submodules and ignore rules.

Overview
Introduces libs/vec2slug, a self-contained research package that turns pooled sentence embeddings (OpenAI text-embedding-3-small) into kebab-case URL slugs via a small prefix-conditioned transformer decoder, plus an MLP baseline for comparison.

Pipeline & tooling: Adds workspace-based data prep (FineWeb URL slugs and a smaller Haiku-distilled corpus), embedding (OpenRouter, local Harrier, OpenAI Batch), cluster splits, BPE and KMeans vocab paths, slug-train-* / slug-predict / slug-eval CLIs, ONNX export, attention analysis, and helper scripts (benchmarks, demo JSON, HF publish). Evaluation is a composable transform pipeline (validity, token F1, ROUGE, BERTScore, distinctiveness, etc.). Deployment artifacts: hf/inference.py (ONNX/PyTorch beam search), model card template, publish_hf.py, and documented canonical checkpoints (~0.30 macro token F1 on held-out test).

Repo wiring: .gitignore entries for vec2slug venv/data and Python caches; git submodules for vendor/evaluate, vendor/datatrove, and vendor/datasets. Research writeups in README.md and CONCLUSION.md.

Reviewed by Cursor Bugbot for commit f13e4a3. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown
Member Author

indietyp commented May 25, 2026

This stack of pull requests is managed by Graphite. Learn more about stacking.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 25, 2026

Dependency Review

The following issues were found:

  • ❌ 2 vulnerable package(s)
  • ✅ 0 package(s) with incompatible licenses
  • ✅ 0 package(s) with invalid SPDX license definitions
  • ⚠️ 50 package(s) with unknown licenses.
  • ⚠️ 1 packages with OpenSSF Scorecard issues.

View full job summary

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 444794f. Configure here.

Comment thread libs/vec2slug/scripts/benchmark_inference.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant