H-6504: vec2slug: URL slug generation from text embeddings by indietyp · Pull Request #142 · hashintel/labs

indietyp · 2026-05-25T14:54:18Z

vec2slug: URL slug generation from text embeddings

Adds libs/vec2slug, a research project that generates URL slugs directly from pooled sentence embeddings using a tiny transformer decoder, without re-feeding source text through a language model.

The core claim is that embeddings are a reusable substrate for cheap auxiliary outputs. Slug generation is the proof of concept: if a system already has embeddings for search or deduplication, it can produce human-readable slugs for ~$0 marginal cost (CPU time only) instead of making a Haiku-class LLM call ($0.001/slug).

What's in the PR

Full training pipeline (69 files, ~15.5k lines):

Corpus preparation: URL-extracted slugs from FineWeb-Edu (2.3M samples) and a smaller distilled corpus (10k, Haiku-generated labels)
Embedding: OpenAI text-embedding-3-small via OpenRouter or Batch API, with checkpointing and resumability
Cluster-based train/val/test split to prevent near-duplicate leakage
Two model architectures: MLP multi-label classifier (baseline, fails) and prefix-conditioned transformer decoder (seq2seq, works)
BPE tokenizer (5,000 subwords, hyphen-aware) and legacy KMeans vocabulary compression
Seven evaluation metrics: validity, exact match, token F1, ROUGE-1/L, BERTScore, distinctiveness, vocab diversity
ONNX export for browser/edge deployment
Attention analysis tooling (hyphen-routing discovery)

Two trained models:

Model	Params	Token F1	BERTScore	Inference (VPS)
d=384, L=4	11.5M (46 MiB)	0.298	0.869	~89ms
d=512, L=6	24.8M (99 MiB)	0.306	0.872	~160ms

Doubling parameters adds +0.008 Token F1, within the ±0.008 confidence interval. The smaller model is recommended for deployment.

HuggingFace publishing pipeline: model card template, eval extraction, ONNX bundling, and upload script targeting hashintel/vec2slug-v1-openai-{small,large}.

Standalone inference script (hf/inference.py): zero-dependency ONNX inference with beam search, also supports PyTorch backend. Runs with uv run directly.

Key findings

Bag-of-tokens classifiers (MLP) collapse to high-frequency function words. The failure is architectural, not a training deficiency.
BPE vocabulary was the single largest quality improvement (+0.072 Token F1 over KMeans). Vocabulary strategy imposes a hard ceiling.
Three calibration artifacts compounded: training-data truncation, position-uniform EOS loss, and standard beam search early-stop. Each fix was small; cumulatively they moved output from "topically correct but truncated" to "topically correct at appropriate length."
Hyphens serve as learned embedding-routing nodes: 4/8 attention heads at layer 1 allocate 96 to 99% of their attention from hyphens to the prefix embedding.
Parameter scaling produced no statistically convincing gains. Data quality and embedding information content are more likely bottlenecks.

Companion

Blog post at hash.dev/blog/vec2slug (separate).

vercel · 2026-05-25T14:54:22Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
petrinaut-hazel	Ready	Preview, Comment	May 25, 2026 2:55pm

augmentcode · 2026-05-25T14:54:27Z

This pull request is abnormally large and would use a significant amount of tokens to review. If you still wish to review it, comment "augment review" and we will review it.

cursor · 2026-05-25T14:54:27Z

PR Summary

Low Risk
Additive library under libs/ with gitignored data and env-based API keys; no changes to core application auth or runtime paths beyond submodules and ignore rules.

Overview
Introduces libs/vec2slug, a self-contained research package that turns pooled sentence embeddings (OpenAI text-embedding-3-small) into kebab-case URL slugs via a small prefix-conditioned transformer decoder, plus an MLP baseline for comparison.

Pipeline & tooling: Adds workspace-based data prep (FineWeb URL slugs and a smaller Haiku-distilled corpus), embedding (OpenRouter, local Harrier, OpenAI Batch), cluster splits, BPE and KMeans vocab paths, slug-train-* / slug-predict / slug-eval CLIs, ONNX export, attention analysis, and helper scripts (benchmarks, demo JSON, HF publish). Evaluation is a composable transform pipeline (validity, token F1, ROUGE, BERTScore, distinctiveness, etc.). Deployment artifacts: hf/inference.py (ONNX/PyTorch beam search), model card template, publish_hf.py, and documented canonical checkpoints (~0.30 macro token F1 on held-out test).

Repo wiring: .gitignore entries for vec2slug venv/data and Python caches; git submodules for vendor/evaluate, vendor/datatrove, and vendor/datasets. Research writeups in README.md and CONCLUSION.md.

^{Reviewed by Cursor Bugbot for commit f13e4a3. Bugbot is set up for automated code reviews on this repo. Configure here.}

indietyp · 2026-05-25T14:54:36Z

H-6504: vec2slug: URL slug generation from text embeddings #142 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

github-actions · 2026-05-25T14:54:43Z

Dependency Review

The following issues were found:

❌ 2 vulnerable package(s)
✅ 0 package(s) with incompatible licenses
✅ 0 package(s) with invalid SPDX license definitions
⚠️ 50 package(s) with unknown licenses.
⚠️ 1 packages with OpenSSF Scorecard issues.

View full job summary

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 444794f. Configure here.}

indietyp added 30 commits May 15, 2026 11:10

feat: slug-from-embeddings

4010335

feat: slug from embedding

8d4b190

feat: slug from embedding

424f28b

feat: slug from embedding

e6e1502

feat: remove embedded git repo

8d4c827

feat: evaluation

aa97722

feat: mlp based training

ec2f1bd

feat: evaluation

faef4f1

feat: mlp based training

4a507ff

feat: embed corpus

35c03bc

feat: embed corpus

bdccae8

feat: move to workspace abstraction

8b83920

feat: compress vocab

cf89da3

feat: experimentation around MLP

1a84d8d

feat: experimentation around S2S

197fa96

feat: compress vocab

ce9c2a9

feat: retraint

6dbcaf8

feat: viz polish

0c14fc7

feat: position aware loss

66b33b3

feat: evaluation

9e0c19c

feat: analyze attention

e08318f

feat: checkpoint

5b1592e

feat: benchmarks

8b19bdb

feat: move scripts into scripts

7c06897

feat: slug_from_embedding -> vec2slug

54a4eef

feat: checkpoint

1ed1979

feat: checkpoint

6c5f11d

feat: cached predict

c123186

fix: remove cache

4fa93d5

feat: build new examples and verify inference parity

c9cc82a

indietyp added 4 commits May 24, 2026 16:21

fix: ruff

7f4a884

chore: minor changes

cef39c2

fix: HF model card

3495388

chore: remove unneeded files

f13e4a3

vercel Bot had a problem deploying to Preview – hcore May 25, 2026 14:54 Failure

vercel Bot deployed to Preview – petrinaut-hazel May 25, 2026 14:54 View deployment

chore: remove vendored libraries

444794f

vercel Bot had a problem deploying to Preview – hcore May 25, 2026 14:55 Failure

vercel Bot deployed to Preview – petrinaut-hazel May 25, 2026 14:55 View deployment

cursor Bot reviewed May 25, 2026

View reviewed changes

Comment thread libs/vec2slug/scripts/benchmark_inference.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

H-6504: vec2slug: URL slug generation from text embeddings#142

H-6504: vec2slug: URL slug generation from text embeddings#142
indietyp wants to merge 35 commits into
mainfrom
bm/h-6504-vec2slug-url-slug-generation-from-text-embeddings

indietyp commented May 25, 2026

Uh oh!

vercel Bot commented May 25, 2026 •

edited

Loading

Uh oh!

augmentcode Bot commented May 25, 2026

Uh oh!

cursor Bot commented May 25, 2026 •

edited

Loading

Uh oh!

indietyp commented May 25, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 25, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

indietyp commented May 25, 2026

vec2slug: URL slug generation from text embeddings

What's in the PR

Key findings

Companion

Uh oh!

vercel Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

augmentcode Bot commented May 25, 2026

Uh oh!

cursor Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

indietyp commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Review

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented May 25, 2026 •

edited

Loading

cursor Bot commented May 25, 2026 •

edited

Loading

indietyp commented May 25, 2026 •

edited

Loading

github-actions Bot commented May 25, 2026 •

edited

Loading