CleanBook — Smart Bookmark Cleaning & Classification

Rules-first · ML-assisted · LLM-optional · Offline-ready

CleanBook is an open-source, offline-first bookmark cleaning and classification tool. It transforms chaotic browser bookmark collections into well-organized, categorized libraries using a hybrid approach that prioritizes rules, enhances with machine learning, and optionally leverages LLM capabilities.

Your bookmarks stay on your machine. No cloud uploads, no privacy concerns.

Why CleanBook?

Problem	CleanBook Solution
🔍 Can't find bookmarks in a messy collection of hundreds or thousands	Smart classification into categories you define, with 91.4% accuracy
⏱️ Manual organizing is tedious and hard to maintain	Fully automated batch processing—point it at your export, get organized results
🔒 Privacy concerns with cloud-based bookmark managers	100% offline processing. Your data never leaves your device
⚙️ One-size-fits-all tools don't match your workflow	Configuration-driven: customize categories, rules, and thresholds via JSON/YAML

🚀 Quick Start

Install

# Via pipx (recommended - isolated environment)
pipx install cleanbook

# Via pip
pip install cleanbook

# From source
git clone https://github.com/LessUp/bookmarks-cleaner.git
cd bookmarks-cleaner && pip install .

Run

# Process your bookmarks
cleanbook -i bookmarks.html -o output/

# Interactive wizard mode
cleanbook-wizard

Example Output

✓ Loaded 1,247 bookmarks from bookmarks.html
✓ Removed 23 duplicates (1.8%)
✓ Classified 1,224 bookmarks (91.4% accuracy)
✓ Generated:
    output/bookmarks_clean.html    # Import to browser
    output/bookmarks_data.json     # Structured data
    output/report.md               # Classification report
✓ Done in 2.34s

🏗️ How It Works

                    ┌─────────────────────────────────────┐
  bookmarks.html ──▶│  1. Parse & Extract                 │
                    │     URLs, titles, metadata          │
                    └─────────────┬───────────────────────┘
                                  ▼
                    ┌─────────────────────────────────────┐
                    │  2. Smart Deduplication             │
                    │     URL normalization, similarity   │
                    └─────────────┬───────────────────────┘
                                  ▼
┌───────────────────┬─────────────────────────────────────┬───────────────────┐
│                   │  3. Multi-Layer Classification      │                   │
│  High Priority    │     ┌─────────────────────────┐     │                   │
│  ═════════════    │     │ Rule Engine  (30%)      │◀────┤ Domain, keyword   │
│                   │     │ ML Classifier (25%)     │◀────┤ TF-IDF + Ensemble │
│  Automatic        │     │ Semantic (20%)          │◀────┤ Word vectors      │
│  Fallback ────────┼────▶│ User Profile (10%)      │     │                   │
│                   │     │ LLM (15%, optional)     │◀────┤ OpenAI-compatible │
│                   │     └───────────┬─────────────┘     │   (if configured) │
└───────────────────┴─────────────────┼───────────────────┴───────────────────┘
                                      ▼
                    ┌─────────────────────────────────────┐
                    │  4. Weighted Voting Fusion          │
                    │     Combine results, confidence calc│
                    └─────────────┬───────────────────────┘
                                  ▼
                    ┌─────────────────────────────────────┐
                    │  5. Multi-Format Export             │
                    │     HTML | JSON | Markdown          │
                    └─────────────────────────────────────┘

Key Design: Each layer provides confidence scores. If ML or LLM is unavailable, the system automatically redistributes weights to other layers—classification always completes.

✨ Features

🚀 Offline-First Design

Complete pipeline runs locally without any cloud services. Rule engine responds in sub-milliseconds. Perfect for:

Air-gapped environments
Privacy-sensitive users
Batch processing large collections

🤖 Hybrid Classification (91.4% Accuracy)

Multi-layer approach with automatic fallback:

Layer	Priority	Speed	Fallback
Rule Engine	High	0.1ms	Never fails
ML Classifier	Medium	~5ms	Rules
Semantic Analysis	Medium	~3ms	Rules
LLM (optional)	Low	~500ms	All above

⚙️ Configuration-Driven

Customize everything via config.json—no code changes required:

{
  "category_rules": {
    "Technology/AI": {
      "rules": [
        { "match": "domain", "keywords": ["openai.com", "huggingface.co"], "weight": 15 },
        { "match": "title", "keywords": ["GPT", "LLM", "neural network"], "weight": 10 }
      ]
    }
  }
}

📦 Multi-Format Export

Format	Use Case	Browser Support
HTML (Netscape)	Re-import to browser	Chrome, Firefox, Safari, Edge
JSON	Data analysis, further processing	Universal
Markdown	Knowledge base, documentation	Notion, Obsidian, GitHub

🎯 Smart Deduplication

URL normalization (HTTP → HTTPS, www removal, trailing slashes)
Multi-dimensional similarity detection (SimHash, Levenshtein distance)
Preserves the most complete metadata when merging duplicates

💾 Performance Optimized

LRU caching for repeated operations
Parallel processing with configurable workers
Lazy initialization of ML components

🎯 Target Users

User	Use Case	Recommended Setup
Individual Users	Personal bookmark maintenance	`pipx install cleanbook`, customize categories in config
Team Maintainers	Unified team bookmark standards	Share config.json + taxonomy YAML files, CI pipeline
Developers	Study bookmark processing pipelines	Fork repo, explore `/specs`, extend classifier plugins

🔬 Performance

┌─────────────────────┬────────────┐
│ Metric              │ Value      │
├─────────────────────┼────────────┤
│ Classification Acc  │ 91.4%      │
│ Processing Speed    │ ~50+ /sec  │
│ Cache Hit Rate      │ 87-92%     │
│ Memory (baseline)   │ ~45MB      │
│ Memory (1K bookmarks│ ~125MB     │
└─────────────────────┴────────────┘

Benchmarked on: Intel i7-1165G7, Python 3.11, scikit-learn 1.4.2

📚 Documentation

Resource	Link
Homepage	lessup.github.io/bookmarks-cleaner
Quick Start	/en/quickstart
Best Practices	/en/guide/best-practices
Architecture	/en/design/architecture
LLM Templates	/en/reference/llm-templates

🛠️ Development

# Clone repository
git clone https://github.com/LessUp/bookmarks-cleaner.git
cd bookmarks-cleaner

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt

# Run tests
pytest

# Run with coverage
pytest --cov=src --cov-report=html

See Development Guide for details.

🤝 Contributing

Contributions are welcome! Please read our Contributing Guide for details.

This project follows Spec-Driven Development (SDD). Before writing any code, review the specification documents in the /specs directory. See AGENTS.md for the complete AI agent workflow.

📝 License

This project is licensed under the MIT License.

🙏 Acknowledgments

Inspired by the need for efficient personal knowledge management
Built with scikit-learn, BeautifulSoup, and Rich

Made with ❤️ by LessUp

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
.claude		.claude
.github		.github
agent		agent
changelog		changelog
config		config
docs		docs
examples		examples
models		models
scripts		scripts
specs		specs
src		src
taxonomy		taxonomy
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CHANGELOG.zh.md		CHANGELOG.zh.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
QUICK_SETUP.md		QUICK_SETUP.md
QWEN.md		QWEN.md
README.md		README.md
README.zh-CN.md		README.zh-CN.md
SETUP_GUIDE.md		SETUP_GUIDE.md
config.json		config.json
config_temp.json		config_temp.json
main.py		main.py
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CleanBook — Smart Bookmark Cleaning & Classification

📖 Table of Contents

Why CleanBook?

🚀 Quick Start

Install

Run

Example Output

🏗️ How It Works

✨ Features

🎯 Target Users

🔬 Performance

📚 Documentation

🛠️ Development

🤝 Contributing

📝 License

🙏 Acknowledgments

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CleanBook — Smart Bookmark Cleaning & Classification

📖 Table of Contents

Why CleanBook?

🚀 Quick Start

Install

Run

Example Output

🏗️ How It Works

✨ Features

🎯 Target Users

🔬 Performance

📚 Documentation

🛠️ Development

🤝 Contributing

📝 License

🙏 Acknowledgments

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages