Rules-first · ML-assisted · LLM-optional · Offline-ready
简体中文 | Documentation | Releases
CleanBook is an open-source, offline-first bookmark cleaning and classification tool. It transforms chaotic browser bookmark collections into well-organized, categorized libraries using a hybrid approach that prioritizes rules, enhances with machine learning, and optionally leverages LLM capabilities.
Your bookmarks stay on your machine. No cloud uploads, no privacy concerns.
- Why CleanBook?
- Quick Start
- How It Works
- Features
- Target Users
- Performance
- Documentation
- Development
- Contributing
| Problem | CleanBook Solution |
|---|---|
| 🔍 Can't find bookmarks in a messy collection of hundreds or thousands | Smart classification into categories you define, with 91.4% accuracy |
| ⏱️ Manual organizing is tedious and hard to maintain | Fully automated batch processing—point it at your export, get organized results |
| 🔒 Privacy concerns with cloud-based bookmark managers | 100% offline processing. Your data never leaves your device |
| ⚙️ One-size-fits-all tools don't match your workflow | Configuration-driven: customize categories, rules, and thresholds via JSON/YAML |
# Via pipx (recommended - isolated environment)
pipx install cleanbook
# Via pip
pip install cleanbook
# From source
git clone https://github.com/LessUp/bookmarks-cleaner.git
cd bookmarks-cleaner && pip install .# Process your bookmarks
cleanbook -i bookmarks.html -o output/
# Interactive wizard mode
cleanbook-wizard✓ Loaded 1,247 bookmarks from bookmarks.html
✓ Removed 23 duplicates (1.8%)
✓ Classified 1,224 bookmarks (91.4% accuracy)
✓ Generated:
output/bookmarks_clean.html # Import to browser
output/bookmarks_data.json # Structured data
output/report.md # Classification report
✓ Done in 2.34s
┌─────────────────────────────────────┐
bookmarks.html ──▶│ 1. Parse & Extract │
│ URLs, titles, metadata │
└─────────────┬───────────────────────┘
▼
┌─────────────────────────────────────┐
│ 2. Smart Deduplication │
│ URL normalization, similarity │
└─────────────┬───────────────────────┘
▼
┌───────────────────┬─────────────────────────────────────┬───────────────────┐
│ │ 3. Multi-Layer Classification │ │
│ High Priority │ ┌─────────────────────────┐ │ │
│ ═════════════ │ │ Rule Engine (30%) │◀────┤ Domain, keyword │
│ │ │ ML Classifier (25%) │◀────┤ TF-IDF + Ensemble │
│ Automatic │ │ Semantic (20%) │◀────┤ Word vectors │
│ Fallback ────────┼────▶│ User Profile (10%) │ │ │
│ │ │ LLM (15%, optional) │◀────┤ OpenAI-compatible │
│ │ └───────────┬─────────────┘ │ (if configured) │
└───────────────────┴─────────────────┼───────────────────┴───────────────────┘
▼
┌─────────────────────────────────────┐
│ 4. Weighted Voting Fusion │
│ Combine results, confidence calc│
└─────────────┬───────────────────────┘
▼
┌─────────────────────────────────────┐
│ 5. Multi-Format Export │
│ HTML | JSON | Markdown │
└─────────────────────────────────────┘
Key Design: Each layer provides confidence scores. If ML or LLM is unavailable, the system automatically redistributes weights to other layers—classification always completes.
🚀 Offline-First Design
Complete pipeline runs locally without any cloud services. Rule engine responds in sub-milliseconds. Perfect for:
- Air-gapped environments
- Privacy-sensitive users
- Batch processing large collections
🤖 Hybrid Classification (91.4% Accuracy)
Multi-layer approach with automatic fallback:
| Layer | Priority | Speed | Fallback |
|---|---|---|---|
| Rule Engine | High | 0.1ms | Never fails |
| ML Classifier | Medium | ~5ms | Rules |
| Semantic Analysis | Medium | ~3ms | Rules |
| LLM (optional) | Low | ~500ms | All above |
⚙️ Configuration-Driven
Customize everything via config.json—no code changes required:
{
"category_rules": {
"Technology/AI": {
"rules": [
{ "match": "domain", "keywords": ["openai.com", "huggingface.co"], "weight": 15 },
{ "match": "title", "keywords": ["GPT", "LLM", "neural network"], "weight": 10 }
]
}
}
}📦 Multi-Format Export
| Format | Use Case | Browser Support |
|---|---|---|
| HTML (Netscape) | Re-import to browser | Chrome, Firefox, Safari, Edge |
| JSON | Data analysis, further processing | Universal |
| Markdown | Knowledge base, documentation | Notion, Obsidian, GitHub |
🎯 Smart Deduplication
- URL normalization (HTTP → HTTPS, www removal, trailing slashes)
- Multi-dimensional similarity detection (SimHash, Levenshtein distance)
- Preserves the most complete metadata when merging duplicates
💾 Performance Optimized
- LRU caching for repeated operations
- Parallel processing with configurable workers
- Lazy initialization of ML components
| User | Use Case | Recommended Setup |
|---|---|---|
| Individual Users | Personal bookmark maintenance | pipx install cleanbook, customize categories in config |
| Team Maintainers | Unified team bookmark standards | Share config.json + taxonomy YAML files, CI pipeline |
| Developers | Study bookmark processing pipelines | Fork repo, explore /specs, extend classifier plugins |
┌─────────────────────┬────────────┐
│ Metric │ Value │
├─────────────────────┼────────────┤
│ Classification Acc │ 91.4% │
│ Processing Speed │ ~50+ /sec │
│ Cache Hit Rate │ 87-92% │
│ Memory (baseline) │ ~45MB │
│ Memory (1K bookmarks│ ~125MB │
└─────────────────────┴────────────┘
Benchmarked on: Intel i7-1165G7, Python 3.11, scikit-learn 1.4.2
| Resource | Link |
|---|---|
| Homepage | lessup.github.io/bookmarks-cleaner |
| Quick Start | /en/quickstart |
| Best Practices | /en/guide/best-practices |
| Architecture | /en/design/architecture |
| LLM Templates | /en/reference/llm-templates |
# Clone repository
git clone https://github.com/LessUp/bookmarks-cleaner.git
cd bookmarks-cleaner
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt
# Run tests
pytest
# Run with coverage
pytest --cov=src --cov-report=htmlSee Development Guide for details.
Contributions are welcome! Please read our Contributing Guide for details.
This project follows Spec-Driven Development (SDD). Before writing any code, review the specification documents in the /specs directory. See AGENTS.md for the complete AI agent workflow.
This project is licensed under the MIT License.
- Inspired by the need for efficient personal knowledge management
- Built with scikit-learn, BeautifulSoup, and Rich
Made with ❤️ by LessUp