Skip to content

LessUp/bookmarks-cleaner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

99 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

CleanBook — Smart Bookmark Cleaning & Classification

Python 3.10+ License: MIT Platform CI Documentation

Rules-first · ML-assisted · LLM-optional · Offline-ready

简体中文 | Documentation | Releases


CleanBook is an open-source, offline-first bookmark cleaning and classification tool. It transforms chaotic browser bookmark collections into well-organized, categorized libraries using a hybrid approach that prioritizes rules, enhances with machine learning, and optionally leverages LLM capabilities.

Your bookmarks stay on your machine. No cloud uploads, no privacy concerns.


📖 Table of Contents


Why CleanBook?

Problem CleanBook Solution
🔍 Can't find bookmarks in a messy collection of hundreds or thousands Smart classification into categories you define, with 91.4% accuracy
⏱️ Manual organizing is tedious and hard to maintain Fully automated batch processing—point it at your export, get organized results
🔒 Privacy concerns with cloud-based bookmark managers 100% offline processing. Your data never leaves your device
⚙️ One-size-fits-all tools don't match your workflow Configuration-driven: customize categories, rules, and thresholds via JSON/YAML

🚀 Quick Start

Install

# Via pipx (recommended - isolated environment)
pipx install cleanbook

# Via pip
pip install cleanbook

# From source
git clone https://github.com/LessUp/bookmarks-cleaner.git
cd bookmarks-cleaner && pip install .

Run

# Process your bookmarks
cleanbook -i bookmarks.html -o output/

# Interactive wizard mode
cleanbook-wizard

Example Output

✓ Loaded 1,247 bookmarks from bookmarks.html
✓ Removed 23 duplicates (1.8%)
✓ Classified 1,224 bookmarks (91.4% accuracy)
✓ Generated:
    output/bookmarks_clean.html    # Import to browser
    output/bookmarks_data.json     # Structured data
    output/report.md               # Classification report
✓ Done in 2.34s

🏗️ How It Works

                    ┌─────────────────────────────────────┐
  bookmarks.html ──▶│  1. Parse & Extract                 │
                    │     URLs, titles, metadata          │
                    └─────────────┬───────────────────────┘
                                  ▼
                    ┌─────────────────────────────────────┐
                    │  2. Smart Deduplication             │
                    │     URL normalization, similarity   │
                    └─────────────┬───────────────────────┘
                                  ▼
┌───────────────────┬─────────────────────────────────────┬───────────────────┐
│                   │  3. Multi-Layer Classification      │                   │
│  High Priority    │     ┌─────────────────────────┐     │                   │
│  ═════════════    │     │ Rule Engine  (30%)      │◀────┤ Domain, keyword   │
│                   │     │ ML Classifier (25%)     │◀────┤ TF-IDF + Ensemble │
│  Automatic        │     │ Semantic (20%)          │◀────┤ Word vectors      │
│  Fallback ────────┼────▶│ User Profile (10%)      │     │                   │
│                   │     │ LLM (15%, optional)     │◀────┤ OpenAI-compatible │
│                   │     └───────────┬─────────────┘     │   (if configured) │
└───────────────────┴─────────────────┼───────────────────┴───────────────────┘
                                      ▼
                    ┌─────────────────────────────────────┐
                    │  4. Weighted Voting Fusion          │
                    │     Combine results, confidence calc│
                    └─────────────┬───────────────────────┘
                                  ▼
                    ┌─────────────────────────────────────┐
                    │  5. Multi-Format Export             │
                    │     HTML | JSON | Markdown          │
                    └─────────────────────────────────────┘

Key Design: Each layer provides confidence scores. If ML or LLM is unavailable, the system automatically redistributes weights to other layers—classification always completes.


✨ Features

🚀 Offline-First Design

Complete pipeline runs locally without any cloud services. Rule engine responds in sub-milliseconds. Perfect for:

  • Air-gapped environments
  • Privacy-sensitive users
  • Batch processing large collections
🤖 Hybrid Classification (91.4% Accuracy)

Multi-layer approach with automatic fallback:

Layer Priority Speed Fallback
Rule Engine High 0.1ms Never fails
ML Classifier Medium ~5ms Rules
Semantic Analysis Medium ~3ms Rules
LLM (optional) Low ~500ms All above
⚙️ Configuration-Driven

Customize everything via config.json—no code changes required:

{
  "category_rules": {
    "Technology/AI": {
      "rules": [
        { "match": "domain", "keywords": ["openai.com", "huggingface.co"], "weight": 15 },
        { "match": "title", "keywords": ["GPT", "LLM", "neural network"], "weight": 10 }
      ]
    }
  }
}
📦 Multi-Format Export
Format Use Case Browser Support
HTML (Netscape) Re-import to browser Chrome, Firefox, Safari, Edge
JSON Data analysis, further processing Universal
Markdown Knowledge base, documentation Notion, Obsidian, GitHub
🎯 Smart Deduplication
  • URL normalization (HTTP → HTTPS, www removal, trailing slashes)
  • Multi-dimensional similarity detection (SimHash, Levenshtein distance)
  • Preserves the most complete metadata when merging duplicates
💾 Performance Optimized
  • LRU caching for repeated operations
  • Parallel processing with configurable workers
  • Lazy initialization of ML components

🎯 Target Users

User Use Case Recommended Setup
Individual Users Personal bookmark maintenance pipx install cleanbook, customize categories in config
Team Maintainers Unified team bookmark standards Share config.json + taxonomy YAML files, CI pipeline
Developers Study bookmark processing pipelines Fork repo, explore /specs, extend classifier plugins

🔬 Performance

┌─────────────────────┬────────────┐
│ Metric              │ Value      │
├─────────────────────┼────────────┤
│ Classification Acc  │ 91.4%      │
│ Processing Speed    │ ~50+ /sec  │
│ Cache Hit Rate      │ 87-92%     │
│ Memory (baseline)   │ ~45MB      │
│ Memory (1K bookmarks│ ~125MB     │
└─────────────────────┴────────────┘

Benchmarked on: Intel i7-1165G7, Python 3.11, scikit-learn 1.4.2


📚 Documentation

Resource Link
Homepage lessup.github.io/bookmarks-cleaner
Quick Start /en/quickstart
Best Practices /en/guide/best-practices
Architecture /en/design/architecture
LLM Templates /en/reference/llm-templates

🛠️ Development

# Clone repository
git clone https://github.com/LessUp/bookmarks-cleaner.git
cd bookmarks-cleaner

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt

# Run tests
pytest

# Run with coverage
pytest --cov=src --cov-report=html

See Development Guide for details.


🤝 Contributing

Contributions are welcome! Please read our Contributing Guide for details.

This project follows Spec-Driven Development (SDD). Before writing any code, review the specification documents in the /specs directory. See AGENTS.md for the complete AI agent workflow.


📝 License

This project is licensed under the MIT License.


🙏 Acknowledgments


Made with ❤️ by LessUp

About

Smart Bookmark Cleanup & Classification: Rules + ML + Optional LLM, Dedup & Multi-Format Export (Python CLI) | 智能书签清理与分类工具:规则 + ML + 可选 LLM,去重、标题清理、多格式导出(Python CLI)

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors