Skip to content

osintukraine/tg-archiver

Repository files navigation

tg-archiver

Self-hosted Telegram channel archiver with web interface.

Archive messages from Telegram channels with media storage, translation support, full-text search, and RSS feed generation.

Features

  • Real-time Archiving: Monitor channels via Telegram folders - add channels by drag-and-drop
  • Media Storage: Photos, videos, documents with SHA-256 deduplication
  • Translation: Auto-translate non-English content (Google Translate free tier, optional DeepL)
  • Full-Text Search: PostgreSQL tsvector-powered search across content and translations
  • RSS Feeds: Generate RSS/Atom/JSON feeds for archived channels
  • Social Graph: Track forwards, replies, comments, and engagement metrics
  • Forward Chain Tracking: Auto-discover channels via forwards, fetch original message context
  • Topic Classification: Admin-defined topic taxonomy for message categorization
  • Backfill: Automatically fetch historical messages when adding new channels
  • Self-Hosted: Complete data sovereignty - your data stays on your infrastructure

Architecture

┌─────────────┐     ┌─────────┐     ┌───────────┐     ┌──────────┐
│  Telegram   │────▶│ Listener │────▶│   Redis   │────▶│ Processor │
│   Folders   │     │ Service  │     │  Streams  │     │  Service  │
└─────────────┘     └─────────┘     └───────────┘     └──────────┘
                                                            │
                    ┌─────────┐     ┌───────────┐           │
                    │ Frontend │◀───│    API    │◀──────────┤
                    │ Next.js  │    │  FastAPI  │           │
                    └─────────┘     └───────────┘           ▼
                                                    ┌──────────────┐
                                                    │  PostgreSQL  │
                                                    │    MinIO     │
                                                    └──────────────┘
Service Purpose
Listener Connects to Telegram, monitors folder-based channels
Processor Processes messages, downloads media, stores to DB
API FastAPI REST API with search, RSS, and admin endpoints
Frontend Next.js web interface for browsing and search
PostgreSQL Message storage with full-text search
Redis Message queue using Redis Streams
MinIO S3-compatible media storage

Quick Start

Prerequisites

  • Docker and Docker Compose (v2.0+)
  • Python 3.11+ (for initial Telegram authentication only)
  • Telegram API credentials from my.telegram.org/apps

Step 1: Clone and Configure

git clone https://github.com/yourusername/tg-archiver.git
cd tg-archiver

# Copy example environment file
cp .env.example .env

Edit .env with your Telegram credentials:

# Required - get from https://my.telegram.org/apps
TELEGRAM_API_ID=your_api_id_here
TELEGRAM_API_HASH=your_api_hash_here

# Change these in production!
POSTGRES_PASSWORD=your_secure_password
JWT_SECRET_KEY=generate_a_64_char_secret_key
JWT_ADMIN_PASSWORD=your_admin_password

Step 2: Create Telegram Session

Before starting Docker, you need to authenticate with Telegram to create a session file:

# Install dependencies for auth script
pip install telethon python-dotenv

# Run the authentication script
python3 scripts/telegram_auth.py

The script will:

  1. Ask for your phone number (with country code, e.g., +1234567890)
  2. Send a verification code to your Telegram app
  3. Ask for 2FA password if enabled
  4. Create sessions/listener.session file
  5. Show your Telegram folders for verification

Example output:

🔐 tg-archiver Telegram Authentication
============================================================
API ID: 12345678
Session file: /path/to/tg-archiver/sessions/listener.session
============================================================

📱 Phone number authentication required

Enter your phone number (with country code, e.g., +1234567890): +1234567890

📤 Sending verification code to +1234567890...

🔑 Enter the verification code you received: 12345

🔐 Signing in...

============================================================
✅ Authentication successful!
============================================================
Logged in as: John Doe
Username: @johndoe
Phone: +1234567890
User ID: 123456789

📁 Telegram Folders on this account:
----------------------------------------
  1. [All Chats]
  2. Personal (15 chats)
  3. Work (8 chats)
----------------------------------------
⚠️  Target folder 'tg-archiver' NOT FOUND
   Create a folder named 'tg-archiver' in your Telegram app
   and add channels to archive.

Step 3: Create Archive Folder in Telegram

  1. Open Telegram (mobile or desktop)
  2. Go to SettingsFoldersCreate New Folder
  3. Name it exactly: tg-archiver (or match your FOLDER_ARCHIVE_ALL_PATTERN in .env)
  4. Add channels you want to archive to this folder

Step 4: Build and Start the Platform

# Build all container images (first time or after updates)
docker-compose build

# Start all services
docker-compose up -d

# Watch logs to verify startup
docker-compose logs -f

Note: The build step compiles the Python services and Next.js frontend. First build takes 3-5 minutes.

Step 5: Access the Interface

Service URL Credentials
Frontend http://localhost:3000 JWT_ADMIN_EMAIL / JWT_ADMIN_PASSWORD
API Docs http://localhost:8000/docs -
MinIO Console http://localhost:9001 MINIO_ACCESS_KEY / MINIO_SECRET_KEY

Channel Management

tg-archiver uses Telegram's native folder feature for channel management - no admin panel needed!

Adding Channels (Folder Method)

  1. Find a channel in Telegram
  2. Long-press (mobile) or right-click (desktop) → Add to Folder
  3. Select your tg-archiver folder
  4. Done! The listener detects changes within 5 minutes

Removing Channels

  1. Remove the channel from your tg-archiver folder
  2. The listener stops monitoring (existing messages remain archived)

Folder Naming

The default folder name is tg-archiver. You can customize it in .env:

FOLDER_ARCHIVE_ALL_PATTERN=my-archive

Note: Telegram folder names are limited to 12 characters.


Bulk Channel Import (CSV)

For importing many channels at once, use the CSV import feature in the admin panel.

How to Import

  1. Go to AdminImport Channels (/admin/import)
  2. Prepare a CSV file with your channels
  3. Drag-and-drop or click to upload
  4. Review the validation results
  5. Select channels and target folders
  6. Click Start Import

CSV Format

channel_id,channel_username,target_folder,notes
-1001234567890,channelname,tg-archiver,Optional notes
,anotherchannel,tg-archiver,Username only (ID will be resolved)
-1009876543210,,tg-archiver,ID only
Column Required Description
channel_id One of ID or username Telegram channel ID (negative number)
channel_username One of ID or username Channel username (without @)
target_folder Yes Target Telegram folder name
notes No Optional notes for reference

Import Process

  1. Validation - Each channel is validated (exists, accessible, not duplicate)
  2. Folder Creation - Target folders are created in Telegram if they don't exist
  3. Channel Addition - Channels are added to the specified folders
  4. Monitoring Starts - Listener automatically picks up new channels

Import Status

Status Meaning
pending Waiting to be processed
validating Checking channel accessibility
valid Ready to import
invalid Cannot be imported (see error)
importing Being added to folder
completed Successfully imported
failed Import failed (see error)
skipped Skipped (already monitored or deselected)

Forward Chain Tracking

tg-archiver automatically discovers new channels through message forwards and enriches the social graph with original message context.

How It Works

Your Monitored Channel          Discovered Channel (auto-joined)
┌──────────────────────┐        ┌──────────────────────────────┐
│ Message A            │        │ Original Message             │
│ "Forwarded from X"   │───────▶│ - Content cached             │
│ - forward_from_id    │        │ - Reactions fetched          │
│ - propagation time   │        │ - Comments fetched           │
└──────────────────────┘        └──────────────────────────────┘

When a forwarded message arrives in your monitored channels:

  1. Discovery: The source channel is recorded in discovered_channels
  2. Auto-Join: Background worker joins the channel (for social data access only)
  3. Social Fetch: Original message content, reactions, and comments are retrieved
  4. Admin Review: Discovered channels appear in /admin/channelsDiscovered tab

Key Benefits

  • No archiving of discovered channels - Only social context is fetched
  • Propagation timing - Track how fast content spreads (seconds from original to forward)
  • Complete social graph - Reactions and comments from the original post
  • Admin control - Promote interesting channels to full archiving, or ignore them

Admin Interface

Navigate to AdminChannelsDiscovered tab to:

Action Description
Promote Add channel to full archiving (select category, folder, rule)
Ignore Hide from suggestions, stop social fetching
Retry Retry joining a failed/private channel

Configuration

Variable Default Description
CHANNEL_JOIN_ENABLED true Enable auto-joining discovered channels
CHANNEL_JOIN_INTERVAL_SECONDS 60 Seconds between join attempts
CHANNEL_JOIN_BATCH_SIZE 5 Max channels to join per cycle
CHANNEL_JOIN_MAX_RETRIES 3 Retries before marking as failed
CHANNEL_JOIN_RETRY_DELAY_HOURS 24 Hours between retry attempts

Social data fetching uses the existing SOCIAL_FETCH_* settings.

Database Tables

Table Purpose
discovered_channels Channels found via forwards (status, metadata, admin actions)
message_forwards Links archived messages to their original sources
original_messages Cached content of original messages (graph leaf nodes)
forward_reactions Reactions on original messages
forward_comments Comments on original messages

API Endpoints

GET  /api/admin/discovered           # List discovered channels
GET  /api/admin/discovered/stats     # Statistics
GET  /api/admin/discovered/{id}      # Channel details + recent forwards
POST /api/admin/discovered/{id}/promote  # Promote to full archiving
POST /api/admin/discovered/{id}/ignore   # Mark as ignored
POST /api/admin/discovered/{id}/retry    # Retry failed join

Configuration Reference

Required Settings

Variable Description
TELEGRAM_API_ID Telegram API ID from my.telegram.org
TELEGRAM_API_HASH Telegram API Hash from my.telegram.org

Database

Variable Default Description
POSTGRES_HOST postgres PostgreSQL hostname
POSTGRES_PORT 5432 PostgreSQL port
POSTGRES_DB tg_archiver Database name
POSTGRES_USER archiver Database user
POSTGRES_PASSWORD - Database password (change in production)

Authentication

Variable Default Description
AUTH_PROVIDER jwt jwt for local auth, none to disable
JWT_SECRET_KEY - Secret for JWT signing (change in production)
JWT_ADMIN_EMAIL admin@tg-archiver.local Admin login email
JWT_ADMIN_PASSWORD - Admin login password
JWT_EXPIRATION_MINUTES 60 Token expiration time

Storage (MinIO)

Variable Default Description
MINIO_ENDPOINT minio:9000 MinIO server address
MINIO_ACCESS_KEY minioadmin MinIO access key
MINIO_SECRET_KEY minioadmin MinIO secret key (change in production)
MINIO_BUCKET_NAME tg-archive-media Bucket for media files

Translation

Variable Default Description
TRANSLATION_ENABLED true Enable auto-translation
DEEPL_API_KEY - DeepL API key (uses Google Translate if not set)

Backfill

Variable Default Description
BACKFILL_ENABLED true Enable historical message backfill
BACKFILL_START_DATE 2024-01-01 How far back to fetch
BACKFILL_MODE on_discovery manual, on_discovery, or scheduled
BACKFILL_BATCH_SIZE 100 Messages per batch
BACKFILL_DELAY_MS 1000 Delay between batches (rate limiting)

Frontend

Variable Default Description
NEXT_PUBLIC_API_URL `` (empty) API URL for browser requests. Empty = relative URLs (use when behind proxy)

Operations

Viewing Logs

# All services
docker-compose logs -f

# Specific service
docker-compose logs -f listener
docker-compose logs -f processor
docker-compose logs -f api

Restarting Services

# Restart everything
docker-compose restart

# Restart specific service
docker-compose restart listener

Stopping the Platform

docker-compose down

Updating

git pull
docker-compose build
docker-compose up -d

Re-authenticating Telegram

If your session expires or you need to switch accounts:

# Remove old session
rm sessions/listener.session

# Re-run authentication
python3 scripts/telegram_auth.py

# Restart listener
docker-compose restart listener

Troubleshooting

"Session file not found" Error

The listener can't find the Telegram session file.

# Check if session exists
ls -la sessions/

# If missing, create it
python3 scripts/telegram_auth.py

"Target folder not found" Warning

The listener can't find your archive folder in Telegram.

  1. Verify folder name matches FOLDER_ARCHIVE_ALL_PATTERN in .env
  2. Folder names are case-insensitive but must match exactly
  3. Re-run python3 scripts/telegram_auth.py to see current folders

CORS Errors in Browser

If you see CORS errors when accessing the frontend:

  1. Ensure NEXT_PUBLIC_API_URL is empty (for proxied setup) or set correctly
  2. Check that Caddy/nginx is properly routing /api/* to the API service

Messages Not Appearing

  1. Check listener logs: docker-compose logs -f listener
  2. Verify the channel is in your archive folder
  3. Check processor logs: docker-compose logs -f processor
  4. Verify Redis is running: docker-compose logs redis

FloodWait Errors

Telegram rate limiting. The listener handles this automatically by waiting.

# Check listener logs for wait time
docker-compose logs listener | grep -i flood

Development

Directory Structure

tg-archiver/
├── services/
│   ├── listener/       # Telegram monitoring service
│   ├── processor/      # Message processing service
│   ├── api/            # FastAPI backend
│   └── frontend/       # Next.js frontend
├── shared/python/      # Shared Python modules (models, utils)
├── infrastructure/
│   └── postgres/       # Database schema (init.sql)
├── scripts/
│   └── telegram_auth.py  # Telegram authentication script
├── sessions/           # Telegram session files (gitignored)
├── docker-compose.yml
├── .env.example
└── .env               # Your configuration (gitignored)

Running Services Locally

# Start infrastructure only
docker-compose up -d postgres redis minio minio-init

# Install Python dependencies
cd services/listener
pip install -r requirements.txt

# Run listener locally (for debugging)
POSTGRES_HOST=localhost python -m src.main

Running Frontend Locally

cd services/frontend
npm install
npm run dev

Production Deployment

Security Checklist

Before deploying to production, ensure you've configured proper security:

1. Generate Secure Secrets

# Generate JWT secret (64 characters minimum)
openssl rand -hex 64

# Generate Redis password
openssl rand -base64 32

# Generate PostgreSQL password
openssl rand -base64 24

2. Configure Production Environment

Create or update your .env file:

# Environment mode
ENVIRONMENT=production

# Strong passwords (use generated values above)
POSTGRES_PASSWORD=your_generated_postgres_password
JWT_SECRET_KEY=your_generated_64_char_jwt_secret
JWT_ADMIN_PASSWORD=your_strong_admin_password
REDIS_PASSWORD=your_generated_redis_password
MINIO_SECRET_KEY=your_minio_secret_at_least_32_chars

# Security features
CSRF_ENABLED=true
TOKEN_BLACKLIST_FAIL_MODE=closed

3. Enable HTTPS

Use the production Caddyfile with automatic HTTPS:

# Set your domain
export DOMAIN=archive.yourdomain.com
export ACME_EMAIL=admin@yourdomain.com

# Use production Caddy config
cp infrastructure/caddy/Caddyfile.production infrastructure/caddy/Caddyfile

4. Deploy

# Build with production settings
docker-compose build

# Start services
docker-compose up -d

# Verify all services are healthy
docker-compose ps

Security Features

Feature Description Config Variable
HTTPS TLS 1.2+ with automatic Let's Encrypt Caddyfile.production
Rate Limiting 5 login attempts per minute Built-in
CSRF Protection Double-submit cookie pattern CSRF_ENABLED=true
Token Invalidation Logout blacklists JWT tokens Built-in
Secure Cookies HttpOnly, SameSite=Strict, Secure ENVIRONMENT=production
Password Policy Minimum 8 characters Built-in
Non-root Containers All services run as non-root Built-in
Redis Auth Password-protected Redis REDIS_PASSWORD

Production Architecture

Internet → Caddy (HTTPS:443) → Internal Docker Network
                │
                ├── /api/*     → api:8000
                ├── /media/*   → minio:9000
                └── /*         → frontend:3000

Only Caddy is exposed to the internet. All other services communicate internally.

Backup Recommendations

# Database backup
docker-compose exec postgres pg_dump -U archiver tg_archiver > backup.sql

# Media backup (MinIO data)
docker run --rm -v tg-archiver_minio_data:/data -v $(pwd):/backup \
  alpine tar czf /backup/minio-backup.tar.gz /data

License

AGPL-3.0 License - See LICENSE file for details.

About

real-time Telegram archiver in python with nextjs frontend - backfill - media archiver

Topics

Resources

License

Stars

Watchers

Forks

Contributors