Skip to content

feat: Horizontal scaling with message queue (Redis/NATS) for enterprise deployments #47

@initcron

Description

@initcron

Summary

For enterprise-scale deployments (1000s of repositories, multiple orgs, high webhook throughput), AOF needs native horizontal scaling support with a message queue architecture.

Current Architecture

GitHub Webhook → AOF Daemon (single process) → Execute Agent/Fleet/Flow

Limitations:

  • Single process handles all events
  • Synchronous webhook processing
  • No built-in queue for backpressure handling
  • Memory grows with trigger count
  • Single point of failure

Proposed Architecture

                    ┌─────────────────┐
                    │   Ingress/LB    │
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              ▼              ▼              ▼
     ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
     │ AOF Gateway │ │ AOF Gateway │ │ AOF Gateway │
     │ (stateless) │ │ (stateless) │ │ (stateless) │
     └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
            │               │               │
            └───────────────┼───────────────┘
                            ▼
                   ┌─────────────────┐
                   │   Redis/NATS    │
                   │  (message queue)│
                   └────────┬────────┘
                            │
              ┌─────────────┼─────────────┐
              ▼             ▼             ▼
     ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
     │ AOF Worker  │ │ AOF Worker  │ │ AOF Worker  │
     │ (executor)  │ │ (executor)  │ │ (executor)  │
     └─────────────┘ └─────────────┘ └─────────────┘

Components

  1. AOF Gateway (stateless)

    • Receives webhooks
    • Validates signatures
    • Matches to trigger
    • Publishes to queue
    • Returns 202 Accepted immediately
  2. Message Queue (Redis Streams or NATS JetStream)

    • Durable message storage
    • Consumer groups for load distribution
    • Dead letter queue for failed events
    • Retry with exponential backoff
  3. AOF Worker (stateless)

    • Consumes from queue
    • Executes agents/fleets/flows
    • Reports results back to queue
    • Horizontally scalable

Configuration

apiVersion: aof.dev/v1
kind: DaemonConfig
metadata:
  name: aof-gateway

spec:
  mode: gateway  # New: gateway | worker | standalone (default)
  
  queue:
    type: redis  # redis | nats
    url: redis://redis-cluster:6379
    # Or for NATS:
    # type: nats
    # url: nats://nats-cluster:4222
    
    # Queue settings
    stream: aof-events
    consumer_group: aof-workers
    max_retries: 3
    retry_delay_ms: 1000
    dead_letter_queue: aof-dlq
    
  # Gateway-specific settings
  gateway:
    ack_timeout_ms: 5000  # Return 202 within 5s
    
  # Worker-specific settings  
  worker:
    concurrency: 10  # Parallel event processing
    prefetch: 5      # Events to prefetch

Implementation Plan

Phase 1: Queue Abstraction

  • Define MessageQueue trait
  • Implement Redis Streams backend
  • Implement NATS JetStream backend
  • Add queue configuration to DaemonConfig

Phase 2: Gateway Mode

  • Add mode: gateway option
  • Separate webhook handling from execution
  • Publish events to queue
  • Return 202 Accepted immediately

Phase 3: Worker Mode

  • Add mode: worker option
  • Consume from queue
  • Execute agents/fleets/flows
  • Handle failures and retries

Phase 4: Observability

  • Queue depth metrics
  • Processing latency metrics
  • Dead letter queue alerting
  • Distributed tracing (OpenTelemetry)

Phase 5: Advanced Features

  • Priority queues (critical events first)
  • Rate limiting per org/repo
  • Event deduplication
  • Graceful shutdown with drain

Kubernetes Deployment

# Gateway Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: aof-gateway
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: aof
          args: [serve, --mode=gateway]
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
---
# Worker Deployment (auto-scaling)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: aof-worker
spec:
  replicas: 5
  template:
    spec:
      containers:
        - name: aof
          args: [serve, --mode=worker]
          resources:
            requests:
              cpu: 500m
              memory: 512Mi
---
# HPA for workers
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: aof-worker-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: aof-worker
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: External
      external:
        metric:
          name: redis_stream_lag
        target:
          type: AverageValue
          averageValue: 100

Benefits

  • Scalability: Add workers to handle more load
  • Reliability: Events persisted in queue, survive restarts
  • Backpressure: Queue absorbs traffic spikes
  • Isolation: Workers can be specialized (frontend, backend, infra)
  • Observability: Queue metrics for capacity planning

Related Issues

Acceptance Criteria

  • Queue abstraction with Redis and NATS backends
  • Gateway mode for webhook ingestion
  • Worker mode for event processing
  • Kubernetes manifests for horizontal deployment
  • Helm chart with scaling options
  • Documentation for enterprise deployment
  • Benchmark showing 10x throughput improvement

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions