feat: Horizontal scaling with message queue (Redis/NATS) for enterprise deployments

## Summary

For enterprise-scale deployments (1000s of repositories, multiple orgs, high webhook throughput), AOF needs native horizontal scaling support with a message queue architecture.

## Current Architecture

```
GitHub Webhook → AOF Daemon (single process) → Execute Agent/Fleet/Flow
```

**Limitations:**
- Single process handles all events
- Synchronous webhook processing
- No built-in queue for backpressure handling
- Memory grows with trigger count
- Single point of failure

## Proposed Architecture

```
                    ┌─────────────────┐
                    │   Ingress/LB    │
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              ▼              ▼              ▼
     ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
     │ AOF Gateway │ │ AOF Gateway │ │ AOF Gateway │
     │ (stateless) │ │ (stateless) │ │ (stateless) │
     └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
            │               │               │
            └───────────────┼───────────────┘
                            ▼
                   ┌─────────────────┐
                   │   Redis/NATS    │
                   │  (message queue)│
                   └────────┬────────┘
                            │
              ┌─────────────┼─────────────┐
              ▼             ▼             ▼
     ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
     │ AOF Worker  │ │ AOF Worker  │ │ AOF Worker  │
     │ (executor)  │ │ (executor)  │ │ (executor)  │
     └─────────────┘ └─────────────┘ └─────────────┘
```

### Components

1. **AOF Gateway** (stateless)
   - Receives webhooks
   - Validates signatures
   - Matches to trigger
   - Publishes to queue
   - Returns 202 Accepted immediately

2. **Message Queue** (Redis Streams or NATS JetStream)
   - Durable message storage
   - Consumer groups for load distribution
   - Dead letter queue for failed events
   - Retry with exponential backoff

3. **AOF Worker** (stateless)
   - Consumes from queue
   - Executes agents/fleets/flows
   - Reports results back to queue
   - Horizontally scalable

## Configuration

```yaml
apiVersion: aof.dev/v1
kind: DaemonConfig
metadata:
  name: aof-gateway

spec:
  mode: gateway  # New: gateway | worker | standalone (default)
  
  queue:
    type: redis  # redis | nats
    url: redis://redis-cluster:6379
    # Or for NATS:
    # type: nats
    # url: nats://nats-cluster:4222
    
    # Queue settings
    stream: aof-events
    consumer_group: aof-workers
    max_retries: 3
    retry_delay_ms: 1000
    dead_letter_queue: aof-dlq
    
  # Gateway-specific settings
  gateway:
    ack_timeout_ms: 5000  # Return 202 within 5s
    
  # Worker-specific settings  
  worker:
    concurrency: 10  # Parallel event processing
    prefetch: 5      # Events to prefetch
```

## Implementation Plan

### Phase 1: Queue Abstraction
- [ ] Define `MessageQueue` trait
- [ ] Implement Redis Streams backend
- [ ] Implement NATS JetStream backend
- [ ] Add queue configuration to DaemonConfig

### Phase 2: Gateway Mode
- [ ] Add `mode: gateway` option
- [ ] Separate webhook handling from execution
- [ ] Publish events to queue
- [ ] Return 202 Accepted immediately

### Phase 3: Worker Mode
- [ ] Add `mode: worker` option
- [ ] Consume from queue
- [ ] Execute agents/fleets/flows
- [ ] Handle failures and retries

### Phase 4: Observability
- [ ] Queue depth metrics
- [ ] Processing latency metrics
- [ ] Dead letter queue alerting
- [ ] Distributed tracing (OpenTelemetry)

### Phase 5: Advanced Features
- [ ] Priority queues (critical events first)
- [ ] Rate limiting per org/repo
- [ ] Event deduplication
- [ ] Graceful shutdown with drain

## Kubernetes Deployment

```yaml
# Gateway Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: aof-gateway
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: aof
          args: [serve, --mode=gateway]
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
---
# Worker Deployment (auto-scaling)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: aof-worker
spec:
  replicas: 5
  template:
    spec:
      containers:
        - name: aof
          args: [serve, --mode=worker]
          resources:
            requests:
              cpu: 500m
              memory: 512Mi
---
# HPA for workers
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: aof-worker-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: aof-worker
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: External
      external:
        metric:
          name: redis_stream_lag
        target:
          type: AverageValue
          averageValue: 100
```

## Benefits

- **Scalability**: Add workers to handle more load
- **Reliability**: Events persisted in queue, survive restarts
- **Backpressure**: Queue absorbs traffic spikes
- **Isolation**: Workers can be specialized (frontend, backend, infra)
- **Observability**: Queue metrics for capacity planning

## Related Issues

- #45 - Team/role-based authorization
- #46 - Multi-organization support

## Acceptance Criteria

- [ ] Queue abstraction with Redis and NATS backends
- [ ] Gateway mode for webhook ingestion
- [ ] Worker mode for event processing
- [ ] Kubernetes manifests for horizontal deployment
- [ ] Helm chart with scaling options
- [ ] Documentation for enterprise deployment
- [ ] Benchmark showing 10x throughput improvement

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Horizontal scaling with message queue (Redis/NATS) for enterprise deployments #47

Summary

Current Architecture

Proposed Architecture

Components

Configuration

Implementation Plan

Phase 1: Queue Abstraction

Phase 2: Gateway Mode

Phase 3: Worker Mode

Phase 4: Observability

Phase 5: Advanced Features

Kubernetes Deployment

Benefits

Related Issues

Acceptance Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat: Horizontal scaling with message queue (Redis/NATS) for enterprise deployments #47

Description

Summary

Current Architecture

Proposed Architecture

Components

Configuration

Implementation Plan

Phase 1: Queue Abstraction

Phase 2: Gateway Mode

Phase 3: Worker Mode

Phase 4: Observability

Phase 5: Advanced Features

Kubernetes Deployment

Benefits

Related Issues

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions