AutoIndexer

AutoIndexer is a Kubernetes operator that automates the indexing of documents from various data sources into KAITO RAGEngine services. It streamlines the process of building knowledge bases for Retrieval Augmented Generation (RAG) workflows by automatically ingesting content from Git repositories, static URLs, and databases into vector indexes.

The operator handles the complete lifecycle of document indexing including:

Automated data ingestion from multiple source types
Scheduled indexing with cron-based scheduling
Drift detection and remediation to maintain data consistency
Multi-format support for text and PDF documents
Authentication management with secrets and workload identity
Integration with KAITO RAGEngine for vector storage

Architecture

AutoIndexer follows the Kubernetes Custom Resource Definition (CRD)/controller design pattern. Users manage an AutoIndexer custom resource that describes the data source configuration, indexing requirements, and scheduling preferences. The AutoIndexer controller automates the indexing workflow by reconciling the AutoIndexer custom resource.

The architecture consists of:

AutoIndexer Controller: Reconciles AutoIndexer custom resources, manages indexing jobs, handles scheduling, and performs drift detection
Indexing Jobs: Python-based jobs that handle the actual document processing, content extraction, and communication with RAGEngine services
Data Source Handlers: Pluggable handlers for different data source types (Git, Static URLs, Databases)
Content Handlers: Process different document formats (text, PDF) and extract indexable content
RAGEngine Integration: Seamless integration with KAITO RAGEngine for document storage and retrieval

Features

Automated Document Indexing

Automatic discovery and processing of documents from configured data sources
Support for incremental indexing to handle only new or changed content
Batch processing for efficient large-scale document ingestion

Multiple Data Sources

Git Repositories: Index code, documentation, and text files from public or private Git repositories
Static URLs: Process individual documents from web URLs (text and PDF formats)
Databases: Query and index content from databases using languages like Kusto (KQL)

Flexible Scheduling

Cron-based scheduling for automated, recurring indexing
Manual trigger support for on-demand indexing
Suspend/resume functionality for maintenance windows

Drift Detection & Remediation

Automatic detection of data inconsistencies between source and index
Configurable remediation strategies: Auto, Manual, or Ignore
Comprehensive monitoring and alerting for drift events

Secure Authentication

Kubernetes Secret-based credentials for private repositories and APIs
Azure Workload Identity integration for cloud-native authentication
Flexible credential provider system for extensibility

Observability & Monitoring

Comprehensive status reporting with phase tracking
Detailed metrics on indexing success, failure, and performance
OpenTelemetry integration for distributed tracing and logging

Installation

Prerequisites

Kubernetes cluster (1.24+)
KAITO RAGEngine deployed and accessible
kubectl configured for cluster access

Quick Start

After installing AutoIndexer, you can create your first indexing workflow:

1. Create a RAGEngine (if not already deployed)

apiVersion: kaito.sh/v1alpha1
kind: RAGEngine
metadata:
  name: my-ragengine
spec:
  inference:
    preset:
      name: text-embedding-ada-002

2. Create an AutoIndexer for Git Repository

apiVersion: autoindexer.kaito.sh/v1alpha1
kind: AutoIndexer
metadata:
  name: docs-indexer
spec:
  ragEngine: my-ragengine
  indexName: documentation-index
  schedule: "0 */6 * * *"  # Index every 6 hours
  dataSource:
    type: Git
    git:
      repository: "https://github.com/kubernetes/website.git"
      branch: "main"
      paths:
        - "content/en/docs/"
      excludePaths:
        - "*.png"
        - "*.jpg"

3. Monitor the Indexing Progress

# Check AutoIndexer status
kubectl get autoindexers

# Watch indexing progress
kubectl get autoindexers docs-indexer -o yaml

# View indexing job logs
kubectl logs -l autoindexer.kaito.sh/name=docs-indexer

Data Sources

Git Repositories

Index documentation, code, and text files from Git repositories:

dataSource:
  type: Git
  git:
    repository: "https://github.com/example/repo.git"
    branch: "main"
    commit: "abc123"  # Optional: specific commit
    paths:
      - "docs/"
      - "*.md"
      - "src/**/*.py"
    excludePaths:
      - "docs/archive/"
      - "*.test.py"

Features:

Support for public and private repositories
Branch and specific commit targeting
Flexible path inclusion/exclusion with glob patterns
Incremental indexing based on commit history

Static URLs

Process individual documents from web endpoints:

dataSource:
  type: Static
  static:
    urls:
      - "https://example.com/document1.pdf"
      - "https://example.com/document2.txt"
      - "https://api.example.com/content"

Features:

Support for text and PDF documents
UTF-8, UTF-8-SIG, Latin1 encoding detection
Custom headers and authentication via credentials

Database Sources

Query and index structured data from databases:

dataSource:
  type: Database
  database:
    language: Kusto
    initialQuery: |
      MyDatabase
      | where Timestamp > ago(30d)
      | project Content, Title, Metadata
    incrementalQuery: |
      MyDatabase  
      | where Timestamp > datetime({last_run})
      | project Content, Title, Metadata

Features:

Support for Kusto Query Language (KQL)
Incremental query support for large datasets
Extensible architecture for additional database types

Authentication

Kubernetes Secrets

Store credentials in Kubernetes secrets:

# Secret containing authentication token
apiVersion: v1
kind: Secret
metadata:
  name: git-credentials
type: Opaque
stringData:
  token: "ghp_your_github_token_here"
---
# AutoIndexer using the secret
spec:
  credentials:
    type: SecretRef
    secretRef:
      name: git-credentials
      key: token

Azure Workload Identity

Use Azure Workload Identity for cloud-native authentication:

spec:
  credentials:
    type: WorkloadIdentity
    workloadIdentity:
      cloudProvider: Azure
      azureWorkloadIdentity:
        serviceAccountName: autoindexer-sa
        scope: "https://storage.azure.com/.default"
        clientID: "12345678-1234-1234-1234-123456789012"
        tenantID: "87654321-4321-4321-4321-210987654321"

Scheduling and Drift Detection

Cron-based Scheduling

Schedule regular indexing with standard cron syntax:

spec:
  schedule: "0 2 * * *"    # Daily at 2 AM
  # schedule: "*/15 * * * *" # Every 15 minutes
  # schedule: "@daily"       # Daily at midnight
  # schedule: "@weekly"      # Weekly on Sunday at midnight

Drift Detection

AutoIndexer automatically detects when the actual document count in an index differs from the expected count:

spec:
  driftRemediationPolicy:
    strategy: Auto  # Options: Auto, Manual, Ignore

Remediation Strategies:

Auto: Automatically delete and recreate the index when drift is detected
Manual: Require manual intervention and approval for remediation
Ignore: Disable drift detection for this AutoIndexer

Configuration

Environment Variables

Configure the AutoIndexer controller using environment variables:

Variable	Description	Default
`DRIFT_DETECTION_ENABLED`	Enable/disable drift detection	`true`
`DRIFT_DETECTION_INTERVAL`	Interval between drift checks	`1h`
`INDEXER_JOB_IMAGE`	Container image for indexing jobs	`autoindexer:latest`
`INDEXER_JOB_TIMEOUT`	Timeout for indexing jobs	`3600s`

API Reference

AutoIndexer Spec Fields

Field	Type	Description
`ragEngine`	string	Name of the RAGEngine resource to use
`indexName`	string	Name of the index for document storage
`dataSource`	DataSourceSpec	Configuration for the data source
`credentials`	CredentialsSpec	Authentication configuration
`schedule`	string	Cron schedule for automatic indexing
`suspend`	bool	Suspend scheduled indexing
`driftRemediationPolicy`	DriftRemediationPolicy	Drift detection configuration

AutoIndexer Status Fields

Field	Type	Description
`indexingPhase`	string	Current phase: Pending, Running, Completed, Failed, etc.
`lastIndexingTimestamp`	Time	Timestamp of last successful indexing
`numOfDocumentInIndex`	int32	Number of documents currently in the index
`successfulIndexingCount`	int32	Count of successful indexing runs
`errorIndexingCount`	int32	Count of failed indexing runs
`nextScheduledIndexing`	Time	Next scheduled indexing time

Troubleshooting

Common Issues

AutoIndexer stuck in Pending phase:

# Check for missing RAGEngine
kubectl get ragengine

# Verify credentials are properly configured
kubectl get secrets

Indexing jobs failing:

# Check job logs
kubectl logs -l autoindexer.kaito.sh/name=your-autoindexer

# Verify data source accessibility
kubectl exec -it debug-pod -- curl -I https://your-data-source-url

Drift detection false positives:

# Temporarily disable drift detection
kubectl patch autoindexer your-autoindexer --type='merge' \
  -p='{"spec":{"driftRemediationPolicy":{"strategy":"Ignore"}}}'

AutoIndexer is part of the KAITO project - Kubernetes AI Toolchain Operator.

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
.github		.github
api/v1alpha1		api/v1alpha1
charts/kaito/autoindexer		charts/kaito/autoindexer
cmd		cmd
config		config
docker/autoindexer		docker/autoindexer
docs/images		docs/images
examples/autoindexer		examples/autoindexer
hack		hack
jobs/autoindexer		jobs/autoindexer
pkg		pkg
test/e2e		test/e2e
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Folders and files

Latest commit

History

Repository files navigation

AutoIndexer

Architecture

Features

Automated Document Indexing

Multiple Data Sources

Flexible Scheduling

Drift Detection & Remediation

Secure Authentication

Observability & Monitoring

Installation

Prerequisites

Quick Start

1. Create a RAGEngine (if not already deployed)

2. Create an AutoIndexer for Git Repository

3. Monitor the Indexing Progress

Data Sources

Git Repositories

Static URLs

Database Sources

Authentication

Kubernetes Secrets

Azure Workload Identity

Scheduling and Drift Detection

Cron-based Scheduling

Drift Detection

Configuration

Environment Variables

API Reference

AutoIndexer Spec Fields

AutoIndexer Status Fields

Troubleshooting

Common Issues

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages