Skip to content

Latest commit

 

History

History
510 lines (363 loc) · 14.2 KB

File metadata and controls

510 lines (363 loc) · 14.2 KB

AIOpsLab Automated Deployment with Terraform + Ansible

Fully automated deployment of production-ready Kubernetes clusters on Azure

Tested on WSL2 (Ubuntu 22.04) + Windows 11 with Azure VMs (Ubuntu 22.04 LTS, amd64). Auto-install of tools (kubectl, helm, poetry) targets Linux/amd64.

Quick Start

Mode B (AIOpsLab on your laptop, remote kubectl):

python3 deploy.py --apply --resource-group <your-rg> --workers 2 --mode B

Mode A (AIOpsLab on the controller VM):

python3 deploy.py --apply --resource-group <your-rg> --workers 2 --mode A
# With --dev to rsync local code instead of git clone:
python3 deploy.py --apply --resource-group <your-rg> --workers 2 --mode A --dev

Destroy when done:

python3 deploy.py --destroy --resource-group <your-rg>

The script handles VM provisioning, K8s cluster setup, and AIOpsLab configuration.

Tip: Add --allowed-ips CorpNetPublic (or a CIDR) to restrict SSH/K8s API access. Default is open to all (*).


✨ What's New (v2.0)

  • Fully Automated: One command deploys everything
  • Dynamic Scaling: Support for 1-N worker nodes
  • Ansible Integration: Production-ready K8s setup
  • Smart Inventory: Auto-generates Ansible inventory from Terraform
  • SSH Verification: Waits for connectivity before proceeding
  • Graceful Destroy: Safe teardown with confirmation
  • Better Outputs: Structured VM information for automation

📋 Prerequisites

1. Software Requirements

Tool Version Installation
Python 3.11+ python.org
Terraform 1.6+ Install
Ansible Latest Install
Azure CLI Latest Install

Quick Install (Ubuntu/Debian)

# Ansible
sudo apt update
sudo apt install software-properties-common
sudo add-apt-repository --yes --update ppa:ansible/ansible
sudo apt install ansible -y

# Python dependencies
pip install pyyaml

2. Azure Setup

# Login to Azure
az login

# Set subscription
az account set --subscription "<subscription-id>"

# Create resource group (if needed)
az group create --name aiopslab-rg --location eastus

# Generate SSH key (if needed)
ssh-keygen -t rsa -b 4096 -f ~/.ssh/id_rsa

🎯 Usage

Option 1: Automated Deployment with deploy.py

Deploy with default settings (2 workers, Standard_B2s):

python deploy.py --apply

Custom Deployment

Specify worker count and VM size:

python deploy.py --apply \
    --workers 5 \
    --vm-size Standard_D8s_v3 \
    --resource-group my-rg \
    --prefix myaiops \
    --ssh-key ~/.ssh/custom_key.pub

Available Options

--plan                          Dry-run: show what would be created
--apply                         Deploy infrastructure and setup cluster
--destroy                       Destroy all infrastructure
--setup-only                    Re-run AIOpsLab setup without reprovisioning (uses
                                existing Terraform state). Useful for iterating on
                                code or config changes.
--workers N                     Number of worker nodes (default: 2)
--vm-size SIZE                  Azure VM size (default: Standard_B2s)
--resource-group RG             Azure resource group (default: aiopslab-rg)
--prefix PREFIX                 Resource name prefix (default: aiopslab)
--ssh-key PATH                  SSH public key path (default: ~/.ssh/id_rsa.pub)
--allowed-ips ADDR              NSG source address for SSH + K8s API. Use '*' for
                                open (default), a CIDR, or an Azure service tag
                                like 'CorpNetPublic'.
--mode {A,B}                    A: AIOpsLab on controller VM. B: AIOpsLab on
                                laptop with remote kubectl (default: B).
--dev                           Mode A only: rsync local repo to controller
                                instead of git clone.
--debug                         Enable debug logging

Destroy Infrastructure

python deploy.py --destroy \
    --resource-group aiopslab-rg \
    --ssh-key ~/.ssh/id_rsa.pub

You'll be prompted to confirm before deletion.


Option 2: Manual Step-by-Step Deployment

For more control or debugging, you can run each step manually:

Step 1: Provision Azure VMs with Terraform

cd scripts/terraform
terraform init
terraform plan -var="resource_group_name=<your-rg>" -var="worker_vm_count=3"
terraform apply -var="resource_group_name=<your-rg>" -var="worker_vm_count=3"

Step 2: Generate Ansible Inventory

python generate_inventory.py
# This creates ../ansible/inventory.yml with VM IPs and SSH config

Step 3: Run Ansible Playbooks

cd ../ansible

# Install Docker, Kubernetes packages on all nodes
ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -i inventory.yml setup_common.yml

# Initialize K8s cluster and join workers
ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -i inventory.yml remote_setup_controller_worker.yml

Step 4: Verify Cluster

# The playbook copies kubeconfig to your ~/.kube/config automatically
kubectl get nodes

Destroy Manually

cd scripts/terraform
terraform destroy -var="resource_group_name=<your-rg>"

🖥️ Mode A vs Mode B Deployment

Mode A: AIOpsLab Inside Cluster (Recommended for full functionality)

Run AIOpsLab directly on the controller VM. The setup is fully automated:

# Clone mode (default): git clones the repo on the controller
python3 deploy.py --apply --resource-group <your-rg> --workers 2 --mode A

# Dev mode: rsync your local repo to the controller instead of cloning
python3 deploy.py --apply --resource-group <your-rg> --workers 2 --mode A --dev

The --mode A setup (scripts/ansible/setup_aiopslab.yml) automatically:

  • Installs Python 3.11, Poetry, Helm, and git on the controller
  • Adds the user to the docker group (required by VirtualizationFaultInjector)
  • Clones the repo with submodules (clone mode) or rsyncs local code (dev mode)
  • Generates aiopslab/config.yml with k8s_host: localhost
  • Runs poetry env use python3.11 && poetry install
  • Verifies cluster access with kubectl get nodes

After deploy, SSH to the controller to run experiments:

ssh -i ~/.ssh/id_rsa azureuser@<controller-ip>
cd ~/AIOpsLab
eval $(poetry env activate)
python3 cli.py

To iterate on code changes without reprovisioning VMs:

python3 deploy.py --setup-only --mode A --dev

Pros: All fault injectors work (Docker is on the same machine), no Docker required locally Cons: Must SSH to controller to run experiments

Mode B: AIOpsLab on Your Laptop (Convenient for development)

deploy.py --mode B (the default) handles everything automatically:

python3 deploy.py --apply --resource-group <your-rg> --workers 2 --mode B

This automatically:

  • Installs kubectl, helm, and poetry if missing
  • Verifies kubeconfig and cluster connectivity
  • Generates aiopslab/config.yml with the correct controller IP
  • Runs poetry env use python3.11 && poetry install
  • Prints a summary table showing what succeeded and what needs manual action

After deploy, just:

eval $(poetry env activate)
python3 cli.py

Pros: Use local IDE, no SSH needed for running experiments Cons: Some fault injectors (e.g., VirtualizationFaultInjector) require local Docker

Note: If you see Docker connection errors in Mode B, either install Docker on your laptop or switch to Mode A.

Note: If using a git worktree in WSL, git submodule update may fail due to cross-platform path issues. Run it from Git Bash instead.


📊 VM Sizing Guide

VM Size vCPUs RAM Use Case Cost/Month*
Standard_B2s 2 4 GB Dev/Test ~$30
Standard_D4s_v3 4 16 GB Small Prod ~$120
Standard_D8s_v3 8 32 GB Medium Prod ~$240
Standard_D16s_v3 16 64 GB Large Prod ~$480

*Approximate costs for East US region


🔧 What Gets Deployed

Infrastructure (Terraform)

  • 1 Controller VM (Kubernetes control plane)
  • N Worker VMs (configurable, default 2)
  • Virtual Network & Subnet (10.0.0.0/16)
  • Network Security Group (SSH access)
  • Public IPs for all VMs
  • Network Interfaces

Software Stack (Ansible)

  • Docker CE + cri-dockerd
  • Kubernetes v1.31 (kubeadm, kubelet, kubectl)
  • Flannel CNI plugin
  • Fully configured K8s cluster

Deployment Workflow

1. Terraform Init      → Initialize providers
2. Terraform Plan      → Create execution plan
3. Terraform Apply     → Provision VMs on Azure
4. Get Outputs         → Retrieve VM IPs and config
5. Generate Inventory  → Create Ansible inventory.yml
6. Wait for SSH        → Ensure VMs are accessible
7. Run Ansible         → Install Docker, K8s packages
8. Setup Cluster       → Initialize K8s, join workers
9. AIOpsLab Setup      → Mode-dependent:
     Mode A            →   Run setup_aiopslab.yml on controller
     Mode B            →   Install tools locally, generate config.yml

Total Time: 15-25 minutes


Post-Deployment

Both modes print a summary table at the end showing what succeeded and what needs manual action.

Mode A

AIOpsLab is already installed on the controller. SSH in and start:

ssh -i ~/.ssh/id_rsa azureuser@<controller-ip>
cd ~/AIOpsLab
eval $(poetry env activate)
python3 cli.py

Mode B

AIOpsLab is configured locally. Start directly:

eval $(poetry env activate)
python3 cli.py

Verify Cluster

kubectl get nodes
# NAME                STATUS   ROLES           AGE   VERSION
# aiopslab-controller Ready    control-plane   5m    v1.31.x
# aiopslab-worker-1   Ready    <none>          3m    v1.31.x
# aiopslab-worker-2   Ready    <none>          3m    v1.31.x

🐛 Troubleshooting

SSH Connection Timeout

Symptoms: Deployment hangs at "Waiting for SSH"

Solutions:

  1. Check Network Security Group allows your IP
  2. Verify SSH key path is correct
  3. Wait longer (VMs may be slow to boot)
# Test SSH manually
ssh -i ~/.ssh/id_rsa -v azureuser@<vm-ip>

Ansible Playbook Fails

Solutions: Re-run Ansible manually:

cd scripts/ansible

# Run common setup
ansible-playbook -i inventory.yml setup_common.yml

# Run cluster setup
ansible-playbook -i inventory.yml remote_setup_controller_worker.yml

Nodes Not Ready

Solution: Check Flannel CNI:

kubectl get pods -n kube-system | grep flannel

# If not running, reapply:
kubectl apply -f https://github.com/flannel-io/flannel/releases/latest/download/kube-flannel.yml

kubeadm init fails with "conntrack not found"

Cause: Missing conntrack package (required for kube-proxy)

Solution: The setup_common.yml playbook should install this. If running manually:

sudo apt install conntrack -y

kubectl from laptop shows certificate error

Symptom: Unable to connect to the server: x509: certificate is valid for X, not Y

Cause: K8s API server certificate doesn't include the public IP

Solution: The Ansible playbook automatically adds --apiserver-cert-extra-sans with the public IP. If you need to reinitialize:

# On controller, reset and reinit with SANs
sudo kubeadm reset -f
sudo kubeadm init \
  --pod-network-cidr=10.244.0.0/16 \
  --cri-socket unix:///var/run/cri-dockerd.sock \
  --apiserver-advertise-address=<private-ip> \
  --apiserver-cert-extra-sans=<public-ip>,<private-ip>

Helm chart not found error

Symptom: FileNotFoundError: Helm chart not found at: ...

Solution: Clone with submodules:

git submodule update --init --recursive

Docker connection error in Mode B

Symptom: Error while fetching server API version: HTTPConnection.request() got an unexpected keyword argument 'chunked'

Cause: Some fault injectors try to connect to local Docker daemon

Solution:

  1. Install Docker Desktop on your laptop, OR
  2. Use Mode A (run AIOpsLab on controller VM)

🔐 Security Notes

Default Behavior (Secure)

The deployment script is secure by default:

  • Automatically adds SSH host keys via ssh-keyscan before running Ansible
  • Host key verification is always enabled (no option to disable)

Quick Security Checklist

  • NSG Rules: SSH is open to 0.0.0.0/0 by default - restrict it!

    # Use --allowed-ips to restrict access (e.g. Microsoft CorpNet)
    python deploy.py --apply --workers 2 --allowed-ips CorpNetPublic
    
    # Or add custom IP after deployment
    az network nsg rule create -g aiopslab-rg --nsg-name aiopslab-nsg \
        --name SSH-MyIP --priority 100 --protocol TCP \
        --source-address-prefixes "YOUR_IP/32" --destination-port-ranges 22
  • SSH Keys: Use 4096-bit RSA or Ed25519 with passphrases

  • Production: Consider Azure Bastion for secure access

  • Environments: Use separate resource groups for prod/dev/test


💰 Cost Management

Estimated Costs

Small Dev Setup (2 workers, B2s): ~$90/month Medium Prod (3 workers, D4s_v3): ~$480/month Large Prod (5 workers, D8s_v3): ~$1,440/month

Save Money

  1. Destroy when not in use: python deploy.py --destroy
  2. Use B-series VMs for dev/test
  3. Deallocate VMs instead of deleting:
    az vm deallocate --resource-group aiopslab-rg --name aiopslab-controller

Files

Directory Contents
scripts/terraform/ deploy.py (main entry point), Terraform configs (main.tf, variables.tf, etc.), generate_inventory.py
scripts/ansible/ Playbooks for K8s setup (setup_common.yml, remote_setup_controller_worker.yml) and Mode A AIOpsLab setup (setup_aiopslab.yml), Jinja2 templates, inventory

🤝 Contributing

Found a bug or have a suggestion? Please open an issue or submit a PR!

Areas for improvement:

  • Support for AWS, GCP
  • Automated monitoring setup
  • Cost optimization features
  • Integration tests

📄 License

Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.


Need Help? Open an issue on GitHub.