Fully automated deployment of production-ready Kubernetes clusters on Azure
Tested on WSL2 (Ubuntu 22.04) + Windows 11 with Azure VMs (Ubuntu 22.04 LTS, amd64). Auto-install of tools (kubectl, helm, poetry) targets Linux/amd64.
Mode B (AIOpsLab on your laptop, remote kubectl):
python3 deploy.py --apply --resource-group <your-rg> --workers 2 --mode BMode A (AIOpsLab on the controller VM):
python3 deploy.py --apply --resource-group <your-rg> --workers 2 --mode A
# With --dev to rsync local code instead of git clone:
python3 deploy.py --apply --resource-group <your-rg> --workers 2 --mode A --devDestroy when done:
python3 deploy.py --destroy --resource-group <your-rg>The script handles VM provisioning, K8s cluster setup, and AIOpsLab configuration.
Tip: Add
--allowed-ips CorpNetPublic(or a CIDR) to restrict SSH/K8s API access. Default is open to all (*).
- ✅ Fully Automated: One command deploys everything
- ✅ Dynamic Scaling: Support for 1-N worker nodes
- ✅ Ansible Integration: Production-ready K8s setup
- ✅ Smart Inventory: Auto-generates Ansible inventory from Terraform
- ✅ SSH Verification: Waits for connectivity before proceeding
- ✅ Graceful Destroy: Safe teardown with confirmation
- ✅ Better Outputs: Structured VM information for automation
| Tool | Version | Installation |
|---|---|---|
| Python | 3.11+ | python.org |
| Terraform | 1.6+ | Install |
| Ansible | Latest | Install |
| Azure CLI | Latest | Install |
# Ansible
sudo apt update
sudo apt install software-properties-common
sudo add-apt-repository --yes --update ppa:ansible/ansible
sudo apt install ansible -y
# Python dependencies
pip install pyyaml# Login to Azure
az login
# Set subscription
az account set --subscription "<subscription-id>"
# Create resource group (if needed)
az group create --name aiopslab-rg --location eastus
# Generate SSH key (if needed)
ssh-keygen -t rsa -b 4096 -f ~/.ssh/id_rsaDeploy with default settings (2 workers, Standard_B2s):
python deploy.py --applySpecify worker count and VM size:
python deploy.py --apply \
--workers 5 \
--vm-size Standard_D8s_v3 \
--resource-group my-rg \
--prefix myaiops \
--ssh-key ~/.ssh/custom_key.pub--plan Dry-run: show what would be created
--apply Deploy infrastructure and setup cluster
--destroy Destroy all infrastructure
--setup-only Re-run AIOpsLab setup without reprovisioning (uses
existing Terraform state). Useful for iterating on
code or config changes.
--workers N Number of worker nodes (default: 2)
--vm-size SIZE Azure VM size (default: Standard_B2s)
--resource-group RG Azure resource group (default: aiopslab-rg)
--prefix PREFIX Resource name prefix (default: aiopslab)
--ssh-key PATH SSH public key path (default: ~/.ssh/id_rsa.pub)
--allowed-ips ADDR NSG source address for SSH + K8s API. Use '*' for
open (default), a CIDR, or an Azure service tag
like 'CorpNetPublic'.
--mode {A,B} A: AIOpsLab on controller VM. B: AIOpsLab on
laptop with remote kubectl (default: B).
--dev Mode A only: rsync local repo to controller
instead of git clone.
--debug Enable debug logging
python deploy.py --destroy \
--resource-group aiopslab-rg \
--ssh-key ~/.ssh/id_rsa.pubYou'll be prompted to confirm before deletion.
For more control or debugging, you can run each step manually:
cd scripts/terraform
terraform init
terraform plan -var="resource_group_name=<your-rg>" -var="worker_vm_count=3"
terraform apply -var="resource_group_name=<your-rg>" -var="worker_vm_count=3"python generate_inventory.py
# This creates ../ansible/inventory.yml with VM IPs and SSH configcd ../ansible
# Install Docker, Kubernetes packages on all nodes
ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -i inventory.yml setup_common.yml
# Initialize K8s cluster and join workers
ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -i inventory.yml remote_setup_controller_worker.yml# The playbook copies kubeconfig to your ~/.kube/config automatically
kubectl get nodescd scripts/terraform
terraform destroy -var="resource_group_name=<your-rg>"Run AIOpsLab directly on the controller VM. The setup is fully automated:
# Clone mode (default): git clones the repo on the controller
python3 deploy.py --apply --resource-group <your-rg> --workers 2 --mode A
# Dev mode: rsync your local repo to the controller instead of cloning
python3 deploy.py --apply --resource-group <your-rg> --workers 2 --mode A --devThe --mode A setup (scripts/ansible/setup_aiopslab.yml) automatically:
- Installs Python 3.11, Poetry, Helm, and git on the controller
- Adds the user to the
dockergroup (required by VirtualizationFaultInjector) - Clones the repo with submodules (clone mode) or rsyncs local code (dev mode)
- Generates
aiopslab/config.ymlwithk8s_host: localhost - Runs
poetry env use python3.11 && poetry install - Verifies cluster access with
kubectl get nodes
After deploy, SSH to the controller to run experiments:
ssh -i ~/.ssh/id_rsa azureuser@<controller-ip>
cd ~/AIOpsLab
eval $(poetry env activate)
python3 cli.pyTo iterate on code changes without reprovisioning VMs:
python3 deploy.py --setup-only --mode A --devPros: All fault injectors work (Docker is on the same machine), no Docker required locally Cons: Must SSH to controller to run experiments
deploy.py --mode B (the default) handles everything automatically:
python3 deploy.py --apply --resource-group <your-rg> --workers 2 --mode BThis automatically:
- Installs kubectl, helm, and poetry if missing
- Verifies kubeconfig and cluster connectivity
- Generates
aiopslab/config.ymlwith the correct controller IP - Runs
poetry env use python3.11 && poetry install - Prints a summary table showing what succeeded and what needs manual action
After deploy, just:
eval $(poetry env activate)
python3 cli.pyPros: Use local IDE, no SSH needed for running experiments Cons: Some fault injectors (e.g., VirtualizationFaultInjector) require local Docker
Note: If you see Docker connection errors in Mode B, either install Docker on your laptop or switch to Mode A.
Note: If using a git worktree in WSL, git submodule update may fail due to cross-platform path issues. Run it from Git Bash instead.
| VM Size | vCPUs | RAM | Use Case | Cost/Month* |
|---|---|---|---|---|
| Standard_B2s | 2 | 4 GB | Dev/Test | ~$30 |
| Standard_D4s_v3 | 4 | 16 GB | Small Prod | ~$120 |
| Standard_D8s_v3 | 8 | 32 GB | Medium Prod | ~$240 |
| Standard_D16s_v3 | 16 | 64 GB | Large Prod | ~$480 |
*Approximate costs for East US region
- 1 Controller VM (Kubernetes control plane)
- N Worker VMs (configurable, default 2)
- Virtual Network & Subnet (10.0.0.0/16)
- Network Security Group (SSH access)
- Public IPs for all VMs
- Network Interfaces
- Docker CE + cri-dockerd
- Kubernetes v1.31 (kubeadm, kubelet, kubectl)
- Flannel CNI plugin
- Fully configured K8s cluster
1. Terraform Init → Initialize providers
2. Terraform Plan → Create execution plan
3. Terraform Apply → Provision VMs on Azure
4. Get Outputs → Retrieve VM IPs and config
5. Generate Inventory → Create Ansible inventory.yml
6. Wait for SSH → Ensure VMs are accessible
7. Run Ansible → Install Docker, K8s packages
8. Setup Cluster → Initialize K8s, join workers
9. AIOpsLab Setup → Mode-dependent:
Mode A → Run setup_aiopslab.yml on controller
Mode B → Install tools locally, generate config.yml
Total Time: 15-25 minutes
Both modes print a summary table at the end showing what succeeded and what needs manual action.
AIOpsLab is already installed on the controller. SSH in and start:
ssh -i ~/.ssh/id_rsa azureuser@<controller-ip>
cd ~/AIOpsLab
eval $(poetry env activate)
python3 cli.pyAIOpsLab is configured locally. Start directly:
eval $(poetry env activate)
python3 cli.pykubectl get nodes
# NAME STATUS ROLES AGE VERSION
# aiopslab-controller Ready control-plane 5m v1.31.x
# aiopslab-worker-1 Ready <none> 3m v1.31.x
# aiopslab-worker-2 Ready <none> 3m v1.31.xSymptoms: Deployment hangs at "Waiting for SSH"
Solutions:
- Check Network Security Group allows your IP
- Verify SSH key path is correct
- Wait longer (VMs may be slow to boot)
# Test SSH manually
ssh -i ~/.ssh/id_rsa -v azureuser@<vm-ip>Solutions: Re-run Ansible manually:
cd scripts/ansible
# Run common setup
ansible-playbook -i inventory.yml setup_common.yml
# Run cluster setup
ansible-playbook -i inventory.yml remote_setup_controller_worker.ymlSolution: Check Flannel CNI:
kubectl get pods -n kube-system | grep flannel
# If not running, reapply:
kubectl apply -f https://github.com/flannel-io/flannel/releases/latest/download/kube-flannel.ymlCause: Missing conntrack package (required for kube-proxy)
Solution: The setup_common.yml playbook should install this. If running manually:
sudo apt install conntrack -ySymptom: Unable to connect to the server: x509: certificate is valid for X, not Y
Cause: K8s API server certificate doesn't include the public IP
Solution: The Ansible playbook automatically adds --apiserver-cert-extra-sans with the public IP. If you need to reinitialize:
# On controller, reset and reinit with SANs
sudo kubeadm reset -f
sudo kubeadm init \
--pod-network-cidr=10.244.0.0/16 \
--cri-socket unix:///var/run/cri-dockerd.sock \
--apiserver-advertise-address=<private-ip> \
--apiserver-cert-extra-sans=<public-ip>,<private-ip>Symptom: FileNotFoundError: Helm chart not found at: ...
Solution: Clone with submodules:
git submodule update --init --recursiveSymptom: Error while fetching server API version: HTTPConnection.request() got an unexpected keyword argument 'chunked'
Cause: Some fault injectors try to connect to local Docker daemon
Solution:
- Install Docker Desktop on your laptop, OR
- Use Mode A (run AIOpsLab on controller VM)
The deployment script is secure by default:
- Automatically adds SSH host keys via ssh-keyscan before running Ansible
- Host key verification is always enabled (no option to disable)
-
NSG Rules: SSH is open to 0.0.0.0/0 by default - restrict it!
# Use --allowed-ips to restrict access (e.g. Microsoft CorpNet) python deploy.py --apply --workers 2 --allowed-ips CorpNetPublic # Or add custom IP after deployment az network nsg rule create -g aiopslab-rg --nsg-name aiopslab-nsg \ --name SSH-MyIP --priority 100 --protocol TCP \ --source-address-prefixes "YOUR_IP/32" --destination-port-ranges 22
-
SSH Keys: Use 4096-bit RSA or Ed25519 with passphrases
-
Production: Consider Azure Bastion for secure access
-
Environments: Use separate resource groups for prod/dev/test
Small Dev Setup (2 workers, B2s): ~$90/month Medium Prod (3 workers, D4s_v3): ~$480/month Large Prod (5 workers, D8s_v3): ~$1,440/month
- Destroy when not in use:
python deploy.py --destroy - Use B-series VMs for dev/test
- Deallocate VMs instead of deleting:
az vm deallocate --resource-group aiopslab-rg --name aiopslab-controller
| Directory | Contents |
|---|---|
scripts/terraform/ |
deploy.py (main entry point), Terraform configs (main.tf, variables.tf, etc.), generate_inventory.py |
scripts/ansible/ |
Playbooks for K8s setup (setup_common.yml, remote_setup_controller_worker.yml) and Mode A AIOpsLab setup (setup_aiopslab.yml), Jinja2 templates, inventory |
Found a bug or have a suggestion? Please open an issue or submit a PR!
Areas for improvement:
- Support for AWS, GCP
- Automated monitoring setup
- Cost optimization features
- Integration tests
Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.
Need Help? Open an issue on GitHub.