Skip to content

Latest commit

 

History

History
643 lines (483 loc) · 15.5 KB

File metadata and controls

643 lines (483 loc) · 15.5 KB

Deployment Guide

This guide provides step-by-step instructions for deploying the HA RKE2 Kubernetes cluster on AWS.

Table of Contents


Prerequisites

Required Tools

Tool Version Installation
Terraform >= 1.5.0 Download
AWS CLI >= 2.0 Install Guide
kubectl >= 1.28 Install Guide
SSH client Any Built-in (Linux/Mac) or PuTTY (Windows)

Verify Installations

# Check Terraform
terraform version
# Expected: Terraform v1.5.0 or higher

# Check AWS CLI
aws --version
# Expected: aws-cli/2.x.x

# Check kubectl
kubectl version --client
# Expected: Client Version: v1.28.x or higher

AWS Requirements

  • AWS Account with administrative access
  • IAM Permissions for:
    • VPC (create, modify, delete)
    • EC2 (instances, security groups, key pairs)
    • ELB (load balancers, target groups)
    • IAM (read-only for current user)

Verify AWS Permissions

# Check current identity
aws sts get-caller-identity

# Expected output:
{
    "UserId": "AIDAXXXXXXXXXXXXXXXXX",
    "Account": "123456789012",
    "Arn": "arn:aws:iam::123456789012:user/your-username"
}

Step 1: Clone the Repository

# Clone the repository
git clone https://github.com/deviant101/ha-rke2-kubernetes-cluster.git
cd ha-rke2-kubernetes-cluster

# Verify structure
ls -la
# Expected:
# README.md
# docs/
# terraform/

Step 2: Configure AWS Credentials

Option A: Environment Variables (Recommended for CI/CD)

export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_DEFAULT_REGION="us-east-1"

Option B: AWS CLI Profile (Recommended for Development)

# Configure default profile
aws configure

# You'll be prompted for:
AWS Access Key ID [None]: your-access-key
AWS Secret Access Key [None]: your-secret-key
Default region name [None]: us-east-1
Default output format [None]: json

Option C: Named Profile

# Configure named profile
aws configure --profile rke2-cluster

# Use the profile
export AWS_PROFILE=rke2-cluster

Verify AWS Configuration

# Test access
aws ec2 describe-regions --query 'Regions[].RegionName' --output table

Step 3: Generate SSH Key Pair

Create New SSH Key

# Generate SSH key pair
ssh-keygen -t ed25519 -C "rke2-cluster" -f ~/.ssh/rke2-cluster-key

# Set correct permissions
chmod 600 ~/.ssh/rke2-cluster-key
chmod 644 ~/.ssh/rke2-cluster-key.pub

# View the public key (you'll reference this path later)
cat ~/.ssh/rke2-cluster-key.pub

Use Existing SSH Key

If you have an existing key:

# Verify key exists
ls -la ~/.ssh/id_ed25519.pub
# or
ls -la ~/.ssh/id_rsa.pub

Step 4: Configure Terraform Variables

Create terraform.tfvars

cd terraform

# Copy example file
cp terraform.tfvars.example terraform.tfvars

# Edit the configuration
nano terraform.tfvars  # or vim, code, etc.

Minimal Configuration

# terraform.tfvars

# Required
ssh_public_key_path = "~/.ssh/rke2-cluster-key.pub"

# Optional - customize as needed
aws_region   = "us-east-1"
cluster_name = "my-rke2-cluster"
environment  = "dev"

Full Configuration Example

# terraform.tfvars

# AWS Configuration
aws_region = "us-east-1"

# Cluster Identity
cluster_name = "production-rke2"
environment  = "production"

# Network Configuration
vpc_cidr             = "10.0.0.0/16"
public_subnet_cidrs  = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
pod_cidr             = "10.42.0.0/16"
service_cidr         = "10.43.0.0/16"

# Node Configuration
control_plane_count         = 3
worker_count                = 3
control_plane_instance_type = "t3.large"
worker_instance_type        = "t3.large"
root_volume_size            = 100

# RKE2 Version
rke2_version = "v1.34.6+rke2r1"

# SSH Access
ssh_public_key_path = "~/.ssh/rke2-cluster-key.pub"

# Security - Restrict SSH access to your IP
admin_cidr_blocks = ["YOUR.PUBLIC.IP.ADDRESS/32"]

# Optional - Provide your own token (or let Terraform generate one)
# rke2_token = "your-secure-token-here"

Find Your Public IP

# Get your current public IP
curl -s ifconfig.me
# or
curl -s ipinfo.io/ip

Step 5: Initialize Terraform

cd terraform

# Initialize Terraform (downloads providers)
terraform init

Expected Output

Initializing the backend...
Initializing provider plugins...
- Finding hashicorp/aws versions matching "~> 5.0"...
- Finding hashicorp/random versions matching "~> 3.0"...
- Installing hashicorp/aws v5.x.x...
- Installing hashicorp/random v3.x.x...

Terraform has been successfully initialized!

Step 6: Plan the Deployment

# Generate and review the execution plan
terraform plan

Review the Plan

The plan will show:

  • Resources to create (VPC, subnets, EC2 instances, NLB, etc.)
  • Configuration details (instance types, CIDRs, etc.)
  • No resources to destroy (new deployment)
Plan: 25 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  + cluster_name              = "my-rke2-cluster"
  + control_plane_private_ips = (known after apply)
  + control_plane_public_ips  = (known after apply)
  + kubernetes_api_endpoint   = (known after apply)
  + nlb_dns_name              = (known after apply)
  ...

Save the Plan (Optional)

# Save plan for later apply
terraform plan -out=tfplan

# Later, apply the saved plan
terraform apply tfplan

Step 7: Apply the Configuration

# Deploy the cluster
terraform apply

Confirm the Apply

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

Deployment Timeline

┌─────────────────────────────────────────────────────────────────────────┐
│                    DEPLOYMENT TIMELINE                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  T+0:00   Terraform starts creating resources                           │
│           - VPC, Subnets, Internet Gateway                              │
│           - Security Groups                                             │
│           - Network Load Balancer                                       │
│                                                                         │
│  T+2:00   EC2 instances launch                                          │
│           - Control Plane nodes start user-data scripts                 │
│           - Workers wait for control plane                              │
│                                                                         │
│  T+5:00   First control plane initializes                               │
│           - RKE2 server starts                                          │
│           - etcd cluster initializes                                    │
│           - API server starts                                           │
│                                                                         │
│  T+8:00   Additional control planes join                                │
│           - CP-2 and CP-3 join via NLB                                  │
│           - etcd achieves quorum                                        │
│                                                                         │
│  T+12:00  Workers join cluster                                          │
│           - RKE2 agents connect                                         │
│           - Cilium configured                                           │
│                                                                         │
│  T+15:00  Cluster fully operational                                     │
│           - All nodes Ready                                             │
│           - System pods running                                         │
│                                                                         │
│  TOTAL:   ~15-20 minutes                                                │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Terraform Output

After successful apply:

Apply complete! Resources: 25 added, 0 changed, 0 destroyed.

Outputs:

cluster_name = "my-rke2-cluster"
control_plane_private_ips = [
  "10.0.1.100",
  "10.0.2.101",
  "10.0.3.102",
]
control_plane_public_ips = [
  "54.123.45.67",
  "54.123.45.68",
  "54.123.45.69",
]
kubernetes_api_endpoint = "https://my-rke2-cluster-nlb-1234567890.elb.us-east-1.amazonaws.com:6443"
kubeconfig_command = "ssh -i ~/.ssh/rke2-cluster-key ubuntu@54.123.45.67 'sudo cat /etc/rancher/rke2/rke2.yaml' | sed 's/127.0.0.1/my-rke2-cluster-nlb-1234567890.elb.us-east-1.amazonaws.com/g' > kubeconfig.yaml"
nlb_dns_name = "my-rke2-cluster-nlb-1234567890.elb.us-east-1.amazonaws.com"
...

Step 8: Verify Deployment

Check AWS Resources

# List EC2 instances
aws ec2 describe-instances \
  --filters "Name=tag:Cluster,Values=my-rke2-cluster" \
  --query 'Reservations[].Instances[].[InstanceId,State.Name,Tags[?Key==`Name`].Value|[0]]' \
  --output table

SSH to Control Plane

# Get SSH command from Terraform output
terraform output -raw ssh_control_plane_commands

# SSH to first control plane
ssh -i ~/.ssh/rke2-cluster-key ubuntu@<CONTROL_PLANE_PUBLIC_IP>

Verify RKE2 Status on Node

# On the control plane node:

# Check RKE2 service status
sudo systemctl status rke2-server

# Check node status
sudo /var/lib/rancher/rke2/bin/kubectl \
  --kubeconfig /etc/rancher/rke2/rke2.yaml \
  get nodes

# Check all pods
sudo /var/lib/rancher/rke2/bin/kubectl \
  --kubeconfig /etc/rancher/rke2/rke2.yaml \
  get pods -A

Step 9: Access the Cluster

Download kubeconfig

# Use the command from Terraform output
$(terraform output -raw kubeconfig_command)

# Verify the file was created
cat kubeconfig.yaml

Configure kubectl

# Option A: Set KUBECONFIG environment variable
export KUBECONFIG=$(pwd)/kubeconfig.yaml

# Option B: Copy to default location
mkdir -p ~/.kube
cp kubeconfig.yaml ~/.kube/config
chmod 600 ~/.kube/config

Verify Cluster Access

# Check cluster info
kubectl cluster-info

# Expected output:
Kubernetes control plane is running at https://my-rke2-cluster-nlb-xxx.elb.us-east-1.amazonaws.com:6443
CoreDNS is running at https://...

# List nodes
kubectl get nodes

# Expected output:
NAME                          STATUS   ROLES                       AGE   VERSION
ip-10-0-1-100.ec2.internal   Ready    control-plane,etcd,master   10m   v1.34.6+rke2r1
ip-10-0-2-101.ec2.internal   Ready    control-plane,etcd,master   8m    v1.34.6+rke2r1
ip-10-0-3-102.ec2.internal   Ready    control-plane,etcd,master   6m    v1.34.6+rke2r1
ip-10-0-1-200.ec2.internal   Ready    <none>                      4m    v1.34.6+rke2r1
ip-10-0-2-201.ec2.internal   Ready    <none>                      4m    v1.34.6+rke2r1
ip-10-0-3-202.ec2.internal   Ready    <none>                      4m    v1.34.6+rke2r1

Check System Pods

kubectl get pods -n kube-system

# Expected pods:
NAME                                      READY   STATUS    RESTARTS   AGE
cilium-xxxxx                              1/1     Running   0          10m
cilium-operator-xxxxx                     1/1     Running   0          10m
coredns-xxxxx                             1/1     Running   0          10m
rke2-coredns-rke2-coredns-xxxxx           1/1     Running   0          10m
...

Step 10: Deploy Sample Application

Create Namespace

kubectl create namespace demo

Deploy nginx

cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-demo
  namespace: demo
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:latest
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: nginx-service
  namespace: demo
spec:
  type: NodePort
  selector:
    app: nginx
  ports:
  - port: 80
    targetPort: 80
    nodePort: 30080
EOF

Verify Deployment

# Check pods
kubectl get pods -n demo

# Check service
kubectl get svc -n demo

# Test the application (from any worker node)
curl http://<WORKER_PUBLIC_IP>:30080

Cleanup

Destroy All Resources

cd terraform

# Preview destruction
terraform plan -destroy

# Destroy resources
terraform destroy

Manual Cleanup (if needed)

# Delete EC2 instances
aws ec2 terminate-instances --instance-ids <INSTANCE_IDS>

# Delete load balancer
aws elbv2 delete-load-balancer --load-balancer-arn <LB_ARN>

# Delete VPC (after all resources are removed)
aws ec2 delete-vpc --vpc-id <VPC_ID>

Customization Options

Different Regions

# terraform.tfvars
aws_region = "eu-west-1"  # Ireland
# or
aws_region = "ap-southeast-1"  # Singapore

Larger Instance Types

# terraform.tfvars
control_plane_instance_type = "m5.xlarge"
worker_instance_type = "m5.2xlarge"

More Workers

# terraform.tfvars
worker_count = 5  # or any number

Different RKE2 Version

# terraform.tfvars
rke2_version = "v1.30.0+rke2r1"

Find available versions: https://github.com/rancher/rke2/releases

Custom Network CIDRs

# terraform.tfvars
vpc_cidr             = "172.16.0.0/16"
public_subnet_cidrs  = ["172.16.1.0/24", "172.16.2.0/24", "172.16.3.0/24"]
pod_cidr             = "172.20.0.0/16"
service_cidr         = "172.21.0.0/16"

Next Steps

After successful deployment:

  1. Install Ingress Controller (nginx-ingress or traefik)
  2. Configure DNS for your applications
  3. Set up monitoring (Prometheus + Grafana)
  4. Configure backup for etcd snapshots
  5. Implement GitOps (ArgoCD or Flux)

Back to Main README | Previous: HA RKE2 Guide | Next: Flow Diagrams