diff --git a/.gitignore b/.gitignore index 68038b13d..407320a7d 100644 --- a/.gitignore +++ b/.gitignore @@ -5,4 +5,5 @@ manifest.json ami-id Pulumi*.yaml /tools/bin/** -!/tools/bin/.gitkeep \ No newline at end of file +!/tools/bin/.gitkeep +.claude \ No newline at end of file diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 000000000..f3221d786 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1 @@ +@specs/project-context.md diff --git a/specs/api/aws/allocation.md b/specs/api/aws/allocation.md new file mode 100644 index 000000000..6fc020fbc --- /dev/null +++ b/specs/api/aws/allocation.md @@ -0,0 +1,98 @@ +# API: Allocation (AWS) + +> Concept: [specs/api/concepts/allocation.md](../concepts/allocation.md) + +**Package:** `github.com/redhat-developer/mapt/pkg/provider/aws/modules/allocation` + +Single entry point for resolving where and on what instance type a target will run. +All AWS EC2 action `Create()` functions call this before any Pulumi stack is touched. + +--- + +## Types + +### `AllocationArgs` + +> `ComputeRequestArgs` and `SpotArgs` are cross-provider types — see `specs/api/provider-interfaces.md`. + +```go +type AllocationArgs struct { + ComputeRequest *cr.ComputeRequestArgs // required: hardware constraints + Prefix *string // required: used to name the spot stack + AMIProductDescription *string // optional: e.g. "Linux/UNIX" — used for spot price queries + AMIName *string // optional: scopes spot search to AMI availability + Spot *spotTypes.SpotArgs // nil = on-demand; non-nil = spot evaluation +} +``` + +### `AllocationResult` + +```go +type AllocationResult struct { + Region *string // AWS region to deploy into + AZ *string // availability zone within that region + SpotPrice *float64 // nil when on-demand; set when spot was selected + InstanceTypes []string // one or more compatible instance type strings +} +``` + +--- + +## Functions + +### `Allocation` + +```go +func Allocation(mCtx *mc.Context, args *AllocationArgs) (*AllocationResult, error) +``` + +**Spot path** (`args.Spot != nil && args.Spot.Spot == true`): +- Creates or reuses a `spotOption-` Pulumi stack +- Queries spot prices across eligible regions; selects best region/AZ/price +- Idempotent: if the stack already exists, returns its saved outputs without re-querying +- Returns `AllocationResult` with all four fields set + +**On-demand path** (`args.Spot == nil` or `args.Spot.Spot == false`): +- Uses `mCtx.TargetHostingPlace()` as the region (set from provider default) +- Iterates AZs until one supports the required instance types +- Returns `AllocationResult` with `SpotPrice == nil` + +**Error:** returns `ErrNoSupportedInstanceTypes` if no AZ in the region supports the requested types. + +--- + +## Usage Pattern + +```go +// In every AWS action Create(): +r.allocationData, err = allocation.Allocation(mCtx, &allocation.AllocationArgs{ + Prefix: &args.Prefix, + ComputeRequest: args.ComputeRequest, + AMIProductDescription: &amiProduct, // constant in the action's constants.go + Spot: args.Spot, +}) + +// Then pass results into the deploy function: +// r.allocationData.Region → NetworkArgs.Region, ComputeRequest credential region +// r.allocationData.AZ → NetworkArgs.AZ +// r.allocationData.InstanceTypes → ComputeRequest.InstaceTypes +// r.allocationData.SpotPrice → ComputeRequest.SpotPrice (when non-nil) +``` + +--- + +## Known Gaps + +- `spot.Destroy()` uses `aws.DefaultCredentials` (not region-scoped); verify this is correct + when the selected spot region differs from the default AWS region +- No re-evaluation of spot selection when the persisted region becomes significantly more expensive + between runs (by design — idempotency wins; worth documenting in user docs) + +--- + +## When to Extend This API + +Open a spec under `specs/features/aws/` and update this file when: +- Adding a new allocation strategy (e.g. reserved instances, on-demand with fallback to spot) +- Adding a new field to `AllocationArgs` that all targets would benefit from +- Changing the idempotency behaviour of the spot stack diff --git a/specs/api/aws/bastion.md b/specs/api/aws/bastion.md new file mode 100644 index 000000000..20573fa32 --- /dev/null +++ b/specs/api/aws/bastion.md @@ -0,0 +1,113 @@ +# API: Bastion + +**Package:** `github.com/redhat-developer/mapt/pkg/provider/aws/modules/bastion` + +Creates a bastion host in the public subnet of an airgap network. Called automatically by +`network.Create()` when `Airgap=true` — action code never calls bastion directly during deploy. + +Action code calls `bastion.WriteOutputs()` in `manageResults()` when airgap is enabled. + +--- + +## Types + +### `BastionArgs` + +```go +type BastionArgs struct { + Prefix string + VPC *ec2.Vpc + Subnet *ec2.Subnet // must be the PUBLIC subnet, not the target subnet +} +``` + +### `BastionResult` + +```go +type BastionResult struct { + Instance *ec2.Instance + PrivateKey *tls.PrivateKey + Usarname string // note: typo in source — "Usarname" not "Username" + Port int // always 22 +} +``` + +--- + +## Functions + +### `Create` + +```go +func Create(ctx *pulumi.Context, mCtx *mc.Context, args *BastionArgs) (*BastionResult, error) +``` + +Called internally by `network.Create()`. Not called directly from action code. + +Creates: +- Amazon Linux 2 `t2.small` instance in the public subnet +- Keypair for SSH access +- Security group allowing SSH ingress from `0.0.0.0/0` + +Exports to Pulumi stack: +- `-bastion_id_rsa` +- `-bastion_username` +- `-bastion_host` + +### `WriteOutputs` + +```go +func WriteOutputs(stackResult auto.UpResult, prefix string, destinationFolder string) error +``` + +Writes the three bastion stack outputs to files in `destinationFolder`: + +| Stack output key | Output filename | +|---|---| +| `-bastion_id_rsa` | `bastion_id_rsa` | +| `-bastion_username` | `bastion_username` | +| `-bastion_host` | `bastion_host` | + +--- + +## Usage Pattern + +```go +// In deploy(): bastion is returned as part of NetworkResult — no direct call needed +nw, err := network.Create(ctx, mCtx, &network.NetworkArgs{Airgap: true, ...}) +// nw.Bastion is populated automatically + +// Pass to Readiness() so SSH goes through the bastion: +c.Readiness(ctx, cmd, prefix, id, privateKey, username, nw.Bastion, deps) + +// In manageResults(): write bastion files alongside target files +func manageResults(mCtx *mc.Context, stackResult auto.UpResult, prefix *string, airgap *bool) error { + if *airgap { + if err := bastion.WriteOutputs(stackResult, *prefix, mCtx.GetResultsOutputPath()); err != nil { + return err + } + } + return output.Write(stackResult, mCtx.GetResultsOutputPath(), results) +} +``` + +--- + +## Bastion Instance Spec (fixed, not configurable) + +| Property | Value | +|---|---| +| AMI | Amazon Linux 2 (`amzn2-ami-hvm-*-x86_64-ebs`) | +| Instance type | `t2.small` | +| Disk | 100 GiB | +| SSH user | `ec2-user` | +| SSH port | 22 | + +--- + +## When to Extend This API + +Open a spec under `specs/features/aws/` and update this file when: +- Making bastion instance type or disk size configurable +- Adding bastion support to Azure targets +- Adding support for Session Manager as an alternative to bastion SSH diff --git a/specs/api/aws/compute.md b/specs/api/aws/compute.md new file mode 100644 index 000000000..50f5adee8 --- /dev/null +++ b/specs/api/aws/compute.md @@ -0,0 +1,170 @@ +# API: Compute (AWS EC2) + +> Concept: [specs/api/concepts/compute.md](../concepts/compute.md) + +**Package:** `github.com/redhat-developer/mapt/pkg/provider/aws/modules/ec2/compute` + +Creates the EC2 instance (on-demand) or Auto Scaling Group (spot). Always the last Pulumi +resource created in a `deploy()` function, after networking, keypair, and security groups. + +--- + +## Types + +### `ComputeRequest` + +```go +type ComputeRequest struct { + MCtx *mc.Context + Prefix string + ID string // component ID — used in resource naming + VPC *ec2.Vpc // from network.NetworkResult.Vpc + Subnet *ec2.Subnet // from network.NetworkResult.Subnet + Eip *ec2.Eip // from network.NetworkResult.Eip + LB *lb.LoadBalancer // from network.NetworkResult.LoadBalancer; nil = on-demand + LBTargetGroups []int // TCP ports to register as LB target groups (e.g. []int{22, 3389}) + AMI *ec2.LookupAmiResult + KeyResources *keypair.KeyPairResources + SecurityGroups pulumi.StringArray + InstaceTypes []string // from AllocationResult.InstanceTypes + InstanceProfile *iam.InstanceProfile // optional — required by SNC for SSM access + DiskSize *int // nil uses the module default (200 GiB) + Airgap bool + Spot bool // true when AllocationResult.SpotPrice != nil + SpotPrice float64 // only read when Spot=true + UserDataAsBase64 pulumi.StringPtrInput // cloud-init or PowerShell userdata + DependsOn []pulumi.Resource // explicit Pulumi dependencies +} +``` + +### `Compute` + +```go +type Compute struct { + Instance *ec2.Instance // set when Spot=false + AutoscalingGroup *autoscaling.Group // set when Spot=true + Eip *ec2.Eip + LB *lb.LoadBalancer + Dependencies []pulumi.Resource // pass to Readiness() and RunCommand() +} +``` + +--- + +## Functions + +### `NewCompute` + +```go +func (r *ComputeRequest) NewCompute(ctx *pulumi.Context) (*Compute, error) +``` + +- `Spot=false`: creates `ec2.Instance` with direct EIP association +- `Spot=true`: creates `ec2.LaunchTemplate` + `autoscaling.Group` with mixed instances policy, forced spot, capacity-optimized allocation strategy; registers LB target groups + +### `Readiness` + +```go +func (c *Compute) Readiness( + ctx *pulumi.Context, + cmd string, // command.CommandCloudInitWait or command.CommandPing + prefix, id string, + mk *tls.PrivateKey, + username string, + b *bastion.BastionResult, // nil when not airgap + dependencies []pulumi.Resource, +) error +``` + +Runs `cmd` over SSH on the instance. Blocks Pulumi until it succeeds (timeout: 40 minutes). +Pass `c.Dependencies` as `dependencies`. + +### `RunCommand` + +```go +func (c *Compute) RunCommand( + ctx *pulumi.Context, + cmd string, + loggingCmdStd bool, // compute.LoggingCmdStd or compute.NoLoggingCmdStd + prefix, id string, + mk *tls.PrivateKey, + username string, + b *bastion.BastionResult, + dependencies []pulumi.Resource, +) (*remote.Command, error) +``` + +Like `Readiness` but returns the command resource for use as a dependency in subsequent steps. +Used by SNC to chain SSH → cluster ready → CA rotated → fetch kubeconfig. + +### `GetHostDnsName` + +```go +func (c *Compute) GetHostDnsName(public bool) pulumi.StringInput +``` + +Returns `LB.DnsName` when LB is set, otherwise `Eip.PublicDns` (public=true) or `Eip.PrivateDns` (public=false). +Export this as `-host`. + +### `GetHostIP` + +```go +func (c *Compute) GetHostIP(public bool) pulumi.StringOutput +``` + +Returns `Eip.PublicIp` or `Eip.PrivateIp`. Used by SNC (needs IP not DNS for kubeconfig replacement). + +--- + +## Readiness Commands + +| Constant | Value | When to use | +|---|---|---| +| `command.CommandCloudInitWait` | `sudo cloud-init status --long --wait \|\| [[ $? -eq 2 \|\| $? -eq 0 ]]` | Linux targets with cloud-init | +| `command.CommandPing` | `echo ping` | Windows targets (no cloud-init) | + +--- + +## Usage Pattern + +```go +cr := compute.ComputeRequest{ + MCtx: r.mCtx, + Prefix: *r.prefix, + ID: awsTargetID, + VPC: nw.Vpc, + Subnet: nw.Subnet, + Eip: nw.Eip, + LB: nw.LoadBalancer, + LBTargetGroups: []int{22}, // add 3389 for Windows + AMI: ami, + KeyResources: keyResources, + SecurityGroups: securityGroups, + InstaceTypes: r.allocationData.InstanceTypes, + DiskSize: &diskSize, // constant in constants.go + Airgap: *r.airgap, + UserDataAsBase64: udB64, +} +if r.allocationData.SpotPrice != nil { + cr.Spot = true + cr.SpotPrice = *r.allocationData.SpotPrice +} +c, err := cr.NewCompute(ctx) + +ctx.Export(fmt.Sprintf("%s-%s", *r.prefix, outputHost), c.GetHostDnsName(!*r.airgap)) + +return c.Readiness(ctx, command.CommandCloudInitWait, + *r.prefix, awsTargetID, + keyResources.PrivateKey, amiUserDefault, + nw.Bastion, c.Dependencies) +``` + +--- + +## When to Extend This API + +Open a spec under `specs/features/aws/` and update this file when: +- Adding support for additional storage volumes +- Adding support for instance store (NVMe) configuration +- Exposing health check grace period as configurable (currently hardcoded at 1200s) +- Adding on-demand with spot fallback (noted as TODO in source) diff --git a/specs/api/aws/network.md b/specs/api/aws/network.md new file mode 100644 index 000000000..4fadf3d20 --- /dev/null +++ b/specs/api/aws/network.md @@ -0,0 +1,115 @@ +# API: Network (AWS) + +> Concept: [specs/api/concepts/network.md](../concepts/network.md) + +**Package:** `github.com/redhat-developer/mapt/pkg/provider/aws/modules/network` + +Creates the VPC, subnet, internet gateway, optional load balancer, and optional airgap bastion +for any AWS EC2 target. Always the first Pulumi resource created in a `deploy()` function. + +--- + +## Types + +### `NetworkArgs` + +```go +type NetworkArgs struct { + Prefix string // resource name prefix + ID string // component ID (e.g. "aws-rhel") — used in resource naming + Region string // from AllocationResult.Region + AZ string // from AllocationResult.AZ + CreateLoadBalancer bool // true when spot is used (LB fronts the ASG) + Airgap bool // true for airgap topology + AirgapPhaseConnectivity Connectivity // ON (with NAT) or OFF (without NAT) + // Optional VPC endpoints to create in the public subnet. + // Empty (default) = no endpoints. Accepted: "s3", "ecr", "ssm". + // Interface endpoints ("ecr", "ssm") share a security group (TCP 443 from VPC CIDR). + // See specs/features/aws/vpc-endpoints.md + Endpoints []string +} + +type Connectivity int +const ( + ON Connectivity = iota // NAT gateway present — machine has internet egress + OFF // NAT gateway absent — machine is isolated +) +``` + +### `NetworkResult` + +```go +type NetworkResult struct { + Vpc *ec2.Vpc + Subnet *ec2.Subnet // target subnet (public or private) + SubnetRouteTableAssociation *ec2.RouteTableAssociation // only set in airgap + Eip *ec2.Eip // always created; used for LB or direct instance + LoadBalancer *lb.LoadBalancer // nil when CreateLoadBalancer=false + Bastion *bastion.BastionResult // nil when Airgap=false +} +``` + +--- + +## Functions + +### `Create` + +```go +func Create(ctx *pulumi.Context, mCtx *mc.Context, args *NetworkArgs) (*NetworkResult, error) +``` + +**Standard path** (`Airgap=false`): +- VPC (`10.0.0.0/16`) with one public subnet (`10.0.2.0/24`) and internet gateway +- No NAT gateway +- EIP always created +- Load balancer created if `CreateLoadBalancer=true`, attached to EIP + +**Airgap path** (`Airgap=true`): +- VPC with public subnet (`10.0.2.0/24`) and private (target) subnet (`10.0.101.0/24`) +- Phase ON: public subnet gets NAT gateway → private subnet has internet egress +- Phase OFF: NAT gateway removed → private subnet is isolated +- Bastion host created in public subnet (see `specs/api/aws/bastion.md`) +- Load balancer when `CreateLoadBalancer=true` is internal-facing (private IP) + +--- + +## CIDRs (fixed, not configurable) + +| Range | Value | +|---|---| +| VPC | `10.0.0.0/16` | +| Public subnet | `10.0.2.0/24` | +| Private (airgap target) subnet | `10.0.101.0/24` | + +--- + +## Usage Pattern + +```go +nw, err := network.Create(ctx, r.mCtx, &network.NetworkArgs{ + Prefix: *r.prefix, + ID: awsTargetID, // constant from constants.go + Region: *r.allocationData.Region, + AZ: *r.allocationData.AZ, + CreateLoadBalancer: r.allocationData.SpotPrice != nil, + Airgap: *r.airgap, + AirgapPhaseConnectivity: r.airgapPhaseConnectivity, +}) + +// Pass results to compute: +// nw.Vpc → ComputeRequest.VPC, securityGroup.SGRequest.VPC +// nw.Subnet → ComputeRequest.Subnet +// nw.Eip → ComputeRequest.Eip +// nw.LoadBalancer → ComputeRequest.LB +// nw.Bastion → ComputeRequest.Readiness() bastion arg +``` + +--- + +## When to Extend This API + +Open a spec under `specs/features/aws/` and update this file when: +- Adding support for IPv6 +- Making CIDRs configurable +- Adding a new topology (e.g. multi-AZ, private-only without bastion) diff --git a/specs/api/aws/security-group.md b/specs/api/aws/security-group.md new file mode 100644 index 000000000..10a44e189 --- /dev/null +++ b/specs/api/aws/security-group.md @@ -0,0 +1,103 @@ +# API: Security Group (AWS) + +> Concept: [specs/api/concepts/security-group.md](../concepts/security-group.md) + +**Package:** `github.com/redhat-developer/mapt/pkg/provider/aws/services/ec2/security-group` + +Creates an EC2 security group with ingress rules. Called from every AWS action `deploy()` +and from the bastion module internally. + +--- + +## Types + +### `SGRequest` + +```go +type SGRequest struct { + Name string // resourcesUtil.GetResourceName(prefix, id, "sg") + Description string + IngressRules []IngressRules + VPC *ec2.Vpc // from network.NetworkResult.Vpc +} +``` + +### `IngressRules` + +```go +type IngressRules struct { + Description string + FromPort int + ToPort int + Protocol string // "tcp", "udp", "icmp", "-1" (all) + CidrBlocks string // CIDR string; empty = 0.0.0.0/0; mutually exclusive with SG + SG *ec2.SecurityGroup // source SG; mutually exclusive with CidrBlocks +} +``` + +### `SGResources` + +```go +type SGResources struct { + SG *ec2.SecurityGroup +} +``` + +--- + +## Functions + +### `Create` + +```go +func (r SGRequest) Create(ctx *pulumi.Context, mCtx *mc.Context) (*SGResources, error) +``` + +Creates the security group with all ingress rules and a permissive egress (all traffic allowed). + +--- + +## Pre-defined Rules + +```go +// Defined in security-group/defaults.go — copy and set CidrBlocks before use +var SSH_TCP = IngressRules{Description: "SSH", FromPort: 22, ToPort: 22, Protocol: "tcp"} +var RDP_TCP = IngressRules{Description: "RDP", FromPort: 3389, ToPort: 3389, Protocol: "tcp"} + +// Port constants +const SSH_PORT = 22 +const HTTPS_PORT = 443 +``` + +**Important:** `SSH_TCP` and `RDP_TCP` are value types — copy them before setting `CidrBlocks`: +```go +sshRule := securityGroup.SSH_TCP +sshRule.CidrBlocks = infra.NETWORKING_CIDR_ANY_IPV4 // "0.0.0.0/0" +``` + +--- + +## Usage Pattern + +```go +sg, err := securityGroup.SGRequest{ + Name: resourcesUtil.GetResourceName(*prefix, awsTargetID, "sg"), + VPC: nw.Vpc, + Description: fmt.Sprintf("sg for %s", awsTargetID), + IngressRules: []securityGroup.IngressRules{sshRule}, +}.Create(ctx, mCtx) + +// Convert to StringArray for ComputeRequest: +sgs := util.ArrayConvert([]*ec2.SecurityGroup{sg.SG}, + func(sg *ec2.SecurityGroup) pulumi.StringInput { return sg.ID() }) +return pulumi.StringArray(sgs[:]), nil +``` + +--- + +## When to Extend This API + +Open a spec under `specs/features/aws/` and update this file when: +- Adding new pre-defined rule constants (e.g. WinRM, HTTPS) +- Adding IPv6 CIDR support +- Adding support for egress rule customisation (currently always allow-all egress) diff --git a/specs/api/azure/allocation.md b/specs/api/azure/allocation.md new file mode 100644 index 000000000..bf74319ec --- /dev/null +++ b/specs/api/azure/allocation.md @@ -0,0 +1,122 @@ +# API: Allocation (Azure) + +**Package:** `github.com/redhat-developer/mapt/pkg/provider/azure/modules/allocation` + +Single entry point for resolving which Azure location, VM size, and image to use. +All Azure action `Create()` functions call this before any Pulumi stack is touched. + +> Concept: [specs/api/concepts/allocation.md](../concepts/allocation.md) + +--- + +## Types + +### `AllocationArgs` + +> `ComputeRequestArgs` and `SpotArgs` are cross-provider types — see `specs/api/provider-interfaces.md`. + +```go +type AllocationArgs struct { + ComputeRequest *cr.ComputeRequestArgs // required: hardware constraints + OSType string // e.g. "Linux", "Windows" — used for spot queries + ImageRef *data.ImageReference // optional: scopes spot search to image availability + Location *string // required for on-demand; ignored when spot selects location + Spot *spotTypes.SpotArgs // nil = on-demand; non-nil = spot evaluation +} +``` + +### `AllocationResult` + +```go +type AllocationResult struct { + Location *string // Azure region (e.g. "eastus") + Price *float64 // nil when on-demand; set when spot was selected + ComputeSizes []string // one or more compatible VM size strings + ImageRef *data.ImageReference // passed through from args +} +``` + +--- + +## Functions + +### `Allocation` + +```go +func Allocation(mCtx *mc.Context, args *AllocationArgs) (*AllocationResult, error) +``` + +**Spot path** (`args.Spot != nil && args.Spot.Spot == true`): +- Queries spot prices across eligible Azure locations +- Scores by price × availability; selects best location/VM size +- No separate Pulumi stack (unlike AWS) — result is not persisted between runs +- Returns `AllocationResult` with all fields set + +**On-demand path** (`args.Spot == nil` or `args.Spot.Spot == false`): +- Uses `args.Location` as the target location +- Filters `ComputeRequest.ComputeSizes` to those available in the location +- Returns `AllocationResult` with `Price == nil` + +--- + +## Related Types + +### `ImageReference` +**Package:** `github.com/redhat-developer/mapt/pkg/provider/azure/data` + +```go +type ImageReference struct { + // Marketplace image + Publisher string + Offer string + Sku string + // Azure Community Gallery + CommunityImageID string + // Azure Shared Gallery (private or cross-tenant) + SharedImageID string +} +``` + +Exactly one of the three variants should be populated. Use `data.GetImageRef()` to build +a reference from OS type, arch, and version: + +```go +func GetImageRef(osTarget OSType, arch string, version string) (*ImageReference, error) +``` + +Supported `OSType` values: `data.Ubuntu`, `data.RHEL`, `data.Fedora` + +### `SpotArgs` +**Package:** `github.com/redhat-developer/mapt/pkg/provider/api/spot` + +Cross-provider type — see `specs/api/concepts/allocation.md` for field descriptions. + +--- + +## Usage Pattern + +```go +// In every Azure action Create(): +r.allocationData, err = allocation.Allocation(mCtx, &allocation.AllocationArgs{ + ComputeRequest: args.ComputeRequest, + OSType: "Linux", // or "Windows" + ImageRef: imageRef, // from data.GetImageRef() + Location: &defaultLocation, // provider default, ignored if spot + Spot: args.Spot, +}) + +// Then pass results into the deploy function: +// r.allocationData.Location → NetworkArgs.Location, VM location +// r.allocationData.ComputeSizes → pick one for VirtualMachineArgs.VMSize +// r.allocationData.Price → VirtualMachineArgs.SpotPrice (when non-nil) +// r.allocationData.ImageRef → VirtualMachineArgs.Image +``` + +--- + +## When to Extend This API + +Open a spec under `specs/features/azure/` and update this file when: +- Persisting Azure spot allocation to a Pulumi stack (for idempotency, matching AWS behaviour) +- Adding new `OSType` values to `data.GetImageRef()` +- Adding `ExcludedLocations` filtering to on-demand path diff --git a/specs/api/azure/network.md b/specs/api/azure/network.md new file mode 100644 index 000000000..2e2f652c4 --- /dev/null +++ b/specs/api/azure/network.md @@ -0,0 +1,107 @@ +# API: Network (Azure) + +> Concept: [specs/api/concepts/network.md](../concepts/network.md) + +**Package:** `github.com/redhat-developer/mapt/pkg/provider/azure/modules/network` + +Creates the VNet, subnet, public IP, and network interface for any Azure VM target. +Called after the resource group and security group are created in a `deploy()` function. + +--- + +## Types + +### `NetworkArgs` + +```go +type NetworkArgs struct { + Prefix string + ComponentID string + ResourceGroup *resources.ResourceGroup // must be created before calling network.Create() + Location *string // from AllocationResult.Location + SecurityGroup securityGroup.SecurityGroup // must be created before calling network.Create() +} +``` + +Note: unlike AWS, the security group is passed **in** to `network.Create()` rather than +being created after. Creation order in `deploy()` is therefore: +**resource group → security group → network → VM** + +### `Network` + +```go +type Network struct { + Network *network.VirtualNetwork + PublicSubnet *network.Subnet + NetworkInterface *network.NetworkInterface // pass to VirtualMachineArgs.NetworkInterface + PublicIP *network.PublicIPAddress // export as -host +} +``` + +--- + +## Functions + +### `Create` + +```go +func Create(ctx *pulumi.Context, mCtx *mc.Context, args *NetworkArgs) (*Network, error) +``` + +Creates in sequence: +1. VNet (`10.0.0.0/16`) with RunID as name +2. Subnet (`10.0.2.0/24`) +3. Static Standard-SKU public IP +4. NIC attached to subnet + public IP + security group + +All resources are tagged via `mCtx.ResourceTags()`. + +--- + +## CIDRs (fixed, not configurable) + +| Range | Value | +|---|---| +| VNet | `10.0.0.0/16` | +| Subnet | `10.0.2.0/24` | + +--- + +## Usage Pattern + +```go +// 1. Create resource group (outside network module) +rg, err := resources.NewResourceGroup(ctx, ..., &resources.ResourceGroupArgs{ + Location: pulumi.String(*r.allocationData.Location), +}) + +// 2. Create security group (before network) +sg, err := securityGroup.Create(ctx, mCtx, &securityGroup.SecurityGroupArgs{ + Name: resourcesUtil.GetResourceName(*r.prefix, azureTargetID, "sg"), + RG: rg, + Location: r.allocationData.Location, + IngressRules: []securityGroup.IngressRules{securityGroup.SSH_TCP}, +}) + +// 3. Create network (takes sg as input) +n, err := network.Create(ctx, mCtx, &network.NetworkArgs{ + Prefix: *r.prefix, + ComponentID: azureTargetID, + ResourceGroup: rg, + Location: r.allocationData.Location, + SecurityGroup: sg, +}) + +// 4. Pass to VM: +// n.NetworkInterface → VirtualMachineArgs.NetworkInteface +// n.PublicIP.IpAddress → export as -host +``` + +--- + +## When to Extend This API + +Open a spec under `specs/features/azure/` and update this file when: +- Adding airgap support for Azure (bastion + private subnet pattern) +- Adding load balancer support for spot VM scenarios +- Making CIDRs configurable diff --git a/specs/api/azure/security-group.md b/specs/api/azure/security-group.md new file mode 100644 index 000000000..9995af4bd --- /dev/null +++ b/specs/api/azure/security-group.md @@ -0,0 +1,104 @@ +# API: Security Group (Azure) + +> Concept: [specs/api/concepts/security-group.md](../concepts/security-group.md) + +**Package:** `github.com/redhat-developer/mapt/pkg/provider/azure/services/network/security-group` + +Creates an Azure Network Security Group (NSG). The NSG is created **before** the network +module is called, because `network.Create()` takes the NSG as an input argument. +See `specs/api/azure/network.md`. + +--- + +## Types + +### `SecurityGroupArgs` + +```go +type SecurityGroupArgs struct { + Name string // resourcesUtil.GetResourceName(prefix, id, "sg") + RG *resources.ResourceGroup // resource group the NSG belongs to + Location *string // from AllocationResult.Location + IngressRules []IngressRules +} +``` + +### `IngressRules` + +```go +type IngressRules struct { + Description string + FromPort int + ToPort int + Protocol string // "tcp", "udp", "*" (all) + CidrBlocks string // source CIDR; empty = allow any source ("*") +} +``` + +### `SecurityGroup` + +```go +type SecurityGroup = *network.NetworkSecurityGroup +``` + +A type alias — the raw Pulumi Azure NSG resource. Passed directly into `NetworkArgs.SecurityGroup`. + +--- + +## Functions + +### `Create` + +```go +func Create(ctx *pulumi.Context, mCtx *mc.Context, args *SecurityGroupArgs) (SecurityGroup, error) +``` + +Creates the NSG with inbound allow rules. Priorities are auto-assigned starting at 1001, +incrementing by 1 per rule. Egress is unrestricted (Azure default). + +--- + +## Pre-defined Rules + +```go +// Defined in security-group/defaults.go — safe to use directly (not value copies like AWS) +var SSH_TCP = IngressRules{Description: "SSH", FromPort: 22, ToPort: 22, Protocol: "tcp"} +var RDP_TCP = IngressRules{Description: "RDP", FromPort: 3389, ToPort: 3389, Protocol: "tcp"} + +var SSH_PORT int = 22 +var RDP_PORT int = 3389 +``` + +Unlike the AWS equivalent, Azure `IngressRules` do not have a source SG field — only CIDR. +Empty `CidrBlocks` allows from any source (`*`), which is the default for SSH and RDP rules. + +--- + +## Usage Pattern + +```go +sg, err := securityGroup.Create(ctx, mCtx, &securityGroup.SecurityGroupArgs{ + Name: resourcesUtil.GetResourceName(*r.prefix, azureTargetID, "sg"), + RG: rg, + Location: r.allocationData.Location, + IngressRules: []securityGroup.IngressRules{ + securityGroup.SSH_TCP, + // securityGroup.RDP_TCP, // add for Windows targets + }, +}) + +// Pass directly into network: +n, err := network.Create(ctx, mCtx, &network.NetworkArgs{ + SecurityGroup: sg, + ... +}) +``` + +--- + +## When to Extend This API + +Open a spec under `specs/features/azure/` and update this file when: +- Adding source NSG reference support (intra-VNet rules) +- Adding egress rule customisation +- Adding new pre-defined rule constants diff --git a/specs/api/azure/virtual-machine.md b/specs/api/azure/virtual-machine.md new file mode 100644 index 000000000..8e8e83498 --- /dev/null +++ b/specs/api/azure/virtual-machine.md @@ -0,0 +1,113 @@ +# API: Virtual Machine (Azure) + +> Concept: [specs/api/concepts/compute.md](../concepts/compute.md) + +**Package:** `github.com/redhat-developer/mapt/pkg/provider/azure/modules/virtual-machine` + +Creates an Azure VM. The Azure equivalent of `specs/api/aws/compute.md`. +Always the last Pulumi resource created in an Azure `deploy()` function. + +--- + +## Types + +### `VirtualMachineArgs` + +```go +type VirtualMachineArgs struct { + Prefix string + ComponentID string + ResourceGroup *resources.ResourceGroup + NetworkInteface *network.NetworkInterface // note: typo in source — "Inteface" not "Interface" + VMSize string // pick one from AllocationResult.ComputeSizes + + SpotPrice *float64 // nil = on-demand; non-nil = spot (sets Priority="Spot") + + Image *data.ImageReference // from AllocationResult.ImageRef + + // Linux: provide PrivateKey (password auth disabled) + PrivateKey *tls.PrivateKey + // Windows: provide AdminPasswd (password auth) + AdminPasswd *random.RandomPassword + + AdminUsername string + UserDataAsBase64 pulumi.StringPtrInput // cloud-init or custom script (base64) + Location string // from AllocationResult.Location +} +``` + +### `VirtualMachine` + +```go +type VirtualMachine = *compute.VirtualMachine +``` + +The returned value is the raw Pulumi Azure VM resource. +Access the public IP via `Network.PublicIP.IpAddress` (not from the VM itself). + +--- + +## Functions + +### `Create` + +```go +func Create(ctx *pulumi.Context, mCtx *mc.Context, args *VirtualMachineArgs) (VirtualMachine, error) +``` + +- **Linux VMs**: sets `LinuxConfiguration` with SSH public key; disables password authentication +- **Windows VMs**: sets `AdminPassword`; no SSH configuration +- **Spot**: sets `Priority = "Spot"` and `BillingProfile.MaxPrice = *SpotPrice` +- **On-demand**: no priority or billing profile set +- Disk: 200 GiB Standard_LRS, created from image +- Boot diagnostics disabled (improves provisioning time) +- Image resolution: handles Marketplace, Community Gallery, and Shared Gallery variants automatically + +--- + +## Image Resolution (internal) + +`convertImageRef()` resolves the `ImageReference` to a Pulumi `ImageReferenceArgs`: + +| ImageReference field set | Azure resource used | +|---|---| +| `CommunityImageID` | Community Gallery (`communityGalleryImageId`) | +| `SharedImageID` (own subscription) | Direct resource ID | +| `SharedImageID` (other subscription) | Shared Gallery (`sharedGalleryImageId`) | +| `Publisher` + `Offer` + `Sku` | Marketplace image; SKU upgraded to Gen2 if available | + +Self-owned detection uses `AZURE_SUBSCRIPTION_ID` env var against the image resource path. + +--- + +## Usage Pattern + +```go +vm, err := virtualmachine.Create(ctx, mCtx, &virtualmachine.VirtualMachineArgs{ + Prefix: *r.prefix, + ComponentID: azureTargetID, + ResourceGroup: rg, + NetworkInteface: n.NetworkInterface, + VMSize: r.allocationData.ComputeSizes[0], + SpotPrice: r.allocationData.Price, // nil if on-demand + Image: r.allocationData.ImageRef, + AdminUsername: amiUserDefault, + PrivateKey: privateKey, // Linux + // AdminPasswd: password, // Windows instead + UserDataAsBase64: udB64, + Location: *r.allocationData.Location, +}) + +// Export host from the network public IP (not from the VM): +ctx.Export(fmt.Sprintf("%s-%s", *r.prefix, outputHost), n.PublicIP.IpAddress) +``` + +--- + +## When to Extend This API + +Open a spec under `specs/features/azure/` and update this file when: +- Making disk size configurable +- Adding data disk support +- Adding support for VM extensions (currently Windows uses custom script extension directly in some actions) +- Adding `RunCommand` / `Readiness` methods equivalent to `specs/api/aws/compute.md` diff --git a/specs/api/concepts/allocation.md b/specs/api/concepts/allocation.md new file mode 100644 index 000000000..4572e444e --- /dev/null +++ b/specs/api/concepts/allocation.md @@ -0,0 +1,70 @@ +# Concept: Allocation + +Allocation is the pre-stack step that resolves **where** a target will run and **on what hardware**, +before any Pulumi resource is created. Every provider action `Create()` calls its allocation +function first and stores the result on the action struct. + +--- + +## Provider-Agnostic Contract + +1. Accept hardware constraints (`ComputeRequestArgs`) and an optional spot preference (`SpotArgs`). +2. On the **spot path**: query cloud pricing across eligible regions/locations; select best price. +3. On the **on-demand path**: use the provider default region/location; filter to available sizes. +4. Return a result struct that downstream modules consume directly — no re-querying. + +--- + +## Cross-Provider Types + +These types are defined in the shared provider API and used by both AWS and Azure allocation. + +### `ComputeRequestArgs` +**Package:** `github.com/redhat-developer/mapt/pkg/provider/api/compute-request` + +```go +type ComputeRequestArgs struct { + CPUs int32 + GPUs int32 + GPUManufacturer string + GPUModel string + MemoryGib int32 + Arch Arch // Amd64 | Arm64 + NestedVirt bool // true when a profile requires nested virtualisation + ComputeSizes []string // skip selector — use these exact instance types/sizes +} +``` + +When `ComputeSizes` is set, the instance selector is skipped entirely. + +### `SpotArgs` +**Package:** `github.com/redhat-developer/mapt/pkg/provider/api/spot` + +```go +type SpotArgs struct { + Spot bool + Tolerance Tolerance // Lowest | Low | Medium | High | Highest + IncreaseRate int // % above current price for bid (default 30) + ExcludedHostingPlaces []string // regions/locations to skip +} +``` + +--- + +## Provider Comparison + +| | AWS (`specs/api/aws/allocation.md`) | Azure (`specs/api/azure/allocation.md`) | +|---|---|---| +| Location key | Region + AZ (two fields in result) | Location (one field in result) | +| Spot persistence | Separate `spotOption` Pulumi stack — idempotent across runs | No stack — re-evaluated each run | +| Instance selector | `aws/data.NewComputeSelector()` | `azure/data.NewComputeSelector()` | +| Extra input | `AMIName`, `AMIProductDescription` | `OSType`, `ImageRef` | +| Extra output | `AZ *string` | `ImageRef *data.ImageReference` | + +--- + +## Implementation References + +- AWS: `specs/api/aws/allocation.md` +- Azure: `specs/api/azure/allocation.md` +- Shared types: `specs/api/provider-interfaces.md` diff --git a/specs/api/concepts/compute.md b/specs/api/concepts/compute.md new file mode 100644 index 000000000..a4c1ae052 --- /dev/null +++ b/specs/api/concepts/compute.md @@ -0,0 +1,64 @@ +# Concept: Compute + +The compute module is always the **last Pulumi resource created** in a `deploy()` function. +It creates the VM or instance, wires it to the network and security group, and runs a +readiness check before the stack is considered complete. + +--- + +## Provider-Agnostic Contract + +1. Accept network outputs (subnet, public IP), credentials (keypair or password), security groups, + instance types/sizes, and userdata from the action. +2. On the **spot path**: use a spot-aware resource (AWS ASG / Azure VM priority). +3. On the **on-demand path**: use a standard instance/VM with direct IP assignment. +4. Run a **readiness check** — a remote command that blocks Pulumi until the machine is ready. +5. Export the host address as `-host`. + +--- + +## Spot Mechanism + +| | AWS (`specs/api/aws/compute.md`) | Azure (`specs/api/azure/virtual-machine.md`) | +|---|---|---| +| Spot resource | `ec2.LaunchTemplate` + `autoscaling.Group` (ASG) | Single VM with `Priority="Spot"` + `MaxPrice` | +| Load balancer | Required — ASG registers target groups | Not applicable | +| Selection source | `AllocationResult.SpotPrice != nil` → `Spot=true` | `AllocationResult.Price != nil` → non-nil `SpotPrice` | + +--- + +## Readiness Check + +| | AWS | Azure | +|---|---|---| +| Method | `Compute.Readiness()` — built into the module | Remote command run directly in the action | +| Linux command | `sudo cloud-init status --long --wait` | Same command, called differently | +| Windows command | `echo ping` | Equivalent inline | +| Timeout | 40 minutes | Varies by action | + +--- + +## Host Address + +| | AWS | Azure | +|---|---|---| +| DNS/IP source | `Compute.GetHostDnsName()` — returns LB DNS or EIP public DNS | `Network.PublicIP.IpAddress` — from the network module, not the VM | +| Export key | `-host` | `-host` | + +--- + +## Provider Comparison + +| | AWS (`specs/api/aws/compute.md`) | Azure (`specs/api/azure/virtual-machine.md`) | +|---|---|---| +| Disk size | Configurable via `DiskSize *int` | Fixed at 200 GiB | +| LB support | Yes (for spot ASG) | No | +| Airgap | Yes — bastion passed to `Readiness()` | No | +| Readiness helper | `Compute.Readiness()` + `RunCommand()` | No equivalent yet | + +--- + +## Implementation References + +- AWS: `specs/api/aws/compute.md` +- Azure: `specs/api/azure/virtual-machine.md` diff --git a/specs/api/concepts/network.md b/specs/api/concepts/network.md new file mode 100644 index 000000000..a70afaa27 --- /dev/null +++ b/specs/api/concepts/network.md @@ -0,0 +1,46 @@ +# Concept: Network + +The network module is always the **first Pulumi resource created** in a `deploy()` function. +It establishes the virtual network, subnet, and public IP that all subsequent resources depend on. + +--- + +## Provider-Agnostic Contract + +1. Accept a prefix, component ID, and location/region from `AllocationResult`. +2. Create a virtual network + subnet with fixed CIDRs (`10.0.0.0/16` / `10.0.2.0/24`). +3. Produce a public IP (or EIP) and a subnet reference consumed by the compute module. +4. Return a result struct — downstream modules must not re-query network state. + +--- + +## Creation Order in `deploy()` + +``` +network.Create() ← first +securityGroup.Create() ← depends on network (AWS only; Azure reverses this) +keypair / password +compute.NewCompute() ← last +``` + +Azure is the exception: the security group is created **before** `network.Create()` because +`NetworkArgs.SecurityGroup` is a required input. See `specs/api/concepts/security-group.md`. + +--- + +## Provider Comparison + +| | AWS (`specs/api/aws/network.md`) | Azure (`specs/api/azure/network.md`) | +|---|---|---| +| Airgap support | Yes — two-phase NAT removal, private subnet, bastion | No | +| Load balancer | Optional, created internally when spot is used | Not managed by this module | +| Security group | Created after network; passed to compute | Created before network; passed in as input | +| Public address output | EIP (`NetworkResult.Eip`) or LB DNS | `Network.PublicIP.IpAddress` | +| Bastion | Automatic when `Airgap=true` | Not available | + +--- + +## Implementation References + +- AWS: `specs/api/aws/network.md` +- Azure: `specs/api/azure/network.md` diff --git a/specs/api/concepts/security-group.md b/specs/api/concepts/security-group.md new file mode 100644 index 000000000..57076b794 --- /dev/null +++ b/specs/api/concepts/security-group.md @@ -0,0 +1,57 @@ +# Concept: Security Group + +A security group (or network security group) restricts inbound traffic to the VM/instance. +Both providers create one per target with explicit ingress rules and permissive egress (allow all). + +--- + +## Provider-Agnostic Contract + +1. Accept a list of ingress rules (port range, protocol, source CIDR). +2. Deny all inbound traffic not matched by a rule. +3. Allow all outbound traffic (permissive egress — not configurable today). +4. Return a resource reference consumed by the network or compute module. + +--- + +## Creation Order + +This is the key structural difference between providers: + +| Provider | When created | Passed to | +|---|---|---| +| AWS | After `network.Create()` | `compute.ComputeRequest.SecurityGroups` | +| Azure | Before `network.Create()` | `network.NetworkArgs.SecurityGroup` (required input) | + +The Azure network module attaches the NSG to the NIC internally, so the VM does not receive +the security group directly. + +--- + +## Pre-defined Rules + +Both providers export `SSH_TCP` and `RDP_TCP` rule constants. Usage differs: + +| | AWS | Azure | +|---|---|---| +| Type | Value type — **must copy** before setting `CidrBlocks` | Reference — safe to use directly | +| Source SG | Supported via `IngressRules.SG` | Not supported (CIDR only) | +| Protocol wildcard | `"-1"` (all traffic) | `"*"` | +| Priority | Not applicable | Auto-assigned from 1001 upward | + +--- + +## Provider Comparison + +| | AWS (`specs/api/aws/security-group.md`) | Azure (`specs/api/azure/security-group.md`) | +|---|---|---| +| Return type | `*SGResources{SG *ec2.SecurityGroup}` | `SecurityGroup` (alias for `*network.NetworkSecurityGroup`) | +| Source SG in rules | Yes | No | +| VPC/RG binding | Bound to VPC (`SGRequest.VPC`) | Bound to resource group (`SecurityGroupArgs.RG`) | + +--- + +## Implementation References + +- AWS: `specs/api/aws/security-group.md` +- Azure: `specs/api/azure/security-group.md` diff --git a/specs/api/output-contract.md b/specs/api/output-contract.md new file mode 100644 index 000000000..9c0e8d8c0 --- /dev/null +++ b/specs/api/output-contract.md @@ -0,0 +1,101 @@ +# API: Output Contract + +**Package:** `github.com/redhat-developer/mapt/pkg/provider/util/output` + +Defines the files written to `ResultsOutput` after a successful `create`. These files are +the interface between mapt and the CI systems that consume it (Tekton tasks, GitHub workflows, +shell scripts). Changing a filename is a breaking change for all consumers. + +--- + +## Function + +### `output.Write` + +```go +func Write(stackResult auto.UpResult, destinationFolder string, results map[string]string) error +``` + +- `results` maps a Pulumi stack output key → destination filename +- Writes each value as a plain text file with permissions `0600` +- Silently skips outputs that are not strings (logs a debug message) +- No-op when `destinationFolder` is empty + +--- + +## Standard Output Files + +These filenames are stable across all targets that produce them. +CI consumers depend on these exact names. + +| Filename | Content | Targets | +|---|---|---| +| `host` | Hostname or IP to SSH/RDP to | All | +| `username` | OS login username | All | +| `id_rsa` | PEM-encoded SSH private key | All Linux targets, Windows (SSH) | +| `userpassword` | Administrator password (plaintext) | Windows targets | +| `kubeconfig` | kubectl-compatible kubeconfig YAML | SNC, EKS, Kind | +| `kubeadmin-password` | OCP kubeadmin password | SNC only | +| `developer-password` | OCP developer password | SNC only | + +### Airgap Additional Files (written by `bastion.WriteOutputs`) + +| Filename | Content | +|---|---| +| `bastion_host` | Bastion public IP | +| `bastion_username` | Bastion SSH username (`ec2-user`) | +| `bastion_id_rsa` | Bastion SSH private key | + +--- + +## Pulumi Stack Export Keys + +Stack output keys follow the pattern `-`. The `prefix` defaults to `"main"` +when not explicitly set by the caller. + +| Stack output key | → | Filename | +|---|---|---| +| `-host` | | `host` | +| `-username` | | `username` | +| `-id_rsa` | | `id_rsa` | +| `-userpassword` | | `userpassword` | +| `-kubeconfig` | | `kubeconfig` | +| `-kubeadmin-password` | | `kubeadmin-password` | +| `-developer-password` | | `developer-password` | +| `-bastion_id_rsa` | | `bastion_id_rsa` | +| `-bastion_username` | | `bastion_username` | +| `-bastion_host` | | `bastion_host` | + +--- + +## Usage Pattern in `manageResults()` + +```go +func manageResults(mCtx *mc.Context, stackResult auto.UpResult, prefix *string, airgap *bool) error { + results := map[string]string{ + fmt.Sprintf("%s-%s", *prefix, outputUsername): "username", + fmt.Sprintf("%s-%s", *prefix, outputUserPrivateKey): "id_rsa", + fmt.Sprintf("%s-%s", *prefix, outputHost): "host", + } + if *airgap { + if err := bastion.WriteOutputs(stackResult, *prefix, mCtx.GetResultsOutputPath()); err != nil { + return err + } + } + return output.Write(stackResult, mCtx.GetResultsOutputPath(), results) +} +``` + +Output key constants (`outputHost`, `outputUsername`, etc.) are defined in the action's +`constants.go` and must match the `ctx.Export(...)` calls in `deploy()`. + +--- + +## When to Change This Contract + +Any change to filenames is **breaking** — update this spec and notify consumers: +- Tekton task definitions that read the files (`tkn/template/`) +- GitHub workflow files that reference the output directory +- Any external documentation or user guides + +New output files can be added without breaking existing consumers. diff --git a/specs/api/provider-interfaces.md b/specs/api/provider-interfaces.md new file mode 100644 index 000000000..9c1adf5f5 --- /dev/null +++ b/specs/api/provider-interfaces.md @@ -0,0 +1,200 @@ +# API: Provider Interfaces (Cross-Cloud) + +**Package:** `github.com/redhat-developer/mapt/pkg/provider/api` + +Defines the hardware-constraint and spot-selection types that are **shared across all cloud +providers**. Both AWS and Azure allocations are driven by the same input structs; each provider +supplies its own implementation of the selector interfaces. + +This layer sits *below* `specs/api/aws/allocation.md` and `specs/api/azure/allocation.md` — +those allocation modules call these selectors internally. Action code interacts with these +types directly (passing `ComputeRequestArgs` and `SpotArgs` into `AllocationArgs`), but +never calls the selector interfaces itself. + +--- + +## Package: `compute-request` + +**Full path:** `github.com/redhat-developer/mapt/pkg/provider/api/compute-request` + +### Types + +```go +type Arch int + +const ( + Amd64 Arch = iota + 1 + Arm64 + MaxResults = 20 // max VM types returned per selector call +) + +type ComputeRequestArgs struct { + CPUs int32 + GPUs int32 + GPUManufacturer string + GPUModel string + MemoryGib int32 + Arch Arch + NestedVirt bool + // Override: skip selector entirely, use these sizes directly + ComputeSizes []string +} + +type ComputeSelector interface { + Select(args *ComputeRequestArgs) ([]string, error) +} +``` + +`ComputeRequestArgs` is embedded in every `AllocationArgs` on both clouds. +If `ComputeSizes` is pre-populated, the selector is skipped — useful when +a specific VM type is required rather than capacity-matched selection. + +### Functions + +```go +func Validate(cpus, memory int32, arch Arch) error +func (a Arch) String() string // "x64" | "Arm64" +``` + +### Provider Implementations + +| Provider | Type | Package | +|---|---|---| +| AWS | `data.ComputeSelector` | `pkg/provider/aws/data` | +| Azure | `data.ComputeSelector` | `pkg/provider/azure/data` | + +**AWS** uses the `amazon-ec2-instance-selector` library to filter by vCPUs, memory, and arch +across all available instance types. + +**Azure** queries the ARM Resource SKUs API, then filters by vCPUs, memory, arch, HyperV Gen2 +support, nested virt eligibility, PremiumIO, and `AcceleratedNetworkingEnabled`. Results are +sorted by vCPU count ascending. Azure also exposes `FilterComputeSizesByLocation()` as a +standalone helper used by the on-demand allocation path. + +--- + +## Package: `spot` + +**Full path:** `github.com/redhat-developer/mapt/pkg/provider/api/spot` + +### Types + +```go +type Tolerance int + +const ( + Lowest Tolerance = iota // eviction rate 0–5% (AWS: placement score ≥ 7) + Low // eviction rate 5–10% + Medium // eviction rate 10–15% + High // eviction rate 15–20% + Highest // eviction rate 20%+ (AWS: placement score ≥ 1) +) + +var DefaultTolerance = Lowest + +type SpotArgs struct { + Spot bool + Tolerance Tolerance + IncreaseRate int // bid price = base × (1 + IncreaseRate/100); default 30% + ExcludedHostingPlaces []string // regions/locations to skip +} + +type SpotRequestArgs struct { + ComputeRequest *cr.ComputeRequestArgs + OS *string // "linux", "windows", "RHEL", "fedora" — affects product filter + ImageName *string // AWS: scopes region search to AMI availability + SpotParams *SpotArgs +} + +type SpotResults struct { + ComputeType []string // AWS: multiple types for ASG; Azure: single type + Price float64 // bid price (already inflated by SafePrice) + HostingPlace string // AWS: region; Azure: location + AvailabilityZone string // AWS only; empty on Azure + ChanceLevel int // not yet populated (TODO in source) +} + +type SpotSelector interface { + Select(mCtx *mc.Context, args *SpotRequestArgs) (*SpotResults, error) +} +``` + +### Functions + +```go +func ParseTolerance(str string) (Tolerance, bool) +// "lowest"|"low"|"medium"|"high"|"highest" → Tolerance + +func SafePrice(basePrice float64, spotPriceIncreaseRate *int) float64 +// Returns basePrice × (1 + rate/100). Default rate = 30%. +// Called by both provider SpotInfo() implementations before returning results. +``` + +### Provider Implementations + +| Provider | Type | Selection strategy | +|---|---|---| +| AWS | `data.SpotSelector` | Placement scores × spot price history across all regions | +| Azure | `data.SpotSelector` | Eviction rates × spot price (via Azure Resource Graph) | + +**AWS**: Queries placement scores (API requires an opt-in region as API endpoint) and +spot price history in parallel across all regions. Filters regions where the AMI is +available. Returns up to 8 instance types for the winning AZ (used by the ASG mixed-instances +policy). + +**Azure**: Queries eviction rates and spot prices via Azure Resource Graph KQL. Crosses eviction +rate buckets against allowed tolerance, then picks the lowest-price / lowest-eviction-rate +location. Falls back to price-only ranking if eviction-rate data is unavailable. Returns a +single compute size. + +--- + +## Package: `config/userdata` + +**Full path:** `github.com/redhat-developer/mapt/pkg/provider/api/config/userdata` + +```go +type CloudConfig interface { + CloudConfig() (*string, error) +} +``` + +Implemented by cloud-init / cloud-config builder packages used to generate the +`UserData` / `UserDataAsBase64` field on compute resources. Every target that +injects software at boot implements this interface. + +--- + +## Architecture Summary + +``` +pkg/provider/api/ ← provider-agnostic types & interfaces + compute-request/ + ComputeRequestArgs used in AllocationArgs (both clouds) + ComputeSelector interface + spot/ + SpotArgs, SpotResults used in AllocationArgs (both clouds) + SpotSelector interface + SafePrice() shared bid-price calculation + config/userdata/ + CloudConfig interface for cloud-init builders + +pkg/provider/aws/data/ ← AWS implementations + ComputeSelector ec2-instance-selector + SpotSelector placement scores + price history + +pkg/provider/azure/data/ ← Azure implementations + ComputeSelector ARM Resource SKUs API + SpotSelector Azure Resource Graph (eviction + price) +``` + +--- + +## When to Extend This API + +Open a spec under `specs/features/aws/` or `specs/features/azure/` and update this file when: +- Adding a third cloud provider (implement both interfaces in the new `data` package) +- Adding GPU-based compute selection (currently fields exist but filtering is partial) +- Making `CPUsRange` / `MemoryRange` filters active (currently commented out) +- Populating `SpotResults.ChanceLevel` (currently a TODO in both implementations) +- Adding `ExcludedRegions` to AWS spot path (field exists in `SpotInfoArgs` but not wired into `SpotRequestArgs`) diff --git a/specs/cicd/code-build.md b/specs/cicd/code-build.md new file mode 100644 index 000000000..9b71b25fb --- /dev/null +++ b/specs/cicd/code-build.md @@ -0,0 +1,48 @@ +# Spec: Go Code Build and Test + +## Status +Implemented + +## Context +Runs static analysis, build, and unit tests on every PR and push to `main`. +This is the primary gate for Go code correctness. + +Relevant files: +- `.github/workflows/build-go.yaml` +- `Makefile` — `check`, `build`, `test`, `lint`, `fmt` targets + +## Problem +This feature is implemented. This spec documents the current behaviour. + +## Requirements +- [x] Run on every `pull_request` targeting `main` and every `push` to `main` or a tag +- [x] Run `make check` (build + test + lint + renovate-check) on `ubuntu-24.04` +- [x] Pin Go version (`1.26`) +- [x] Free disk space before build to avoid Docker layer cache exhaustion + +## Out of Scope +- OCI image build (see `oci-build.md`) +- Integration tests on non-Linux hosts (see `hosted-runner-test.md`) + +## Must Reuse +- `make check` — runs `make build`, `make test`, `make lint`, `make renovate-check` in sequence +- `endersonmenezes/free-disk-space@v3` — frees Android/dotnet/Haskell toolchains before build + +## Must Create +- `.github/workflows/build-go.yaml` + +## API Changes +- none + +## Acceptance Criteria + +### Unit + +- Workflow YAML is syntactically valid +- `make check` passes locally on a clean checkout + +### Integration + +- PR to `main` triggers the workflow and it passes +- Push to `main` triggers the workflow and it passes +- A PR introducing a lint error causes the workflow to fail diff --git a/specs/cicd/hosted-runner-test.md b/specs/cicd/hosted-runner-test.md new file mode 100644 index 000000000..4ce695f8f --- /dev/null +++ b/specs/cicd/hosted-runner-test.md @@ -0,0 +1,96 @@ +# Spec: Windows Integration Test via Self-Hosted Runner + +## Status +Implemented + +## Context +Runs the full Go test suite on a real Windows host provisioned by mapt itself. This is +mapt's only integration-level CI gate: the tool provisions its own test environment and +then runs tests inside it. Triggered only when the GitHub Actions runner integration code +changes (`pkg/integrations/github/`). + +Relevant files: +- `.github/workflows/build-img-ghrunner-test.yaml` — builds a dedicated OCI image for the test +- `.github/workflows/build-on-hosted-runner.yaml` — orchestrates provision → test → destroy +- `.github/workflows/provision-hosted-runner.yaml` — reusable: fetches runner token, runs mapt create +- `.github/workflows/destroy-hosted-runner.yaml` — reusable: runs mapt destroy + +## Problem +This feature is implemented. This spec documents the current behaviour, the four-workflow +design, and the always-destroy guarantee. + +## Requirements +- [x] Trigger only on PRs to `main` that change `pkg/integrations/github/*.go` or + `.github/workflows/build-img-ghrunner-test.yaml` (path filter) +- [x] Build a dedicated OCI image tagged `:pr-` for the test run (separate from + the standard PR image) +- [x] Fetch a GitHub Actions runner registration token via the GitHub API before provisioning +- [x] Provision an Azure Windows VM with the GitHub runner pre-installed using mapt, + authenticated via ARM_* secrets and Azure Blob Storage for Pulumi state +- [x] Wait 120 seconds after provisioning for the runner to register with GitHub +- [x] Run `go test -v ./...` on the self-hosted Windows runner +- [x] Destroy the Azure VM after the test regardless of outcome (`if: always()`) +- [x] Pulumi state backend: `azblob://mapt-gh-runner-mapt-state/-` + +## Out of Scope +- Integration tests for AWS targets (not currently automated) +- Integration tests on Linux self-hosted runners +- Testing non-GitHub runner integrations (Cirrus, GitLab) in CI + +## Design +Four workflows are needed because GitHub Actions does not allow a single job to both +provision a self-hosted runner and then use it (the runner must be registered before +the job that runs on it is dispatched). The split is: + +``` +build-img-ghrunner-test builds OCI image → artifact + │ (workflow_run trigger) +build-on-hosted-runner orchestrates: + ├── provision-hosted-runner (reusable) + │ fetch token → download artifact → mapt create → sleep 120s + ├── test_run_selfhosted_runner [self-hosted, x64, Windows] + │ go test -v ./... + └── destroy-hosted-runner (reusable) if: always() + download artifact → mapt destroy +``` + +The `destroy-hosted-runner` job runs with `if: always()` and depends on both +`hosted_runner_provision` and `test_run_selfhosted_runner`, ensuring the VM is +destroyed even when tests fail or the provision job partially succeeds. + +## Must Reuse +- `mapt azure windows create` — provisions the Azure Windows VM with `--install-ghactions-runner` +- `mapt azure windows destroy` — tears down the VM and cleans up Pulumi state +- `make oci-build-amd64` / `make oci-save-amd64` — builds and saves the test image +- GitHub runner registration token API: `POST /repos/{owner}/{repo}/actions/runners/registration-token` + +## Must Create +- `.github/workflows/build-img-ghrunner-test.yaml` — path-gated build; uploads artifact +- `.github/workflows/build-on-hosted-runner.yaml` — orchestration workflow +- `.github/workflows/provision-hosted-runner.yaml` — reusable provision workflow +- `.github/workflows/destroy-hosted-runner.yaml` — reusable destroy workflow + +## API Changes +- none + +## Known Gaps +- The 120s sleep is a fixed wait; there is no poll-until-ready mechanism +- Only `amd64` is tested; `arm64` Windows is not covered +- Path filter means changes to non-integration Go code do not trigger Windows tests +- `destroy-hosted-runner` downloads artifact by name `mapt` (singular) but + `build-img-ghrunner-test` uploads as `mapt-` — verify names are consistent + +## Acceptance Criteria + +### Unit + +- All four workflow YAML files are syntactically valid +- The `if: always()` condition on the destroy job is present + +### Integration + +- A PR changing `pkg/integrations/github/*.go` triggers the full pipeline +- The self-hosted Windows runner appears in the repository runner list during the test +- `go test -v ./...` passes on the Windows runner +- The Azure VM is destroyed after the run (both on success and failure) +- A PR not touching `pkg/integrations/github/` does not trigger this pipeline diff --git a/specs/cicd/oci-build.md b/specs/cicd/oci-build.md new file mode 100644 index 000000000..9826f2744 --- /dev/null +++ b/specs/cicd/oci-build.md @@ -0,0 +1,59 @@ +# Spec: OCI Image Build and Publish + +## Status +Implemented + +## Context +Builds the `mapt` container image for both `amd64` and `arm64` on every PR and push. +On PR, publishes a multi-arch manifest to `ghcr.io` tagged `:pr-` for downstream +integration testing. On push to `main` or a tag, publishes to `quay.io`. + +Relevant files: +- `.github/workflows/build-oci.yaml` — matrix build + push on merge +- `.github/workflows/push-oci-pr.yml` — combines artifacts and publishes PR image to ghcr.io +- `Makefile` — `oci-build-amd64`, `oci-build-arm64`, `oci-save-*`, `oci-load`, `oci-push` +- `oci/Containerfile` — the container image definition + +## Problem +This feature is implemented. This spec documents the current behaviour and the two-workflow +design needed to produce a multi-arch manifest from a matrix build. + +## Requirements +- [x] Build `amd64` image on `ubuntu-24.04` and `arm64` image on `ubuntu-24.04-arm` in parallel +- [x] Save each image as a `.tar` artifact (`mapt-amd64`, `mapt-arm64`) +- [x] On PR: publish a multi-arch manifest to `ghcr.io/redhat-developer/mapt:pr-` +- [x] On push to `main` or a tag: push both arch images and a multi-arch manifest to `quay.io` +- [x] Install `podman` explicitly on `arm64` runner (not pre-installed) +- [x] PR image publication runs in a separate workflow triggered by `oci-builds` completion + to work around GitHub Actions artifact cross-workflow access restrictions + +## Out of Scope +- Go code build and test (see `code-build.md`) +- Tekton task bundle (see `tkn-bundle.md`) + +## Must Reuse +- `make oci-build-amd64` / `make oci-build-arm64` — builds arch-specific image +- `make oci-save-amd64` / `make oci-save-arm64` — saves image to `.tar` +- `make oci-load` — loads both arch tars back into podman +- `make oci-push` — pushes multi-arch manifest to registry +- `redhat-actions/podman-login@v1` — authenticates to quay.io / ghcr.io + +## Must Create +- `.github/workflows/build-oci.yaml` — matrix build; push job on `push` events +- `.github/workflows/push-oci-pr.yml` — triggered by `oci-builds` completion; publishes PR image + +## API Changes +- none + +## Acceptance Criteria + +### Unit + +- Both workflow YAML files are syntactically valid +- `make oci-build-amd64` completes successfully on an amd64 host + +### Integration + +- PR to `main` produces `ghcr.io/redhat-developer/mapt:pr-` as a multi-arch manifest +- Push to `main` updates `quay.io/redhat-developer/mapt:main` (amd64 + arm64) +- A semver tag push produces a versioned image on quay.io diff --git a/specs/cicd/spec-driven-pr-workflow.md b/specs/cicd/spec-driven-pr-workflow.md new file mode 100644 index 000000000..578c9a039 --- /dev/null +++ b/specs/cicd/spec-driven-pr-workflow.md @@ -0,0 +1,148 @@ +# Spec: Spec-Driven PR Workflow + +## Status +Draft + +## Jira + + +## Context +mapt is adopting a spec-anchored development approach (see `specs/project-context.md`). +Today, PRs mix spec and implementation in the same review round, or skip the spec entirely. +There is no CI gate that enforces a spec exists before code is merged, and no structured +signal to trigger an AI agent to implement from a spec. + +Current CI workflows for reference: `specs/features/cicd/code-build.md`, +`specs/features/cicd/oci-build.md`. + +## Problem +- Spec review and code review happen in the same PR, collapsing the gate that makes + spec-first valuable. Reviewers must context-switch between architectural intent and + implementation detail in a single pass. +- An AI agent implementing from a spec has no well-defined trigger or input convention. +- There is no CI check that prevents a malformed or `Draft` spec from being merged as + if it were accepted. + +## Requirements +- [ ] A developer (or agent) opens a **Draft PR** containing only a spec file under + `specs/features/` with `Status: Accepted` +- [ ] CI runs `spec-lint` on the PR: validates that every changed spec file has all + required sections and that `Status` is not `Draft` +- [ ] `spec-lint` fails the PR if any required section is missing or `Status == Draft` +- [ ] A reviewer approves the spec by posting a `/implement` comment on the PR +- [ ] The `/implement` comment triggers a GitHub Actions workflow that runs a Claude Code + agent, which reads the spec and adds implementation commit(s) to the same branch +- [ ] The agent commits with message `feat(): implement from specs/` +- [ ] After the agent commit, `code-build` re-runs automatically; the PR is promoted from + Draft to Ready for Review +- [ ] A second review round covers only the implementation (spec already approved) +- [ ] On merge, existing workflows (`code-build`, `oci-builds`, `tkn-bundle`) run unchanged + +## Out of Scope +- Changes to `code-build.md`, `oci-build.md`, or `tkn-bundle.md` workflows +- Jira auto-creation of spec stubs from issues (follow-on) +- Two-PR flow (spec merged separately before implementation PR) — possible future evolution +- Automated integration tests triggered by merge (separate spec) + +## Design + +### PR Lifecycle + +``` +1. Dev opens Draft PR + branch: feat/aws-xyz-host + commit: "spec: aws xyz host" ← specs/features/aws/xyz-host.md (Status: Accepted) + +2. CI: spec-lint + ✓ required sections present + ✓ Status == Accepted (not Draft) + ✓ Must Reuse references valid specs/api/ paths + ✗ fails → PR blocked, dev fixes spec + +3. Reviewer reads spec only + → posts /implement comment + +4. CI: implement workflow triggers + → Claude Code agent runs with: + constitution: specs/project-context.md + api context: all specs/api/ files referenced in Must Reuse + task: implement all files in Must Create, calling Must Reuse modules + → agent pushes commit(s) to the PR branch + +5. CI re-runs: make build && make test + PR auto-promoted Draft → Ready for Review + +6. Reviewer does code review (implementation only) + → merge +``` + +### spec-lint Rules + +| Rule | Check | +|---|---| +| Required sections | `## Status`, `## Context`, `## Problem`, `## Requirements`, `## Must Reuse`, `## Must Create`, `## Acceptance Criteria` all present | +| Status not Draft | Value is `Accepted`, `Implemented`, or `Deprecated` | +| Must Reuse not empty | Section body has at least one bullet point | +| Must Create not empty | Section body has at least one file path | + +### `/implement` Trigger + +A `issue_comment` GitHub Actions event listens for comments containing `/implement` on PRs. +Before dispatching the agent, the workflow verifies the commenter has `write` permission on +the repository. Either `/implement` comment or a `spec-approved` label can trigger the agent. + +### Agent Context + +The agent receives: +1. The changed spec file (from the PR diff) +2. `specs/project-context.md` (mandatory module sequences, naming rules) +3. All `specs/api/` files referenced in Must Reuse +4. Read-only access to the existing codebase + +The agent is constrained to create only the files listed in Must Create, call only the +modules listed in Must Reuse in the documented order, and verify `make build` passes +before committing. + +## Must Reuse + +Existing workflows that must **not** be modified: +- `.github/workflows/build-go.yaml` — `code-build` +- `.github/workflows/build-oci.yaml` — `oci-builds` +- `.github/workflows/tkn-bundle.yaml` — `tkn-bundle` + +## Must Create + +| File | Purpose | +|---|---| +| `.github/workflows/spec-lint.yaml` | Runs `scripts/spec-lint.sh` on PR; blocks merge if spec is malformed or Draft | +| `.github/workflows/spec-implement.yaml` | Listens for `/implement` comment; verifies write access; dispatches agent | +| `scripts/spec-lint.sh` | Shell script: checks required sections, Status value, non-empty Must Reuse/Must Create | + +## API Changes +- none + +## Tasks +- [ ] Write `scripts/spec-lint.sh` — section presence, Status != Draft, non-empty sections +- [ ] Test `spec-lint.sh` locally against all existing specs (all should pass) +- [ ] Write `.github/workflows/spec-lint.yaml` — triggers on PR, runs lint against changed `specs/features/**/*.md` +- [ ] Write `.github/workflows/spec-implement.yaml` — `issue_comment` trigger, write-access guard, agent dispatch +- [ ] Define agent invocation: model selection, system prompt assembly, output commit convention +- [ ] Add `spec-approved` label to the GitHub repository +- [ ] Update `specs/features/cicd/` with the two new workflow specs once implemented +- [ ] `make build && make test` passes (no Go changes expected) + +## Acceptance Criteria + +### Unit + +- `scripts/spec-lint.sh specs/features/aws/rhel-host.md` exits 0 +- `scripts/spec-lint.sh specs/features/000-template.md` exits non-zero (Status is `Draft`) +- `scripts/spec-lint.sh` exits non-zero on a spec missing the `## Must Reuse` section + +### Integration + +- A Draft PR with `Status: Draft` causes `spec-lint` to fail and block merge +- A Draft PR with `Status: Accepted` and all required sections causes `spec-lint` to pass +- Posting `/implement` on a passing-spec PR triggers the agent workflow +- Agent commit appears on the branch; `make build && make test` passes +- PR is promoted from Draft to Ready for Review automatically after the agent commit diff --git a/specs/cicd/tkn-bundle.md b/specs/cicd/tkn-bundle.md new file mode 100644 index 000000000..4010fad40 --- /dev/null +++ b/specs/cicd/tkn-bundle.md @@ -0,0 +1,53 @@ +# Spec: Tekton Task Bundle Validation and Publish + +## Status +Implemented + +## Context +Validates the generated Tekton task YAML files (`tkn/*.yaml`) against a real Tekton installation +on every PR and push. On push to `main` or a tag, publishes the task bundle to `quay.io` for +downstream consumers (Konflux, RHTAP pipelines). + +Relevant files: +- `.github/workflows/tkn-bundle.yaml` +- `tkn/` — generated Tekton task YAML files (generated by `make tkn-update`) +- `Makefile` — `tkn-update`, `tkn-push` targets + +## Problem +This feature is implemented. This spec documents the current behaviour. + +## Requirements +- [x] On every PR and push: spin up a Kind cluster, deploy Tekton `v0.44.5` (minimum + supported version), and apply all `tkn/*.yaml` — fails the workflow if any task fails + to apply +- [x] On push to `main` or a tag only: push the Tekton task bundle to `quay.io` via + `make tkn-push` (gated on `tkn-check` success) +- [x] `tkn-build` (push step) only runs after `tkn-check` passes + +## Out of Scope +- Generating the Tekton YAML files — that is done locally by `make tkn-update` before commit +- OCI image build (see `oci-build.md`) + +## Must Reuse +- `helm/kind-action@v1` — creates the Kind cluster +- `make tkn-push` — pushes the bundle to quay.io +- `redhat-actions/podman-login@v1` — authenticates to quay.io + +## Must Create +- `.github/workflows/tkn-bundle.yaml` + +## API Changes +- none + +## Acceptance Criteria + +### Unit + +- Workflow YAML is syntactically valid +- `kubectl apply -f tkn` succeeds against a local Kind + Tekton cluster + +### Integration + +- PR to `main` runs `tkn-check` and passes when `tkn/*.yaml` are valid +- A malformed task YAML causes `tkn-check` to fail and blocks the PR +- Push to `main` runs `tkn-build` and publishes the bundle to quay.io diff --git a/specs/cmd/azure-params.md b/specs/cmd/azure-params.md new file mode 100644 index 000000000..505079c2c --- /dev/null +++ b/specs/cmd/azure-params.md @@ -0,0 +1,42 @@ +# CLI Params: Azure Shared + +**Package:** `github.com/redhat-developer/mapt/cmd/mapt/cmd/azure/params` +**File:** `cmd/mapt/cmd/azure/params/params.go` + +Azure-provider-specific shared params, used alongside the cross-provider params in +`specs/cmd/params.md`. Every Azure `create` command that accepts a location registers +this flag. + +--- + +## Location + +```go +const ( + Location = "location" + LocationDesc = "If spot is passed location will be calculated based on spot results. Otherwise location will be used to create resources." + LocationDefault = "westeurope" +) +``` + +| Flag | Type | Default | Description | +|---|---|---|---| +| `--location` | string | `westeurope` | Azure region; ignored when `--spot` is set (spot selects the location) | + +No `Add*Flags` helper — each cmd registers it directly: + +```go +flagSet.StringP(azureParams.Location, "", azureParams.LocationDefault, azureParams.LocationDesc) +``` + +Mapped to `AllocationArgs.Location` inside the action. When spot is active the allocation +module ignores this value and picks the best-priced region automatically. + +See `specs/api/azure/allocation.md`. + +--- + +## When to Extend + +Update this file when adding new Azure-wide shared params (e.g. resource group prefix, +subscription override). diff --git a/specs/cmd/params.md b/specs/cmd/params.md new file mode 100644 index 000000000..b8c259327 --- /dev/null +++ b/specs/cmd/params.md @@ -0,0 +1,273 @@ +# CLI Params Layer + +**Package:** `github.com/redhat-developer/mapt/cmd/mapt/cmd/params` +**File:** `cmd/mapt/cmd/params/params.go` + +Central registry for all reusable CLI flags. Every flag that appears on more than one +`create` command is defined here, not in the individual cmd files. Individual cmd files +only define flags that are unique to that target. + +--- + +## The Three-Part Pattern + +Every flag group follows the same structure: + +### 1. Constants + +```go +// Exported: used by cmd files to read values via viper +const FlagName string = "flag-name" +const FlagNameDesc string = "human readable description" +const FlagNameDefault string = "default-value" // optional + +// Unexported: only used within params.go +const internalFlag string = "internal-flag-name" +``` + +Use **exported** constants when the cmd file needs to call `viper.GetX(params.FlagName)` +directly. Use **unexported** when the value is only read inside a `*Args()` helper in +this package. + +### 2. `Add*Flags(fs *pflag.FlagSet)` + +Registers flags on the flagset passed in. Called once per `create` command that needs +this group: + +```go +func AddSpotFlags(fs *pflag.FlagSet) { + fs.Bool(spot, false, spotDesc) + fs.StringP(spotTolerance, "", spotToleranceDefault, spotToleranceDesc) + fs.StringSliceP(spotExcludedHostedZones, "", []string{}, spotExcludedHostedZonesDesc) +} +``` + +### 3. `*Args() *SomeType` + +Reads values from viper and returns a populated struct (or `nil` if the feature is not +enabled). Called inside the cmd's `RunE` when building the action args: + +```go +func SpotArgs() *spotTypes.SpotArgs { + if viper.IsSet(spot) { + return &spotTypes.SpotArgs{ ... } + } + return nil // nil = feature not requested +} +``` + +Returning `nil` is the canonical "not configured" signal — action code checks for nil +before using the result. + +--- + +## How Viper Binding Works + +Each `create` command binds its flagset to viper at the start of `RunE`: + +```go +RunE: func(cmd *cobra.Command, args []string) error { + if err := viper.BindPFlags(cmd.Flags()); err != nil { + return err + } + // now viper.GetX(flagName) works for all registered flags + ... +} +``` + +After binding, all flag values are accessible via `viper.GetString`, `viper.GetBool`, +`viper.GetInt32`, `viper.GetStringSlice`, `viper.IsSet`, etc. + +--- + +## Existing Flag Groups + +### Common (every command) + +```go +func AddCommonFlags(fs *pflag.FlagSet) +``` + +| Flag | Type | Description | +|---|---|---| +| `project-name` | string | Pulumi project name | +| `backed-url` | string | State backend URL (`file://`, `s3://`, `azblob://`) | + +Added to the parent command's `PersistentFlags` so it applies to all subcommands. + +--- + +### Debug + +```go +func AddDebugFlags(fs *pflag.FlagSet) +``` + +| Flag | Type | Default | Description | +|---|---|---|---| +| `debug` | bool | false | Enable debug traces | +| `debug-level` | uint | 3 | Verbosity 1–9 | + +--- + +### Compute Request + +```go +func AddComputeRequestFlags(fs *pflag.FlagSet) +func ComputeRequestArgs() *cr.ComputeRequestArgs +``` + +| Flag | Type | Default | Description | +|---|---|---|---| +| `cpus` | int32 | 8 | vCPU count | +| `memory` | int32 | 64 | RAM in GiB | +| `gpus` | int32 | 0 | GPU count | +| `gpu-manufacturer` | string | — | e.g. `NVIDIA` | +| `nested-virt` | bool | false | Require nested virtualisation support | +| `compute-sizes` | []string | — | Override selector; comma-separated instance types | +| `arch` | string | `x86_64` | `x86_64` or `arm64` | + +`ComputeRequestArgs()` maps `arch` to `cr.Amd64` / `cr.Arm64`. When `--snc` is set, +`NestedVirt` is forced true regardless of `--nested-virt`. + +See `specs/api/provider-interfaces.md` for `ComputeRequestArgs` type. + +--- + +### Spot + +```go +func AddSpotFlags(fs *pflag.FlagSet) +func SpotArgs() *spotTypes.SpotArgs // returns nil when --spot not set +``` + +| Flag | Type | Default | Description | +|---|---|---|---| +| `spot` | bool | false | Enable spot selection | +| `spot-eviction-tolerance` | string | `lowest` | `lowest`/`low`/`medium`/`high`/`highest` | +| `spot-increase-rate` | int | 30 | Bid price % above current price | +| `spot-excluded-regions` | []string | — | Regions to skip | + +Returns `nil` when `--spot` is not set — this signals on-demand to allocation. + +See `specs/api/provider-interfaces.md` for `SpotArgs` type. + +--- + +### Network (to be added — `specs/features/aws/vpc-endpoints.md`) + +```go +func AddNetworkFlags(fs *pflag.FlagSet) +func NetworkEndpoints() []string +``` + +| Flag | Type | Default | Description | +|---|---|---|---| +| `endpoints` | []string | — | VPC endpoints to create: `s3`, `ecr`, `ssm` | + +--- + +### GitHub Actions Runner + +```go +func AddGHActionsFlags(fs *pflag.FlagSet) +func GithubRunnerArgs() *github.GithubRunnerArgs // returns nil when token not set +``` + +| Flag | Type | Description | +|---|---|---| +| `ghactions-runner-token` | string | Registration token | +| `ghactions-runner-repo` | string | Repository or org URL | +| `ghactions-runner-labels` | []string | Runner labels | + +Returns `nil` when `--ghactions-runner-token` is not set. +Platform and arch are derived from `--arch`; not user-configurable at CLI level. + +--- + +### Cirrus CI Persistent Worker + +```go +func AddCirrusFlags(fs *pflag.FlagSet) +func CirrusPersistentWorkerArgs() *cirrus.PersistentWorkerArgs // returns nil when token not set +``` + +| Flag | Type | Description | +|---|---|---| +| `it-cirrus-pw-token` | string | Cirrus registration token | +| `it-cirrus-pw-labels` | map[string]string | Labels as `key=value` pairs | + +Returns `nil` when `--it-cirrus-pw-token` is not set. + +--- + +### GitLab Runner + +```go +func AddGitLabRunnerFlags(fs *pflag.FlagSet) +func GitLabRunnerArgs() *gitlab.GitLabRunnerArgs // returns nil when token not set +``` + +| Flag | Type | Default | Description | +|---|---|---|---| +| `glrunner-token` | string | — | GitLab Personal Access Token | +| `glrunner-project-id` | string | — | Project ID (mutually exclusive with group ID) | +| `glrunner-group-id` | string | — | Group ID (mutually exclusive with project ID) | +| `glrunner-url` | string | `https://gitlab.com` | GitLab instance URL | +| `glrunner-tags` | []string | — | Runner tags | + +Returns `nil` when `--glrunner-token` is not set. + +--- + +### Serverless / Destroy + +```go +// No Add* helper — these are registered directly in each destroy command +``` + +| Flag | Type | Description | Command | +|---|---|---|---| +| `timeout` | string | Go duration string — schedules self-destruct | create | +| `serverless` | bool | Use role-based credentials (ECS context) | destroy | +| `force-destroy` | bool | Destroy even if locked | destroy | +| `keep-state` | bool | Keep Pulumi state in S3 after destroy | destroy | + +--- + +## Arch Conversion Helpers + +Each integration has its own `Platform`/`Arch` type. Params provides private converters: + +```go +func linuxArchAsGithubActionsArch(arch string) *github.Arch // "x86_64" → &Amd64 +func linuxArchAsCirrusArch(arch string) *cirrus.Arch +func linuxArchAsGitLabArch(arch string) *gitlab.Arch + +// Exported variants for MAC commands (different arch string convention): +func MACArchAsCirrusArch(arch string) *cirrus.Arch // "x86" → &Amd64 +func MACArchAsGitLabArch(arch string) *gitlab.Arch +``` + +--- + +## How to Add a New Flag Group + +1. **Add constants** in the `const` block — unexported flag name, exported description +2. **Add `Add*Flags(fs *pflag.FlagSet)`** — register each flag with the appropriate type + (`Bool`, `StringP`, `StringSliceP`, `Int32P`, `StringToStringP`) +3. **Add `*Args() *SomeType`** — read from viper and return a populated struct or `nil` +4. **Call `Add*Flags`** in each cmd create function that needs the group +5. **Call `*Args()`** in the `RunE` body when building the action args struct + +For a single-target flag (not shared), define it with a local constant in the target's +cmd file instead, and read it directly with `viper.GetX(localConst)`. + +--- + +## When to Extend This File + +Update this spec when: +- Adding a new shared flag group (e.g. `AddNetworkFlags` for VPC endpoints) +- Adding flags to an existing group +- Adding a new arch conversion helper for a new integration diff --git a/specs/features/000-template.md b/specs/features/000-template.md new file mode 100644 index 000000000..613053b08 --- /dev/null +++ b/specs/features/000-template.md @@ -0,0 +1,85 @@ +# Spec: [Title] + +## Status + +Draft + +## Context +Brief background. What area of the codebase this touches. Links to related existing files. + +## Problem +What is missing, broken, or needs improvement. + +## Requirements +- [ ] Concrete, testable requirement +- [ ] Another requirement + +## Out of Scope +Explicit list of what this spec does NOT cover. + +## Design + + +## Must Reuse +Existing modules and functions that MUST be called. Do not reimplement this logic. +Reference the API spec for each module's full type signatures. + + + + +## Must Create +New files to write. Everything not listed under Must Reuse. + +- `pkg/provider//action//.go` +- `pkg/provider//action//constants.go` +- `pkg/target/host//` or `pkg/target/service//` +- `cmd/mapt/cmd//hosts/.go` +- `tkn/template/infra--.yaml` + +## API Changes +List any `specs/api/` files that need updating alongside this feature. + +- none + +## Tasks + +- [ ] Create `constants.go` — stackName, componentID, AMI regex, ports, disk size +- [ ] Create `.go` — Args struct, `Create()`, `Destroy()`, `deploy()`, `manageResults()`, `securityGroups()` +- [ ] Create cloud-config / userdata builder in `pkg/target/` +- [ ] Create Cobra command in `cmd/` +- [ ] Create Tekton template in `tkn/template/` +- [ ] Verify all Must Reuse calls are present and in the mandatory order +- [ ] Update any `specs/api/` files listed in API Changes +- [ ] `make build && make test` passes + +## Acceptance Criteria + +### Unit + +- `make build` succeeds + +### Integration + +- Specific observable outcome (command runs, output file exists, SSH works, etc.) diff --git a/specs/features/aws/airgap-network.md b/specs/features/aws/airgap-network.md new file mode 100644 index 000000000..d43a3a9d5 --- /dev/null +++ b/specs/features/aws/airgap-network.md @@ -0,0 +1,71 @@ +# Spec: Airgap Network Topology + +## Status +Implemented + +## Context +An optional network topology that isolates the target instance from the public internet while +still allowing SSH access via a bastion host. Implemented as a two-phase Pulumi stack update. + +Key files: +- `pkg/provider/aws/modules/network/airgap/airgap.go` — VPC/subnet creation +- `pkg/provider/aws/modules/network/network.go` — dispatcher (standard vs airgap) +- `pkg/provider/aws/modules/bastion/bastion.go` — bastion host resource + +The same Pulumi stack is applied twice: +1. Phase 1 (`connectivity = ON`): NAT gateway present → machine can reach internet for bootstrapping +2. Phase 2 (`connectivity = OFF`): NAT gateway removed → machine loses egress, bastion still accessible + +## Problem +This feature is implemented for AWS RHEL and Windows. This spec documents the design and gaps. + +## Requirements +- [x] Create a VPC with a public subnet (has internet gateway + NAT gateway in phase 1) and a private subnet (target) +- [x] Phase 1: private subnet has route to NAT gateway; cloud-init runs and machine is bootstrapped +- [x] Phase 2: NAT gateway is removed; private subnet loses egress; machine is isolated +- [x] Bastion host in the public subnet provides SSH proxy access throughout both phases +- [x] Write bastion output files alongside target files (`bastion-host`, `bastion-username`, `bastion-id_rsa`) +- [x] Targets using airgap: RHEL, Windows Server (AWS); extensible to other targets + +## Out of Scope +- Azure airgap (not currently implemented) +- Egress filtering via security groups or NACLs (only NAT removal is used) + +## Affected Areas +- `pkg/provider/aws/modules/network/` — standard and airgap network implementations +- `pkg/provider/aws/modules/bastion/` — bastion host and output writing +- `pkg/provider/aws/action/rhel/rhel.go` — `createAirgapMachine()` orchestration +- `pkg/provider/aws/action/windows/windows.go` — same + +## Known Gaps / Improvement Ideas +- The error from phase 1 of `createAirgapMachine()` is swallowed in both rhel and windows actions + (`return nil` instead of `return err`) — this is a bug; phase 2 should not run if phase 1 fails +- No validation that `Airgap=true` requires a remote BackedURL (unlike serverless timeout which does validate) + +## Acceptance Criteria + +### Unit + +- `make build` succeeds + +### Integration + +- `mapt aws rhel create --airgap ...` provisions an instance accessible only through the bastion +- Direct SSH to the target host's public IP fails; SSH via bastion succeeds +- Phase 2 is confirmed complete by checking the target cannot reach an external host + +--- + +## Command + +This is a cross-cutting feature, not a standalone command. It is activated via the +`--airgap` flag on individual target create commands: + +``` +mapt aws rhel create --airgap ... +mapt aws windows create --airgap ... +``` + +The `--airgap` flag is defined locally in each host cmd file (not in shared params). +No additional flags are specific to the airgap feature itself — the two-phase +connectivity behaviour is controlled internally by the action. diff --git a/specs/features/aws/eks.md b/specs/features/aws/eks.md new file mode 100644 index 000000000..71783f7e4 --- /dev/null +++ b/specs/features/aws/eks.md @@ -0,0 +1,85 @@ +# Spec: AWS EKS (Elastic Kubernetes Service) + +## Status +Implemented + +## Context +Provisions a managed EKS cluster on AWS. Entry point: `pkg/provider/aws/action/eks/`. +CLI: `cmd/mapt/cmd/aws/services/eks.go`. + +Unlike the SNC target, EKS uses the AWS-managed control plane and worker node groups +rather than a self-managed cluster on a single EC2 instance. + +## Problem +This feature is implemented. This spec documents the current behaviour. + +## Requirements +- [x] Provision an EKS cluster with a managed node group +- [x] Support configurable Kubernetes version +- [x] Support spot instances for worker nodes +- [x] Write kubeconfig output file +- [x] `destroy` cleans up all cluster resources and S3 state + +## Out of Scope +- OpenShift SNC (see `005-aws-openshift-snc.md`) +- Azure AKS (see `012-azure-aks.md`) +- AWS Kind (see `007-aws-kind.md`) + +## Affected Areas +- `pkg/provider/aws/action/eks/` — orchestration +- `cmd/mapt/cmd/aws/services/eks.go` +- `tkn/template/infra-aws-kind.yaml` (verify — may share template) + +## Acceptance Criteria + +### Unit + +- `make build` succeeds + +### Integration + +- `mapt aws eks create ...` provisions a functioning EKS cluster +- Exported kubeconfig allows `kubectl get nodes` to return Ready nodes +- `mapt aws eks destroy ...` removes all resources + +--- + +## Command + +``` +mapt aws eks create [flags] +mapt aws eks destroy [flags] +``` + +### Shared flag groups (`specs/cmd/params.md`) + +| Group | Flags added | +|---|---| +| Common | `--project-name`, `--backed-url` | +| Compute Request | `--cpus`, `--memory`, `--arch`, `--nested-virt`, `--compute-sizes` | +| Spot | `--spot`, `--spot-eviction-tolerance`, `--spot-increase-rate`, `--spot-excluded-regions` | + +Note: no integration flags, no timeout (EKS cluster lifecycle is not self-destructed). + +### Target-specific flags (create only) + +| Flag | Type | Default | Description | +|---|---|---|---| +| `--version` | string | `1.31` | Kubernetes version | +| `--workers-desired` | int | `1` | Worker node group desired size | +| `--workers-max` | int | `3` | Worker node group maximum size | +| `--workers-min` | int | `1` | Worker node group minimum size | +| `--addons` | []string | — | EKS managed addons to install (comma-separated) | +| `--load-balancer-controller` | bool | false | Install AWS Load Balancer Controller | +| `--excluded-zone-ids` | []string | — | AZ IDs to exclude from node placement | +| `--arch` | string | `x86_64` | Worker node architecture | +| `--conn-details-output` | string | — | Path to write kubeconfig | +| `--tags` | map | — | Resource tags | + +### Destroy flags + +`--force-destroy`, `--keep-state` (no `--serverless`) + +### Action args struct populated + +`eks.EKSArgs` → `pkg/provider/aws/action/eks/eks.go` diff --git a/specs/features/aws/fedora-host.md b/specs/features/aws/fedora-host.md new file mode 100644 index 000000000..b4bdfff54 --- /dev/null +++ b/specs/features/aws/fedora-host.md @@ -0,0 +1,80 @@ +# Spec: AWS Fedora Host + +## Status +Implemented + +## Context +Provisions a Fedora EC2 instance on AWS. Entry point: `pkg/provider/aws/action/fedora/`. +Cloud-config: `pkg/target/host/fedora/`. CLI: `cmd/mapt/cmd/aws/hosts/fedora.go`. + +Fedora on AWS is used for Fedora-specific testing. The instance uses a cloud-init config +with the Fedora cloud image. + +## Problem +This feature is implemented. This spec documents the current behaviour. + +## Requirements +- [x] Provision a Fedora EC2 instance (latest or specified version) +- [x] Support spot instance allocation +- [x] Support optional CI integrations (GitHub runner, Cirrus worker, GitLab runner) +- [x] Write output files: `host`, `username`, `id_rsa` +- [x] `destroy` cleans up stack, spot stack, S3 state + +## Out of Scope +- Azure Fedora (see docs/azure/fedora.md — currently Azure Linux target) +- RHEL (subscription-managed — see `001-aws-rhel-host.md`) + +## Affected Areas +- `pkg/provider/aws/action/fedora/` +- `pkg/target/host/fedora/` — cloud-config +- `cmd/mapt/cmd/aws/hosts/fedora.go` +- `tkn/template/infra-aws-fedora.yaml` + +## Acceptance Criteria + +### Unit + +- `make build` succeeds + +### Integration + +- `mapt aws fedora create ...` provisions an accessible Fedora instance +- SSH access works with the output key +- `mapt aws fedora destroy ...` removes all resources + +--- + +## Command + +``` +mapt aws fedora create [flags] +mapt aws fedora destroy [flags] +``` + +### Shared flag groups (`specs/cmd/params.md`) + +| Group | Flags added | +|---|---| +| Common | `--project-name`, `--backed-url` | +| Compute Request | `--cpus`, `--memory`, `--arch`, `--nested-virt`, `--compute-sizes` | +| Spot | `--spot`, `--spot-eviction-tolerance`, `--spot-increase-rate`, `--spot-excluded-regions` | +| Integrations | `--ghactions-runner-*`, `--it-cirrus-pw-*`, `--glrunner-*` | + +### Target-specific flags (create only) + +| Flag | Type | Default | Description | +|---|---|---|---| +| `--version` | string | `41` | Fedora Cloud major version | +| `--arch` | string | `x86_64` | `x86_64` or `arm64` | +| `--airgap` | bool | false | Provision as airgap machine | +| `--timeout` | string | — | Self-destruct duration | +| `--conn-details-output` | string | — | Path to write connection files | +| `--tags` | map | — | Resource tags | + +### Destroy flags + +`--serverless`, `--force-destroy`, `--keep-state` + +### Action args struct populated + +`fedora.FedoraArgs` → `pkg/provider/aws/action/fedora/fedora.go` diff --git a/specs/features/aws/kind.md b/specs/features/aws/kind.md new file mode 100644 index 000000000..accbb727c --- /dev/null +++ b/specs/features/aws/kind.md @@ -0,0 +1,83 @@ +# Spec: AWS Kind Cluster + +## Status +Implemented + +## Context +Provisions a Kind (Kubernetes-in-Docker) cluster on an EC2 instance. +Entry point: `pkg/provider/aws/action/kind/`. Cloud-config: `pkg/target/service/kind/`. +CLI: `cmd/mapt/cmd/aws/services/kind.go`. + +Kind is a lighter-weight alternative to EKS/SNC for CI pipelines that need a disposable +Kubernetes cluster without managed-service overhead. + +## Problem +This feature is implemented. This spec documents the current behaviour. + +## Requirements +- [x] Provision an EC2 instance and install Kind + Docker via cloud-init +- [x] Create a Kind cluster during cloud-init; export kubeconfig +- [x] Support configurable Kubernetes version (via Kind node image) +- [x] Support spot instance allocation +- [x] Write output files: `host`, `username`, `id_rsa`, `kubeconfig` +- [x] `destroy` cleans up stack and S3 state + +## Out of Scope +- Azure Kind (see `014-azure-kind.md`) +- EKS managed clusters (see `006-aws-eks.md`) + +## Affected Areas +- `pkg/provider/aws/action/kind/` +- `pkg/target/service/kind/` — cloud-config generation and test +- `cmd/mapt/cmd/aws/services/kind.go` +- `tkn/template/infra-aws-kind.yaml` + +## Acceptance Criteria + +### Unit + +- `make build` succeeds + +### Integration + +- `mapt aws kind create ...` produces a working kubeconfig +- `kubectl get nodes` returns a Ready node +- `mapt aws kind destroy ...` removes all resources + +--- + +## Command + +``` +mapt aws kind create [flags] +mapt aws kind destroy [flags] +``` + +### Shared flag groups (`specs/cmd/params.md`) + +| Group | Flags added | +|---|---| +| Common | `--project-name`, `--backed-url` | +| Compute Request | `--cpus`, `--memory`, `--arch`, `--nested-virt`, `--compute-sizes` | +| Spot | `--spot`, `--spot-eviction-tolerance`, `--spot-increase-rate`, `--spot-excluded-regions` | + +Note: no integration flags. + +### Target-specific flags (create only) + +| Flag | Type | Default | Description | +|---|---|---|---| +| `--version` | string | `v1.34` | Kubernetes version for Kind | +| `--arch` | string | `x86_64` | `x86_64` or `arm64` | +| `--extra-port-mappings` | string | — | JSON array of `{containerPort, hostPort, protocol}` objects | +| `--timeout` | string | — | Self-destruct duration | +| `--conn-details-output` | string | — | Path to write kubeconfig | +| `--tags` | map | — | Resource tags | + +### Destroy flags + +`--serverless`, `--force-destroy`, `--keep-state` + +### Action args struct populated + +`kind.KindArgs` → `pkg/provider/aws/action/kind/kind.go` diff --git a/specs/features/aws/mac-host.md b/specs/features/aws/mac-host.md new file mode 100644 index 000000000..79e9f2047 --- /dev/null +++ b/specs/features/aws/mac-host.md @@ -0,0 +1,92 @@ +# Spec: AWS Mac Host (Single) + +## Status +Implemented + +## Context +Provisions a single macOS instance on an AWS Dedicated Host. Entry point: +`pkg/provider/aws/action/mac/`. Modules: `pkg/provider/aws/modules/mac/`. +CLI: `cmd/mapt/cmd/aws/hosts/mac.go`. + +AWS Dedicated Hosts for Mac have a hard constraint: minimum 24-hour tenancy before release. +The mac module handles host allocation, machine setup (via root-volume replacement), and +graceful release respecting the 24h window. + +## Problem +This feature is implemented. This spec documents behaviour and the 24h constraint implications. + +## Requirements +- [x] Allocate an AWS Dedicated Host for macOS (x86_64 or arm64/Apple Silicon) +- [x] Deploy a macOS machine via root-volume replacement (not standard AMI boot) +- [x] Support optional CI integration: GitHub Actions runner, Cirrus persistent worker, GitLab runner +- [x] Optionally fix the dedicated host to a specific region/AZ (`FixedLocation`) +- [x] Enforce the 24-hour minimum tenancy: do not attempt to release a host allocated < 24h ago +- [x] Write output files: `host`, `username`, `id_rsa` +- [x] `destroy` handles the 24h wait or errors clearly if host is not yet releasable + +## Out of Scope +- Mac Pool service (managed pool of mac hosts — see `004-aws-mac-pool-service.md`) +- Windows or Linux hosts + +## Affected Areas +- `pkg/provider/aws/action/mac/` — orchestration +- `pkg/provider/aws/modules/mac/host/` — dedicated host allocation +- `pkg/provider/aws/modules/mac/machine/` — machine setup via volume replacement +- `cmd/mapt/cmd/aws/hosts/mac.go` +- `tkn/template/infra-aws-mac.yaml` + +## Acceptance Criteria + +### Unit + +- `make build` succeeds + +### Integration + +- `mapt aws mac create ...` exits 0 and writes `host`, `username`, `id_rsa` +- SSH access to the macOS host works +- `mapt aws mac destroy ...` either releases the host (if >= 24h old) or fails with a clear error + +--- + +## Command + +``` +mapt aws mac create [flags] +mapt aws mac destroy [flags] +mapt aws mac request [flags] # borrow a machine from the pool +mapt aws mac release [flags] # return a machine to the pool +``` + +Note: `request` and `release` operate on the mac-pool (see `specs/features/aws/mac-pool-service.md`). +A standalone `create` provisions a dedicated host directly without a pool. + +### Shared flag groups (`specs/cmd/params.md`) + +| Group | Flags added | +|---|---| +| Common | `--project-name`, `--backed-url` | + +Note: no compute-request, no spot, no timeout, no integration flags. +Mac hardware is allocated as a dedicated host — instance type is fixed by arch+version. + +### Target-specific flags (create) + +| Flag | Type | Default | Description | +|---|---|---|---| +| `--arch` | string | `m1` | MAC architecture: `x86`, `m1`, `m2` | +| `--version` | string | *(per arch)* | macOS version: 11/12 on x86; 13/14/15 on all | +| `--fixed-location` | bool | false | Force creation in `AWS_DEFAULT_REGION` only | +| `--airgap` | bool | false | Provision as airgap machine | +| `--conn-details-output` | string | — | Path to write connection files | +| `--tags` | map | — | Resource tags | + +### Destroy / request / release flags + +`--dedicated-host-id` — required for `request`, `release`, and `destroy` to identify the host + +`--force-destroy`, `--keep-state` on destroy. + +### Action args struct populated + +`mac.MacArgs` → `pkg/provider/aws/action/mac/mac.go` diff --git a/specs/features/aws/mac-pool-service.md b/specs/features/aws/mac-pool-service.md new file mode 100644 index 000000000..1f8c0c25a --- /dev/null +++ b/specs/features/aws/mac-pool-service.md @@ -0,0 +1,100 @@ +# Spec: AWS Mac Pool Service + +## Status +Implemented + +## Context +A managed pool of macOS dedicated hosts providing request/release semantics to CI pipelines. +Entry point: `pkg/provider/aws/action/mac-pool/mac-pool.go`. +CLI: `cmd/mapt/cmd/aws/services/mac-pool.go`. + +The pool runs a serverless HouseKeeper on a recurring schedule (ECS Fargate) that maintains +the desired offered capacity by adding or removing machines while respecting AWS's 24h minimum +host tenancy. State is stored per-machine in separate Pulumi stacks under a shared S3 prefix. + +## Problem +This feature is implemented. This spec documents the architecture and known gaps. + +## Requirements +- [x] `create`: provision N machines (OfferedCapacity) up to MaxSize; start the HouseKeeper scheduler +- [x] `create`: generate a least-privilege IAM user/key pair for request/release operations (`requestReleaserAccount`) +- [x] `housekeeper`: add machines if current offered capacity < desired and pool size < max +- [x] `housekeeper`: remove machines if current offered capacity > desired and machines are > 24h old (destroyable) +- [x] `request`: lock the next available (non-locked) machine and write its connection details +- [x] `release`: unlock a machine identified by host ID, resetting it for the next user +- [x] Reject local `file://` BackedURL — pool requires remote S3 state +- [x] `destroy`: remove IAM resources, serverless scheduler, and S3 state + +## Out of Scope +- Single mac host (see `003-aws-mac-host.md`) +- Integration-mode selection on `request` (currently hardcoded; TODO in code) + +## Affected Areas +- `pkg/provider/aws/action/mac-pool/` — orchestration +- `pkg/provider/aws/modules/mac/` — host, machine, util sub-packages +- `pkg/provider/aws/modules/serverless/` — HouseKeeper recurring task +- `pkg/provider/aws/modules/iam/` — request/releaser IAM account +- `cmd/mapt/cmd/aws/services/mac-pool.go` + +## Known Gaps / Improvement Ideas +- `Request` integration-mode is hardcoded (TODO comment at `mac-pool.go:138`) +- `destroyCapacity` has a TODO about allocation time ordering +- `getNextMachineForRequest` picks the newest machine; could be optimized (e.g. LRU) +- No explicit handling when all machines in the pool are locked and none available + +## Acceptance Criteria + +### Unit + +- `make build` succeeds + +### Integration + +- Pool creates N dedicated hosts and writes IAM credentials +- `housekeeper` invocation adds a machine when pool is below capacity +- `request` writes `host`, `username`, `id_rsa` for a locked machine +- `release` makes the machine available again for the next request + +--- + +## Command + +``` +mapt aws mac-pool create [flags] # create the pool of dedicated hosts +mapt aws mac-pool destroy [flags] +mapt aws mac-pool request [flags] # borrow a machine from the pool +mapt aws mac-pool release [flags] # return a machine to the pool +``` + +### Shared flag groups (`specs/cmd/params.md`) + +| Group | Flags added | +|---|---| +| Common | `--project-name`, `--backed-url` | + +No compute-request, spot, timeout, or integration flags. + +### Target-specific flags (create) + +| Flag | Type | Default | Description | +|---|---|---|---| +| `--name` | string | — | Pool name (used to identify the resource group) | +| `--arch` | string | `m1` | MAC architecture: `x86`, `m1`, `m2` | +| `--version` | string | *(per arch)* | macOS version | +| `--offered-capacity` | int | *(default in action)* | Number of machines kept available in the pool | +| `--max-size` | int | *(default in action)* | Maximum number of dedicated hosts in the pool | +| `--fixed-location` | bool | false | Force creation in `AWS_DEFAULT_REGION` only | +| `--conn-details-output` | string | — | Path to write IAM credentials | +| `--tags` | map | — | Resource tags | + +### Request / release flags + +`--project-name`, `--backed-url` (from common) + +### Destroy flags + +`--force-destroy`, `--keep-state` + +### Action args struct populated + +`mac.MacPoolArgs` → `pkg/provider/aws/action/mac/mac-pool.go` diff --git a/specs/features/aws/openshift-snc.md b/specs/features/aws/openshift-snc.md new file mode 100644 index 000000000..def590955 --- /dev/null +++ b/specs/features/aws/openshift-snc.md @@ -0,0 +1,106 @@ +# Spec: AWS OpenShift Single Node Cluster (SNC) + +## Status +Implemented + +## Context +Provisions a single-node OpenShift cluster (CRC/SNC) on an EC2 instance using a pre-baked AMI. +Entry point: `pkg/provider/aws/action/snc/`. Profile system: `pkg/target/service/snc/profile/`. +CLI: `cmd/mapt/cmd/aws/services/snc.go`. + +The cluster setup runs inside cloud-init on boot. Sensitive values (pull secret, kubeadmin +password, developer password) are managed via AWS SSM Parameter Store. Readiness is verified +by SSH-checking the kubeconfig availability and CA rotation completion. + +## Problem +This feature is implemented. This spec documents behaviour, the profile system, and gaps. + +## Requirements +- [x] Provision an EC2 instance using the SNC pre-baked AMI (looked up by version + arch) +- [x] Fail early with a clear error if the AMI does not exist in the target region +- [x] Store pull secret, kubeadmin password, and developer password in SSM; inject via cloud-init +- [x] Verify cluster readiness: SSH up → kubeconfig exists → CA rotation complete +- [x] Export kubeconfig (with public IP replacing internal API endpoint) as a secret output +- [x] Support optional profiles deployed post-cluster-ready via the Kubernetes Pulumi provider: + - `virtualization` — enables nested virtualisation on the compute instance + - `serverless-serving` — installs Knative Serving + - `serverless-eventing` — installs Knative Eventing + - `serverless` — installs both Knative Serving and Eventing + - `servicemesh` — installs OpenShift Service Mesh 3 +- [x] Validate profile names before provisioning begins +- [x] Support spot allocation and serverless self-destruct timeout +- [x] Write output files: `host`, `username`, `id_rsa`, `kubeconfig`, `kubeadmin-password`, `developer-password` +- [x] `destroy` cleans up main stack, spot stack, S3 state + +## Out of Scope +- Multi-node OCP (full IPI/UPI install) +- EKS (see `006-aws-eks.md`) + +## Affected Areas +- `pkg/provider/aws/action/snc/` — orchestration, kubeconfig extraction +- `pkg/target/service/snc/` — cloud-config, SSM management, readiness commands +- `pkg/target/service/snc/profile/` — profile registry and deployment +- `cmd/mapt/cmd/aws/services/snc.go` +- `tkn/template/infra-aws-ocp-snc.yaml` + +## Known Gaps / Improvement Ideas +- Profile deployment failures are logged as warnings, not errors (`snc.go:279`) + — consider making this configurable (fail-fast vs warn-and-continue) +- `disableClusterReadiness` flag skips the readiness wait entirely; useful for debugging + but not documented in the Tekton task +- The `--version` flag accepts a free-form string; no validation against available AMIs beyond + the early existence check + +## Acceptance Criteria + +### Unit + +- `make build` succeeds +- Unknown profile names are rejected before any stack is created + +### Integration + +- Cluster is reachable via the exported kubeconfig +- `oc get nodes` shows one Ready node +- Profiles deploy successfully when specified +- `mapt aws openshift-snc destroy` removes all resources and state + +--- + +## Command + +``` +mapt aws openshift-snc create [flags] +mapt aws openshift-snc destroy [flags] +``` + +### Shared flag groups (`specs/cmd/params.md`) + +| Group | Flags added | +|---|---| +| Common | `--project-name`, `--backed-url` | +| Compute Request | `--cpus`, `--memory`, `--arch`, `--nested-virt`, `--compute-sizes` | +| Spot | `--spot`, `--spot-eviction-tolerance`, `--spot-increase-rate`, `--spot-excluded-regions` | + +Note: no integration flags. + +### Target-specific flags (create only) + +| Flag | Type | Default | Description | +|---|---|---|---| +| `--version` | string | `4.21.0` | OpenShift version | +| `--arch` | string | `x86_64` | `x86_64` or `arm64` | +| `--pull-secret-file` | string | — | Path to Red Hat pull secret JSON file (required) | +| `--snc` | []string | — | SNC profiles to apply (comma-separated) | +| `--disable-cluster-readiness` | bool | false | Skip cluster readiness check after provision | +| `--timeout` | string | — | Self-destruct duration | +| `--conn-details-output` | string | — | Path to write kubeconfig | +| `--tags` | map | — | Resource tags | + +### Destroy flags + +`--serverless`, `--force-destroy`, `--keep-state` + +### Action args struct populated + +`snc.SNCArgs` → `pkg/provider/aws/action/snc/snc.go` diff --git a/specs/features/aws/rhel-ai.md b/specs/features/aws/rhel-ai.md new file mode 100644 index 000000000..70956939d --- /dev/null +++ b/specs/features/aws/rhel-ai.md @@ -0,0 +1,83 @@ +# Spec: AWS RHEL AI Host + +## Status +Implemented + +## Context +Provisions a RHEL AI instance on AWS, designed for AI/ML workloads. Entry point: +`pkg/provider/aws/action/rhel-ai/`. API: `pkg/target/host/rhelai/`. +CLI: `cmd/mapt/cmd/aws/hosts/rhelai.go`. + +RHEL AI differs from standard RHEL in that it uses specialised GPU-capable instance types +and a RHEL AI-specific AMI. + +## Problem +This feature is implemented. This spec documents the current behaviour. + +## Requirements +- [x] Provision a RHEL AI instance using the RHEL AI AMI +- [x] Target GPU-capable instance types (e.g. g4dn, p3 families) +- [x] Support spot allocation +- [x] Write output files: `host`, `username`, `id_rsa` +- [x] `destroy` cleans up all resources and state + +## Out of Scope +- Standard RHEL (see `001-aws-rhel-host.md`) +- Azure RHEL AI (see `015-azure-rhel-ai.md`) + +## Affected Areas +- `pkg/provider/aws/action/rhel-ai/` +- `pkg/target/host/rhelai/` +- `cmd/mapt/cmd/aws/hosts/rhelai.go` +- `tkn/template/infra-aws-rhel-ai.yaml` +- `Pulumi.rhelai.yaml` — stack configuration for the rhelai Pulumi stack + +## Acceptance Criteria + +### Unit + +- `make build` succeeds + +### Integration + +- `mapt aws rhel-ai create ...` provisions an accessible RHEL AI instance +- SSH access works +- `mapt aws rhel-ai destroy ...` removes all resources + +--- + +## Command + +``` +mapt aws rhel-ai create [flags] +mapt aws rhel-ai destroy [flags] +``` + +### Shared flag groups (`specs/cmd/params.md`) + +| Group | Flags added | +|---|---| +| Common | `--project-name`, `--backed-url` | +| Compute Request | `--cpus`, `--memory`, `--arch`, `--nested-virt`, `--compute-sizes` | +| Spot | `--spot`, `--spot-eviction-tolerance`, `--spot-increase-rate`, `--spot-excluded-regions` | + +Note: no integration flags. + +### Target-specific flags (create only) + +| Flag | Type | Default | Description | +|---|---|---|---| +| `--version` | string | `3.0.0` | RHEL AI version | +| `--accelerator` | string | `cuda` | GPU accelerator type: `cuda` or `rocm` | +| `--custom-ami` | string | — | Override with a custom AMI ID | +| `--timeout` | string | — | Self-destruct duration | +| `--conn-details-output` | string | — | Path to write connection files | +| `--tags` | map | — | Resource tags | + +### Destroy flags + +`--serverless`, `--force-destroy`, `--keep-state` + +### Action args struct populated + +`rhelai.RHELAIArgs` → `pkg/provider/aws/action/rhelai/rhelai.go` diff --git a/specs/features/aws/rhel-host.md b/specs/features/aws/rhel-host.md new file mode 100644 index 000000000..6004bc56b --- /dev/null +++ b/specs/features/aws/rhel-host.md @@ -0,0 +1,129 @@ +# Spec: AWS RHEL Host + +## Status +Implemented + +## Context +Provisions a RHEL EC2 instance on AWS. This is the reference implementation of the AWS EC2 host +pattern — all other AWS EC2 host targets follow the same structure. + +Relevant existing files: +- `pkg/provider/aws/action/rhel/` — orchestration (reference implementation) +- `pkg/target/host/rhel/cloud-config.go` — cloud-config builder +- `cmd/mapt/cmd/aws/hosts/rhel.go` — CLI + +## Problem +This feature is fully implemented. This spec documents current behaviour, the mandatory module +sequence, and known gaps. Use it as the template when adding a new AWS EC2 host target. + +## Requirements +- [x] Provision a RHEL EC2 instance (versions: 9.x, 8.x) for x86_64 or arm64 +- [x] Register with Red Hat Subscription Manager using `SubsUsername` / `SubsPassword` via cloud-init +- [x] Support spot instance allocation with cross-region best-bid selection +- [x] Support on-demand allocation using the default AWS region +- [x] Support airgap topology: two-phase stack update (connectivity ON then OFF) +- [x] Optionally apply the `profileSNC` cloud-config variant to pre-install SNC dependencies +- [x] Optionally schedule serverless self-destruct after a given timeout (requires remote BackedURL) +- [x] Write output files: `host`, `username`, `id_rsa` (and bastion files when airgap) +- [x] `destroy` cleans up main stack, spot stack (if exists), and S3 state + +## Out of Scope +- RHEL AI variant (see `009-aws-rhel-ai.md`) +- Azure RHEL (see `010-azure-rhel-host.md`) + +## Must Reuse + +**In `Create()`:** +- `mc.Init(mCtxArgs, aws.Provider())` — context initialisation +- `allocation.Allocation(mCtx, &AllocationArgs{Prefix, ComputeRequest, AMIProductDescription, Spot})` — resolves region/AZ/instance types for spot or on-demand + +**In `deploy()`, in this order:** +- `amiSVC.GetAMIByName(ctx, amiRegex, nil, map[string]string{"architecture": arch})` — finds the RHEL AMI +- `network.Create(ctx, mCtx, &NetworkArgs{Prefix, ID, Region, AZ, CreateLoadBalancer, Airgap, AirgapPhaseConnectivity})` — VPC/subnet/IGW/LB +- `keypair.KeyPairRequest{Name: resourcesUtil.GetResourceName(...)}.Create(ctx, mCtx)` — SSH keypair +- `securityGroup.SGRequest{...}.Create(ctx, mCtx)` — security group (SSH/22 ingress) +- `rhelApi.CloudConfigArgs{...}.GenerateCloudConfig(ctx, mCtx.RunID())` — RHEL cloud-config with subscription and optional SNC profile +- `compute.ComputeRequest{...}.NewCompute(ctx)` — EC2 instance +- `serverless.OneTimeDelayedTask(...)` — only when `Timeout != ""` +- `c.Readiness(ctx, command.CommandCloudInitWait, ...)` — waits for cloud-init to complete + +**In `Destroy()`:** +- `aws.DestroyStack(mCtx, DestroyStackRequest{Stackname: stackName})` +- `spot.Destroy(mCtx)` guarded by `spot.Exist(mCtx)` +- `aws.CleanupState(mCtx)` + +**In `manageResults()`:** +- `bastion.WriteOutputs(stackResult, prefix, resultsPath)` — only when `airgap=true` +- `output.Write(stackResult, resultsPath, results)` — writes `host`, `username`, `id_rsa` + +**Naming:** +- All resource names via `resourcesUtil.GetResourceName(prefix, awsRHELDedicatedID, suffix)` +- Stack name via `mCtx.StackNameByProject(stackName)` + +## Must Create +- `pkg/provider/aws/action/rhel/rhel.go` — `RHELArgs`, `Create()`, `Destroy()`, `deploy()`, `manageResults()`, `securityGroups()` +- `pkg/provider/aws/action/rhel/constants.go` — `stackName`, `awsRHELDedicatedID`, `amiRegex`, `diskSize`, `amiProduct`, `amiUserDefault`, output key constants +- `pkg/target/host/rhel/cloud-config.go` — `CloudConfigArgs`, `GenerateCloudConfig()` +- `pkg/target/host/rhel/cloud-config-base` — base cloud-config template file +- `pkg/target/host/rhel/cloud-config-snc` — SNC-variant cloud-config template file +- `cmd/mapt/cmd/aws/hosts/rhel.go` — Cobra `create` and `destroy` subcommands +- `tkn/template/infra-aws-rhel.yaml` — Tekton task template + +## Known Gaps +- `createAirgapMachine()` swallows the phase-1 error: returns `nil` instead of `err` at `rhel.go:167` + — phase 2 must not run if phase 1 fails +- No validation that `SubsUsername`/`SubsPassword` are non-empty when `profileSNC=true` +- `diskSize` is a hardcoded constant; not exposed as a CLI flag + +## Acceptance Criteria + +### Unit + +- `make build` succeeds + +### Integration + +- `mapt aws rhel create --backed-url s3://... --project-name test --version 9 --arch x86_64 --subs-username u --subs-user-pass p` exits 0 +- Output directory contains `host`, `username`, `id_rsa` +- SSH access to the provisioned host succeeds +- `mapt aws rhel destroy --backed-url s3://... --project-name test` exits 0 and removes state + +--- + +## Command + +``` +mapt aws rhel create [flags] +mapt aws rhel destroy [flags] +``` + +### Shared flag groups (`specs/cmd/params.md`) + +| Group | Flags added | +|---|---| +| Common | `--project-name`, `--backed-url` | +| Compute Request | `--cpus`, `--memory`, `--arch`, `--nested-virt`, `--compute-sizes` | +| Spot | `--spot`, `--spot-eviction-tolerance`, `--spot-increase-rate`, `--spot-excluded-regions` | +| Integrations | `--ghactions-runner-*`, `--it-cirrus-pw-*`, `--glrunner-*` | + +### Target-specific flags (create only) + +| Flag | Type | Default | Description | +|---|---|---|---| +| `--version` | string | `9.4` | RHEL major.minor version | +| `--arch` | string | `x86_64` | `x86_64` or `arm64` | +| `--rh-subscription-username` | string | — | Red Hat subscription username | +| `--rh-subscription-password` | string | — | Red Hat subscription password | +| `--snc` | bool | false | Apply SNC profile (sets `nested-virt=true`) | +| `--airgap` | bool | false | Provision as airgap machine (bastion access only) | +| `--timeout` | string | — | Self-destruct duration e.g. `4h` (requires remote `--backed-url`) | +| `--conn-details-output` | string | — | Path to write connection files | +| `--tags` | map | — | Resource tags `name=value,...` | + +### Destroy flags + +`--serverless`, `--force-destroy`, `--keep-state` + +### Action args struct populated + +`rhel.RHELArgs` → `pkg/provider/aws/action/rhel/rhel.go` diff --git a/specs/features/aws/serverless-self-destruct.md b/specs/features/aws/serverless-self-destruct.md new file mode 100644 index 000000000..b9b514a46 --- /dev/null +++ b/specs/features/aws/serverless-self-destruct.md @@ -0,0 +1,76 @@ +# Spec: Serverless Self-Destruct (Timeout Mode) + +## Status +Implemented + +## Context +Any provisioned host or service can optionally schedule its own destruction after a given duration. +This prevents cost overruns when a CI pipeline fails to call `destroy` explicitly. + +Implementation: `pkg/provider/aws/modules/serverless/`. + +Mechanism: +1. An ECS Fargate task definition is created with the `mapt` OCI image +2. An AWS EventBridge Scheduler one-time schedule fires at `now + timeout` +3. The scheduled task runs `mapt destroy --project-name ... --backed-url ... --serverless` +4. A shared ECS cluster and IAM roles are created once per region and retained (`RetainOnDelete(true)`) + +## Problem +This feature is implemented. This spec documents the design and constraints. + +## Requirements +- [x] Accept a timeout duration string (Go `time.Duration` format, e.g. `"4h"`, `"30m"`) +- [x] Reject timeout when BackedURL is `file://` (state must be remotely accessible by Fargate) +- [x] Create/reuse a named ECS cluster (`mapt-serverless-cluster`) retained on delete +- [x] Create/reuse task execution and scheduler IAM roles, retained on delete +- [x] Create a one-time EventBridge Schedule at `now + timeout` in the region's local timezone +- [x] The Fargate task image is the `mapt` OCI image baked in at compile time via linker flag (`-X ...context.OCI`) +- [x] Support `--serverless` flag on destroy to use role-based credentials (no static key/secret needed inside ECS) +- [x] Clean up the EventBridge schedule and task definition on destroy (these are not retained) + +## Out of Scope +- Recurring schedules (used internally by mac-pool HouseKeeper via `serverless.Create()` with `Repeat` type) +- Azure self-destruct (not implemented) + +## Affected Areas +- `pkg/provider/aws/modules/serverless/serverless.go` — core implementation +- `pkg/provider/aws/modules/serverless/types.go` — schedule types +- `pkg/manager/context/context.go` — `OCI` variable set by linker +- Any action that calls `serverless.OneTimeDelayedTask()` (rhel, windows, snc, fedora, kind, eks) +- `oci/Containerfile` — the container image being scheduled + +## Known Gaps / Improvement Ideas +- IAM policy for the task role is very broad (`ec2:*`, `s3:*`, `cloudformation:*`, `ssm:*`, `scheduler:*`) + — could be scoped down to only what destroy needs +- There is no mechanism to cancel the scheduled self-destruct once set (other than manually deleting + the EventBridge schedule from the AWS console) +- The OCI image tag used by the Fargate task is baked in at build time; if a newer binary is deployed + via a different image tag, old scheduled tasks still run the old image + +## Acceptance Criteria + +### Unit + +- `make build` succeeds +- `--timeout` with a `file://` BackedURL returns an error before any stack is created + +### Integration + +- `mapt aws rhel create --timeout 1h ...` creates a visible EventBridge schedule +- After the timeout, the Fargate task fires and the stack is destroyed + +--- + +## Command + +This is a cross-cutting feature, not a standalone command. It is activated via the +`--timeout` flag on individual target create commands, and the `--serverless` flag +on destroy commands: + +``` +mapt aws rhel create --timeout 4h ... +mapt aws rhel destroy --serverless ... +``` + +Both flags are defined in shared params (`specs/cmd/params.md` — Serverless / Destroy group). +No additional flags are specific to the self-destruct feature itself. diff --git a/specs/features/aws/vpc-endpoints.md b/specs/features/aws/vpc-endpoints.md new file mode 100644 index 000000000..a067134b0 --- /dev/null +++ b/specs/features/aws/vpc-endpoints.md @@ -0,0 +1,192 @@ +# Feature: Optional VPC Endpoints + +## Status +Implemented + +## Context + +Every public subnet created by mapt unconditionally creates three VPC endpoints inside +`PublicSubnetRequest.Create()` in `pkg/provider/aws/services/vpc/subnet/public.go`: + +| Name | Service | Type | +|---|---|---| +| `s3` | `com.amazonaws.{region}.s3` | Gateway | +| `ecr` | `com.amazonaws.{region}.ecr.dkr` | Interface | +| `ssm` | `com.amazonaws.{region}.ssm` | Interface | + +Interface endpoints (ECR, SSM) also create a shared security group allowing TCP 443 +inbound from the VPC CIDR — this group is also created unconditionally today. + +Targets that do not need these endpoints pay for them unnecessarily. Targets that need +other endpoints cannot add them without code changes. + +--- + +## Requirements + +- [x] Accept a `Endpoints []string` field on `NetworkArgs` — each entry is a short name + (`"s3"`, `"ecr"`, `"ssm"`) identifying the endpoint to create +- [x] Empty slice (default) = **no endpoints created** — breaking change from current + behaviour; callers that need endpoints must opt in explicitly +- [x] Propagate through the full call chain: + `cmd params` → action `*Args` → `NetworkArgs` → `NetworkRequest` → `PublicSubnetRequest` → `endpoints()` +- [x] `endpoints()` creates only the endpoints present in the list; unknown names return an + error before any AWS resource is created +- [x] The Interface-endpoint security group is only created when at least one Interface + endpoint (`ecr`, `ssm`) is in the list +- [x] Targets that currently depend on specific endpoints (verify EKS, SNC) must pass the + required endpoint names explicitly in their action args + +--- + +## Out of Scope + +- Adding new endpoint types beyond the existing three +- Azure (no equivalent mechanism) +- Airgap path — endpoints are only created for public subnets (`standard/`) + +--- + +## Must Reuse + +- `network.Create()` — `specs/api/aws/network.md` — extend `NetworkArgs` with `Endpoints []string` +- `standard.NetworkRequest.CreateNetwork()` — pass `Endpoints` down to `PublicSubnetRequest` +- `PublicSubnetRequest.Create()` — pass `Endpoints` down to `endpoints()` + +--- + +## Must Create + +No new files. All changes are within existing files: + +### 1. Shared CLI params — `cmd/mapt/cmd/params/params.go` + +Follow the three-part pattern described in `specs/cmd/params.md`. Add the Network group: + +```go +const ( + Endpoints = "endpoints" + EndpointsDesc = "Comma-separated list of VPC endpoints to create. " + + "Accepted values: s3, ecr, ssm. Empty = no endpoints." +) + +func AddNetworkFlags(fs *pflag.FlagSet) { + fs.StringSliceP(Endpoints, "", []string{}, EndpointsDesc) +} + +func NetworkEndpoints() []string { + return viper.GetStringSlice(Endpoints) +} +``` + +`StringSliceP` + `viper.GetStringSlice` handle comma-separated input automatically — +the same mechanism used by `--compute-sizes` and `--spot-excluded-regions`. + +### 2. Action args structs — one per target that uses network + +Add `Endpoints []string` to each action's public args struct and wire it into +`NetworkArgs` inside `deploy()`: + +| Action args struct | File | +|---|---| +| `rhel.RHELArgs` | `pkg/provider/aws/action/rhel/rhel.go` | +| `windows.WindowsArgs` | `pkg/provider/aws/action/windows/windows.go` | +| `fedora.FedoraArgs` | `pkg/provider/aws/action/fedora/fedora.go` | +| `kind.KindArgs` | `pkg/provider/aws/action/kind/kind.go` | +| `snc.SNCArgs` | `pkg/provider/aws/action/snc/snc.go` | +| `eks.EKSArgs` | `pkg/provider/aws/action/eks/eks.go` | + +In each action's `deploy()`, pass the field to `NetworkArgs`: + +```go +nw, err := network.Create(ctx, r.mCtx, &network.NetworkArgs{ + ... + Endpoints: r.endpoints, // new field +}) +``` + +### 3. cmd create files — one per target + +Call `params.AddNetworkFlags(flagSet)` and pass `params.NetworkEndpoints()` to the +action args. Pattern (shown for RHEL, identical for all others): + +```go +// in getRHELCreate() flagSet block: +params.AddNetworkFlags(flagSet) + +// in RHELArgs construction: +&rhel.RHELArgs{ + ... + Endpoints: params.NetworkEndpoints(), +} +``` + +Affected cmd files: + +| File | +|---| +| `cmd/mapt/cmd/aws/hosts/rhel.go` | +| `cmd/mapt/cmd/aws/hosts/windows.go` | +| `cmd/mapt/cmd/aws/hosts/fedora.go` | +| `cmd/mapt/cmd/aws/hosts/rhelai.go` | +| `cmd/mapt/cmd/aws/services/kind.go` | +| `cmd/mapt/cmd/aws/services/snc.go` | +| `cmd/mapt/cmd/aws/services/eks.go` | + +### 4. Network module — `pkg/provider/aws/modules/network/network.go` + +Add `Endpoints []string` to `NetworkArgs`; pass to `NetworkRequest`. + +### 5. Standard network — `pkg/provider/aws/modules/network/standard/standard.go` + +Add `Endpoints []string` to `NetworkRequest`; pass to `PublicSubnetRequest`. + +### 6. Public subnet — `pkg/provider/aws/services/vpc/subnet/public.go` + +Add `Endpoints []string` to `PublicSubnetRequest`. + +Refactor `endpoints()`: +- Accept the list; iterate and create only matching entries +- Unknown names: return error immediately +- Create the security group only when at least one Interface endpoint (`ecr`, `ssm`) is present +- Return without creating anything when the list is empty + +--- + +## Endpoint Identifiers + +| Name | AWS service name | Type | Needs security group | +|---|---|---|---| +| `s3` | `com.amazonaws.{region}.s3` | Gateway | No | +| `ecr` | `com.amazonaws.{region}.ecr.dkr` | Interface | Yes | +| `ssm` | `com.amazonaws.{region}.ssm` | Interface | Yes | + +The security group (TCP 443 ingress from VPC CIDR) is shared by all Interface endpoints +in the subnet. Created once if any Interface endpoint is in the list; omitted otherwise. + +--- + +## API Changes + +Update `specs/api/aws/network.md`: +- Add `Endpoints []string` to `NetworkArgs` type block +- Document the accepted names and the security group behaviour + +--- + +## Acceptance Criteria + +### Unit + +- `make build` succeeds +- `endpoints()` called with an unknown name returns an error without creating any AWS resource + +### Integration + +- [ ] `mapt aws rhel create` with no `--endpoints` provisions a VPC with zero endpoints +- [ ] `mapt aws rhel create --endpoints s3,ssm` creates only S3 (Gateway) and SSM (Interface); + ECR is absent; security group is present +- [ ] `mapt aws rhel create --endpoints s3` creates only S3; no security group is created +- [ ] `mapt aws rhel create --endpoints foo` returns an error before any stack is touched +- [ ] Targets that depended on endpoints before this change (verify EKS, SNC) pass their + required endpoint names explicitly and continue to work diff --git a/specs/features/aws/windows-server-host.md b/specs/features/aws/windows-server-host.md new file mode 100644 index 000000000..355ae2ca4 --- /dev/null +++ b/specs/features/aws/windows-server-host.md @@ -0,0 +1,133 @@ +# Spec: AWS Windows Server Host + +## Status +Implemented + +## Context +Provisions a Windows Server EC2 instance on AWS. Follows the standard AWS EC2 host pattern +(see `001-aws-rhel-host.md`) with two additions: AMI cross-region copy and Fast Launch. + +Relevant existing files: +- `pkg/provider/aws/action/windows/` — orchestration +- `pkg/provider/aws/modules/ami/` — AMI copy + fast-launch (reused here, not in other targets) +- `pkg/target/host/windows-server/` — PowerShell userdata builder + +## Problem +This feature is fully implemented. This spec documents the standard and Windows-specific +module usage, and known gaps. + +## Requirements +- [x] Provision Windows Server 2019 (English or non-English variant) EC2 instance +- [x] Accept a custom AMI name/owner/user; fall back to well-known defaults +- [x] Copy the AMI to the target region when not natively available; optionally keep the copy +- [x] Enable Fast Launch on copied AMI with configurable parallelism +- [x] Support spot instance allocation with cross-region best-bid selection +- [x] Support airgap topology (two-phase: connectivity ON → OFF) +- [x] Generate a random administrator password; export as `userpassword` +- [x] Open security group rules for SSH (22) and RDP (3389) +- [x] Optionally schedule serverless self-destruct after timeout +- [x] Write output files: `host`, `username`, `userpassword`, `id_rsa` (and bastion files when airgap) +- [x] `destroy` cleans up main stack, AMI-copy stack (if exists), spot stack (if exists), S3 state + +## Out of Scope +- Azure Windows Desktop (see `011-azure-windows-desktop.md`) +- Non-server Windows editions + +## Must Reuse + +**In `Create()` — standard:** +- `mc.Init(mCtxArgs, aws.Provider())` +- `allocation.Allocation(mCtx, &AllocationArgs{...})` — spot or on-demand + +**In `Create()` — Windows-specific addition before `createMachine()`:** +- `data.IsAMIOffered(ctx, ImageRequest{Name, Region})` — check if AMI exists in the target region +- `amiCopy.CopyAMIRequest{..., FastLaunch: true, MaxParallel: N}.Create()` — copy AMI to region when not offered; this creates its own Pulumi stack + +**In `deploy()`, in this order — same as standard pattern:** +- `amiSVC.GetAMIByName(ctx, amiName+"*", []string{amiOwner}, nil)` +- `network.Create(ctx, mCtx, &NetworkArgs{..., CreateLoadBalancer: r.spot})` +- `keypair.KeyPairRequest{Name: resourcesUtil.GetResourceName(...)}.Create(ctx, mCtx)` +- `securityGroup.SGRequest{..., IngressRules: [SSH_TCP, RDP_TCP]}.Create(ctx, mCtx)` +- `security.CreatePassword(ctx, resourcesUtil.GetResourceName(...))` — random admin password +- `cloudConfigWindowsServer.GenerateUserdata(ctx, user, password, keyResources, runID)` — PowerShell userdata +- `compute.ComputeRequest{..., LBTargetGroups: []int{22, 3389}}.NewCompute(ctx)` +- `serverless.OneTimeDelayedTask(...)` — only when `Timeout != ""` +- `c.Readiness(ctx, command.CommandPing, ...)` — ICMP ping readiness (not cloud-init wait) + +**In `Destroy()` — Windows-specific additions:** +- `aws.DestroyStack(mCtx, DestroyStackRequest{Stackname: stackName})` +- `amiCopy.Destroy(mCtx)` guarded by `amiCopy.Exist(mCtx)` — additional step vs standard pattern +- `spot.Destroy(mCtx)` guarded by `spot.Exist(mCtx)` +- `aws.CleanupState(mCtx)` + +**In `manageResults()` — standard:** +- `bastion.WriteOutputs(...)` when airgap +- `output.Write(stackResult, resultsPath, results)` — writes `host`, `username`, `userpassword`, `id_rsa` + +**Naming:** +- All resource names via `resourcesUtil.GetResourceName(prefix, awsWindowsDedicatedID, suffix)` + +## Must Create +- `pkg/provider/aws/action/windows/windows.go` — `WindowsServerArgs`, `Create()`, `Destroy()`, `deploy()`, `manageResults()`, `securityGroups()` +- `pkg/provider/aws/action/windows/constants.go` — stack name, component ID, AMI defaults, disk size, fast-launch config +- `pkg/target/host/windows-server/windows-server.go` — `GenerateUserdata()` +- `pkg/target/host/windows-server/bootstrap.ps1` — embedded PowerShell bootstrap script +- `cmd/mapt/cmd/aws/hosts/windows.go` — Cobra `create` and `destroy` subcommands +- `tkn/template/infra-aws-windows-server.yaml` — Tekton task template + +## Known Gaps +- `createAirgapMachine()` swallows the phase-1 error: `return nil` instead of `return err` at `windows.go:214` +- RDP through the bastion is unfinished — TODO comment at bottom of `windows.go` +- Readiness uses `CommandPing` (ICMP) not `CommandCloudInitWait`; cloud-init completion is not explicitly verified + +## Acceptance Criteria + +### Unit + +- `make build` succeeds + +### Integration + +- `mapt aws windows create ...` provisions an accessible Windows instance +- RDP port 3389 and SSH port 22 are reachable +- Output directory contains `host`, `username`, `userpassword`, `id_rsa` +- `mapt aws windows destroy ...` removes all stacks and S3 state + +--- + +## Command + +``` +mapt aws windows create [flags] +mapt aws windows destroy [flags] +``` + +### Shared flag groups (`specs/cmd/params.md`) + +| Group | Flags added | +|---|---| +| Common | `--project-name`, `--backed-url` | +| Spot | `--spot`, `--spot-eviction-tolerance`, `--spot-increase-rate`, `--spot-excluded-regions` | + +Note: no compute-request flags — Windows uses a fixed AMI-based workflow, not hardware-spec selection. No integration flags. + +### Target-specific flags (create only) + +| Flag | Type | Default | Description | +|---|---|---|---| +| `--ami-name` | string | `Windows_Server-2019-English-Full-Base*` | AMI name pattern to search | +| `--ami-username` | string | `ec2-user` | Default username on the AMI | +| `--ami-region` | string | — | Source region for cross-region AMI copy | +| `--ami-keep-copy` | bool | false | Retain the copied AMI after destroy | +| `--airgap` | bool | false | Provision as airgap machine | +| `--timeout` | string | — | Self-destruct duration | +| `--conn-details-output` | string | — | Path to write connection files | +| `--tags` | map | — | Resource tags | + +### Destroy flags + +`--serverless`, `--force-destroy`, `--keep-state` + +### Action args struct populated + +`windows.WindowsArgs` → `pkg/provider/aws/action/windows/windows.go` diff --git a/specs/features/azure/aks.md b/specs/features/azure/aks.md new file mode 100644 index 000000000..2168c7682 --- /dev/null +++ b/specs/features/azure/aks.md @@ -0,0 +1,78 @@ +# Spec: Azure AKS (Azure Kubernetes Service) + +## Status +Implemented + +## Context +Provisions a managed AKS cluster on Azure. Entry point: `pkg/provider/azure/action/aks/`. +CLI: `cmd/mapt/cmd/azure/services/aks.go`. + +## Problem +This feature is implemented. This spec documents the current behaviour. + +## Requirements +- [x] Provision an AKS cluster with a configurable node pool +- [x] Support configurable Kubernetes version +- [x] Support spot node pools (Azure spot VMs) +- [x] Write kubeconfig output file +- [x] `destroy` cleans up all resources and state + +## Out of Scope +- AWS EKS (see `006-aws-eks.md`) +- Azure Kind (see `014-azure-kind.md`) + +## Affected Areas +- `pkg/provider/azure/action/aks/` +- `cmd/mapt/cmd/azure/services/aks.go` +- `tkn/template/infra-azure-aks.yaml` + +## Acceptance Criteria + +### Unit + +- `make build` succeeds + +### Integration + +- `mapt azure aks create ...` provisions a functioning AKS cluster +- Exported kubeconfig allows `kubectl get nodes` to return Ready nodes +- `mapt azure aks destroy ...` removes all resources + +--- + +## Command + +``` +mapt azure aks create [flags] +mapt azure aks destroy [flags] +``` + +### Shared flag groups + +| Group | Source | Flags added | +|---|---|---| +| Common | `specs/cmd/params.md` | `--project-name`, `--backed-url` | +| Spot | `specs/cmd/params.md` | `--spot`, `--spot-eviction-tolerance`, `--spot-increase-rate`, `--spot-excluded-regions` | + +Note: no compute-request (VM size is explicit), no integrations, no timeout. +AKS uses its own `--location` rather than the shared azure-params one (different default: `West US`). + +### Target-specific flags (create only) + +| Flag | Type | Default | Description | +|---|---|---|---| +| `--location` | string | `West US` | Azure region (ignored when spot is set) | +| `--vmsize` | string | *(default in action)* | Explicit VM size for node pool | +| `--version` | string | `1.31` | Kubernetes version | +| `--only-system-pool` | bool | false | Create system node pool only (no user pool) | +| `--enable-app-routing` | bool | false | Enable AKS App Routing add-on | +| `--conn-details-output` | string | — | Path to write kubeconfig | +| `--tags` | map | — | Resource tags | + +### Destroy flags + +*(none beyond common)* + +### Action args struct populated + +`aks.AKSArgs` → `pkg/provider/azure/action/aks/aks.go` diff --git a/specs/features/azure/kind.md b/specs/features/azure/kind.md new file mode 100644 index 000000000..548922367 --- /dev/null +++ b/specs/features/azure/kind.md @@ -0,0 +1,79 @@ +# Spec: Azure Kind Cluster + +## Status +Implemented + +## Context +Provisions a Kind (Kubernetes-in-Docker) cluster on an Azure VM. +Entry point: `pkg/provider/azure/action/kind/`. CLI: `cmd/mapt/cmd/azure/services/kind.go`. + +Mirrors the AWS Kind target but runs on Azure infrastructure. + +## Problem +This feature is implemented. This spec documents the current behaviour. + +## Requirements +- [x] Provision an Azure VM and install Kind + Docker via cloud-init +- [x] Create a Kind cluster; export kubeconfig +- [x] Support configurable Kubernetes version +- [x] Support spot (low-priority) VMs +- [x] Write output files: `host`, `username`, `id_rsa`, `kubeconfig` +- [x] `destroy` cleans up all resources and state + +## Out of Scope +- AWS Kind (see `007-aws-kind.md`) +- Azure AKS managed clusters (see `012-azure-aks.md`) + +## Affected Areas +- `pkg/provider/azure/action/kind/` +- `cmd/mapt/cmd/azure/services/kind.go` + +## Acceptance Criteria + +### Unit + +- `make build` succeeds + +### Integration + +- `mapt azure kind create ...` produces a working kubeconfig +- `kubectl get nodes` returns a Ready node +- `mapt azure kind destroy ...` removes all resources + +--- + +## Command + +``` +mapt azure kind create [flags] +mapt azure kind destroy [flags] +``` + +### Shared flag groups + +| Group | Source | Flags added | +|---|---|---| +| Common | `specs/cmd/params.md` | `--project-name`, `--backed-url` | +| Compute Request | `specs/cmd/params.md` | `--cpus`, `--memory`, `--arch`, `--nested-virt`, `--compute-sizes` | +| Spot | `specs/cmd/params.md` | `--spot`, `--spot-eviction-tolerance`, `--spot-increase-rate`, `--spot-excluded-regions` | +| Location | `specs/cmd/azure-params.md` | `--location` (default: `westeurope`) | + +Note: no integration flags. + +### Target-specific flags (create only) + +| Flag | Type | Default | Description | +|---|---|---|---| +| `--version` | string | `v1.34` | Kubernetes version for Kind | +| `--arch` | string | `x86_64` | `x86_64` or `arm64` | +| `--extra-port-mappings` | string | — | JSON array of `{containerPort, hostPort, protocol}` | +| `--conn-details-output` | string | — | Path to write kubeconfig | +| `--tags` | map | — | Resource tags | + +### Destroy flags + +`--serverless`, `--force-destroy` + +### Action args struct populated + +`kind.KindArgs` → `pkg/provider/azure/action/kind/kind.go` diff --git a/specs/features/azure/linux-host.md b/specs/features/azure/linux-host.md new file mode 100644 index 000000000..57513082d --- /dev/null +++ b/specs/features/azure/linux-host.md @@ -0,0 +1,80 @@ +# Spec: Azure Linux Host (Fedora / Ubuntu) + +## Status +Implemented + +## Context +Provisions a generic Linux VM on Azure (Fedora or Ubuntu). Entry point: +`pkg/provider/azure/action/linux/`. CLI: `cmd/mapt/cmd/azure/hosts/linux.go`. +Also referenced as separate Fedora/Ubuntu targets in docs (`docs/azure/fedora.md`, `docs/azure/ubuntu.md`). + +This is a general-purpose Linux provisioner for Azure that accepts a configurable image reference. + +## Problem +This feature is implemented. This spec documents the current behaviour. + +## Requirements +- [x] Provision a Linux VM on Azure with a configurable Marketplace image (Fedora, Ubuntu, etc.) +- [x] Support spot (low-priority) VMs +- [x] Support optional CI integrations (GitHub runner, Cirrus worker, GitLab runner) +- [x] Write output files: `host`, `username`, `id_rsa` +- [x] `destroy` cleans up all resources and state + +## Out of Scope +- Azure RHEL (subscription-managed — see `010-azure-rhel-host.md`) +- AWS Fedora (see `008-aws-fedora-host.md`) + +## Affected Areas +- `pkg/provider/azure/action/linux/` +- `pkg/provider/azure/data/` — image reference lookup +- `cmd/mapt/cmd/azure/hosts/linux.go` +- `tkn/template/infra-azure-fedora.yaml` + +## Acceptance Criteria + +### Unit + +- `make build` succeeds + +### Integration + +- `mapt azure linux create ...` provisions an accessible Linux VM +- SSH access works +- `mapt azure linux destroy ...` removes all resources + +--- + +## Command + +``` +mapt azure linux create [flags] # Ubuntu default; reused for Fedora with different version +mapt azure linux destroy [flags] +``` + +### Shared flag groups + +| Group | Source | Flags added | +|---|---|---| +| Common | `specs/cmd/params.md` | `--project-name`, `--backed-url` | +| Compute Request | `specs/cmd/params.md` | `--cpus`, `--memory`, `--arch`, `--nested-virt`, `--compute-sizes` | +| Spot | `specs/cmd/params.md` | `--spot`, `--spot-eviction-tolerance`, `--spot-increase-rate`, `--spot-excluded-regions` | +| Integrations | `specs/cmd/params.md` | `--ghactions-runner-*`, `--it-cirrus-pw-*`, `--glrunner-*` | +| Location | `specs/cmd/azure-params.md` | `--location` (default: `westeurope`) | + +### Target-specific flags (create only) + +| Flag | Type | Default | Description | +|---|---|---|---| +| `--version` | string | `24.04` | OS version (Ubuntu format; `42` for Fedora) | +| `--arch` | string | `x86_64` | `x86_64` or `arm64` | +| `--username` | string | `rhqp` | OS username for SSH access | +| `--conn-details-output` | string | — | Path to write connection files | +| `--tags` | map | — | Resource tags | + +### Destroy flags + +*(none beyond common)* + +### Action args struct populated + +`linux.LinuxArgs` → `pkg/provider/azure/action/linux/linux.go` diff --git a/specs/features/azure/rhel-ai.md b/specs/features/azure/rhel-ai.md new file mode 100644 index 000000000..20c6b0f2e --- /dev/null +++ b/specs/features/azure/rhel-ai.md @@ -0,0 +1,80 @@ +# Spec: Azure RHEL AI Host + +## Status +Implemented + +## Context +Provisions a RHEL AI VM on Azure for AI/ML workloads. Entry point: +`pkg/provider/azure/action/rhel-ai/`. CLI: `cmd/mapt/cmd/azure/hosts/rhelai.go`. + +Mirrors the AWS RHEL AI target on Azure infrastructure, using GPU-capable VM sizes +and the RHEL AI Marketplace image. + +## Problem +This feature is implemented. This spec documents the current behaviour. + +## Requirements +- [x] Provision a RHEL AI VM on Azure using the Marketplace image +- [x] Target GPU-capable Azure VM sizes +- [x] Support spot (low-priority) VMs +- [x] Write output files: `host`, `username`, `id_rsa` +- [x] `destroy` cleans up all Azure resources and state + +## Out of Scope +- AWS RHEL AI (see `009-aws-rhel-ai.md`) +- Standard Azure RHEL (see `010-azure-rhel-host.md`) + +## Affected Areas +- `pkg/provider/azure/action/rhel-ai/` +- `cmd/mapt/cmd/azure/hosts/rhelai.go` +- `tkn/template/infra-azure-rhel-ai.yaml` + +## Acceptance Criteria + +### Unit + +- `make build` succeeds + +### Integration + +- `mapt azure rhel-ai create ...` provisions an accessible RHEL AI VM +- SSH access works +- `mapt azure rhel-ai destroy ...` removes all resources + +--- + +## Command + +``` +mapt azure rhel-ai create [flags] +mapt azure rhel-ai destroy [flags] +``` + +### Shared flag groups + +| Group | Source | Flags added | +|---|---|---| +| Common | `specs/cmd/params.md` | `--project-name`, `--backed-url` | +| Compute Request | `specs/cmd/params.md` | `--cpus`, `--memory`, `--arch`, `--nested-virt`, `--compute-sizes` | +| Spot | `specs/cmd/params.md` | `--spot`, `--spot-eviction-tolerance`, `--spot-increase-rate`, `--spot-excluded-regions` | +| Location | `specs/cmd/azure-params.md` | `--location` (default: `westeurope`) | + +Note: no integration flags. + +### Target-specific flags (create only) + +| Flag | Type | Default | Description | +|---|---|---|---| +| `--version` | string | `3.0.0` | RHEL AI version | +| `--accelerator` | string | `cuda` | GPU accelerator: `cuda` or `rocm` | +| `--custom-ami` | string | — | Custom image override | +| `--conn-details-output` | string | — | Path to write connection files | +| `--tags` | map | — | Resource tags | + +### Destroy flags + +`--serverless`, `--force-destroy`, `--keep-state` + +### Action args struct populated + +`rhelai.RHELAIArgs` → `pkg/provider/azure/action/rhelai/rhelai.go` diff --git a/specs/features/azure/rhel-host.md b/specs/features/azure/rhel-host.md new file mode 100644 index 000000000..6f2186a3d --- /dev/null +++ b/specs/features/azure/rhel-host.md @@ -0,0 +1,85 @@ +# Spec: Azure RHEL Host + +## Status +Implemented + +## Context +Provisions a RHEL VM on Azure. Entry point: `pkg/provider/azure/action/rhel/`. +CLI: `cmd/mapt/cmd/azure/hosts/rhel.go`. + +Azure RHEL uses Azure Marketplace images. Root disk expansion is handled via a shell script +(`expand-root-disk.sh`) run during cloud-init since Azure RHEL images often ship with a small root partition. + +## Problem +This feature is implemented. This spec documents the current behaviour. + +## Requirements +- [x] Provision a RHEL VM on Azure using the Marketplace image +- [x] Expand the root disk during cloud-init to use the full allocated disk size +- [x] Support spot (Azure low-priority / spot VMs) via `azure/modules/allocation/` +- [x] Support optional CI integrations +- [x] Write output files: `host`, `username`, `id_rsa` +- [x] `destroy` cleans up all Azure resources and state + +## Out of Scope +- AWS RHEL (see `001-aws-rhel-host.md`) +- Azure RHEL AI (see `015-azure-rhel-ai.md`) + +## Affected Areas +- `pkg/provider/azure/action/rhel/` — including `expand-root-disk.sh` +- `pkg/provider/azure/modules/` — network, virtual-machine, allocation +- `cmd/mapt/cmd/azure/hosts/rhel.go` +- `tkn/template/infra-azure-rhel.yaml` + +## Acceptance Criteria + +### Unit + +- `make build` succeeds + +### Integration + +- `mapt azure rhel create ...` provisions an accessible RHEL VM +- Root disk is expanded to the configured size +- SSH access works +- `mapt azure rhel destroy ...` removes all resources + +--- + +## Command + +``` +mapt azure rhel create [flags] +mapt azure rhel destroy [flags] +``` + +### Shared flag groups + +| Group | Source | Flags added | +|---|---|---| +| Common | `specs/cmd/params.md` | `--project-name`, `--backed-url` | +| Compute Request | `specs/cmd/params.md` | `--cpus`, `--memory`, `--arch`, `--nested-virt`, `--compute-sizes` | +| Spot | `specs/cmd/params.md` | `--spot`, `--spot-eviction-tolerance`, `--spot-increase-rate`, `--spot-excluded-regions` | +| Integrations | `specs/cmd/params.md` | `--ghactions-runner-*`, `--it-cirrus-pw-*`, `--glrunner-*` | +| Location | `specs/cmd/azure-params.md` | `--location` (default: `westeurope`) | + +### Target-specific flags (create only) + +| Flag | Type | Default | Description | +|---|---|---|---| +| `--version` | string | `9.7` | RHEL major.minor version | +| `--arch` | string | `x86_64` | `x86_64` or `arm64` | +| `--username` | string | `rhqp` | OS username for SSH access | +| `--rh-subscription-username` | string | — | Red Hat subscription username | +| `--rh-subscription-password` | string | — | Red Hat subscription password | +| `--snc` | bool | false | Apply SNC profile | +| `--conn-details-output` | string | — | Path to write connection files | +| `--tags` | map | — | Resource tags | + +### Destroy flags + +*(none beyond common)* + +### Action args struct populated + +`rhel.RhelArgs` → `pkg/provider/azure/action/rhel/rhel.go` diff --git a/specs/features/azure/windows-desktop.md b/specs/features/azure/windows-desktop.md new file mode 100644 index 000000000..32e142a13 --- /dev/null +++ b/specs/features/azure/windows-desktop.md @@ -0,0 +1,83 @@ +# Spec: Azure Windows Desktop Host + +## Status +Implemented + +## Context +Provisions a Windows Desktop VM on Azure. Entry point: `pkg/provider/azure/action/windows/`. +CLI: `cmd/mapt/cmd/azure/hosts/windows.go`. + +This differs from the AWS Windows Server target: it targets Windows Desktop editions on Azure +and includes CI-specific setup scripts (`rhqp-ci-setup.ps1`). + +## Problem +This feature is implemented. This spec documents the current behaviour. + +## Requirements +- [x] Provision a Windows Desktop VM on Azure using the specified Marketplace image +- [x] Run CI setup PowerShell scripts via custom script extension or userdata +- [x] Support optional spot (low-priority) VMs +- [x] Open security group rules for RDP (3389) and WinRM/SSH as needed +- [x] Write output files: `host`, `username`, `userpassword` +- [x] `destroy` cleans up all Azure resources and state + +## Out of Scope +- AWS Windows Server (see `002-aws-windows-server-host.md`) +- Azure RHEL or Linux (see `010-azure-rhel-host.md`, `013-azure-linux-host.md`) + +## Affected Areas +- `pkg/provider/azure/action/windows/` — including `rhqp-ci-setup.ps1` +- `cmd/mapt/cmd/azure/hosts/windows.go` +- `tkn/template/infra-azure-windows-desktop.yaml` + +## Acceptance Criteria + +### Unit + +- `make build` succeeds + +### Integration + +- `mapt azure windows create ...` provisions an accessible Windows VM +- RDP connection works with the output credentials +- `mapt azure windows destroy ...` removes all resources + +--- + +## Command + +``` +mapt azure windows create [flags] +mapt azure windows destroy [flags] +``` + +### Shared flag groups + +| Group | Source | Flags added | +|---|---|---| +| Common | `specs/cmd/params.md` | `--project-name`, `--backed-url` | +| Compute Request | `specs/cmd/params.md` | `--cpus`, `--memory`, `--arch`, `--nested-virt`, `--compute-sizes` | +| Spot | `specs/cmd/params.md` | `--spot`, `--spot-eviction-tolerance`, `--spot-increase-rate`, `--spot-excluded-regions` | +| Location | `specs/cmd/azure-params.md` | `--location` (default: `westeurope`) | + +Note: no integration flags. + +### Target-specific flags (create only) + +| Flag | Type | Default | Description | +|---|---|---|---| +| `--windows-version` | string | `11` | Windows major version | +| `--feature` | string | — | Windows feature/edition variant | +| `--username` | string | `rhqp` | Username for SSH access | +| `--admin-username` | string | `rhqpadmin` | Admin username for RDP access | +| `--profile` | []string | — | Setup profiles to apply (comma-separated) | +| `--conn-details-output` | string | — | Path to write connection files | +| `--tags` | map | — | Resource tags | + +### Destroy flags + +*(none beyond common)* + +### Action args struct populated + +`windows.WindowsArgs` → `pkg/provider/azure/action/windows/windows.go` diff --git a/specs/integrations/cirrus-ci.md b/specs/integrations/cirrus-ci.md new file mode 100644 index 000000000..403c9a0c6 --- /dev/null +++ b/specs/integrations/cirrus-ci.md @@ -0,0 +1,109 @@ +# Integration: Cirrus CI Persistent Worker + +**Package:** `github.com/redhat-developer/mapt/pkg/integrations/cirrus` + +Registers the provisioned machine as a Cirrus CI persistent worker at boot. +The cirrus-cli binary is downloaded and configured as a long-running service. + +See `specs/integrations/overview.md` for the shared interface and config flow. + +--- + +## Type + +```go +type PersistentWorkerArgs struct { + Name string // Worker name — set to mCtx.RunID() by the action + Token string // Cirrus CI registration token (required) + Platform *Platform // Target OS: Linux | Darwin | Windows + Arch *Arch // Target arch: Amd64 | Arm64 + Labels map[string]string // Worker labels as key=value pairs +} +``` + +### Platform / Arch constants + +```go +var ( + Windows Platform = "windows" + Linux Platform = "linux" + Darwin Platform = "darwin" + + Arm64 Arch = "arm64" + Amd64 Arch = "amd64" +) +``` + +--- + +## Persistent Worker Version + +```go +var version = "v0.135.0" // overridden at build time via linker flag +``` + +Makefile variable: `CIRRUS_CLI` +Linker target: `pkg/integrations/cirrus.version` + +--- + +## Download URL Pattern + +``` +https://github.com/cirruslabs/cirrus-cli/releases/download/{version}/cirrus-{platform}-{arch} +https://github.com/cirruslabs/cirrus-cli/releases/download/{version}/cirrus-{platform}-{arch}.exe (Windows) +``` + +--- + +## Listen Port + +```go +var cirrusPort = "3010" +``` + +The worker listens on port `3010`. This port must be opened in the security group when +Cirrus integration is enabled — callers use `cirrus.CirrusPort()` to conditionally add +the ingress rule: + +```go +func CirrusPort() (*int, error) // returns nil, nil if Cirrus not configured +``` + +This is the only integration that requires an additional inbound port. + +--- + +## Functions + +```go +func Init(args *PersistentWorkerArgs) // stores args as package-level state +func GetRunnerArgs() *PersistentWorkerArgs // returns nil if not configured +func GetToken() string // returns token or "" if not configured +func CirrusPort() (*int, error) // returns port int or nil if not configured +``` + +--- + +## UserDataValues populated + +| Field | Source | +|---|---| +| `CliURL` | `downloadURL()` — version + platform + arch | +| `Name` | `PersistentWorkerArgs.Name` | +| `Token` | `PersistentWorkerArgs.Token` | +| `Labels` | Map entries formatted as `key=value`, joined with `,` | +| `Port` | `"3010"` (fixed) | +| `User` | Set by `GetIntegrationSnippet` from `username` arg | +| `RepoURL`, `Executor` | Not used | + +--- + +## Script Templates + +Embedded at compile time: +- `snippet-linux.sh` — downloads binary, installs as systemd service +- `snippet-darwin.sh` — same flow for macOS +- `snippet-windows.ps1` — downloads `.exe`, installs as Windows service + +Template selection is based on `PersistentWorkerArgs.Platform`. diff --git a/specs/integrations/github-actions.md b/specs/integrations/github-actions.md new file mode 100644 index 000000000..bcacb1b72 --- /dev/null +++ b/specs/integrations/github-actions.md @@ -0,0 +1,98 @@ +# Integration: GitHub Actions Self-Hosted Runner + +**Package:** `github.com/redhat-developer/mapt/pkg/integrations/github` + +Registers the provisioned machine as a GitHub Actions self-hosted runner at boot. +The runner binary is downloaded and installed by the injected setup script. + +See `specs/integrations/overview.md` for the shared interface and config flow. + +--- + +## Type + +```go +type GithubRunnerArgs struct { + Token string // GitHub runner registration token (required) + RepoURL string // Repository or organisation URL to register against (required) + Name string // Runner name — set to mCtx.RunID() by the action + Platform *Platform // Target OS: Linux | Darwin | Windows + Arch *Arch // Target arch: Amd64 | Arm64 | Arm + Labels []string // Runner labels, comma-joined before injection + User string // OS user to run as (set by cloud-config builder) +} +``` + +### Platform / Arch constants + +```go +var ( + Windows Platform = "win" + Linux Platform = "linux" + Darwin Platform = "osx" + + Arm64 Arch = "arm64" + Amd64 Arch = "x64" + Arm Arch = "arm" +) +``` + +--- + +## Runner Version + +```go +var runnerVersion = "2.317.0" // overridden at build time via linker flag +``` + +Makefile variable: `GITHUB_RUNNER` +Linker target: `pkg/integrations/github.runnerVersion` + +--- + +## Download URL Pattern + +``` +https://github.com/actions/runner/releases/download/v{version}/actions-runner-{platform}-{arch}-{version}.tar.gz +https://github.com/actions/runner/releases/download/v{version}/actions-runner-{platform}-{arch}-{version}.zip (Windows) +``` + +The URL is built by `downloadURL()` and injected as `UserDataValues.CliURL`. + +--- + +## Functions + +```go +func Init(args *GithubRunnerArgs) // stores args as package-level state +func GetRunnerArgs() *GithubRunnerArgs // returns nil if not configured +func GetToken() string // returns token or "" if not configured +``` + +`GetRunnerArgs()` implements `IntegrationConfig` (via pointer receiver methods on +`*GithubRunnerArgs`) — pass directly to `GetIntegrationSnippet`. + +--- + +## UserDataValues populated + +| Field | Source | +|---|---| +| `CliURL` | `downloadURL()` — version + platform + arch | +| `Name` | `GithubRunnerArgs.Name` | +| `Token` | `GithubRunnerArgs.Token` | +| `Labels` | `GithubRunnerArgs.Labels` joined with `,` | +| `RepoURL` | `GithubRunnerArgs.RepoURL` | +| `User` | Set by `GetIntegrationSnippet` from `username` arg | +| `Port`, `Executor` | Not used | + +--- + +## Script Templates + +Embedded at compile time: +- `snippet-linux.sh` — downloads `.tar.gz`, extracts, configures, starts as systemd service +- `snippet-darwin.sh` — same flow for macOS +- `snippet-windows.ps1` — downloads `.zip`, extracts, registers as Windows service + +Template selection is based on `GithubRunnerArgs.Platform`. diff --git a/specs/integrations/gitlab.md b/specs/integrations/gitlab.md new file mode 100644 index 000000000..9a3e154f7 --- /dev/null +++ b/specs/integrations/gitlab.md @@ -0,0 +1,146 @@ +# Integration: GitLab Runner + +**Package:** `github.com/redhat-developer/mapt/pkg/integrations/gitlab` + +Registers the provisioned machine as a GitLab runner. Unlike GitHub Actions and Cirrus CI, +GitLab registration requires creating a runner resource in GitLab itself to obtain an auth +token. mapt uses the Pulumi GitLab provider to create the runner as a Pulumi resource +inside the deploy stack — the auth token is resolved at provision time and injected into +the setup script. + +See `specs/integrations/overview.md` for the shared interface and config flow. + +--- + +## Type + +```go +type GitLabRunnerArgs struct { + GitLabPAT string // Personal Access Token for the Pulumi GitLab provider + ProjectID string // GitLab project ID — mutually exclusive with GroupID + GroupID string // GitLab group ID — mutually exclusive with ProjectID + URL string // GitLab instance URL (e.g. "https://gitlab.com") + Tags []string // Runner tags for job routing; empty = accepts untagged jobs + Name string // Runner description — set to mCtx.RunID() by the action + Platform *Platform // Target OS: Linux | Darwin | Windows + Arch *Arch // Target arch: Amd64 | Arm64 | Arm + User string // OS user to run as + AuthToken string // Set by Pulumi after CreateRunner(); not caller-supplied +} +``` + +### Platform / Arch constants + +```go +var ( + Windows Platform = "windows" + Linux Platform = "linux" + Darwin Platform = "darwin" + + Arm64 Arch = "arm64" + Amd64 Arch = "amd64" + Arm Arch = "arm" +) +``` + +--- + +## Runner Version + +```go +var version = "18.8.0" // overridden at build time via linker flag +``` + +Makefile variable: `GITLAB_RUNNER` +Linker target: `pkg/integrations/gitlab.version` + +--- + +## Download URL Pattern + +``` +https://gitlab-runner-downloads.s3.amazonaws.com/v{version}/binaries/gitlab-runner-{platform}-{arch} +https://gitlab-runner-downloads.s3.amazonaws.com/v{version}/binaries/gitlab-runner-{platform}-{arch}.exe (Windows) +``` + +--- + +## Pulumi Registration (key difference from other integrations) + +GitLab runners must be registered in GitLab before deployment. mapt handles this inside the +Pulumi deploy stack by calling `CreateRunner()`: + +```go +func CreateRunner(ctx *pulumi.Context, args *GitLabRunnerArgs) (pulumi.StringOutput, error) +``` + +This creates a `gitlab.UserRunner` Pulumi resource via the `pulumi-gitlab` provider, +authenticated with `GitLabPAT`. The resource returns an `AuthToken` as a `pulumi.StringOutput`. + +The returned token is then wired via `ApplyT` into the userdata generation so it is available +when the cloud-init script is rendered: + +```go +token, err := gitlab.CreateRunner(ctx, glArgs) +// token is a pulumi.StringOutput resolved during stack apply +token.ApplyT(func(t string) string { + gitlab.SetAuthToken(t) + // generate userdata here using GetIntegrationSnippet + return t +}) +``` + +Exports added to the stack: `gitlab-runner-id`, `gitlab-runner-type`. + +### Project vs Group runner + +Exactly one of `ProjectID` or `GroupID` must be set — `CreateRunner` returns an error if +both or neither are provided: + +| Field set | Runner type | GitLab API | +|---|---|---| +| `ProjectID` | `project_type` | Scoped to a single project | +| `GroupID` | `group_type` | Shared across all projects in the group | + +--- + +## Functions + +```go +func Init(args *GitLabRunnerArgs) // stores args as package-level state +func GetRunnerArgs() *GitLabRunnerArgs // returns nil if not configured +func GetToken() string // returns AuthToken or "" if not configured +func SetAuthToken(token string) // called inside ApplyT after CreateRunner +func CreateRunner(ctx *pulumi.Context, args *GitLabRunnerArgs) (pulumi.StringOutput, error) +``` + +--- + +## UserDataValues populated + +| Field | Source | +|---|---| +| `CliURL` | `downloadURL()` — version + platform + arch | +| `Name` | `GitLabRunnerArgs.Name` | +| `Token` | `GitLabRunnerArgs.AuthToken` — set by Pulumi, not caller | +| `RepoURL` | `GitLabRunnerArgs.URL` | +| `User` | Set by `GetIntegrationSnippet` from `username` arg | +| `Labels`, `Port`, `Executor` | Not used | + +--- + +## Script Templates + +Embedded at compile time: +- `snippet-linux.sh` — downloads binary, registers runner, starts as systemd service +- `snippet-darwin.sh` — same flow for macOS +- `snippet-windows.ps1` — downloads `.exe`, installs as Windows service + +Template selection is based on `GitLabRunnerArgs.Platform`. + +--- + +## Known Gaps + +- No Tekton task template includes the GitLab runner flags (verify and add) +- Tags are not surfaced in the setup script — only the Pulumi resource carries them diff --git a/specs/integrations/overview.md b/specs/integrations/overview.md new file mode 100644 index 000000000..50988f113 --- /dev/null +++ b/specs/integrations/overview.md @@ -0,0 +1,129 @@ +# Integrations: Overview + +Integrations allow any provisioned mapt target to register itself as a CI system agent +at boot, without manual setup. The integration is injected as a shell or PowerShell script +into the cloud-init `write_files` section. + +Three services are supported — each has its own spec: +- `specs/integrations/github-actions.md` — GitHub Actions self-hosted runner +- `specs/integrations/cirrus-ci.md` — Cirrus CI persistent worker +- `specs/integrations/gitlab.md` — GitLab runner (uses Pulumi for registration) + +--- + +## Shared Interface + +**Package:** `github.com/redhat-developer/mapt/pkg/integrations` + +### `IntegrationConfig` + +```go +type IntegrationConfig interface { + GetUserDataValues() *UserDataValues // nil = integration disabled + GetSetupScriptTemplate() string // embedded shell/PS1 template string +} +``` + +Every service implementation implements this interface. Returning `nil` from +`GetUserDataValues()` is the zero-value — it means the integration was not configured +and `GetIntegrationSnippet` returns an empty string. + +### `UserDataValues` + +```go +type UserDataValues struct { + CliURL string // download URL for the runner binary + User string // OS username — set automatically by GetIntegrationSnippet + Name string // runner/worker name (set to mCtx.RunID()) + Token string // registration/auth token + Labels string // comma-separated labels or key=value pairs + Port string // listen port (Cirrus only) + RepoURL string // repository or GitLab instance URL + Executor string // executor type (GitLab only) +} +``` + +Not all fields are used by every service — see the per-service spec for which fields +are populated. + +--- + +## Shared Functions + +### `GetIntegrationSnippet` + +```go +func GetIntegrationSnippet(intCfg IntegrationConfig, username string) (*string, error) +``` + +Renders the service's embedded script template with `UserDataValues`. Sets `User` from +`username` before rendering. Returns an empty string (not an error) when +`GetUserDataValues()` returns nil. + +### `GetIntegrationSnippetAsCloudInitWritableFile` + +```go +func GetIntegrationSnippetAsCloudInitWritableFile(intCfg IntegrationConfig, username string) (*string, error) +``` + +Same as `GetIntegrationSnippet` but indents every line by 6 spaces, ready to embed as +a `write_files` entry in a cloud-init YAML: + +```yaml +write_files: + - content: | + #!/bin/bash + # rendered snippet here — each line indented 6 spaces +``` + +--- + +## Config Flow + +Integration args enter via `ContextArgs` at `mc.Init()` time, which calls each package's +`Init()` to store them as package-level state: + +```go +// Caller sets one of (mutually exclusive in practice, but not validated): +mCtxArgs.GHRunnerArgs = &github.GithubRunnerArgs{...} +mCtxArgs.CirrusPWArgs = &cirrus.PersistentWorkerArgs{...} +mCtxArgs.GLRunnerArgs = &gitlab.GitLabRunnerArgs{...} + +// mc.Init() calls: +github.Init(ca.GHRunnerArgs) // nil-safe; sets package-level runnerArgs +cirrus.Init(ca.CirrusPWArgs) +gitlab.Init(ca.GLRunnerArgs) +``` + +Cloud-config builders then retrieve via `.GetRunnerArgs()` or +`.GetIntegrationConfig()` and pass the result to `GetIntegrationSnippet`. + +--- + +## Usage Pattern in a Cloud-Config Builder + +```go +// In pkg/target/host//.go: +snippet, err := integrations.GetIntegrationSnippetAsCloudInitWritableFile( + github.GetRunnerArgs(), // returns nil if not configured → empty snippet + username, +) +// Embed snippet into the cloud-init write_files section +``` + +--- + +## Known Gaps + +- No validation that at most one integration is configured (multiple could be set simultaneously) +- Runner versions are compile-time constants; upgrading requires a full rebuild and release +- The GitLab runner integration does not appear in the Tekton task templates (verify) + +--- + +## When to Extend + +Add a new file under `specs/integrations/` when: +- Adding a new CI system (e.g. Jenkins, TeamCity) +- Making runner versions runtime-configurable instead of compile-time +- Adding support for runner groups or additional registration parameters diff --git a/specs/integrations/tekton-tasks.md b/specs/integrations/tekton-tasks.md new file mode 100644 index 000000000..a98fe5c1c --- /dev/null +++ b/specs/integrations/tekton-tasks.md @@ -0,0 +1,61 @@ +# Spec: Tekton Task Bundles + +## Context +mapt ships a set of Tekton Task definitions for use in Tekton Pipelines. These allow CI pipelines +running on OpenShift/Kubernetes to dynamically provision and destroy remote targets as pipeline steps. + +Key files: +- `tkn/template/*.yaml` — source templates with `` and `` placeholders +- `tkn/*.yaml` — rendered task files (generated by `make tkn-update`) +- `Makefile` targets: `tkn-update`, `tkn-push` + +The bundle is published to `quay.io/redhat-developer/mapt:-tkn` as an OCI artifact +using the `tkn bundle push` command. + +## Problem +This feature is implemented. This spec documents the generation process and current task coverage. + +## Requirements +- [ ] Template files in `tkn/template/` define tasks parametrically (``, ``) +- [ ] `make tkn-update IMG=... VERSION=...` renders all templates to `tkn/` using `sed` +- [ ] `make tkn-push` bundles all rendered tasks and pushes to the OCI registry +- [ ] Each target has a corresponding Tekton task with `create` and `destroy` steps +- [ ] Task parameters mirror the CLI flags for the corresponding `mapt` subcommand +- [ ] Tasks use the mapt container image and pass `--serverless` flag for role-based credentials + +## Current Task Coverage +| Task file | Target | +|-----------|--------| +| `infra-aws-rhel.yaml` | AWS RHEL host | +| `infra-aws-rhel-ai.yaml` | AWS RHEL AI host | +| `infra-aws-windows-server.yaml` | AWS Windows Server host | +| `infra-aws-fedora.yaml` | AWS Fedora host | +| `infra-aws-mac.yaml` | AWS Mac host | +| `infra-aws-kind.yaml` | AWS Kind cluster | +| `infra-aws-ocp-snc.yaml` | AWS OpenShift SNC | +| `infra-azure-aks.yaml` | Azure AKS | +| `infra-azure-rhel.yaml` | Azure RHEL host | +| `infra-azure-rhel-ai.yaml` | Azure RHEL AI host | +| `infra-azure-fedora.yaml` | Azure Linux/Fedora host | +| `infra-azure-windows-desktop.yaml` | Azure Windows Desktop | + +## Out of Scope +- GitHub Actions workflow files (`.github/workflows/`) — those are for mapt's own CI, not for consumers +- Direct CLI usage (that is the primary usage documented in `docs/`) + +## Affected Areas +- `tkn/template/` — source templates +- `tkn/` — generated (do not edit directly) +- `Makefile` — `tkn-update` and `tkn-push` targets +- `.github/workflows/tkn-bundle.yaml` — CI workflow that runs `tkn-push` + +## Known Gaps / Improvement Ideas +- Azure Kind task is missing from the bundle (no `infra-azure-kind.yaml` template) +- AWS Mac Pool service has no Tekton task +- Task parameters are not validated beyond what Tekton's type system offers (string/array) +- No Tekton task for `aws mac-pool request` / `release` operations + +## Acceptance Criteria +- `make tkn-update IMG=quay.io/redhat-developer/mapt:v1.0.0 VERSION=1.0.0` regenerates `tkn/*.yaml` +- `make tkn-push` successfully pushes the bundle to the registry +- A Tekton Pipeline can reference the bundled tasks and successfully provision/destroy a target diff --git a/specs/project-context.md b/specs/project-context.md new file mode 100644 index 000000000..69eb0dcfe --- /dev/null +++ b/specs/project-context.md @@ -0,0 +1,363 @@ +# mapt — Project Context + +## What This Project Is + +mapt (Multi Architecture Provisioning Tool) is a Go CLI that provisions ephemeral compute environments +across AWS and Azure using the Pulumi Automation API. It is used primarily by CI/CD pipelines that +need on-demand remote machines of specific OS/arch combinations. + +Key design goals: +- **Cost savings**: prefer spot instances with cross-region best-bid selection +- **Speed**: use AMI fast-launch, root volume replacement, pre-baked images +- **Safety**: self-destruct via serverless scheduled tasks (timeout mode) +- **Integration**: emit connection details (host, username, key/password) as output files consumed by CI systems + +## Repository Layout + +``` +cmd/mapt/cmd/ CLI commands (Cobra), one file per target + params/ Shared flag definitions, Add*Flags helpers, *Args() readers + — see specs/cmd/params.md + aws/hosts/ AWS host subcommands (rhel, windows, fedora, mac, rhelai) + aws/services/ AWS service subcommands (eks, kind, mac-pool, snc) + azure/hosts/ Azure host subcommands (rhel, windows, linux, rhelai) + azure/services/ Azure service subcommands (aks, kind) + +pkg/manager/ Pulumi Automation API wrapper + context/ Context type — carries project/run metadata, integrations + credentials/ Provider credential helpers + +pkg/provider/ + api/ Shared API types and interfaces (ComputeRequest, SpotArgs, SpotSelector, ComputeSelector, CloudConfig) + — see specs/api/provider-interfaces.md + aws/ + action/ Entry points per target: Create(), Destroy() orchestrate stacks + modules/ Reusable Pulumi stack components + allocation/ Spot vs on-demand region/AZ selection + ami/ AMI copy + fast-launch + bastion/ Bastion host for airgap scenarios + ec2/compute/ EC2 instance resource + iam/ IAM roles/policies + mac/ Mac dedicated host + machine lifecycle + network/ Standard and airgap VPC/subnet/LB + serverless/ ECS Fargate scheduled self-destruct + spot/ Best-spot-option Pulumi stack + data/ AWS SDK read-only queries (AMI, AZ, spot price, etc.) + services/ Low-level Pulumi resource wrappers (keypair, SG, S3, SSM, VPC) + azure/ + action/ Entry points per target + modules/ Azure network, VM, allocation + data/ Azure SDK queries + services/ Azure Pulumi resource wrappers + util/ Shared: command readiness, output writing, security, windows helpers + +pkg/integrations/ CI system integration snippets + github/ GitHub Actions self-hosted runner + cirrus/ Cirrus CI persistent worker + gitlab/ GitLab runner + +pkg/target/ Cloud-init / userdata builders per OS target + host/rhel/ RHEL cloud-config (base + SNC variant) + host/fedora/ Fedora cloud-config + host/rhelai/ RHEL AI API wrapper + host/windows-server/ Windows PowerShell userdata + service/kind/ Kind cloud-config + service/snc/ OpenShift SNC cloud-config + profile deployment + profile/ SNC profiles: virtualization, serverless, servicemesh + +pkg/util/ Generic utilities (cache, cloud-init, file, logging, maps, network, slices) + +tkn/ Tekton Task YAML files (generated from tkn/template/ by make tkn-update) +docs/ User-facing documentation per target +specs/ Developer/contributor artifacts + project-context.md Project knowledge base (this file) + features/ Feature specifications (one file per user-facing capability) + aws/ AWS provisioning targets + azure/ Azure provisioning targets + 000-template.md Spec template — use this for every new feature + api/ Module interface contracts (types, signatures) — see specs/api/ + cmd/ CLI parameter definitions and shared flag groups — see specs/cmd/params.md + integrations/ CI system integrations that mapt provisions into (GitHub, Cirrus, GitLab, Tekton) + cicd/ mapt's own build/test/release pipeline specs +``` + +## Key Types + +```go +// manager/context.ContextArgs — input to every action Create()/Destroy() +type ContextArgs struct { + ProjectName string + BackedURL string // "s3://bucket/path" or "file:///local/path" + ResultsOutput string // directory where output files are written + Serverless bool // use role-based credentials (ECS task context) + ForceDestroy bool + KeepState bool + Tags map[string]string + GHRunnerArgs *github.GithubRunnerArgs // optional integration + CirrusPWArgs *cirrus.PersistentWorkerArgs + GLRunnerArgs *gitlab.GitLabRunnerArgs +} + +// manager.Stack — describes a Pulumi stack to run +type Stack struct { + ProjectName string + StackName string + BackedURL string + DeployFunc pulumi.RunFunc + ProviderCredentials credentials.ProviderCredentials +} + +// provider/aws/modules/allocation.AllocationResult — result of spot/on-demand selection +type AllocationResult struct { + Region *string + AZ *string + SpotPrice *float64 // nil if on-demand + InstanceTypes []string +} +``` + +## Module Reuse Contract + +**This is the most important architectural rule in mapt.** + +Logic that exists in a module MUST be reused, never reimplemented. The layers are: + +- `modules/` — reusable Pulumi stack components. Always call these; never inline their logic into an action. +- `services/` — low-level Pulumi resource wrappers. Always use these; never call Pulumi provider resources directly from an action. +- `data/` — read-only cloud API queries. Always use these; never call AWS/Azure SDKs directly from an action. +- `action/` — the only layer allowed to contain orchestration logic specific to a single target. + +When writing a spec or implementing a feature, explicitly list which existing modules are called +(Must Reuse) separately from which new files are created (Must Create). This is the distinction +the spec template enforces. + +### AWS EC2 Host — Mandatory Module Sequence + +Every AWS EC2-based host target calls these modules in this order. Deviation requires justification. + +**`Create()` function:** +``` +mc.Init(mCtxArgs, aws.Provider()) +allocation.Allocation(mCtx, &AllocationArgs{...}) // spot or on-demand +r.createMachine() | r.createAirgapMachine() +``` + +**`deploy()` Pulumi RunFunc — always in this order:** +``` +amiSVC.GetAMIByName() // AMI lookup +network.Create() // VPC, subnet, IGW, optional LB, optional airgap +keypair.KeyPairRequest.Create() // TLS keypair → export -id_rsa +securityGroup.SGRequest.Create() // security group with ingress rules +.Generate() // cloud-init / userdata +compute.ComputeRequest.NewCompute() // EC2 instance +serverless.OneTimeDelayedTask() // only when Timeout != "" +c.Readiness() // remote command readiness check +``` + +**`Destroy()` function — always in this order:** +``` +aws.DestroyStack() +spot.Destroy() guarded by spot.Exist() // only if spot was used +amiCopy.Destroy() guarded by amiCopy.Exist() // only if AMI copy was needed (Windows) +aws.CleanupState() +``` + +**`manageResults()` function:** +``` +bastion.WriteOutputs() // only when airgap=true +output.Write() // always — writes host/username/key files +``` + +**Naming — non-negotiable:** +``` +resourcesUtil.GetResourceName(prefix, componentID, suffix) // all resource names +mCtx.StackNameByProject(stackName) // all Pulumi stack names +``` + +### AWS EC2 Host — Files to Create (only these) + +For each new AWS EC2 target, exactly these files are created — everything else is reused: + +``` +pkg/provider/aws/action//.go // Args struct, Create, Destroy, deploy, manageResults, securityGroups +pkg/provider/aws/action//constants.go // stackName, componentID, AMI regex, disk size, ports +pkg/target/host// // cloud-config or userdata builder +cmd/mapt/cmd/aws/hosts/.go // Cobra create/destroy subcommands +tkn/template/infra-aws-.yaml // Tekton task template +``` + +### Azure VM Host — Mandatory Module Sequence + +**`Create()` function:** +``` +mc.Init(mCtxArgs, azure.Provider()) +allocation.Allocation(mCtx, &AllocationArgs{...}) // azure spot or on-demand +``` + +**`deploy()` Pulumi RunFunc:** +``` +azure resource group +azure/modules/network.Create() // VNet, subnet, NIC, optional public IP +keypair or password generation +azure/services/network/security-group.SGRequest.Create() +virtualmachine.NewVM() // Azure VM resource +readiness check via remote command +``` + +**`Destroy()` function:** +``` +azure.DestroyStack() +azure.CleanupState() +``` + +### Adding a New AWS Host Target + +1. **Args struct** in `pkg/provider/aws/action//.go` + - Embed `*cr.ComputeRequestArgs`, `*spotTypes.SpotArgs` + - Include `Prefix`, `Airgap bool`, `Timeout string` + +2. **`Create()`**: `mc.Init` → `allocation.Allocation` → `createMachine` or `createAirgapMachine` + +3. **`deploy()`**: follow the mandatory module sequence above exactly + +4. **`Destroy()`**: follow the mandatory destroy sequence above exactly + +5. **`manageResults()`**: `bastion.WriteOutputs` (if airgap) then `output.Write` + +6. **Cobra command** in `cmd/mapt/cmd/aws/hosts/.go` + - Subcommands: `create`, `destroy`; bind all flags + +7. **Tekton template** in `tkn/template/infra-aws-.yaml` + +### Airgap Orchestration + +Two-phase stack update on the same stack: +1. `airgapPhaseConnectivity = network.ON` — creates NAT gateway, bootstraps machine +2. `airgapPhaseConnectivity = network.OFF` — removes NAT gateway, machine loses egress + +### Spot vs On-Demand (Allocation Module) + +`allocation.Allocation()` is the single entry point. It: +- If `Spot.Spot == true`: creates/reuses a `spotOption` Pulumi stack that selects best region + AZ + price +- If on-demand: uses the provider's default region, iterates AZs until instance types are available + +The spot stack is idempotent — if it already exists, outputs are reused (region stays stable across re-creates). + +### Serverless Self-Destruct + +`serverless.OneTimeDelayedTask()` creates an AWS EventBridge Scheduler + Fargate task that runs +`mapt destroy` at `now + timeout`. Requires a remote BackedURL (not `file://`). + +### Integration Snippets + +Each integration (`github`, `cirrus`, `gitlab`) implements `IntegrationConfig`: +- `GetUserDataValues()` returns token, repo URL, labels, etc. +- `GetSetupScriptTemplate()` returns an embedded shell/PowerShell script template +- Called from cloud-config / userdata builders in `pkg/target/` + +### SNC Profiles + +Profiles are registered in `pkg/target/service/snc/profile/profile.go`: +- `virtualization` — enables nested virt on the compute instance +- `serverless-serving`, `serverless-eventing`, `serverless` — Knative +- `servicemesh` — OpenShift Service Mesh 3 + +`profile.RequireNestedVirt()` gates the instance type selection. +`profile.Deploy()` installs operators/CRDs via the Pulumi Kubernetes provider post-cluster-ready. + +## Spec-Driven Development + +All features are spec-anchored: a spec file must exist and be `Accepted` before implementation +is merged. Specs are the source of truth for both human reviewers and AI agents. + +### Spec Status Lifecycle + +``` +Draft → Accepted → Implemented → Deprecated +``` + +- **Draft** — written but not yet reviewed; cannot be implemented +- **Accepted** — reviewed and approved; ready for implementation (triggers `/implement`) +- **Implemented** — all Must Create files exist; all Requirements checked `[x]` +- **Deprecated** — superseded or removed; kept for history + +### Spec Template + +Every feature spec uses `specs/features/000-template.md`. Required sections: + +| Section | Purpose | +|---|---| +| `## Status` | Lifecycle state (see above) | +| `## Context` | Background, affected files, links to related specs | +| `## Problem` | What is missing or broken | +| `## Requirements` | Checkbox list — `[x]` when implemented | +| `## Out of Scope` | Explicit exclusions | +| `## Must Reuse` | Existing modules that MUST be called; never reimplement | +| `## Must Create` | New files to write; nothing else should be created | +| `## Tasks` | Ordered implementation checklist for Draft/Accepted specs; deleted when Implemented | +| `## Acceptance Criteria` | Split into `### Unit` (no cloud needed) and `### Integration` (manual/nightly) | + +Optional: `## Design` for non-trivial data flow or error handling decisions. +Optional: `## Jira` for issue tracking link. + +### Spec Directory Conventions + +- `specs/features//.md` — user-facing provisioning capabilities +- `specs/features/cicd/.md` — CI/CD pipeline workflows +- `specs/api//.md` — module interface contracts (types, function signatures) +- `specs/cmd/params.md` — shared CLI flag groups +- `specs/integrations/.md` — CI system integrations mapt provisions into +- `specs/cicd/.md` — mapt's own build/test pipeline + +### PR Workflow (Current) + +New features follow a two-stage review within a single Draft PR: + +1. Open **Draft PR** with only the spec file (`Status: Accepted`) +2. CI runs `spec-lint` — validates required sections, blocks if `Status: Draft` +3. Reviewer approves spec → posts `/implement` comment +4. Agent implements all files in Must Create, calls Must Reuse in mandatory order +5. CI re-runs `make build && make test` → PR promoted to Ready for Review +6. Second review covers implementation only; merge + +See `specs/features/cicd/spec-driven-pr-workflow.md` for the full workflow spec. + +## Build & Test Commands + +```bash +make build # compile to out/mapt +make install # go install to $GOPATH/bin +make test # go test -race ./pkg/... ./cmd/... +make lint # golangci-lint +make fmt # gofmt +make check # build + test + lint + renovate-check +make oci-build # container image (amd64 + arm64) +make tkn-update # regenerate tkn/*.yaml from templates +make tkn-push # push Tekton bundle +``` + +## Naming Conventions + +- Resource names: `resourcesUtil.GetResourceName(prefix, componentID, suffix)` + e.g. `GetResourceName("main", "aws-rhel", "sg")` → `"main-aws-rhel-sg"` +- Stack names: `mCtx.StackNameByProject(stackName)` → `"-"` +- Output keys: `"-host"`, `"-username"`, `"-id_rsa"`, `"-userpassword"` +- Constants: defined in `constants.go` / `contants.go` next to the action file + +## State Backend + +Pulumi state is stored at `BackedURL`: +- Remote: `s3://bucket/prefix` (required for serverless timeout and mac pool) +- Local: `file:///path/to/dir` (dev/testing only; incompatible with timeout) + +After `Destroy`, `aws.CleanupState()` removes the S3 state files unless `KeepState` is set. + +## Dependencies + +- **Pulumi Automation API** (`github.com/pulumi/pulumi/sdk/v3/go/auto`) — all infra is managed via inline stacks +- **AWS SDK v2** — read-only queries (spot prices, AMI lookup, AZ enumeration) +- **Azure SDK for Go** — read-only queries (VM sizes, image refs, locations) +- **Cobra + Viper** — CLI parsing +- **go-playground/validator** — struct validation before stack creation +- **logrus** — structured logging +- **freecache** — in-process caching for expensive cloud API calls