Skip to content

fix: don't create empty-zone offerings for zone-restricted SKUs#1613

Closed
k24dizzle wants to merge 2 commits intoAzure:mainfrom
k24dizzle:fix/zone-restricted-sku-empty-zone-offering
Closed

fix: don't create empty-zone offerings for zone-restricted SKUs#1613
k24dizzle wants to merge 2 commits intoAzure:mainfrom
k24dizzle:fix/zone-restricted-sku-empty-zone-offering

Conversation

@k24dizzle
Copy link
Copy Markdown

@k24dizzle k24dizzle commented Apr 2, 2026

Problem

Fixes #1384

When using TopologySpreadConstraints with whenUnsatisfiable: DoNotSchedule, Karpenter refuses to provision new nodes even when valid zones are available.

The root cause is that some VM SKUs have NotAvailableForSubscription zone restrictions covering all zones in a region. skewer correctly strips those blocked zones, leaving an empty zone set. instanceTypeZones() then falls through to sets.New("") — a fallback originally intended for truly non-zonal regions — causing these SKUs to appear as available with zone="".

Karpenter core treats "" as a valid topology domain, so instead of seeing zone counts [2, 1, 1] across 3 real zones, the scheduler sees [0, 2, 1, 1] across 4 domains. With maxSkew: 1, this makes the constraint unsatisfiable even though valid zones exist.

See kubernetes-sigs/karpenter#2864 for an attempt to fix this in the upstream karpenter, this PR is an attempt to fix this here in the azure provider by not offering these instances in the first place.

Fix

Add hasRawZonesInRegion() to check locationInfo directly (before restriction filtering). If a SKU has zones defined in the region but none are available after restrictions are applied, return an empty set instead of sets.New(""). This causes the SKU to produce no offerings and be dropped from the instance type list. The sets.New("") path for truly non-zonal regions is preserved.

Verification

In affected subscriptions, SKUs like Standard_L48s_v3 report zone support in locationInfo but carry NotAvailableForSubscription restrictions covering all zones. After the fix, instance type count in westus2 drops from 507 → 496, zone="" offerings drop to 0, and TSC scheduling with whenUnsatisfiable: DoNotSchedule works correctly.

Inspecting a zone-restricted SKU with az vm list-skus
az vm list-skus \
  --location westus2 \
  --resource-type virtualMachines \
  --query "[?name=='Standard_L48s_v3'].{name:name,locationInfo:locationInfo,restrictions:restrictions}" \
  --output json
{
  "locationInfo": [{ "zones": ["1", "2", "3"] }],
  "restrictions": [{
    "reasonCode": "NotAvailableForSubscription",
    "type": "Zone",
    "restrictionInfo": { "zones": ["1", "2", "3"] }
  }]
}

@k24dizzle k24dizzle force-pushed the fix/zone-restricted-sku-empty-zone-offering branch from 940437c to 856e6e6 Compare April 2, 2026 03:17
@k24dizzle
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

@tallaxes
Copy link
Copy Markdown
Collaborator

Closing in favor of #1615

@tallaxes tallaxes closed this Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Topology spread constraints fail with empty zone domain in counts

2 participants