fix: don't create empty-zone offerings for zone-restricted SKUs#1613
Closed
k24dizzle wants to merge 2 commits intoAzure:mainfrom
Closed
fix: don't create empty-zone offerings for zone-restricted SKUs#1613k24dizzle wants to merge 2 commits intoAzure:mainfrom
k24dizzle wants to merge 2 commits intoAzure:mainfrom
Conversation
940437c to
856e6e6
Compare
Author
|
@microsoft-github-policy-service agree |
Collaborator
|
Closing in favor of #1615 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Fixes #1384
When using
TopologySpreadConstraintswithwhenUnsatisfiable: DoNotSchedule, Karpenter refuses to provision new nodes even when valid zones are available.The root cause is that some VM SKUs have
NotAvailableForSubscriptionzone restrictions covering all zones in a region.skewercorrectly strips those blocked zones, leaving an empty zone set.instanceTypeZones()then falls through tosets.New("")— a fallback originally intended for truly non-zonal regions — causing these SKUs to appear as available withzone="".Karpenter core treats
""as a valid topology domain, so instead of seeing zone counts[2, 1, 1]across 3 real zones, the scheduler sees[0, 2, 1, 1]across 4 domains. WithmaxSkew: 1, this makes the constraint unsatisfiable even though valid zones exist.See kubernetes-sigs/karpenter#2864 for an attempt to fix this in the upstream karpenter, this PR is an attempt to fix this here in the azure provider by not offering these instances in the first place.
Fix
Add
hasRawZonesInRegion()to checklocationInfodirectly (before restriction filtering). If a SKU has zones defined in the region but none are available after restrictions are applied, return an empty set instead ofsets.New(""). This causes the SKU to produce no offerings and be dropped from the instance type list. Thesets.New("")path for truly non-zonal regions is preserved.Verification
In affected subscriptions, SKUs like
Standard_L48s_v3report zone support inlocationInfobut carryNotAvailableForSubscriptionrestrictions covering all zones. After the fix, instance type count in westus2 drops from 507 → 496,zone=""offerings drop to 0, and TSC scheduling withwhenUnsatisfiable: DoNotScheduleworks correctly.Inspecting a zone-restricted SKU with
az vm list-skusaz vm list-skus \ --location westus2 \ --resource-type virtualMachines \ --query "[?name=='Standard_L48s_v3'].{name:name,locationInfo:locationInfo,restrictions:restrictions}" \ --output json{ "locationInfo": [{ "zones": ["1", "2", "3"] }], "restrictions": [{ "reasonCode": "NotAvailableForSubscription", "type": "Zone", "restrictionInfo": { "zones": ["1", "2", "3"] } }] }