Skip to content

{AKS} az aks create/update: add --enable/--disable-control-plane-metrics#33537

Merged
yanzhudd merged 6 commits into
Azure:devfrom
bragi92:kadubey/aks-control-plane-metrics
Jun 16, 2026
Merged

{AKS} az aks create/update: add --enable/--disable-control-plane-metrics#33537
yanzhudd merged 6 commits into
Azure:devfrom
bragi92:kadubey/aks-control-plane-metrics

Conversation

@bragi92

@bragi92 bragi92 commented Jun 11, 2026

Copy link
Copy Markdown
Member

Surface azureMonitorProfile.metrics.controlPlane.enabled so users can opt clusters in/out of Azure Monitor managed Prometheus control-plane metrics (controlplane-apiserver, controlplane-etcd) via the first-class API property — replaces the AFEC-gated preview.

New flags:

  • az aks create: --enable-control-plane-metrics (--enable-cp-metrics)
  • az aks update: --enable-control-plane-metrics (--enable-cp-metrics)
  • az aks update: --disable-control-plane-metrics (--disable-cp-metrics)

Enable requires Azure Monitor metrics to already be on or to be enabled in the same command via --enable-azure-monitor-metrics. Enable + disable in the same command, or enable-CP + --disable-azure-monitor-metrics, are rejected client-side with MutuallyExclusiveArgumentError.

Greenfield race fix:
On aks create, metrics.controlPlane.enabled=true is intentionally NOT set on the initial cluster PUT. Otherwise the RP would schedule the control-plane-metrics collection (CCP) pod before the DCRA is created in postprocessing (link_azure_monitor_profile_artifacts), causing the CCP pod to crash-loop with "DCRA not found" until reconciliation. The flip is deferred to the existing post-DCRA addon_put PUT, so the CCP pod is scheduled only after its DCRA exists. The update path is unchanged — brownfield updates target a cluster whose DCRA already exists, so there is no race.

Related command

  • az aks create
  • az aks update

Description

This PR plumbs the new azureMonitorProfile.metrics.controlPlane.enabled API property end-to-end through the AKS CLI.

What's added:

  • New --enable-control-plane-metrics / --enable-cp-metrics flag on az aks create and az aks update.
  • New --disable-control-plane-metrics / --disable-cp-metrics flag on az aks update.
  • Argument registration in _params.py and help text in _help.py.
  • Client-side validation in _validators.py:
    • Enable requires azureMonitorProfile.metrics.enabled=true on the cluster (or --enable-azure-monitor-metrics in the same command).
    • --enable-control-plane-metrics + --disable-control-plane-metricsMutuallyExclusiveArgumentError.
    • --enable-control-plane-metrics + --disable-azure-monitor-metricsMutuallyExclusiveArgumentError.
  • AKSManagedClusterContext getters in managed_cluster_decorator.py and routing in custom.py so create/update both populate the correct subfield of ManagedClusterAzureMonitorProfileMetrics.
  • Greenfield race fix (see above): on aks create the CP-metrics flip is held back from the initial PUT and applied after link_azure_monitor_profile_artifacts provisions the DCRA, then re-PUT via the existing addon_put path. Brownfield aks update is unchanged.

Scenario coverage:

# Command Cluster state Result
1 aks create --enable-azure-monitor-metrics --enable-control-plane-metrics n/a Cluster created, AMW linked, DCRA created, then CP-metrics flipped on
2 aks create --enable-control-plane-metrics (no AMW flag) n/a Rejected: AMW required
3 aks update --enable-control-plane-metrics AMW already on CP-metrics enabled
4 aks update --enable-control-plane-metrics AMW off Rejected: AMW required
5 aks update --enable-azure-monitor-metrics --enable-control-plane-metrics … AMW off AMW enabled + CP-metrics enabled in one call
6 aks update --disable-control-plane-metrics AMW on, CP on CP-metrics disabled, AMW left intact
7 aks update --enable-control-plane-metrics --disable-control-plane-metrics any Rejected: mutually exclusive
8 aks update --enable-control-plane-metrics --disable-azure-monitor-metrics any Rejected: mutually exclusive

Files changed:

  • src/azure-cli/azure/cli/command_modules/acs/_help.py
  • src/azure-cli/azure/cli/command_modules/acs/_params.py
  • src/azure-cli/azure/cli/command_modules/acs/_validators.py
  • src/azure-cli/azure/cli/command_modules/acs/custom.py
  • src/azure-cli/azure/cli/command_modules/acs/managed_cluster_decorator.py
  • src/azure-cli/azure/cli/command_modules/acs/tests/latest/test_managed_cluster_decorator.py
  • src/azure-cli/azure/cli/command_modules/acs/tests/latest/recordings/… (recorded test cassettes)

Pairs with the matching aks-preview change: Azure/azure-cli-extensions#9931.

Testing Guide

Unit tests:

azdev test acs --discover
azdev test acs --series --pytest-args "-k control_plane_metrics"

Live validation against a real AKS cluster + Azure Monitor workspace:

RG=ccp-test-rg
LOC=eastus
AMW=/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Monitor/accounts/<amw>
GRAFANA=/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Dashboard/grafana/<g>

# Greenfield: enable CP-metrics at create time
az aks create -g $RG -n green-cp --location $LOC \
  --enable-azure-monitor-metrics --azure-monitor-workspace-resource-id $AMW \
  --grafana-resource-id $GRAFANA \
  --enable-control-plane-metrics
az aks show -g $RG -n green-cp --query "azureMonitorProfile.metrics" -o jsonc

# Brownfield: enable, then disable, then re-enable
az aks create -g $RG -n brown-cp --location $LOC \
  --enable-azure-monitor-metrics --azure-monitor-workspace-resource-id $AMW \
  --grafana-resource-id $GRAFANA
az aks update -g $RG -n brown-cp --enable-control-plane-metrics
az aks show -g $RG -n brown-cp --query "azureMonitorProfile.metrics" -o jsonc
az aks update -g $RG -n brown-cp --disable-control-plane-metrics
az aks show -g $RG -n brown-cp --query "azureMonitorProfile.metrics" -o jsonc
az aks update -g $RG -n brown-cp --enable-control-plane-metrics
az aks show -g $RG -n brown-cp --query "azureMonitorProfile.metrics" -o jsonc

# Negative cases
az aks update -g $RG -n brown-cp --enable-control-plane-metrics --disable-control-plane-metrics
az aks update -g $RG -n brown-cp --enable-control-plane-metrics --disable-azure-monitor-metrics
az aks create -g $RG -n bad-cp --enable-control-plane-metrics   # no AMW => rejected

Validation in Azure Monitor workspace after each enable: default CCP metric families flow within ~5–10 min (apiserver_request_total, apiserver_request_duration_seconds_*, etcd_server_has_leader, etcd_mvcc_db_total_size_in_bytes, process_start_time_seconds). After disable, allow ~15 min for the previous deployment's metrics to age out before re-asserting.

History Notes

[AKS] az aks create: Add --enable-control-plane-metrics/--enable-cp-metrics to opt new clusters into Azure Monitor managed Prometheus control-plane metrics
[AKS] az aks update: Add --enable-control-plane-metrics/--enable-cp-metrics and --disable-control-plane-metrics/--disable-cp-metrics to toggle Azure Monitor managed Prometheus control-plane metrics on existing clusters


This checklist is used to make sure that common guidelines for a pull request are followed.

Surface `azureMonitorProfile.metrics.controlPlane.enabled` so users can
opt clusters in/out of Azure Monitor managed Prometheus control-plane
metrics (kube-apiserver, etcd, scheduler, controller-manager) via the
first-class API property — replaces the AFEC-gated preview.

New flags:
  az aks create:  --enable-control-plane-metrics  (--enable-cp-metrics)
  az aks update:  --enable-control-plane-metrics  (--enable-cp-metrics)
                  --disable-control-plane-metrics (--disable-cp-metrics)

Enable requires Azure Monitor metrics to already be on or to be enabled
in the same command via --enable-azure-monitor-metrics. Enable + disable
in the same command, or enable-CP + --disable-azure-monitor-metrics,
are rejected client-side with MutuallyExclusiveArgumentError.

Greenfield race fix:
On `aks create`, `metrics.controlPlane.enabled=true` is intentionally
NOT set on the initial cluster PUT. Otherwise the RP would schedule the
control-plane-metrics collection (CCP) pod before the DCRA is created
in postprocessing (link_azure_monitor_profile_artifacts), causing the
CCP pod to crash-loop with "DCRA not found" until reconciliation. The
flip is deferred to the existing post-DCRA addon_put PUT, so the CCP
pod is scheduled only after its DCRA exists. The update path is
unchanged — brownfield updates target a cluster whose DCRA already
exists, so there is no race.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@azure-client-tools-bot-prd

azure-client-tools-bot-prd Bot commented Jun 11, 2026

Copy link
Copy Markdown
️✔️AzureCLI-FullTest
️✔️acr
️✔️latest
️✔️3.12
️✔️3.14
️✔️acs
️✔️latest
️✔️3.12
️✔️3.14
️✔️advisor
️✔️latest
️✔️3.12
️✔️3.14
️✔️ams
️✔️latest
️✔️3.12
️✔️3.14
️✔️apim
️✔️latest
️✔️3.12
️✔️3.14
️✔️appconfig
️✔️latest
️✔️3.12
️✔️3.14
️✔️appservice
️✔️latest
️✔️3.12
️✔️3.14
️✔️aro
️✔️latest
️✔️3.12
️✔️3.14
️✔️backup
️✔️latest
️✔️3.12
️✔️3.14
️✔️batch
️✔️latest
️✔️3.12
️✔️3.14
️✔️batchai
️✔️latest
️✔️3.12
️✔️3.14
️✔️billing
️✔️latest
️✔️3.12
️✔️3.14
️✔️botservice
️✔️latest
️✔️3.12
️✔️3.14
️✔️cdn
️✔️latest
️✔️3.12
️✔️3.14
️✔️cloud
️✔️latest
️✔️3.12
️✔️3.14
️✔️cognitiveservices
️✔️latest
️✔️3.12
️✔️3.14
️✔️compute_recommender
️✔️latest
️✔️3.12
️✔️3.14
️✔️computefleet
️✔️latest
️✔️3.12
️✔️3.14
️✔️config
️✔️latest
️✔️3.12
️✔️3.14
️✔️configure
️✔️latest
️✔️3.12
️✔️3.14
️✔️consumption
️✔️latest
️✔️3.12
️✔️3.14
️✔️container
️✔️latest
️✔️3.12
️✔️3.14
️✔️containerapp
️✔️latest
️✔️3.12
️✔️3.14
️✔️core
️✔️latest
️✔️3.12
️✔️3.14
️✔️cosmosdb
️✔️latest
️✔️3.12
️✔️3.14
️✔️databoxedge
️✔️latest
️✔️3.12
️✔️3.14
️✔️dls
️✔️latest
️✔️3.12
️✔️3.14
️✔️dms
️✔️latest
️✔️3.12
️✔️3.14
️✔️eventgrid
️✔️latest
️✔️3.12
️✔️3.14
️✔️eventhubs
️✔️latest
️✔️3.12
️✔️3.14
️✔️feedback
️✔️latest
️✔️3.12
️✔️3.14
️✔️find
️✔️latest
️✔️3.12
️✔️3.14
️✔️hdinsight
️✔️latest
️✔️3.12
️✔️3.14
️✔️identity
️✔️latest
️✔️3.12
️✔️3.14
️✔️iot
️✔️latest
️✔️3.12
️✔️3.14
️✔️keyvault
️✔️latest
️✔️3.12
️✔️3.14
️✔️lab
️✔️latest
️✔️3.12
️✔️3.14
️✔️managedservices
️✔️latest
️✔️3.12
️✔️3.14
️✔️maps
️✔️latest
️✔️3.12
️✔️3.14
️✔️marketplaceordering
️✔️latest
️✔️3.12
️✔️3.14
️✔️monitor
️✔️latest
️✔️3.12
️✔️3.14
️✔️mysql
️✔️latest
️✔️3.12
️✔️3.14
️✔️netappfiles
️✔️latest
️✔️3.12
️✔️3.14
️✔️network
️✔️latest
️✔️3.12
️✔️3.14
️✔️policyinsights
️✔️latest
️✔️3.12
️✔️3.14
️✔️postgresql
️✔️latest
️✔️3.12
️✔️3.14
️✔️privatedns
️✔️latest
️✔️3.12
️✔️3.14
️✔️profile
️✔️latest
️✔️3.12
️✔️3.14
️✔️rdbms
️✔️latest
️✔️3.12
️✔️3.14
️✔️redis
️✔️latest
️✔️3.12
️✔️3.14
️✔️relay
️✔️latest
️✔️3.12
️✔️3.14
️✔️resource
️✔️latest
️✔️3.12
️✔️3.14
️✔️role
️✔️latest
️✔️3.12
️✔️3.14
️✔️search
️✔️latest
️✔️3.12
️✔️3.14
️✔️security
️✔️latest
️✔️3.12
️✔️3.14
️✔️servicebus
️✔️latest
️✔️3.12
️✔️3.14
️✔️serviceconnector
️✔️latest
️✔️3.12
️✔️3.14
️✔️servicefabric
️✔️latest
️✔️3.12
️✔️3.14
️✔️signalr
️✔️latest
️✔️3.12
️✔️3.14
️✔️sql
️✔️latest
️✔️3.12
️✔️3.14
️✔️sqlvm
️✔️latest
️✔️3.12
️✔️3.14
️✔️storage
️✔️latest
️✔️3.12
️✔️3.14
️✔️synapse
️✔️latest
️✔️3.12
️✔️3.14
️✔️telemetry
️✔️latest
️✔️3.12
️✔️3.14
️✔️util
️✔️latest
️✔️3.12
️✔️3.14
️✔️vm
️✔️latest
️✔️3.12
️✔️3.14

@azure-client-tools-bot-prd

azure-client-tools-bot-prd Bot commented Jun 11, 2026

Copy link
Copy Markdown
⚠️AzureCLI-BreakingChangeTest
⚠️acs
rule cmd_name rule_message suggest_message
⚠️ 1006 - ParaAdd aks create cmd aks create added parameter enable_control_plane_metrics
⚠️ 1006 - ParaAdd aks update cmd aks update added parameter disable_control_plane_metrics
⚠️ 1006 - ParaAdd aks update cmd aks update added parameter enable_control_plane_metrics

@yonzhan

yonzhan commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

AKS

@bragi92 bragi92 marked this pull request as ready for review June 11, 2026 23:41
Copilot AI review requested due to automatic review settings June 11, 2026 23:41

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR adds support for enabling/disabling Azure Monitor managed Prometheus control plane metrics for AKS clusters, with validation and create-flow handling to avoid a DCRA/CCP race.

Changes:

  • Introduces new CLI flags (--enable-control-plane-metrics/--enable-cp-metrics, --disable-control-plane-metrics/--disable-cp-metrics) and wires them into create/update flows.
  • Adds validation to reject invalid flag combinations and require Azure Monitor metrics as a prerequisite.
  • Updates postprocessing to defer enabling control plane metrics on create until after DCRA creation, and adds unit + live tests.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/azure-cli/azure/cli/command_modules/acs/managed_cluster_decorator.py Adds getters/validation for control plane metrics flags; defers control plane enablement on create; allows toggling on update.
src/azure-cli/azure/cli/command_modules/acs/azuremonitormetrics/azuremonitorprofile.py Adds a create-flow addon PUT variant that flips metrics.controlPlane.enabled after DCRA creation.
src/azure-cli/azure/cli/command_modules/acs/custom.py Plumbs new parameters into aks_create / aks_update entrypoints.
src/azure-cli/azure/cli/command_modules/acs/_params.py Registers new CLI arguments and help text.
src/azure-cli/azure/cli/command_modules/acs/_help.py Documents the new flags in command help.
src/azure-cli/azure/cli/command_modules/acs/tests/latest/test_managed_cluster_decorator.py Adds unit tests for create/update validation and payload shaping.
src/azure-cli/azure/cli/command_modules/acs/tests/latest/test_aks_commands.py Adds live tests covering create/update/negative cases for the new flags.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/azure-cli/azure/cli/command_modules/acs/managed_cluster_decorator.py Outdated
Comment thread src/azure-cli/azure/cli/command_modules/acs/managed_cluster_decorator.py Outdated
- Wait on the LRO in _addon_put_with_control_plane via poller.result(). This is the
  only place controlPlane.enabled is set during the greenfield create flow, so the
  CP flip must be durably persisted before the create command returns. Without the
  wait, callers and tests that read the cluster immediately could observe the
  pre-flip state. (The sibling addon_put intentionally remains fire-and-forget
  because metrics.enabled was already persisted on the initial cluster PUT.)
- Replace raise UnknownError(e) with raise UnknownError(str(e)) from e so the
  message is readable and the original traceback is preserved.
- Coerce _get_enable_control_plane_metrics / _get_disable_control_plane_metrics
  return values to bool() to match the declared -> bool return type when the
  parameter dict omits the key.
- Make the live test_aks_create_with_control_plane_metrics assertion robust:
  the controlPlane.enabled check is moved out of the immediate create response
  into an explicit aks show after aks wait, since the flip is intentionally
  deferred to post-DCRA postprocessing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@FumingZhang

Copy link
Copy Markdown
Member

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 3 pipeline(s).

@FumingZhang FumingZhang left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with approving the change since the test result was positive, but I have to admit that the monitoring addon configuration has become quite confusing for me. It's difficult to understand all the options and the control flow between the v1 and v2 versions. I believe I've mentioned this before—do you and your team have any plans to refactor this part of the code?

# Trigger control-plane-metrics validation even if the parent metrics flag was
# not specified, so users get a clear error instead of silent ignore when they
# pass --enable-control-plane-metrics on its own.
self.context.get_enable_control_plane_metrics()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you could separate the validation logic from the getter and call the validator here instead, as it's unusual to retrieve a value only to discard it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

mc.azure_monitor_profile.metrics.kube_state_metrics = self.models.ManagedClusterAzureMonitorProfileKubeStateMetrics( # pylint:disable=line-too-long
metric_labels_allowlist=str(ksm_metric_labels_allow_list),
metric_annotations_allow_list=str(ksm_metric_annotations_allow_list))
# NOTE: control_plane.enabled is intentionally NOT set here on the create flow.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach is somewhat unconventional from a client perspective. I've encountered some CRIs who reported that the monitoring addon wasn't functioning as expected when deploying with an ARM template or similar methods. Would it be possible to handle these types of tasks on the server side instead? This allows users on different clients to have a consistent experience.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed. We've had this in out backlog for a while. The logs and prometheus addon team has been looking into it this semester and currently Rashmi has been working on the design for that.

Once the logic is moved to the RP, I'll create a PR to remove it from the CLI.

@bragi92

bragi92 commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

I'm fine with approving the change since the test result was positive, but I have to admit that the monitoring addon configuration has become quite confusing for me. It's difficult to understand all the options and the control flow between the v1 and v2 versions. I believe I've mentioned this before—do you and your team have any plans to refactor this part of the code?

Yup, Rashmi has been working on moving the creation logic for most of the dependencies to the AKS-RP. This should alleviate the pain of maintaing it in the CLI, UX and make our ARM, Bicep etc. teamplates simpler.

Per FumingZhang review feedback on PR Azure#33537: calling get_enable_control_plane_metrics() purely to trigger validation and discarding the return value is a confusing pattern. Extract the validation block into a new private _validate_control_plane_metrics_params method, expose a public validate_control_plane_metrics_params, and have the getters delegate to it when enable_validation=True (preserves existing API). The set_up_azure_monitor_profile call site now calls the validator directly instead of discarding a getter result.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bragi92 added a commit to bragi92/azure-cli-extensions that referenced this pull request Jun 12, 2026
Per FumingZhang review feedback on Azure/azure-cli#33537: calling get_enable_control_plane_metrics() purely to trigger validation and discarding the return value is a confusing pattern. Extract the validation block into a new private _validate_control_plane_metrics_params method, expose a public validate_control_plane_metrics_params, and have the getters delegate to it when enable_validation=True (preserves existing API). The two _setup_azure_monitor_profile call sites now call the validator directly instead of discarding a getter result.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bragi92 and others added 3 commits June 12, 2026 08:31
Other validators in this file (e.g. validate_byo_hosted_system_subnets) are a single public def validate_xxx(self) -> None — no private companion. Collapse the extra _validate_control_plane_metrics_params indirection so the new validator matches the file's convention. Tests + behavior unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The aka.ms/aks/controlplane-metrics shortlink does not resolve. Drop the trailing reference from the four help strings (create + update, both _help.py and _params.py). The remaining help text already explains the flag and its prerequisite, matching the sibling --enable-azure-monitor-metrics line which has no docs URL.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace 'kube-apiserver, etcd, etc' with the actual default Prometheus scrape job names: controlplane-apiserver and controlplane-etcd. These are the targets users see in AMW and what the AKS docs reference. The 'etc' was also misleading since scheduler / controller-manager / NAP targets are opt-in via MinimalIngestionProfile and are not flipped on by this flag.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bragi92

bragi92 commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

/azp run

@azure-pipelines

Copy link
Copy Markdown
Commenter does not have sufficient privileges for PR 33537 in repo Azure/azure-cli

@FumingZhang

Copy link
Copy Markdown
Member

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 3 pipeline(s).

@FumingZhang FumingZhang changed the title {AKS} az aks create/update: add --enable/--disable-control-plane-metrics {AKS} az aks create/update: add --enable/--disable-control-plane-metrics Jun 15, 2026
@FumingZhang

Copy link
Copy Markdown
Member

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 3 pipeline(s).

@yanzhudd yanzhudd merged commit 1d68d09 into Azure:dev Jun 16, 2026
50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants