{AKS} az aks create/update: add --enable/--disable-control-plane-metrics#33537
Conversation
Surface `azureMonitorProfile.metrics.controlPlane.enabled` so users can
opt clusters in/out of Azure Monitor managed Prometheus control-plane
metrics (kube-apiserver, etcd, scheduler, controller-manager) via the
first-class API property — replaces the AFEC-gated preview.
New flags:
az aks create: --enable-control-plane-metrics (--enable-cp-metrics)
az aks update: --enable-control-plane-metrics (--enable-cp-metrics)
--disable-control-plane-metrics (--disable-cp-metrics)
Enable requires Azure Monitor metrics to already be on or to be enabled
in the same command via --enable-azure-monitor-metrics. Enable + disable
in the same command, or enable-CP + --disable-azure-monitor-metrics,
are rejected client-side with MutuallyExclusiveArgumentError.
Greenfield race fix:
On `aks create`, `metrics.controlPlane.enabled=true` is intentionally
NOT set on the initial cluster PUT. Otherwise the RP would schedule the
control-plane-metrics collection (CCP) pod before the DCRA is created
in postprocessing (link_azure_monitor_profile_artifacts), causing the
CCP pod to crash-loop with "DCRA not found" until reconciliation. The
flip is deferred to the existing post-DCRA addon_put PUT, so the CCP
pod is scheduled only after its DCRA exists. The update path is
unchanged — brownfield updates target a cluster whose DCRA already
exists, so there is no race.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
️✔️AzureCLI-FullTest
|
|
| rule | cmd_name | rule_message | suggest_message |
|---|---|---|---|
| aks create | cmd aks create added parameter enable_control_plane_metrics |
||
| aks update | cmd aks update added parameter disable_control_plane_metrics |
||
| aks update | cmd aks update added parameter enable_control_plane_metrics |
|
AKS |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR adds support for enabling/disabling Azure Monitor managed Prometheus control plane metrics for AKS clusters, with validation and create-flow handling to avoid a DCRA/CCP race.
Changes:
- Introduces new CLI flags (
--enable-control-plane-metrics/--enable-cp-metrics,--disable-control-plane-metrics/--disable-cp-metrics) and wires them into create/update flows. - Adds validation to reject invalid flag combinations and require Azure Monitor metrics as a prerequisite.
- Updates postprocessing to defer enabling control plane metrics on create until after DCRA creation, and adds unit + live tests.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
src/azure-cli/azure/cli/command_modules/acs/managed_cluster_decorator.py |
Adds getters/validation for control plane metrics flags; defers control plane enablement on create; allows toggling on update. |
src/azure-cli/azure/cli/command_modules/acs/azuremonitormetrics/azuremonitorprofile.py |
Adds a create-flow addon PUT variant that flips metrics.controlPlane.enabled after DCRA creation. |
src/azure-cli/azure/cli/command_modules/acs/custom.py |
Plumbs new parameters into aks_create / aks_update entrypoints. |
src/azure-cli/azure/cli/command_modules/acs/_params.py |
Registers new CLI arguments and help text. |
src/azure-cli/azure/cli/command_modules/acs/_help.py |
Documents the new flags in command help. |
src/azure-cli/azure/cli/command_modules/acs/tests/latest/test_managed_cluster_decorator.py |
Adds unit tests for create/update validation and payload shaping. |
src/azure-cli/azure/cli/command_modules/acs/tests/latest/test_aks_commands.py |
Adds live tests covering create/update/negative cases for the new flags. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Wait on the LRO in _addon_put_with_control_plane via poller.result(). This is the only place controlPlane.enabled is set during the greenfield create flow, so the CP flip must be durably persisted before the create command returns. Without the wait, callers and tests that read the cluster immediately could observe the pre-flip state. (The sibling addon_put intentionally remains fire-and-forget because metrics.enabled was already persisted on the initial cluster PUT.) - Replace raise UnknownError(e) with raise UnknownError(str(e)) from e so the message is readable and the original traceback is preserved. - Coerce _get_enable_control_plane_metrics / _get_disable_control_plane_metrics return values to bool() to match the declared -> bool return type when the parameter dict omits the key. - Make the live test_aks_create_with_control_plane_metrics assertion robust: the controlPlane.enabled check is moved out of the immediate create response into an explicit aks show after aks wait, since the flip is intentionally deferred to post-DCRA postprocessing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run |
|
Azure Pipelines successfully started running 3 pipeline(s). |
FumingZhang
left a comment
There was a problem hiding this comment.
I'm fine with approving the change since the test result was positive, but I have to admit that the monitoring addon configuration has become quite confusing for me. It's difficult to understand all the options and the control flow between the v1 and v2 versions. I believe I've mentioned this before—do you and your team have any plans to refactor this part of the code?
| # Trigger control-plane-metrics validation even if the parent metrics flag was | ||
| # not specified, so users get a clear error instead of silent ignore when they | ||
| # pass --enable-control-plane-metrics on its own. | ||
| self.context.get_enable_control_plane_metrics() |
There was a problem hiding this comment.
Perhaps you could separate the validation logic from the getter and call the validator here instead, as it's unusual to retrieve a value only to discard it.
| mc.azure_monitor_profile.metrics.kube_state_metrics = self.models.ManagedClusterAzureMonitorProfileKubeStateMetrics( # pylint:disable=line-too-long | ||
| metric_labels_allowlist=str(ksm_metric_labels_allow_list), | ||
| metric_annotations_allow_list=str(ksm_metric_annotations_allow_list)) | ||
| # NOTE: control_plane.enabled is intentionally NOT set here on the create flow. |
There was a problem hiding this comment.
This approach is somewhat unconventional from a client perspective. I've encountered some CRIs who reported that the monitoring addon wasn't functioning as expected when deploying with an ARM template or similar methods. Would it be possible to handle these types of tasks on the server side instead? This allows users on different clients to have a consistent experience.
There was a problem hiding this comment.
Yes, agreed. We've had this in out backlog for a while. The logs and prometheus addon team has been looking into it this semester and currently Rashmi has been working on the design for that.
Once the logic is moved to the RP, I'll create a PR to remove it from the CLI.
Yup, Rashmi has been working on moving the creation logic for most of the dependencies to the AKS-RP. This should alleviate the pain of maintaing it in the CLI, UX and make our ARM, Bicep etc. teamplates simpler. |
Per FumingZhang review feedback on PR Azure#33537: calling get_enable_control_plane_metrics() purely to trigger validation and discarding the return value is a confusing pattern. Extract the validation block into a new private _validate_control_plane_metrics_params method, expose a public validate_control_plane_metrics_params, and have the getters delegate to it when enable_validation=True (preserves existing API). The set_up_azure_monitor_profile call site now calls the validator directly instead of discarding a getter result. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per FumingZhang review feedback on Azure/azure-cli#33537: calling get_enable_control_plane_metrics() purely to trigger validation and discarding the return value is a confusing pattern. Extract the validation block into a new private _validate_control_plane_metrics_params method, expose a public validate_control_plane_metrics_params, and have the getters delegate to it when enable_validation=True (preserves existing API). The two _setup_azure_monitor_profile call sites now call the validator directly instead of discarding a getter result. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Other validators in this file (e.g. validate_byo_hosted_system_subnets) are a single public def validate_xxx(self) -> None — no private companion. Collapse the extra _validate_control_plane_metrics_params indirection so the new validator matches the file's convention. Tests + behavior unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The aka.ms/aks/controlplane-metrics shortlink does not resolve. Drop the trailing reference from the four help strings (create + update, both _help.py and _params.py). The remaining help text already explains the flag and its prerequisite, matching the sibling --enable-azure-monitor-metrics line which has no docs URL. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace 'kube-apiserver, etcd, etc' with the actual default Prometheus scrape job names: controlplane-apiserver and controlplane-etcd. These are the targets users see in AMW and what the AKS docs reference. The 'etc' was also misleading since scheduler / controller-manager / NAP targets are opt-in via MinimalIngestionProfile and are not flipped on by this flag. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run |
|
Commenter does not have sufficient privileges for PR 33537 in repo Azure/azure-cli |
|
/azp run |
|
Azure Pipelines successfully started running 3 pipeline(s). |
az aks create/update: add --enable/--disable-control-plane-metrics
|
/azp run |
|
Azure Pipelines successfully started running 3 pipeline(s). |
Surface
azureMonitorProfile.metrics.controlPlane.enabledso users can opt clusters in/out of Azure Monitor managed Prometheus control-plane metrics (controlplane-apiserver, controlplane-etcd) via the first-class API property — replaces the AFEC-gated preview.New flags:
az aks create:--enable-control-plane-metrics(--enable-cp-metrics)az aks update:--enable-control-plane-metrics(--enable-cp-metrics)az aks update:--disable-control-plane-metrics(--disable-cp-metrics)Enable requires Azure Monitor metrics to already be on or to be enabled in the same command via
--enable-azure-monitor-metrics. Enable + disable in the same command, or enable-CP +--disable-azure-monitor-metrics, are rejected client-side withMutuallyExclusiveArgumentError.Greenfield race fix:
On
aks create,metrics.controlPlane.enabled=trueis intentionally NOT set on the initial cluster PUT. Otherwise the RP would schedule the control-plane-metrics collection (CCP) pod before the DCRA is created in postprocessing (link_azure_monitor_profile_artifacts), causing the CCP pod to crash-loop with "DCRA not found" until reconciliation. The flip is deferred to the existing post-DCRAaddon_putPUT, so the CCP pod is scheduled only after its DCRA exists. The update path is unchanged — brownfield updates target a cluster whose DCRA already exists, so there is no race.Related command
az aks createaz aks updateDescription
This PR plumbs the new
azureMonitorProfile.metrics.controlPlane.enabledAPI property end-to-end through the AKS CLI.What's added:
--enable-control-plane-metrics/--enable-cp-metricsflag onaz aks createandaz aks update.--disable-control-plane-metrics/--disable-cp-metricsflag onaz aks update._params.pyand help text in_help.py._validators.py:azureMonitorProfile.metrics.enabled=trueon the cluster (or--enable-azure-monitor-metricsin the same command).--enable-control-plane-metrics+--disable-control-plane-metrics→MutuallyExclusiveArgumentError.--enable-control-plane-metrics+--disable-azure-monitor-metrics→MutuallyExclusiveArgumentError.AKSManagedClusterContextgetters inmanaged_cluster_decorator.pyand routing incustom.pyso create/update both populate the correct subfield ofManagedClusterAzureMonitorProfileMetrics.aks createthe CP-metrics flip is held back from the initial PUT and applied afterlink_azure_monitor_profile_artifactsprovisions the DCRA, then re-PUT via the existingaddon_putpath. Brownfieldaks updateis unchanged.Scenario coverage:
aks create --enable-azure-monitor-metrics --enable-control-plane-metricsaks create --enable-control-plane-metrics(no AMW flag)aks update --enable-control-plane-metricsaks update --enable-control-plane-metricsaks update --enable-azure-monitor-metrics --enable-control-plane-metrics …aks update --disable-control-plane-metricsaks update --enable-control-plane-metrics --disable-control-plane-metricsaks update --enable-control-plane-metrics --disable-azure-monitor-metricsFiles changed:
src/azure-cli/azure/cli/command_modules/acs/_help.pysrc/azure-cli/azure/cli/command_modules/acs/_params.pysrc/azure-cli/azure/cli/command_modules/acs/_validators.pysrc/azure-cli/azure/cli/command_modules/acs/custom.pysrc/azure-cli/azure/cli/command_modules/acs/managed_cluster_decorator.pysrc/azure-cli/azure/cli/command_modules/acs/tests/latest/test_managed_cluster_decorator.pysrc/azure-cli/azure/cli/command_modules/acs/tests/latest/recordings/…(recorded test cassettes)Pairs with the matching aks-preview change: Azure/azure-cli-extensions#9931.
Testing Guide
Unit tests:
Live validation against a real AKS cluster + Azure Monitor workspace:
Validation in Azure Monitor workspace after each enable: default CCP metric families flow within ~5–10 min (
apiserver_request_total,apiserver_request_duration_seconds_*,etcd_server_has_leader,etcd_mvcc_db_total_size_in_bytes,process_start_time_seconds). After disable, allow ~15 min for the previous deployment's metrics to age out before re-asserting.History Notes
[AKS]
az aks create: Add--enable-control-plane-metrics/--enable-cp-metricsto opt new clusters into Azure Monitor managed Prometheus control-plane metrics[AKS]
az aks update: Add--enable-control-plane-metrics/--enable-cp-metricsand--disable-control-plane-metrics/--disable-cp-metricsto toggle Azure Monitor managed Prometheus control-plane metrics on existing clustersThis checklist is used to make sure that common guidelines for a pull request are followed.