Skip to content

Stop doing LoadBalancer tests in gce-master-scale-correctness#36720

Merged
k8s-ci-robot merged 1 commit intokubernetes:masterfrom
danwinship:lb-5000-flake
Apr 13, 2026
Merged

Stop doing LoadBalancer tests in gce-master-scale-correctness#36720
k8s-ci-robot merged 1 commit intokubernetes:masterfrom
danwinship:lb-5000-flake

Conversation

@danwinship
Copy link
Copy Markdown
Contributor

gce-master-scale-correctness periodically starts flaking because the LoadBalancer tests do not reliably pass under GCE at that scale, at least in the way this test configures the cluster.

Discussion with @bowei and @serathius at KubeCon led to the conclusion that we should just drop the LB tests from this particular job.

(Which is to say, we should change the skip rule from "skip any tests with [Feature:...] tags that aren't [Feature:LoadBalancer]" to "skip any tests with [Feature:...] tags".)

Fixes kubernetes/kubernetes#131863

The tests do not pass reliably at that scale.
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 25, 2026
@k8s-ci-robot k8s-ci-robot requested review from mborsz and mm4tt March 25, 2026 13:38
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. area/config Issues or PRs related to code in /config area/jobs sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Mar 25, 2026
@serathius
Copy link
Copy Markdown
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 26, 2026
@danwinship
Copy link
Copy Markdown
Contributor Author

/assign @wojtek-t

@wojtek-t
Copy link
Copy Markdown
Member

Discussion with @bowei and @serathius at KubeCon led to the conclusion that we should just drop the LB tests from this particular job.

I'm actually pretty worried about that.
The reason why we added them in the first place to this job is that certain load-balancing was initiialy unusable at large scale. And at least some of these problems were in our components.

If we remove those tests completely from large scale tests, we will be on a slipper slope to newer degradations in this area.
I acknowledge the fact that flakiness isn't good, but... isn't it still better than zero visibility?

@danwinship
Copy link
Copy Markdown
Contributor Author

The reason why we added them in the first place to this job is that certain load-balancing was initially unusable at large scale.

OK, but note that load balancing is expected to be unusable in a 5000-node cluster on GCP, when using load balancers configured in the way k8s configures them by default. We are testing a configuration that Google does not support (and then sending out release-informing alerts when it turns out that the configuration that wasn't expected to work actually doesn't work).

If we remove those tests completely from large scale tests

IIRC @bowei said our load balancer configuration should be fully supported on 1000-node clusters.

And at least some of these problems were in our components.

Kube-proxy doesn't do anything that would make LoadBalancer Services scale any differently than ClusterIP and NodePort Services, so we shouldn't need the LB tests to catch kube-proxy scaling problems. And if there are cloud-provider-gcp scaling problems, then those should trigger cloud-provider-gcp alerts, not k/k alerts.

I acknowledge the fact that flakiness isn't good, but... isn't it still better than zero visibility?

But right now it's the E2E Test That Cried "Wolf!". We ignore the failures anyway, because they're always GCP's fault...

@wojtek-t
Copy link
Copy Markdown
Member

wojtek-t commented Apr 1, 2026

And if there are cloud-provider-gcp scaling problems, then those should trigger cloud-provider-gcp alerts, not k/k alerts.

OK - fair. That I agree with.
The problem is that, I'm 99% sure that we don't have cloud-provider-gcp scale tests...
@bowei @serathius

But right now it's the E2E Test That Cried "Wolf!". We ignore the failures anyway, because they're always GCP's fault...

We ignore them, because they are flakes.
But if we permanently break them, we would catch them.
I guess this my main point - we can now catch "it broke completely" state. So even if we don't even look at flakes and ignore them, this state is arguably better than having zero visibility if that works at all...

@wojtek-t
Copy link
Copy Markdown
Member

wojtek-t commented Apr 1, 2026

But I guess the argument about "it's not k/k itself issue anyway" - is a valid one and kind of over-weights all my arguments.
So I'm going to approve and hold - if we want to continue the discussion here.
If not, can we open an issue to have those tests covered separately on a different setup elsewhere then?

/approve
/hold

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Apr 1, 2026
@aojea
Copy link
Copy Markdown
Member

aojea commented Apr 8, 2026

/assign

@danwinship
Copy link
Copy Markdown
Contributor Author

can we open an issue to have those tests covered separately on a different setup elsewhere then?

#36773

@aojea
Copy link
Copy Markdown
Member

aojea commented Apr 13, 2026

But I guess the argument about "it's not k/k itself issue anyway" - is a valid one and kind of over-weights all my arguments. So I'm going to approve and hold - if we want to continue the discussion here. If not, can we open an issue to have those tests covered separately on a different setup elsewhere then?

/approve /hold

Ok, so the problem here is not for the jobs itself or the tests or the cloud provider angle, there are multiple jobs that are cloud provider specific or feature specific. The problem is that the job is informing and that has some expectations of stability since there is a group "ci-signal" that reviews them, and we can not special cast a job on informing to say "well, it can flake, just check if it fail always", so the right way to proceed IMHO is to keep the job in informing stable (remove the loadbalancer tests) and for keeping the current coverage find owners that maintain #36773

/hold cancel

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 13, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, danwinship, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 6b0415f into kubernetes:master Apr 13, 2026
6 checks passed
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@danwinship: Updated the job-config configmap in namespace default at cluster test-infra-trusted using the following files:

  • key sig-scalability-release-blocking-jobs.yaml using file config/jobs/kubernetes/sig-scalability/sig-scalability-release-blocking-jobs.yaml
Details

In response to this:

gce-master-scale-correctness periodically starts flaking because the LoadBalancer tests do not reliably pass under GCE at that scale, at least in the way this test configures the cluster.

Discussion with @bowei and @serathius at KubeCon led to the conclusion that we should just drop the LB tests from this particular job.

(Which is to say, we should change the skip rule from "skip any tests with [Feature:...] tags that aren't [Feature:LoadBalancer]" to "skip any tests with [Feature:...] tags".)

Fixes kubernetes/kubernetes#131863

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@danwinship danwinship deleted the lb-5000-flake branch April 13, 2026 10:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/config Issues or PRs related to code in /config area/jobs cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Flaking Test] [sig-network] LoadBalancers tests fail with timeout waiting for service "xxx" to have a load balancer: context deadline exceeded

5 participants