Stop doing LoadBalancer tests in gce-master-scale-correctness by danwinship · Pull Request #36720 · kubernetes/test-infra

danwinship · 2026-03-25T13:38:31Z

gce-master-scale-correctness periodically starts flaking because the LoadBalancer tests do not reliably pass under GCE at that scale, at least in the way this test configures the cluster.

Discussion with @bowei and @serathius at KubeCon led to the conclusion that we should just drop the LB tests from this particular job.

(Which is to say, we should change the skip rule from "skip any tests with [Feature:...] tags that aren't [Feature:LoadBalancer]" to "skip any tests with [Feature:...] tags".)

Fixes kubernetes/kubernetes#131863

The tests do not pass reliably at that scale.

serathius · 2026-03-26T07:50:16Z

/lgtm

danwinship · 2026-03-30T11:53:46Z

/assign @wojtek-t

wojtek-t · 2026-03-31T08:09:10Z

Discussion with @bowei and @serathius at KubeCon led to the conclusion that we should just drop the LB tests from this particular job.

I'm actually pretty worried about that.
The reason why we added them in the first place to this job is that certain load-balancing was initiialy unusable at large scale. And at least some of these problems were in our components.

If we remove those tests completely from large scale tests, we will be on a slipper slope to newer degradations in this area.
I acknowledge the fact that flakiness isn't good, but... isn't it still better than zero visibility?

danwinship · 2026-03-31T11:57:11Z

The reason why we added them in the first place to this job is that certain load-balancing was initially unusable at large scale.

OK, but note that load balancing is expected to be unusable in a 5000-node cluster on GCP, when using load balancers configured in the way k8s configures them by default. We are testing a configuration that Google does not support (and then sending out release-informing alerts when it turns out that the configuration that wasn't expected to work actually doesn't work).

If we remove those tests completely from large scale tests

IIRC @bowei said our load balancer configuration should be fully supported on 1000-node clusters.

And at least some of these problems were in our components.

Kube-proxy doesn't do anything that would make LoadBalancer Services scale any differently than ClusterIP and NodePort Services, so we shouldn't need the LB tests to catch kube-proxy scaling problems. And if there are cloud-provider-gcp scaling problems, then those should trigger cloud-provider-gcp alerts, not k/k alerts.

I acknowledge the fact that flakiness isn't good, but... isn't it still better than zero visibility?

But right now it's the E2E Test That Cried "Wolf!". We ignore the failures anyway, because they're always GCP's fault...

wojtek-t · 2026-04-01T11:59:53Z

And if there are cloud-provider-gcp scaling problems, then those should trigger cloud-provider-gcp alerts, not k/k alerts.

OK - fair. That I agree with.
The problem is that, I'm 99% sure that we don't have cloud-provider-gcp scale tests...
@bowei @serathius

But right now it's the E2E Test That Cried "Wolf!". We ignore the failures anyway, because they're always GCP's fault...

We ignore them, because they are flakes.
But if we permanently break them, we would catch them.
I guess this my main point - we can now catch "it broke completely" state. So even if we don't even look at flakes and ignore them, this state is arguably better than having zero visibility if that works at all...

wojtek-t · 2026-04-01T12:01:58Z

But I guess the argument about "it's not k/k itself issue anyway" - is a valid one and kind of over-weights all my arguments.
So I'm going to approve and hold - if we want to continue the discussion here.
If not, can we open an issue to have those tests covered separately on a different setup elsewhere then?

/approve
/hold

aojea · 2026-04-08T12:44:27Z

/assign

danwinship · 2026-04-08T12:56:06Z

can we open an issue to have those tests covered separately on a different setup elsewhere then?

#36773

aojea · 2026-04-13T09:29:55Z

But I guess the argument about "it's not k/k itself issue anyway" - is a valid one and kind of over-weights all my arguments. So I'm going to approve and hold - if we want to continue the discussion here. If not, can we open an issue to have those tests covered separately on a different setup elsewhere then?

/approve /hold

Ok, so the problem here is not for the jobs itself or the tests or the cloud provider angle, there are multiple jobs that are cloud provider specific or feature specific. The problem is that the job is informing and that has some expectations of stability since there is a group "ci-signal" that reviews them, and we can not special cast a job on informing to say "well, it can flake, just check if it fail always", so the right way to proceed IMHO is to keep the job in informing stable (remove the loadbalancer tests) and for keeping the current coverage find owners that maintain #36773

/hold cancel

/lgtm
/approve

k8s-ci-robot · 2026-04-13T09:30:07Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, danwinship, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~config/jobs/kubernetes/sig-scalability/OWNERS~~ [aojea,wojtek-t]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2026-04-13T09:44:17Z

@danwinship: Updated the job-config configmap in namespace default at cluster test-infra-trusted using the following files:

key sig-scalability-release-blocking-jobs.yaml using file config/jobs/kubernetes/sig-scalability/sig-scalability-release-blocking-jobs.yaml

Details

In response to this:

gce-master-scale-correctness periodically starts flaking because the LoadBalancer tests do not reliably pass under GCE at that scale, at least in the way this test configures the cluster.

Discussion with @bowei and @serathius at KubeCon led to the conclusion that we should just drop the LB tests from this particular job.

(Which is to say, we should change the skip rule from "skip any tests with [Feature:...] tags that aren't [Feature:LoadBalancer]" to "skip any tests with [Feature:...] tags".)

Fixes kubernetes/kubernetes#131863

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Stop doing LoadBalancer tests in gce-master-scale-correctness

d955fe6

The tests do not pass reliably at that scale.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 25, 2026

k8s-ci-robot requested review from mborsz and mm4tt March 25, 2026 13:38

k8s-ci-robot assigned serathius Mar 26, 2026

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 26, 2026

k8s-ci-robot assigned wojtek-t Mar 30, 2026

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Apr 1, 2026

k8s-ci-robot assigned aojea Apr 8, 2026

danwinship mentioned this pull request Apr 8, 2026

make sure we test LoadBalancers at scale #36773

Open

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 13, 2026

k8s-ci-robot merged commit 6b0415f into kubernetes:master Apr 13, 2026
6 checks passed

danwinship deleted the lb-5000-flake branch April 13, 2026 10:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop doing LoadBalancer tests in gce-master-scale-correctness#36720

Stop doing LoadBalancer tests in gce-master-scale-correctness#36720
k8s-ci-robot merged 1 commit intokubernetes:masterfrom
danwinship:lb-5000-flake

danwinship commented Mar 25, 2026

Uh oh!

serathius commented Mar 26, 2026

Uh oh!

danwinship commented Mar 30, 2026

Uh oh!

wojtek-t commented Mar 31, 2026

Uh oh!

danwinship commented Mar 31, 2026

Uh oh!

wojtek-t commented Apr 1, 2026

Uh oh!

wojtek-t commented Apr 1, 2026

Uh oh!

aojea commented Apr 8, 2026

Uh oh!

danwinship commented Apr 8, 2026

Uh oh!

aojea commented Apr 13, 2026

Uh oh!

k8s-ci-robot commented Apr 13, 2026

Uh oh!

Uh oh!

k8s-ci-robot commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

danwinship commented Mar 25, 2026

Uh oh!

serathius commented Mar 26, 2026

Uh oh!

danwinship commented Mar 30, 2026

Uh oh!

wojtek-t commented Mar 31, 2026

Uh oh!

danwinship commented Mar 31, 2026

Uh oh!

wojtek-t commented Apr 1, 2026

Uh oh!

wojtek-t commented Apr 1, 2026

Uh oh!

aojea commented Apr 8, 2026

Uh oh!

danwinship commented Apr 8, 2026

Uh oh!

aojea commented Apr 13, 2026

Uh oh!

k8s-ci-robot commented Apr 13, 2026

Uh oh!

Uh oh!

k8s-ci-robot commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants