Skip to content

Fix stale kind cluster setup for certs and valkey #146

Open
Nina Polshakova (npolshakova) wants to merge 1 commit into
agent-substrate:mainfrom
npolshakova:fix-stale-kind-cluster-setup
Open

Fix stale kind cluster setup for certs and valkey #146
Nina Polshakova (npolshakova) wants to merge 1 commit into
agent-substrate:mainfrom
npolshakova:fix-stale-kind-cluster-setup

Conversation

@npolshakova

Copy link
Copy Markdown

Fixes #<issue_number_goes_here>

It's a good idea to open an issue first for discussion.

  • Tests pass
  • Appropriate changes to documentation are included in the PR

Had a kind cluster up for a couple days and ran into these issues:

  1. Cert expires and ./hack/install-ate-kind.sh --deploy-ate-system setup script won't refresh the certs
  while listing actors in db: while iterating all redis master: tls: failed to verify certificate: x509: certificate has expired or is not yet valid: current time 2026-06-02T14:26:03Z is after 2026-05-30T18:46:15Z

This adds a new --refresh-local-certs flag to recreate local CA roots used by the podcertificate controller, local session identity JWT/CA pools, CA bundle used by API server to trust Valkey TLS and restarts the pods that mount generated certs.

  1. Valkey remembers old pod IPs after the StatefulSet restarts. In the ate-api-server-deployment it kept timing out since the old IPs were being used.

The new --reset-local-valkey deletes job/valkey-cluster-init and Valkey PVCs with label app=valkey-cluster so the Valkey Cluster can be initialized from scratch.

Signed-off-by: npolshakova <nina.polshakova@solo.io>
@google-cla

google-cla Bot commented Jun 2, 2026

Copy link
Copy Markdown

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@npolshakova

Copy link
Copy Markdown
Author

Benjamin Elder (@BenTheElder) Dmitry Berkovich (@dberkov) saw you had some recent changes in the setup scripts. Any thoughts about adding this for us with long-lived kind clusters 😅

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, I think we're all iterating so much with clean installs every time.

I've also been wondering if we should drop the number of replicas etc at least on kind to make startup faster and use less resources ... but there's a tradeoff since that's our main source of continuous test coverage right now (and we expect "real" clusters) to need to deal with shards (which affects ate-apiserver).

Comment thread hack/install-ate.sh
echo " --deploy-ate-system Deploy core system (CRDs, atelet, apiserver)"
echo " --delete-ate-system Delete core system"
echo " --delete-all Delete core system and all registered demos"
echo " --refresh-local-certs Regenerate local cert/JWT prerequisites and restart cert consumers"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably figure out how to fix this generally so it doesn't require manual intervention, kind aside, cc Grant McCloskey (@MushuEE) Julian Gutierrez Oschmann (@juli4n) (anyone is welcome to look at this please, I just know these two are looking at valkey)

@BenTheElder

Copy link
Copy Markdown
Collaborator

x-ref #225

Comment thread README.md
./hack/install-ate-kind.sh --delete-all
```

For local `kind` clusters that have been running for several days, generated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a chance the issue exists for non kind version too. Yesterday I tried to re-deploy ate-api and it failed. It was complaining on expired token at startup. Deleting everything and installing from scratch solved the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants