Skip to content

fix(controller): skip DB migration when DB schema is ahead of binary#1963

Open
onematchfox wants to merge 1 commit into
kagent-dev:mainfrom
onematchfox:fix/migrations-compat-mode
Open

fix(controller): skip DB migration when DB schema is ahead of binary#1963
onematchfox wants to merge 1 commit into
kagent-dev:mainfrom
onematchfox:fix/migrations-compat-mode

Conversation

@onematchfox

Copy link
Copy Markdown
Contributor

When an older binary starts against a database that a newer binary has already migrated, calling mg.Up() fails because golang-migrate tries to read the down file for the current DB version, which does not exist in the older binary's embedded FS. This results in the controller crash looping (issue #1881).

Instead, detect when the database version exceeds the binary's highest known migration and return early. This relies on the expand-then-contract policy documented in database-migrations.md, which guarantees each release's code is compatible with the schema applied by the previous release.

Adds TestApplyDir_SucceedsWhenDBVersionAhead to cover this path.

Copilot AI review requested due to automatic review settings June 4, 2026 14:55
@github-actions github-actions Bot added the bug Something isn't working label Jun 4, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR makes the migration runner tolerant of running an older binary against a database schema migrated by a newer binary, avoiding crash-loops by skipping migration application when the DB version is ahead of the binary’s embedded migrations.

Changes:

  • Add a “compatibility mode” early-return in applyDir when DB schema version exceeds the binary’s max embedded migration version.
  • Introduce maxEmbeddedVersion to compute the highest embedded migration version from the filesystem.
  • Add a regression test ensuring applyDir succeeds (and does not modify schema version) when the DB version is ahead.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
go/core/pkg/migrations/runner.go Adds compatibility-mode logic and a helper to scan embedded migrations for the max version.
go/core/pkg/migrations/runner_test.go Adds a test proving older binaries don’t error or roll back when DB schema is ahead.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread go/core/pkg/migrations/runner.go
Comment thread go/core/pkg/migrations/runner.go
Comment thread go/core/pkg/migrations/runner.go Outdated
Comment thread go/core/pkg/migrations/runner_test.go
@onematchfox onematchfox force-pushed the fix/migrations-compat-mode branch from 40aad83 to 9382b86 Compare June 4, 2026 14:58
@onematchfox onematchfox requested a review from Copilot June 4, 2026 14:59

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comment thread go/core/pkg/migrations/runner.go
Comment thread go/core/pkg/migrations/runner.go
Comment thread go/core/pkg/migrations/runner_test.go
@onematchfox onematchfox force-pushed the fix/migrations-compat-mode branch from 9382b86 to f39b229 Compare June 4, 2026 15:06
@onematchfox onematchfox requested a review from Copilot June 4, 2026 15:06

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comment thread go/core/pkg/migrations/runner.go
Comment thread go/core/pkg/migrations/runner.go
Comment thread go/core/pkg/migrations/runner_test.go
When an older binary starts against a database that a newer binary has
already migrated, calling mg.Up() fails because golang-migrate tries to
read the down file for the current DB version, which does not exist in
the older binary's embedded FS. This caused a crash loop (issue kagent-dev#1881).

Instead, detect when the database version exceeds the binary's highest
known migration and return early. This relies on the expand-then-contract
policy documented in database-migrations.md, which guarantees each
release's code is compatible with the schema applied by the previous
release.

Adds TestApplyDir_SucceedsWhenDBVersionAhead to cover this path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com>
@onematchfox onematchfox force-pushed the fix/migrations-compat-mode branch from f39b229 to aa49e3e Compare June 5, 2026 10:11
Comment thread go/core/pkg/migrations/runner.go
iplay88keys
iplay88keys previously approved these changes Jun 5, 2026

@iplay88keys iplay88keys left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be changed when we update the db migration/rollback logic, but this is a good incremental improvement in the meantime.

@iplay88keys iplay88keys self-requested a review June 5, 2026 21:06
@iplay88keys iplay88keys dismissed their stale review June 5, 2026 21:06

Thinking more about this so we don't introduce a footgun here.

@iplay88keys

Copy link
Copy Markdown
Contributor

Sorry for un-approving this PR. I want to make sure the implications of this change are well thought through.

Let's start with aligning on a few things:

  • We only support n+1 upgrades and n-1 downgrades.
  • We also don't currently have a guarantee that a downgrade doesn't require manual steps, but we try to avoid them on upgrades.

The current db migrations are set up such that the application fails to start if there are more migrations applied to the db than the binary knows about. This is by design to prevent data corruption.

There is no way for the binary to know what migrations are in n+1 vs n+2 without going back and editing the old artifacts or making this incompatible with airgapped environments.

This PR removes the aforementioned check (comparing known migrations vs what's migrated), effectively introducing a requirement that Kagent stays compatible with future database schema shape changes and queries. It also leaves a subtle footgun as the information regarding the "compatibility mode" is only surfaced in a log line.

Even with this change, there is still a manual database rollback requirement in order to not be in an inconsistent state. So I'm not sure that it's worth taking on these additional support burdens in that case.

@onematchfox

Copy link
Copy Markdown
Contributor Author

We also don't currently have a guarantee that a downgrade doesn't require manual steps, but we try to avoid them on upgrades.

That's a bit of a understatement 😅; the only way to downgrade is through manual intervention at present. Even though the docs seem to indicate that this shouldn't ever be the case?

... and by automatic rollback on migration failure

The current db migrations are set up such that the application fails to start if there are more migrations applied to the db than the binary knows about. This is by design to prevent data corruption.

This is the issue. Any database schema update ultimately prevents any form of automated rollback. This design choice will also cause the existing version of the controller to crash-loop if it scales out during the rollout of a new version - this despite the fact that it is documented that:

During a rolling deploy, old pods will be reading and writing a schema that has already been upgraded. Every migration must be backward-compatible with the previous version's code.

There is no way for the binary to know what migrations are in n+1 vs n+2

Yes, unfortunately there is no alignment between the DB schema versions and the application version. Hence, I did not try to implement a check on whether a user and rather opted for a warning that would be visible if the controller crashed due to an incompatible schema. Worth noting here, is that if the documented process is followed then this would only ever happen when rolling back N+2 minor versions. E.g. rolling back 0.10.x->0.8.x. And even then, incompatible/breaking schema changes (dropping tables, columns, etc) are far less likely to happen than compatible/additive ones surely? Whilst there's isn't muchhistory in the current set of migrations, I don't see any backwards incompatible changes so far so we could theoretically be running 0.7.x against the schema in 0.9.x. Whereas in reality we can't even run 0.9.2 against the schema introduced by 0.9.3 (a patch release).

effectively introducing a requirement that Kagent stays compatible with future database schema shape changes and queries.

I don't expect KAgent to be compatible with all future database schema changes. But I do expect to be able to rollback a patch release, or a single minor version without manual intervention.

It also leaves a subtle footgun as the information regarding the "compatibility mode" is only surfaced in a log line.

IMO, this is a lot better than the current behaviour. If the controller crash loops, an operator will most likely look at logs and see the warning 🤞 .

Even with this change, there is still a manual database rollback requirement in order to not be in an inconsistent state

I disagree. If the expand/contract pattern is followed as documented there should not need to be a manual database rollback requirement when rolling back from N+1->N.

@iplay88keys

Copy link
Copy Markdown
Contributor

by automatic rollback on migration failure

At this point, the design around this is focused on temporary migration upgrade failures, with the goal of retrying db migrations. Full rollback of the application after a successful upgrade was not in scope for that design.

Yes, unfortunately there is no alignment between the DB schema versions and the application version.

This, coupled by the lack of preventing multiple version downgrades is still my biggest concern around this at the moment.

If the expand/contract pattern is followed as documented there should not need to be a manual database rollback requirement when rolling back from N+1->N.

Whether or not the application runs does not remove the fact that the db is no longer in the state expected by the migration system for the version of Kagent running. That may or may not be an issue. Another risk I see here is data being orphaned as a result of the upgrade process. The expand and contract model should cover this if implemented correctly, though.

IMO, this is a lot better than the current behaviour

I definitely see your point on this. I am actively looking into what db rollbacks looks like for Kagent and am including this as an option to fill in that feature gap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants