Skip to content

fix(gitops-update): refuse semver downgrade unless explicitly allowed#396

Draft
alexgarzao wants to merge 1 commit into
mainfrom
fix/gitops-semver-guard
Draft

fix(gitops-update): refuse semver downgrade unless explicitly allowed#396
alexgarzao wants to merge 1 commit into
mainfrom
fix/gitops-semver-guard

Conversation

@alexgarzao
Copy link
Copy Markdown

⚠️ Opened as draft — DevOps team review required. Behavior change affects every release pipeline that uses this reusable workflow.

Summary

gitops-update.yml currently writes image tags into values.yaml whenever CURRENT_TAG != TAG, regardless of semver precedence. When a production tag fires, the env loop is dev stg prd sandbox — production releases overwrite dev paths indiscriminately. This caused a multi-day incident (details below).

This PR adds a semver-correct precedence guard with an explicit override.

Before After
1.1.1 overwrites 1.2.0-beta.12 silently 1.1.11.2.0-beta.12 is refused with ::warning::; commit is skipped
No way to express "downgrade allowed" intent New input allow_downgrade: bool (default false); set true for intentional rollbacks

Root incident (the bug this prevents)

On 2026-05-31 00:49 UTC, the auto-bump commit LerianStudio/midaz-firmino-gitops@2e46be40 (ci(flowker): update image tags (production)) downgraded 4 dev paths (firmino, anacleto, benedita, clotilde) from 1.2.0-beta.12 to 1.1.1. The audit_db schema in dev had been migrated to v2 by the beta deploys; image :1.1.1 only ships migration 001. Result: 3 of 4 dev pods stuck in CrashLoopBackOff for 2+ days (firmino-flowker-dev reached 270 restarts before the manual fix). Out-of-band fix: LerianStudio/midaz-firmino-gitops#814.

Changes

  • New input allow_downgrade (boolean, default false).
  • Pure-bash semver_gt function — no install step, no external dependency.
    • sort -V was tried first and rejected: GNU coreutils does not implement semver §11 precedence for prereleases (it sorts 1.2.0 < 1.2.0-beta.12, which is backwards).
  • Guard applied at 3 update sites:
    • helmfile values.yaml — image tag mappings
    • helmfile values.yaml — configmap key mappings
    • kustomization.yaml — kustomize edit set image
  • Empty/non-semver current value → bypass guard with a ::warning:: (first installs and exotic schemes are not blocked).

Local test matrix (12 cases, all passing)

PASS  1.2.0              vs 1.2.0-beta.12       -> GT   (release > prerelease)
PASS  1.2.0-beta.12      vs 1.2.0               -> LT
PASS  1.2.0              vs 1.1.99              -> GT
PASS  1.1.1              vs 1.2.0-beta.12       -> LT   ← THE INCIDENT
PASS  1.2.0-beta.13      vs 1.2.0-beta.12       -> GT
PASS  1.2.0-rc.1         vs 1.2.0-beta.99       -> GT   (rc > beta)
PASS  1.2.0-alpha        vs 1.2.0-alpha.1       -> LT   (shorter prerelease < longer)
PASS  1.1.1              vs 1.1.1               -> EQ
PASS  2.0.0              vs 1.99.99             -> GT
PASS  v1.2.3             vs 1.2.3               -> EQ   (v-prefix tolerated)
PASS  1.2.0-beta.10      vs 1.2.0-beta.9        -> GT   (numeric ordering)
PASS  1.2.0-beta.12      vs 1.2.0-beta.13       -> LT

Risk / backward compatibility

  • Default behavior changes for callers: a release that previously silently downgraded will now refuse with a warning. This is the intended behavior; it surfaces a class of bug that was previously invisible.
  • Callers that genuinely need rollback set allow_downgrade: true at the call site.
  • No new dependencies, no new install steps, runs in the same bash shell.
  • No changes to the argocd_sync, commit, or slack-notify jobs.

Test plan

  • Local unit tests pass (12/12 — see matrix above).
  • Trigger a synthetic release in a sandbox caller (self-pr-validation.yml?) with 1.1.1 over a dev path on 1.2.0-beta.X — confirm the workflow log shows the refusal warning and the commit step is a no-op.
  • Same scenario with allow_downgrade: true — confirm the downgrade is applied (backward-compatible escape hatch).
  • Upgrade scenario (1.2.0-beta.13 over 1.2.0-beta.12) — confirm it still works (no regression).
  • Equal scenario — confirm it still no-ops as before.

Follow-up not in this PR

The IS_PRODUCTION env loop is dev stg prd sandbox. Even with the semver guard, this still writes to dev (it just no-ops on downgrades). A stricter policy might be: production releases never touch dev. That is a behavioral change that deserves its own discussion with the DevOps team and is out of scope here.

cc @LerianStudio/g_github_devops

The release pipeline currently overwrites image tags by simple inequality
("$CURRENT_TAG" != "$TAG"). When a production release fires (IS_PRODUCTION
loop = "dev stg prd sandbox"), it writes the production tag into every
env — including dev paths that were already on a higher pre-release.

Recent incident: on 2026-05-31 a release of flowker v1.1.1 downgraded
4 dev paths (firmino, anacleto, benedita, clotilde) from 1.2.0-beta.12
back to 1.1.1. The audit_db schema in dev had been migrated to v2 by
the beta deploys; the older image only shipped migration 001, so the
app panicked with "source directory missing or empty" on startup. Two
dev environments stayed in CrashLoopBackOff for 2+ days. Fix shipped
out-of-band via LerianStudio/midaz-firmino-gitops#814 by manually
restoring tag to 1.2.0-beta.12.

This change:

- Adds new boolean input `allow_downgrade` (default false). Callers who
  legitimately need to roll back (e.g. emergency revert) opt in.
- Adds a semver-correct `semver_gt` bash function (no external deps).
  Pure bash so the step keeps zero install cost; sort -V was rejected
  because GNU coreutils does not implement prerelease precedence per
  semver.org#spec-item-11 (it sorts "1.2.0" before "1.2.0-beta.X").
- Wraps the three update sites with the guard:
    * helmfile values.yaml — image tag mappings
    * helmfile values.yaml — configmap key mappings
    * kustomization.yaml — `kustomize edit set image`
- Treats empty/non-semver current values as "skip the check, allow
  write" with a warning, so first installs and exotic tag schemes
  (e.g. branch SHAs) are not blocked.

Behavior:

- 1.1.1 over 1.2.0-beta.12 → REFUSED with ::warning::, exit clean,
  no commit. Caller sees the warning and decides next step.
- 1.2.0 over 1.2.0-beta.12 → ALLOWED (release > prerelease, semver).
- 1.2.0-beta.13 over 1.2.0-beta.12 → ALLOWED.
- Equal values → no-op as before.
- allow_downgrade: true → previous behavior preserved verbatim.

Tested locally against 12 precedence cases including the failure that
triggered this fix; all pass.

Follow-up (not in this PR): the IS_PRODUCTION env loop "dev stg prd
sandbox" is overly broad — production releases should arguably never
touch dev. That is a wider conversation; this PR is the minimal safety
net that catches the failure without changing the semantics callers
already rely on.

Refs: LerianStudio/midaz-firmino-gitops#814
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 3, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 12ebf88f-459c-490a-ada9-89079ab091ff

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/gitops-semver-guard

Comment @coderabbitai help to get the list of available commands and usage tips.

@lerian-studio
Copy link
Copy Markdown
Contributor

🔍 Lint Analysis

Check Files Scanned Status
YAML Lint 1 file(s) ✅ success
Action Lint 1 file(s) ✅ success
Pinned Actions 1 file(s) ✅ success
Markdown Link Check no changes ⏭️ skipped
Spelling Check 1 file(s) ✅ success
Shell Check 1 file(s) ✅ success
README Check 1 file(s) ✅ success
Composite Schema no changes ⏭️ skipped
Deployment Matrix no changes ⏭️ skipped

🔍 View full scan logs

@lerian-studio
Copy link
Copy Markdown
Contributor

🛡️ CodeQL Analysis Results

Languages analyzed: actions

Found 2 issue(s): 2 Medium

Severity Rule File Message
🟡 Medium actions/untrusted-checkout/medium .github/workflows/gitops-update.yml:147 Potential unsafe checkout of untrusted pull request on privileged workflow.
🟡 Medium actions/untrusted-checkout/medium .github/workflows/gitops-update.yml:155 Potential unsafe checkout of untrusted pull request on privileged workflow.

🔍 View full scan logs | 🛡️ Security tab

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants