feat(decoredirect): auto-heal failed certs and add retry-cert API route#19
Merged
Conversation
When a Certificate enters cert-manager's exponential backoff (Issuing=False,
Reason=Failed), the controller now automatically detects it and checks whether
the domain DNS is correctly pointing to Deco's redirect infrastructure. If both
an HTTP check (X-Redirect-By: deco) and an AAAA check (no GCP 2600:1901::/32
range) pass, the Certificate is deleted so cert-manager retries without backoff.
Also adds POST /redirects/{domain}/retry-cert API route that performs the same
DNS checks and forces an immediate retry for operators who don't want to wait
for the next 30s reconcile cycle.
Root cause addressed: domains migrating from Deno Deploy sometimes retain AAAA
records in GCP range (2600:1901::/32). cert-manager's self-check uses IPv4 and
passes, but Let's Encrypt validates via IPv6, hits Deno Deploy, and fails —
leaving the Certificate stuck in multi-hour backoff.
…injectable Tests cover: - cert Failed + DNS ready → cert deleted (healed) - cert Failed + DNS wrong → cert untouched - cert Issuing=True → cert untouched (noop) - cert Ready=True → cert untouched (noop) - cert doesn't exist → no error Also skips healing when cert has DeletionTimestamp to avoid acting on a cert that is already being deleted.
…ct-blocked-ipv6 Removes the hardcoded GCP/Deno Deploy IPv6 range (2600:1901::/32) and replaces it with a configurable list of blocked CIDRs. When empty (default), no AAAA check is performed. Configure for Deco's deployment with: --redirect-blocked-ipv6=2600:1901::/32 Also accepts REDIRECT_BLOCKED_IPV6 env var.
…--redirect-blocked-ipv6" This reverts commit f626842.
…ct-blocked-ipv6 Removes hardcoded GCP range. Configure blocked CIDRs via: - --redirect-blocked-ipv6=2600:1901::/32 (flag) - REDIRECT_BLOCKED_IPV6=2600:1901::/32 (env) - redirect.blockedIPv6CIDRs in Helm values Default is empty (no AAAA check).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Certificateis stuck in cert-manager's exponential backoff (Issuing=False, Reason=Failed), the reconciler now checks DNS readiness every 30s. If both checks pass, it deletes the Certificate so cert-manager retries immediately without backoff.POST /redirects/{domain}/retry-cert: new API route for operators to force an immediate retry without waiting for the next reconcile cycle.GET http://<domain>/must returnX-Redirect-By: deco(confirms IPv4 traffic reaches nginx)2600:1901::/32(prevents Let's Encrypt from validating via Deno Deploy's IPv6)Root cause
During migration from Deno Deploy, domains sometimes retain AAAA records pointing to GCP (
2600:1901::/32). cert-manager's self-check uses IPv4 and passes, but Let's Encrypt validates via IPv6, hits Deno Deploy, and fails — leaving the Certificate stuck in multi-hour exponential backoff (1h → 2h → 4h → 8h).Behavior
Issuing=True(challenge pending)Ready=TrueIssuing=False, Reason=Failed+ DNS wrongIssuing=False, Reason=Failed+ DNS correctTest plan
2600:1901::/32AAAA record — confirm Certificate staysFaileduntil AAAA is removed, then auto-heals within 30sPOST /redirects/{domain}/retry-certbefore fixing DNS — confirm 422 with AAAA error messagePOST /redirects/{domain}/retry-certafter fixing DNS — confirm cert is reset and issuedCertReady=Trueare unaffected🤖 Generated with Claude Code
Summary by cubic
Auto-heals stuck redirect TLS certificates by deleting failed
Certificates once DNS is correct. Blocked IPv6 AAAA ranges are configurable, and thePOST /redirects/{domain}/retry-certendpoint is removed (healing runs in reconcile).New Features
Issuing=False, Reason=Failedand DNS checks pass, delete theCertificateso cert-manager retries without backoff.X-Redirect-By: deco; optional AAAA check blocks issuance only when an address falls in configured IPv6 CIDRs (default empty = no AAAA check).--redirect-blocked-ipv6,REDIRECT_BLOCKED_IPV6, or Helmredirect.blockedIPv6CIDRs.Bug Fixes
Certificates withDeletionTimestampto avoid update conflicts._ = resp.Body.Close().--redirect-blocked-ipv6whenredirect.blockedIPv6CIDRsis set.Written for commit 34d997b. Summary will update on new commits.