Skip to content

feat(decoredirect): auto-heal failed certs and add retry-cert API route#19

Merged
igoramf merged 9 commits into
mainfrom
feat/validate-redirect-cert
Jun 5, 2026
Merged

feat(decoredirect): auto-heal failed certs and add retry-cert API route#19
igoramf merged 9 commits into
mainfrom
feat/validate-redirect-cert

Conversation

@igoramf

@igoramf igoramf commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Controller auto-healing: when a Certificate is stuck in cert-manager's exponential backoff (Issuing=False, Reason=Failed), the reconciler now checks DNS readiness every 30s. If both checks pass, it deletes the Certificate so cert-manager retries immediately without backoff.
  • POST /redirects/{domain}/retry-cert: new API route for operators to force an immediate retry without waiting for the next reconcile cycle.
  • DNS readiness checks (shared by both):
    1. HTTP check — GET http://<domain>/ must return X-Redirect-By: deco (confirms IPv4 traffic reaches nginx)
    2. AAAA check — no IPv6 record in GCP range 2600:1901::/32 (prevents Let's Encrypt from validating via Deno Deploy's IPv6)

Root cause

During migration from Deno Deploy, domains sometimes retain AAAA records pointing to GCP (2600:1901::/32). cert-manager's self-check uses IPv4 and passes, but Let's Encrypt validates via IPv6, hits Deno Deploy, and fails — leaving the Certificate stuck in multi-hour exponential backoff (1h → 2h → 4h → 8h).

Behavior

State Auto-heal retry-cert
Issuing=True (challenge pending) no-op no-op
Ready=True no-op no-op
Issuing=False, Reason=Failed + DNS wrong waits returns 422 with reason
Issuing=False, Reason=Failed + DNS correct deletes cert → recreates deletes cert → recreates

Test plan

  • Deploy to staging and create a DecoRedirect with a domain that has a 2600:1901::/32 AAAA record — confirm Certificate stays Failed until AAAA is removed, then auto-heals within 30s
  • Call POST /redirects/{domain}/retry-cert before fixing DNS — confirm 422 with AAAA error message
  • Call POST /redirects/{domain}/retry-cert after fixing DNS — confirm cert is reset and issued
  • Existing redirects with CertReady=True are unaffected

🤖 Generated with Claude Code


Summary by cubic

Auto-heals stuck redirect TLS certificates by deleting failed Certificates once DNS is correct. Blocked IPv6 AAAA ranges are configurable, and the POST /redirects/{domain}/retry-cert endpoint is removed (healing runs in reconcile).

  • New Features

    • Controller auto-heal: if Issuing=False, Reason=Failed and DNS checks pass, delete the Certificate so cert-manager retries without backoff.
    • DNS readiness: HTTP must return X-Redirect-By: deco; optional AAAA check blocks issuance only when an address falls in configured IPv6 CIDRs (default empty = no AAAA check).
    • Config: set blocked ranges via --redirect-blocked-ipv6, REDIRECT_BLOCKED_IPV6, or Helm redirect.blockedIPv6CIDRs.
  • Bug Fixes

    • Skip acting on Certificates with DeletionTimestamp to avoid update conflicts.
    • Close HTTP response bodies with _ = resp.Body.Close().
    • Helm generator: include --redirect-blocked-ipv6 when redirect.blockedIPv6CIDRs is set.

Written for commit 34d997b. Summary will update on new commits.

Review in cubic

igoramf added 9 commits June 3, 2026 15:54
When a Certificate enters cert-manager's exponential backoff (Issuing=False,
Reason=Failed), the controller now automatically detects it and checks whether
the domain DNS is correctly pointing to Deco's redirect infrastructure. If both
an HTTP check (X-Redirect-By: deco) and an AAAA check (no GCP 2600:1901::/32
range) pass, the Certificate is deleted so cert-manager retries without backoff.

Also adds POST /redirects/{domain}/retry-cert API route that performs the same
DNS checks and forces an immediate retry for operators who don't want to wait
for the next 30s reconcile cycle.

Root cause addressed: domains migrating from Deno Deploy sometimes retain AAAA
records in GCP range (2600:1901::/32). cert-manager's self-check uses IPv4 and
passes, but Let's Encrypt validates via IPv6, hits Deno Deploy, and fails —
leaving the Certificate stuck in multi-hour backoff.
…injectable

Tests cover:
- cert Failed + DNS ready → cert deleted (healed)
- cert Failed + DNS wrong → cert untouched
- cert Issuing=True → cert untouched (noop)
- cert Ready=True → cert untouched (noop)
- cert doesn't exist → no error

Also skips healing when cert has DeletionTimestamp to avoid
acting on a cert that is already being deleted.
…ct-blocked-ipv6

Removes the hardcoded GCP/Deno Deploy IPv6 range (2600:1901::/32) and replaces it
with a configurable list of blocked CIDRs. When empty (default), no AAAA check is
performed. Configure for Deco's deployment with:

  --redirect-blocked-ipv6=2600:1901::/32

Also accepts REDIRECT_BLOCKED_IPV6 env var.
…ct-blocked-ipv6

Removes hardcoded GCP range. Configure blocked CIDRs via:
- --redirect-blocked-ipv6=2600:1901::/32 (flag)
- REDIRECT_BLOCKED_IPV6=2600:1901::/32 (env)
- redirect.blockedIPv6CIDRs in Helm values

Default is empty (no AAAA check).
@igoramf igoramf merged commit 68f58ea into main Jun 5, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant