Handle GitHub rate limits with specialized error and Prometheus metrics#6306
Handle GitHub rate limits with specialized error and Prometheus metrics#6306evankanderson merged 7 commits intomindersec:mainfrom
Conversation
evankanderson
left a comment
There was a problem hiding this comment.
A few small comments, but this is looking reasonable, thank you!
| if resp != nil && (resp.StatusCode == http.StatusNotFound || resp.StatusCode == http.StatusForbidden) { | ||
| // If the hook is not found, we can ignore the | ||
| // error, user might have deleted it manually. | ||
| // We also ignore deleting webhooks that we're not | ||
| // allowed to touch. This is usually the case | ||
| // with repository transfer. | ||
| return nil | ||
| } |
There was a problem hiding this comment.
I'm not sure these are in the "err is set" case. (They weren't in such a block before.) Are you sure that these set err?
There was a problem hiding this comment.
Still wondering about these being in the err != nil case, since they weren't previously in such a block.
| // or if it's an abuse rate limit. | ||
| isThrottled := false | ||
| if remainingHeader != "" { | ||
| if remaining, err := strconv.ParseInt(remainingHeader, 10, 64); err == nil && remaining == 0 { |
There was a problem hiding this comment.
I'd prefer to parse the remainingHeader once rather than re-processing it here. Maybe use a sentinel value like math.MinInt64 or the like to detect that the value couldn't be found?
| if resp.StatusCode == http.StatusForbidden || resp.StatusCode == http.StatusTooManyRequests { | ||
| // Verify it's actually a rate limit error by checking the remaining quota (if 0) | ||
| // or if it's an abuse rate limit. | ||
| isThrottled := false |
There was a problem hiding this comment.
You can simplify this with the case on 78-80:
| isThrottled := false | |
| isThrottled := resp.StatusCode == http.StatusTooManyRequests | |
| // then add the case where we don't get a 429, but we have 0 quota remaining |
- common.go: use errors.AsType[T] (Go 1.26) in isRateLimitError - go.mod: bump Go version to 1.26 - telemetry.go: parse remainingHeader once using math.MinInt64 sentinel, simplify isThrottled initialization to resp.StatusCode == 429 - telemetry_test.go: add t.Parallel() to test function and subtest loop
|
@evankanderson PTAL |
- datasources/rest/handler.go: drain and close response body before retrying on 429 to fix gosec G104 unhandled error warnings - security.yml: add helm dependency update step before Helm scan so the common chart dependency is present for Trivy to render the chart
evankanderson
left a comment
There was a problem hiding this comment.
Looks like you have a merge conflict in internal/datasources/rest/handler.go (just discard your changes), and one question in common.go where you moved a block inside an if err != nil block where it wasn't before.
| Msg("rate limited, retrying") | ||
| retryCount++ | ||
| // Drain and close the body before retrying to allow connection reuse. | ||
| _, _ = io.Copy(io.Discard, resp.Body) |
There was a problem hiding this comment.
I think these were handled in #6312 -- maybe sync / discard these changes?
(Sorry, between a few CI flakes and about 5-6 PRs merging each day, things have gotten less clean than I'd like.)
| if resp != nil && (resp.StatusCode == http.StatusNotFound || resp.StatusCode == http.StatusForbidden) { | ||
| // If the hook is not found, we can ignore the | ||
| // error, user might have deleted it manually. | ||
| // We also ignore deleting webhooks that we're not | ||
| // allowed to touch. This is usually the case | ||
| // with repository transfer. | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Still wondering about these being in the err != nil case, since they weren't previously in such a block.
|
I have addressed both items: internal/datasources/rest/handler.go: Discarded the changes related to body draining as they are indeed handled in #6312. Will be fixing the tests now |
evankanderson
left a comment
There was a problem hiding this comment.
One lint error, and you probably need to revert the change to security.yml.
| - name: Fetch Helm chart dependencies | ||
| if: always() | ||
| run: helm dependency update ./deployment/helm | ||
|
|
There was a problem hiding this comment.
This is probably wrong -- it will end up updating the helm dependencies for the security scan but NOT for the actual built components, so we'd end up shipping vulnerabilities that our scans don't find.
There was a problem hiding this comment.
Probably the correct fix here (in a separate PR!) is to add helm to the .github/dependabot.yml configuration if you want to do so.
| } | ||
| // If error is 403 then it means we are missing permissions | ||
| if resp != nil && resp.StatusCode == 403 { | ||
| return nil, fmt.Errorf("missing permissions: check") |
There was a problem hiding this comment.
You have a lint error here (you return, then more code follows it -- in this case, another return).
|
@evankanderson PTAL |
Summary
This PR implements robust handling for GitHub API throttling and exposes relevant Prometheus metrics to monitor rate limit usage, resolving Issue #4616.
Key changes:
pkg/engine/errorsthat wraps GitHub rate limit information (limit, remaining, reset time), giving upper layers structured context for throttling events.internal/providers/telemetryto automatically parse GitHub-specificX-RateLimit-*headers from every response.github.http.ratelimit.remaining(Histogram): Records the available quota to monitor usage patterns.github.http.throttled.count(Counter): Tracks the frequency of 403/429 responses specifically caused by rate limiting.CreateHook,DeleteHook,UpdateCheckRun, etc.) to consistently use the new error type and the existing rate-limit wait logic.pkg/engine/errorsto match the latest upstream repository organization.Fixes #4616
Testing
internal/providers/telemetry/telemetry_test.gousing OpenTelemetry'sManualReaderto verify accurate header parsing and metric recording across normal and throttled scenarios.internal/providers/github/common_test.goto ensure correct propagation of the new specialized error type.make buildand a full test run.