Skip to content

Handle GitHub rate limits with specialized error and Prometheus metrics#6306

Merged
evankanderson merged 7 commits intomindersec:mainfrom
krrish175-byte:fix/issue-4616
Apr 13, 2026
Merged

Handle GitHub rate limits with specialized error and Prometheus metrics#6306
evankanderson merged 7 commits intomindersec:mainfrom
krrish175-byte:fix/issue-4616

Conversation

@krrish175-byte
Copy link
Copy Markdown
Contributor

Summary

This PR implements robust handling for GitHub API throttling and exposes relevant Prometheus metrics to monitor rate limit usage, resolving Issue #4616.

Key changes:

  • Specialized RateLimitError: Introduced a new error type in pkg/engine/errors that wraps GitHub rate limit information (limit, remaining, reset time), giving upper layers structured context for throttling events.
  • Telemetry Instrumentation: Extended the HTTP client instrumentation in internal/providers/telemetry to automatically parse GitHub-specific X-RateLimit-* headers from every response.
  • Observability Metrics:
    • github.http.ratelimit.remaining (Histogram): Records the available quota to monitor usage patterns.
    • github.http.throttled.count (Counter): Tracks the frequency of 403/429 responses specifically caused by rate limiting.
  • Provider Consistency: Updated core GitHub provider methods (CreateHook, DeleteHook, UpdateCheckRun, etc.) to consistently use the new error type and the existing rate-limit wait logic.
  • Structural Alignment: Migrated the engine errors package to pkg/engine/errors to match the latest upstream repository organization.

Fixes #4616

Testing

  • Automated Verification: Created a new test suite in internal/providers/telemetry/telemetry_test.go using OpenTelemetry's ManualReader to verify accurate header parsing and metric recording across normal and throttled scenarios.
  • Integration Stability: Updated and verified existing tests in internal/providers/github/common_test.go to ensure correct propagation of the new specialized error type.
  • Full Build: Verified project stability with a clean make build and a full test run.

@krrish175-byte krrish175-byte requested a review from a team as a code owner April 8, 2026 17:07
@coveralls
Copy link
Copy Markdown

coveralls commented Apr 8, 2026

Coverage Status

coverage: 59.539% (+0.2%) from 59.327% — krrish175-byte:fix/issue-4616 into mindersec:main

Copy link
Copy Markdown
Member

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small comments, but this is looking reasonable, thank you!

Comment thread internal/providers/github/common.go Outdated
Comment on lines +595 to +602
if resp != nil && (resp.StatusCode == http.StatusNotFound || resp.StatusCode == http.StatusForbidden) {
// If the hook is not found, we can ignore the
// error, user might have deleted it manually.
// We also ignore deleting webhooks that we're not
// allowed to touch. This is usually the case
// with repository transfer.
return nil
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure these are in the "err is set" case. (They weren't in such a block before.) Are you sure that these set err?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still wondering about these being in the err != nil case, since they weren't previously in such a block.

Comment thread internal/providers/github/common.go
// or if it's an abuse rate limit.
isThrottled := false
if remainingHeader != "" {
if remaining, err := strconv.ParseInt(remainingHeader, 10, 64); err == nil && remaining == 0 {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to parse the remainingHeader once rather than re-processing it here. Maybe use a sentinel value like math.MinInt64 or the like to detect that the value couldn't be found?

if resp.StatusCode == http.StatusForbidden || resp.StatusCode == http.StatusTooManyRequests {
// Verify it's actually a rate limit error by checking the remaining quota (if 0)
// or if it's an abuse rate limit.
isThrottled := false
Copy link
Copy Markdown
Member

@evankanderson evankanderson Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simplify this with the case on 78-80:

Suggested change
isThrottled := false
isThrottled := resp.StatusCode == http.StatusTooManyRequests
// then add the case where we don't get a 429, but we have 0 quota remaining

Comment thread internal/providers/telemetry/telemetry_test.go
Comment thread internal/providers/telemetry/telemetry_test.go
- common.go: use errors.AsType[T] (Go 1.26) in isRateLimitError
- go.mod: bump Go version to 1.26
- telemetry.go: parse remainingHeader once using math.MinInt64 sentinel,
  simplify isThrottled initialization to resp.StatusCode == 429
- telemetry_test.go: add t.Parallel() to test function and subtest loop
@krrish175-byte
Copy link
Copy Markdown
Contributor Author

@evankanderson PTAL

- datasources/rest/handler.go: drain and close response body before
  retrying on 429 to fix gosec G104 unhandled error warnings
- security.yml: add helm dependency update step before Helm scan so
  the common chart dependency is present for Trivy to render the chart
Copy link
Copy Markdown
Member

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you have a merge conflict in internal/datasources/rest/handler.go (just discard your changes), and one question in common.go where you moved a block inside an if err != nil block where it wasn't before.

Comment thread internal/datasources/rest/handler.go Outdated
Msg("rate limited, retrying")
retryCount++
// Drain and close the body before retrying to allow connection reuse.
_, _ = io.Copy(io.Discard, resp.Body)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these were handled in #6312 -- maybe sync / discard these changes?

(Sorry, between a few CI flakes and about 5-6 PRs merging each day, things have gotten less clean than I'd like.)

Comment thread internal/providers/github/common.go Outdated
Comment on lines +595 to +602
if resp != nil && (resp.StatusCode == http.StatusNotFound || resp.StatusCode == http.StatusForbidden) {
// If the hook is not found, we can ignore the
// error, user might have deleted it manually.
// We also ignore deleting webhooks that we're not
// allowed to touch. This is usually the case
// with repository transfer.
return nil
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still wondering about these being in the err != nil case, since they weren't previously in such a block.

@krrish175-byte
Copy link
Copy Markdown
Contributor Author

I have addressed both items:

internal/datasources/rest/handler.go: Discarded the changes related to body draining as they are indeed handled in #6312.
internal/providers/github/common.go: Refactored DeleteHook to move the status code checks (404/403) outside of the explicit if err != nil block, restoring the original logic structure.

Will be fixing the tests now

Copy link
Copy Markdown
Member

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One lint error, and you probably need to revert the change to security.yml.

Comment thread .github/workflows/security.yml Outdated
Comment on lines +39 to +42
- name: Fetch Helm chart dependencies
if: always()
run: helm dependency update ./deployment/helm

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably wrong -- it will end up updating the helm dependencies for the security scan but NOT for the actual built components, so we'd end up shipping vulnerabilities that our scans don't find.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the correct fix here (in a separate PR!) is to add helm to the .github/dependabot.yml configuration if you want to do so.

Comment thread internal/providers/github/common.go Outdated
}
// If error is 403 then it means we are missing permissions
if resp != nil && resp.StatusCode == 403 {
return nil, fmt.Errorf("missing permissions: check")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a lint error here (you return, then more code follows it -- in this case, another return).

@krrish175-byte
Copy link
Copy Markdown
Contributor Author

@evankanderson PTAL

@evankanderson evankanderson merged commit d547963 into mindersec:main Apr 13, 2026
26 checks passed
@krrish175-byte krrish175-byte deleted the fix/issue-4616 branch April 14, 2026 04:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handle GitHub throttling errors properly

3 participants