Skip to content

Add reproducing test for scrape failure with >70k metrics#1003

Closed
jaqx0r wants to merge 629 commits into
google:mainfrom
jaqx0r:issue-903-metrics-scrape
Closed

Add reproducing test for scrape failure with >70k metrics#1003
jaqx0r wants to merge 629 commits into
google:mainfrom
jaqx0r:issue-903-metrics-scrape

Conversation

@jaqx0r
Copy link
Copy Markdown
Contributor

@jaqx0r jaqx0r commented May 26, 2026

Added two table-driven tests and a benchmark to internal/exporter/prometheus_test.go that exercise the exporter at high metric cardinality:

  • TestWritePrometheusManyLabelValues — single metric with N label values (100, 1000, 10000), verifying the exporter can encode all values.
  • TestWritePrometheusManyMetrics — N separate metrics each with one datum, exercising the goroutine-per-metric pattern in Exporter.Collect (prometheus.go:46) that slows down at 70k+ scale.
  • BenchmarkWritePrometheus — latency measurements for both patterns at 100/1k/10k.

Benchmarks on this machine show ~40–50ms for 10k items, extrapolating to ~270–330ms for 70k — fast enough that the pure serialisation isnt the bottleneck. The failure likely stems from the /metrics handler WriteTimeout (5s) combined with the full reg.Gather() path (Go/process/expvar collectors + user metrics) and the time to stream ~5.6MB of response text.

Fixes #903

github-actions Bot and others added 30 commits October 23, 2025 23:10
…bazelbuild/rules_go-0.58.0

build(deps): bump github.com/bazelbuild/rules_go from 0.57.0 to 0.58.0
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4 to 5.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](actions/upload-artifact@v4...v5)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [github.com/bazelbuild/rules_go](https://github.com/bazelbuild/rules_go) from 0.58.0 to 0.58.1.
- [Release notes](https://github.com/bazelbuild/rules_go/releases)
- [Commits](bazel-contrib/rules_go@v0.58.0...v0.58.1)

---
updated-dependencies:
- dependency-name: github.com/bazelbuild/rules_go
  dependency-version: 0.58.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
…bazelbuild/rules_go-0.58.1

build(deps): bump github.com/bazelbuild/rules_go from 0.58.0 to 0.58.1
…/upload-artifact-5

build(deps): bump actions/upload-artifact from 4 to 5
Bumps [github.com/bazelbuild/rules_go](https://github.com/bazelbuild/rules_go) from 0.58.1 to 0.58.2.
- [Release notes](https://github.com/bazelbuild/rules_go/releases)
- [Commits](bazel-contrib/rules_go@v0.58.1...v0.58.2)

---
updated-dependencies:
- dependency-name: github.com/bazelbuild/rules_go
  dependency-version: 0.58.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
…bazelbuild/rules_go-0.58.2

build(deps): bump github.com/bazelbuild/rules_go from 0.58.1 to 0.58.2
Bumps [github.com/prometheus/common](https://github.com/prometheus/common) from 0.67.1 to 0.67.2.
- [Release notes](https://github.com/prometheus/common/releases)
- [Changelog](https://github.com/prometheus/common/blob/main/CHANGELOG.md)
- [Commits](prometheus/common@v0.67.1...v0.67.2)

---
updated-dependencies:
- dependency-name: github.com/prometheus/common
  dependency-version: 0.67.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
…prometheus/common-0.67.2

build(deps): bump github.com/prometheus/common from 0.67.1 to 0.67.2
Bumps [github.com/bazelbuild/rules_go](https://github.com/bazelbuild/rules_go) from 0.58.2 to 0.58.3.
- [Release notes](https://github.com/bazelbuild/rules_go/releases)
- [Commits](bazel-contrib/rules_go@v0.58.2...v0.58.3)

---
updated-dependencies:
- dependency-name: github.com/bazelbuild/rules_go
  dependency-version: 0.58.3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
…bazelbuild/rules_go-0.58.3

build(deps): bump github.com/bazelbuild/rules_go from 0.58.2 to 0.58.3
Bumps [github.com/bazelbuild/rules_go](https://github.com/bazelbuild/rules_go) from 0.58.3 to 0.59.0.
- [Release notes](https://github.com/bazelbuild/rules_go/releases)
- [Commits](bazel-contrib/rules_go@v0.58.3...v0.59.0)

---
updated-dependencies:
- dependency-name: github.com/bazelbuild/rules_go
  dependency-version: 0.59.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
…bazelbuild/rules_go-0.59.0

build(deps): bump github.com/bazelbuild/rules_go from 0.58.3 to 0.59.0
Bumps [golang.org/x/sys](https://github.com/golang/sys) from 0.37.0 to 0.38.0.
- [Commits](golang/sys@v0.37.0...v0.38.0)

---
updated-dependencies:
- dependency-name: golang.org/x/sys
  dependency-version: 0.38.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
…x/sys-0.38.0

build(deps): bump golang.org/x/sys from 0.37.0 to 0.38.0
Bumps [golang.org/x/tools](https://github.com/golang/tools) from 0.38.0 to 0.39.0.
- [Release notes](https://github.com/golang/tools/releases)
- [Commits](golang/tools@v0.38.0...v0.39.0)

---
updated-dependencies:
- dependency-name: golang.org/x/tools
  dependency-version: 0.39.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
…x/tools-0.39.0

build(deps): bump golang.org/x/tools from 0.38.0 to 0.39.0
chore(deps): update dependency aspect_bazel_lib to v2.21.2
chore(deps): update distroless_base docker digest to 9e9b50d
chore(deps): update dependency gazelle to v0.47.0
…tions

chore(deps): update github artifact actions (major)
chore(deps): update dependency rules_go to v0.59.0
Bumps [github.com/prometheus/common](https://github.com/prometheus/common) from 0.67.2 to 0.67.3.
- [Release notes](https://github.com/prometheus/common/releases)
- [Changelog](https://github.com/prometheus/common/blob/main/CHANGELOG.md)
- [Commits](prometheus/common@v0.67.2...v0.67.3)

---
updated-dependencies:
- dependency-name: github.com/prometheus/common
  dependency-version: 0.67.3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
…prometheus/common-0.67.3

build(deps): bump github.com/prometheus/common from 0.67.2 to 0.67.3
renovate Bot and others added 25 commits May 18, 2026 01:14
chore(deps): update distroless_base docker digest to f2df870
fix(deps): update module golang.org/x/sys to v0.45.0
chore(deps): update docker/login-action action to v4.2.0
A pointer to the input `LogLine` and the thread state was being held
in memory between execution of the `ProcessLogLine` function.

This change clears both pointers at the end of the program run to ensure
the GC cleans them up.

This may be related to Issue: #390 but I'm not fully convinced yet.
`v.t` is the same pointer and gets reset each time.
Clone a key when popping from the stack instead of using it directly; at
this moment the programme is running and the LogLine is still live, but
this is the last moment when the capture group reference should be
shared with the LogLine.  After this moment, we must use new memory to
store the datum keys to avoid pinning potentially large log lines in
memory permanently.

Issue: #390
fix: Repair two memory leaks in the VM.
build: Add initial AGENTS instructions.
test: Skip timezone test when the tz db is not available.
test: Handle IPv6 and fallback to IPv4 addresses in testutil.FreePort.
chore: Update MODULE.bazel.lock
The custom list is underspecified, and causes `mtail` to crash before
`main`. This change allows all the normal Go runtime system calls to
execute.

Also remove duplicate LockPersonality option.

Fixes: #175
fix: Use the default SystemCallFilter setting for systemd hardening.
Show how to simulate production load and profile `mtail` memory.

Issue: #390
Every log line we allocated and then released memory for the thread
struct in the VM.  This is costly in CPU because of the GC churn
required.  Instead, allocate from a pool which will reuse the memory
already allocated.

The per-log-line allocation of threads is now eliminated.

| Metric | Before (no pool) | After (sync.Pool) | Change |
|---|---|---|---|
| **Total cumulative alloc_space** | ~60 MB | **~16–19 MB** | **~3× reduction** |
| `VM.ProcessLogLine` **flat** | 17.0 MB (28%) | **0 MB** | eliminated |
| `VM.execute` **flat** | 7.5 MB (13.7%) | **0.5 MB** (3%) | ~15× reduction |
| Pool init (`New.func1`) | — | 2.5 MB | one-time cost |

Issue: #390
refactor: Use a memory pool to avoid reallocating thread memory.
This avoids about 2MB of result allocations per regexp match.

`regexp.FindStringSubmatch` returns `[]string` — a new slice **plus N+1 new string allocations** (one for the full match, one per capture group). Each string copies its data from the original log line onto the heap. On a typical dhcpd log line with 7 capture groups, that's 8 separate heap-allocated strings per match.

`regexp.FindStringSubmatchIndex` returns `[]int` — a single flat slice of byte-position pairs. No string data is copied. The original text is kept alive by a `matchResult.text` reference, and capture groups are read on demand via `text[indices[2*n]:indices[2*n+1]]`, which creates only a lightweight string header (no data copy).

An alternation like `(group_a|group_b)` produces index pair `{-1, -1}` for the unmatched branch. With the old `FindStringSubmatch` this came back as `""` (a valid empty string). The index-based code would panic on a negative slice bound, so the new `captureGroup` method checks for `-1` and returns `""` explicitly.

Issue: #390
refactor: Store the indices of the capture groups instead of strings.
Two test functions and a benchmark exercise the exporter at scale:
- TestWritePrometheusManyLabelValues: single metric, many label values
- TestWritePrometheusManyMetrics: many separate metrics (exercises the
  goroutine-per-metric pattern in Exporter.Collect)
- BenchmarkWritePrometheus: latency measurements for both patterns

Also adds issue-903.md documenting the test plan and benchmark results.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

mtail fails to scrape with > 70k metrics presented.

2 participants