SWIP-13: Live Debugger for MAL / LAL / OAL + admin-server module#13864
Open
SWIP-13: Live Debugger for MAL / LAL / OAL + admin-server module#13864
Conversation
559f1a1 to
5863bd3
Compare
…ule integration.
Implements [SWIP-13](docs/en/swip/SWIP-13.md) end-to-end and lands two
supporting infrastructure changes the operator-facing API surface depends on:
1. New `admin-server` module — shared host for admin / on-demand write APIs.
Runs on TWO ports: an HTTP REST surface (default 17128) and an
admin-internal gRPC bus (default 17129) carrying peer-to-peer cluster
RPCs (runtime-rule Suspend / Resume / Forward; DSL debug install /
collect / stop / stopByClientId). The admin-internal bus is a dedicated
transport separate from the public agent / cluster gRPC port
(`core.gRPCPort`, default 11800), so privileged admin RPCs stay out of
the agent network's blast radius. Disabled by default
(`SW_ADMIN_SERVER=default`); has no built-in authentication, must be
gateway-protected.
2. Runtime rule hot-update — REST routes (addOrUpdate / inactivate /
delete / get / list / dump) now mount on `admin-server`'s HTTP host.
All admin writes serialize on a single deterministic main OAP per
cluster (sorted-first peer, no leader election); non-main nodes
transparently forward over the admin-internal gRPC bus, so an L7 LB in
front of the admin port can route any operator request to any OAP.
Cluster convergence on the periodic refresh tick is configurable via
`receiver-runtime-rule.refreshRulesPeriod` (default 30 s). The
runtime-rule plugin's old HTTP-server keys
(`receiver-runtime-rule.restHost`/`restPort`/...) are removed; host-
level knobs move under the new `admin-server` block.
3. DSL Debug API (SWIP-13) — new `dsl-debugging` module. Sample-based
runtime debugger that captures per-stage inputs/outputs as MAL/LAL/OAL
process live ingest. Generated bytecode carries one volatile-bool gate
per probe call site; idle path is a single volatile load JIT eliminates
after warm-up. Active sessions fan out to every cluster peer over the
admin-internal gRPC bus so each peer captures its own slice.
- LB-safe routing: any node can serve any verb. POST mints `sessionId`
on the receiving node, broadcasts install to peers, returns
`404 rule_not_found` only when no node owns the rule (404 body
carries `peers[]` so operators see which OAP rejected what). Failed
installs broadcast Stop to clean up any silent-success peer whose
ack timed out. GET returns `404 session_not_found` only when every
node disowns the id.
- Capture surface bounded on two dimensions only — there is NO
per-session byte cap and no structural sub-caps. Hard ceilings:
`MAX_ACTIVE_SESSIONS=200` per node (429 too_many_sessions when full),
`recordCap` ≤ 10000 + `retentionMillis` ≤ 1 h per session
(400 invalid_limits on out-of-range). Total bytes are reported on
every GET response so operators can verify their heap budget.
- LAL granularity flag: `block` (default — parser/extractor/sink) or
`statement` (one `line` entry per extractor statement, with verbatim
DSL slice + source line).
- MAL captures render the full surviving SampleFamily map
(`{"families": N, "items": [...]}`) so multi-metric expressions show
cross-family filter narrowing.
- `sourceText` carries the verbatim ANTLR slice byte-for-byte; `dsl`
and structured `rule` envelope per-record so hot-updates mid-session
don't make captures ambiguous.
- MAL hand-written probe sites null-check the GateHolder so a build
with `injectionEnabled=false` runs MAL analysis without NPEs.
- OAL terminal `appendEmit` produces a TYPE_OUTPUT sample carrying the
L1-ready Metrics snapshot (matches the wire contract).
- OAL debug-source sidecar (.java alongside .class for IDE source-
attach) compiles cleanly: WithMetadata FQCN corrected, final fields
get `null` initializers, stub methods get bodies.
Disabled by default (`SW_DSL_DEBUGGING=default` *and*
`SW_ADMIN_SERVER=default`); `injectionEnabled` defaults `true` once the
module is enabled. Per-DSL operator references under
`docs/en/setup/backend/admin-api/dsl-debugging-{mal,lal,oal}.md`.
End-to-end coverage: four new e2e cases under
`test/e2e-v2/cases/dsl-debugging/{mal,lal-block,lal-statement,oal}` drive
the full session lifecycle against real telemetry through BanyanDB +
runtime-rule and assert the captured wire shape per DSL.
Operators upgrading to 10.5.0 with `SW_RECEIVER_RUNTIME_RULE=default`
must also set `SW_ADMIN_SERVER=default`; OAP fails fast at startup
otherwise.
5863bd3 to
6139012
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
admin-servermodule hosts admin / on-demand write APIs on a shared HTTP port (default17128) plus a dedicated admin-internal gRPC bus (default17129) for peer-to-peer cluster RPCs (runtime-rule Suspend/Resume/Forward, DSL debug install/collect/stop). Privileged admin RPCs stay isolated from the agent network.admin-server's HTTP host. All admin writes serialize on a deterministic single-main OAP per cluster; non-main nodes transparently forward over the admin-internal gRPC bus, so an L7 LB in front of the admin port can route any operator request to any OAP.dsl-debuggingmodule implements SWIP-13 — a sample-based runtime debugger that captures per-stage inputs/outputs as MAL/LAL/OAL process live ingest. Idle path is one volatile-bool read per probe call site (JIT-eliminable); active sessions fan out to every cluster peer, with LB-safe routing, hard caps (MAX_ACTIVE_SESSIONS=200,recordCap≤10000,retentionMillis≤1h), and verbatim ANTLR-slicesourceTextper sample.Operators upgrading to 10.5.0 with
SW_RECEIVER_RUNTIME_RULE=defaultmust also setSW_ADMIN_SERVER=default; OAP fails fast at startup otherwise.Per-DSL operator references:
MAL / LAL / OAL. Common ground: DSL Debug API. Admin host: Admin API readme.
Test plan
dsl-debugging,runtime-rule,admin-server,server-core,meter-analyzer,log-analyzer,oal-rt)MAX_ACTIVE_SESSIONSis never exceeded under contentionLALScriptExecutionTest48/48 green (regression-tested after a related getter revert)test/e2e-v2/cases/dsl-debugging/{mal,lal-block,lal-statement,oal}/drive the full session lifecycle against real telemetry through BanyanDB + runtime-rule and assert the captured wire shape per DSL