Add E2E test verifying OTEL tracing across Zenko services#2378
Conversation
Hello delthas,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
|
c362986 to
9a3a251
Compare
9a3a251 to
f32a16b
Compare
f32a16b to
a7b196c
Compare
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
|
a7b196c to
c9ae52e
Compare
c9ae52e to
79d656d
Compare
francoisferrand
left a comment
There was a problem hiding this comment.
overall not sure where to stand here: there is a fundamental compromise: do we want to test the "production" case first and foremost (i.e. without tracing), or is it safe-enough to enable Otel all the time (i.e. not testing production anymore, which could hide crashes or introduce subtle delays...)
SylvainSenechal
left a comment
There was a problem hiding this comment.
side question, here the jaeger server will be killed at the end the ci run, but would it be possible to have the traces data retained, so that maybe we could export these traces data in the artifact, and spin up a server to explore them with a ui ?
79d656d to
648091c
Compare
8f05574 to
5160774
Compare
|
Moved to a ratio of 0% (no traces, but enabled); and injecting a forcefully enabled trace using a This should bring no performance overhead, no overlay reconfiguration, and nicely isolate our test trace. |
5160774 to
4a072e5
Compare
3b4b0fa to
dcf5457
Compare
dcf5457 to
69afb02
Compare
Includes OTEL tracing support needed by the E2E test in this PR. Issue: ZENKO-5258
69afb02 to
43b1e51
Compare
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
|
| dashboard: cloudserver/cloudserver-dashboards | ||
| image: cloudserver | ||
| tag: 9.3.9 | ||
| tag: 9.4.0-preview.1 |
There was a problem hiding this comment.
9.4.0-preview.1 is a pre-release tag. deps.yaml defines the component versions for the solution — the review criteria require pinning to a concrete released tag or content hash. If this PR is meant to land only after cloudserver 9.4.0 ships, consider gating the merge on that release and updating the tag then. If the preview tag is intentional for CI validation, document that constraint so it isn't accidentally shipped.
43b1e51 to
dc9fb9c
Compare
| dashboard: cloudserver/cloudserver-dashboards | ||
| image: cloudserver | ||
| tag: 9.3.9 | ||
| image: playground/delthas/cloudserver |
There was a problem hiding this comment.
image: playground/delthas/cloudserver points to a personal playground namespace, not the official cloudserver image. deps.yaml is the source of truth for solution images — merging this as-is ships a non-release, non-official image. The previous review flagged 9.4.0-preview.1; the current state is more severe since the image path itself changed. Restore image: cloudserver and pin to a released tag before merge (or gate merge on the cloudserver release).
| image: playground/delthas/cloudserver | |
| image: cloudserver |
Temporary bump to ghcr.io/scality/playground/delthas/cloudserver:9.4.1-h20bb7d391a725278da84ec1e7bad1d99cc9e94a1 to verify CI works against this build. To be reverted / replaced before merge. Issue: ZENKO-5258
dc9fb9c to
fa02bea
Compare
| dashboard: cloudserver/cloudserver-dashboards | ||
| image: cloudserver | ||
| tag: 9.3.9 | ||
| dashboard: playground/delthas/cloudserver/cloudserver-dashboards |
There was a problem hiding this comment.
Same playground namespace issue as the image field below — this also needs to be restored to the official path before merge.
| dashboard: playground/delthas/cloudserver/cloudserver-dashboards | |
| dashboard: cloudserver/cloudserver-dashboards |
fa02bea to
b936735
Compare
Temporary bump to ghcr.io/scality/playground/delthas/backbeat:9.4.1-h736f50680341e71275532cc3b7ef2777a7768bec (with matching dashboards + policies) to verify CI works against this build. To be reverted / replaced before merge. Issue: ZENKO-5258
- Deploy Jaeger all-in-one (memory-only, OTLP-enabled) in the kind CI cluster alongside existing dependencies - Patch the Zenko CR with `spec.otel` (enabled, sampling ratio 1.0) so every request is traced during CI — also acts as a smoke test that OTEL doesn't break existing @premerge tests - Add a new @premerge CTST scenario that puts an S3 object and then polls the Jaeger query API to assert a trace exists with spans from both cloudserver and vault Issue: ZENKO-5258
b936735 to
1245e78
Compare
| policy: backbeat/backbeat-policies | ||
| tag: 9.4.0 | ||
| dashboard: playground/delthas/backbeat/backbeat-dashboards | ||
| image: playground/delthas/backbeat |
There was a problem hiding this comment.
Same playground namespace issue as cloudserver (already flagged): image, dashboard, and policy all point to playground/delthas/backbeat/.... Restore to official paths before merge.
| image: playground/delthas/backbeat | |
| image: backbeat |
|
CI failing for unrelated reasons; OTEL test succeeding. |
|
|
||
| const { traceparent, traceId } = generateTraceContext(); | ||
| const client = this.createS3Client(); | ||
| injectTraceparent(client, traceparent); |
There was a problem hiding this comment.
how come CTST can inject a trace (and esp. cloudserver/... accepts it) ?
CTST is not running k8s in-cluster AFAIK : it runs on the "host" now, while zenko runs in a "kind" k8s cluster
→ is the "trusted hosts" protection not working? I would expect we need to explicitely fill some field in zenko CR to end up trusting ctst?
There was a problem hiding this comment.
Wait so, the CR is configured with samplingRatio = 0, so no traces are generated by the services.
Then you inject the trace yourself by manipulating the header of the request 🤔
Wouldn't we want to test it with a non zero samplingRatio ?
Or I guess you did it because you don't wanna enable tracing for every tests, and you don't want to patch the cr during the tests ?
SylvainSenechal
left a comment
There was a problem hiding this comment.
Just 1 question + update the /playground deps
Summary
1.76.0)in the kind CI cluster alongside existing dependencies (Keycloak,
Prometheus, etc.)
spec.tracing, ratio"0") so thecluster runs in production-mode (no ambient traces). The test sends
a single PUT through the existing S3 ingress with a W3C
traceparentheader injected via SDK middleware. Cloudserver's parent-based sampler
honors the header → only that one request is traced.
@PreMergeCTST scenario that asserts the trace appears inJaeger and contains spans from both connector-cloudserver and
connector-vault
What changed
solution/deps.yamlvault8.11.6 → 8.11.7(OTEL support) andcloudserver(OTEL support, parent-based sampler)configs/zenko.yamlspec.tracing(enabled, ratio"0", jaeger endpoint)configs/jaeger.yaml1.76.0, memory-only, requests/limits)install-kind-dependencies.shkubectl apply -f configs/jaeger.yamlsetup-e2e-env.shJaegerQueryEndpointin world paramsworld/Zenko.tsJaegerQueryEndpointworld parameterfeatures/otel-tracing.feature@PreMergescenariosteps/otel-tracing.tstraceparent, adds an SDK middleware that injects the header on the CTST S3 client, pollsGET /api/traces/<id>until the trace appears, asserts cloudserver + vault spansWhy
Parent ticket OS-1072
tracks adding OpenTelemetry tracing across the Zenko stack. This PR
adds CI coverage to verify that traces actually propagate end-to-end
(cloudserver → vault) once tracing is enabled.
Note on trust boundary: production runs metalk8s nginx which strips
incoming
traceparentheaders; Zenko CI uses upstreamkubernetes/ingress-nginxwhich forwards them. So in CI the testinjects via the regular ingress; the production trust-boundary
(external
traceparentis dropped at the edge) is metalk8s-specificand out of scope for this CI test.
Issue: ZENKO-5258