feat: add health system api for admins#82
Conversation
…lth-api # Conflicts: # pyproject.toml
vtnphan
left a comment
There was a problem hiding this comment.
Looks good to me. Added 1 comment about SEQERA_API health check to include workspace and token.
I do think it's worth to add 1 more component to check for Tower Agent health by using create compute env with the same credential ID to make sure the Tower agent is alive (as compute env status cannot check for tower agent online status)
marius-mather
left a comment
There was a problem hiding this comment.
Looks good but I'm not sure if we need to tie the system status API to the dashboard, it seems like something we want available regardless of whether the dashboard is enabled.
|
As per face to face discussion with @marius-mather , will extend the |
|
@marius-mather please see the updated related variables in this commit and deployed in secrets manager. We should be able to see the status of the tower agent on the dev admin dashboard once this is merged (and built). |
marius-mather
left a comment
There was a problem hiding this comment.
good to go, let's test it out
Pull Request
Summary
SBP-406: Adds an admin-only system status surface so SBP admins can see whether workflow submission is healthy — specifically Seqera API reachability and the Gadi-backed compute environment (the Seqera Tower agent connection state). Also added the admin endpoint + dashboard view from the Workflow Submission Resilience investigation.
Changes
seqera_api→ authenticatedGET {SEQERA_API_URL}/user-info(5s timeout): a 2xx confirms reachability and a validSEQERA_ACCESS_TOKEN; 401/403 is reported as a credential problem.seqera_compute_envnow also flags a 403/404 as aWORK_SPACE/COMPUTE_ID/token misconfiguration in the message.seqera_compute_env→GET /compute-envs/{COMPUTE_ID}?workspaceId={WORK_SPACE}, mappingcomputeEnv.status(the Tower agent / Gadi proxy):AVAILABLE→ healthy,CREATING→ degraded,ERRORED/OFFLINE/INVALID→ unhealthy, else degraded.overallStatus, results cached in acachetools.TTLCache (30s, configurable)withasyncio.Lockstampede protection; coarse + verbose projections.app/schemas/health.py) —ComponentStatus,ComponentStatusDetail,SystemStatusAdminResponse.GET /admin/api/system-status(app/db/admin.py) — admin-only, returns per-component status, latency, last-error body, full compute-env JSON, and an optional CloudWatch log-group link.README.mdwith new "System Status" section)How to Test
CI shall pass!
Type of change
Checklist