Skip to content

feat: add health system api for admins#82

Merged
amandazhuyilan merged 5 commits into
mainfrom
SBP-406-add-admin-health-api
Jun 23, 2026
Merged

feat: add health system api for admins#82
amandazhuyilan merged 5 commits into
mainfrom
SBP-406-add-admin-health-api

Conversation

@amandazhuyilan

@amandazhuyilan amandazhuyilan commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Pull Request

Summary

SBP-406: Adds an admin-only system status surface so SBP admins can see whether workflow submission is healthy — specifically Seqera API reachability and the Gadi-backed compute environment (the Seqera Tower agent connection state). Also added the admin endpoint + dashboard view from the Workflow Submission Resilience investigation.

Changes

  • Health probe service (app/services/health.py) — two probes via the Seqera API:
    • seqera_api → authenticated GET {SEQERA_API_URL}/user-info (5s timeout): a 2xx confirms reachability and a valid SEQERA_ACCESS_TOKEN; 401/403 is reported as a credential problem.
    • seqera_compute_env now also flags a 403/404 as a WORK_SPACE/COMPUTE_ID/token misconfiguration in the message.
    • seqera_compute_envGET /compute-envs/{COMPUTE_ID}?workspaceId={WORK_SPACE}, mapping computeEnv.status (the Tower agent / Gadi proxy): AVAILABLE → healthy, CREATING → degraded, ERRORED/OFFLINE/INVALID → unhealthy, else degraded.
    • Aggregated overallStatus, results cached in a cachetools.TTLCache (30s, configurable) with asyncio.Lock stampede protection; coarse + verbose projections.
  • Response schemas (app/schemas/health.py) — ComponentStatus, ComponentStatusDetail, SystemStatusAdminResponse.
  • Admin endpoint GET /admin/api/system-status (app/db/admin.py) — admin-only, returns per-component status, latency, last-error body, full compute-env JSON, and an optional CloudWatch log-group link.
  • Admin dashboard view — new Starlette Admin CustomView "System Status" (see screenshot)
  • Updated README.md with new "System Status" section)

How to Test

CI shall pass!

Screenshot 2026-06-23 at 1 54 05 pm

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update

Checklist

  • I have added tests that prove my fix is effective or that my feature works
  • I have added or updated documentation where necessary
  • I have run linting and unit tests locally
  • The code follows the project's style guidelines

@vtnphan vtnphan left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Added 1 comment about SEQERA_API health check to include workspace and token.
I do think it's worth to add 1 more component to check for Tower Agent health by using create compute env with the same credential ID to make sure the Tower agent is alive (as compute env status cannot check for tower agent online status)

Comment thread app/services/health.py

@marius-mather marius-mather left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but I'm not sure if we need to tie the system status API to the dashboard, it seems like something we want available regardless of whether the dashboard is enabled.

Comment thread app/db/admin.py Outdated
@amandazhuyilan

Copy link
Copy Markdown
Collaborator Author

As per face to face discussion with @marius-mather , will extend the HEALTH_CACHE_TTL_SECONDS to 60 seconds and add the required attributes to aws secret manager before merging this pull request in.

@amandazhuyilan

amandazhuyilan commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

@marius-mather please see the updated related variables in this commit and deployed in secrets manager. We should be able to see the status of the tower agent on the dev admin dashboard once this is merged (and built).

@marius-mather marius-mather left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to go, let's test it out

@amandazhuyilan amandazhuyilan merged commit 0abf6ce into main Jun 23, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants