Skip to content

Fix/system monitor#875

Open
jaagut wants to merge 8 commits into
mainfrom
fix/system_monitor
Open

Fix/system monitor#875
jaagut wants to merge 8 commits into
mainfrom
fix/system_monitor

Conversation

@jaagut
Copy link
Copy Markdown
Member

@jaagut jaagut commented May 24, 2026

Summary

Proposed changes

Related issues

Checklist

  • Run pixi run build
  • Write documentation
  • Test on your machine
  • Test on the robot
  • Create issues for future work
  • Triage this PR and label it

Enhance GPU monitoring by integrating NVIDIA and AMD detection, updating collection methods, and adding support for nvidia-ml-py package
@jaagut jaagut force-pushed the fix/system_monitor branch from b051716 to 33366a8 Compare May 24, 2026 19:12
@jaagut jaagut marked this pull request as ready for review May 24, 2026 20:02
@jaagut jaagut moved this from 🆕 New to 📋 Backlog in Software May 24, 2026
@jaagut jaagut moved this from 📋 Backlog to 👀 In review in Software May 24, 2026
@jaagut jaagut requested review from ChlukasX, Flova, MegaIng and Copilot May 24, 2026 20:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves the ROS2 system_monitor package to collect and publish more robust system workload metrics (notably GPU stats) across different hardware backends, and ensures the required Workload message is generated in bitbots_msgs.

Changes:

  • Add Workload.msg to bitbots_msgs interface generation.
  • Refactor GPU monitoring to auto-detect NVIDIA (NVML), Jetson (sysfs), and AMD (pyamdgpuinfo) backends; tighten type consistency in collectors.
  • Adjust sampling behavior (CPU smoothing + lower default update frequency) and add nvidia-ml-py to the Pixi environment.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/bitbots_msgs/CMakeLists.txt Adds Workload.msg to rosidl-generated interfaces so downstream nodes can publish/subscribe it.
src/bitbots_misc/system_monitor/system_monitor/network_interfaces.py Adds return type annotations for interface collection helpers.
src/bitbots_misc/system_monitor/system_monitor/monitor.py Updates GPU collector call signature and aligns default “disabled” tuple types; minor comment grammar fix.
src/bitbots_misc/system_monitor/system_monitor/memory.py Adds a typed return annotation for memory stats collection.
src/bitbots_misc/system_monitor/system_monitor/gpu.py Replaces single-backend AMD logic with auto-detected NVIDIA/Jetson/AMD backends and improved error handling/logging.
src/bitbots_misc/system_monitor/system_monitor/cpus.py Adds EMA smoothing for CPU usage values and updates return/type annotations.
src/bitbots_misc/system_monitor/config/config.yaml Lowers default update frequency from 10 Hz to 2 Hz.
pixi.toml Adds nvidia-ml-py dependency for NVML-based monitoring.
pixi.lock Locks nvidia-ml-py into all environments.
.vscode/settings.json Adds dictionary words related to new GPU monitoring terms.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +79 to +85
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
load = float(pynvml.nvmlDeviceGetUtilizationRates(handle).gpu)
mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
vram_used = mem_info.used
vram_total = mem_info.total
temperature = float(pynvml.nvmlDeviceGetTemperature(handle, 0))
return (load, vram_used, vram_total, temperature)
if raw_load is None:
continue
# Jetson reports GPU load in permille on current L4T kernels.
load = raw_load / 10.0
Comment on lines +76 to +82
# smooth short-term sampling noise with exponential moving average
prev = _prev_usage[cpu_num]
if prev == 0.0:
smoothed = float(round(raw_usage, 2))
else:
smoothed = float(round((raw_usage * _EMA_ALPHA) + (prev * (1.0 - _EMA_ALPHA)), 2))

Comment on lines +72 to +90
def _collect_nvidia(node: Node) -> tuple[float, int, int, float]:
"""Collect GPU metrics from NVIDIA GPU using pynvml."""
try:
import pynvml

pynvml.nvmlInit()
try:
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
load = float(pynvml.nvmlDeviceGetUtilizationRates(handle).gpu)
mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
vram_used = mem_info.used
vram_total = mem_info.total
temperature = float(pynvml.nvmlDeviceGetTemperature(handle, 0))
return (load, vram_used, vram_total, temperature)
finally:
try:
pynvml.nvmlShutdown()
except Exception:
pass
Comment on lines +186 to +188
If `node` is provided the ROS node's logger will be used for messages.

node: ROS node for logging (required for backend detection and error logging)
@jaagut
Copy link
Copy Markdown
Member Author

jaagut commented May 25, 2026

@copilot please implement the suggestions.

Copy link
Copy Markdown

Copilot AI commented May 25, 2026

@copilot please implement the suggestions.

Implemented the requested suggestions in bf629a3 (plus preceding commits in this update): GPU load is now consistently normalized to fraction semantics across backends, NVML init/handle are cached instead of per-sample init/shutdown, CPU EMA initialization no longer relies on 0.0 as a sentinel, and the GPU collector docstrings were corrected.

Copilot finished work on behalf of jaagut May 25, 2026 06:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 👀 In review

Development

Successfully merging this pull request may close these issues.

3 participants