Fix/system monitor#875
Conversation
Enhance GPU monitoring by integrating NVIDIA and AMD detection, updating collection methods, and adding support for nvidia-ml-py package
b051716 to
33366a8
Compare
There was a problem hiding this comment.
Pull request overview
Improves the ROS2 system_monitor package to collect and publish more robust system workload metrics (notably GPU stats) across different hardware backends, and ensures the required Workload message is generated in bitbots_msgs.
Changes:
- Add
Workload.msgtobitbots_msgsinterface generation. - Refactor GPU monitoring to auto-detect NVIDIA (NVML), Jetson (sysfs), and AMD (pyamdgpuinfo) backends; tighten type consistency in collectors.
- Adjust sampling behavior (CPU smoothing + lower default update frequency) and add
nvidia-ml-pyto the Pixi environment.
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/bitbots_msgs/CMakeLists.txt | Adds Workload.msg to rosidl-generated interfaces so downstream nodes can publish/subscribe it. |
| src/bitbots_misc/system_monitor/system_monitor/network_interfaces.py | Adds return type annotations for interface collection helpers. |
| src/bitbots_misc/system_monitor/system_monitor/monitor.py | Updates GPU collector call signature and aligns default “disabled” tuple types; minor comment grammar fix. |
| src/bitbots_misc/system_monitor/system_monitor/memory.py | Adds a typed return annotation for memory stats collection. |
| src/bitbots_misc/system_monitor/system_monitor/gpu.py | Replaces single-backend AMD logic with auto-detected NVIDIA/Jetson/AMD backends and improved error handling/logging. |
| src/bitbots_misc/system_monitor/system_monitor/cpus.py | Adds EMA smoothing for CPU usage values and updates return/type annotations. |
| src/bitbots_misc/system_monitor/config/config.yaml | Lowers default update frequency from 10 Hz to 2 Hz. |
| pixi.toml | Adds nvidia-ml-py dependency for NVML-based monitoring. |
| pixi.lock | Locks nvidia-ml-py into all environments. |
| .vscode/settings.json | Adds dictionary words related to new GPU monitoring terms. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| handle = pynvml.nvmlDeviceGetHandleByIndex(0) | ||
| load = float(pynvml.nvmlDeviceGetUtilizationRates(handle).gpu) | ||
| mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle) | ||
| vram_used = mem_info.used | ||
| vram_total = mem_info.total | ||
| temperature = float(pynvml.nvmlDeviceGetTemperature(handle, 0)) | ||
| return (load, vram_used, vram_total, temperature) |
| if raw_load is None: | ||
| continue | ||
| # Jetson reports GPU load in permille on current L4T kernels. | ||
| load = raw_load / 10.0 |
| # smooth short-term sampling noise with exponential moving average | ||
| prev = _prev_usage[cpu_num] | ||
| if prev == 0.0: | ||
| smoothed = float(round(raw_usage, 2)) | ||
| else: | ||
| smoothed = float(round((raw_usage * _EMA_ALPHA) + (prev * (1.0 - _EMA_ALPHA)), 2)) | ||
|
|
| def _collect_nvidia(node: Node) -> tuple[float, int, int, float]: | ||
| """Collect GPU metrics from NVIDIA GPU using pynvml.""" | ||
| try: | ||
| import pynvml | ||
|
|
||
| pynvml.nvmlInit() | ||
| try: | ||
| handle = pynvml.nvmlDeviceGetHandleByIndex(0) | ||
| load = float(pynvml.nvmlDeviceGetUtilizationRates(handle).gpu) | ||
| mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle) | ||
| vram_used = mem_info.used | ||
| vram_total = mem_info.total | ||
| temperature = float(pynvml.nvmlDeviceGetTemperature(handle, 0)) | ||
| return (load, vram_used, vram_total, temperature) | ||
| finally: | ||
| try: | ||
| pynvml.nvmlShutdown() | ||
| except Exception: | ||
| pass |
| If `node` is provided the ROS node's logger will be used for messages. | ||
|
|
||
| node: ROS node for logging (required for backend detection and error logging) |
|
@copilot please implement the suggestions. |
Agent-Logs-Url: https://github.com/bit-bots/bitbots_main/sessions/797bd202-b15d-4688-b2a5-48352421a742 Co-authored-by: jaagut <34797331+jaagut@users.noreply.github.com>
Agent-Logs-Url: https://github.com/bit-bots/bitbots_main/sessions/797bd202-b15d-4688-b2a5-48352421a742 Co-authored-by: jaagut <34797331+jaagut@users.noreply.github.com>
Agent-Logs-Url: https://github.com/bit-bots/bitbots_main/sessions/797bd202-b15d-4688-b2a5-48352421a742 Co-authored-by: jaagut <34797331+jaagut@users.noreply.github.com>
Agent-Logs-Url: https://github.com/bit-bots/bitbots_main/sessions/797bd202-b15d-4688-b2a5-48352421a742 Co-authored-by: jaagut <34797331+jaagut@users.noreply.github.com>
Agent-Logs-Url: https://github.com/bit-bots/bitbots_main/sessions/797bd202-b15d-4688-b2a5-48352421a742 Co-authored-by: jaagut <34797331+jaagut@users.noreply.github.com>
Implemented the requested suggestions in bf629a3 (plus preceding commits in this update): GPU load is now consistently normalized to fraction semantics across backends, NVML init/handle are cached instead of per-sample init/shutdown, CPU EMA initialization no longer relies on |
Summary
Proposed changes
Related issues
Checklist
pixi run build