fix(visualserver): contain visual worker failures#1347
Conversation
There was a problem hiding this comment.
Code Review
This pull request hardens the visual server to prevent hangs and resource exhaustion by introducing timeouts for visual inference and cache operations, handling per-image preprocessing failures gracefully, and adding a periodic watchdog thread to monitor worker liveness. The review feedback highlights critical thread-safety issues with concurrent RPyC calls on shared connections (such as get_items_embed and set_items_embed in the cache executor, and remote_infer_images on remote connections) and suggests using locks to serialize these operations. Additionally, the reviewer identifies a potential bug in the watchdog's sentinel cleanup where calling os.kill with a non-positive PID could inadvertently signal entire process groups.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| self._cache_executor = concurrent.futures.ThreadPoolExecutor( | ||
| max_workers=2, thread_name_prefix="visual_cache_rpc" | ||
| ) |
There was a problem hiding this comment.
RPyC connections are not thread-safe. Since self.cache_client is called from threads in self._cache_executor (which has max_workers=2), concurrent requests can call get_items_embed concurrently, leading to protocol corruption or connection hangs. We should initialize a lock to serialize all RPyC calls on self.cache_client.
| self._cache_executor = concurrent.futures.ThreadPoolExecutor( | |
| max_workers=2, thread_name_prefix="visual_cache_rpc" | |
| ) | |
| self._cache_executor = concurrent.futures.ThreadPoolExecutor( | |
| max_workers=2, thread_name_prefix='visual_cache_rpc' | |
| ) | |
| self._cache_lock = threading.Lock() |
| def get_need_infer_images(self, group_req_indexes: GroupReqIndexes) -> List[ImageItem]: | ||
| """同步路径: 检查 shm req 是否 aborted, 必要时调用 embed cache 询问哪些 image 已就绪。 | ||
|
|
||
| ``cache_client.root.get_items_embed`` 是同步 RPyC 调用; 调用方需要包一层 | ||
| ``asyncio.to_thread`` 并加超时, 避免 embed cache 卡死时阻塞 asyncio 事件循环 | ||
| (2026-05-09 incident)。 | ||
| """ |
There was a problem hiding this comment.
To ensure thread safety when calling self.cache_client.root.get_items_embed concurrently from multiple executor threads, please wrap the RPyC call inside get_need_infer_images with the newly introduced self._cache_lock:
with self._cache_lock:
ready_image = obtain(self.cache_client.root.get_items_embed(img_uuids))| # set_items_embed 也走 cache executor + 受 cache_client 的 sync_request_timeout 兜底 | ||
| self.cache_client.root.set_items_embed([image.uuid for image in images_need_infer]) |
There was a problem hiding this comment.
RPyC connections are not thread-safe. Since _load_to_cpu_cache runs in self._cache_executor (which has max_workers=2), concurrent requests can call set_items_embed concurrently on self.cache_client, leading to protocol corruption. We should wrap this call with self._cache_lock to serialize access.
| # set_items_embed 也走 cache executor + 受 cache_client 的 sync_request_timeout 兜底 | |
| self.cache_client.root.set_items_embed([image.uuid for image in images_need_infer]) | |
| # set_items_embed 也走 cache executor + 受 cache_client 的 sync_request_timeout 兜底 | |
| with self._cache_lock: | |
| self.cache_client.root.set_items_embed([image.uuid for image in images_need_infer]) |
| if self.args.detail_log: | ||
| start = time.time() | ||
| logger.info(f"Start to remote infer images {[image.md5 for image in images]}") | ||
| conn.root.remote_infer_images(images, event) |
There was a problem hiding this comment.
RPyC connections are not thread-safe. If multiple concurrent requests select the same remote visual server connection from self.id_to_rpyc_conn, they will call conn.root.remote_infer_images concurrently on the same connection, which can corrupt the RPyC protocol. We should serialize calls to remote_infer_images on the same connection, for example by using a lock associated with the connection or a global lock.
| stale = False | ||
| if pid is None: | ||
| # Pre-PID file format from an older crash — treat as stale. | ||
| stale = True |
There was a problem hiding this comment.
In _sweep_stale_sentinels, if pid is parsed as 0 or a negative integer (which can happen if the sentinel file is corrupted or has invalid content), calling os.kill(pid, 0) can have unintended side effects. Specifically, in Unix systems, os.kill(0, 0) sends the signal to all processes in the current process group, and negative PIDs signal process groups. We should explicitly check that pid > 0 before calling os.kill to avoid signaling process groups or raising unexpected errors.
| stale = False | |
| if pid is None: | |
| # Pre-PID file format from an older crash — treat as stale. | |
| stale = True | |
| stale = False | |
| if pid is None or pid <= 0: | |
| # Pre-PID or invalid file format from an older crash — treat as stale. | |
| stale = True |
Backport visualserver failure containment from upstream/qwen35. Keep worker loops alive after per-batch failures, propagate per-image failure status back to the manager, bound visual waits and cache RPCs, harden proxy/visual paths, and expose worker death through watchdog sentinels checked by /health. Adapted for main: dropped the qwen35 image/dependency update and qwen35-only visual token-budget flags. (cherry picked from commit 2b19512)
c69bb61 to
50d6a0a
Compare
Summary
2b195128visualserver failure containment to currentmain./healthwatchdog sentinels.main: excluded the qwen35 Dockerfile/dependency update and qwen35-only visual token-budget flags.Verification
git diff --check origin/main..HEADpython -m py_compile lightllm/models/qwen3_vl/qwen3_visual.py lightllm/server/api_cli.py lightllm/server/core/objs/start_args_type.py lightllm/server/visualserver/manager.py lightllm/server/visualserver/model_infer/model_rpc.py lightllm/server/visualserver/model_infer/model_rpc_client.py lightllm/server/visualserver/objs.py lightllm/server/visualserver/proxy_manager.py lightllm/server/visualserver/visual_only_manager.py lightllm/utils/health_check.pyrg -n "<<<<<<<|>>>>>>>|\|\|\|\|\|\|\||visual_batch_max_tokens|visual_image_max_tokens|Dockerfile\.qwen35" .returned no matches