fix(visualserver): contain visual worker failures by sufubao · Pull Request #1347 · ModelTC/LightLLM

sufubao · 2026-06-13T10:47:52Z

Summary

Backport 2b195128 visualserver failure containment to current main.
Add per-image visual inference result signaling so worker failures/timeouts abort requests instead of forwarding missing embeddings.
Bound visual infer waits and embed-cache RPyC calls, keep visual workers alive after per-batch failures, and expose visual worker death through /health watchdog sentinels.
Adapted for main: excluded the qwen35 Dockerfile/dependency update and qwen35-only visual token-budget flags.

Verification

git diff --check origin/main..HEAD
python -m py_compile lightllm/models/qwen3_vl/qwen3_visual.py lightllm/server/api_cli.py lightllm/server/core/objs/start_args_type.py lightllm/server/visualserver/manager.py lightllm/server/visualserver/model_infer/model_rpc.py lightllm/server/visualserver/model_infer/model_rpc_client.py lightllm/server/visualserver/objs.py lightllm/server/visualserver/proxy_manager.py lightllm/server/visualserver/visual_only_manager.py lightllm/utils/health_check.py
rg -n "<<<<<<<|>>>>>>>|\|\|\|\|\|\|\||visual_batch_max_tokens|visual_image_max_tokens|Dockerfile\.qwen35" . returned no matches

gemini-code-assist

Code Review

This pull request hardens the visual server to prevent hangs and resource exhaustion by introducing timeouts for visual inference and cache operations, handling per-image preprocessing failures gracefully, and adding a periodic watchdog thread to monitor worker liveness. The review feedback highlights critical thread-safety issues with concurrent RPyC calls on shared connections (such as get_items_embed and set_items_embed in the cache executor, and remote_infer_images on remote connections) and suggests using locks to serialize these operations. Additionally, the reviewer identifies a potential bug in the watchdog's sentinel cleanup where calling os.kill with a non-positive PID could inadvertently signal entire process groups.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-13T10:51:39Z

+        self._cache_executor = concurrent.futures.ThreadPoolExecutor(
+            max_workers=2, thread_name_prefix="visual_cache_rpc"
+        )


RPyC connections are not thread-safe. Since self.cache_client is called from threads in self._cache_executor (which has max_workers=2), concurrent requests can call get_items_embed concurrently, leading to protocol corruption or connection hangs. We should initialize a lock to serialize all RPyC calls on self.cache_client.

Suggested change

self._cache_executor = concurrent.futures.ThreadPoolExecutor(

max_workers=2, thread_name_prefix="visual_cache_rpc"

)

self._cache_executor = concurrent.futures.ThreadPoolExecutor(

max_workers=2, thread_name_prefix='visual_cache_rpc'

)

self._cache_lock = threading.Lock()

gemini-code-assist · 2026-06-13T10:51:39Z

    def get_need_infer_images(self, group_req_indexes: GroupReqIndexes) -> List[ImageItem]:
+        """同步路径: 检查 shm req 是否 aborted, 必要时调用 embed cache 询问哪些 image 已就绪。
+
+        ``cache_client.root.get_items_embed`` 是同步 RPyC 调用; 调用方需要包一层
+        ``asyncio.to_thread`` 并加超时, 避免 embed cache 卡死时阻塞 asyncio 事件循环
+        (2026-05-09 incident)。
+        """


To ensure thread safety when calling self.cache_client.root.get_items_embed concurrently from multiple executor threads, please wrap the RPyC call inside get_need_infer_images with the newly introduced self._cache_lock:

with self._cache_lock: ready_image = obtain(self.cache_client.root.get_items_embed(img_uuids))

gemini-code-assist · 2026-06-13T10:51:39Z

+                # set_items_embed 也走 cache executor + 受 cache_client 的 sync_request_timeout 兜底
                self.cache_client.root.set_items_embed([image.uuid for image in images_need_infer])


RPyC connections are not thread-safe. Since _load_to_cpu_cache runs in self._cache_executor (which has max_workers=2), concurrent requests can call set_items_embed concurrently on self.cache_client, leading to protocol corruption. We should wrap this call with self._cache_lock to serialize access.

Suggested change

# set_items_embed 也走 cache executor + 受 cache_client 的 sync_request_timeout 兜底

self.cache_client.root.set_items_embed([image.uuid for image in images_need_infer])

# set_items_embed 也走 cache executor + 受 cache_client 的 sync_request_timeout 兜底

with self._cache_lock:

self.cache_client.root.set_items_embed([image.uuid for image in images_need_infer])

gemini-code-assist · 2026-06-13T10:51:40Z

        if self.args.detail_log:
-            start = time.time()
            logger.info(f"Start to remote infer images {[image.md5 for image in images]}")
        conn.root.remote_infer_images(images, event)


RPyC connections are not thread-safe. If multiple concurrent requests select the same remote visual server connection from self.id_to_rpyc_conn, they will call conn.root.remote_infer_images concurrently on the same connection, which can corrupt the RPyC protocol. We should serialize calls to remote_infer_images on the same connection, for example by using a lock associated with the connection or a global lock.

gemini-code-assist · 2026-06-13T10:51:40Z

+                stale = False
+                if pid is None:
+                    # Pre-PID file format from an older crash — treat as stale.
+                    stale = True


In _sweep_stale_sentinels, if pid is parsed as 0 or a negative integer (which can happen if the sentinel file is corrupted or has invalid content), calling os.kill(pid, 0) can have unintended side effects. Specifically, in Unix systems, os.kill(0, 0) sends the signal to all processes in the current process group, and negative PIDs signal process groups. We should explicitly check that pid > 0 before calling os.kill to avoid signaling process groups or raising unexpected errors.

Suggested change

stale = False

if pid is None:

# Pre-PID file format from an older crash — treat as stale.

stale = True

stale = False

if pid is None or pid <= 0:

# Pre-PID or invalid file format from an older crash — treat as stale.

stale = True

Backport visualserver failure containment from upstream/qwen35. Keep worker loops alive after per-batch failures, propagate per-image failure status back to the manager, bound visual waits and cache RPCs, harden proxy/visual paths, and expose worker death through watchdog sentinels checked by /health. Adapted for main: dropped the qwen35 image/dependency update and qwen35-only visual token-budget flags. (cherry picked from commit 2b19512)

gemini-code-assist Bot reviewed Jun 13, 2026

View reviewed changes

sufubao force-pushed the backport-visual-worker-failures branch from c69bb61 to 50d6a0a Compare June 13, 2026 11:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(visualserver): contain visual worker failures#1347

fix(visualserver): contain visual worker failures#1347
sufubao wants to merge 1 commit into
ModelTC:mainfrom
sufubao:backport-visual-worker-failures

sufubao commented Jun 13, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 13, 2026

Uh oh!

gemini-code-assist Bot Jun 13, 2026

Uh oh!

gemini-code-assist Bot Jun 13, 2026

Uh oh!

gemini-code-assist Bot Jun 13, 2026

Uh oh!

gemini-code-assist Bot Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		# set_items_embed 也走 cache executor + 受 cache_client 的 sync_request_timeout 兜底
		self.cache_client.root.set_items_embed([image.uuid for image in images_need_infer])

Conversation

sufubao commented Jun 13, 2026

Summary

Verification

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant