[Fix]: Handle None scales in generate_zero_point for mixed-format layers by lingyezhixing · Pull Request #4505 · InternLM/lmdeploy

lingyezhixing · 2026-04-07T15:51:57Z

Motivation

Fix crash when loading compressed-tensors quantized Qwen3.5 models (e.g., from llm-compressor) in TurboMind backend.

Qwen3.5 mixes linear_attention (24 layers) and full_attention (8 layers). For linear_attention layers that lack self_attn weights, the reader returns None for scales. When compressed_tensors=True and has_zero_point=False (symmetric quantization), generate_zero_point(scales) is called unconditionally, crashing on None.

Models with standard AWQ format (quant_method="awq") are unaffected because they take a different code path that never calls generate_zero_point.

Modification

Guard generate_zero_point(scales) with a None check in lmdeploy/turbomind/deploy/parameter.py:

if self.compressed_tensors and not self.has_zero_point:
-    zeros = generate_zero_point(scales)
+    if scales is not None and all(s is not None for s in scales):
+        zeros = generate_zero_point(scales)
+    else:
+        zeros = scales

BC-breaking (Optional)

No.

Use cases (Optional)

Crash reproduction (before fix):

lmdeploy chat cyankiwi/Qwen3.5-4B-AWQ-4bit --backend turbomind

Works correctly (standard AWQ, unaffected):

lmdeploy chat QuantTrio/Qwen3.5-4B-AWQ --backend turbomind

Qwen3.5-AWQ has mixed-format attention layers (fp16 QKV + AWQ O projection). The reader returns (None, None, None, None) for quant params to signal skip, but QuantWeightOnly.__call__ passed these Nones directly to generate_zero_point() which crashed on None.shape. Guard the call so Nones propagate to _export's existing all-None skip logic instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Fixes a TurboMind export-time crash when converting/loading compressed-tensors symmetric-int4 weights for Qwen3.5 models that mix linear_attention and full_attention, where some layers may not have self_attn weights and thus produce None scale entries.

Changes:

Add a None-aware guard around generate_zero_point(scales) for compressed-tensors symmetric quantization.
Fall back to passing through scales as zeros when scales are missing (None) to avoid crashing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-08T03:05:20Z

+            if scales is not None and all(s is not None for s in scales):
+                zeros = generate_zero_point(scales)
+            else:
+                zeros = scales


The new branch that skips generate_zero_point when scales (or any element within it) is None isn’t covered by the existing compressed-tensors tests. Please add a unit test that exercises QuantWeightOnly with compressed-tensors keys where weight_scale is a tuple containing None entries (e.g., all None for a missing self_attn layer) and asserts the call does not crash and that zeros is passed through consistently.

lvhan028 · 2026-04-15T07:09:00Z

Hi, @lingyezhixing
I tried the following code with latest main. And it worked well

from lmdeploy import pipeline, TurbomindEngineConfig


model_path = 'cyankiwi/Qwen3.5-4B-AWQ-4bit'

backend_config = TurbomindEngineConfig(
    tp=1,
    cache_max_entry_count=0.8,
)
pipe = pipeline(model_path, backend_config=backend_config, log_level='INFO')
response = pipe(['Hi, pls intro yourself'])
print(response)

43758726 · 2026-04-16T07:45:59Z

Hi, @lingyezhixing
I also tried the following code with the lastest main and it worked well.

from lmdeploy import pipeline, TurbomindEngineConfig

model_path = 'cyankiwi/Qwen3.5-4B-AWQ-4bit'

backend_config = TurbomindEngineConfig()
pipe = pipeline(model_path, backend_config=backend_config)
response = pipe(['Hi, pls intro yourself'])
print(response)

lingyezhixing · 2026-04-16T08:37:45Z

Confirmed fixed in the latest main, closing this PR

lvhan028 requested review from 43758726 and Copilot April 8, 2026 03:01

Copilot started reviewing on behalf of lvhan028 April 8, 2026 03:01 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

lvhan028 added the improvement label Apr 9, 2026

lvhan028 self-requested a review April 15, 2026 06:26

lingyezhixing closed this Apr 16, 2026

lingyezhixing deleted the lingyezhixing/fix-none-scales-zero-point branch April 16, 2026 08:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix]: Handle None scales in generate_zero_point for mixed-format layers#4505

[Fix]: Handle None scales in generate_zero_point for mixed-format layers#4505
lingyezhixing wants to merge 1 commit intoInternLM:mainfrom
lingyezhixing:lingyezhixing/fix-none-scales-zero-point

lingyezhixing commented Apr 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

lvhan028 commented Apr 15, 2026

Uh oh!

43758726 commented Apr 16, 2026

Uh oh!

lingyezhixing commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

lingyezhixing commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modification

BC-breaking (Optional)

Use cases (Optional)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

lvhan028 commented Apr 15, 2026

Uh oh!

43758726 commented Apr 16, 2026

Uh oh!

lingyezhixing commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lingyezhixing commented Apr 7, 2026 •

edited

Loading