Skip to content

[Fix]: Handle None scales in generate_zero_point for mixed-format layers#4505

Closed
lingyezhixing wants to merge 1 commit intoInternLM:mainfrom
lingyezhixing:lingyezhixing/fix-none-scales-zero-point
Closed

[Fix]: Handle None scales in generate_zero_point for mixed-format layers#4505
lingyezhixing wants to merge 1 commit intoInternLM:mainfrom
lingyezhixing:lingyezhixing/fix-none-scales-zero-point

Conversation

@lingyezhixing
Copy link
Copy Markdown

@lingyezhixing lingyezhixing commented Apr 7, 2026

Motivation

Fix crash when loading compressed-tensors quantized Qwen3.5 models (e.g., from llm-compressor) in TurboMind backend.

Qwen3.5 mixes linear_attention (24 layers) and full_attention (8 layers). For linear_attention layers that lack self_attn weights, the reader returns None for scales. When compressed_tensors=True and has_zero_point=False (symmetric quantization), generate_zero_point(scales) is called unconditionally, crashing on None.

Models with standard AWQ format (quant_method="awq") are unaffected because they take a different code path that never calls generate_zero_point.

Modification

Guard generate_zero_point(scales) with a None check in lmdeploy/turbomind/deploy/parameter.py:

if self.compressed_tensors and not self.has_zero_point:
-    zeros = generate_zero_point(scales)
+    if scales is not None and all(s is not None for s in scales):
+        zeros = generate_zero_point(scales)
+    else:
+        zeros = scales

BC-breaking (Optional)

No.

Use cases (Optional)

Crash reproduction (before fix):

lmdeploy chat cyankiwi/Qwen3.5-4B-AWQ-4bit --backend turbomind

Works correctly (standard AWQ, unaffected):

lmdeploy chat QuantTrio/Qwen3.5-4B-AWQ --backend turbomind

Qwen3.5-AWQ has mixed-format attention layers (fp16 QKV + AWQ O projection).
The reader returns (None, None, None, None) for quant params to signal skip,
but QuantWeightOnly.__call__ passed these Nones directly to generate_zero_point()
which crashed on None.shape. Guard the call so Nones propagate to _export's
existing all-None skip logic instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a TurboMind export-time crash when converting/loading compressed-tensors symmetric-int4 weights for Qwen3.5 models that mix linear_attention and full_attention, where some layers may not have self_attn weights and thus produce None scale entries.

Changes:

  • Add a None-aware guard around generate_zero_point(scales) for compressed-tensors symmetric quantization.
  • Fall back to passing through scales as zeros when scales are missing (None) to avoid crashing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +100 to +103
if scales is not None and all(s is not None for s in scales):
zeros = generate_zero_point(scales)
else:
zeros = scales
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new branch that skips generate_zero_point when scales (or any element within it) is None isn’t covered by the existing compressed-tensors tests. Please add a unit test that exercises QuantWeightOnly with compressed-tensors keys where weight_scale is a tuple containing None entries (e.g., all None for a missing self_attn layer) and asserts the call does not crash and that zeros is passed through consistently.

Copilot uses AI. Check for mistakes.
@lvhan028 lvhan028 self-requested a review April 15, 2026 06:26
@lvhan028
Copy link
Copy Markdown
Collaborator

Hi, @lingyezhixing
I tried the following code with latest main. And it worked well

from lmdeploy import pipeline, TurbomindEngineConfig


model_path = 'cyankiwi/Qwen3.5-4B-AWQ-4bit'

backend_config = TurbomindEngineConfig(
    tp=1,
    cache_max_entry_count=0.8,
)
pipe = pipeline(model_path, backend_config=backend_config, log_level='INFO')
response = pipe(['Hi, pls intro yourself'])
print(response)

@43758726
Copy link
Copy Markdown
Collaborator

Hi, @lingyezhixing
I also tried the following code with the lastest main and it worked well.

from lmdeploy import pipeline, TurbomindEngineConfig

model_path = 'cyankiwi/Qwen3.5-4B-AWQ-4bit'

backend_config = TurbomindEngineConfig()
pipe = pipeline(model_path, backend_config=backend_config)
response = pipe(['Hi, pls intro yourself'])
print(response)

@lingyezhixing
Copy link
Copy Markdown
Author

Confirmed fixed in the latest main, closing this PR

@lingyezhixing lingyezhixing deleted the lingyezhixing/fix-none-scales-zero-point branch April 16, 2026 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants