Skip to content

Report a float compute dtype for quantized GGMLTensor#456

Open
fxd0h wants to merge 1 commit into
city96:mainfrom
fxd0h:quantized-dtype-mps
Open

Report a float compute dtype for quantized GGMLTensor#456
fxd0h wants to merge 1 commit into
city96:mainfrom
fxd0h:quantized-dtype-mps

Conversation

@fxd0h

@fxd0h fxd0h commented Jun 7, 2026

Copy link
Copy Markdown

A quantized GGMLTensor is backed by packed uint8 storage, so weight.dtype returns torch.uint8. Models that cast activations to the weight dtype before a linear (Ideogram-4 ldm/ideogram4/model.py, Gemma-4, Qwen3.5, CogVideo, some VAEs do x.to(weight.dtype)) therefore convert their activations to uint8. This corrupts them and breaks F.linear on MPS, which requires floating-point inputs: RuntimeError: MPS device does not support linear for non-float inputs.

Change

GGMLTensor.dtype now returns a floating-point compute dtype (bfloat16) for quantised tensors; F16, F32 and unquantised tensors keep their real storage dtype. A compute_dtype set by the loader takes precedence.

  • bfloat16 rather than float16 avoids downcast and saturation on GGML-BF16 models, which run bfloat16 natively (WAN, FLUX).
  • The dequantisation paths are unaffected, as they read storage through .data rather than .dtype.

Testing

  • Fixes Ideogram-4 GGUF generation on MPS, which previously failed in the DiT forward pass.
  • Regression-tested by loading WAN 2.2 I2V A14B Q8_0 (1095 tensors) on MPS: dequantisation shape and values are correct, the forward pass runs end to end, and there is no recursion in the property.

The change mainly benefits MPS, where the non-float cast is fatal; CUDA tolerated the previous behaviour.

Part of #454.

Developed with AI assistance; the diff and test results above were verified on-device before submission.

A quantized GGMLTensor is backed by packed uint8 storage, so
weight.dtype returns torch.uint8. Models that do x.to(weight.dtype)
before a linear (Ideogram-4, Gemma-4, Qwen3.5, CogVideo, some VAEs)
cast activations to uint8, corrupting them and breaking F.linear on
MPS, which requires floating-point inputs.

GGMLTensor.dtype now returns bfloat16 for quantized tensors; F16/F32
and unquantized tensors keep their real storage dtype. A loader-set
compute_dtype takes precedence. Dequant paths read storage via .data
and are unaffected.

Regression-tested loading WAN 2.2 I2V A14B Q8_0 (1095 tensors) on MPS.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant