Report a float compute dtype for quantized GGMLTensor by fxd0h · Pull Request #456 · city96/ComfyUI-GGUF

fxd0h · 2026-06-07T22:39:01Z

A quantized GGMLTensor is backed by packed uint8 storage, so weight.dtype returns torch.uint8. Models that cast activations to the weight dtype before a linear (Ideogram-4 ldm/ideogram4/model.py, Gemma-4, Qwen3.5, CogVideo, some VAEs do x.to(weight.dtype)) therefore convert their activations to uint8. This corrupts them and breaks F.linear on MPS, which requires floating-point inputs: RuntimeError: MPS device does not support linear for non-float inputs.

Change

GGMLTensor.dtype now returns a floating-point compute dtype (bfloat16) for quantised tensors; F16, F32 and unquantised tensors keep their real storage dtype. A compute_dtype set by the loader takes precedence.

bfloat16 rather than float16 avoids downcast and saturation on GGML-BF16 models, which run bfloat16 natively (WAN, FLUX).
The dequantisation paths are unaffected, as they read storage through .data rather than .dtype.

Testing

Fixes Ideogram-4 GGUF generation on MPS, which previously failed in the DiT forward pass.
Regression-tested by loading WAN 2.2 I2V A14B Q8_0 (1095 tensors) on MPS: dequantisation shape and values are correct, the forward pass runs end to end, and there is no recursion in the property.

The change mainly benefits MPS, where the non-float cast is fatal; CUDA tolerated the previous behaviour.

Part of #454.

Developed with AI assistance; the diff and test results above were verified on-device before submission.

A quantized GGMLTensor is backed by packed uint8 storage, so weight.dtype returns torch.uint8. Models that do x.to(weight.dtype) before a linear (Ideogram-4, Gemma-4, Qwen3.5, CogVideo, some VAEs) cast activations to uint8, corrupting them and breaking F.linear on MPS, which requires floating-point inputs. GGMLTensor.dtype now returns bfloat16 for quantized tensors; F16/F32 and unquantized tensors keep their real storage dtype. A loader-set compute_dtype takes precedence. Dequant paths read storage via .data and are unaffected. Regression-tested loading WAN 2.2 I2V A14B Q8_0 (1095 tensors) on MPS.

fxd0h mentioned this pull request Jun 7, 2026

Please support qwen3vl_8b gguf (Ideogram-4) text encoder #452

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report a float compute dtype for quantized GGMLTensor#456

Report a float compute dtype for quantized GGMLTensor#456
fxd0h wants to merge 1 commit into
city96:mainfrom
fxd0h:quantized-dtype-mps

fxd0h commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fxd0h commented Jun 7, 2026

Change

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant