Report a float compute dtype for quantized GGMLTensor#456
Open
fxd0h wants to merge 1 commit into
Open
Conversation
A quantized GGMLTensor is backed by packed uint8 storage, so weight.dtype returns torch.uint8. Models that do x.to(weight.dtype) before a linear (Ideogram-4, Gemma-4, Qwen3.5, CogVideo, some VAEs) cast activations to uint8, corrupting them and breaking F.linear on MPS, which requires floating-point inputs. GGMLTensor.dtype now returns bfloat16 for quantized tensors; F16/F32 and unquantized tensors keep their real storage dtype. A loader-set compute_dtype takes precedence. Dequant paths read storage via .data and are unaffected. Regression-tested loading WAN 2.2 I2V A14B Q8_0 (1095 tensors) on MPS.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A quantized
GGMLTensoris backed by packed uint8 storage, soweight.dtypereturnstorch.uint8. Models that cast activations to the weight dtype before a linear (Ideogram-4ldm/ideogram4/model.py, Gemma-4, Qwen3.5, CogVideo, some VAEs dox.to(weight.dtype)) therefore convert their activations to uint8. This corrupts them and breaksF.linearon MPS, which requires floating-point inputs:RuntimeError: MPS device does not support linear for non-float inputs.Change
GGMLTensor.dtypenow returns a floating-point compute dtype (bfloat16) for quantised tensors; F16, F32 and unquantised tensors keep their real storage dtype. Acompute_dtypeset by the loader takes precedence.bfloat16rather thanfloat16avoids downcast and saturation on GGML-BF16 models, which run bfloat16 natively (WAN, FLUX)..datarather than.dtype.Testing
The change mainly benefits MPS, where the non-float cast is fatal; CUDA tolerated the previous behaviour.
Part of #454.
Developed with AI assistance; the diff and test results above were verified on-device before submission.