F16 broken

by ramendik - opened 17 days ago

Discussion

ramendik

17 days ago

Hello,

The F16 version outputs ?????? - I tested on both CPU and CUDA.

The same happens when I roll my own GGUF which is (as I now realize) in F16 by default.

BF16 works fine.

gabegoodhart

IBM Granite org 16 days ago

Thanks for raising this! This is a thorny issue that has to do with the default precision type being F16 in llama.cpp. We chose to post the F16 GGUF because there are some configurations where it will infer correctly, but it's definitely bordering on unusable due to these precision issues. I was able to make it run on my M3 MacBook Pro with the following:

./bin/llama-cli -m ~/models/ibm-granite/granite-4.0-1b/granite-4.0-1B-F16.gguf -p "You are a helpful assistant" -ngl 36

The key here is that the -ngl 36 setting keeps the last 4 layers on CPU (it works for me with any value <=36).

I think the right course of action is to add a note about this to the model card for the GGUFs indicating that the F16 version should not be used unless none of the others will work for your situation.

gabegoodhart

IBM Granite org 16 days ago

Ok, I've added the disclaimer to the model card: https://huggingface.co/ibm-granite/granite-4.0-1b-GGUF

gabegoodhart changed discussion status to closed 16 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment