F16 broken

#1
by ramendik - opened

Hello,

The F16 version outputs ?????? - I tested on both CPU and CUDA.

The same happens when I roll my own GGUF which is (as I now realize) in F16 by default.

BF16 works fine.

IBM Granite org

Thanks for raising this! This is a thorny issue that has to do with the default precision type being F16 in llama.cpp. We chose to post the F16 GGUF because there are some configurations where it will infer correctly, but it's definitely bordering on unusable due to these precision issues. I was able to make it run on my M3 MacBook Pro with the following:

./bin/llama-cli -m ~/models/ibm-granite/granite-4.0-1b/granite-4.0-1B-F16.gguf -p "You are a helpful assistant" -ngl 36

The key here is that the -ngl 36 setting keeps the last 4 layers on CPU (it works for me with any value <=36).

I think the right course of action is to add a note about this to the model card for the GGUFs indicating that the F16 version should not be used unless none of the others will work for your situation.

IBM Granite org

Ok, I've added the disclaimer to the model card: https://huggingface.co/ibm-granite/granite-4.0-1b-GGUF

gabegoodhart changed discussion status to closed

Sign up or log in to comment