🦌 The ELK Meets the Falcon 🦅

Falcon H1R 7B NVFP4

✨ Made in Abu Dhabi, Accelerated in Dubai ✨

🚀 2x Faster • 📦 3x Smaller • 💰 6x Hardware Savings • 🎯 <1% Accuracy Loss

🏜️ The Speed of Light, The Wisdom of the Desert

When Abu Dhabi's Technology Innovation Institute created the majestic Falcon H1R, they built a hybrid beast - part Transformer, part Mamba2 - capable of soaring through 65,536 tokens of context.

But even falcons need to fly faster.

Mutaz Al Awamleh took this desert marvel and supercharged it with NVIDIA's cutting-edge NVFP4 quantization - 4-bit floating point with two-level scaling. The result? A model that's 2x faster while using 3x less memory.

📊 Performance That Speaks for Itself

Benchmarked on NVIDIA DGX Spark GB10 with 1000 concurrent requests:

Metric	NVFP4	BF16	Improvement
Throughput	879 TPS	681 TPS	1.3x faster
Time to First Token	45.7s	60.4s	1.3x faster
End-to-End Latency	71.1s	93.8s	1.3x faster
Memory Usage	5.1 GiB	~14 GiB	2.8x smaller
Accuracy Loss	-	-	<1%

Long Context Performance (4K tokens @ 1000 concurrent)

Metric	NVFP4	BF16	Improvement
Throughput	585 TPS	336 TPS	1.7x faster
Time to First Token	77.2s	135.7s	1.8x faster
End-to-End Latency	110.5s	202.5s	1.8x faster

⚡ One Command to Rule Them All

docker run --gpus all -p 8000:8000 mutazai/falcon-h1r-7b-nvfp4:latest

That's it. No configuration. No tuning. No headaches.

Everything is baked in:

✅ NVFP4 quantization with FlashInfer-CUTLASS backend
✅ FP8 KV cache for reduced memory bandwidth
✅ Prefix caching enabled
✅ CUDA graphs with full piecewise compilation
✅ 65K context window support
✅ 256 max concurrent sequences

🔌 OpenAI-Compatible API

import openai

client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="falcon-h1r-7b-nvfp4",
    messages=[
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    max_tokens=500
)

print(response.choices[0].message.content)

🦅 Model Details

Attribute	Value
Base Model	tiiuae/Falcon-H1R-7B
Architecture	Hybrid Transformers + Mamba2
Parameters	7.5B
Context Length	65,536 tokens
Quantization	NVFP4 (4-bit floating point)
Compression	2.8x (16 GB → 5.1 GB)
Target Hardware	NVIDIA Blackwell (GB10, GB100, GB200)

🔧 Technical Specifications

Quantization Details

Algorithm: NVFP4 with two-level scaling (per-block + per-tensor)
Weight Format: 4-bit floating point (E2M1 with dynamic scaling)
KV Cache: FP8 for reduced memory bandwidth
Excluded Layers: embed_tokens, lm_head (kept in BF16)

Optimization Stack

Base Image: mutazai/optimized-vllm-sota-cuda13:3.0
vLLM Version: 0.13.0 with FlashInfer-CUTLASS
CUDA: 13.0 (Blackwell optimized)
PyTorch: 2.10

Runtime Configuration (Baked In)

--quantization modelopt_fp4
--kv-cache-dtype fp8
--enable-prefix-caching
--max-model-len 65536
--max-num-seqs 256
--gpu-memory-utilization 0.88
--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}'

📈 Benchmark Methodology

All benchmarks performed with:

Unique prompts per request (no cache hits from repeated prompts)
Prefix caching enabled (KV cache reuse for common prefixes)
FP8 KV cache (reduced memory bandwidth)
Concurrency levels: 100, 250, 500, 1000
Prompt sizes: 100, 1K, 4K, 16K tokens

🐳 Docker Hub

Pull the image:

docker pull mutazai/falcon-h1r-7b-nvfp4:latest

Run with custom port:

docker run --gpus all -p 9000:8000 mutazai/falcon-h1r-7b-nvfp4:latest

Check health:

curl http://localhost:8000/health

🗺️ The Quantization Journey

Quantizing Falcon H1R required solving several challenges unique to hybrid Mamba-Transformer architectures:

Challenges Overcome

Challenge	Solution
vLLM Falcon H1R loader missing NVFP4 support	Patched `falcon_h1.py` with NVFP4 weight loading (weight unpacking from FP4 packed format)
W4A4 activation quantization failures	Switched to W4A16 (weight-only quantization) - Mamba SSM layers produce zero-scale activations
`config.json` format incompatible	Fixed `quant_method: "modelopt"` → `"nvfp4"`, added `format: "nvfp4"` field
`hf_quant_config.json` wrong algo	Changed `kv_cache_quant_algo: "INT8"` → `"FP8"` to match KV cache dtype
transformers 4.55+ incompatible	Pinned to `transformers==4.48.0` for HybridMambaAttentionDynamicCache compatibility

Key Code Changes

Weight Loader Patch (falcon_h1.py):
- Added NVFP4 weight unpacking for linear layers
- Handled packed FP4 tensors with two-level scaling
- Excluded Mamba conv1d and SSM layers from quantization

Quantization Config:

# Used W4A16 weight-only (not W4A4 full quantization)
WEIGHT_ONLY_CONFIG = {
    "*weight_quantizer": {"enable": True, "num_bits": (2, 1)},
    "*input_quantizer": {"enable": False},  # Critical for Mamba
    "*output_quantizer": {"enable": False},
    "*lm_head*": {"enable": False},
    "*embed_tokens*": {"enable": False}
}

vLLM Runtime Flags:
- --quantization modelopt_fp4 (not modelopt)
- --kv-cache-dtype fp8 (must match config)
- --trust-remote-code (for Mamba2 support)

Lessons Learned

Mamba hybrid models need W4A16: Full W4A4 (activation quantization) causes zero-scale errors in SSM layers
Config format matters: vLLM is very strict about quant_method and format fields
transformers version: Hybrid cache implementation changed in 4.55+, breaking Mamba-Transformer models

🙏 Credits

🦅 Original Model: Technology Innovation Institute (TII) - Abu Dhabi, UAE 🇦🇪
🦌 Optimization & Quantization: Mutaz Al Awamleh - Dubai, UAE 🇦🇪
🐳 Docker Hub: mutazai/falcon-h1r-7b-nvfp4
🔧 Quantization Tool: NVIDIA ModelOpt with NVFP4
⚡ Serving Stack: vLLM 0.13.0 with FlashInfer-CUTLASS backend

📜 License

Apache 2.0 - Same as the original Falcon H1R model.

🏜️ From the sands of Abu Dhabi to the clouds of Dubai ☁️

🦅 The Falcon flies faster than ever before ⚡

   🦌 + 🦅 = 🚀

   ELK + FALCON = SPEED

🐳 mutazai/falcon-h1r-7b-nvfp4

Downloads last month: 8

Safetensors

Model size

4B params

Tensor type

BF16

F8_E4M3

Model tree for cybermotaz/Falcon-H1R-7B-NVFP4

Base model

tiiuae/Falcon-H1-7B-Base

Finetuned

tiiuae/Falcon-H1R-7B

Quantized

(11)

this model