π¦ The ELK Meets the Falcon π¦
Falcon H1R 7B NVFP4
β¨ Made in Abu Dhabi, Accelerated in Dubai β¨
π 2x Faster β’ π¦ 3x Smaller β’ π° 6x Hardware Savings β’ π― <1% Accuracy Loss
ποΈ The Speed of Light, The Wisdom of the Desert
When Abu Dhabi's Technology Innovation Institute created the majestic Falcon H1R, they built a hybrid beast - part Transformer, part Mamba2 - capable of soaring through 65,536 tokens of context.
But even falcons need to fly faster.
Mutaz Al Awamleh took this desert marvel and supercharged it with NVIDIA's cutting-edge NVFP4 quantization - 4-bit floating point with two-level scaling. The result? A model that's 2x faster while using 3x less memory.
π Performance That Speaks for Itself
Benchmarked on NVIDIA DGX Spark GB10 with 1000 concurrent requests:
| Metric | NVFP4 | BF16 | Improvement |
|---|---|---|---|
| Throughput | 879 TPS | 681 TPS | 1.3x faster |
| Time to First Token | 45.7s | 60.4s | 1.3x faster |
| End-to-End Latency | 71.1s | 93.8s | 1.3x faster |
| Memory Usage | 5.1 GiB | ~14 GiB | 2.8x smaller |
| Accuracy Loss | - | - | <1% |
Long Context Performance (4K tokens @ 1000 concurrent)
| Metric | NVFP4 | BF16 | Improvement |
|---|---|---|---|
| Throughput | 585 TPS | 336 TPS | 1.7x faster |
| Time to First Token | 77.2s | 135.7s | 1.8x faster |
| End-to-End Latency | 110.5s | 202.5s | 1.8x faster |
β‘ One Command to Rule Them All
docker run --gpus all -p 8000:8000 mutazai/falcon-h1r-7b-nvfp4:latest
That's it. No configuration. No tuning. No headaches.
Everything is baked in:
- β NVFP4 quantization with FlashInfer-CUTLASS backend
- β FP8 KV cache for reduced memory bandwidth
- β Prefix caching enabled
- β CUDA graphs with full piecewise compilation
- β 65K context window support
- β 256 max concurrent sequences
π OpenAI-Compatible API
import openai
client = openai.OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="falcon-h1r-7b-nvfp4",
messages=[
{"role": "user", "content": "Explain quantum computing in simple terms."}
],
max_tokens=500
)
print(response.choices[0].message.content)
π¦ Model Details
| Attribute | Value |
|---|---|
| Base Model | tiiuae/Falcon-H1R-7B |
| Architecture | Hybrid Transformers + Mamba2 |
| Parameters | 7.5B |
| Context Length | 65,536 tokens |
| Quantization | NVFP4 (4-bit floating point) |
| Compression | 2.8x (16 GB β 5.1 GB) |
| Target Hardware | NVIDIA Blackwell (GB10, GB100, GB200) |
π§ Technical Specifications
Quantization Details
- Algorithm: NVFP4 with two-level scaling (per-block + per-tensor)
- Weight Format: 4-bit floating point (E2M1 with dynamic scaling)
- KV Cache: FP8 for reduced memory bandwidth
- Excluded Layers:
embed_tokens,lm_head(kept in BF16)
Optimization Stack
- Base Image:
mutazai/optimized-vllm-sota-cuda13:3.0 - vLLM Version: 0.13.0 with FlashInfer-CUTLASS
- CUDA: 13.0 (Blackwell optimized)
- PyTorch: 2.10
Runtime Configuration (Baked In)
--quantization modelopt_fp4
--kv-cache-dtype fp8
--enable-prefix-caching
--max-model-len 65536
--max-num-seqs 256
--gpu-memory-utilization 0.88
--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}'
π Benchmark Methodology
All benchmarks performed with:
- Unique prompts per request (no cache hits from repeated prompts)
- Prefix caching enabled (KV cache reuse for common prefixes)
- FP8 KV cache (reduced memory bandwidth)
- Concurrency levels: 100, 250, 500, 1000
- Prompt sizes: 100, 1K, 4K, 16K tokens
π³ Docker Hub
Pull the image:
docker pull mutazai/falcon-h1r-7b-nvfp4:latest
Run with custom port:
docker run --gpus all -p 9000:8000 mutazai/falcon-h1r-7b-nvfp4:latest
Check health:
curl http://localhost:8000/health
πΊοΈ The Quantization Journey
Quantizing Falcon H1R required solving several challenges unique to hybrid Mamba-Transformer architectures:
Challenges Overcome
| Challenge | Solution |
|---|---|
| vLLM Falcon H1R loader missing NVFP4 support | Patched falcon_h1.py with NVFP4 weight loading (weight unpacking from FP4 packed format) |
| W4A4 activation quantization failures | Switched to W4A16 (weight-only quantization) - Mamba SSM layers produce zero-scale activations |
config.json format incompatible |
Fixed quant_method: "modelopt" β "nvfp4", added format: "nvfp4" field |
hf_quant_config.json wrong algo |
Changed kv_cache_quant_algo: "INT8" β "FP8" to match KV cache dtype |
| transformers 4.55+ incompatible | Pinned to transformers==4.48.0 for HybridMambaAttentionDynamicCache compatibility |
Key Code Changes
Weight Loader Patch (
falcon_h1.py):- Added NVFP4 weight unpacking for linear layers
- Handled packed FP4 tensors with two-level scaling
- Excluded Mamba conv1d and SSM layers from quantization
Quantization Config:
# Used W4A16 weight-only (not W4A4 full quantization) WEIGHT_ONLY_CONFIG = { "*weight_quantizer": {"enable": True, "num_bits": (2, 1)}, "*input_quantizer": {"enable": False}, # Critical for Mamba "*output_quantizer": {"enable": False}, "*lm_head*": {"enable": False}, "*embed_tokens*": {"enable": False} }vLLM Runtime Flags:
--quantization modelopt_fp4(notmodelopt)--kv-cache-dtype fp8(must match config)--trust-remote-code(for Mamba2 support)
Lessons Learned
- Mamba hybrid models need W4A16: Full W4A4 (activation quantization) causes zero-scale errors in SSM layers
- Config format matters: vLLM is very strict about
quant_methodandformatfields - transformers version: Hybrid cache implementation changed in 4.55+, breaking Mamba-Transformer models
π Credits
- π¦ Original Model: Technology Innovation Institute (TII) - Abu Dhabi, UAE π¦πͺ
- π¦ Optimization & Quantization: Mutaz Al Awamleh - Dubai, UAE π¦πͺ
- π³ Docker Hub: mutazai/falcon-h1r-7b-nvfp4
- π§ Quantization Tool: NVIDIA ModelOpt with NVFP4
- β‘ Serving Stack: vLLM 0.13.0 with FlashInfer-CUTLASS backend
π License
Apache 2.0 - Same as the original Falcon H1R model.
- Downloads last month
- 8