🦌 The ELK Meets the Falcon πŸ¦…

Falcon H1R 7B NVFP4

✨ Made in Abu Dhabi, Accelerated in Dubai ✨


πŸš€ 2x Faster β€’ πŸ“¦ 3x Smaller β€’ πŸ’° 6x Hardware Savings β€’ 🎯 <1% Accuracy Loss


Docker Hub License vLLM CUDA Driver


🏜️ The Speed of Light, The Wisdom of the Desert

When Abu Dhabi's Technology Innovation Institute created the majestic Falcon H1R, they built a hybrid beast - part Transformer, part Mamba2 - capable of soaring through 65,536 tokens of context.

But even falcons need to fly faster.

Mutaz Al Awamleh took this desert marvel and supercharged it with NVIDIA's cutting-edge NVFP4 quantization - 4-bit floating point with two-level scaling. The result? A model that's 2x faster while using 3x less memory.


πŸ“Š Performance That Speaks for Itself

Benchmarked on NVIDIA DGX Spark GB10 with 1000 concurrent requests:

Metric NVFP4 BF16 Improvement
Throughput 879 TPS 681 TPS 1.3x faster
Time to First Token 45.7s 60.4s 1.3x faster
End-to-End Latency 71.1s 93.8s 1.3x faster
Memory Usage 5.1 GiB ~14 GiB 2.8x smaller
Accuracy Loss - - <1%

Long Context Performance (4K tokens @ 1000 concurrent)

Metric NVFP4 BF16 Improvement
Throughput 585 TPS 336 TPS 1.7x faster
Time to First Token 77.2s 135.7s 1.8x faster
End-to-End Latency 110.5s 202.5s 1.8x faster

⚑ One Command to Rule Them All

docker run --gpus all -p 8000:8000 mutazai/falcon-h1r-7b-nvfp4:latest

That's it. No configuration. No tuning. No headaches.

Everything is baked in:

  • βœ… NVFP4 quantization with FlashInfer-CUTLASS backend
  • βœ… FP8 KV cache for reduced memory bandwidth
  • βœ… Prefix caching enabled
  • βœ… CUDA graphs with full piecewise compilation
  • βœ… 65K context window support
  • βœ… 256 max concurrent sequences

πŸ”Œ OpenAI-Compatible API

import openai

client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="falcon-h1r-7b-nvfp4",
    messages=[
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    max_tokens=500
)

print(response.choices[0].message.content)

πŸ¦… Model Details

Attribute Value
Base Model tiiuae/Falcon-H1R-7B
Architecture Hybrid Transformers + Mamba2
Parameters 7.5B
Context Length 65,536 tokens
Quantization NVFP4 (4-bit floating point)
Compression 2.8x (16 GB β†’ 5.1 GB)
Target Hardware NVIDIA Blackwell (GB10, GB100, GB200)

πŸ”§ Technical Specifications

Quantization Details

  • Algorithm: NVFP4 with two-level scaling (per-block + per-tensor)
  • Weight Format: 4-bit floating point (E2M1 with dynamic scaling)
  • KV Cache: FP8 for reduced memory bandwidth
  • Excluded Layers: embed_tokens, lm_head (kept in BF16)

Optimization Stack

  • Base Image: mutazai/optimized-vllm-sota-cuda13:3.0
  • vLLM Version: 0.13.0 with FlashInfer-CUTLASS
  • CUDA: 13.0 (Blackwell optimized)
  • PyTorch: 2.10

Runtime Configuration (Baked In)

--quantization modelopt_fp4
--kv-cache-dtype fp8
--enable-prefix-caching
--max-model-len 65536
--max-num-seqs 256
--gpu-memory-utilization 0.88
--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}'

πŸ“ˆ Benchmark Methodology

All benchmarks performed with:

  • Unique prompts per request (no cache hits from repeated prompts)
  • Prefix caching enabled (KV cache reuse for common prefixes)
  • FP8 KV cache (reduced memory bandwidth)
  • Concurrency levels: 100, 250, 500, 1000
  • Prompt sizes: 100, 1K, 4K, 16K tokens

🐳 Docker Hub

Pull the image:

docker pull mutazai/falcon-h1r-7b-nvfp4:latest

Run with custom port:

docker run --gpus all -p 9000:8000 mutazai/falcon-h1r-7b-nvfp4:latest

Check health:

curl http://localhost:8000/health

πŸ—ΊοΈ The Quantization Journey

Quantizing Falcon H1R required solving several challenges unique to hybrid Mamba-Transformer architectures:

Challenges Overcome

Challenge Solution
vLLM Falcon H1R loader missing NVFP4 support Patched falcon_h1.py with NVFP4 weight loading (weight unpacking from FP4 packed format)
W4A4 activation quantization failures Switched to W4A16 (weight-only quantization) - Mamba SSM layers produce zero-scale activations
config.json format incompatible Fixed quant_method: "modelopt" β†’ "nvfp4", added format: "nvfp4" field
hf_quant_config.json wrong algo Changed kv_cache_quant_algo: "INT8" β†’ "FP8" to match KV cache dtype
transformers 4.55+ incompatible Pinned to transformers==4.48.0 for HybridMambaAttentionDynamicCache compatibility

Key Code Changes

  1. Weight Loader Patch (falcon_h1.py):

    • Added NVFP4 weight unpacking for linear layers
    • Handled packed FP4 tensors with two-level scaling
    • Excluded Mamba conv1d and SSM layers from quantization
  2. Quantization Config:

    # Used W4A16 weight-only (not W4A4 full quantization)
    WEIGHT_ONLY_CONFIG = {
        "*weight_quantizer": {"enable": True, "num_bits": (2, 1)},
        "*input_quantizer": {"enable": False},  # Critical for Mamba
        "*output_quantizer": {"enable": False},
        "*lm_head*": {"enable": False},
        "*embed_tokens*": {"enable": False}
    }
    
  3. vLLM Runtime Flags:

    • --quantization modelopt_fp4 (not modelopt)
    • --kv-cache-dtype fp8 (must match config)
    • --trust-remote-code (for Mamba2 support)

Lessons Learned

  • Mamba hybrid models need W4A16: Full W4A4 (activation quantization) causes zero-scale errors in SSM layers
  • Config format matters: vLLM is very strict about quant_method and format fields
  • transformers version: Hybrid cache implementation changed in 4.55+, breaking Mamba-Transformer models

πŸ™ Credits

  • πŸ¦… Original Model: Technology Innovation Institute (TII) - Abu Dhabi, UAE πŸ‡¦πŸ‡ͺ
  • 🦌 Optimization & Quantization: Mutaz Al Awamleh - Dubai, UAE πŸ‡¦πŸ‡ͺ
  • 🐳 Docker Hub: mutazai/falcon-h1r-7b-nvfp4
  • πŸ”§ Quantization Tool: NVIDIA ModelOpt with NVFP4
  • ⚑ Serving Stack: vLLM 0.13.0 with FlashInfer-CUTLASS backend

πŸ“œ License

Apache 2.0 - Same as the original Falcon H1R model.


🏜️ From the sands of Abu Dhabi to the clouds of Dubai ☁️

πŸ¦… The Falcon flies faster than ever before ⚑


   🦌 + πŸ¦… = πŸš€

   ELK + FALCON = SPEED

🐳 mutazai/falcon-h1r-7b-nvfp4

Downloads last month
8
Safetensors
Model size
4B params
Tensor type
BF16
Β·
F8_E4M3
Β·
U8
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cybermotaz/Falcon-H1R-7B-NVFP4

Quantized
(11)
this model