Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Website
Tasks
HuggingChat
Collections
Languages
Organizations
Community
Blog
Posts
Daily Papers
Learn
Discord
Forum
GitHub
Solutions
Team & Enterprise
Hugging Face PRO
Enterprise Support
Inference Providers
Inference Endpoints
Storage Buckets
Log In
Sign Up
Felix
numb3r3
62
3
13
Follow
feixiao's profile picture
Numaan1998's profile picture
sigridjineth's profile picture
12 followers
ยท
25 following
AI & ML interests
None yet
Recent Activity
reacted
to
JonnaMat
's
post
with ๐
about 24 hours ago
โก FlashHead: Fast LM Head Inference - Now a Simple vLLM Plugin flash-head replaces the dense LM head with a two-stage retrieval pipeline - up to 2x inference speedup, training-free. Previously required custom Docker images; now it's just: ``` pip install flash-head vllm serve embedl/Qwen3-1.7B-FlashHead-W4A16 ``` โจ The plugin activates automatically via vLLM's `vllm.general_plugins` entry point. No source patches, no custom imports. ๐งฉ Supported models (full collection): https://huggingface.co/Qwen Qwen3, https://huggingface.co/meta-llama Llama3, https://huggingface.co/google Gemma3, https://huggingface.co/nvidia Cosmos-Reason2 - BF16 and W4A16 variants. https://huggingface.co/collections/embedl/flashhead ๐ https://huggingface.co/spaces/embedl/Edge-Inference-Benchmarks ๐ง Benchmark it yourself: ``` vllm bench latency --model embedl/Qwen3-1.7B-FlashHead-W4A16 --batch-size 1 # Baseline comparison FLASHHEAD_ENABLED=0 vllm bench latency --model embedl/Qwen3-1.7B-FlashHead-W4A16 --batch-size 1 ``` FlashHead shines at low batch sizes; the typical real-time / on-device use case. ๐
View all activity
Organizations
models
0
None public yet
datasets
1
numb3r3/embeddings
Updated
Dec 6, 2023
โข
3