Open to Work

Sean Li PRO

Hellohal2064

AI & ML interests

AI Infrastructure Engineer | Dual DGX Sparks (230GB VRAM) | 5-node Docker Swarm | Building AI Coworker systems

Recent Activity

posted an update about 23 hours ago

🚀 vLLM Docker Image for NVIDIA DGX Spark (GB10/SM121) Just released a pre-built vLLM Docker image optimized for DGX Spark's ARM64 + Blackwell SM121 GPU. **Why this exists:** Standard vLLM images don't support SM121 - you get "SM121 not supported" errors. This image includes patches for full GB10 compatibility. **What's included:** - vLLM 0.15.0 + SM121 patches - PyTorch 2.11 + CUDA 13.0 - ARM64 (aarch64) native - Pre-configured for FlashInfer attention **Verified models:** - Qwen3-Next-80B-A3B-FP8 (1M context!) - Qwen3-Embedding-8B (4096-dim embeddings) - Qwen3-VL-30B (vision) docker pull https://hub.docker.com/r/hellohal2064/vllm-dgx-spark-gb10

reacted to their post with 🔥 30 days ago

🚀 Excited to share: The vLLM container for NVIDIA DGX Spark! I've been working on getting vLLM to run natively on the new DGX Spark with its GB10 Blackwell GPU (SM121 architecture). The results? 2.5x faster inference compared to llama.cpp! 📊 Performance Highlights: • Qwen3-Coder-30B: 44 tok/s (vs 21 tok/s with llama.cpp) • Qwen3-Next-80B: 45 tok/s (vs 18 tok/s with llama.cpp) 🔧 Technical Challenges Solved: • Built PyTorch nightly with CUDA 13.1 + SM121 support • Patched vLLM for Blackwell architecture • Created custom MoE expert configs for GB10 • Implemented TRITON_ATTN backend workaround 📦 Available now: • Docker Hub: docker pull hellohal2064/vllm-dgx-spark-gb10:latest • HuggingFace: huggingface.co/Hellohal2064/vllm-dgx-spark-gb10 The DGX Spark's 119GB unified memory opens up possibilities for running massive models locally. Happy to connect with others working on the DGX Spark Blackwell!

replied to their post 30 days ago

View all activity

Organizations

posted an update about 23 hours ago

Post

161

🚀 vLLM Docker Image for NVIDIA DGX Spark (GB10/SM121)

Just released a pre-built vLLM Docker image optimized for DGX Spark's ARM64 + Blackwell SM121 GPU.

**Why this exists:**
Standard vLLM images don't support SM121 - you get "SM121 not supported" errors. This image includes patches for full GB10 compatibility.

**What's included:**
- vLLM 0.15.0 + SM121 patches
- PyTorch 2.11 + CUDA 13.0
- ARM64 (aarch64) native
- Pre-configured for FlashInfer attention

**Verified models:**
- Qwen3-Next-80B-A3B-FP8 (1M context!)
- Qwen3-Embedding-8B (4096-dim embeddings)
- Qwen3-VL-30B (vision)

docker pull
https://hub.docker.com/r/hellohal2064/vllm-dgx-spark-gb10

reacted to their post with 🔥 30 days ago

Post

1611

🚀 Excited to share: The vLLM container for NVIDIA DGX Spark!

I've been working on getting vLLM to run natively on the new DGX Spark with its GB10 Blackwell GPU (SM121 architecture). The results? 2.5x faster inference compared to llama.cpp!

📊 Performance Highlights:
• Qwen3-Coder-30B: 44 tok/s (vs 21 tok/s with llama.cpp)
• Qwen3-Next-80B: 45 tok/s (vs 18 tok/s with llama.cpp)

🔧 Technical Challenges Solved:
• Built PyTorch nightly with CUDA 13.1 + SM121 support
• Patched vLLM for Blackwell architecture
• Created custom MoE expert configs for GB10
• Implemented TRITON_ATTN backend workaround

📦 Available now:
• Docker Hub: docker pull hellohal2064/vllm-dgx-spark-gb10:latest
• HuggingFace: huggingface.co/Hellohal2064/vllm-dgx-spark-gb10

The DGX Spark's 119GB unified memory opens up possibilities for running massive models locally. Happy to connect with others working on the DGX Spark Blackwell!

4 replies

replied to their post 30 days ago

I am US CST time. you can reach out to me at 971-708-9761. My AI system will ask you to share what your calling about, just say DGX spark AI.

replied to their post 30 days ago

Please try it out :)let me know if you run into any problems. I will most likely be uploading a new image sometime this week. Working on some other improvements around the qwen-next Models

posted an update about 1 month ago

Post

1611

🚀 Excited to share: The vLLM container for NVIDIA DGX Spark!

I've been working on getting vLLM to run natively on the new DGX Spark with its GB10 Blackwell GPU (SM121 architecture). The results? 2.5x faster inference compared to llama.cpp!

📊 Performance Highlights:
• Qwen3-Coder-30B: 44 tok/s (vs 21 tok/s with llama.cpp)
• Qwen3-Next-80B: 45 tok/s (vs 18 tok/s with llama.cpp)

🔧 Technical Challenges Solved:
• Built PyTorch nightly with CUDA 13.1 + SM121 support
• Patched vLLM for Blackwell architecture
• Created custom MoE expert configs for GB10
• Implemented TRITON_ATTN backend workaround

📦 Available now:
• Docker Hub: docker pull hellohal2064/vllm-dgx-spark-gb10:latest
• HuggingFace: huggingface.co/Hellohal2064/vllm-dgx-spark-gb10

The DGX Spark's 119GB unified memory opens up possibilities for running massive models locally. Happy to connect with others working on the DGX Spark Blackwell!

4 replies

Sean Li PRO

AI & ML interests

Recent Activity

Organizations

Hellohal2064's activity