Instructions to use StableQuant/Qwen-Templates-Rebuild-Project with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use StableQuant/Qwen-Templates-Rebuild-Project with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="StableQuant/Qwen-Templates-Rebuild-Project")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("StableQuant/Qwen-Templates-Rebuild-Project", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use StableQuant/Qwen-Templates-Rebuild-Project with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "StableQuant/Qwen-Templates-Rebuild-Project" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "StableQuant/Qwen-Templates-Rebuild-Project", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/StableQuant/Qwen-Templates-Rebuild-Project
- SGLang
How to use StableQuant/Qwen-Templates-Rebuild-Project with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "StableQuant/Qwen-Templates-Rebuild-Project" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "StableQuant/Qwen-Templates-Rebuild-Project", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "StableQuant/Qwen-Templates-Rebuild-Project" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "StableQuant/Qwen-Templates-Rebuild-Project", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use StableQuant/Qwen-Templates-Rebuild-Project with Docker Model Runner:
docker model run hf.co/StableQuant/Qwen-Templates-Rebuild-Project
Hermes Tool Loops in v1.1.5
This is the discussion about the Tool Loops that occur in v1.1.5 when using Hermes. OpenHands and OpenCode seem to be unaffected. I think its a Harnessspecific thing. v.1.1.5 uses a robust tool-call error reccovery logic, it shouldnt happen but since it still does, I will look into this manually (installing Hermes myself and test)
So far, reported was tool calling loops and also cron tool call loops.
I was testing in hermes. With 1.1.5 hermes cannot write code to local file - timeouts 100% (15 tries out of 15). It can generate and display code, but probably cannot use some specific tools like write file or so.
Streaming, reasoning does not help (in logs - LLM finish task in 12s, hermes timeouts after 300s so clearly response lost somewhere)
I am using this with Opencode and latest VLLM and Qwen3.6 27b and noticing that sometimes it stops abruptly.
When it does from the text it looks its about to call some tool but it never does and just stops instead ...
@ABLomas
Did you noticed this on specific code or on random files happening?
I ask because I discovered an error yesterday myself, it coded fine for hours in OpenHands but when using a certain code part it becames stale. Probably the same reason.
Could you tell me if you used other templates sucessful managing this part, like froggeric v16 for example or did it happens with every template?
Ok, i spun up my own Hermes instance now. The tool calling is indeed totally broken with current template version, independed of editing code files. Im working on this now. No further information needed (but you can still post it if you want).
My setup:--host 0.0.0.0 -fa 1 --fit-ctx 262144 --min-p 0.0 --fit 1 -b 2048 -ub 512 --no-mmap -ctk q8_0 -ctv q8_0 --jinja -m Qwen3.6-35B-A3B-IQ4_XS-4.15bpw.gguf --temp 0.6 --top-p 0.95 --top-k 20 --presence-penalty 0.0 --repeat-penalty 1.0 --chat-template-kwargs "{\"preserve_thinking\":true}" --no-mmproj -np 1 --alias Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --reasoning-budget 4096 --metrics --reasoning-budget-message "[SYSTEM ALERT: Reasoning budget exceeded. I am stuck in a loop or overcomplicating. I must stop IMMEDIATELY and use the ask_followup_question tool to notify the user and ask for guidance.]" --chat-template-file qwen3.6_chat_template.txt -to 900
Model - https://huggingface.co/byteshape/Qwen3.6-35B-A3B-GGUF/blob/main/Qwen3.6-35B-A3B-IQ4_XS-4.15bpw.gguf
Hermes - v0.14.0 (2026.5.16)
Yesterday I tried installing version v1.1.5 on the Pi coding agent, tool seemed to be calling incorrectly. I don't know why. 😁
I had to revert to froggeric's v19
@StableQuant i just wanna say that froggeric's v19 template is the best for Hermes, v16 has loops as well.
@StableQuant I found a template that works with hermes and opencode without any issues so far - https://gist.github.com/fakezeta/9e8e039c60332fcb143c6e805558afe0
Maybe, it can help you to enhance your template.
Hi @szwedek thank you for posting it.
From the first look it seems indeed to be a clean template.
About my current process:
I came to the tentatively conclusion that this whole problem is not one a simple chat template can solve.
Its about language incompatibility basically.
Qwen was trained on normal text and heavily on XML based structured text.
Current Tool like Hermes expect it so use the JSON based OpenAI standard for tool calling.
The problem with Qwen is, its not trained on that. Its trained on XML. It will work, to some degree if you tell it explicitely to use JSON (like Hermes system prompt does) but as soon as you use thinking(Alibaba stated structured output in thinking mode is not supported) or experience high context load it falls back into its trained behaviour to output XML.
There are approaches to pack JSON tool calls into XML Tags which seem to be sucessfull to some point, but doesnt seem to fix it completly since JSON is hard for LLMs(complex brittle structure) to generate unlike XML(easy simple structure), even more when they are trained on mostly XML.
The template you linked tells the llm to make its tool calls XML based, which will work with vllm, which has its own "translation logic" built in when using a certain qwen3 xml parsing switch.
But for llama.cpp I expect it to become unstable as well under high context load.
My current thinking process goes away from a simple chat template solution but more into a chat template + middleware solution. If Qwen wants to speak XML but Tools want JSON, then why not just give both of them what they want?!
So, its more of a ecosystem incompatibility and also a geograpic/political dimension between Qwen and OpenAI when you think about it.
My current approach goes into a middleware + a clean Qwen Chat Template with XML that you can host for example with docker and does the translation process in miliseconds, not noticably to the user.
This would solve it but im still in early experimentation phases.
For the chat template you linked I expect it to also become unstable at some point since Hermes uses tool call IDs which Qwen natively doesnt uses and dont understands and begins to hallucinate them later at some point which confuses Hermes then. So the template might be stable to some low load usecases but probably become unstable as soon as you put high load work on it. But I might be wrong, keep me updated what your mileage is.
So to finalize my post: This whole template thing is a true rabbithole and its more than just a simple "non-deterministic to deterministic programmatic" question but rather about ecosystem compatibility.
For me the opposite is true. vllm and sglang have a worse chat template implementation.
I never was able to make vllm work properly with qwen 2.6, no matter the template. Either the thinking gets broken, or its stops abruptly or tool calls are wrong. Various different chat template issues.
With latest llama and this config Qwen3.6 27B MTP works perfectly for me:
/app/llama-server \
--hf-repo $MODEL_REPO \
--hf-file $MODEL_FILE \
--port 8000 \
--alias Qwen3.6-27B \
--jinja \
--ctx-size 262144 \
-ngl 99 \
--flash-attn on \
-ctk q8_0 -ctv q8_0 \
--cont-batching \
--parallel 3 \
--batch-size 4096 \
--metrics \
--threads 4 \
--mlock \
--no-mmap \
--spec-type draft-mtp \
--spec-draft-n-max 3
@meualsan
I see. But did you tried the newest corresponding qwen parser flags with vllm and sglang?
Also to note, even if you use these flags, it doenst fixes dev role handling, tool calls inside thinking tags etc which is a flaw in the original Qwen template.
So you would need flags(to do XML to JSON translation) aswell as a fixed Qwen template with vllm and sglang.
For vllm for example there is the --tool-call-parser hermes flag for hermes.
Also sglang and vllm use guided decoding in their backend. Means: once a tool call is requested they force the model to output valid json via a predefined "library", unfitting tokens get rejected.
Which works better then just to tell the model to produce valid json, but also only until you reach high context load. It becomes unstable then aswell.
For llama.cpp there is currently no such thing at all.
llama.cpp seems to be faster with new integrations, for example I use a turboquant fork which expands KV-cache up to 8x vs Q4 cache, since two weeks, its a dream.
vllm doenst has this yet. From my understanding llama.cpp is more community driven and vllm and sglang is industry.
llama.cpp is more fluid and faster but vllm and sglang is business.
People get paid to include fixes there to make Qwen etc available to run in Hyperscaler and AI Clouds for stable business appliance.
With SGLang for example you could do multiuser usage on a single RTX3090 with a small Qwen modell and get combined decoding speeds in the 2,5k tokens/s. vs llama.cpp is mostly single user.
Up to 4 user its fine but any more gets slow really quickly. Total throughput a few hundred token/s vs 2-3k in Sglang with 16 users the same time.
Unfortunately for us VRAM poor xD thats not good news currently.
But It gets shared anyway. New stuff gets exchanged in both ways with time.