Instructions to use StableQuant/Qwen-Templates-Rebuild-Project with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use StableQuant/Qwen-Templates-Rebuild-Project with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="StableQuant/Qwen-Templates-Rebuild-Project")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("StableQuant/Qwen-Templates-Rebuild-Project", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use StableQuant/Qwen-Templates-Rebuild-Project with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "StableQuant/Qwen-Templates-Rebuild-Project"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "StableQuant/Qwen-Templates-Rebuild-Project",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/StableQuant/Qwen-Templates-Rebuild-Project

SGLang

How to use StableQuant/Qwen-Templates-Rebuild-Project with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "StableQuant/Qwen-Templates-Rebuild-Project" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "StableQuant/Qwen-Templates-Rebuild-Project",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "StableQuant/Qwen-Templates-Rebuild-Project" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "StableQuant/Qwen-Templates-Rebuild-Project",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use StableQuant/Qwen-Templates-Rebuild-Project with Docker Model Runner:
```
docker model run hf.co/StableQuant/Qwen-Templates-Rebuild-Project
```

Hermes Tool Loops in v1.1.5

by StableQuant - opened 6 days ago

Discussion

StableQuant

Owner 6 days ago

This is the discussion about the Tool Loops that occur in v1.1.5 when using Hermes. OpenHands and OpenCode seem to be unaffected. I think its a Harnessspecific thing. v.1.1.5 uses a robust tool-call error reccovery logic, it shouldnt happen but since it still does, I will look into this manually (installing Hermes myself and test)

So far, reported was tool calling loops and also cron tool call loops.

ABLomas

5 days ago

I was testing in hermes. With 1.1.5 hermes cannot write code to local file - timeouts 100% (15 tries out of 15). It can generate and display code, but probably cannot use some specific tools like write file or so.
Streaming, reasoning does not help (in logs - LLM finish task in 12s, hermes timeouts after 300s so clearly response lost somewhere)

meualsan

5 days ago

•

edited 5 days ago

I am using this with Opencode and latest VLLM and Qwen3.6 27b and noticing that sometimes it stops abruptly.
When it does from the text it looks its about to call some tool but it never does and just stops instead ...

StableQuant

Owner 5 days ago

@ABLomas
Did you noticed this on specific code or on random files happening?
I ask because I discovered an error yesterday myself, it coded fine for hours in OpenHands but when using a certain code part it becames stale. Probably the same reason.
Could you tell me if you used other templates sucessful managing this part, like froggeric v16 for example or did it happens with every template?

StableQuant

Owner 5 days ago

•

edited 5 days ago

Ok, i spun up my own Hermes instance now. The tool calling is indeed totally broken with current template version, independed of editing code files. Im working on this now. No further information needed (but you can still post it if you want).

szwedek

5 days ago

My setup:
--host 0.0.0.0 -fa 1 --fit-ctx 262144 --min-p 0.0 --fit 1 -b 2048 -ub 512 --no-mmap -ctk q8_0 -ctv q8_0 --jinja -m Qwen3.6-35B-A3B-IQ4_XS-4.15bpw.gguf --temp 0.6 --top-p 0.95 --top-k 20 --presence-penalty 0.0 --repeat-penalty 1.0 --chat-template-kwargs "{\"preserve_thinking\":true}" --no-mmproj -np 1 --alias Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --reasoning-budget 4096 --metrics --reasoning-budget-message "[SYSTEM ALERT: Reasoning budget exceeded. I am stuck in a loop or overcomplicating. I must stop IMMEDIATELY and use the ask_followup_question tool to notify the user and ask for guidance.]" --chat-template-file qwen3.6_chat_template.txt -to 900

Model - https://huggingface.co/byteshape/Qwen3.6-35B-A3B-GGUF/blob/main/Qwen3.6-35B-A3B-IQ4_XS-4.15bpw.gguf

Hermes - v0.14.0 (2026.5.16)

tooltd

4 days ago

Yesterday I tried installing version v1.1.5 on the Pi coding agent, tool seemed to be calling incorrectly. I don't know why. 😁
I had to revert to froggeric's v19

szwedek

4 days ago

@StableQuant i just wanna say that froggeric's v19 template is the best for Hermes, v16 has loops as well.

szwedek

1 day ago

@StableQuant I found a template that works with hermes and opencode without any issues so far - https://gist.github.com/fakezeta/9e8e039c60332fcb143c6e805558afe0
Maybe, it can help you to enhance your template.

herstrabol

1 day ago

•

edited 1 day ago

Hi @szwedek thank you for posting it.
From the first look it seems indeed to be a clean template.

About my current process:
I came to the tentatively conclusion that this whole problem is not one a simple chat template can solve.
Its about language incompatibility basically.

Qwen was trained on normal text and heavily on XML based structured text.
Current Tool like Hermes expect it so use the JSON based OpenAI standard for tool calling.

The problem with Qwen is, its not trained on that. Its trained on XML. It will work, to some degree if you tell it explicitely to use JSON (like Hermes system prompt does) but as soon as you use thinking(Alibaba stated structured output in thinking mode is not supported) or experience high context load it falls back into its trained behaviour to output XML.

There are approaches to pack JSON tool calls into XML Tags which seem to be sucessfull to some point, but doesnt seem to fix it completly since JSON is hard for LLMs(complex brittle structure) to generate unlike XML(easy simple structure), even more when they are trained on mostly XML.

The template you linked tells the llm to make its tool calls XML based, which will work with vllm, which has its own "translation logic" built in when using a certain qwen3 xml parsing switch.

But for llama.cpp I expect it to become unstable as well under high context load.

My current thinking process goes away from a simple chat template solution but more into a chat template + middleware solution. If Qwen wants to speak XML but Tools want JSON, then why not just give both of them what they want?!
So, its more of a ecosystem incompatibility and also a geograpic/political dimension between Qwen and OpenAI when you think about it.

My current approach goes into a middleware + a clean Qwen Chat Template with XML that you can host for example with docker and does the translation process in miliseconds, not noticably to the user.
This would solve it but im still in early experimentation phases.

For the chat template you linked I expect it to also become unstable at some point since Hermes uses tool call IDs which Qwen natively doesnt uses and dont understands and begins to hallucinate them later at some point which confuses Hermes then. So the template might be stable to some low load usecases but probably become unstable as soon as you put high load work on it. But I might be wrong, keep me updated what your mileage is.

So to finalize my post: This whole template thing is a true rabbithole and its more than just a simple "non-deterministic to deterministic programmatic" question but rather about ecosystem compatibility.

meualsan

about 20 hours ago

For me the opposite is true. vllm and sglang have a worse chat template implementation.
I never was able to make vllm work properly with qwen 2.6, no matter the template. Either the thinking gets broken, or its stops abruptly or tool calls are wrong. Various different chat template issues.
With latest llama and this config Qwen3.6 27B MTP works perfectly for me:

  /app/llama-server \
      --hf-repo $MODEL_REPO \
      --hf-file $MODEL_FILE \
      --port 8000 \
      --alias Qwen3.6-27B \
      --jinja \
      --ctx-size 262144 \
      -ngl 99 \
      --flash-attn on \
      -ctk q8_0 -ctv q8_0 \
      --cont-batching \
      --parallel 3 \
      --batch-size 4096 \
      --metrics \
      --threads 4 \
      --mlock \
      --no-mmap \
      --spec-type draft-mtp \
      --spec-draft-n-max 3

StableQuant

Owner about 20 hours ago

@meualsan
I see. But did you tried the newest corresponding qwen parser flags with vllm and sglang?
Also to note, even if you use these flags, it doenst fixes dev role handling, tool calls inside thinking tags etc which is a flaw in the original Qwen template.
So you would need flags(to do XML to JSON translation) aswell as a fixed Qwen template with vllm and sglang.
For vllm for example there is the --tool-call-parser hermes flag for hermes.

Also sglang and vllm use guided decoding in their backend. Means: once a tool call is requested they force the model to output valid json via a predefined "library", unfitting tokens get rejected.
Which works better then just to tell the model to produce valid json, but also only until you reach high context load. It becomes unstable then aswell.

For llama.cpp there is currently no such thing at all.

StableQuant

Owner about 19 hours ago

•

edited about 19 hours ago

llama.cpp seems to be faster with new integrations, for example I use a turboquant fork which expands KV-cache up to 8x vs Q4 cache, since two weeks, its a dream.
vllm doenst has this yet. From my understanding llama.cpp is more community driven and vllm and sglang is industry.
llama.cpp is more fluid and faster but vllm and sglang is business.
People get paid to include fixes there to make Qwen etc available to run in Hyperscaler and AI Clouds for stable business appliance.
With SGLang for example you could do multiuser usage on a single RTX3090 with a small Qwen modell and get combined decoding speeds in the 2,5k tokens/s. vs llama.cpp is mostly single user.
Up to 4 user its fine but any more gets slow really quickly. Total throughput a few hundred token/s vs 2-3k in Sglang with 16 users the same time.

Unfortunately for us VRAM poor xD thats not good news currently.
But It gets shared anyway. New stuff gets exchanged in both ways with time.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment