Exactly what I have been looking for.

by leonletto - opened 14 days ago

14 days ago

I wrote this because coder models always use some sort of sparse attention and therefore fail in complicated long context situations.
https://leonletto.github.io/Blog/Why-Your-Long-Context-AI-Keeps-Forgetting.html

Your finetune solves this perfectly by giving me the agentic enhancements I wanted without compromising on long context by keeping full attention intact. Amazing work!

Thank you.

vim-ary

Nebius org 13 days ago

Thanks so much!

Quick clarification though—did you perhaps test an older version of Qwen3-Coder-30B-A3B? About 22 days ago it was updated to max_window_layers=48, enabling full attention across all layers.

Even before that update, use_sliding_window=false would suggest the model may have been trained with full attention on all layers, which should ensure proper long-context handling at inference in vLLM and Transformers. Regarding this point from your blog:

Important note: This isn't just a configuration setting you can change. The max_window_layers parameter defines the model's architecture during training—the weights are trained specifically for this layer configuration. You can't edit config.json to "fix" a code model for long-context tasks; you'd need to retrain the model with a different architecture. Research from NVIDIA's SWAN-GPT paper demonstrates that different layer types (full attention vs sliding window) learn fundamentally different representations during training, and converting between architectures requires significant continued pre-training.
Note: You might see use_sliding_window: false in some config files—this controls runtime behavior in specific loaders (vLLM, HF Transformers), but the architectural layer configuration is baked into the weights regardless of this flag.

If use_sliding_window=false indicated the model was trained with full attention from the start, then the weights would likely already have the appropriate architecture. The config flag would just be informing the inference engine to match the training configuration, rather than changing the architecture itself.

That said, you might still benefit from our finetune since we built on base Qwen3-30B-A3B-Instruct-2507 (not the Coder variant), so it should be closer to the base model you rely on while adding the agentic enhancements you want.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment