Qwen-Image fast tokenizer (tokenizer.json)

A derived artifact for running Qwen/Qwen-Image on the native-Rust/MLX mlx-gen engine (SceneWorks).

Why this exists

Qwen/Qwen-Image ships its Qwen2 BPE tokenizer as vocab.json + merges.txt only โ€” there is no fast tokenizer.json in the upstream repo (the Python fork builds the fast tokenizer at runtime via transformers). The Rust engine's tokenizer loader (mlx_gen::TextTokenizer, consumed by the qwen-image provider's load_tokenizer) reads the HF tokenizers fast serialization, so it needs a tokenizer.json.

This repo hosts that derived tokenizer.json so SceneWorks model-install can overlay it onto the upstream Qwen-Image snapshot (instead of running a Python vocab.json+merges.txtโ†’fast conversion at install time on every machine โ€” the desktop Mac bundle ships no Python). See SceneWorks sc-6570; this mirrors the Kolors fast-tokenizer overlay (sc-4764).

Note: Qwen/Qwen-Image-Edit-2511 already ships its own tokenizer.json upstream, so only the base text-to-image Qwen/Qwen-Image repo needs this overlay.

How it was built

Materialized by tools/build_qwen_tokenizer.py (mlx-gen): loads the Qwen2 tokenizer with transformers.AutoTokenizer.from_pretrained (the fast path) and writes backend_tokenizer.save(...). The result is the byte-identical fast tokenizer the fork builds at runtime โ€” same vocab, merges, NFC + ByteLevel pipeline, and special tokens.

Validation: fast-tokenizer ids == the fork's runtime transformers tokenizer across an EN + EN-long + CN + mixed CN/EN/numeric/punct + empty(negative-prompt) battery โ€” 0 mismatches. vocab_size 151665, pad token id 151643 (<|endoftext|>).

Files

  • tokenizer.json โ€” the derived fast tokenizer (the file the Rust engine needs).
  • vocab.json, merges.txt, tokenizer_config.json, added_tokens.json, special_tokens_map.json โ€” the upstream slow-tokenizer source files (provenance / reproducibility).

License & provenance

Derived from the Qwen2 tokenizer shipped with Qwen/Qwen-Image (Apache-2.0). This repo redistributes only the tokenizer (no model weights) for engine interoperability.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support