TinyStories Mixtral 2M Top-2 MoE GQA Japanese Validation Suite (tinymoeja2m)

This repository provides an ultra-lightweight, Japanese-specialized Mixtral model variant scaled down to a 2.05M total parameter footprint and a 1.14M active parameter execution frame. It is trained on the comprehensive 320k Japanese translated stories from the TinyStories dataset via Gemma 4.

This asset is configured with a 2,048 token context window (2k) and a standard RoPE base frequency (rope_theta) of 10,000.0 to act as a clean, trick-free baseline validation asset for runtime implementations.

It is designed specifically for debugging custom inference engines against the synergy of Grouped-Query Attention (GQA) and Mixture-of-Experts (MoE) topologies.


πŸ“Š Comparison: tinymoeja2m vs Other Variants

To help track feature coverage across the verification suite, the updated structural layouts are outlined below:

Feature / Metric tiny1m (Standard) tinygemma1m (Gemma 2) tinymoe2m (English 4k) tinymoeja2m (This Repository)
Language English English English Japanese
Base Architecture Llama 2 Gemma 2 Llama 2 (Mixtral) Llama 2 (Mixtral Format)
FFN Structure Single FFN (Dense) Single FFN (Dense) Mixture-of-Experts Mixture-of-Experts (MoE)
Attention Mechanism MHA (Multi-Head) GQA (Grouped-Query) MHA (Multi-Head) GQA (Grouped-Query)
Total / Selected Experts 1 / - 1 / - 4 Experts / Top-2 4 Experts / Top-2
GQA Head Ratio (Q:KV) 1:1 (MHA) 4:1 (GQA) 1:1 (MHA) 4:1 (Query: 4, KV: 1)
Max Position Embeddings - - 4,096 2,048 (2k Context)
RoPE Base (rope_theta) - - 15,000.0 10,000.0
Total / Active Params ~1.2M / ~1.2M ~1.0M / ~1.0M ~1.95M / ~1.14M ~2.05M / ~1.14M
Primary Debug Target Core matrix mult Advanced graph Scatter/Gather loops GQA Broadcast & Byte Fallback

πŸ“‚ Repository Structure & File Descriptions

1. GGUF Formats (Root Directory ./)

Binary files optimized for execution via llama.cpp or compatible lower-level inference engines. Upstream parsers automatically recognize this under the mixed (Mixtral) descriptor.

Filename Type Target / Validation Focus
tinymoeja2m.F32.gguf F32 Baseline Test. Eliminates quantization noise to isolate and verify raw probability mathematics.
tinymoeja2m.F16.gguf
tinymoeja2m.BF16.gguf
F16
BF16
Half-Precision Test. Evaluates 16-bit floating-point unpacking routines and parallelized accumulation layers.
tinymoeja2m.Q8_0.gguf Q8_0 Standard Quantization. Verifies block-based uniform scaling across decentralized MoE structures.
tinymoeja2m.Q4_0.gguf
tinymoeja2m.Q4_1.gguf
Q4_0
Q4_1
Classic 4-bit Quantization. Tests basic linear scaling and unpacking logic across multiple discontinuous expert weight matrices.
tinymoeja2m.Q2_K.gguf Q2_K Standard K-Quant (2-bit). Evaluates mixed super-block dequantization loops feeding sparse FFN routines.
tinymoeja2m.Q3_K_M.gguf Q3_K_M Standard K-Quant (3-bit). Tests sub-variant multi-block layouts handling dynamic routing vectors.
tinymoeja2m.Q4_K_M.gguf Q4_K_M Standard K-Quant (4-bit). Target for modern 4-bit super-block logic coupled with sparse MoE graphs.
tinymoeja2m.Q5_K_M.gguf Q5_K_M Standard K-Quant (5-bit). Validates high-fidelity mixed 5-bit precision layouts.
tinymoeja2m.Q6_K.gguf Q6_K Standard K-Quant (6-bit). Validates 6-bit high-fidelity super-block dequantization.

2. Hugging Face Native Format (./hf/)

Unquantized components formatted for direct instantiation inside the PyTorch transformers library ecosystem:

  • hf/model.safetensors: Raw unquantized matrix parameters containing all 4 expert sub-networks, GQA projection matrices, and the master router tensor.
  • hf/config.json: Architectural specifications built around MixtralConfig. Fully configured to enforce num_attention_heads: 4, num_key_value_heads: 1, max_position_embeddings: 2048, and rope_theta: 10000.0.
  • hf/generation_config.json: Standard generation defaults.
  • hf/tokenizer.model: The custom 1,024-vocabulary size SentencePiece BPE master binary trained on a clean Japanese text subset with byte_fallback enabled.
  • hf/tokenizer.json: Evaluated JSON-serialized token maps for high-speed interoperability across native tokenization backends.
  • hf/tokenizer_config.json: Enforced metadata linking LlamaTokenizer classes to guarantee correct handling of prefix spacing and automatic <s> (BOS) injection.
  • hf/special_tokens_map.json: Structural map linking special tokens (<s>=1, </s>=2, <unk>=0, <pad>=2).

🎯 Purpose & Design Philosophy (Verification Targets)

This checkpoint is specifically engineered as a deterministic validation test asset for runtime computing backends and is not designed for practical semantic tasks.

Due to the compact parameter size (~2.05M) and ultra-focused vocabulary layout (1,024 tokens), the network concentrates its capacity entirely on mastering Japanese phrase continuations and basic syntax under an autoregressive framework.

Critical Debugging Capabilities for Custom Engines:

  1. GQA Broadcast Matrix Multiplication The 4:1 Grouped-Query Attention structure requires the execution kernels to correctly share a single Key/Value cache block across 4 independent Query heads. This serves as an ideal testbed for tracking memory stride offsets and tensor broadcasting alignment in parallel computing shaders.
  2. Multi-Byte UTF-8 Byte Fallback Validation With the vocabulary limited to 1,024 tokens, any kanji or character outside the primary training subset triggers the byte_fallback mechanism, breaking the character down into raw sequential UTF-8 byte tokens (3 tokens per character). This enforces a rigorous stress test on the engine's streaming decoder to correctly stitch unaligned byte streams back into flawless Japanese text without truncation or corruption.

πŸš€ Usage Examples

A. Running GGUF via llama.cpp

To process the GQA MoE execution graph and evaluate dynamic expert routing directly on your shell:

./llama-cli -m tinymoeja2m.Q4_K_M.gguf -p "γƒˆγƒ γ¨γƒͺγƒͺーは" -n 64 --temp 0.0

B. Loading Hugging Face Formats via Python

import torch
import sentencepiece as spm
from transformers import MixtralForCausalLM
from huggingface_hub import hf_hub_download

# Define target repository identity
repo_id = "shibatch/tinymoeja2m"

print("Downloading and caching specialized tokenizer layer...")
# Fetch tokenizer.model file automatically from Hugging Face Hub
tokenizer_file = hf_hub_download(repo_id=repo_id, subfolder="hf", filename="tokenizer.model")

sp = spm.SentencePieceProcessor()
sp.Load(tokenizer_file)

print("Downloading and loading Mixtral-based 2M MoE model weights...")
model = MixtralForCausalLM.from_pretrained(repo_id, subfolder="hf")

device = "cuda" if torch.cuda.is_available() else ("xpu" if torch.xpu.is_available() else "cpu")
model = model.to(device)
model.eval()

# Prompt text utilizing vocabulary subsets
prompt = "γƒˆγƒ γ¨γƒͺγƒͺーは"
input_ids = [1] + sp.EncodeAsIds(prompt) # Explicitly prepend BOS (1)
input_tensor = torch.tensor([input_ids]).to(device)

print("Executing text generation loop (Validating 4:1 GQA & Top-2 MoE Kernels)...")
with torch.no_grad():
    output_ids = model.generate(
        input_tensor,
        max_length=64,
        do_sample=False,
        pad_token_id=2,
        bos_token_id=1,
        eos_token_id=2
    )

generated_ids = output_ids[0].cpu().tolist()
generated_text = sp.DecodeIds(generated_ids)

print("\n--- Inference Test Result ---")
print("Prompt   :", prompt)
print("Generated:", generated_text)

πŸ“ Model Specifications

  • Architecture: Mixtral (MixtralForCausalLM)
  • Dataset: TinyStories Japanese Translation Corpus (320k stories)
  • Total Parameters (num_local_experts = 4): ~2.05M
  • Active Parameters (num_experts_per_tok = 2): ~1.14M
  • Vocabulary Size (vocab_size): 1,024 (Custom SentencePiece BPE with byte_fallback enabled)
  • Hidden Size (hidden_size): 128
  • Number of Hidden Layers (num_hidden_layers): 3
  • Number of Attention Heads (num_heads / num_kv_heads): 4 / 1 (4:1 GQA layout)
  • Individual Expert Internal Dimension (intermediate_size): 352 (SwiGLU structure)
  • Max Position Embeddings (max_position_embeddings): 2,048
  • RoPE Base Frequency (rope_theta): 10,000.0

πŸ“œ License

  • License: MIT License. You are completely free to duplicate, modify, distribute, and utilize these assets across any commercial, personal, or educational environments.
Downloads last month
-
GGUF
Model size
2.01M params
Architecture
llama
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

32-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for shibatch/tinymoeja2m

Quantized
(43)
this model