Instructions to use shibatch/tinymoeja2m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use shibatch/tinymoeja2m with Transformers:

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("shibatch/tinymoeja2m", dtype="auto")

llama-cpp-python

How to use shibatch/tinymoeja2m with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="shibatch/tinymoeja2m",
	filename="tinymoeja2m.BF16.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use shibatch/tinymoeja2m with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf shibatch/tinymoeja2m:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf shibatch/tinymoeja2m:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf shibatch/tinymoeja2m:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf shibatch/tinymoeja2m:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf shibatch/tinymoeja2m:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf shibatch/tinymoeja2m:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf shibatch/tinymoeja2m:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf shibatch/tinymoeja2m:Q4_K_M

Use Docker

docker model run hf.co/shibatch/tinymoeja2m:Q4_K_M

LM Studio
Jan
Ollama
How to use shibatch/tinymoeja2m with Ollama:
```
ollama run hf.co/shibatch/tinymoeja2m:Q4_K_M
```

Unsloth Studio

How to use shibatch/tinymoeja2m with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for shibatch/tinymoeja2m to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for shibatch/tinymoeja2m to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for shibatch/tinymoeja2m to start chatting

Docker Model Runner
How to use shibatch/tinymoeja2m with Docker Model Runner:
```
docker model run hf.co/shibatch/tinymoeja2m:Q4_K_M
```

Lemonade

How to use shibatch/tinymoeja2m with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull shibatch/tinymoeja2m:Q4_K_M

Run and chat with the model

lemonade run user.tinymoeja2m-Q4_K_M

List all available models

lemonade list

TinyStories Mixtral 2M Top-2 MoE GQA Japanese Validation Suite (tinymoeja2m)

This repository provides an ultra-lightweight, Japanese-specialized Mixtral model variant scaled down to a 2.05M total parameter footprint and a 1.14M active parameter execution frame. It is trained on the comprehensive 320k Japanese translated stories from the TinyStories dataset via Gemma 4.

This asset is configured with a 2,048 token context window (2k) and a standard RoPE base frequency (rope_theta) of 10,000.0 to act as a clean, trick-free baseline validation asset for runtime implementations.

It is designed specifically for debugging custom inference engines against the synergy of Grouped-Query Attention (GQA) and Mixture-of-Experts (MoE) topologies.

📊 Comparison: `tinymoeja2m` vs Other Variants

To help track feature coverage across the verification suite, the updated structural layouts are outlined below:

Feature / Metric	`tiny1m` (Standard)	`tinygemma1m` (Gemma 2)	`tinymoe2m` (English 4k)	`tinymoeja2m` (This Repository)
Language	English	English	English	Japanese
Base Architecture	Llama 2	Gemma 2	Llama 2 (Mixtral)	Llama 2 (Mixtral Format)
FFN Structure	Single FFN (Dense)	Single FFN (Dense)	Mixture-of-Experts	Mixture-of-Experts (MoE)
Attention Mechanism	MHA (Multi-Head)	GQA (Grouped-Query)	MHA (Multi-Head)	GQA (Grouped-Query)
Total / Selected Experts	1 / -	1 / -	4 Experts / Top-2	4 Experts / Top-2
GQA Head Ratio (Q:KV)	1:1 (MHA)	4:1 (GQA)	1:1 (MHA)	4:1 (Query: 4, KV: 1)
Max Position Embeddings	-	-	4,096	2,048 (2k Context)
RoPE Base (`rope_theta`)	-	-	15,000.0	10,000.0
Total / Active Params	~1.2M / ~1.2M	~1.0M / ~1.0M	~1.95M / ~1.14M	~2.05M / ~1.14M
Primary Debug Target	Core matrix mult	Advanced graph	Scatter/Gather loops	GQA Broadcast & Byte Fallback

📂 Repository Structure & File Descriptions

1. GGUF Formats (Root Directory `./`)

Binary files optimized for execution via llama.cpp or compatible lower-level inference engines. Upstream parsers automatically recognize this under the mixed (Mixtral) descriptor.

Filename	Type	Target / Validation Focus
`tinymoeja2m.F32.gguf`	`F32`	Baseline Test. Eliminates quantization noise to isolate and verify raw probability mathematics.
`tinymoeja2m.F16.gguf` `tinymoeja2m.BF16.gguf`	`F16` `BF16`	Half-Precision Test. Evaluates 16-bit floating-point unpacking routines and parallelized accumulation layers.
`tinymoeja2m.Q8_0.gguf`	`Q8_0`	Standard Quantization. Verifies block-based uniform scaling across decentralized MoE structures.
`tinymoeja2m.Q4_0.gguf` `tinymoeja2m.Q4_1.gguf`	`Q4_0` `Q4_1`	Classic 4-bit Quantization. Tests basic linear scaling and unpacking logic across multiple discontinuous expert weight matrices.
`tinymoeja2m.Q2_K.gguf`	`Q2_K`	Standard K-Quant (2-bit). Evaluates mixed super-block dequantization loops feeding sparse FFN routines.
`tinymoeja2m.Q3_K_M.gguf`	`Q3_K_M`	Standard K-Quant (3-bit). Tests sub-variant multi-block layouts handling dynamic routing vectors.
`tinymoeja2m.Q4_K_M.gguf`	`Q4_K_M`	Standard K-Quant (4-bit). Target for modern 4-bit super-block logic coupled with sparse MoE graphs.
`tinymoeja2m.Q5_K_M.gguf`	`Q5_K_M`	Standard K-Quant (5-bit). Validates high-fidelity mixed 5-bit precision layouts.
`tinymoeja2m.Q6_K.gguf`	`Q6_K`	Standard K-Quant (6-bit). Validates 6-bit high-fidelity super-block dequantization.

2. Hugging Face Native Format (`./hf/`)

Unquantized components formatted for direct instantiation inside the PyTorch transformers library ecosystem:

hf/model.safetensors: Raw unquantized matrix parameters containing all 4 expert sub-networks, GQA projection matrices, and the master router tensor.
hf/config.json: Architectural specifications built around MixtralConfig. Fully configured to enforce num_attention_heads: 4, num_key_value_heads: 1, max_position_embeddings: 2048, and rope_theta: 10000.0.
hf/generation_config.json: Standard generation defaults.
hf/tokenizer.model: The custom 1,024-vocabulary size SentencePiece BPE master binary trained on a clean Japanese text subset with byte_fallback enabled.
hf/tokenizer.json: Evaluated JSON-serialized token maps for high-speed interoperability across native tokenization backends.
hf/tokenizer_config.json: Enforced metadata linking LlamaTokenizer classes to guarantee correct handling of prefix spacing and automatic <s> (BOS) injection.
hf/special_tokens_map.json: Structural map linking special tokens (<s>=1, </s>=2, <unk>=0, <pad>=2).

🎯 Purpose & Design Philosophy (Verification Targets)

This checkpoint is specifically engineered as a deterministic validation test asset for runtime computing backends and is not designed for practical semantic tasks.

Due to the compact parameter size (~2.05M) and ultra-focused vocabulary layout (1,024 tokens), the network concentrates its capacity entirely on mastering Japanese phrase continuations and basic syntax under an autoregressive framework.

Critical Debugging Capabilities for Custom Engines:

GQA Broadcast Matrix Multiplication The 4:1 Grouped-Query Attention structure requires the execution kernels to correctly share a single Key/Value cache block across 4 independent Query heads. This serves as an ideal testbed for tracking memory stride offsets and tensor broadcasting alignment in parallel computing shaders.
Multi-Byte UTF-8 Byte Fallback Validation With the vocabulary limited to 1,024 tokens, any kanji or character outside the primary training subset triggers the byte_fallback mechanism, breaking the character down into raw sequential UTF-8 byte tokens (3 tokens per character). This enforces a rigorous stress test on the engine's streaming decoder to correctly stitch unaligned byte streams back into flawless Japanese text without truncation or corruption.

🚀 Usage Examples

A. Running GGUF via llama.cpp

To process the GQA MoE execution graph and evaluate dynamic expert routing directly on your shell:

./llama-cli -m tinymoeja2m.Q4_K_M.gguf -p "トムとリリーは" -n 64 --temp 0.0

B. Loading Hugging Face Formats via Python

import torch
import sentencepiece as spm
from transformers import MixtralForCausalLM
from huggingface_hub import hf_hub_download

# Define target repository identity
repo_id = "shibatch/tinymoeja2m"

print("Downloading and caching specialized tokenizer layer...")
# Fetch tokenizer.model file automatically from Hugging Face Hub
tokenizer_file = hf_hub_download(repo_id=repo_id, subfolder="hf", filename="tokenizer.model")

sp = spm.SentencePieceProcessor()
sp.Load(tokenizer_file)

print("Downloading and loading Mixtral-based 2M MoE model weights...")
model = MixtralForCausalLM.from_pretrained(repo_id, subfolder="hf")

device = "cuda" if torch.cuda.is_available() else ("xpu" if torch.xpu.is_available() else "cpu")
model = model.to(device)
model.eval()

# Prompt text utilizing vocabulary subsets
prompt = "トムとリリーは"
input_ids = [1] + sp.EncodeAsIds(prompt) # Explicitly prepend BOS (1)
input_tensor = torch.tensor([input_ids]).to(device)

print("Executing text generation loop (Validating 4:1 GQA & Top-2 MoE Kernels)...")
with torch.no_grad():
    output_ids = model.generate(
        input_tensor,
        max_length=64,
        do_sample=False,
        pad_token_id=2,
        bos_token_id=1,
        eos_token_id=2
    )

generated_ids = output_ids[0].cpu().tolist()
generated_text = sp.DecodeIds(generated_ids)

print("\n--- Inference Test Result ---")
print("Prompt   :", prompt)
print("Generated:", generated_text)

📝 Model Specifications

Architecture: Mixtral (MixtralForCausalLM)
Dataset: TinyStories Japanese Translation Corpus (320k stories)
Total Parameters (num_local_experts = 4): ~2.05M
Active Parameters (num_experts_per_tok = 2): ~1.14M
Vocabulary Size (vocab_size): 1,024 (Custom SentencePiece BPE with byte_fallback enabled)
Hidden Size (hidden_size): 128
Number of Hidden Layers (num_hidden_layers): 3
Number of Attention Heads (num_heads / num_kv_heads): 4 / 1 (4:1 GQA layout)
Individual Expert Internal Dimension (intermediate_size): 352 (SwiGLU structure)
Max Position Embeddings (max_position_embeddings): 2,048
RoPE Base Frequency (rope_theta): 10,000.0

📜 License

License: MIT License. You are completely free to duplicate, modify, distribute, and utilize these assets across any commercial, personal, or educational environments.

Downloads last month: -

GGUF

Model size

2.01M params

Architecture

llama

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

32-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for shibatch/tinymoeja2m

Base model

mistralai/Mixtral-8x7B-v0.1

Quantized

(43)

this model