Instructions to use shibatch/tinymoeja2m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use shibatch/tinymoeja2m with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("shibatch/tinymoeja2m", dtype="auto") - llama-cpp-python
How to use shibatch/tinymoeja2m with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="shibatch/tinymoeja2m", filename="tinymoeja2m.BF16.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use shibatch/tinymoeja2m with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf shibatch/tinymoeja2m:Q4_K_M # Run inference directly in the terminal: llama-cli -hf shibatch/tinymoeja2m:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf shibatch/tinymoeja2m:Q4_K_M # Run inference directly in the terminal: llama-cli -hf shibatch/tinymoeja2m:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf shibatch/tinymoeja2m:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf shibatch/tinymoeja2m:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf shibatch/tinymoeja2m:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf shibatch/tinymoeja2m:Q4_K_M
Use Docker
docker model run hf.co/shibatch/tinymoeja2m:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use shibatch/tinymoeja2m with Ollama:
ollama run hf.co/shibatch/tinymoeja2m:Q4_K_M
- Unsloth Studio
How to use shibatch/tinymoeja2m with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for shibatch/tinymoeja2m to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for shibatch/tinymoeja2m to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for shibatch/tinymoeja2m to start chatting
- Docker Model Runner
How to use shibatch/tinymoeja2m with Docker Model Runner:
docker model run hf.co/shibatch/tinymoeja2m:Q4_K_M
- Lemonade
How to use shibatch/tinymoeja2m with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull shibatch/tinymoeja2m:Q4_K_M
Run and chat with the model
lemonade run user.tinymoeja2m-Q4_K_M
List all available models
lemonade list
TinyStories Mixtral 2M Top-2 MoE GQA Japanese Validation Suite (tinymoeja2m)
This repository provides an ultra-lightweight, Japanese-specialized Mixtral model variant scaled down to a 2.05M total parameter footprint and a 1.14M active parameter execution frame. It is trained on the comprehensive 320k Japanese translated stories from the TinyStories dataset via Gemma 4.
This asset is configured with a 2,048 token context window (2k) and a standard RoPE base frequency (rope_theta) of 10,000.0 to act as a clean, trick-free baseline validation asset for runtime implementations.
It is designed specifically for debugging custom inference engines against the synergy of Grouped-Query Attention (GQA) and Mixture-of-Experts (MoE) topologies.
π Comparison: tinymoeja2m vs Other Variants
To help track feature coverage across the verification suite, the updated structural layouts are outlined below:
| Feature / Metric | tiny1m (Standard) |
tinygemma1m (Gemma 2) |
tinymoe2m (English 4k) |
tinymoeja2m (This Repository) |
|---|---|---|---|---|
| Language | English | English | English | Japanese |
| Base Architecture | Llama 2 | Gemma 2 | Llama 2 (Mixtral) | Llama 2 (Mixtral Format) |
| FFN Structure | Single FFN (Dense) | Single FFN (Dense) | Mixture-of-Experts | Mixture-of-Experts (MoE) |
| Attention Mechanism | MHA (Multi-Head) | GQA (Grouped-Query) | MHA (Multi-Head) | GQA (Grouped-Query) |
| Total / Selected Experts | 1 / - | 1 / - | 4 Experts / Top-2 | 4 Experts / Top-2 |
| GQA Head Ratio (Q:KV) | 1:1 (MHA) | 4:1 (GQA) | 1:1 (MHA) | 4:1 (Query: 4, KV: 1) |
| Max Position Embeddings | - | - | 4,096 | 2,048 (2k Context) |
RoPE Base (rope_theta) |
- | - | 15,000.0 | 10,000.0 |
| Total / Active Params | ~1.2M / ~1.2M | ~1.0M / ~1.0M | ~1.95M / ~1.14M | ~2.05M / ~1.14M |
| Primary Debug Target | Core matrix mult | Advanced graph | Scatter/Gather loops | GQA Broadcast & Byte Fallback |
π Repository Structure & File Descriptions
1. GGUF Formats (Root Directory ./)
Binary files optimized for execution via llama.cpp or compatible lower-level inference engines. Upstream parsers automatically recognize this under the mixed (Mixtral) descriptor.
| Filename | Type | Target / Validation Focus |
|---|---|---|
tinymoeja2m.F32.gguf |
F32 |
Baseline Test. Eliminates quantization noise to isolate and verify raw probability mathematics. |
tinymoeja2m.F16.gguftinymoeja2m.BF16.gguf |
F16BF16 |
Half-Precision Test. Evaluates 16-bit floating-point unpacking routines and parallelized accumulation layers. |
tinymoeja2m.Q8_0.gguf |
Q8_0 |
Standard Quantization. Verifies block-based uniform scaling across decentralized MoE structures. |
tinymoeja2m.Q4_0.gguftinymoeja2m.Q4_1.gguf |
Q4_0Q4_1 |
Classic 4-bit Quantization. Tests basic linear scaling and unpacking logic across multiple discontinuous expert weight matrices. |
tinymoeja2m.Q2_K.gguf |
Q2_K |
Standard K-Quant (2-bit). Evaluates mixed super-block dequantization loops feeding sparse FFN routines. |
tinymoeja2m.Q3_K_M.gguf |
Q3_K_M |
Standard K-Quant (3-bit). Tests sub-variant multi-block layouts handling dynamic routing vectors. |
tinymoeja2m.Q4_K_M.gguf |
Q4_K_M |
Standard K-Quant (4-bit). Target for modern 4-bit super-block logic coupled with sparse MoE graphs. |
tinymoeja2m.Q5_K_M.gguf |
Q5_K_M |
Standard K-Quant (5-bit). Validates high-fidelity mixed 5-bit precision layouts. |
tinymoeja2m.Q6_K.gguf |
Q6_K |
Standard K-Quant (6-bit). Validates 6-bit high-fidelity super-block dequantization. |
2. Hugging Face Native Format (./hf/)
Unquantized components formatted for direct instantiation inside the PyTorch transformers library ecosystem:
hf/model.safetensors: Raw unquantized matrix parameters containing all 4 expert sub-networks, GQA projection matrices, and the master router tensor.hf/config.json: Architectural specifications built aroundMixtralConfig. Fully configured to enforcenum_attention_heads: 4,num_key_value_heads: 1,max_position_embeddings: 2048, andrope_theta: 10000.0.hf/generation_config.json: Standard generation defaults.hf/tokenizer.model: The custom 1,024-vocabulary size SentencePiece BPE master binary trained on a clean Japanese text subset withbyte_fallbackenabled.hf/tokenizer.json: Evaluated JSON-serialized token maps for high-speed interoperability across native tokenization backends.hf/tokenizer_config.json: Enforced metadata linkingLlamaTokenizerclasses to guarantee correct handling of prefix spacing and automatic<s>(BOS) injection.hf/special_tokens_map.json: Structural map linking special tokens (<s>=1,</s>=2,<unk>=0,<pad>=2).
π― Purpose & Design Philosophy (Verification Targets)
This checkpoint is specifically engineered as a deterministic validation test asset for runtime computing backends and is not designed for practical semantic tasks.
Due to the compact parameter size (~2.05M) and ultra-focused vocabulary layout (1,024 tokens), the network concentrates its capacity entirely on mastering Japanese phrase continuations and basic syntax under an autoregressive framework.
Critical Debugging Capabilities for Custom Engines:
- GQA Broadcast Matrix Multiplication The 4:1 Grouped-Query Attention structure requires the execution kernels to correctly share a single Key/Value cache block across 4 independent Query heads. This serves as an ideal testbed for tracking memory stride offsets and tensor broadcasting alignment in parallel computing shaders.
- Multi-Byte UTF-8 Byte Fallback Validation
With the vocabulary limited to 1,024 tokens, any kanji or character outside the primary training subset triggers the
byte_fallbackmechanism, breaking the character down into raw sequential UTF-8 byte tokens (3 tokens per character). This enforces a rigorous stress test on the engine's streaming decoder to correctly stitch unaligned byte streams back into flawless Japanese text without truncation or corruption.
π Usage Examples
A. Running GGUF via llama.cpp
To process the GQA MoE execution graph and evaluate dynamic expert routing directly on your shell:
./llama-cli -m tinymoeja2m.Q4_K_M.gguf -p "γγ γ¨γͺγͺγΌγ―" -n 64 --temp 0.0
B. Loading Hugging Face Formats via Python
import torch
import sentencepiece as spm
from transformers import MixtralForCausalLM
from huggingface_hub import hf_hub_download
# Define target repository identity
repo_id = "shibatch/tinymoeja2m"
print("Downloading and caching specialized tokenizer layer...")
# Fetch tokenizer.model file automatically from Hugging Face Hub
tokenizer_file = hf_hub_download(repo_id=repo_id, subfolder="hf", filename="tokenizer.model")
sp = spm.SentencePieceProcessor()
sp.Load(tokenizer_file)
print("Downloading and loading Mixtral-based 2M MoE model weights...")
model = MixtralForCausalLM.from_pretrained(repo_id, subfolder="hf")
device = "cuda" if torch.cuda.is_available() else ("xpu" if torch.xpu.is_available() else "cpu")
model = model.to(device)
model.eval()
# Prompt text utilizing vocabulary subsets
prompt = "γγ γ¨γͺγͺγΌγ―"
input_ids = [1] + sp.EncodeAsIds(prompt) # Explicitly prepend BOS (1)
input_tensor = torch.tensor([input_ids]).to(device)
print("Executing text generation loop (Validating 4:1 GQA & Top-2 MoE Kernels)...")
with torch.no_grad():
output_ids = model.generate(
input_tensor,
max_length=64,
do_sample=False,
pad_token_id=2,
bos_token_id=1,
eos_token_id=2
)
generated_ids = output_ids[0].cpu().tolist()
generated_text = sp.DecodeIds(generated_ids)
print("\n--- Inference Test Result ---")
print("Prompt :", prompt)
print("Generated:", generated_text)
π Model Specifications
- Architecture: Mixtral (
MixtralForCausalLM) - Dataset: TinyStories Japanese Translation Corpus (320k stories)
- Total Parameters (
num_local_experts= 4): ~2.05M - Active Parameters (
num_experts_per_tok= 2): ~1.14M - Vocabulary Size (
vocab_size): 1,024 (Custom SentencePiece BPE withbyte_fallbackenabled) - Hidden Size (
hidden_size): 128 - Number of Hidden Layers (
num_hidden_layers): 3 - Number of Attention Heads (
num_heads/num_kv_heads): 4 / 1 (4:1 GQA layout) - Individual Expert Internal Dimension (
intermediate_size): 352 (SwiGLU structure) - Max Position Embeddings (
max_position_embeddings): 2,048 - RoPE Base Frequency (
rope_theta): 10,000.0
π License
- License: MIT License. You are completely free to duplicate, modify, distribute, and utilize these assets across any commercial, personal, or educational environments.
- Downloads last month
- -
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
32-bit
Model tree for shibatch/tinymoeja2m
Base model
mistralai/Mixtral-8x7B-v0.1