Instructions to use shibatch/tinymoe2m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use shibatch/tinymoe2m with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("shibatch/tinymoe2m", dtype="auto") - llama-cpp-python
How to use shibatch/tinymoe2m with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="shibatch/tinymoe2m", filename="tinymoe2m.BF16.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use shibatch/tinymoe2m with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf shibatch/tinymoe2m:Q4_K_M # Run inference directly in the terminal: llama-cli -hf shibatch/tinymoe2m:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf shibatch/tinymoe2m:Q4_K_M # Run inference directly in the terminal: llama-cli -hf shibatch/tinymoe2m:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf shibatch/tinymoe2m:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf shibatch/tinymoe2m:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf shibatch/tinymoe2m:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf shibatch/tinymoe2m:Q4_K_M
Use Docker
docker model run hf.co/shibatch/tinymoe2m:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use shibatch/tinymoe2m with Ollama:
ollama run hf.co/shibatch/tinymoe2m:Q4_K_M
- Unsloth Studio
How to use shibatch/tinymoe2m with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for shibatch/tinymoe2m to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for shibatch/tinymoe2m to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for shibatch/tinymoe2m to start chatting
- Docker Model Runner
How to use shibatch/tinymoe2m with Docker Model Runner:
docker model run hf.co/shibatch/tinymoe2m:Q4_K_M
- Lemonade
How to use shibatch/tinymoe2m with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull shibatch/tinymoe2m:Q4_K_M
Run and chat with the model
lemonade run user.tinymoe2m-Q4_K_M
List all available models
lemonade list
output = llm(
"Once upon a time,",
max_tokens=512,
echo=True
)
print(output)TinyStories Mixtral 2M Top-2 MoE (tinymoe2m) GGUF & HF Validation Suite
This repository provides an ultra-lightweight Mixtral model variant (a Mixture-of-Experts architecture utilizing the Llama 2 compute topology) scaled down to a 1.95M total parameter footprint and a 1.14M active parameter execution frame. It is trained on the TinyStories dataset and optimized as a precise validation asset.
It is designed specifically for debugging custom inference engines, and native tensor compilers against MoE-specific runtime features. These include Gating network weight allocation, token distribution/gathering (Scatter/Gather loops), and the weighted addition combining multiple independent expert outputs.
π Comparison: tinymoe2m vs Other 1M Variants
To help track feature coverage across the 1M/2M verification suite, the core structural layouts are outlined below:
| Feature / Metric | tiny1m (Standard) |
tinybpe1m (BPE Variant) |
tinygemma1m (Gemma 2 Variant) |
tinymoe2m (This Repository) |
|---|---|---|---|---|
| Base Architecture | Llama 2 | Llama 2 | Gemma 2 | Llama 2 (Mixtral Format) |
| FFN Structure | Single FFN (Dense) | Single FFN (Dense) | Single FFN (Dense) | Mixture-of-Experts (MoE) |
| Attention Mechanism | MHA (Multi-Head) | MHA (Multi-Head) | GQA (Grouped-Query) | MHA (Multi-Head) |
| Total Experts | 1 (Non-MoE) | 1 (Non-MoE) | 1 (Non-MoE) | 4 Experts |
| Selected Experts | - | - | - | Top-2 Experts |
Expert FFN Dim (intermediate_size) |
564 | 352 | 352 | 352 (Shared across all experts) |
| Total Parameters | ~1.2M | ~1.0M | ~1.0M | ~1.95M (1.95M Total) |
| Active Parameters | ~1.2M | ~1.0M | ~1.0M | ~1.14M (1.14M Active) |
| Primary Debug Target | Core matrix mult & layout | byte_fallback decode |
Gemma 2 advanced graph | Dynamic Routing & Scatter/Gather |
π‘ Compute Cost vs Capacity Optimization
With a total parameter count of approximately 1.95M, this model retains roughly twice the absolute capacity of standard 1M dense variants, allowing it to maintain a stable command of grammar rules and coherent phrasings from the TinyStories corpus. Crucially, because only the top-2 experts fire per token, the active parameter execution count is capped at ~1.14M. This layout perfectly replicates the fundamental benefit of MoE architectures: expanding a model's total internal capacity by 2x while restricting the added floating-point operation (FLOPs) overhead to just a 1.1xβ1.2x increase compared to a 1M dense counterpart.
π Repository Structure & File Descriptions
1. GGUF Formats (Root Directory ./)
Binary files optimized for execution via llama.cpp or compatible lower-level inference engines. Upstream parsers will automatically recognize this architecture under the mixed (Mixtral) type descriptor.
| Filename | Type | Size | Target / Validation Focus |
|---|---|---|---|
tinymoe2m.F32.gguf |
F32 |
~8.0 MB | Baseline Test. Eliminates quantization noise to isolate and verify the raw probability mathematics of the Gating network and expert tensor synthesis. |
tinymoe2m.F16.gguftinymoe2m.BF16.gguf |
F16BF16 |
~4.0 MB | Half-Precision Test. Evaluates 16-bit floating-point unpacking routines and stability under parallelized accumulation layers. |
tinymoe2m.Q8_0.gguf |
Q8_0 |
~2.2 MB | Standard Quantization. Verifies block-based uniform scaling (32-element blocks) across decentralized MoE structures. |
tinymoe2m.Q4_0.gguftinymoe2m.Q4_1.gguf |
Q4_0Q4_1 |
~1.4 MB | Classic Quantization. Tests 4-bit linear scaling and unpacking logic across multiple discontinuous expert weight matrices. |
tinymoe2m.Q2_K.gguf |
Q2_K |
~1.1 MB | Standard K-Quant (2-bit). Evaluates mixed super-block dequantization loops feeding sparse FFN routines. |
tinymoe2m.Q3_K_M.gguf |
Q3_K_M |
~1.2 MB | Standard K-Quant (3-bit). Tests sub-variant multi-block layouts handling dynamic routing vectors. |
tinymoe2m.Q4_K_M.gguf |
Q4_K_M |
~1.4 MB | Standard K-Quant (4-bit). The baseline testing target for modern 4-bit super-block logic coupled with MoE paths. |
tinymoe2m.Q5_K_M.gguf |
Q5_K_M |
~1.5 MB | Standard K-Quant (5-bit). Validates high-fidelity mixed 5-bit precision layouts. |
tinymoe2m.Q6_K.gguf |
Q6_K |
~1.7 MB | Standard K-Quant (6-bit). Validates 6-bit high-fidelity super-block dequantization. |
2. Hugging Face Native Format (./hf/)
Unquantized components formatted for direct instantiation inside the PyTorch transformers library ecosystem:
hf/model.safetensors: Raw unquantized matrix parameters containing all 4 expert sub-networks alongside the master router tensor.hf/config.json: Architectural specifications built aroundMixtralConfigcriteria (layer depth, head maps, absolute expert counts, and top-k selection targets).hf/generation_config.json: Standard generation defaults.hf/tokenizer.model: The custom 512-vocabulary size SentencePiece BPE master binary.hf/tokenizer_config.json: Metadata linkingLlamaTokenizerclasses to guarantee correct handling of prefix spacing and manage automatic<s>(BOS) injection properly on the Hugging Face backend.hf/special_tokens_map.json: Structural map linking token strings (<s>=1,</s>=2) back to internal index bounds.
π Usage Examples
A. Running GGUF via llama.cpp
To process the MoE execution graph and evaluate dynamic expert routing directly on your shell:
./llama-cli -m tinymoe2m.Q4_K_M.gguf -p "Tom and Jerry are " -n 64 --temp 0.0
B. Loading Hugging Face Formats via Python
Because the configuration parameters are seamlessly matched with the custom vocabulary schema, you can invoke the classes using standard automated loaders without building proprietary wrapper systems.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
repo_id = "shibatch/tinymoe2m"
print("Loading MoE configuration and tokenizer layers...")
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder="hf")
model = AutoModelForCausalLM.from_pretrained(repo_id, subfolder="hf")
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()
prompt = "Tom and Jerry are "
inputs = tokenizer(prompt, return_tensors="pt").to(device)
print("Running inference loop (Validating Top-2 sparse routing matrices)...")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_length=64,
do_sample=False
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("\n--- Inference Test Result ---")
print("Prompt :", prompt)
print("Generated:", generated_text)
π Model Specifications
- Architecture: Mixtral (
MixtralForCausalLM) - Dataset: TinyStories
- Total Parameters (
num_local_experts= 4): ~1.95M - Active Parameters (
num_experts_per_tok= 2): ~1.14M - Vocabulary Size (
vocab_size): 512 (Custom SentencePiece BPE withbyte_fallbackenabled) - Hidden Size (
hidden_size): 128 - Number of Hidden Layers (
num_hidden_layers): 3 - Number of Attention Heads (
num_heads/num_kv_heads): 2 / 2 (MHA layout) - Individual Expert Internal Dimension (
intermediate_size): 352 (SwiGLU structure) - Max Position Embeddings (
max_position_embeddings): 256
π License
- License: MIT License. You are completely free to duplicate, modify, distribute, and utilize these assets across any commercial, personal, or educational environments.
- Downloads last month
- -
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
32-bit
Model tree for shibatch/tinymoe2m
Base model
mistralai/Mixtral-8x7B-v0.1
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="shibatch/tinymoe2m", filename="", )