How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="shibatch/tinymqa1m",
	filename="",
)
output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

TinyStories Llama2 1M MQA (tinymqa1m) GGUF & HF Validation Suite

This repository provides an ultra-lightweight Llama2 model variant featuring a Custom BPE Tokenizer combined with a strict MQA (Multi-Query Attention) structural layout. It is trained on the TinyStories dataset and optimized specifically for compiler, runtime, and hardware kernel validation.


πŸ“Š Comparison: tinymqa1m vs Previous Variants

To help you choose the correct test asset for your specific engine debugging goals, the architectural differences across the 1M parameter suite are structured below:

Feature / Metric tiny1m (Standard) tinybpe1m (BPE Variant) tinymqa1m (This Repository)
Attention Mechanism MHA (Multi-Head Attention) MHA (Multi-Head Attention) MQA (Multi-Query Attention)
Attention Heads ($N_{heads} / N_{kv_heads}$) 2 Heads / 2 KV Heads 2 Heads / 2 KV Heads 4 Heads / 1 KV Head (Asymmetric)
Tokenizer Type Simple Character-level SentencePiece BPE SentencePiece BPE
Byte Fallback Support No Yes (byte_fallback=True) Yes (byte_fallback=True)
llama2.c Compatibility Fully Compatible (run.c) Incompatible (Corrupts text) Incompatible (Crashes/Corrupts)
Primary Debug Target Core matrix multiplication & layout byte_fallback decoder loop KV-cache alignment & broadcast

Why test with tinymqa1m?

Modern architectures like Llama 3, Gemma, and Mistral rely on GQA (Grouped-Query Attention) or MQA to optimize memory bandwidth. Implementing these attention patterns in custom inference engines (C/C++, Vulkan, etc.) frequently introduces boundary bugs into KV-cache tensor indexing. This model allows you to thoroughly validate KV-cache matrix broadcasting logic under a tight 1M parameter profile without memory overhead.


πŸ“‚ Repository Structure & File Descriptions

1. GGUF Formats (Root Directory ./)

A complete suite compiled for llama.cpp and compatible modern custom runtimes. The structural MQA hyper-parameters and specialized token layouts are fully baked into each GGUF binary:

Filename(s) / Wildcard Pattern Type Size Purpose / Validation Target
tinymqa1m.F32.gguf F32 ~4.0 MB Baseline Test. Validates GGUF parsing, MQA tensor layout, matrix dimensions, and RoPE indexing without dequantization factors.
tinymqa1m.F16.gguf
tinymqa1m.BF16.gguf
F16
BF16
~2.0 MB Half-Precision Test. Validates 16-bit float loading, tensor broadcasting, and structural inference stability.
tinymqa1m.Q8_0.gguf Q8_0 ~1.1 MB Quantization Level 1. Validates block-based uniform scaling with 32 elements under MQA dimensions.
tinymqa1m.Q4_0.gguf
tinymqa1m.Q4_1.gguf
Q4_0
Q4_1
~0.7 MB Quantization Level 2. Validates classic 4-bit linear quantization and bit-unpacking logic.
tinymqa1m.Q2_K.gguf Q2_K ~0.5 MB Standard K-Quant (2-bit). Validates 2-bit super-block quantization parsing.
tinymqa1m.Q3_K_*.gguf
↳ tinymqa1m.Q3_K_S.gguf
↳ tinymqa1m.Q3_K_M.gguf
↳ tinymqa1m.Q3_K_L.gguf
Q3_K ~0.6 MB Standard K-Quant (3-bit). Validates Small, Medium, and Large sub-variants of 3-bit multi-block structures.
tinymqa1m.Q4_K_*.gguf
↳ tinymqa1m.Q4_K_S.gguf
↳ tinymqa1m.Q4_K_M.gguf
Q4_K ~0.7 MB Standard K-Quant (4-bit). Validates Small and Medium sub-variants of modern 4-bit super-block structural parsing.
tinymqa1m.Q5_K_*.gguf
↳ tinymqa1m.Q5_K_S.gguf
↳ tinymqa1m.Q5_K_M.gguf
Q5_K ~0.8 MB Standard K-Quant (5-bit). Validates Small and Medium sub-variants of 5-bit mixed precision super-blocks.
tinymqa1m.Q6_K.gguf Q6_K ~0.9 MB Standard K-Quant (6-bit). Validates 6-bit high-fidelity super-block quantization.
tinymqa1m.IQ3_*.gguf
↳ tinymqa1m.IQ3_XXS.gguf
↳ tinymqa1m.IQ3_S.gguf
I-Quants ~0.5 MB Importance Quants (3-bit). Non-linear 3-bit importance quantization targeting lookup table (codebook) decoding logic.
tinymqa1m.IQ4_*.gguf
↳ tinymqa1m.IQ4_NL.gguf
↳ tinymqa1m.IQ4_XS.gguf
I-Quants ~0.6 MB Importance Quants (4-bit). Non-linear 4-bit importance quantization variants (Non-Linear and Extra Small).
tinymqa1m.TQ1_0.gguf
tinymqa1m.TQ2_0.gguf
Ternary ~0.4 MB Experimental. Ternary (-1, 0, 1) state quantization for cutting-edge engine testing.

2. Hugging Face Native Format (./hf/)

Standard configurations and weight layer states used by the PyTorch transformers library:

  • hf/model.safetensors: Unquantized native model parameters using explicit MQA structures.
  • hf/config.json: Architectural settings specifying the asymmetrical head layout (num_attention_heads: 4, num_key_value_heads: 1).
  • hf/generation_config.json: Default generation threshold boundaries.
  • hf/tokenizer_config.json: Tokenizer behavior configuration enabling automatic <s> (BOS) injection and sequence padding boundaries.
  • hf/special_tokens_map.json: Token mappings string keys directly to internal special token IDs.
  • hf/tokenizer.model: The master 512-vocab SentencePiece tokenizer binary file.

πŸš€ Usage Examples

A. Running GGUF via llama.cpp

To verify your local hardware runtime execution or evaluate token generation logic under MQA parameters:

./llama-cli -m tinymqa1m.Q4_K_M.gguf -p "Tom and Jerry are " -n 64 --temp 0.0

B. Loading Hugging Face Formats via Python

With the runtime metadata (tokenizer_config.json / special_tokens_map.json) fully populated, you can instantiate the configuration directly using standard Hugging Face components without custom workflow wrappers.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

repo_id = "shibatch/tinymqa1m"

print("Loading tokenizer and MQA model configuration...")
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder="hf")
model = AutoModelForCausalLM.from_pretrained(repo_id, subfolder="hf")

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()

prompt = "Tom and Jerry are "
# Formatting and <s> (BOS) insertion are handled automatically via configuration metadata
inputs = tokenizer(prompt, return_tensors="pt").to(device)

print("Executing text generation loop (Validating MQA projection tensors)...")
with torch.no_grad():
    outputs = model.generate(
        **inputs, 
        max_length=64, 
        do_sample=False
    )
    
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("\n--- Inference Test Result ---")
print("Prompt   :", prompt)
print("Generated:", generated_text)

πŸ“ Model Specifications

The network scales the attention pipeline to map 4 Query channels down to 1 Key-Value pair, verifying structural broadcasting implementations cleanly.

  • Architecture: Llama 2 with Multi-Query Attention (MQA)
  • Dataset: TinyStories
  • Total Parameters: ~1M (Exactly 896,256 parameters)
  • Vocabulary Size: 512 (Custom SentencePiece BPE with byte_fallback enabled)
  • Hidden Size (hidden_size): 128
  • Number of Hidden Layers (num_hidden_layers): 4
  • Number of Attention Heads (num_heads): 4 (head_dim = 32)
  • Number of Key-Value Heads (num_kv_heads): 1 (Strict MQA broadcast ratio)
  • Intermediate Size (intermediate_size): 352
  • Max Position Embeddings (max_position_embeddings): 256

πŸ“œ Acknowledgments & License

  • Original Implementation: Inspired by Andrej Karpathy's llama2.c project.
  • Dataset: TinyStories dataset.
  • License: MIT License. You are free to use, modify, and distribute these assets for any purpose.
Downloads last month
156
GGUF
Model size
837k params
Architecture
llama
Hardware compatibility
Log In to add your hardware

1-bit

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

32-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for shibatch/tinymqa1m

Quantized
(4)
this model