ogma-small · 8.6M efficient text embedding model · MTEB 56.32

Efficient English text embedding model for semantic search, RAG, vector search, retrieval, clustering, classification, STS, and agent memory — MTEB 56.32, 8.6M parameters, 1024-token context

Ogma Small is the flagship efficiency model in the family. At 8.6M parameters it scores 56.32 MTEB in our canonical 66-task Ogma paper results, while using only 38% of MiniLM-L6-v2's parameters, running 1.75× faster on CPU, and handling inputs 4× longer (1024 vs 256 tokens). Purpose-built to be the drop-in for every place you currently reach for MiniLM.

Why the name Ogma?

Ogma is named after Ogma (also written Oghma), the Irish god associated with eloquence and credited in myth with inventing Ogham, an early alphabet for encoding language into symbols. That is the core job of an embedding model: turn language into compact vectors that machines can search, compare, cluster, and reason over.

Use cases

ogma-small is the default efficiency model for semantic search, RAG retrieval, agent memory, vector databases, document retrieval, text classification, clustering, STS / sentence similarity, and lightweight reranking pipelines. It is aimed at teams looking for a small, fast MiniLM-style embedding model with longer context and strong MTEB quality.

Good fits:

On-device or local-first applications where MiniLM-class quality is useful but model size, CPU latency, and privacy matter.
Production RAG systems that need affordable embeddings for documents, chunks, tickets, chats, and internal knowledge bases.
Agent memory and tool-use systems where frequent embedding calls should stay cheap and local when possible.
Vector search at scale where smaller models and Matryoshka sub-dimensions can reduce index size and query cost.
Classification and clustering features for safety filters, routing, topic grouping, deduplication, and analytics.

Choose ogma-small when you want the best balance of quality, speed, size, and deployability across edge, local, and server workloads.

Highlights

🏆 MTEB avg 56.32 — canonical Ogma paper result over 66/66 MTEB English tasks
⚡ 1.75× faster than MiniLM on single-threaded CPU inference (92.9 vs 53.1 docs/s)
📏 1024-token context — 4× longer than all-MiniLM-L6-v2 (256 tokens)
🔀 Symmetric routing via task tokens — encode everything with [SYM], or use [QRY]/[QRY] for retrieval (queries and documents both encoded with task="qry"); benchmark both routes on your task
📐 Matryoshka dims: [256, 128, 64, 32] — one model, any precision
🛡️ +4.0% F1 on prompt injection detection vs MiniLM (same architecture series)

Performance

MTEB English — 66/66 tasks (category-averaged)

Benchmarked with MTEB v2.10.7 on the standard 66-task English benchmark using category averaging (same methodology as the MTEB leaderboard).

Category	ogma-small	all-MiniLM-L6-v2	Δ vs MiniLM
Classification	66.49	62.62	+3.87
Clustering	40.69	41.94	-1.25
PairClassification	82.91	82.37	+0.54
Reranking	50.51	58.04	-7.53
Retrieval	42.05	41.95	+0.10
STS	82.00	78.90	+3.10
Summarization	29.59	30.81	-1.22
Overall	56.32	56.09	+0.23

Why choose Ogma Small?

ogma-small is the default recommendation for most use cases. It is MiniLM-class quality while being faster, smaller, and context-aware. Use ogma-base when you need the extra quality margin; use ogma-mini when you need to go sub-4M parameters.

Safety — Toxicity & Prompt Injection Detection

Evaluated on the Ogma transformer architecture (same family). Embeddings are extracted then fed to a logistic regression (LR) or MLP classifier head — the embedding model itself is not fine-tuned. Evaluated against all-MiniLM-L6-v2 as baseline.

1. Jigsaw Toxic Comment Classification

Dataset: Arsive/toxicity_classification_jigsaw — Binary toxicity classification Train: 25,960 · Test: 6,490

Model	Classifier	Accuracy	F1	Precision	Recall	AUC-ROC
Ogma	LogReg	89.12%	88.26%	89.09%	87.44%	95.74%
Ogma	MLP	88.91%	87.98%	89.14%	86.85%	95.92%
MiniLM	LogReg	87.32%	86.25%	87.46%	85.07%	94.96%
MiniLM	MLP	91.71%	91.24%	90.13%	92.39%	97.16%

Ogma (LR) leads MiniLM (LR) by +2.01% F1. MiniLM (MLP) leads on this dataset — the additional training data (25K samples) allows the MLP to compensate for MiniLM's slightly weaker base representations.

2. Prompt Injection Detection — deepset/prompt-injections

Dataset: deepset/prompt-injections — Binary injection detection Train: 546 · Test: 116 (low-data regime)

Model	Classifier	Accuracy	F1	Precision	Recall	AUC-ROC
Ogma	LogReg	86.21%	84.62%	100.0%	73.33%	97.77%
Ogma	MLP	90.52%	90.27%	96.23%	85.0%	98.1%
MiniLM	LogReg	82.76%	80.39%	97.62%	68.33%	94.52%
MiniLM	MLP	87.07%	86.24%	95.92%	78.33%	93.96%

Ogma leads across both classifiers: +4.03% F1 (MLP), +4.23% F1 (LogReg). Ogma's representations are better separated in the low-data regime — it achieves 100% precision with LogReg, meaning zero false positives.

3. Prompt Injection Detection — neuralchemy/Prompt-injection-dataset

Dataset: neuralchemy/Prompt-injection-dataset — Binary injection detection Train: 4,391 · Test: 942

Model	Classifier	Accuracy	F1	Precision	Recall	AUC-ROC
Ogma	LogReg	95.22%	95.93%	95.84%	96.01%	99.30%
Ogma	MLP	95.44%	96.16%	94.89%	97.46%	99.37%
MiniLM	LogReg	94.59%	95.38%	95.46%	95.29%	98.92%
MiniLM	MLP	93.95%	94.85%	94.59%	95.11%	98.92%

Ogma leads across all metrics: +0.78% F1 (MLP), +0.55% F1 (LR). Both models perform well at scale; Ogma maintains its edge and achieves higher AUC-ROC (99.37% vs 98.92%).

Summary

Task	Ogma best F1	MiniLM best F1	Δ
Jigsaw Toxicity	88.26% (LR)	91.24% (MLP)	−2.98%
deepset Injection	90.27% (MLP)	86.24% (MLP)	+4.03%
neuralchemy Injection	96.16% (MLP)	95.38% (LR)	+0.78%

Ogma is a stronger feature extractor for prompt injection detection — the safety-critical task for agent pipelines. MiniLM edges ahead on toxicity when given sufficient labelled data and a more powerful classifier head. For agentic use cases where detecting adversarial instructions is the priority, Ogma representations are the better choice.

Architecture

Property	Value
Architecture	Custom Transformer
Internal dim (`d_model`)	256
Output dim (`d_output`)	256
Transformer layers	6
Attention heads	4
Vocabulary	30,000 (SentencePiece / AlbertTokenizer)
Max sequence length	1,024 tokens
Pooling	Mean pooling
Task tokens	`[QRY]` (query), `[DOC]` (document), `[SYM]` (symmetric)
Matryoshka dims	[32, 64, 128, 256]
Output normalisation	L2 (unit sphere)
Parameters	8.6M
Model file	`model.safetensors` (33 MB)

Key design choices:

Task token prepend: A learnable task token ([QRY], [DOC], or [SYM]) is prepended to the input sequence before the transformer. Recommended inference route: [QRY]/[QRY] — encode both queries and documents with [QRY]; this benchmarked highest on MTEB. [SYM] everywhere is the next-best symmetric alternative. We do not recommend [DOC] at inference time — it is exposed for downstream fine-tuning, not as an asymmetric query/document route.
Matryoshka training: The model is trained with Matryoshka Representation Learning, meaning embeddings truncated to any supported sub-dimension remain well-calibrated without retraining.
Mean pooling: The average of all token outputs (excluding padding) produces the sentence embedding, which consistently outperforms CLS-token pooling in the Ogma architecture family.
L2 normalisation: All outputs are unit-normalised; cosine similarity == dot product == euclidean similarity (up to a constant), simplifying downstream usage.

Usage

Installation

pip install torch tokenizers transformers huggingface_hub

Basic Encoding

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("axiotic/ogma-small", trust_remote_code=True).eval()
tok   = AutoTokenizer.from_pretrained("axiotic/ogma-small", trust_remote_code=True)

sentences = [
    "The quick brown fox jumps over the lazy dog",
    "A fast auburn vulpine leaps over an idle canine",
    "The capital of France is Paris",
]
emb = model.embed(sentences, task="sym", tokenizer=tok)
# emb.shape → (256,) per sentence, L2-normalised

sim = (emb[0] @ emb[1]).item()   # cosine sim == dot product (L2-normalised)
print(f"paraphrase: {sim:.4f}")

task="sym" is a safe default for all similarity tasks (STS, clustering, classification) and for retrieval. Ogma is trained for symmetric routing — queries and documents are always encoded with the same task token. The two recommended routes are:

[SYM] for everything (the safe default above), or
[QRY]/[QRY] — encode both queries and documents with task="qry".

Try both on your downstream task; either can win depending on the data, and [QRY]/[QRY] is the natural starting point when fine-tuning a classifier or retrieval head on top of the embeddings.

Retrieval

Encode queries and documents with the same task token. Below we show the [QRY]/[QRY] route — both calls use task="qry". This is intentional (Ogma is symmetric, not asymmetric); swap in task="sym" to compare the SYM route on your data.

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("axiotic/ogma-small", trust_remote_code=True).eval()
tok   = AutoTokenizer.from_pretrained("axiotic/ogma-small", trust_remote_code=True)

queries = ["What is knowledge distillation?"]
docs = [
    "Knowledge distillation trains a smaller student model to mimic a larger teacher.",
    "The Eiffel Tower is in Paris, France.",
]

q = model.embed(queries, task="qry", tokenizer=tok)   # (256,) per query  — symmetric: both sides use qry
d = model.embed(docs,    task="qry", tokenizer=tok)   # (256,) per doc    — not a typo; Ogma is symmetric

scores = (q @ d.T).squeeze(0)   # cosine sim (L2-normalised, dot == cosine)
print(scores.tolist())          # [higher, lower] — first doc is relevant

Matryoshka — Flexible Dimensionality

Ogma is trained with Matryoshka Representation Learning. Slice and re-normalise to any supported sub-dimension with no retraining:

import torch, torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("axiotic/ogma-small", trust_remote_code=True).eval()
tok   = AutoTokenizer.from_pretrained("axiotic/ogma-small", trust_remote_code=True)

emb = model.embed(["hello world"], task="sym", tokenizer=tok)   # full 256d

for d in model.config.matryoshka_dims:
    sub = F.normalize(emb[:, :d], dim=-1)
    print(f"{d}d  norm={sub.norm(dim=-1).item():.4f}")

Model Family

Model	Params	Size	MTEB Avg	Class	Clust	PairClass	Rerank	Ret	STS	Summ	d_out	Context
ogma-large	32.4M	124 MB	57.41	68.6	41.6	84.0	53.1	43.7	83.7	30.9	256	1024
ogma-base	13.3M	51 MB	57.02	67.74	41.49	83.73	51.25	42.36	82.84	29.73	256	1024
ogma-small	8.6M	33 MB	56.32	66.49	40.69	82.91	50.51	42.05	82.00	29.59	256	1024
ogma-mini	3.5M	14 MB	53.06	61.77	37.38	79.66	47.39	36.21	77.71	31.33	256	1024
ogma-micro	2.3M	8.9 MB	52.18	59.53	36.88	78.62	49.74	33.09	75.63	31.77	128	1024
all-MiniLM-L6-v2	22.7M	87 MB	56.09	62.62	41.94	82.37	58.04	41.95	78.90	30.81	384	256
potion-base-32M	32.0M	123 MB	51.22	66.0	39.2	78.2	50.9	32.2	73.9	29.8	256	inf
potion-base-8M	7.6M	29 MB	50.03	64.44	32.93	76.62	49.73	31.71	73.24	29.28	256	inf

All Ogma: MTEB 2.10.7, 66-task standard English set, category-averaged. MiniLM/Potion: published scores from the Model2Vec results page.

Training Details

Property	Value
Teacher model	`jinaai/jina-embeddings-v5-text-small` (CC-BY-NC-4.0)
Training paradigm	Knowledge distillation from cached teacher embeddings
Training data	~7M curated English sentence pairs
Tokenizer	AlbertTokenizer (SentencePiece, vocab=30,000)
Embedding initialisation	PCA of teacher embeddings (128d) projected to d_model
Loss	Distillation + contrastive (balanced schedule)
Evaluation framework	MTEB 2.10.7

Limitations

No text generation. Ogma is an encoder-only embedding model.
English only. Training data and evaluation are English-only.
Slower than static models. Transformer inference is 40-100× slower than static models (Potion, Model2Vec) on CPU. The trade-off: contextual understanding and 4× longer sequences.
Non-commercial licence. Due to distillation from a CC-BY-NC-4.0 teacher, Ogma inherits the NonCommercial restriction. Commercial use requires a separate Jina AI licence or retraining with a permissive teacher (Apache 2.0 compatible models like BGE or E5 can substitute at the cost of a full retraining run).
Reranking gap. Ogma lags behind MiniLM-L6-v2 on reranking tasks (category avg delta: -7.5). This is an architectural characteristic: the model optimises for semantic similarity and classification over pairwise ranking.

Licence & Attribution

This model is released under CC-BY-NC-4.0 (Creative Commons Attribution-NonCommercial 4.0 International).

Required attribution (must be included in all uses):

This model was trained via knowledge distillation from jina-embeddings-v5-text-small (https://huggingface.co/jinaai/jina-embeddings-v5-text-small) by Jina AI, licensed under CC-BY-NC-4.0.

Citation

@misc{ogma2026,
  title     = {Ogma: Efficient Dense Retrieval via Structured Embeddings},
  author    = {Axiotic AI},
  year      = {2026},
  url       = {https://huggingface.co/axiotic/ogma-small},
}

Downloads last month: 308

Safetensors

Model size

8.6M params

Tensor type

F32

Space using axiotic/ogma-small 1

Evaluation results

cosine_spearman on MTEB STSBenchmark
test set self-reported

85.540
accuracy on MTEB AmazonPolarityClassification
test set self-reported

76.720
v_measure on MTEB RedditClustering
test set self-reported

43.940
cos_sim_ap on MTEB TwitterSemEval2015
test set self-reported

68.490
map on MTEB MindSmallReranking
validation set self-reported

30.550
ndcg_at_10 on MTEB MSMARCO
self-reported

34.310
cos_sim_spearman on MTEB SummEval
test set self-reported

29.590