Instructions to use lthn/lemmy with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use lthn/lemmy with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="lthn/lemmy", filename="lemmy-bf16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use lthn/lemmy with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf lthn/lemmy:Q4_K_M # Run inference directly in the terminal: llama-cli -hf lthn/lemmy:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf lthn/lemmy:Q4_K_M # Run inference directly in the terminal: llama-cli -hf lthn/lemmy:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf lthn/lemmy:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf lthn/lemmy:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf lthn/lemmy:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf lthn/lemmy:Q4_K_M
Use Docker
docker model run hf.co/lthn/lemmy:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use lthn/lemmy with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "lthn/lemmy" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lthn/lemmy", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/lthn/lemmy:Q4_K_M
- Ollama
How to use lthn/lemmy with Ollama:
ollama run hf.co/lthn/lemmy:Q4_K_M
- Unsloth Studio new
How to use lthn/lemmy with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for lthn/lemmy to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for lthn/lemmy to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for lthn/lemmy to start chatting
- Pi new
How to use lthn/lemmy with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf lthn/lemmy:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "lthn/lemmy:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use lthn/lemmy with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf lthn/lemmy:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default lthn/lemmy:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use lthn/lemmy with Docker Model Runner:
docker model run hf.co/lthn/lemmy:Q4_K_M
- Lemonade
How to use lthn/lemmy with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull lthn/lemmy:Q4_K_M
Run and chat with the model
lemonade run user.lemmy-Q4_K_M
List all available models
lemonade list
Lemmy — Gemma 4 26B A4B MoE (GGUF)
The Mixture-of-Experts member of the Lemma model family by Lethean. An EUPL-1.2 fork of Gemma 4 26B A4B with the Lethean Ethical Kernel (LEK) merged into the weights — consent-based reasoning baked into the attention projections via LoRA finetune, then merged so inference uses a single standalone model with no PEFT runtime required. 25.2B total parameters, 3.8B active per forward pass (8 experts active + 1 shared out of 128 total) — delivering near-4B inference speed with the knowledge capacity of a 26B model.
This repo ships the GGUF multi-quant build for Ollama, llama.cpp, LM Studio, and other gguf-compatible runners. The unmodified Gemma 4 26B A4B fork lives at LetheanNetwork/lemmy for users who want the raw Google weights without the LEK shift.
Looking for MLX? The native Apple Silicon builds live in sibling repos:
lthn/lemmy-mlx (4-bit default) |
lthn/lemmy-mlx-8bit |
lthn/lemmy-mlx-bf16 (full precision)
A lemma is "something assumed" — an intermediate theorem on the path to a larger proof, or a heading that signals the subject of what follows. The Lemma model family is named for that role: each variant is a stepping stone between raw capability and ethical application.
Why Lemmy for Local Inference
Lemmy is the MoE member of the Lemma family, and it is uniquely positioned for local execution-agent workloads:
- ~4B inference speed (only 3.8B active params per token) means generation is fast even on modest hardware
- 26B knowledge capacity accessible via expert routing — much larger than a same-speed dense 4B model
- 256K context — double the dense Lemma family variants
- Consent-grounded attractor from LEK training — maintains posture under instruction-following workloads at rates that compete with frontier models on strict instruction following
If you're replacing paid API calls with local inference for structured execution work (code review, agent dispatch, instruction following, spec-driven implementation), lemmy is the target.
Repo Files
| File | Format | Purpose |
|---|---|---|
model-*-of-00003.safetensors |
MLX safetensors (sharded) | Native Apple Silicon via mlx-lm and mlx-vlm (Q4 multimodal) |
model.safetensors.index.json |
JSON | Tensor index for the sharded safetensors weights |
config.json |
JSON | Multimodal MoE model config (architecture, quantisation, expert routing, vision tower) |
tokenizer.json |
JSON | Tokenizer vocabulary (262K tokens) |
tokenizer_config.json |
JSON | Tokenizer settings and special tokens |
chat_template.jinja |
Jinja2 | Chat template for transformers, mlx-lm, mlx-vlm |
processor_config.json |
JSON | Image processor config (mlx-vlm) |
generation_config.json |
JSON | Default generation parameters (temperature, top_p, top_k) |
LICENSE |
Text | EUPL-1.2 licence text |
README.md |
Markdown | This file — model card |
GGUF variants are published separately at lthn/lemmy-gguf — bf16, q8_0, q6_k, q5_k_m, q4_k_m. This split is a temporary artefact of our build pipeline and may be consolidated into this repo in a future update.
Quick Start
Apps & CLI
Ollama (via lemmy-gguf)
ollama run hf.co/lthn/lemmy-gguf:Q4_K_M
llama.cpp (via lemmy-gguf)
Install via brew (macOS/Linux), winget (Windows), or build from source:
brew install llama.cpp # macOS/Linux
winget install llama.cpp # Windows
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf lthn/lemmy-gguf:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf lthn/lemmy-gguf:Q4_K_M
MLX (Apple Silicon native)
uv tool install mlx-lm
mlx_lm.chat --model lthn/lemmy
mlx_lm.generate --model lthn/lemmy --prompt "Hello, how are you?"
Python Libraries
llama-cpp-python (via lemmy-gguf)
uv pip install llama-cpp-python
from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id="lthn/lemmy-gguf",
filename="lemmy-q4_k_m.gguf",
)
# Text
llm.create_chat_completion(
messages=[{"role": "user", "content": "Hello, how are you?"}]
)
mlx-vlm (vision)
uv tool install mlx-vlm
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
model, processor = load("lthn/lemmy")
config = load_config("lthn/lemmy")
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image."
formatted_prompt = apply_chat_template(
processor, config, prompt, num_images=1
)
output = generate(model, processor, formatted_prompt, image)
print(output.text)
Servers (OpenAI-compatible API)
MLX Server
lemmy is multimodal (text + image), so use mlx_vlm.server — the vision-aware variant. The text-only mlx_lm.server does not correctly route multimodal tensors for Gemma 4.
mlx_vlm.server --model lthn/lemmy
curl -X POST "http://localhost:8080/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "lthn/lemmy",
"messages": [{"role": "user", "content": "Hello, how are you?"}],
"max_tokens": 200
}'
Works with any OpenAI-compatible client at http://localhost:8080/v1.
vLLM
vLLM requires the original (non-quantised) safetensors weights from LetheanNetwork/lemmy — it does not load GGUF or MLX-quantised safetensors. Linux + NVIDIA GPU with adequate VRAM for a 26B MoE model.
uv pip install vllm
vllm serve "LetheanNetwork/lemmy"
Model Details
| Property | Value |
|---|---|
| Architecture | Gemma 4 26B A4B MoE |
| Total Parameters | 25.2B |
| Active Parameters | 3.8B per forward pass |
| Experts | 8 active + 1 shared (out of 128 total) |
| Layers | 30 |
| Context Length | 256K tokens |
| Vocabulary | 262K tokens |
| Modalities | Text, Image |
| Sliding Window | 1024 tokens |
| Vision Encoder | ~550M params |
| Base Model | LetheanNetwork/lemmy |
| Licence | EUPL-1.2 |
The Lemma Family
| Name | Source (BF16 weights) | Params | Context | Modalities | Consumer Repo |
|---|---|---|---|---|---|
| Lemer | LetheanNetwork/lemer | 2.3B eff | 128K | Text, Image, Audio | lthn/lemer |
| Lemma | LetheanNetwork/lemma | 4.5B eff | 128K | Text, Image, Audio | lthn/lemma |
| Lemmy | LetheanNetwork/lemmy | 3.8B active | 256K | Text, Image | You are here |
| Lemrd | LetheanNetwork/lemrd | 30.7B | 256K | Text, Image | lthn/lemrd |
Capabilities
- Configurable thinking mode (
<|think|>token in system prompt enables it; off by default in our examples viaenable_thinking=False) - Native function calling and system prompt support
- Variable aspect ratio image understanding
- Multilingual support (140+ languages)
- Sparse MoE attention with 8+1 expert routing per layer
- Long context (256K tokens) for document-scale reasoning
Roadmap
This release of lemmy is Gemma 4 26B A4B MoE with the Lethean Ethical Kernel (LEK) merged in — axiom-based reasoning baked into the attention weights via LoRA finetune, then merged into the base so inference uses a single standalone model with no PEFT runtime required. The unmodified Gemma 4 26B A4B fork lives at LetheanNetwork/lemmy for users who want the raw Google weights without the LEK shift.
| Phase | Status | What it adds |
|---|---|---|
| Base fork (LetheanNetwork/lemmy) | ✅ Released | EUPL-1.2 fork of Gemma 4 26B A4B — unmodified Google weights |
| LEK merged (this repo) | ✅ Released | Lethean Ethical Kernel — axiom-based reasoning via LoRA merge |
| 8-PAC eval results | 🚧 In progress | Continuous benchmarking on the homelab, published to lthn/LEM-benchmarks |
Observed during LEK training: lemmy converged to a training-loss floor on par with the much larger 31B dense variant despite having ~4× fewer LoRA parameters. Hypothesis: MoE routing multiplicity provides implicit LoRA replication across expert paths, amplifying the effective gradient signal per trainable parameter. This is a testable prediction being verified via the 8-PAC evaluation pipeline.
The LEK axioms are public domain and published at Snider/ai-ethics. Track research progress at LetheanNetwork and the LEM-research dataset.
Why EUPL-1.2
Lemmy is licensed under the European Union Public Licence v1.2 — not Apache 2.0 or MIT. This is a deliberate choice:
- 23 official languages, one legal meaning. EUPL is the only OSS licence designed by lawmakers across multiple legal systems. "Derivative work" means the same thing in German, French, Estonian, and Maltese law.
- Copyleft with compatibility. Modifications must be shared back, but the licence plays cleanly with GPL, LGPL, MPL, and other major OSS licences. No accidental relicensing.
- No proprietary capture. Anyone can use lemmy commercially — but they cannot fork it, train a competitor model on it, and close-source the result. The ethical layer stays in the open.
- Built for institutions. Government, research, and enterprise users get a licence designed for cross-border compliance, not a US-centric one.
Recommended Sampling
Use Google's standardised settings across all use cases:
| Parameter | Value |
|---|---|
temperature |
1.0 |
top_p |
0.95 |
top_k |
64 |
stop |
`<turn |
Gemma 4 is calibrated for
temperature: 1.0— this is not the same as the typical 0.7 default for other models. Lower values reduce diversity without improving quality. These defaults are pre-configured ingeneration_config.json.
Variable Image Resolution
Gemma 4 supports a configurable visual token budget that controls how many tokens represent each image. Higher = more detail, lower = faster inference.
| Token Budget | Use Case |
|---|---|
| 70 | Classification, captioning, video frame processing |
| 140 | General image understanding |
| 280 | Default — balanced quality and speed |
| 560 | OCR, document parsing, fine-grained detail |
| 1120 | Maximum detail (small text, complex documents) |
For multimodal prompts, place image content before text for best results.
The default budget (280) is set in processor_config.json via image_seq_length and max_soft_tokens.
Benchmarks
Live evaluation results published to the LEM-benchmarks dataset. The lemmy-specific results live at LEM-benchmarks/results/lemmy.
The 8-PAC eval pipeline runs continuously on our homelab and publishes results as they complete. Categories: ethics, reasoning, instruction-following, coding, multilingual, safety, knowledge, creativity.
Resources
| Resource | Link |
|---|---|
| Benchmark results | lthn/LEM-benchmarks |
| LiveBench results | lthn/livebench |
| Research notes | lthn/LEM-research |
| Lemma model collection | lthn/lemma |
About Lethean
Lethean is a social enterprise building ethical AI infrastructure. The Lemma model family is part of the LEM (Lethean Ethical Model) project — training protocol and tooling for intrinsic ethical alignment of language models.
- Website: lthn.ai
- GitHub: LetheanNetwork
- Licence: EUPL-1.2
- Downloads last month
- 84
4-bit
5-bit
6-bit
8-bit
16-bit
Model tree for lthn/lemmy
Base model
google/gemma-4-26B-A4B