Sarden1: Multilingual PII Detection & Redaction Model

Model Description

Sarden1 is a high-performance token classification model built from scratch for personally identifiable information (PII) detection and redaction. It identifies and labels sensitive entity spans in text across 15 locales, making it suitable for GDPR/HIPAA compliance pipelines, log scrubbing, and document redaction at production scale.

  • Developed by: Surpem
  • Model Type: Token Classifier (BIO tagging)
  • Architecture: Custom Decoder-style Transformer
  • Base Model: Trained from scratch — no pretrained base
  • License: Apache 2.0
  • Languages: en, de, fr, it, es, nl, pt, pl, cs, da, fi, sv (+ en_GB, en_CA, en_AU)

Architecture

Component Detail
Parameters ~300M
Layers 18 transformer layers
Hidden size 1024
Attention Grouped Query Attention (16Q / 4KV heads)
FFN SwiGLU (2730 intermediate)
Positional encoding RoPE (θ = 500,000)
Normalisation RMSNorm (no bias)
Tokeniser GPT-2 BPE (vocab 50,257)
Precision bfloat16

Entity Types

Sarden1 detects 12 PII entity types using BIO span labelling:

Category Entity Types
Identity PERSON, USERNAME, DATE
Contact EMAIL, PHONE, ADDRESS
Financial CREDITCARD, SSN
Documents PASSPORT, DRIVERSLICENSE
Technical IP
Organisational ORG

Get Started

import json, torch
from safetensors.torch import load_file
from transformers import AutoTokenizer

# Load weights and config
sd       = load_file("model.safetensors")
cfg      = json.load(open("config.json"))
id2label = {int(k): v for k, v in cfg["id2label"].items()}

# Load tokeniser
tok = AutoTokenizer.from_pretrained(".")

# (Rebuild model from architecture, then:)
model.load_state_dict(sd)
model.eval()

# Inference
text    = "Hi, I'm Jane Smith. Reach me at jane@example.com or 555-1234."
enc     = tok(text, return_offsets_mapping=True, return_tensors="pt")
with torch.no_grad():
    logits = model(enc["input_ids"])["logits"]

preds   = logits.argmax(-1)[0].tolist()
offsets = enc["offset_mapping"][0].tolist()

for pred, (cs, ce) in zip(preds, offsets):
    if cs != ce and id2label.get(pred, "O") != "O":
        print(f"{id2label[pred]:<20} {repr(text[cs:ce])}")

Example output:

PERSON               'Jane Smith'
EMAIL                'jane@example.com'
PHONE                '555-1234'

Citation

@misc{surpem2026sarden1,
      title  = {Sarden1-300M: Multilingual PII Detection \& Redaction Model},
      author = {Surpem},
      year   = {2026},
      url    = {https://huggingface.co/surpem/sarden1-300m},
}
Downloads last month
6
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Collection including Surpem/Sarden1