Sarden1
Collection
1 item • Updated
Sarden1 is a high-performance token classification model built from scratch for personally identifiable information (PII) detection and redaction. It identifies and labels sensitive entity spans in text across 15 locales, making it suitable for GDPR/HIPAA compliance pipelines, log scrubbing, and document redaction at production scale.
| Component | Detail |
|---|---|
| Parameters | ~300M |
| Layers | 18 transformer layers |
| Hidden size | 1024 |
| Attention | Grouped Query Attention (16Q / 4KV heads) |
| FFN | SwiGLU (2730 intermediate) |
| Positional encoding | RoPE (θ = 500,000) |
| Normalisation | RMSNorm (no bias) |
| Tokeniser | GPT-2 BPE (vocab 50,257) |
| Precision | bfloat16 |
Sarden1 detects 12 PII entity types using BIO span labelling:
| Category | Entity Types |
|---|---|
| Identity | PERSON, USERNAME, DATE |
| Contact | EMAIL, PHONE, ADDRESS |
| Financial | CREDITCARD, SSN |
| Documents | PASSPORT, DRIVERSLICENSE |
| Technical | IP |
| Organisational | ORG |
import json, torch
from safetensors.torch import load_file
from transformers import AutoTokenizer
# Load weights and config
sd = load_file("model.safetensors")
cfg = json.load(open("config.json"))
id2label = {int(k): v for k, v in cfg["id2label"].items()}
# Load tokeniser
tok = AutoTokenizer.from_pretrained(".")
# (Rebuild model from architecture, then:)
model.load_state_dict(sd)
model.eval()
# Inference
text = "Hi, I'm Jane Smith. Reach me at jane@example.com or 555-1234."
enc = tok(text, return_offsets_mapping=True, return_tensors="pt")
with torch.no_grad():
logits = model(enc["input_ids"])["logits"]
preds = logits.argmax(-1)[0].tolist()
offsets = enc["offset_mapping"][0].tolist()
for pred, (cs, ce) in zip(preds, offsets):
if cs != ce and id2label.get(pred, "O") != "O":
print(f"{id2label[pred]:<20} {repr(text[cs:ce])}")
Example output:
PERSON 'Jane Smith'
EMAIL 'jane@example.com'
PHONE '555-1234'
@misc{surpem2026sarden1,
title = {Sarden1-300M: Multilingual PII Detection \& Redaction Model},
author = {Surpem},
year = {2026},
url = {https://huggingface.co/surpem/sarden1-300m},
}