Tachelhit — Wikilangs Models
Open-source tokenizers, n-gram & Markov language models, vocabulary stats, and word embeddings trained on Tachelhit Wikipedia by Wikilangs.
🌐 Language Page · 🎮 Playground · 📊 Full Research Report
Language Samples
Example sentences drawn from the Tachelhit Wikipedia corpus:
11 yan d mraw ( s Taɛrabt احدى عشر ) ( s Tafṛensist onze ) iga yan izwl Msmun awal n SGSM : Msmun awal amatay asnmalay n tmaziɣt (MMSM) tisaɣulin
12 sin d mraw ( s Taɛrabt اثنى عشرة ) ( s Tafṛensist douze ) iga yan izwl Msmun awal n SGSM : Msmun awal amatay asnmalay n tmaziɣt (MMSM) tisaɣulin
13 kṛaḍ d merraw ( s Taɛrabt ثلاثة عشرة ) ( s Tafṛensist treize ) iga yan izwl Msmun awal n SGSM : Msmun awal amatay asnmalay n tmaziɣt (MMSM) tisaɣulin
Acfud iga yat tasklut mẓẓin, ilan isnnann. Tiwlafin Assaɣ Tasnalɣa (morphologie) Anzwi Tisaɣulin
Acnyal nɣ Aknyal agdudan aṣbnyuli, ɣ tgzzumt tiss 4.1 n tmnḍawt taṣbnyulit yuma kṛaḍ ikʷlan: aẓggaɣ d uwraɣ d uẓggaɣ daɣ. Tisaɣulin
Quick Start
Load the Tokenizer
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load("shi_tokenizer_32k.model")
text = "Sstekk iga yan ugḍiḍ imẓẓin. Assaɣ Tuzduɣt Tasnalɣa (morphologie) Tisaɣulin Msmu"
tokens = sp.EncodeAsPieces(text)
ids = sp.EncodeAsIds(text)
print(tokens) # subword pieces
print(ids) # integer ids
# Decode back
print(sp.DecodeIds(ids))
Tokenization examples (click to expand)
Sample 1: Sstekk iga yan ugḍiḍ imẓẓin. Assaɣ Tuzduɣt Tasnalɣa (morphologie) Tisaɣulin Msmu…
| Vocab | Tokens | Count |
|---|---|---|
| 8k | ▁s ste kk ▁iga ▁yan ▁ugḍiḍ ▁imẓẓin . ▁assaɣ ▁tuzduɣt … (+19 more) |
29 |
| 16k | ▁s ste kk ▁iga ▁yan ▁ugḍiḍ ▁imẓẓin . ▁assaɣ ▁tuzduɣt … (+19 more) |
29 |
| 32k | ▁s stekk ▁iga ▁yan ▁ugḍiḍ ▁imẓẓin . ▁assaɣ ▁tuzduɣt ▁tasnalɣa … (+18 more) |
28 |
| 64k | ▁sstekk ▁iga ▁yan ▁ugḍiḍ ▁imẓẓin . ▁assaɣ ▁tuzduɣt ▁tasnalɣa ▁( … (+17 more) |
27 |
Sample 2: Asimwas iga ass wiss Smmus g ussan n imalass. Tisaɣulin
| Vocab | Tokens | Count |
|---|---|---|
| 8k | ▁as im was ▁iga ▁ass ▁wiss ▁smmus ▁g ▁ussan ▁n … (+3 more) |
13 |
| 16k | ▁as imwas ▁iga ▁ass ▁wiss ▁smmus ▁g ▁ussan ▁n ▁imalass … (+2 more) |
12 |
| 32k | ▁asimwas ▁iga ▁ass ▁wiss ▁smmus ▁g ▁ussan ▁n ▁imalass . … (+1 more) |
11 |
| 64k | ▁asimwas ▁iga ▁ass ▁wiss ▁smmus ▁g ▁ussan ▁n ▁imalass . … (+1 more) |
11 |
Sample 3: Turdut (S turdut: اردو ) tga tutlayt nna s sawaln ayt Bakistan d Lhnd. Isuɣal
| Vocab | Tokens | Count |
|---|---|---|
| 8k | ▁tur dut ▁( s ▁tur dut : ▁ا ر دو … (+14 more) |
24 |
| 16k | ▁tur dut ▁( s ▁tur dut : ▁ار دو ▁) … (+13 more) |
23 |
| 32k | ▁turdut ▁( s ▁turdut : ▁اردو ▁) ▁tga ▁tutlayt ▁nna … (+9 more) |
19 |
| 64k | ▁turdut ▁( s ▁turdut : ▁اردو ▁) ▁tga ▁tutlayt ▁nna … (+8 more) |
18 |
Load Word Embeddings
from gensim.models import KeyedVectors
# Aligned embeddings (cross-lingual, mapped to English vector space)
wv = KeyedVectors.load("shi_embeddings_128d_aligned.kv")
similar = wv.most_similar("word", topn=5)
for word, score in similar:
print(f" {word}: {score:.3f}")
Load N-gram Model
import pyarrow.parquet as pq
df = pq.read_table("shi_3gram_word.parquet").to_pandas()
print(df.head())
Models Overview
| Category | Assets |
|---|---|
| Tokenizers | BPE at 8k, 16k, 32k, 64k vocab sizes |
| N-gram models | 2 / 3 / 4 / 5-gram (word & subword) |
| Markov chains | Context 1–5 (word & subword) |
| Embeddings | 32d, 64d, 128d — mono & aligned |
| Vocabulary | Full frequency list + Zipf analysis |
| Statistics | Corpus & model statistics JSON |
Metrics Summary
| Component | Model | Key Metric | Value |
|---|---|---|---|
| Tokenizer | 8k BPE | Compression | 3.02x |
| Tokenizer | 16k BPE | Compression | 3.30x |
| Tokenizer | 32k BPE | Compression | 3.56x |
| Tokenizer | 64k BPE | Compression | 3.82x 🏆 |
| N-gram | 2-gram (subword) | Perplexity | 255 🏆 |
| N-gram | 2-gram (word) | Perplexity | 1,027 |
| N-gram | 3-gram (subword) | Perplexity | 1,284 |
| N-gram | 3-gram (word) | Perplexity | 1,698 |
| N-gram | 4-gram (subword) | Perplexity | 3,345 |
| N-gram | 4-gram (word) | Perplexity | 3,109 |
| N-gram | 5-gram (subword) | Perplexity | 5,689 |
| N-gram | 5-gram (word) | Perplexity | 3,900 |
| Markov | ctx-1 (subword) | Predictability | 0.0% |
| Markov | ctx-1 (word) | Predictability | 36.7% |
| Markov | ctx-2 (subword) | Predictability | 0.0% |
| Markov | ctx-2 (word) | Predictability | 74.0% |
| Markov | ctx-3 (subword) | Predictability | 17.0% |
| Markov | ctx-3 (word) | Predictability | 91.6% |
| Markov | ctx-4 (subword) | Predictability | 43.6% |
| Markov | ctx-4 (word) | Predictability | 95.2% 🏆 |
| Vocabulary | full | Size | 31,623 |
| Vocabulary | full | Zipf R² | 0.9880 |
| Embeddings | mono_32d | Isotropy | 0.6948 |
| Embeddings | mono_64d | Isotropy | 0.5226 |
| Embeddings | mono_128d | Isotropy | 0.2352 |
| Embeddings | aligned_32d | Isotropy | 0.6948 🏆 |
| Embeddings | aligned_64d | Isotropy | 0.5226 |
| Embeddings | aligned_128d | Isotropy | 0.2352 |
| Alignment | aligned_32d | R@1 / R@5 / R@10 | 0.6% / 2.0% / 5.4% |
| Alignment | aligned_64d | R@1 / R@5 / R@10 | 2.4% / 8.0% / 12.8% |
| Alignment | aligned_128d | R@1 / R@5 / R@10 | 3.6% / 11.2% / 17.8% 🏆 |
📊 Full ablation study, per-model breakdowns, and interpretation guide →
About
Trained on wikipedia-monthly — monthly snapshots of 300+ Wikipedia languages.
A project by Wikilangs · Maintainer: Omar Kamali · Omneity Labs
Citation
@misc{wikilangs2025,
author = {Kamali, Omar},
title = {Wikilangs: Open NLP Models for Wikipedia Languages},
year = {2025},
doi = {10.5281/zenodo.18073153},
publisher = {Zenodo},
url = {https://huggingface.co/wikilangs},
institution = {Omneity Labs}
}
Links
- 🌐 wikilangs.org
- 🌍 Language page
- 🎮 Playground
- 🤗 HuggingFace models
- 📊 wikipedia-monthly dataset
- 👤 Omar Kamali
- 🤝 Sponsor: Featherless AI
License: MIT — free for academic and commercial use.
Generated by Wikilangs Pipeline · 2026-03-02 12:00:32
