Tachelhit — Wikilangs Models

Open-source tokenizers, n-gram & Markov language models, vocabulary stats, and word embeddings trained on Tachelhit Wikipedia by Wikilangs.

🌐 Language Page · 🎮 Playground · 📊 Full Research Report

Language Samples

Example sentences drawn from the Tachelhit Wikipedia corpus:

11 yan d mraw ( s Taɛrabt احدى عشر ) ( s Tafṛensist onze ) iga yan izwl Msmun awal n SGSM : Msmun awal amatay asnmalay n tmaziɣt (MMSM) tisaɣulin

12 sin d mraw ( s Taɛrabt اثنى عشرة ) ( s Tafṛensist douze ) iga yan izwl Msmun awal n SGSM : Msmun awal amatay asnmalay n tmaziɣt (MMSM) tisaɣulin

13 kṛaḍ d merraw ( s Taɛrabt ثلاثة عشرة ) ( s Tafṛensist treize ) iga yan izwl Msmun awal n SGSM : Msmun awal amatay asnmalay n tmaziɣt (MMSM) tisaɣulin

Acfud iga yat tasklut mẓẓin, ilan isnnann. Tiwlafin Assaɣ Tasnalɣa (morphologie) Anzwi Tisaɣulin

Acnyal nɣ Aknyal agdudan aṣbnyuli, ɣ tgzzumt tiss 4.1 n tmnḍawt taṣbnyulit yuma kṛaḍ ikʷlan: aẓggaɣ d uwraɣ d uẓggaɣ daɣ. Tisaɣulin

Quick Start

Load the Tokenizer

import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.Load("shi_tokenizer_32k.model")

text = "Sstekk iga yan ugḍiḍ imẓẓin. Assaɣ Tuzduɣt Tasnalɣa (morphologie) Tisaɣulin Msmu"
tokens = sp.EncodeAsPieces(text)
ids    = sp.EncodeAsIds(text)

print(tokens)  # subword pieces
print(ids)     # integer ids

# Decode back
print(sp.DecodeIds(ids))
Tokenization examples (click to expand)

Sample 1: Sstekk iga yan ugḍiḍ imẓẓin. Assaɣ Tuzduɣt Tasnalɣa (morphologie) Tisaɣulin Msmu…

Vocab Tokens Count
8k ▁s ste kk ▁iga ▁yan ▁ugḍiḍ ▁imẓẓin . ▁assaɣ ▁tuzduɣt … (+19 more) 29
16k ▁s ste kk ▁iga ▁yan ▁ugḍiḍ ▁imẓẓin . ▁assaɣ ▁tuzduɣt … (+19 more) 29
32k ▁s stekk ▁iga ▁yan ▁ugḍiḍ ▁imẓẓin . ▁assaɣ ▁tuzduɣt ▁tasnalɣa … (+18 more) 28
64k ▁sstekk ▁iga ▁yan ▁ugḍiḍ ▁imẓẓin . ▁assaɣ ▁tuzduɣt ▁tasnalɣa ▁( … (+17 more) 27

Sample 2: Asimwas iga ass wiss Smmus g ussan n imalass. Tisaɣulin

Vocab Tokens Count
8k ▁as im was ▁iga ▁ass ▁wiss ▁smmus ▁g ▁ussan ▁n … (+3 more) 13
16k ▁as imwas ▁iga ▁ass ▁wiss ▁smmus ▁g ▁ussan ▁n ▁imalass … (+2 more) 12
32k ▁asimwas ▁iga ▁ass ▁wiss ▁smmus ▁g ▁ussan ▁n ▁imalass . … (+1 more) 11
64k ▁asimwas ▁iga ▁ass ▁wiss ▁smmus ▁g ▁ussan ▁n ▁imalass . … (+1 more) 11

Sample 3: Turdut (S turdut: اردو ) tga tutlayt nna s sawaln ayt Bakistan d Lhnd. Isuɣal

Vocab Tokens Count
8k ▁tur dut ▁( s ▁tur dut : ▁ا ر دو … (+14 more) 24
16k ▁tur dut ▁( s ▁tur dut : ▁ار دو ▁) … (+13 more) 23
32k ▁turdut ▁( s ▁turdut : ▁اردو ▁) ▁tga ▁tutlayt ▁nna … (+9 more) 19
64k ▁turdut ▁( s ▁turdut : ▁اردو ▁) ▁tga ▁tutlayt ▁nna … (+8 more) 18

Load Word Embeddings

from gensim.models import KeyedVectors

# Aligned embeddings (cross-lingual, mapped to English vector space)
wv = KeyedVectors.load("shi_embeddings_128d_aligned.kv")

similar = wv.most_similar("word", topn=5)
for word, score in similar:
    print(f"  {word}: {score:.3f}")

Load N-gram Model

import pyarrow.parquet as pq

df = pq.read_table("shi_3gram_word.parquet").to_pandas()
print(df.head())

Models Overview

Performance Dashboard

Category Assets
Tokenizers BPE at 8k, 16k, 32k, 64k vocab sizes
N-gram models 2 / 3 / 4 / 5-gram (word & subword)
Markov chains Context 1–5 (word & subword)
Embeddings 32d, 64d, 128d — mono & aligned
Vocabulary Full frequency list + Zipf analysis
Statistics Corpus & model statistics JSON

Metrics Summary

Component Model Key Metric Value
Tokenizer 8k BPE Compression 3.02x
Tokenizer 16k BPE Compression 3.30x
Tokenizer 32k BPE Compression 3.56x
Tokenizer 64k BPE Compression 3.82x 🏆
N-gram 2-gram (subword) Perplexity 255 🏆
N-gram 2-gram (word) Perplexity 1,027
N-gram 3-gram (subword) Perplexity 1,284
N-gram 3-gram (word) Perplexity 1,698
N-gram 4-gram (subword) Perplexity 3,345
N-gram 4-gram (word) Perplexity 3,109
N-gram 5-gram (subword) Perplexity 5,689
N-gram 5-gram (word) Perplexity 3,900
Markov ctx-1 (subword) Predictability 0.0%
Markov ctx-1 (word) Predictability 36.7%
Markov ctx-2 (subword) Predictability 0.0%
Markov ctx-2 (word) Predictability 74.0%
Markov ctx-3 (subword) Predictability 17.0%
Markov ctx-3 (word) Predictability 91.6%
Markov ctx-4 (subword) Predictability 43.6%
Markov ctx-4 (word) Predictability 95.2% 🏆
Vocabulary full Size 31,623
Vocabulary full Zipf R² 0.9880
Embeddings mono_32d Isotropy 0.6948
Embeddings mono_64d Isotropy 0.5226
Embeddings mono_128d Isotropy 0.2352
Embeddings aligned_32d Isotropy 0.6948 🏆
Embeddings aligned_64d Isotropy 0.5226
Embeddings aligned_128d Isotropy 0.2352
Alignment aligned_32d R@1 / R@5 / R@10 0.6% / 2.0% / 5.4%
Alignment aligned_64d R@1 / R@5 / R@10 2.4% / 8.0% / 12.8%
Alignment aligned_128d R@1 / R@5 / R@10 3.6% / 11.2% / 17.8% 🏆

📊 Full ablation study, per-model breakdowns, and interpretation guide →


About

Trained on wikipedia-monthly — monthly snapshots of 300+ Wikipedia languages.

A project by Wikilangs · Maintainer: Omar Kamali · Omneity Labs

Citation

@misc{wikilangs2025,
  author    = {Kamali, Omar},
  title     = {Wikilangs: Open NLP Models for Wikipedia Languages},
  year      = {2025},
  doi       = {10.5281/zenodo.18073153},
  publisher = {Zenodo},
  url       = {https://huggingface.co/wikilangs},
  institution = {Omneity Labs}
}

Links

License: MIT — free for academic and commercial use.


Generated by Wikilangs Pipeline · 2026-03-02 12:00:32

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train wikilangs/shi