Cebuano — Wikilangs Models

Open-source tokenizers, n-gram & Markov language models, vocabulary stats, and word embeddings trained on Cebuano Wikipedia by Wikilangs.

🌐 Language Page · 🎮 Playground · 📊 Full Research Report

Language Samples

Example sentences drawn from the Cebuano Wikipedia corpus:

Kining maong panid gitagana alang sa lista sa mga tawo nga nahimong mayor sa lalawigan sa Sugbo. Alkalde sa Lalawigan sa Sugbo Alkalde

Ang sekswalidad puyde mopasabot sa: Sekswalidad sa tawo Sekswalidad sa tanom Sekswalidad (oryentasyon) Sekswalidad sa mananap

Katawhan ug Kultura Ekonomiya Heyograpiya Politikal Mga lungsod Dakbayan Mga dakbayan Pisikal Kaagi Mga sumpay sa gawas

Kining maong panid gitagana alang sa lista sa mga tawo nga nahimong gobernador sa lalawigan sa Samar. Mga Gobernador Antonio Bolastig Milagrosa T. Tan Gobernador Gobernador sa Samar

Kining maong panid gitagana alang sa lista sa mga tawo nga nahimong gobernador sa lalawigan sa Biliran. Mga Gobernador (gikan Wayne Jaro Rogelio J. Espina Gobernador Gobernador sa Biliran

Quick Start

Load the Tokenizer

import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.Load("ceb_tokenizer_32k.model")

text = "Ang (MDCCL) mao ang usa ka tuig sa kalendaryong Gregoryano. Ang maoy usa ka tuig"
tokens = sp.EncodeAsPieces(text)
ids    = sp.EncodeAsIds(text)

print(tokens)  # subword pieces
print(ids)     # integer ids

# Decode back
print(sp.DecodeIds(ids))
Tokenization examples (click to expand)

Sample 1: Ang (MDCCL) mao ang usa ka tuig sa kalendaryong Gregoryano. Ang maoy usa ka tuig…

Vocab Tokens Count
8k ▁ang ▁( m d c cl ) ▁mao ▁ang ▁usa … (+27 more) 37
16k ▁ang ▁( m d c cl ) ▁mao ▁ang ▁usa … (+24 more) 34
32k ▁ang ▁( m d c cl ) ▁mao ▁ang ▁usa … (+22 more) 32
64k ▁ang ▁( md c cl ) ▁mao ▁ang ▁usa ▁ka … (+21 more) 31

Sample 2: Vilnius - Ulohan, Lyetuwanya. lungsod ug dakbayan sa Uropa

Vocab Tokens Count
8k ▁v il n ius ▁- ▁ulo han , ▁ly et … (+9 more) 19
16k ▁vil n ius ▁- ▁ulohan , ▁ly et uw an … (+7 more) 17
32k ▁vil n ius ▁- ▁ulohan , ▁ly et uw an … (+7 more) 17
64k ▁vil n ius ▁- ▁ulohan , ▁lyetuwanya . ▁lungsod ▁ug … (+3 more) 13

Sample 3: Ang manunuwat usa ka tawo nga naay propesyon sa pagsulat.

Vocab Tokens Count
8k ▁ang ▁man un u wat ▁usa ▁ka ▁tawo ▁nga ▁na … (+9 more) 19
16k ▁ang ▁man un u wat ▁usa ▁ka ▁tawo ▁nga ▁na … (+8 more) 18
32k ▁ang ▁man un u wat ▁usa ▁ka ▁tawo ▁nga ▁naay … (+6 more) 16
64k ▁ang ▁man un uwat ▁usa ▁ka ▁tawo ▁nga ▁naay ▁propes … (+4 more) 14

Load Word Embeddings

from gensim.models import KeyedVectors

# Aligned embeddings (cross-lingual, mapped to English vector space)
wv = KeyedVectors.load("ceb_embeddings_128d_aligned.kv")

similar = wv.most_similar("word", topn=5)
for word, score in similar:
    print(f"  {word}: {score:.3f}")

Load N-gram Model

import pyarrow.parquet as pq

df = pq.read_table("ceb_3gram_word.parquet").to_pandas()
print(df.head())

Models Overview

Performance Dashboard

Category Assets
Tokenizers BPE at 8k, 16k, 32k, 64k vocab sizes
N-gram models 2 / 3 / 4 / 5-gram (word & subword)
Markov chains Context 1–5 (word & subword)
Embeddings 32d, 64d, 128d — mono & aligned
Vocabulary Full frequency list + Zipf analysis
Statistics Corpus & model statistics JSON

Metrics Summary

Component Model Key Metric Value
Tokenizer 8k BPE Compression 3.20x
Tokenizer 16k BPE Compression 3.59x
Tokenizer 32k BPE Compression 3.89x
Tokenizer 64k BPE Compression 4.16x 🏆
N-gram 2-gram (subword) Perplexity 244 🏆
N-gram 2-gram (word) Perplexity 1,490
N-gram 3-gram (subword) Perplexity 1,343
N-gram 3-gram (word) Perplexity 2,538
N-gram 4-gram (subword) Perplexity 3,750
N-gram 4-gram (word) Perplexity 4,059
N-gram 5-gram (subword) Perplexity 6,751
N-gram 5-gram (word) Perplexity 5,049
Markov ctx-1 (subword) Predictability 13.0%
Markov ctx-1 (word) Predictability 0.0%
Markov ctx-2 (subword) Predictability 32.8%
Markov ctx-2 (word) Predictability 66.0%
Markov ctx-3 (subword) Predictability 28.5%
Markov ctx-3 (word) Predictability 83.0%
Markov ctx-4 (subword) Predictability 31.1%
Markov ctx-4 (word) Predictability 94.4% 🏆
Vocabulary full Size 208,251
Vocabulary full Zipf R² 0.9938
Embeddings mono_32d Isotropy 0.8551
Embeddings mono_64d Isotropy 0.8254
Embeddings mono_128d Isotropy 0.7631
Embeddings aligned_32d Isotropy 0.8551 🏆
Embeddings aligned_64d Isotropy 0.8254
Embeddings aligned_128d Isotropy 0.7631
Alignment aligned_32d R@1 / R@5 / R@10 5.8% / 18.8% / 31.4%
Alignment aligned_64d R@1 / R@5 / R@10 11.2% / 32.6% / 46.4%
Alignment aligned_128d R@1 / R@5 / R@10 23.8% / 47.0% / 59.2% 🏆

📊 Full ablation study, per-model breakdowns, and interpretation guide →


About

Trained on wikipedia-monthly — monthly snapshots of 300+ Wikipedia languages.

A project by Wikilangs · Maintainer: Omar Kamali · Omneity Labs

Citation

@misc{wikilangs2025,
  author    = {Kamali, Omar},
  title     = {Wikilangs: Open NLP Models for Wikipedia Languages},
  year      = {2025},
  doi       = {10.5281/zenodo.18073153},
  publisher = {Zenodo},
  url       = {https://huggingface.co/wikilangs},
  institution = {Omneity Labs}
}

Links

License: MIT — free for academic and commercial use.


Generated by Wikilangs Pipeline · 2026-03-04 08:49:55

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train wikilangs/ceb