Token Classification
Safetensors
Tatar
bert
tatar
morphology
mbert

Multilingual BERT (mBERT) fine-tuned for Tatar Morphological Analysis

This model is a fine-tuned version of bert-base-multilingual-cased for morphological analysis of the Tatar language. It was trained on a subset of 80,000 sentences from the Tatar Morphological Corpus. The model predicts fine-grained morphological tags (e.g., N+Sg+Nom, V+PRES(Й)+3SG).

Performance on Test Set

Metric Value 95% CI
Token Accuracy 0.9905 [0.9898, 0.9913]
Micro F1 0.9905 [0.9897, 0.9913]
Macro F1 0.5563 [0.5954, 0.6387]*

*Note: macro F1 CI as reported in the paper.

Accuracy by Part of Speech (Top 10)

POS Accuracy
PUNCT 1.0000
NOUN 0.9905
VERB 0.9718
ADJ 0.9718
PRON 0.9918
PART 0.9986
PROPN 0.9779
ADP 1.0000
CCONJ 1.0000
ADV 0.9948

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "TatarNLPWorld/mbert-tatar-morph"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

tokens = ["Татар", "теле", "бик", "бай", "."]
inputs = tokenizer(tokens, is_split_into_words=True, return_tensors="pt", truncation=True)
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)

# Get tag mapping from model config
id2tag = model.config.id2label

word_ids = inputs.word_ids()
prev_word = None
for idx, word_idx in enumerate(word_ids):
    if word_idx is not None and word_idx != prev_word:
        tag_id = predictions[0][idx].item()
        if isinstance(id2tag, dict):
            tag = id2tag.get(str(tag_id), id2tag.get(tag_id, "UNK"))
        else:
            tag = id2tag[tag_id] if tag_id < len(id2tag) else "UNK"
        print(tokens[word_idx], "->", tag)
    prev_word = word_idx

Expected output (approximately):

Татар -> N+Sg+Nom
теле -> N+Sg+POSS_3(СЫ)+Nom
бик -> Adv
бай -> Adj
. -> PUNCT

Citation

If you use this model, please cite it as:

@misc{arabov-mbert-tatar-morph-2026,
  title = {Multilingual BERT (mBERT) fine-tuned for Tatar Morphological Analysis},
  author = {Arabov Mullosharaf Kurbonovich},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/TatarNLPWorld/mbert-tatar-morph}
}

License

Apache 2.0

Downloads last month
48
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support