Model Card for tatar-morph-mbert
Multilingual BERT (mBERT) fine‑tuned for morphological analysis of the Tatar language – token‑level prediction of full morphological tags (including part‑of‑speech, number, case, possession, etc.). This model is part of the TatarNLPWorld collection of Turkic and low‑resource language tools.
Model Details
Model Description
- Developed by: Arabov Mullosharaf Kurbonovich (TatarNLPWorld community)
- Model type: Transformer‑based token classification (fine‑tuned multilingual BERT)
- Language(s) (NLP): Tatar (
tt); the base model supports 104 languages, enabling strong cross‑lingual transfer - License: Apache 2.0
- Finetuned from model: bert-base-multilingual-cased
- Original repository: TatarNLPWorld/tatar-morph-mbert
Model Sources
- Repository: TatarNLPWorld/tatar-morph-mbert
- Demo: Tatar Morphological Analyzer Space (select this model)
- Paper (to appear): "Comparing Multilingual and Monolingual Transformers for Tatar Morphology" (LREC 2026, submitted)
Uses
Direct Use
The model performs token‑level morphological tagging of Tatar sentences. Given a raw sentence, it returns a list of tokens with the predicted full morphological tags (e.g., N+Sg+Nom, V+Past+3, PUNCT).
Example use cases:
- Linguistic research and corpus annotation
- Preprocessing for downstream Tatar NLP tasks (machine translation, information extraction)
- Educational tools for learning Tatar morphology
Downstream Use
The predicted tags can be used as features in higher‑level systems:
- Dependency parsing
- Named entity recognition
- Text‑to‑speech (grapheme‑to‑phoneme conversion)
Out-of-Scope Use
The model is not intended for:
- Languages other than Tatar (though it may produce random output for unrelated languages)
- Grammatical error correction (it only labels existing tokens)
- Dialectal or historical forms not present in the training corpus
Bias, Risks, and Limitations
- Training data bias: The model was fine‑tuned on a 60k‑sentence subset of the Tatar morphological corpus, which may under‑represent certain genres (e.g., spoken language, very informal texts) and rare morphological phenomena.
- Tokenization: mBERT uses WordPiece tokenization; some Tatar words may be split into subwords in a linguistically suboptimal way, but the model learns to handle this during fine‑tuning.
- Computational resource: The model is a full‑size BERT (∼180M parameters) and may be too heavy for real‑time applications on CPU. Consider using the DistilBERT version for faster inference.
Recommendations
- Users should evaluate the model on their own domain data before deployment.
- For highly infrequent word forms, manual verification of predictions is advised.
- The model may reflect social biases present in the training corpus; use responsibly.
How to Get Started with the Model
from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
model_name = "TatarNLPWorld/tatar-morph-mbert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Using pipeline
pipe = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="none")
sentence = "Мин татарча сөйләшәм."
predictions = pipe(sentence)
for pred in predictions:
print(f"{pred['word']}: {pred['entity']}")
For a full inference example with proper word alignment, check the demo space or the repository examples.
Training Details
Training Data
The model was fine‑tuned on a 60,000‑sentence subset of the TatarNLPWorld/tatar-morphological-corpus.
- Total sentences (after filtering empty): 59,992
- Train / validation / test split: 47,993 / 5,999 / 6,000 sentences
- Tag set size: 1,181 unique morphological tags (full tag sequences, e.g.,
N+Sg+Nom,V+Past+3,PUNCT) - Sampling: Shuffled with seed 42.
Training Procedure
Preprocessing
- Sentences and their token‑level tags were extracted using the official processing script.
- Labels aligned to the first subword token of each word (
-100for other subwords). - Maximum sequence length: 128 tokens (median sentence length 6, so truncation is rare).
Training Hyperparameters
- Model:
bert-base-multilingual-cased - Batch size: 16 (per device) × 2 gradient accumulation steps (effective batch 32)
- Learning rate: 2e-5
- Optimizer: AdamW (weight decay 0.01)
- Warmup steps: 500
- Number of epochs: 4
- Mixed precision: FP16 (enabled on GPU)
- Evaluation strategy: per epoch
- Save strategy: per epoch, keep best model based on validation token accuracy
Speeds, Sizes, Times
- Hardware: 1× NVIDIA Tesla V100 32GB
- Training time: ~6.5 hours (4 epochs)
- Model size: ~680 MB (PyTorch checkpoint)
- Inference speed: ~150 sentences/sec on V100 (batch size 16)
Evaluation
Testing Data, Factors & Metrics
Testing Data
The test set consists of 6,000 sentences (held‑out, not seen during training) containing 47,373 tokens that are present in the tag vocabulary (evaluable tokens).
Metrics
We report token‑level classification metrics computed only on evaluable tokens:
- Token Accuracy – proportion of correctly predicted tags.
- Precision / Recall / F1 (micro) – micro‑averaged over all tags.
- F1 (macro) – macro‑average over tags (treats each tag equally, irrespective of frequency).
- Confidence intervals – 95% bootstrap intervals (1,000 iterations).
Detailed per‑POS accuracies are available in the results/pos_accuracy.csv file of this repository.
Results
| Metric | Value | 95% CI |
|---|---|---|
| Token Accuracy | 0.9868 | [0.9858, 0.9878] |
| F1 (micro) | 0.9868 | [0.9858, 0.9878] |
| F1 (macro) | 0.5094 | [0.4873, 0.5315] |
| Precision (micro) | 0.9868 | (same as F1 micro) |
| Recall (micro) | 0.9868 | (same as F1 micro) |
Performance by part‑of‑speech (top 5 frequent POS):
| POS | Accuracy |
|---|---|
| PUNCT | 1.0000 |
| NOUN | 0.9875 |
| VERB | 0.9812 |
| ADP | 0.9965 |
| ADJ | 0.9750 |
Full POS breakdown is available in results/pos_accuracy.csv.
Summary
Multilingual BERT achieves the highest accuracy among all models in our series, demonstrating excellent cross‑lingual transfer to Tatar. It correctly tags almost all tokens, with the macro F1 being lower only due to the long tail of extremely rare tag combinations. This model is recommended when maximum accuracy is required and computational resources are sufficient.
Model Examination
Manual error analysis revealed that most errors occur on:
- Rare verb forms with multiple affixes.
- Compound words and neologisms.
- Proper nouns of foreign origin.
Citation
BibTeX:
@misc{tatar-morph-mbert,
author = {Arabov, Mullosharaf Kurbonovich and TatarNLPWorld Contributors},
title = {Multilingual BERT for Tatar Morphological Analysis},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/TatarNLPWorld/tatar-morph-mbert}}
}
APA:
Arabov, M. K., & TatarNLPWorld. (2026). Multilingual BERT for Tatar Morphological Analysis [Model]. Hugging Face. https://huggingface.co/TatarNLPWorld/tatar-morph-mbert
More Information
- Sister models:
- Dataset: TatarNLPWorld/tatar-morphological-corpus
- Contact: For questions or collaborations, please open an issue on the repository.
Model Card Authors
Arabov Mullosharaf Kurbonovich (TatarNLPWorld)
Model Card Contact
- Downloads last month
- 29
Space using TatarNLPWorld/tatar-morph-mbert 1
Evaluation results
- Token Accuracy on TatarNLPWorld/tatar-morphological-corpustest set self-reported0.987
- F1-micro on TatarNLPWorld/tatar-morphological-corpustest set self-reported0.987
- F1-macro on TatarNLPWorld/tatar-morphological-corpustest set self-reported0.509
- Precision (micro) on TatarNLPWorld/tatar-morphological-corpustest set self-reported0.987
- Recall (micro) on TatarNLPWorld/tatar-morphological-corpustest set self-reported0.987