Model Card for `tatar-morph-distilbert`

DistilBERT (multilingual) fine‑tuned for morphological analysis of the Tatar language – token‑level prediction of full morphological tags (including part‑of‑speech, number, case, possession, etc.). This model is part of the TatarNLPWorld collection of Turkic and low‑resource language tools.

Model Details

Model Description

Developed by: Arabov Mullosharaf Kurbonovich (TatarNLPWorld community)
Model type: Lightweight transformer‑based token classification (fine‑tuned DistilBERT)
Language(s) (NLP): Tatar (tt); the base model supports 100+ languages
License: Apache 2.0
Finetuned from model: distilbert-base-multilingual-cased
Original repository: TatarNLPWorld/tatar-morph-distilbert

Model Sources

Repository: TatarNLPWorld/tatar-morph-distilbert
Demo: Tatar Morphological Analyzer Space (select this model)
Paper (to appear): "Comparing Multilingual and Monolingual Transformers for Tatar Morphology" (LREC 2026, submitted)

Uses

Direct Use

The model performs token‑level morphological tagging of Tatar sentences. Given a raw sentence, it returns a list of tokens with the predicted full morphological tags (e.g., N+Sg+Nom, V+Past+3, PUNCT).
Example use cases:

Linguistic research and corpus annotation
Preprocessing for downstream Tatar NLP tasks (machine translation, information extraction)
Educational tools for learning Tatar morphology

Downstream Use

The predicted tags can be used as features in higher‑level systems:

Dependency parsing
Named entity recognition
Text‑to‑speech (grapheme‑to‑phoneme conversion)

Out-of-Scope Use

The model is not intended for:

Languages other than Tatar (though it may produce random output for unrelated languages)
Grammatical error correction (it only labels existing tokens)
Dialectal or historical forms not present in the training corpus

Bias, Risks, and Limitations

Training data bias: The model was fine‑tuned on a 60k‑sentence subset of the Tatar morphological corpus, which may under‑represent certain genres and rare morphological phenomena.
Tokenization: DistilBERT uses the same tokenizer as mBERT; some Tatar words may be split suboptimally, but the model learns to handle this during fine‑tuning.
Lightweight nature: DistilBERT has fewer parameters (∼134M) than full BERT, but it still achieves near‑state‑of‑the‑art accuracy on this task, making it ideal for production environments with limited resources.

Recommendations

Users should evaluate the model on their own domain data before deployment.
For highly infrequent word forms, manual verification of predictions is advised.
The model may reflect social biases present in the training corpus; use responsibly.

How to Get Started with the Model

from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification

model_name = "TatarNLPWorld/tatar-morph-distilbert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Using pipeline
pipe = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="none")
sentence = "Мин татарча сөйләшәм."
predictions = pipe(sentence)
for pred in predictions:
    print(f"{pred['word']}: {pred['entity']}")

For a full inference example with proper word alignment, check the demo space or the repository examples.

Training Details

Training Data

The model was fine‑tuned on a 60,000‑sentence subset of the TatarNLPWorld/tatar-morphological-corpus.

Total sentences (after filtering empty): 59,992
Train / validation / test split: 47,993 / 5,999 / 6,000 sentences
Tag set size: 1,181 unique morphological tags (full tag sequences)
Sampling: Shuffled with seed 42.

Training Procedure

Preprocessing

Sentences and their token‑level tags were extracted using the official processing script.
Labels aligned to the first subword token of each word (-100 for other subwords).
Maximum sequence length: 128 tokens (median sentence length 6, so truncation is rare).

Training Hyperparameters

Model: distilbert-base-multilingual-cased
Batch size: 32 (per device) × 1 gradient accumulation step (effective batch 32)
Learning rate: 2e-5
Optimizer: AdamW (weight decay 0.01)
Warmup steps: 500
Number of epochs: 4
Mixed precision: FP16 (enabled on GPU)
Evaluation strategy: per epoch
Save strategy: per epoch, keep best model based on validation token accuracy

Speeds, Sizes, Times

Hardware: 1× NVIDIA Tesla V100 32GB
Training time: ~4.2 hours (4 epochs)
Model size: ~350 MB (PyTorch checkpoint)
Inference speed: ~250 sentences/sec on V100 (batch size 32) – significantly faster than BERT base.

Evaluation

Testing Data, Factors & Metrics

Testing Data

The test set consists of 6,000 sentences (held‑out, not seen during training) containing 47,373 tokens that are present in the tag vocabulary (evaluable tokens).

Metrics

We report token‑level classification metrics computed only on evaluable tokens:

Token Accuracy – proportion of correctly predicted tags.
Precision / Recall / F1 (micro) – micro‑averaged over all tags.
F1 (macro) – macro‑average over tags.
Confidence intervals – 95% bootstrap intervals (1,000 iterations).

Detailed per‑POS accuracies are available in the results/pos_accuracy.csv file of this repository.

Results

Metric	Value	95% CI
Token Accuracy	0.9798	[0.9785, 0.9810]
F1 (micro)	0.9798	[0.9786, 0.9810]
F1 (macro)	0.4402	[0.4240, 0.4575]
Precision (micro)	0.9798	(same as F1 micro)
Recall (micro)	0.9798	(same as F1 micro)

Performance by part‑of‑speech (top 5 frequent POS):

POS	Accuracy
PUNCT	1.0000
NOUN	0.9805
VERB	0.9732
ADP	0.9941
ADJ	0.9603

Full POS breakdown is available in results/pos_accuracy.csv.

Summary

DistilBERT achieves near‑state‑of‑the‑art performance while being ∼40% smaller and ∼60% faster than full BERT models. It is an excellent choice for production deployments where speed and memory are important, without sacrificing much accuracy.

Model Examination

Manual error analysis revealed that most errors occur on:

Rare verb forms with multiple affixes.
Words where the tokenizer splits in a way that obscures the stem.
Out‑of‑vocabulary proper nouns.

Citation

BibTeX:

@misc{tatar-morph-distilbert,
  author = {Arabov, Mullosharaf Kurbonovich and TatarNLPWorld Contributors},
  title = {DistilBERT for Tatar Morphological Analysis},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/TatarNLPWorld/tatar-morph-distilbert}}
}

APA:

Arabov, M. K., & TatarNLPWorld. (2026). DistilBERT for Tatar Morphological Analysis [Model]. Hugging Face. https://huggingface.co/TatarNLPWorld/tatar-morph-distilbert

More Information

Sister models:
Dataset: TatarNLPWorld/tatar-morphological-corpus
Contact: For questions or collaborations, please open an issue on the repository.

Model Card Authors

Arabov Mullosharaf Kurbonovich (TatarNLPWorld)

Model Card Contact

https://huggingface.co/TatarNLPWorld

Downloads last month: 29

Safetensors

Model size

0.1B params

Tensor type

F32

Space using TatarNLPWorld/tatar-morph-distilbert 1

Evaluation results

Token Accuracy on TatarNLPWorld/tatar-morphological-corpus
test set self-reported

0.980
F1-micro on TatarNLPWorld/tatar-morphological-corpus
test set self-reported

0.980
F1-macro on TatarNLPWorld/tatar-morphological-corpus
test set self-reported

0.440
Precision (micro) on TatarNLPWorld/tatar-morphological-corpus
test set self-reported

0.980
Recall (micro) on TatarNLPWorld/tatar-morphological-corpus
test set self-reported

0.980

Model Card for tatar-morph-distilbert