Model Card for Model ID

This model is a domain-adapted English→Russian neural machine translation model, finetuned from Helsinki-NLP/opus-mt-en-ru. It has been trained specifically on medical and health-related English–Russian parallel text, enabling more accurate translations of clinical instructions, medical terminology, and news in on public health.

This model uses full-parameter finetuning and is intended for research, experimentation, and academic work in domain-specific NMT.

Model Details

Model Description

This model is based on the OPUS-MT English→Russian MarianMT Transformer architecture and has been further trained on curated medical-domain parallel corpora. The finetuning objective was to improve translation quality for:

medical terminology
symptom descriptions
health-related instructions
public health reporting and medical news

Compared to the base OPUS model, this finetuned version achieves significantly higher SacreBLEU, chrF, and METEOR scores on medical-domain evaluation sets. Performance on general-domain text decreases only slightly, which is expected for domain-adapted NMT models.

This model is intended for research and educational purposes, especially for exploring domain adaptation in low-resource machine translation settings.

Developed by: Ethan Kalika (SirEthanK)
- Graduate Mathematics Researcher
Model type: Encoder-decoder Transformer
Languages:
- Source: English (en)
- Target: Russian (ru)
License: Apache 2.0
Finetuned from model: Helsinki-NLP/opus-mt-en-ru

Uses

Intended Use

This model is intended for research and experimentation in:

domain adaptation for neural machine translation
English→Russian medical translation studies
comparison of baseline vs. domain-adapted NMT models
academic coursework, demonstrations, and ML experiments

It is designed as a proof of concept illustrating how finetuning improves translation quality in a specialized low-resource domain. It is not intended for real-world clinical or diagnostic translation.

Foreseeable Users

NLP researchers
students and educators studying machine translation
developers experimenting with domain adaptation techniques
hobbyists exploring NMT in specialized domains

Affected Users

No individuals should rely on this model for medical communication in real-world settings. All outputs are intended strictly for educational and research purposes.

Direct Use

This model can be used as-is to generate English→Russian translations of medical and health-related text for research, experimentation, and evaluation. It is suitable for:

running translation experiments using the 🤗 Transformers pipeline API
benchmarking domain adaptation effects
analyzing translation quality on medical vs. general-domain text

Direct use of this model should remain limited to non-clinical research. It has not been validated for real-world medical communication, diagnosis, or decision-making.

Out-of-Scope Use

This model is not intended for:

clinical, diagnostic, or medical decision-making
production-grade translation of sensitive health records
any high-stakes or safety-critical application

The model has not been validated for professional medical use and may produce inaccurate or misleading translations of technical medical content.

Bias, Risks, and Limitations

This model inherits limitations and potential biases from both the base OPUS-MT model and the medical-domain datasets used for finetuning. Users should be aware of the following:

Technical Limitations

The model has not been clinically validated and may produce incorrect or misleading translations of medical terminology.
It may generate hallucinations, such as fabricated medical details or incorrect drug names.
Performance decreases slightly on general-domain text, as expected with domain-specialized finetuning.
The model may struggle with very long sentences, highly technical biomedical texts, or ambiguous phrasing.

Data-Related Biases

Training data comes from specific medical datasets (e.g., WikiHealth, TICO-19), which may not represent the full diversity of medical language usage.
Biases present in the source data (e.g., overrepresentation of certain conditions or terminology) may influence the translations.
The dataset may lack representation of regional dialects, colloquial expressions, or rare clinical scenarios.

Sociotechnical Risks

Incorrect translations could cause misunderstanding if misused in a medical context.
The model should not be used for diagnosis, treatment guidance, or any safety-critical decision-making.
Users without medical background may incorrectly trust the model’s output due to automation bias.

Overall Limitation

This model is intended only for research and educational purposes in machine translation and domain adaptation. It should not be used in real-world clinical, diagnostic, or patient-facing applications.

Recommendations

Users (both direct and downstream) should be aware of the model’s technical limitations and the risks associated with translating medical text. In particular, the following recommendations apply:

Do not use the model for clinical, diagnostic, or patient-facing communication. All outputs should be treated as experimental.
Verify translations manually when using them for research, benchmarking, or academic work.
Avoid relying on the model for critical terminology, as it may generate inaccurate or hallucinated medical phrases.
Evaluate the model on your specific domain before using it in any downstream system, as performance may vary significantly across medical subdomains.
Be mindful of dataset-related biases, which may cause the model to favor certain phrasing, style, or terminology patterns.

Overall, users should exercise caution, treat the translations as non-authoritative, and use the model only in educational, experimental, and research contexts.

How to Get Started with the Model

Use the code below to load the model and run an English→Russian translation:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

model_id = "SirEthanK/opus-mt-en-ru-medical-finetuned"

# Load pipeline
translator = pipeline("translation", model=model_id, tokenizer=model_id)

text = "Take one tablet by mouth twice daily."

result = translator(text)
print(result[0]["translation_text"])

Or, using the model and tokenizer directly:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "SirEthanK/opus-mt-en-ru-medical-finetuned"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

inputs = tokenizer("The patient has a mild fever.", return_tensors="pt")
outputs = model.generate(**inputs)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Data

This model was finetuned on the medical EN-RU dataset, available here: https://huggingface.co/datasets/SirEthanK/en-ru-health-only-dataset This dataset consists of medical and health-domain English-Russian parallel text, cleaned and split into train/validation/test sets. The full dataset documentation is provided on the dataset card

Training Procedure

The model was finetuned using the standard MarianMT encoder-decoder architecture implemented in 🤗 Transformers. Training followed a full parameter finetuning approach, using the same SentencePiece tokenizer and vocabulary as the base OPUS-MT model.

Preprocessing

All texts were preprocessed using the same pipeline described in the dataset card for SirEthanK/en-ru-health-only-dataset. For the full preprocessing procedure, see the dataset documentation: https://huggingface.co/datasets/SirEthanK/en-ru-health-only-dataset

Training Hyperparameters

Training regime: fp16 mixed precision (on GPU)
Optimizer: AdamW (Transformers default)
Learning rate: 1e-5
Learning rate schedule: linear decay with no warmup (Transformers default)
Batch size: 4 (training) / 4 (evaluation)
Max sequence length: 160 (source and target)
Number of epochs: 5
Weight decay: 0.01
Gradient clipping: max grad norm = 1.0 (Transformers default)
Generation max length: 160
Group by length: enabled

Hardware

Training was performed on an NVIDIA A100 GPU using PyTorch and 🤗 Transformers.

Monitoring

Validation loss was evaluated at the end of each epoch.
Model checkpoints were saved on the best model performance.
Final model selection was based on validation loss and downstream translation metrics.

Evaluation

Testing Data, Factors & Metrics

Testing Data

The primary evaluation was performed on the held-out test split of the SirEthanK/en-ru-health-only-dataset: https://huggingface.co/datasets/SirEthanK/en-ru-health-only-dataset

This test set contains English–Russian parallel sentences from medical and health-related domains and was kept completely separate from the training and validation data used during finetuning.

To measure the effect of domain adaptation on general-domain translation quality, two additional evaluation sets were used:

quickmt general-domain dataset https://huggingface.co/datasets/quickmt/quickmt-train.ru-en A diverse general-purpose English–Russian corpus commonly used for benchmarking MT systems.
Custom general-domain test set (unpublished)
This set was constructed from a mixture of non-medical sources, including QuickMT, Webfant, SMOL, PHP, Golang, and Cryst.
It provides a heterogeneous general-domain evaluation to measure how domain adaptation affects translation performance outside the medical domain.

Factors

The evaluation results were disaggregated by domain, comparing model performance on:

Medical-domain text
(held-out test split of the en-ru-health-only-dataset)
General-domain text (QuickMT)
A broad, non-medical English–Russian parallel dataset.
General-domain mixed dataset (custom)
A heterogeneous mixture of non-medical sources used to measure how domain finetuning affects general translation performance.

These factors allow assessment of both in-domain (medical) performance and potential degradation or trade-offs on out-of-domain (general) text.

Metrics

The model was evaluated using the following standard machine translation metrics:

SacreBLEU – the most widely used MT metric, enabling standardized comparison with existing and future models.
chrF – a character n-gram F-score, well-suited for morphologically rich languages like Russian and more robust to word-order variation.
METEOR – a recall-oriented metric that incorporates stemming and synonym matching.
TER (Translation Edit Rate) – measures the number of edits required to transform the model output into the reference translation.
ROUGE – evaluates overlapping n-grams, primarily used here as an additional text overlap metric.

Results

Medical-Domain Evaluation

Model	Decoding	BLEU	METEOR	TER	ROUGE-L	chrF
OPUS-MT (pretrained)	greedy	22.78	0.49	67.90	0.37	51.26
OPUS-MT (finetuned, med.)	greedy	31.78	0.55	61.44	0.45	57.74

General-Domain Evaluation (QuickMT)

Model	Decoding	BLEU	METEOR	TER	ROUGE-L	chrF
OPUS-MT (pretrained)	greedy	27.99	0.48	64.95	0.16	54.62
OPUS-MT (finetuned, med.)	greedy	23.81	0.46	66.95	0.15	52.74

General-Domain Evaluation (Custom Mixed Dataset)

Model	Decoding	BLEU	METEOR	TER	ROUGE-L	chrF
OPUS-MT (pretrained)	greedy	24.37	0.48	66.68	0.13	52.12
OPUS-MT (finetuned, med.)	greedy	26.04	0.48	67.47	0.15	53.24

Summary

Finetuning OPUS-MT on medical-domain data resulted in substantial improvements on the in-domain medical test set, increasing BLEU from 22.78 to 31.78 and chrF from 51.26 to 57.74 under greedy decoding. These gains are consistent across all reported metrics.

As expected with domain-adapted NMT models, performance on a general-domain benchmark (QuickMT) decreased slightly relative to the pretrained baseline. This reflects the trade-off between specialization and broad-domain generalization.

However, evaluation on a heterogeneous custom general-domain dataset showed a different pattern: the finetuned model achieved slightly higher BLEU and chrF than the pretrained baseline. This suggests that domain adaptation did not uniformly degrade general-domain performance and may improve translation quality on certain non-medical sources with overlapping vocabulary or style.

Overall, the finetuned model provides substantial gains in its target medical domain while showing manageable and dataset-dependent trade-offs in general-domain translation performance.

Model Examination

Below are example translations comparing the pretrained OPUS-MT model with the finetuned medical model on short medical-domain sentences.

Example 1 — Medication instruction
Source (EN): Take one tablet twice daily after meals.
OPUS-MT (pretrained): Возьмите одну таблетку дважды в день после еды.
Finetuned medical model: возьмите одну таблетку дважды в день после еды.

In this case, both models produce essentially the same correct translation (the main difference is casing).

Example 2 — Symptom description
Source (EN): The patient reports increasing shortness of breath.
OPUS-MT (pretrained): Пациент сообщает, что у него растет дыхание.
Finetuned medical model: пациент сообщает о растущем затруднении дыхания.

The pretrained model output is awkward and semantically incorrect: "растет дыхание" literally means "breathing is increasing" and does not correspond to "shortness of breath."
The finetuned model produces a more medically appropriate phrasing using "затруднение дыхания," which correctly captures the intended meaning.

Example 3 — Clinical monitoring
Source (EN): Monitor blood pressure every four hours.
OPUS-MT (pretrained): Контролировать давление каждые четыре часа.
Finetuned medical model: контролировать артериальное давление каждые четыре часа.

The finetuned model explicitly specifies "артериальное давление" (“arterial blood pressure”), which is more precise and typical in medical Russian.

License

This model is released under the CC BY-NC-SA 4.0 license.

The model was finetuned using training data that includes WikiHow-derived text (through the WikiHealth corpus), which is distributed under CC BY-NC-SA 3.0. Because the dataset contains non-commercial, share-alike material, the finetuned model must also be released under a non-commercial, share-alike license.

This license permits:

sharing and adaptation for non-commercial purposes
redistribution under the same license

It does not permit:

commercial use
relicensing under a more permissive license
deployment in production environments requiring commercial rights

Users are responsible for ensuring their use complies with the license terms.

Model Card Authors

SirEthanK

Model Card Contact

For questions or issues, please contact: https://huggingface.co/SirEthanK

Downloads last month: 19

Safetensors

Model size

76.2M params

Tensor type

F32

Model tree for SirEthanK/opus-mt-en-ru-health-finetuned

Base model

Helsinki-NLP/opus-mt-en-ru

Finetuned

(41)

this model

SirEthanK
/

opus-mt-en-ru-health-finetuned