Model Card for Model ID
This model is a domain-adapted English→Russian neural machine translation model, finetuned from Helsinki-NLP/opus-mt-en-ru. It has been trained specifically on medical and health-related English–Russian parallel text, enabling more accurate translations of clinical instructions, medical terminology, and news in on public health.
This model uses full-parameter finetuning and is intended for research, experimentation, and academic work in domain-specific NMT.
Model Details
Model Description
This model is based on the OPUS-MT English→Russian MarianMT Transformer architecture and has been further trained on curated medical-domain parallel corpora. The finetuning objective was to improve translation quality for:
- medical terminology
- symptom descriptions
- health-related instructions
- public health reporting and medical news
Compared to the base OPUS model, this finetuned version achieves significantly higher SacreBLEU, chrF, and METEOR scores on medical-domain evaluation sets. Performance on general-domain text decreases only slightly, which is expected for domain-adapted NMT models.
This model is intended for research and educational purposes, especially for exploring domain adaptation in low-resource machine translation settings.
- Developed by: Ethan Kalika (SirEthanK)
- Graduate Mathematics Researcher
- Model type: Encoder-decoder Transformer
- Languages:
- Source: English (en)
- Target: Russian (ru)
- License: Apache 2.0
- Finetuned from model: Helsinki-NLP/opus-mt-en-ru
Uses
Intended Use
This model is intended for research and experimentation in:
- domain adaptation for neural machine translation
- English→Russian medical translation studies
- comparison of baseline vs. domain-adapted NMT models
- academic coursework, demonstrations, and ML experiments
It is designed as a proof of concept illustrating how finetuning improves translation quality in a specialized low-resource domain. It is not intended for real-world clinical or diagnostic translation.
Foreseeable Users
- NLP researchers
- students and educators studying machine translation
- developers experimenting with domain adaptation techniques
- hobbyists exploring NMT in specialized domains
Affected Users
No individuals should rely on this model for medical communication in real-world settings. All outputs are intended strictly for educational and research purposes.
Direct Use
This model can be used as-is to generate English→Russian translations of medical and health-related text for research, experimentation, and evaluation. It is suitable for:
- running translation experiments using the 🤗 Transformers pipeline API
- benchmarking domain adaptation effects
- analyzing translation quality on medical vs. general-domain text
Direct use of this model should remain limited to non-clinical research. It has not been validated for real-world medical communication, diagnosis, or decision-making.
Out-of-Scope Use
This model is not intended for:
- clinical, diagnostic, or medical decision-making
- production-grade translation of sensitive health records
- any high-stakes or safety-critical application
The model has not been validated for professional medical use and may produce inaccurate or misleading translations of technical medical content.
Bias, Risks, and Limitations
This model inherits limitations and potential biases from both the base OPUS-MT model and the medical-domain datasets used for finetuning. Users should be aware of the following:
Technical Limitations
- The model has not been clinically validated and may produce incorrect or misleading translations of medical terminology.
- It may generate hallucinations, such as fabricated medical details or incorrect drug names.
- Performance decreases slightly on general-domain text, as expected with domain-specialized finetuning.
- The model may struggle with very long sentences, highly technical biomedical texts, or ambiguous phrasing.
Data-Related Biases
- Training data comes from specific medical datasets (e.g., WikiHealth, TICO-19), which may not represent the full diversity of medical language usage.
- Biases present in the source data (e.g., overrepresentation of certain conditions or terminology) may influence the translations.
- The dataset may lack representation of regional dialects, colloquial expressions, or rare clinical scenarios.
Sociotechnical Risks
- Incorrect translations could cause misunderstanding if misused in a medical context.
- The model should not be used for diagnosis, treatment guidance, or any safety-critical decision-making.
- Users without medical background may incorrectly trust the model’s output due to automation bias.
Overall Limitation
This model is intended only for research and educational purposes in machine translation and domain adaptation. It should not be used in real-world clinical, diagnostic, or patient-facing applications.
Recommendations
Users (both direct and downstream) should be aware of the model’s technical limitations and the risks associated with translating medical text. In particular, the following recommendations apply:
- Do not use the model for clinical, diagnostic, or patient-facing communication. All outputs should be treated as experimental.
- Verify translations manually when using them for research, benchmarking, or academic work.
- Avoid relying on the model for critical terminology, as it may generate inaccurate or hallucinated medical phrases.
- Evaluate the model on your specific domain before using it in any downstream system, as performance may vary significantly across medical subdomains.
- Be mindful of dataset-related biases, which may cause the model to favor certain phrasing, style, or terminology patterns.
Overall, users should exercise caution, treat the translations as non-authoritative, and use the model only in educational, experimental, and research contexts.
How to Get Started with the Model
Use the code below to load the model and run an English→Russian translation:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
model_id = "SirEthanK/opus-mt-en-ru-medical-finetuned"
# Load pipeline
translator = pipeline("translation", model=model_id, tokenizer=model_id)
text = "Take one tablet by mouth twice daily."
result = translator(text)
print(result[0]["translation_text"])
Or, using the model and tokenizer directly:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_id = "SirEthanK/opus-mt-en-ru-medical-finetuned"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
inputs = tokenizer("The patient has a mild fever.", return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Details
Training Data
This model was finetuned on the medical EN-RU dataset, available here: https://huggingface.co/datasets/SirEthanK/en-ru-health-only-dataset This dataset consists of medical and health-domain English-Russian parallel text, cleaned and split into train/validation/test sets. The full dataset documentation is provided on the dataset card
Training Procedure
The model was finetuned using the standard MarianMT encoder-decoder architecture implemented in 🤗 Transformers. Training followed a full parameter finetuning approach, using the same SentencePiece tokenizer and vocabulary as the base OPUS-MT model.
Preprocessing
All texts were preprocessed using the same pipeline described in the dataset card for SirEthanK/en-ru-health-only-dataset. For the full preprocessing procedure, see the dataset documentation: https://huggingface.co/datasets/SirEthanK/en-ru-health-only-dataset
Training Hyperparameters
- Training regime: fp16 mixed precision (on GPU)
- Optimizer: AdamW (Transformers default)
- Learning rate: 1e-5
- Learning rate schedule: linear decay with no warmup (Transformers default)
- Batch size: 4 (training) / 4 (evaluation)
- Max sequence length: 160 (source and target)
- Number of epochs: 5
- Weight decay: 0.01
- Gradient clipping: max grad norm = 1.0 (Transformers default)
- Generation max length: 160
- Group by length: enabled
Hardware
Training was performed on an NVIDIA A100 GPU using PyTorch and 🤗 Transformers.
Monitoring
- Validation loss was evaluated at the end of each epoch.
- Model checkpoints were saved on the best model performance.
- Final model selection was based on validation loss and downstream translation metrics.
Evaluation
Testing Data, Factors & Metrics
Testing Data
The primary evaluation was performed on the held-out test split of the SirEthanK/en-ru-health-only-dataset: https://huggingface.co/datasets/SirEthanK/en-ru-health-only-dataset
This test set contains English–Russian parallel sentences from medical and health-related domains and was kept completely separate from the training and validation data used during finetuning.
To measure the effect of domain adaptation on general-domain translation quality, two additional evaluation sets were used:
- quickmt general-domain dataset https://huggingface.co/datasets/quickmt/quickmt-train.ru-en A diverse general-purpose English–Russian corpus commonly used for benchmarking MT systems.
- Custom general-domain test set (unpublished)
This set was constructed from a mixture of non-medical sources, including QuickMT, Webfant, SMOL, PHP, Golang, and Cryst.
It provides a heterogeneous general-domain evaluation to measure how domain adaptation affects translation performance outside the medical domain.
Factors
The evaluation results were disaggregated by domain, comparing model performance on:
Medical-domain text
(held-out test split of the en-ru-health-only-dataset)General-domain text (QuickMT)
A broad, non-medical English–Russian parallel dataset.General-domain mixed dataset (custom)
A heterogeneous mixture of non-medical sources used to measure how domain finetuning affects general translation performance.
These factors allow assessment of both in-domain (medical) performance and potential degradation or trade-offs on out-of-domain (general) text.
Metrics
The model was evaluated using the following standard machine translation metrics:
- SacreBLEU – the most widely used MT metric, enabling standardized comparison with existing and future models.
- chrF – a character n-gram F-score, well-suited for morphologically rich languages like Russian and more robust to word-order variation.
- METEOR – a recall-oriented metric that incorporates stemming and synonym matching.
- TER (Translation Edit Rate) – measures the number of edits required to transform the model output into the reference translation.
- ROUGE – evaluates overlapping n-grams, primarily used here as an additional text overlap metric.
Results
Medical-Domain Evaluation
| Model | Decoding | BLEU | METEOR | TER | ROUGE-L | chrF |
|---|---|---|---|---|---|---|
| OPUS-MT (pretrained) | greedy | 22.78 | 0.49 | 67.90 | 0.37 | 51.26 |
| OPUS-MT (finetuned, med.) | greedy | 31.78 | 0.55 | 61.44 | 0.45 | 57.74 |
General-Domain Evaluation (QuickMT)
| Model | Decoding | BLEU | METEOR | TER | ROUGE-L | chrF |
|---|---|---|---|---|---|---|
| OPUS-MT (pretrained) | greedy | 27.99 | 0.48 | 64.95 | 0.16 | 54.62 |
| OPUS-MT (finetuned, med.) | greedy | 23.81 | 0.46 | 66.95 | 0.15 | 52.74 |
General-Domain Evaluation (Custom Mixed Dataset)
| Model | Decoding | BLEU | METEOR | TER | ROUGE-L | chrF |
|---|---|---|---|---|---|---|
| OPUS-MT (pretrained) | greedy | 24.37 | 0.48 | 66.68 | 0.13 | 52.12 |
| OPUS-MT (finetuned, med.) | greedy | 26.04 | 0.48 | 67.47 | 0.15 | 53.24 |
Summary
Finetuning OPUS-MT on medical-domain data resulted in substantial improvements on the in-domain medical test set, increasing BLEU from 22.78 to 31.78 and chrF from 51.26 to 57.74 under greedy decoding. These gains are consistent across all reported metrics.
As expected with domain-adapted NMT models, performance on a general-domain benchmark (QuickMT) decreased slightly relative to the pretrained baseline. This reflects the trade-off between specialization and broad-domain generalization.
However, evaluation on a heterogeneous custom general-domain dataset showed a different pattern: the finetuned model achieved slightly higher BLEU and chrF than the pretrained baseline. This suggests that domain adaptation did not uniformly degrade general-domain performance and may improve translation quality on certain non-medical sources with overlapping vocabulary or style.
Overall, the finetuned model provides substantial gains in its target medical domain while showing manageable and dataset-dependent trade-offs in general-domain translation performance.
Model Examination
Below are example translations comparing the pretrained OPUS-MT model with the finetuned medical model on short medical-domain sentences.
Example 1 — Medication instruction
Source (EN): Take one tablet twice daily after meals.
OPUS-MT (pretrained): Возьмите одну таблетку дважды в день после еды.
Finetuned medical model: возьмите одну таблетку дважды в день после еды.
In this case, both models produce essentially the same correct translation (the main difference is casing).
Example 2 — Symptom description
Source (EN): The patient reports increasing shortness of breath.
OPUS-MT (pretrained): Пациент сообщает, что у него растет дыхание.
Finetuned medical model: пациент сообщает о растущем затруднении дыхания.
The pretrained model output is awkward and semantically incorrect: "растет дыхание" literally means "breathing is increasing" and does not correspond to "shortness of breath."
The finetuned model produces a more medically appropriate phrasing using "затруднение дыхания," which correctly captures the intended meaning.
Example 3 — Clinical monitoring
Source (EN): Monitor blood pressure every four hours.
OPUS-MT (pretrained): Контролировать давление каждые четыре часа.
Finetuned medical model: контролировать артериальное давление каждые четыре часа.
The finetuned model explicitly specifies "артериальное давление" (“arterial blood pressure”), which is more precise and typical in medical Russian.
License
This model is released under the CC BY-NC-SA 4.0 license.
The model was finetuned using training data that includes WikiHow-derived text (through the WikiHealth corpus), which is distributed under CC BY-NC-SA 3.0. Because the dataset contains non-commercial, share-alike material, the finetuned model must also be released under a non-commercial, share-alike license.
This license permits:
- sharing and adaptation for non-commercial purposes
- redistribution under the same license
It does not permit:
- commercial use
- relicensing under a more permissive license
- deployment in production environments requiring commercial rights
Users are responsible for ensuring their use complies with the license terms.
Model Card Authors
- SirEthanK
Model Card Contact
For questions or issues, please contact: https://huggingface.co/SirEthanK
- Downloads last month
- 19
Model tree for SirEthanK/opus-mt-en-ru-health-finetuned
Base model
Helsinki-NLP/opus-mt-en-ru