HRCS Research Activity Code Classifier
Overview
This model, developed by the National Institute for Health and Care Research (NIHR), assigns HRCS Research Activity Codes to research awards using the award title and abstract (micro F1 = 0.60). When tags are aggregated to Research Activity Groups (RAGs), performance increases to a micro F1 of 0.71. It is a multi-label transformer classifier built on BiomedBERT-large, domain-adapted (DAPT) on healthcare grant titles and abstracts, then fine-tuned on cross-funder labelled HRCS data. The goal is to support portfolio analysis, automated tagging, and reproducible classification of biomedical research funding.
Model details
- Base model:
microsoft/BiomedNLP-BiomedBERT-large-uncased-abstract - Architecture: Transformer encoder + multi-label classification head
- Task: Multi-label text classification
- Input: Award title + abstract
- Output: Probability per Research Activity Code.
Training approach
The model was trained in two stages using a 24GB GPU.
Domain-adaptive pretraining (DAPT)
We continued masked language modelling on grant titles and abstracts to adapt the encoder to research funding language as opposed to publications. This data used was a healthcare funder specific subset of Gomez Magenti, J. (2025) ‘Harmonised datasets of research project grants from UK and European funders’. Zenodo. doi:10.5281/zenodo.15479412.
Settings:
- Max sequence length: 512
- Mask probability: 0.15
- Epochs: 1
- Learning rate: 5e-5
- Warmup ratio: 0.01
- Weight decay: 0.01
- Effective batch size: 64
- Mixed precision: bf16/fp16
- Gradient checkpointing enabled
The adapted checkpoint was then used for supervised training.
Supervised fine-tuning
The adapted model was fine-tuned for multi-label classification using sigmoid outputs and binary cross-entropy loss.
Input format:
AwardTitle + newline + AwardAbstract
Tokenisation:
- Max length: 512 tokens
- Truncation enabled
- Fixed-length padding during training
Handling class imbalance:
A per-label weighting vector (pos_weight) is applied in the loss to reduce bias toward common categories.
Training configuration:
- Learning rate: 3e-5
- Weight decay: 0.01
- Epochs: up to 20
- Batch size: 14 per device
- Gradient accumulation: 2
- Mixed precision: fp16
- Early stopping patience: 4
- Best checkpoint selected by micro-F1
Evaluation protocol
Data was split into three disjoint sets:
- Training set – used for optimisation
- Validation set – used for early stopping and threshold tuning
- Held-out test set – used only once for final evaluation
The test set was not used during training, checkpoint selection, or threshold tuning. The dataset used is listed at the top of the model card. Predictions are converted to labels using per-category probability thresholds tuned on the validation set. These thresholds are included in metadata.json.
Full Evaluation Results
Overall RAC Metrics:
- f1 micro – 0.60
- f1 macro - 0.51
- precision micro - 0.56
- recall micro - 0.63
Overall RAG Metrics:
- f1 micro – 0.71
- f1 macro - 0.68
- precision micro - 0.70
- recall micro - 0.73
For a comprehensive breakdown of the model's performance, including Overall Metrics, Metrics per Category across both validation and test sets and Metrics per Funder across the validation set, please refer to the detailed evaluation spreadsheet included in this repository.
Download/View the Evaluation Results (Located in the Files and versions tab of this repository).
Intended use
This model is intended for:
- Portfolio analysis
- Large-scale tagging of funding datasets
- Exploratory research landscape mapping
- Automation support for HRCS coding workflows
It is not intended to completely replace expert review.
Limitations
- Performance depends on similarity to the training corpus.
- Rare categories remain harder to detect despite class weighting.
- Abstract length: Long or poorly structured abstracts may be truncated.
- Threshold calibration: Thresholds are tuned for this dataset and may need recalibration for new domains.
- Temporal bias: Model trained on data up to 2022. Therefore, any evaluation needs to use awards starting since 2023 to avoid inflated metrics.
- Annotation Ambiguity and Niche Categories: The model's performance reflects the historical consistency of human coding within the training data. Categories that are historically difficult for human coders to classify consistently under HRCS guidelines (such as 7.1, 8.1 and 8.3) are naturally more challenging for the model.
Inference / How to use
We have provided a ready-to-use Python script that runs both this model (Research Activity Codes) and a Health Categories model on new award data simultaneously.
You can download the script and a sample dataset directly from the 'inference' subfolder in the Files and version tab of this repository.
Instructions
Prerequisites:
- Download the script and test data to your computer from the inference subfolder.
- Open your terminal or command prompt and install the required libraries by running:
pip install torch pandas numpy tqdm transformers huggingface_hub - Use the 'test_data.csv' or prepare a CSV file in the same format containing your grant data. It must include two columns named exactly:
AwardTitleandAwardAbstract.
Running the Code:
- Open the script.
- Under the
# --- USER SETTINGS ---section, updateDATA_FOLDERSto point to the folder containing your CSV. (Leave it as["./"]if your CSV is in the same folder as the script). - Update
TEST_FILENAMEto match the name of your CSV. - Run the script.
The script will automatically download the necessary AI models, process your text, and output a new CSV containing the predicted categories and an "AI Certainty Score" (SmallestLogitDiff) to help you identify which borderline grants require human review.
Selective automation and human-in-the-loop use
In addition to predicted labels, the inference script reports how close each prediction is to the model’s decision boundary in logit space. This is computed as the smallest absolute difference between any category’s logit and its corresponding decision threshold.
Records with logits close to the threshold represent borderline cases where the model is uncertain. These can be prioritised for human review, while higher-confidence predictions can be automated.
When progressively excluding records whose predictions lie closest to the decision boundary, the remaining high-confidence subset shows increasing accuracy:
| % of records excluded for human review | RAG Micro-F1 on remaining subset |
|---|---|
| 0% | 0.71 |
| 10% | 0.72 |
| 20% | 0.72 |
| 30% | 0.74 |
| 40% | 0.74 |
| 50% | 0.76 |
| 60% | 0.79 |
| 70% | 0.83 |
| 80% | 0.85 |
| 90% | 0.90 |
This demonstrates that the model supports hybrid workflows in which uncertain cases are reviewed by experts while confident predictions can be automated.
Citation
NIHR, 2026. HRCS Research Activity Classifier (BiomedBERT, DAPT). [Model]. Developed by Banks, A., Baghurst, D., Carter, J., Manville, C., Wang, K. and Downes, N. Available from: https://huggingface.co/NIHRDataInsights/HRCSResearchActivityCodes
- Downloads last month
- 109