HRCS Research Activity Code Classifier

Overview

This model, developed by the National Institute for Health and Care Research (NIHR), assigns HRCS Research Activity Codes to research awards using the award title and abstract (micro F1 = 0.60). When tags are aggregated to Research Activity Groups (RAGs), performance increases to a micro F1 of 0.71. It is a multi-label transformer classifier built on BiomedBERT-large, domain-adapted (DAPT) on healthcare grant titles and abstracts, then fine-tuned on cross-funder labelled HRCS data. The goal is to support portfolio analysis, automated tagging, and reproducible classification of biomedical research funding.

Model details

Base model: microsoft/BiomedNLP-BiomedBERT-large-uncased-abstract
Architecture: Transformer encoder + multi-label classification head
Task: Multi-label text classification
Input: Award title + abstract
Output: Probability per Research Activity Code.

Training approach

The model was trained in two stages using a 24GB GPU.

Domain-adaptive pretraining (DAPT)

We continued masked language modelling on grant titles and abstracts to adapt the encoder to research funding language as opposed to publications. This data used was a healthcare funder specific subset of Gomez Magenti, J. (2025) ‘Harmonised datasets of research project grants from UK and European funders’. Zenodo. doi:10.5281/zenodo.15479412.

Settings:

Max sequence length: 512
Mask probability: 0.15
Epochs: 1
Learning rate: 5e-5
Warmup ratio: 0.01
Weight decay: 0.01
Effective batch size: 64
Mixed precision: bf16/fp16
Gradient checkpointing enabled

The adapted checkpoint was then used for supervised training.

Supervised fine-tuning

The adapted model was fine-tuned for multi-label classification using sigmoid outputs and binary cross-entropy loss.

Input format: AwardTitle + newline + AwardAbstract

Tokenisation:

Max length: 512 tokens
Truncation enabled
Fixed-length padding during training

Handling class imbalance: A per-label weighting vector (pos_weight) is applied in the loss to reduce bias toward common categories.

Training configuration:

Learning rate: 3e-5
Weight decay: 0.01
Epochs: up to 20
Batch size: 14 per device
Gradient accumulation: 2
Mixed precision: fp16
Early stopping patience: 4
Best checkpoint selected by micro-F1

Evaluation protocol

Data was split into three disjoint sets:

Training set – used for optimisation
Validation set – used for early stopping and threshold tuning
Held-out test set – used only once for final evaluation

The test set was not used during training, checkpoint selection, or threshold tuning. The dataset used is listed at the top of the model card. Predictions are converted to labels using per-category probability thresholds tuned on the validation set. These thresholds are included in metadata.json.

Full Evaluation Results

Overall RAC Metrics:

f1 micro – 0.60
f1 macro - 0.51
precision micro - 0.56
recall micro - 0.63

Overall RAG Metrics:

f1 micro – 0.71
f1 macro - 0.68
precision micro - 0.70
recall micro - 0.73

For a comprehensive breakdown of the model's performance, including Overall Metrics, Metrics per Category across both validation and test sets and Metrics per Funder across the validation set, please refer to the detailed evaluation spreadsheet included in this repository.

Download/View the Evaluation Results (Located in the Files and versions tab of this repository).

Intended use

This model is intended for:

Portfolio analysis
Large-scale tagging of funding datasets
Exploratory research landscape mapping
Automation support for HRCS coding workflows

It is not intended to completely replace expert review.

Limitations

Performance depends on similarity to the training corpus.
Rare categories remain harder to detect despite class weighting.
Abstract length: Long or poorly structured abstracts may be truncated.
Threshold calibration: Thresholds are tuned for this dataset and may need recalibration for new domains.
Temporal bias: Model trained on data up to 2022. Therefore, any evaluation needs to use awards starting since 2023 to avoid inflated metrics.
Annotation Ambiguity and Niche Categories: The model's performance reflects the historical consistency of human coding within the training data. Categories that are historically difficult for human coders to classify consistently under HRCS guidelines (such as 7.1, 8.1 and 8.3) are naturally more challenging for the model.

Inference / How to use

We have provided a ready-to-use Python script that runs both this model (Research Activity Codes) and a Health Categories model on new award data simultaneously.

You can download the script and a sample dataset directly from the 'inference' subfolder in the Files and version tab of this repository.

Instructions

Prerequisites:

Download the script and test data to your computer from the inference subfolder.
Open your terminal or command prompt and install the required libraries by running: pip install torch pandas numpy tqdm transformers huggingface_hub
Use the 'test_data.csv' or prepare a CSV file in the same format containing your grant data. It must include two columns named exactly: AwardTitle and AwardAbstract.

Running the Code:

Open the script.
Under the # --- USER SETTINGS --- section, update DATA_FOLDERS to point to the folder containing your CSV. (Leave it as ["./"] if your CSV is in the same folder as the script).
Update TEST_FILENAME to match the name of your CSV.
Run the script.

The script will automatically download the necessary AI models, process your text, and output a new CSV containing the predicted categories and an "AI Certainty Score" (SmallestLogitDiff) to help you identify which borderline grants require human review.

Selective automation and human-in-the-loop use

In addition to predicted labels, the inference script reports how close each prediction is to the model’s decision boundary in logit space. This is computed as the smallest absolute difference between any category’s logit and its corresponding decision threshold.

Records with logits close to the threshold represent borderline cases where the model is uncertain. These can be prioritised for human review, while higher-confidence predictions can be automated.

When progressively excluding records whose predictions lie closest to the decision boundary, the remaining high-confidence subset shows increasing accuracy:

% of records excluded for human review	RAG Micro-F1 on remaining subset
0%	0.71
10%	0.72
20%	0.72
30%	0.74
40%	0.74
50%	0.76
60%	0.79
70%	0.83
80%	0.85
90%	0.90

This demonstrates that the model supports hybrid workflows in which uncertain cases are reviewed by experts while confident predictions can be automated.

Citation

NIHR, 2026. HRCS Research Activity Classifier (BiomedBERT, DAPT). [Model]. Developed by Banks, A., Baghurst, D., Carter, J., Manville, C., Wang, K. and Downes, N. Available from: https://huggingface.co/NIHRDataInsights/HRCSResearchActivityCodes

Downloads last month: 109

Safetensors

Model size

0.3B params

Tensor type

F32

NIHRDataInsights
/

HRCSResearchActivityCodes