Skip to content

wellcometrust/grant_hrcs_tagger

Repository files navigation

Grant HRCS Tagging Model

A machine learning model for tagging research grants with HRCS (Health Research Classification System) tags based on title and abstract text:

  • Research Activity Codes (RAC) - categorising the type of research activity
  • Health Categories (HC) - categorising the health area of the research

Developed by the Machine Learning team, within Data & Digital at the Wellcome Trust.


Quick Start

  • Just want to tag grants? → See Inference
  • Want to train your own model? → See Training

Data Acknowledgement

Acknowledgment, for training and evaluation of our model we used the following datasets:

Project Structure

├── config/                 # Configuration files
│   ├── train_config.yaml   # Training hyperparameters and settings
│   └── deploy_config.yaml  # Deployment configuration
├── data/                   # Data directory
│   ├── raw/                # Raw downloaded data
│   ├── clean/              # Cleaned parquet files
│   ├── preprocessed/       # Train/test splits
│   ├── label_names/        # Label name mappings
│   └── model/              # Trained model outputs
├── src/                    # Source code
│   ├── data_processing/    # Data processing scripts
│   ├── train.py            # Model training script
│   ├── train_test_split.py # Data splitting utilities
│   ├── inference.py        # Inference functions
│   ├── deploy.py           # SageMaker deployment
│   └── metrics.py          # Evaluation metrics
├── notebooks/              # Jupyter notebooks for exploration
└── test/                   # Test suite

Inference

Use the trained model to tag grants with HRCS codes. Choose the approach that fits your use case:

Using a Hugging Face Hosted Model

The easiest way to get started is using our pre-trained model hosted on Hugging Face:

from transformers import pipeline

# Load the model from Hugging Face Hub
classifier = pipeline(
    "text-classification",
    model="wellcometrust/grant-hrcs-tagger",  # TODO: Update with actual model path
    top_k=None
)

# Predict HRCS tags
result = classifier("Your grant title and abstract text here")
print(result)

Note: The Hugging Face model URL will be updated once the model is published.

Using a Local Model

If you have trained your own model or downloaded one locally, use the inference functions in src/inference.py:

from src.inference import model_fn, predict_fn

# Load model from local directory
model_dict = model_fn("data/model/")

# Predict
result = predict_fn({"inputs": "Grant title and abstract text here"}, model_dict)
print(result)

The inference pipeline applies ranked thresholds (configurable in the training config) to convert logits to multi-label predictions.


Training

This section is for users who want to fine-tune their own HRCS tagging model.

Platform Requirements

This project uses a Makefile with bash commands, which run natively on Linux and macOS.

Windows users: We recommend using WSL (Windows Subsystem for Linux) and following the Linux instructions. Makefiles are not supported natively on Windows without third-party tools, and the bash commands in the Makefile only run on Unix-like systems.

Installing uv

First, install uv, a fast Python package manager:

# On macOS and Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# On Windows
powershell -c "irm https://astral.sh/uv/install.sh | iex"

# Or with pip
pip install uv

For more installation options, see the uv installation guide.

Environment Setup

Set up the project environment using uv, which will read from the pyproject.toml file:

# Sync dependencies and create virtual environment
uv sync

You can run python scripts directly with uv without activating the virtual environment explicitly:

uv run a_python_script.py

To make things even easier, we use the following make commands to run common tasks.

Make Commands

The project includes several make commands to streamline common tasks:

Command Description
make download_data Download raw datasets from HRCS online and NIHR
make build_dataset Process and clean the downloaded data into parquet files
make train_test_split Split cleaned data into train/test sets for training
make train_ra Train Research Activity model
make train_ra_top Train Research Activity (top level) model
make help See all available commands

Note: The download_data command requires wget. On macOS, install with brew install wget.

Fine-tuning the Model

To train your own HRCS tagging model, you'll need:

Hardware Requirements:

  • A GPU is highly recommended for training. The code supports:
    • CUDA GPUs (NVIDIA)
    • Apple Silicon (M1/M2/M3/M4/...) via MPS

Training Steps:

  1. Download and preprocess data (if not already done):
make download_data
make build_dataset  
make train_test_split
  1. Configure training settings by editing config/train_config.yaml:
training_settings:
  category: 'RA'  # Choose: "RA" (Research Activity), "RA_top" (top-level), or "HC" (Health Category)
  model: 'answerdotai/ModernBERT-base'  # Also supports 'distilbert-base-uncased' or any model compatible with AutoModelForSequenceClassification
  learning_rate: 0.0001
  num_train_epochs: 3  # Increase for better performance (try 3-5)
  per_device_train_batch_size: 16  # Reduce if you get GPU memory errors
  class_weighting: False  # Set to True to handle label imbalance
  output_weighting: True  # Set to True for custom prediction thresholds, we found this to slightly improve performance
  1. Run training:
# Train Research Activity model
make train_ra # for the low level RAC codes
make train_ra_top  # for the top level RAC codes

Monitoring with Weights & Biases: The code integrates with wandb for experiment tracking. If you have a wandb account, training metrics will be automatically logged. If you don't have access to wandb, you can disable it by setting report_to: none in the train_config.yaml file.

Trained models are saved to data/model/ and can be used for inference on new grants.


Deployment

Models can be deployed to AWS SageMaker. Configuration is managed in config/deploy_config.yaml:

model_args:
  transformers_version: "4.49.0"
  pytorch_version: "2.6.0"
  py_version: "py312"

endpoint_args:
  instance_type: "ml.m5.xlarge"

See src/deploy.py for deployment utilities and notebooks/deploy.ipynb for an interactive deployment workflow.


Testing

Run the test suite:

pytest test/

About

Classifier model for tagging research grants with HRCS Health Category and Research Activity Code tags.

Topics

Resources

License

Stars

Watchers

Forks

Contributors