CLIP-ViT-Base MTL Model for Multi-Modal Hate Speech Detection

A PyTorch-based multi-modal (image + text) hateful content classification model using CLIP encoder with Multi-Task Learning (MTL) architecture, trained on the MMHS150K dataset for detecting hate speech in social media memes and posts.

This is the best performing model among all architectures tested, achieving the highest F1 scores on the MMHS150K benchmark.

🎯 Model Description

This model implements a multi-task learning architecture with shared representation and task-specific binary heads for each hate category. It uses OpenAI's CLIP (ViT-Base-Patch32) as the backbone encoder.

The model performs multi-label classification across 5 hate speech categories, with each task having its own specialized classification head, making it capable of detecting multiple types of hate in a single post.

🏗️ Architecture

┌─────────────┐     ┌─────────────┐
│   Image     │     │    Text     │
│   Encoder   │     │   Encoder   │
│ (CLIP ViT)  │     │ (CLIP Text) │
└──────┬──────┘     └──────┬──────┘
       │                   │
       ▼                   ▼
┌─────────────┐     ┌─────────────┐
│  Projection │     │  Projection │
│   (Linear)  │     │   (Linear)  │
└──────┬──────┘     └──────┬──────┘
       │                   │
       └─────────┬─────────┘
                 │
                 ▼
         ┌─────────────┐
         │ Gated Fusion│◄── Modality presence flags
         │   Module    │    (handles missing modalities)
         └──────┬──────┘
                │
                ▼
         ┌─────────────┐
         │ Shared Head │
         │    (MLP)    │
         └──────┬──────┘
                │
       ┌────────┼────────┬────────┬────────┐
       ▼        ▼        ▼        ▼        ▼
   ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐
   │Racist│ │Sexist│ │Homo- │ │Relig-│ │Other │
   │ Head │ │ Head │ │phobe │ │ ion  │ │ Hate │
   └──────┘ └──────┘ └──────┘ └──────┘ └──────┘

🔑 Key Features

Feature	Description
Backbone	`openai/clip-vit-base-patch32` - Pre-trained CLIP model
Architecture	Multi-Task Learning with task-specific heads
Fusion Dimension	512
Max Text Length	77 tokens
Multi-label Output	5 hate speech categories (one head per task)
Gated Attention	Modality-aware fusion with learnable gates
Missing Modality Handling	Can handle text-only or image-only inputs

🏷️ Output Classes

Index	Class	Description	Train Distribution
0	Racist	Racist content targeting race/ethnicity	32.6%
1	Sexist	Sexist content targeting gender	12.0%
2	Homophobe	Homophobic content targeting sexual orientation	7.6%
3	Religion	Religion-based hate speech	1.5%
4	OtherHate	Other types of hate speech	15.6%

📊 Evaluation Results

Test Set Performance

Metric	Value
F1 Macro	0.569
F1 Micro	0.644
ROC-AUC Macro	0.783
Test Loss	1.377
Throughput	390.9 samples/sec

Per-Class Performance (Test Set)

Class	F1 Score	ROC-AUC
Racist	0.672	0.765
Sexist	0.589	0.810
Homophobe	0.745	0.882
Religion	0.223	0.618
OtherHate	0.616	0.842

Per-Class Performance (Validation Set)

Class	Precision	Recall	F1-Score	Support
Racist	0.566	0.872	0.686	1,994
Sexist	0.605	0.630	0.617	875
Homophobe	0.796	0.712	0.752	612
Religion	0.680	0.132	0.221	129
OtherHate	0.511	0.736	0.603	1,195
Micro Avg	0.577	0.754	0.654	4,805
Macro Avg	0.631	0.616	0.576	4,805

⚙️ Optimized Thresholds

The model uses per-class calibrated thresholds for optimal performance (instead of default 0.5):

Class	Threshold
Racist	0.30
Sexist	0.70
Homophobe	0.50
Religion	0.30
OtherHate	0.55

📈 Model Comparison

Model	F1 Macro	F1 Micro	ROC-AUC Macro	Throughput
CLIP MTL (this model)	0.569	0.644	0.783	390.9
CLIP Fusion	0.566	0.635	0.783	381.5
SigLIP Fusion	0.507	0.610	0.774	236.3
CLIP Fusion (Weighted Sampling)	0.557	0.636	0.772	266.4
CLIP Fusion (Bigger Batch)	0.515	0.517	0.804	400.9

🎓 Training Data

MMHS150K Dataset

The model was trained on the MMHS150K (Multi-Modal Hate Speech) dataset, a large-scale multi-modal hate speech dataset collected from Twitter containing 150,000 tweet-image pairs annotated for hate speech detection.

Property	Value
Source	Twitter
Total Samples	~150,000
Modalities	Image + Text
Annotation	Multi-label (5 hate categories)
Language	English

Dataset Splits

Split	Samples
Train	~112,500
Validation	~15,000
Test	~22,500

Dataset Reference

Paper: "Exploring Hate Speech Detection in Multimodal Publications" (WACV 2020) Authors: Raul Gomez, Jaume Gibert, Lluis Gomez, Dimosthenis Karatzas

🔧 Training Procedure

Training Configuration

# Model Configuration
backend: clip
head: mtl
encoder_name: openai/clip-vit-base-patch32
fusion_dim: 512
max_text_length: 77
freeze_text: false
freeze_image: false

# Training Configuration
num_train_epochs: 4
per_device_train_batch_size: 32
per_device_eval_batch_size: 64
gradient_accumulation_steps: 2

# Learning Rates (Differential)
lr_encoder: 1.0e-5
lr_head: 5.0e-4

# Regularization
weight_decay: 0.02
max_grad_norm: 1.0

# Scheduler
warmup_ratio: 0.05
lr_scheduler_type: cosine

# Loss
loss_type: bce (per-task binary cross-entropy)

# Precision
precision: fp16

# Data Augmentation
augment: true
aug_scale_min: 0.8
aug_scale_max: 1.0
horizontal_flip: true
color_jitter: true

# Early Stopping
early_stopping_patience: 3
metric_for_best_model: roc_macro

Training Highlights

Multi-Task Learning: Separate binary head for each hate category
Differential Learning Rates: Encoder (1e-5) vs Classification Heads (5e-4)
Mixed Precision: FP16 training for efficiency
Data Augmentation: Random scaling, horizontal flip, color jitter
Threshold Calibration: Per-class threshold optimization on validation set
Early Stopping: Patience of 3 epochs based on ROC-AUC macro
Best Checkpoint: Step 11,236 (epoch 1)

Computational Resources

Training Time: ~4 epochs
Best Checkpoint: Step 11,236
Hardware: GPU with FP16 support

🚀 How to Use

Installation

# Clone the training repository
git clone https://github.com/amirhossein-yousefi/multimodal-content-moderation.git
cd multimodal-content-moderation

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
pip install -e .

Quick Inference with Trust Remote Code

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer, AutoImageProcessor

# Load model with trust_remote_code
model = AutoModel.from_pretrained(
    "Amirhossein75/clip-vit-base-mmhs150k-mtl",
    trust_remote_code=True
)
model.eval()

# Load tokenizer and processor from base CLIP model
tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")
processor = AutoImageProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Prepare inputs
image = Image.open("path/to/image.jpg").convert("RGB")
text = "Sample text from the meme"

# Process inputs
text_inputs = tokenizer(
    text,
    return_tensors="pt",
    padding="max_length",
    truncation=True,
    max_length=77
)
image_inputs = processor(images=image, return_tensors="pt")

# Inference
with torch.no_grad():
    outputs = model.predict(
        input_ids=text_inputs["input_ids"],
        attention_mask=text_inputs["attention_mask"],
        pixel_values=image_inputs["pixel_values"],
    )

# Get predictions
predictions = outputs["predictions"][0]  # [num_tasks]
probabilities = outputs["probabilities"][0]  # [num_tasks]
task_names = outputs["task_names"]

# Print results
for name, pred, prob in zip(task_names, predictions, probabilities):
    print(f"{name}: {'Yes' if pred else 'No'} (prob: {prob:.3f})")

Batch Inference

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer, AutoImageProcessor

# Load model
model = AutoModel.from_pretrained(
    "Amirhossein75/clip-vit-base-mmhs150k-mtl",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")
processor = AutoImageProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Prepare batch
images = [Image.open("image1.jpg"), Image.open("image2.jpg")]
texts = ["text for image 1", "text for image 2"]

# Process
text_inputs = tokenizer(
    texts,
    return_tensors="pt",
    padding="max_length",
    truncation=True,
    max_length=77
)
image_inputs = processor(images=images, return_tensors="pt")

# Inference
model.eval()
with torch.no_grad():
    outputs = model(
        input_ids=text_inputs["input_ids"],
        attention_mask=text_inputs["attention_mask"],
        pixel_values=image_inputs["pixel_values"],
    )
    probabilities = torch.sigmoid(outputs["logits"])

# Apply optimized thresholds
thresholds = torch.tensor([0.30, 0.70, 0.50, 0.30, 0.55])
predictions = (probabilities > thresholds).int()
print(predictions)

Using with the Original Repository

from src.models import MultiTaskClassifier
from src.utils import load_json
from safetensors.torch import load_file

# Load config
config = load_json("inference_config.json")

# Build model
model = MultiTaskClassifier(
    encoder_name=config["encoder_name"],
    task_names=config["class_names"],
    fusion_dim=config["fusion_dim"],
    backend=config["backend"],
)

# Load weights
state_dict = load_file("checkpoint-11236/model.safetensors")
model.load_state_dict(state_dict)
model.eval()

📁 Model Files

File	Description
`checkpoint-11236/model.safetensors`	Model weights in safetensors format
`config.json`	Model architecture configuration
`modeling_clip_mtl.py`	Custom model implementation for `trust_remote_code`
`inference_config.json`	Inference settings with thresholds and class names
`label_map.json`	Label name mapping
`test_metrics.json`	Test set evaluation metrics
`val_report.json`	Detailed validation classification report

☁️ AWS SageMaker Deployment

This model is compatible with AWS SageMaker for cloud deployment:

from sagemaker.pytorch import PyTorchModel
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

model = PyTorchModel(
    model_data="s3://your-bucket/model.tar.gz",
    role=role,
    entry_point='inference.py',
    source_dir='sagemaker',
    framework_version='2.1.0',
    py_version='py310',
)

predictor = model.deploy(
    instance_type='ml.g4dn.xlarge',
    initial_instance_count=1,
)

# Make prediction
import base64
with open('image.jpg', 'rb') as f:
    image_b64 = base64.b64encode(f.read()).decode('utf-8')

response = predictor.predict({
    'instances': [{
        'text': 'Sample text content',
        'image_base64': image_b64,
    }]
})

See the SageMaker documentation for full deployment guide.

⚠️ Intended Uses & Limitations

✅ Intended Uses

Content moderation for social media platforms
Detecting hateful memes and posts
Research in multi-modal hate speech detection
Building content safety systems
Pre-filtering potentially harmful content for human review

⚠️ Limitations

Limitation	Description
Language	Trained only on English content
Domain	Twitter-specific; may not generalize to other platforms
Class Imbalance	Lower performance on rare categories (Religion: F1=0.223)
Cultural Context	May miss culturally-specific hate speech
Sarcasm/Irony	May struggle with subtle or ironic hateful content
Image-only Hate	Text encoder is important; purely visual hate may be missed

❌ Out-of-Scope Uses

NOT for making final moderation decisions without human review
NOT suitable for legal or compliance purposes without additional validation
NOT for censorship or suppression of legitimate speech
NOT for targeting or profiling individuals

🛡️ Ethical Considerations

This model should be used as a tool to assist human moderators, not replace them
False positives may incorrectly flag legitimate content
False negatives may miss harmful content
Regular evaluation and bias auditing is recommended
Consider the cultural and contextual factors in deployment

📝 Citation

If you use this model, please cite:

@misc{yousefi2024multimodal_mtl,
  title={Multi-Modal Hateful Content Classification with CLIP Multi-Task Learning},
  author={Yousefi, Amirhossein},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/Amirhossein75/clip-vit-base-mmhs150k-mtl}
}

Dataset Citation

@inproceedings{gomez2020exploring,
  title={Exploring Hate Speech Detection in Multimodal Publications},
  author={Gomez, Raul and Gibert, Jaume and Gomez, Lluis and Karatzas, Dimosthenis},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  pages={1470--1478},
  year={2020}
}

CLIP Citation

@inproceedings{radford2021learning,
  title={Learning Transferable Visual Models From Natural Language Supervision},
  author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and others},
  booktitle={International Conference on Machine Learning},
  pages={8748--8763},
  year={2021}
}

🔗 Links

Resource	Link
GitHub Repository	multimodal-content-moderation
Base Model	openai/clip-vit-base-patch32
CLIP Fusion Model	Amirhossein75/clip-vit-base-mmhs150k-fusion
MMHS150K Dataset	Official Page
CLIP Paper	arXiv

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🤝 Contributing

Contributions are welcome! Please see the GitHub repository for contribution guidelines.

Downloads last month: 59

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Amirhossein75/clip-vit-base-mmhs150k-mtl

Base model

openai/clip-vit-base-patch32

Finetuned

(114)

this model

Evaluation results

F1 Macro on MMHS150K
self-reported

0.569
F1 Micro on MMHS150K
self-reported

0.644
ROC-AUC Macro on MMHS150K
self-reported

0.783