CLIP-ViT-Base MTL Model for Multi-Modal Hate Speech Detection
A PyTorch-based multi-modal (image + text) hateful content classification model using CLIP encoder with Multi-Task Learning (MTL) architecture, trained on the MMHS150K dataset for detecting hate speech in social media memes and posts.
This is the best performing model among all architectures tested, achieving the highest F1 scores on the MMHS150K benchmark.
π― Model Description
This model implements a multi-task learning architecture with shared representation and task-specific binary heads for each hate category. It uses OpenAI's CLIP (ViT-Base-Patch32) as the backbone encoder.
The model performs multi-label classification across 5 hate speech categories, with each task having its own specialized classification head, making it capable of detecting multiple types of hate in a single post.
ποΈ Architecture
βββββββββββββββ βββββββββββββββ
β Image β β Text β
β Encoder β β Encoder β
β (CLIP ViT) β β (CLIP Text) β
ββββββββ¬βββββββ ββββββββ¬βββββββ
β β
βΌ βΌ
βββββββββββββββ βββββββββββββββ
β Projection β β Projection β
β (Linear) β β (Linear) β
ββββββββ¬βββββββ ββββββββ¬βββββββ
β β
βββββββββββ¬ββββββββββ
β
βΌ
βββββββββββββββ
β Gated Fusionββββ Modality presence flags
β Module β (handles missing modalities)
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββ
β Shared Head β
β (MLP) β
ββββββββ¬βββββββ
β
ββββββββββΌβββββββββ¬βββββββββ¬βββββββββ
βΌ βΌ βΌ βΌ βΌ
ββββββββ ββββββββ ββββββββ ββββββββ ββββββββ
βRacistβ βSexistβ βHomo- β βRelig-β βOther β
β Head β β Head β βphobe β β ion β β Hate β
ββββββββ ββββββββ ββββββββ ββββββββ ββββββββ
π Key Features
| Feature | Description |
|---|---|
| Backbone | openai/clip-vit-base-patch32 - Pre-trained CLIP model |
| Architecture | Multi-Task Learning with task-specific heads |
| Fusion Dimension | 512 |
| Max Text Length | 77 tokens |
| Multi-label Output | 5 hate speech categories (one head per task) |
| Gated Attention | Modality-aware fusion with learnable gates |
| Missing Modality Handling | Can handle text-only or image-only inputs |
π·οΈ Output Classes
| Index | Class | Description | Train Distribution |
|---|---|---|---|
| 0 | Racist | Racist content targeting race/ethnicity | 32.6% |
| 1 | Sexist | Sexist content targeting gender | 12.0% |
| 2 | Homophobe | Homophobic content targeting sexual orientation | 7.6% |
| 3 | Religion | Religion-based hate speech | 1.5% |
| 4 | OtherHate | Other types of hate speech | 15.6% |
π Evaluation Results
Test Set Performance
| Metric | Value |
|---|---|
| F1 Macro | 0.569 |
| F1 Micro | 0.644 |
| ROC-AUC Macro | 0.783 |
| Test Loss | 1.377 |
| Throughput | 390.9 samples/sec |
Per-Class Performance (Test Set)
| Class | F1 Score | ROC-AUC |
|---|---|---|
| Racist | 0.672 | 0.765 |
| Sexist | 0.589 | 0.810 |
| Homophobe | 0.745 | 0.882 |
| Religion | 0.223 | 0.618 |
| OtherHate | 0.616 | 0.842 |
Per-Class Performance (Validation Set)
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Racist | 0.566 | 0.872 | 0.686 | 1,994 |
| Sexist | 0.605 | 0.630 | 0.617 | 875 |
| Homophobe | 0.796 | 0.712 | 0.752 | 612 |
| Religion | 0.680 | 0.132 | 0.221 | 129 |
| OtherHate | 0.511 | 0.736 | 0.603 | 1,195 |
| Micro Avg | 0.577 | 0.754 | 0.654 | 4,805 |
| Macro Avg | 0.631 | 0.616 | 0.576 | 4,805 |
βοΈ Optimized Thresholds
The model uses per-class calibrated thresholds for optimal performance (instead of default 0.5):
| Class | Threshold |
|---|---|
| Racist | 0.30 |
| Sexist | 0.70 |
| Homophobe | 0.50 |
| Religion | 0.30 |
| OtherHate | 0.55 |
π Model Comparison
| Model | F1 Macro | F1 Micro | ROC-AUC Macro | Throughput |
|---|---|---|---|---|
| CLIP MTL (this model) | 0.569 | 0.644 | 0.783 | 390.9 |
| CLIP Fusion | 0.566 | 0.635 | 0.783 | 381.5 |
| SigLIP Fusion | 0.507 | 0.610 | 0.774 | 236.3 |
| CLIP Fusion (Weighted Sampling) | 0.557 | 0.636 | 0.772 | 266.4 |
| CLIP Fusion (Bigger Batch) | 0.515 | 0.517 | 0.804 | 400.9 |
π Training Data
MMHS150K Dataset
The model was trained on the MMHS150K (Multi-Modal Hate Speech) dataset, a large-scale multi-modal hate speech dataset collected from Twitter containing 150,000 tweet-image pairs annotated for hate speech detection.
| Property | Value |
|---|---|
| Source | |
| Total Samples | ~150,000 |
| Modalities | Image + Text |
| Annotation | Multi-label (5 hate categories) |
| Language | English |
Dataset Splits
| Split | Samples |
|---|---|
| Train | ~112,500 |
| Validation | ~15,000 |
| Test | ~22,500 |
Dataset Reference
Paper: "Exploring Hate Speech Detection in Multimodal Publications" (WACV 2020) Authors: Raul Gomez, Jaume Gibert, Lluis Gomez, Dimosthenis Karatzas
π§ Training Procedure
Training Configuration
# Model Configuration
backend: clip
head: mtl
encoder_name: openai/clip-vit-base-patch32
fusion_dim: 512
max_text_length: 77
freeze_text: false
freeze_image: false
# Training Configuration
num_train_epochs: 4
per_device_train_batch_size: 32
per_device_eval_batch_size: 64
gradient_accumulation_steps: 2
# Learning Rates (Differential)
lr_encoder: 1.0e-5
lr_head: 5.0e-4
# Regularization
weight_decay: 0.02
max_grad_norm: 1.0
# Scheduler
warmup_ratio: 0.05
lr_scheduler_type: cosine
# Loss
loss_type: bce (per-task binary cross-entropy)
# Precision
precision: fp16
# Data Augmentation
augment: true
aug_scale_min: 0.8
aug_scale_max: 1.0
horizontal_flip: true
color_jitter: true
# Early Stopping
early_stopping_patience: 3
metric_for_best_model: roc_macro
Training Highlights
- Multi-Task Learning: Separate binary head for each hate category
- Differential Learning Rates: Encoder (1e-5) vs Classification Heads (5e-4)
- Mixed Precision: FP16 training for efficiency
- Data Augmentation: Random scaling, horizontal flip, color jitter
- Threshold Calibration: Per-class threshold optimization on validation set
- Early Stopping: Patience of 3 epochs based on ROC-AUC macro
- Best Checkpoint: Step 11,236 (epoch 1)
Computational Resources
- Training Time: ~4 epochs
- Best Checkpoint: Step 11,236
- Hardware: GPU with FP16 support
π How to Use
Installation
# Clone the training repository
git clone https://github.com/amirhossein-yousefi/multimodal-content-moderation.git
cd multimodal-content-moderation
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
pip install -e .
Quick Inference with Trust Remote Code
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
# Load model with trust_remote_code
model = AutoModel.from_pretrained(
"Amirhossein75/clip-vit-base-mmhs150k-mtl",
trust_remote_code=True
)
model.eval()
# Load tokenizer and processor from base CLIP model
tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")
processor = AutoImageProcessor.from_pretrained("openai/clip-vit-base-patch32")
# Prepare inputs
image = Image.open("path/to/image.jpg").convert("RGB")
text = "Sample text from the meme"
# Process inputs
text_inputs = tokenizer(
text,
return_tensors="pt",
padding="max_length",
truncation=True,
max_length=77
)
image_inputs = processor(images=image, return_tensors="pt")
# Inference
with torch.no_grad():
outputs = model.predict(
input_ids=text_inputs["input_ids"],
attention_mask=text_inputs["attention_mask"],
pixel_values=image_inputs["pixel_values"],
)
# Get predictions
predictions = outputs["predictions"][0] # [num_tasks]
probabilities = outputs["probabilities"][0] # [num_tasks]
task_names = outputs["task_names"]
# Print results
for name, pred, prob in zip(task_names, predictions, probabilities):
print(f"{name}: {'Yes' if pred else 'No'} (prob: {prob:.3f})")
Batch Inference
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
# Load model
model = AutoModel.from_pretrained(
"Amirhossein75/clip-vit-base-mmhs150k-mtl",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")
processor = AutoImageProcessor.from_pretrained("openai/clip-vit-base-patch32")
# Prepare batch
images = [Image.open("image1.jpg"), Image.open("image2.jpg")]
texts = ["text for image 1", "text for image 2"]
# Process
text_inputs = tokenizer(
texts,
return_tensors="pt",
padding="max_length",
truncation=True,
max_length=77
)
image_inputs = processor(images=images, return_tensors="pt")
# Inference
model.eval()
with torch.no_grad():
outputs = model(
input_ids=text_inputs["input_ids"],
attention_mask=text_inputs["attention_mask"],
pixel_values=image_inputs["pixel_values"],
)
probabilities = torch.sigmoid(outputs["logits"])
# Apply optimized thresholds
thresholds = torch.tensor([0.30, 0.70, 0.50, 0.30, 0.55])
predictions = (probabilities > thresholds).int()
print(predictions)
Using with the Original Repository
from src.models import MultiTaskClassifier
from src.utils import load_json
from safetensors.torch import load_file
# Load config
config = load_json("inference_config.json")
# Build model
model = MultiTaskClassifier(
encoder_name=config["encoder_name"],
task_names=config["class_names"],
fusion_dim=config["fusion_dim"],
backend=config["backend"],
)
# Load weights
state_dict = load_file("checkpoint-11236/model.safetensors")
model.load_state_dict(state_dict)
model.eval()
π Model Files
| File | Description |
|---|---|
checkpoint-11236/model.safetensors |
Model weights in safetensors format |
config.json |
Model architecture configuration |
modeling_clip_mtl.py |
Custom model implementation for trust_remote_code |
inference_config.json |
Inference settings with thresholds and class names |
label_map.json |
Label name mapping |
test_metrics.json |
Test set evaluation metrics |
val_report.json |
Detailed validation classification report |
βοΈ AWS SageMaker Deployment
This model is compatible with AWS SageMaker for cloud deployment:
from sagemaker.pytorch import PyTorchModel
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer
model = PyTorchModel(
model_data="s3://your-bucket/model.tar.gz",
role=role,
entry_point='inference.py',
source_dir='sagemaker',
framework_version='2.1.0',
py_version='py310',
)
predictor = model.deploy(
instance_type='ml.g4dn.xlarge',
initial_instance_count=1,
)
# Make prediction
import base64
with open('image.jpg', 'rb') as f:
image_b64 = base64.b64encode(f.read()).decode('utf-8')
response = predictor.predict({
'instances': [{
'text': 'Sample text content',
'image_base64': image_b64,
}]
})
See the SageMaker documentation for full deployment guide.
β οΈ Intended Uses & Limitations
β Intended Uses
- Content moderation for social media platforms
- Detecting hateful memes and posts
- Research in multi-modal hate speech detection
- Building content safety systems
- Pre-filtering potentially harmful content for human review
β οΈ Limitations
| Limitation | Description |
|---|---|
| Language | Trained only on English content |
| Domain | Twitter-specific; may not generalize to other platforms |
| Class Imbalance | Lower performance on rare categories (Religion: F1=0.223) |
| Cultural Context | May miss culturally-specific hate speech |
| Sarcasm/Irony | May struggle with subtle or ironic hateful content |
| Image-only Hate | Text encoder is important; purely visual hate may be missed |
β Out-of-Scope Uses
- NOT for making final moderation decisions without human review
- NOT suitable for legal or compliance purposes without additional validation
- NOT for censorship or suppression of legitimate speech
- NOT for targeting or profiling individuals
π‘οΈ Ethical Considerations
- This model should be used as a tool to assist human moderators, not replace them
- False positives may incorrectly flag legitimate content
- False negatives may miss harmful content
- Regular evaluation and bias auditing is recommended
- Consider the cultural and contextual factors in deployment
π Citation
If you use this model, please cite:
@misc{yousefi2024multimodal_mtl,
title={Multi-Modal Hateful Content Classification with CLIP Multi-Task Learning},
author={Yousefi, Amirhossein},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/Amirhossein75/clip-vit-base-mmhs150k-mtl}
}
Dataset Citation
@inproceedings{gomez2020exploring,
title={Exploring Hate Speech Detection in Multimodal Publications},
author={Gomez, Raul and Gibert, Jaume and Gomez, Lluis and Karatzas, Dimosthenis},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
pages={1470--1478},
year={2020}
}
CLIP Citation
@inproceedings{radford2021learning,
title={Learning Transferable Visual Models From Natural Language Supervision},
author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and others},
booktitle={International Conference on Machine Learning},
pages={8748--8763},
year={2021}
}
π Links
| Resource | Link |
|---|---|
| GitHub Repository | multimodal-content-moderation |
| Base Model | openai/clip-vit-base-patch32 |
| CLIP Fusion Model | Amirhossein75/clip-vit-base-mmhs150k-fusion |
| MMHS150K Dataset | Official Page |
| CLIP Paper | arXiv |
π License
This project is licensed under the MIT License - see the LICENSE file for details.
π€ Contributing
Contributions are welcome! Please see the GitHub repository for contribution guidelines.
- Downloads last month
- 59
Model tree for Amirhossein75/clip-vit-base-mmhs150k-mtl
Base model
openai/clip-vit-base-patch32Evaluation results
- F1 Macro on MMHS150Kself-reported0.569
- F1 Micro on MMHS150Kself-reported0.644
- ROC-AUC Macro on MMHS150Kself-reported0.783