SigLIP2-Base Fusion Model for Multi-Modal Hate Speech Detection

PyTorch Transformers License Dataset

A PyTorch-based multi-modal (image + text) hateful content classification model using SigLIP2 encoder with late fusion architecture, trained on the MMHS150K dataset for detecting hate speech in social media memes and posts.

🎯 Model Description

This model implements a late fusion architecture with gated attention mechanism for detecting hateful content in social media memes and posts. It combines visual and textual features using Google's SigLIP2 (Base-Patch16-224) as the backbone encoder.

The model performs multi-label classification across 5 hate speech categories, making it capable of detecting multiple types of hate in a single post (e.g., content that is both racist and sexist).

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Image     β”‚     β”‚    Text     β”‚
β”‚   Encoder   β”‚     β”‚   Encoder   β”‚
β”‚  (SigLIP2)  β”‚     β”‚  (SigLIP2)  β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚                   β”‚
       β–Ό                   β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Projection β”‚     β”‚  Projection β”‚
β”‚   (Linear)  β”‚     β”‚   (Linear)  β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚                   β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β”‚
                 β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚ Gated Fusion│◄── Modality presence flags
         β”‚   Module    β”‚    (handles missing modalities)
         β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                β”‚
                β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  Interaction Features β”‚
    β”‚  β€’ Fused embedding    β”‚
    β”‚  β€’ Text embedding     β”‚
    β”‚  β€’ Visual embedding   β”‚
    β”‚  β€’ |text - visual|    β”‚
    β”‚  β€’ text βŠ™ visual      β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β”‚
                β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚Classificationβ”‚
         β”‚  Head (MLP)  β”‚
         β”‚  β†’ 5 classes β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”‘ Key Features

Feature Description
Backbone google/siglip2-base-patch16-224 - Pre-trained SigLIP2 model
Fusion Dimension 512
Max Text Length 64 tokens
Multi-label Output 5 hate speech categories
Gated Attention Modality-aware fusion with learnable gates
Interaction Features Rich feature interactions (concatenation, element-wise product, absolute difference)
Missing Modality Handling Can handle text-only or image-only inputs

🏷️ Output Classes

Index Class Description Dataset Prevalence
0 Racist Racist content targeting race/ethnicity 32.6%
1 Sexist Sexist content targeting gender 12.0%
2 Homophobe Homophobic content targeting sexual orientation 7.6%
3 Religion Religion-based hate speech 1.5%
4 OtherHate Other types of hate speech 15.6%

πŸ“Š Evaluation Results

Test Set Performance

Metric Score
F1 Macro 0.507
F1 Micro 0.610
ROC-AUC Macro 0.774
Test Loss 1.530
Throughput 236.3 samples/sec

Per-Class Performance (Validation Set)

Class Precision Recall F1-Score Support
Racist 0.534 0.874 0.663 1,994
Sexist 0.620 0.585 0.602 875
Homophobe 0.818 0.632 0.713 612
Religion 0.118 0.140 0.128 129
OtherHate 0.479 0.707 0.571 1,195
Micro Avg 0.541 0.729 0.621 4,805
Macro Avg 0.514 0.588 0.535 4,805

βš™οΈ Optimized Thresholds

The model uses per-class calibrated thresholds for optimal performance (instead of default 0.5):

Class Threshold
Racist 0.30
Sexist 0.75
Homophobe 0.85
Religion 0.20
OtherHate 0.55

πŸ“ˆ Model Comparison

Model F1 Macro F1 Micro ROC-AUC Throughput
CLIP Fusion 0.566 0.635 0.783 381.5
CLIP MTL 0.569 0.644 0.783 390.9
SigLIP Fusion (this model) 0.507 0.610 0.774 236.3
CLIP Fusion (Weighted Sampling) 0.557 0.636 0.772 266.4
CLIP Fusion (Bigger Batch) 0.515 0.517 0.804 400.9

πŸŽ“ Training Data

MMHS150K Dataset

The model was trained on the MMHS150K (Multi-Modal Hate Speech) dataset, a large-scale multi-modal hate speech dataset collected from Twitter containing 150,000 tweet-image pairs annotated for hate speech detection.

Attribute Value
Source Twitter
Total Samples ~150,000
Modalities Image + Text
Annotation Multi-label (5 hate categories)
Language English

Dataset Splits

Split Samples
Train ~112,500
Validation ~15,000
Test ~22,500

Dataset Reference

Paper: "Exploring Hate Speech Detection in Multimodal Publications" (WACV 2020) Authors: Raul Gomez, Jaume Gibert, Lluis Gomez, Dimosthenis Karatzas

πŸ”§ Training Procedure

Training Configuration

# Model Configuration
backend: siglip
head: fusion
encoder_name: google/siglip2-base-patch16-224
fusion_dim: 512
max_text_length: 64
freeze_text: false
freeze_image: false

# Training Configuration
num_train_epochs: 6
per_device_train_batch_size: 32
per_device_eval_batch_size: 64
gradient_accumulation_steps: 2

# Learning Rates (Differential)
lr_encoder: 1.0e-5
lr_head: 5.0e-4

# Regularization
weight_decay: 0.02
max_grad_norm: 1.0

# Scheduler
warmup_ratio: 0.05
lr_scheduler_type: cosine

# Loss
loss_type: bce
use_logit_adjustment: false

# Precision
precision: fp16

# Data Augmentation
augment: true
aug_scale_min: 0.8
aug_scale_max: 1.0
horizontal_flip: true
color_jitter: true

# Early Stopping
early_stopping_patience: 3
metric_for_best_model: roc_macro

Training Highlights

  • Differential Learning Rates: Encoder (1e-5) vs Classification Head (5e-4)
  • Mixed Precision: FP16 training for efficiency
  • Data Augmentation: Random scaling, horizontal flip, color jitter
  • Threshold Calibration: Per-class threshold optimization on validation set
  • Early Stopping: Patience of 3 epochs based on ROC-AUC macro
  • Best Checkpoint: Selected based on validation ROC-AUC macro score

Computational Resources

  • Training Time: ~6 epochs
  • Best Checkpoint: Step 30,000
  • Hardware: GPU with FP16 support

πŸš€ How to Use

Installation

# Clone the training repository
git clone https://github.com/amirhossein-yousefi/multimodal-content-moderation.git
cd multimodal-content-moderation

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
pip install -e .

Quick Inference with trust_remote_code

from transformers import AutoModel, AutoProcessor
from PIL import Image
import torch

# Load model with trust_remote_code
model = AutoModel.from_pretrained(
    "Amirhossein75/siglip2-base-mmhs150k-fusion",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-224")

# Prepare inputs
image = Image.open("path/to/image.jpg")
text = "sample text from the meme"

inputs = processor(
    text=[text],
    images=[image],
    return_tensors="pt",
    padding=True,
    truncation=True
)

# Inference
model.eval()
with torch.no_grad():
    outputs = model(**inputs)
    probabilities = torch.sigmoid(outputs["logits"])

# Apply optimized thresholds
thresholds = torch.tensor([0.30, 0.75, 0.85, 0.20, 0.55])
predictions = (probabilities > thresholds).int()

class_names = ["racist", "sexist", "homophobe", "religion", "otherhate"]
for i, name in enumerate(class_names):
    print(f"{name}: {bool(predictions[0, i])} (prob: {probabilities[0, i]:.3f})")

Using the predict() Method

from transformers import AutoModel, AutoProcessor
from PIL import Image
import torch

# Load model
model = AutoModel.from_pretrained(
    "Amirhossein75/siglip2-base-mmhs150k-fusion",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-224")

# Prepare inputs
image = Image.open("path/to/image.jpg")
text = "sample text from the meme"

inputs = processor(
    text=[text],
    images=[image],
    return_tensors="pt",
    padding=True,
    truncation=True
)

# Use built-in predict method with calibrated thresholds
result = model.predict(**inputs)

print(result)
# {'predictions': {'racist': False, 'sexist': True, 'homophobe': False, 'religion': False, 'otherhate': False},
#  'probabilities': {'racist': 0.12, 'sexist': 0.78, ...}}

Batch Inference

import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

# Load model
model = AutoModel.from_pretrained(
    "Amirhossein75/siglip2-base-mmhs150k-fusion",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-224")

# Prepare batch
images = [Image.open("image1.jpg"), Image.open("image2.jpg")]
texts = ["text for image 1", "text for image 2"]

inputs = processor(
    text=texts,
    images=images,
    return_tensors="pt",
    padding=True,
    truncation=True
)

# Inference
with torch.no_grad():
    outputs = model(**inputs)
    probabilities = torch.sigmoid(outputs["logits"])

# Apply optimized thresholds
thresholds = torch.tensor([0.30, 0.75, 0.85, 0.20, 0.55])
predictions = (probabilities > thresholds).int()

Using with Hugging Face Pipeline (Custom)

from huggingface_hub import hf_hub_download
import json

# Download config
config_path = hf_hub_download(
    repo_id="Amirhossein75/siglip2-base-mmhs150k-fusion",
    filename="inference_config.json"
)

with open(config_path) as f:
    config = json.load(f)

print(f"Classes: {config['class_names']}")
print(f"Thresholds: {config['thresholds']}")

πŸ“ Model Files

File Description
model.safetensors Model weights in safetensors format
config.json Model architecture configuration
modeling_siglip_fusion.py Custom model class for trust_remote_code
inference_config.json Inference settings with thresholds and class names
label_map.json Label name mapping
test_metrics.json Test set evaluation metrics
val_report.json Detailed validation classification report

☁️ AWS SageMaker Deployment

This model is compatible with AWS SageMaker for cloud deployment:

from sagemaker.pytorch import PyTorchModel
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

model = PyTorchModel(
    model_data="s3://your-bucket/model.tar.gz",
    role=role,
    entry_point='inference.py',
    source_dir='sagemaker',
    framework_version='2.1.0',
    py_version='py310',
)

predictor = model.deploy(
    instance_type='ml.g4dn.xlarge',
    initial_instance_count=1,
)

# Make prediction
import base64
with open('image.jpg', 'rb') as f:
    image_b64 = base64.b64encode(f.read()).decode('utf-8')

response = predictor.predict({
    'instances': [{
        'text': 'Sample text content',
        'image_base64': image_b64,
    }]
})

See the SageMaker documentation for full deployment guide.

⚠️ Intended Uses & Limitations

βœ… Intended Uses

  • Content moderation for social media platforms
  • Detecting hateful memes and posts
  • Research in multi-modal hate speech detection
  • Building content safety systems
  • Pre-filtering potentially harmful content for human review

⚠️ Limitations

Limitation Description
Language Trained only on English content
Domain Twitter-specific; may not generalize to other platforms
Class Imbalance Lower performance on rare categories (Religion: F1=0.128)
Cultural Context May miss culturally-specific hate speech
Sarcasm/Irony May struggle with subtle or ironic hateful content
Image-only Hate Text encoder is important; purely visual hate may be missed

❌ Out-of-Scope Uses

  • NOT for making final moderation decisions without human review
  • NOT suitable for legal or compliance purposes without additional validation
  • NOT for censorship or suppression of legitimate speech
  • NOT for targeting or profiling individuals

πŸ›‘οΈ Ethical Considerations

  • This model should be used as a tool to assist human moderators, not replace them
  • False positives may incorrectly flag legitimate content
  • False negatives may miss harmful content
  • Regular evaluation and bias auditing is recommended
  • Consider the cultural and contextual factors in deployment

πŸ“ Citation

If you use this model, please cite:

@misc{yousefi2024multimodal,
  title={Multi-Modal Hateful Content Classification with SigLIP2 Fusion},
  author={Yousefi, Amirhossein},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/Amirhossein75/siglip2-base-mmhs150k-fusion}
}

Dataset Citation

@inproceedings{gomez2020exploring,
  title={Exploring Hate Speech Detection in Multimodal Publications},
  author={Gomez, Raul and Gibert, Jaume and Gomez, Lluis and Karatzas, Dimosthenis},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  pages={1470--1478},
  year={2020}
}

SigLIP Citation

@article{zhai2023sigmoid,
  title={Sigmoid Loss for Language Image Pre-Training},
  author={Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas},
  journal={arXiv preprint arXiv:2303.15343},
  year={2023}
}

πŸ”— Links

Resource Link
GitHub Repository multimodal-content-moderation
Base Model google/siglip2-base-patch16-224
CLIP Fusion Model Amirhossein75/clip-vit-base-mmhs150k-fusion
MMHS150K Dataset Official Page
SigLIP Paper arXiv

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

🀝 Contributing

Contributions are welcome! Please see the GitHub repository for contribution guidelines.

Downloads last month
44
Safetensors
Model size
0.4B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Amirhossein75/siglip2-base-mmhs150k-fusion

Finetuned
(112)
this model

Evaluation results