Qwen2.5-VL-3B-SCT-Classifier

A Vision-Language Model fine-tuned to classify SEC proxy statement tables as Summary Compensation Tables (SCT) vs non-SCT tables.

Model Description

This model adds a classification head on top of Qwen/Qwen2.5-VL-3B-Instruct for binary table classification. The base model is frozen and only the classifier head is trained.

Task: Given an image of a table from SEC DEF14A filings, classify whether it's a Summary Compensation Table or not.

Architecture

  • Base Model: Qwen/Qwen2.5-VL-3B-Instruct (frozen)
  • Classifier Head: Linear(2048β†’512) β†’ ReLU β†’ Dropout(0.1) β†’ Linear(512β†’2)
  • Pooling: Mean pooling over last hidden states

Training

  • Dataset: pierjoe/sec-table-classifier
    • ~1,500 positive samples (SCT tables)
    • ~3,000 negative samples (other tables from same documents)
  • Loss: CrossEntropyLoss with class weights [1.0, 2.0] to reduce false negatives
  • Optimizer: AdamW, LR=5e-6
  • Epochs: 3
  • Batch Size: 2

Performance

  • Test Accuracy: ~99%
  • False Negatives: Minimized via weighted loss (priority: don't miss real SCT tables)

Usage

import torch
import torch.nn as nn
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from safetensors.torch import load_file
from PIL import Image

# Define classifier
class VLMClassifier(nn.Module):
    def __init__(self, base_model, num_labels=2):
        super().__init__()
        self.base_model = base_model
        hidden_size = base_model.config.hidden_size
        self.classifier = nn.Sequential(
            nn.Linear(hidden_size, 512),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(512, num_labels)
        )
        
    def forward(self, input_ids, attention_mask, pixel_values, image_grid_thw):
        outputs = self.base_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            pixel_values=pixel_values,
            image_grid_thw=image_grid_thw,
            output_hidden_states=True,
            return_dict=True
        )
        hidden_states = outputs.hidden_states[-1]
        pooled = hidden_states.mean(dim=1)
        return self.classifier(pooled.float())

# Load model
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")
base_model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-3B-Instruct", 
    torch_dtype=torch.bfloat16, 
    device_map="cuda:0"
)
model = VLMClassifier(base_model, num_labels=2).to("cuda")
model.classifier.load_state_dict(load_file("classifier_head.safetensors"))
model.eval()

# Inference
img = Image.open("table.png").convert("RGB")
messages = [[{
    "role": "user",
    "content": [
        {"type": "image", "image": img},
        {"type": "text", "text": "Classify this table."}
    ]
}]]
texts = [processor.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages]
inputs = processor(text=texts, images=[img], padding=True, return_tensors="pt")

with torch.no_grad():
    logits = model(
        inputs["input_ids"].to("cuda"),
        inputs["attention_mask"].to("cuda"),
        inputs["pixel_values"].to("cuda", dtype=torch.bfloat16),
        inputs["image_grid_thw"].to("cuda")
    )
    prob_sct = torch.softmax(logits, dim=-1)[0, 1].item()

print(f"P(SCT) = {prob_sct:.3f}")
# Use threshold 0.3 for fewer false negatives
is_sct = prob_sct >= 0.3

Files

  • classifier_head.safetensors - Classifier head weights
  • classifier_config.json - Model configuration
  • config.json - Base model config
  • notebooks/ - Training and testing notebooks

Citation

Part of SEC executive compensation extraction pipeline.

License

Apache 2.0 (same as base model)

Downloads last month
16
Safetensors
Model size
4B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for pierjoe/Qwen3-VL-4B-SCT-Classifier

Finetuned
(140)
this model

Collection including pierjoe/Qwen3-VL-4B-SCT-Classifier