Arch-L3869-PageClassification

Model Details

Model Description

This is a Greek text classification model for categorizing document pages into 18 different classes. The model was trained using a two-phase approach:

Phase 1 (Contrastive Learning): Further pre-training of the base BERT model using Supervised Contrastive Learning (SCL) to create better document embeddings.
Phase 2 (Classification): Fine-tuning with Asymmetric Loss for handling class imbalance.

Developed by: Archeiothiki S.A. - AI Services Team
Model type: BertForSequenceClassification
Language(s): Greek (el)
Finetuned from model: nlpaueb/bert-base-greek-uncased-v1

Model Architecture

Base Model: nlpaueb/bert-base-greek-uncased-v1
Pruned Layers: [0, 2, 4, 6, 8, 11] (6 layers kept for efficiency)
Hidden Size: 768
Attention Heads: 12
Max Position Embeddings: 512
Vocab Size: 35,000

Uses

Direct Use

This model classifies document pages (text extracted via OCR) into one of 18 categories:

ID	Class Label	Description
0	AA_AADE_OTHER	Other AADE documents
1	AA_Certificate_of_Current_Image_of_Entity	Business/Partnership Certificates
2	AA_ENERGY	Energy bills
3	AA_Employer's_Certificate/Payroll	Employment certificates
4	AA_ID_Card	Identity cards
5	AA_INCOME_TAX_RETURN_-_E1	Income tax return (E1 form)
6	AA_INCOME_TAX_RETURN_OF_LEGAL_PERSONS	Legal entity tax returns (N form)
7	AA_LEGAL_ENTITY_MINUTES	General Assembly/Board minutes
8	AA_LEGAL_ENT_ARTICLES_OF_ASSOCIATION	Articles of association
9	AA_LEGAL_ENT_CERTIFICATE	Commercial Registry certificates
10	AA_NEW_POLICE_IDENTITY_CARD	New police ID cards
11	AA_Natural_Person_Information_Form	Ownership certificates
12	AA_Pension_Certificate	Pension certificates
13	AA_Personal_Income_Tax_(FEP)	Personal income tax (FEP)
14	AA_SOLEMN_DECLARATION	Solemn declarations
15	AA_TELEPHONY	Phone bills
16	BB_Other_Documents	Other identifiable documents
17	Other	Unclassified pages

How to Get Started with the Model

Prerequisites

pip install transformers torch

Preprocessing Function (Required!)

⚠️ IMPORTANT: This preprocessing MUST be applied to all texts before inference. The model was trained with this preprocessing.

import re
import unicodedata

# Same symbols removed during training
SYMBOLS_TO_REMOVE = r"[`~!@#$%^&*()\-+=\[\]{\}/?><,\'\":;|»«§°·¦ʼ¬£€©΄´\\…\n]"

def strip_accents_and_lowercase(text: str) -> str:
    """Remove accents and convert to lowercase."""
    return "".join(
        c for c in unicodedata.normalize("NFD", text)
        if unicodedata.category(c) != "Mn"
    ).lower()

def clean_text(text: str, symbols_to_remove: str | None = None) -> str:
    """
    Main preprocessing function.

    Steps:
        1. Remove special symbols
        2. Collapse multiple dots into single dot
        3. Remove accents + lowercase
        4. Normalize whitespace
    """
    if symbols_to_remove:
        text = re.sub(symbols_to_remove, " ", text)

    text = re.sub(r"\.{2,}", ". ", text)
    text = strip_accents_and_lowercase(text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

def preprocess_text(text: str) -> str:
    return clean_text(text, symbols_to_remove=SYMBOLS_TO_REMOVE)

Inference Code Snippet (includes preprocessing + dummy strings)

import json
import re
import unicodedata
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Preprocessing (REQUIRED!)
SYMBOLS_TO_REMOVE = r"[`~!@#$%^&*()\-+=\[\]{\}/?><,\'\":;|»«§°·¦ʼ¬£€©΄´\\…\n]"

def strip_accents_and_lowercase(text: str) -> str:
    return "".join(
        c for c in unicodedata.normalize("NFD", text)
        if unicodedata.category(c) != "Mn"
    ).lower()

def clean_text(text: str, symbols_to_remove: str | None = None) -> str:
    if symbols_to_remove:
        text = re.sub(symbols_to_remove, " ", text)
    text = re.sub(r"\.{2,}", ". ", text)
    text = strip_accents_and_lowercase(text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

def preprocess_text(text: str) -> str:
    return clean_text(text, symbols_to_remove=SYMBOLS_TO_REMOVE)

# Load model and tokenizer
MODEL_PATH = "path/to/model"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH)
model.eval()

# Load label mapping
with open(f"{MODEL_PATH}/id2label.json", "r", encoding="utf-8") as f:
    id2label = json.load(f)

# Dummy texts (examples)
texts = [
    "ΔΕΛΤΙΟ ΑΣΤΥΝΟΜΙΚΗΣ ΤΑΥΤΟΤΗΤΑΣ ΠΑΠΑΔΟΠΟΥΛΟΣ ΙΩΑΝΝΗΣ",
    "ΕΝΤΥΠΟ Ε1 ΔΗΛΩΣΗ ΦΟΡΟΛΟΓΙΑΣ ΕΙΣΟΔΗΜΑΤΟΣ 2024",
]

# Preprocess texts
preprocessed_texts = [preprocess_text(t) for t in texts]

# Tokenize
inputs = tokenizer(
    preprocessed_texts,
    truncation=True,
    padding="max_length",
    max_length=512,
    return_tensors="pt"
)

# Inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    probabilities = torch.sigmoid(logits)  # Multi-label sigmoid
    predictions = probabilities.argmax(dim=1)

# Get labels
for i, pred in enumerate(predictions):
    label = id2label[str(pred.item())]
    confidence = probabilities[i][pred].item()
    print(f"Text: {texts[i][:50]}...")
    print(f"Prediction: {label} (confidence: {confidence:.4f})")
    print()

Expected Output

Text: ΔΕΛΤΙΟ ΑΣΤΥΝΟΜΙΚΗΣ ΤΑΥΤΟΤΗΤΑΣ ΠΑΠΑΔΟΠΟΥΛΟΣ ΙΩΑΝΝΗΣ...
Prediction: AA_ID_Card (confidence: 0.9842)

Text: ΕΝΤΥΠΟ Ε1 ΔΗΛΩΣΗ ΦΟΡΟΛΟΓΙΑΣ ΕΙΣΟΔΗΜΑΤΟΣ 2024...
Prediction: AA_INCOME_TAX_RETURN_-_E1 (confidence: 0.9567)

Training Details

Training Data

Dataset: Internal annotated document dataset
Total Samples: ~6,600 (train + validation)
Test Samples: 1,336
Classes: 18 (imbalanced distribution)
Largest Class: Other (571 test samples, ~43%)
Smallest Class: AA_LEGAL_ENTITY_MINUTES (7 test samples, ~0.5%)

Training Procedure

Phase 1: Contrastive Learning

Base Model: nlpaueb/bert-base-greek-uncased-v1
Loss Function: Supervised Contrastive Loss (SCL)
Epochs: 200
Learning Rate: 2e-5
Batch Size: 32
Layer Pruning: Kept layers [0, 2, 4, 6, 8, 11]

Phase 2: Classification

Base Model: Output of Phase 1 (26_01_2026_15_00_12)
Loss Function: Asymmetric Loss (gamma=4)
Epochs: 50
Learning Rate: 1e-4
Batch Size: 32
Gradient Accumulation: 2
Warmup Ratio: 0.1
LR Scheduler: Cosine
Oversampling: BB_Other_Documents (x2)

Framework Versions

Python: 3.9.0
PyTorch: 2.x
Transformers: 4.38.2
Datasets: 2.x

Evaluation Results

Overall Metrics (Test Set: 1,336 samples)

Metric	Score
Accuracy	0.94
Macro F1	0.92
Weighted F1	0.94

Per-Class Performance

Class	Precision	Recall	F1-Score	Support
AA_AADE_OTHER	0.89	0.89	0.89	9
AA_Certificate_of_Current_Image	1.00	1.00	1.00	10
AA_ENERGY	0.92	0.89	0.91	27
AA_Employer's_Certificate/Payroll	0.86	0.97	0.92	39
AA_ID_Card	1.00	0.99	1.00	190
AA_INCOME_TAX_RETURN_-_E1	0.92	0.86	0.89	77
AA_INCOME_TAX_RETURN_LEGAL	1.00	1.00	1.00	8
AA_LEGAL_ENTITY_MINUTES	1.00	1.00	1.00	7
AA_LEGAL_ENT_ARTICLES	0.80	1.00	0.89	8
AA_LEGAL_ENT_CERTIFICATE	0.71	0.88	0.79	17
AA_NEW_POLICE_IDENTITY_CARD	0.96	1.00	0.98	26
AA_Natural_Person_Form	0.90	0.93	0.92	30
AA_Pension_Certificate	0.92	0.95	0.93	74
AA_Personal_Income_Tax_(FEP)	1.00	0.94	0.97	147
AA_SOLEMN_DECLARATION	0.80	0.89	0.84	9
AA_TELEPHONY	0.97	0.92	0.94	65
BB_Other_Documents	0.82	0.64	0.72	22
Other	0.94	0.95	0.95	571

Key Performance Highlights

✅ Other class: F1=0.95 (excellent handling of the majority class)
✅ BB_Other_Documents: F1=0.72 (best among all trained models for this rare class)
✅ High-confidence classes: AA_ID_Card, AA_Certificate, AA_Legal_Entity_Minutes all achieve 1.00 F1
⚠️ Lower performance: AA_LEGAL_ENT_CERTIFICATE (F1=0.79) - needs more training data

Model Files

File	Description	Required
`model.safetensors`	Model weights	✅ Yes
`config.json`	Model architecture + id2label/label2id	✅ Yes
`tokenizer.json`	Tokenizer	✅ Yes
`tokenizer_config.json`	Tokenizer config	✅ Yes
`vocab.txt`	Vocabulary	✅ Yes
`special_tokens_map.json`	Special tokens	✅ Yes
`id2label.json`	ID to label mapping	✅ Yes
`label2id.json`	Label to ID mapping	✅ Yes
`test_report.txt`	Classification report	Optional

Model Card Authors

AI Services Team - Archeiothiki S.A.

Model Card Contact

Internal use only.

Downloads last month: 126

Safetensors

Model size

70.4M params

Tensor type

F32

Model tree for Archeiothiki/KYC_classification

Base model

nlpaueb/bert-base-greek-uncased-v1

Finetuned

(12)

this model