Arch-L3869-PageClassification
Model Details
Model Description
This is a Greek text classification model for categorizing document pages into 18 different classes. The model was trained using a two-phase approach:
- Phase 1 (Contrastive Learning): Further pre-training of the base BERT model using Supervised Contrastive Learning (SCL) to create better document embeddings.
- Phase 2 (Classification): Fine-tuning with Asymmetric Loss for handling class imbalance.
- Developed by: Archeiothiki S.A. - AI Services Team
- Model type: BertForSequenceClassification
- Language(s): Greek (el)
- Finetuned from model: nlpaueb/bert-base-greek-uncased-v1
Model Architecture
- Base Model: nlpaueb/bert-base-greek-uncased-v1
- Pruned Layers: [0, 2, 4, 6, 8, 11] (6 layers kept for efficiency)
- Hidden Size: 768
- Attention Heads: 12
- Max Position Embeddings: 512
- Vocab Size: 35,000
Uses
Direct Use
This model classifies document pages (text extracted via OCR) into one of 18 categories:
| ID | Class Label | Description |
|---|---|---|
| 0 | AA_AADE_OTHER | Other AADE documents |
| 1 | AA_Certificate_of_Current_Image_of_Entity | Business/Partnership Certificates |
| 2 | AA_ENERGY | Energy bills |
| 3 | AA_Employer's_Certificate/Payroll | Employment certificates |
| 4 | AA_ID_Card | Identity cards |
| 5 | AA_INCOME_TAX_RETURN_-_E1 | Income tax return (E1 form) |
| 6 | AA_INCOME_TAX_RETURN_OF_LEGAL_PERSONS | Legal entity tax returns (N form) |
| 7 | AA_LEGAL_ENTITY_MINUTES | General Assembly/Board minutes |
| 8 | AA_LEGAL_ENT_ARTICLES_OF_ASSOCIATION | Articles of association |
| 9 | AA_LEGAL_ENT_CERTIFICATE | Commercial Registry certificates |
| 10 | AA_NEW_POLICE_IDENTITY_CARD | New police ID cards |
| 11 | AA_Natural_Person_Information_Form | Ownership certificates |
| 12 | AA_Pension_Certificate | Pension certificates |
| 13 | AA_Personal_Income_Tax_(FEP) | Personal income tax (FEP) |
| 14 | AA_SOLEMN_DECLARATION | Solemn declarations |
| 15 | AA_TELEPHONY | Phone bills |
| 16 | BB_Other_Documents | Other identifiable documents |
| 17 | Other | Unclassified pages |
How to Get Started with the Model
Prerequisites
pip install transformers torch
Preprocessing Function (Required!)
β οΈ IMPORTANT: This preprocessing MUST be applied to all texts before inference. The model was trained with this preprocessing.
import re
import unicodedata
# Same symbols removed during training
SYMBOLS_TO_REMOVE = r"[`~!@#$%^&*()\-+=\[\]{\}/?><,\'\":;|»«§°·¦ʼ¬£β¬Β©ΞΒ΄\\β¦\n]"
def strip_accents_and_lowercase(text: str) -> str:
"""Remove accents and convert to lowercase."""
return "".join(
c for c in unicodedata.normalize("NFD", text)
if unicodedata.category(c) != "Mn"
).lower()
def clean_text(text: str, symbols_to_remove: str | None = None) -> str:
"""
Main preprocessing function.
Steps:
1. Remove special symbols
2. Collapse multiple dots into single dot
3. Remove accents + lowercase
4. Normalize whitespace
"""
if symbols_to_remove:
text = re.sub(symbols_to_remove, " ", text)
text = re.sub(r"\.{2,}", ". ", text)
text = strip_accents_and_lowercase(text)
text = re.sub(r"\s+", " ", text).strip()
return text
def preprocess_text(text: str) -> str:
return clean_text(text, symbols_to_remove=SYMBOLS_TO_REMOVE)
Inference Code Snippet (includes preprocessing + dummy strings)
import json
import re
import unicodedata
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
# Preprocessing (REQUIRED!)
SYMBOLS_TO_REMOVE = r"[`~!@#$%^&*()\-+=\[\]{\}/?><,\'\":;|»«§°·¦ʼ¬£β¬Β©ΞΒ΄\\β¦\n]"
def strip_accents_and_lowercase(text: str) -> str:
return "".join(
c for c in unicodedata.normalize("NFD", text)
if unicodedata.category(c) != "Mn"
).lower()
def clean_text(text: str, symbols_to_remove: str | None = None) -> str:
if symbols_to_remove:
text = re.sub(symbols_to_remove, " ", text)
text = re.sub(r"\.{2,}", ". ", text)
text = strip_accents_and_lowercase(text)
text = re.sub(r"\s+", " ", text).strip()
return text
def preprocess_text(text: str) -> str:
return clean_text(text, symbols_to_remove=SYMBOLS_TO_REMOVE)
# Load model and tokenizer
MODEL_PATH = "path/to/model"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH)
model.eval()
# Load label mapping
with open(f"{MODEL_PATH}/id2label.json", "r", encoding="utf-8") as f:
id2label = json.load(f)
# Dummy texts (examples)
texts = [
"ΞΞΞΞ€ΞΞ ΞΣ΀Ξ₯ΞΞΞΞΞΞΞ£ Ξ€ΞΞ₯Ξ€ΞΞ€ΞΞ€ΞΞ£ Ξ ΞΞ ΞΞΞΞ ΞΞ₯ΞΞΞ£ ΞΞ©ΞΞΞΞΞ£",
"ΞΞΞ€Ξ₯Ξ Ξ Ξ1 ΞΞΞΩΣΠΦΞΞ‘ΞΞΞΞΞΞΞ£ ΞΞΞ£ΞΞΞΞΞΞ€ΞΞ£ 2024",
]
# Preprocess texts
preprocessed_texts = [preprocess_text(t) for t in texts]
# Tokenize
inputs = tokenizer(
preprocessed_texts,
truncation=True,
padding="max_length",
max_length=512,
return_tensors="pt"
)
# Inference
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
probabilities = torch.sigmoid(logits) # Multi-label sigmoid
predictions = probabilities.argmax(dim=1)
# Get labels
for i, pred in enumerate(predictions):
label = id2label[str(pred.item())]
confidence = probabilities[i][pred].item()
print(f"Text: {texts[i][:50]}...")
print(f"Prediction: {label} (confidence: {confidence:.4f})")
print()
Expected Output
Text: ΞΞΞΞ€ΞΞ ΞΣ΀Ξ₯ΞΞΞΞΞΞΞ£ Ξ€ΞΞ₯Ξ€ΞΞ€ΞΞ€ΞΞ£ Ξ ΞΞ ΞΞΞΞ ΞΞ₯ΞΞΞ£ ΞΞ©ΞΞΞΞΞ£...
Prediction: AA_ID_Card (confidence: 0.9842)
Text: ΞΞΞ€Ξ₯Ξ Ξ Ξ1 ΞΞΞΩΣΠΦΞΞ‘ΞΞΞΞΞΞΞ£ ΞΞΞ£ΞΞΞΞΞΞ€ΞΞ£ 2024...
Prediction: AA_INCOME_TAX_RETURN_-_E1 (confidence: 0.9567)
Training Details
Training Data
- Dataset: Internal annotated document dataset
- Total Samples: ~6,600 (train + validation)
- Test Samples: 1,336
- Classes: 18 (imbalanced distribution)
- Largest Class: Other (571 test samples, ~43%)
- Smallest Class: AA_LEGAL_ENTITY_MINUTES (7 test samples, ~0.5%)
Training Procedure
Phase 1: Contrastive Learning
- Base Model: nlpaueb/bert-base-greek-uncased-v1
- Loss Function: Supervised Contrastive Loss (SCL)
- Epochs: 200
- Learning Rate: 2e-5
- Batch Size: 32
- Layer Pruning: Kept layers [0, 2, 4, 6, 8, 11]
Phase 2: Classification
- Base Model: Output of Phase 1 (26_01_2026_15_00_12)
- Loss Function: Asymmetric Loss (gamma=4)
- Epochs: 50
- Learning Rate: 1e-4
- Batch Size: 32
- Gradient Accumulation: 2
- Warmup Ratio: 0.1
- LR Scheduler: Cosine
- Oversampling: BB_Other_Documents (x2)
Framework Versions
- Python: 3.9.0
- PyTorch: 2.x
- Transformers: 4.38.2
- Datasets: 2.x
Evaluation Results
Overall Metrics (Test Set: 1,336 samples)
| Metric | Score |
|---|---|
| Accuracy | 0.94 |
| Macro F1 | 0.92 |
| Weighted F1 | 0.94 |
Per-Class Performance
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| AA_AADE_OTHER | 0.89 | 0.89 | 0.89 | 9 |
| AA_Certificate_of_Current_Image | 1.00 | 1.00 | 1.00 | 10 |
| AA_ENERGY | 0.92 | 0.89 | 0.91 | 27 |
| AA_Employer's_Certificate/Payroll | 0.86 | 0.97 | 0.92 | 39 |
| AA_ID_Card | 1.00 | 0.99 | 1.00 | 190 |
| AA_INCOME_TAX_RETURN_-_E1 | 0.92 | 0.86 | 0.89 | 77 |
| AA_INCOME_TAX_RETURN_LEGAL | 1.00 | 1.00 | 1.00 | 8 |
| AA_LEGAL_ENTITY_MINUTES | 1.00 | 1.00 | 1.00 | 7 |
| AA_LEGAL_ENT_ARTICLES | 0.80 | 1.00 | 0.89 | 8 |
| AA_LEGAL_ENT_CERTIFICATE | 0.71 | 0.88 | 0.79 | 17 |
| AA_NEW_POLICE_IDENTITY_CARD | 0.96 | 1.00 | 0.98 | 26 |
| AA_Natural_Person_Form | 0.90 | 0.93 | 0.92 | 30 |
| AA_Pension_Certificate | 0.92 | 0.95 | 0.93 | 74 |
| AA_Personal_Income_Tax_(FEP) | 1.00 | 0.94 | 0.97 | 147 |
| AA_SOLEMN_DECLARATION | 0.80 | 0.89 | 0.84 | 9 |
| AA_TELEPHONY | 0.97 | 0.92 | 0.94 | 65 |
| BB_Other_Documents | 0.82 | 0.64 | 0.72 | 22 |
| Other | 0.94 | 0.95 | 0.95 | 571 |
Key Performance Highlights
- β Other class: F1=0.95 (excellent handling of the majority class)
- β BB_Other_Documents: F1=0.72 (best among all trained models for this rare class)
- β High-confidence classes: AA_ID_Card, AA_Certificate, AA_Legal_Entity_Minutes all achieve 1.00 F1
- β οΈ Lower performance: AA_LEGAL_ENT_CERTIFICATE (F1=0.79) - needs more training data
Model Files
| File | Description | Required |
|---|---|---|
model.safetensors |
Model weights | β Yes |
config.json |
Model architecture + id2label/label2id | β Yes |
tokenizer.json |
Tokenizer | β Yes |
tokenizer_config.json |
Tokenizer config | β Yes |
vocab.txt |
Vocabulary | β Yes |
special_tokens_map.json |
Special tokens | β Yes |
id2label.json |
ID to label mapping | β Yes |
label2id.json |
Label to ID mapping | β Yes |
test_report.txt |
Classification report | Optional |
Model Card Authors
AI Services Team - Archeiothiki S.A.
Model Card Contact
Internal use only.
- Downloads last month
- 67
Model tree for Archeiothiki/KYC_classification
Base model
nlpaueb/bert-base-greek-uncased-v1