You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

mini-ocr — Khmer & English Text Recognition

A lightweight CRNN (CNN + Bi-LSTM) model trained to recognise Khmer and English text from image crops.
It uses a CTC head so it can handle variable-length text without needing segmentation.

Model Architecture

Component	Details
CNN backbone	6 × Conv-BN-ReLU blocks with MaxPool
Recurrent	2 × Bi-LSTM (hidden = 256) with a linear bridge
Output	CTC linear → `NUM_CHARS + 1` (blank = 0)
Input	Greyscale image, height normalised to 32 px, width variable
Vocabulary	222 characters — lowercase/uppercase Latin, digits, Khmer consonants, vowels, diacritics, punctuation

Files

File	Description
`model.pt`	`state_dict` — load with the class definition below
`model_scripted.pt`	TorchScript version — no class definition needed
`vocab.txt`	One character per line, index = line number (1-based)

Quick Start

Install dependencies

pip install torch torchvision pillow

import torch
import numpy as np
from PIL import Image
from huggingface_hub import hf_hub_download

TOKENS = (
    "abcdefghijklmnopqrstuvwxyz"
    "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    "0123456789"
    "កខគឃងចឆជឈញដឋឌឍណតថទធនបផពភមយរលវឝឞសហឡអឣឤឥឦឧឩឪឫឬឭឮឯឰឱឲឳ"
    "ាិីឹឺុូួើឿៀេែៃោៅំះៈ៉៊់៌៍៎៏័៑្។៕៖ៗ៘៛៝"
    "០១២៣៤៥៦៧៨៩៳"
    "!@#$%^&*()-_=+[]{};:'\",.<>?/|\\ "
)
idx2char = {i + 1: c for i, c in enumerate(TOKENS)}

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

scripted_path = hf_hub_download(repo_id="phonsobon/mini-ocr", filename="model_scripted.pt")
model = torch.jit.load(scripted_path, map_location=device)
model.eval()

def load_image(path):
    img = Image.open(path).convert("L")
    w, h = img.size
    img = img.resize((int(w / h * 32), 32))
    img = np.array(img, dtype=np.float32) / 255.0
    return torch.tensor(img).unsqueeze(0).unsqueeze(0)

def ctc_decode(logits):
    preds = torch.argmax(logits, dim=2)[:, 0].cpu().numpy()
    prev, text = -1, []
    for p in preds:
        if p != prev and p != 0:
            text.append(idx2char.get(p, ""))
        prev = p
    return "".join(text)

img = load_image("your_image.png").to(device)
with torch.no_grad():
    result = ctc_decode(model(img))
print("OCR result:", result)

Input Format

Single text-line image (word, phrase, or a short line of text)
Converted to greyscale internally
Height resized to 32 px; width scales proportionally
Values normalised to [0, 1]

For full-document OCR, first crop individual text lines, then pass each crop to the model.

Training Details

Setting	Value
Epochs	50
Optimizer	Adam, lr = 1e-4
Loss	CTC (`blank = 0`, `zero_infinity = True`)
Image height	32 px
Dataset	Synthetic — rendered from a vocabulary text file across multiple fonts with noise augmentation (Gaussian, salt-and-pepper, blur, JPEG compression)
Train / Valid / Test split	80 / 10 / 10

Limitations

Designed for single text-line crops, not full documents or paragraphs.
Performance may degrade on handwritten text (trained on synthetic rendered images).
Very small fonts (< 10 px rendered height) may produce errors.

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support