You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

mini-ocr โ€” Khmer & English Text Recognition

A lightweight CRNN (CNN + Bi-LSTM) model trained to recognise Khmer and English text from image crops.
It uses a CTC head so it can handle variable-length text without needing segmentation.


Model Architecture

Component Details
CNN backbone 6 ร— Conv-BN-ReLU blocks with MaxPool
Recurrent 2 ร— Bi-LSTM (hidden = 256) with a linear bridge
Output CTC linear โ†’ NUM_CHARS + 1 (blank = 0)
Input Greyscale image, height normalised to 32 px, width variable
Vocabulary 222 characters โ€” lowercase/uppercase Latin, digits, Khmer consonants, vowels, diacritics, punctuation

Files

File Description
model.pt state_dict โ€” load with the class definition below
model_scripted.pt TorchScript version โ€” no class definition needed
vocab.txt One character per line, index = line number (1-based)

Quick Start

Install dependencies

pip install torch torchvision pillow
import torch
import numpy as np
from PIL import Image
from huggingface_hub import hf_hub_download

TOKENS = (
    "abcdefghijklmnopqrstuvwxyz"
    "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    "0123456789"
    "แž€แžแž‚แžƒแž„แž…แž†แž‡แžˆแž‰แžŠแž‹แžŒแžแžŽแžแžแž‘แž’แž“แž”แž•แž–แž—แž˜แž™แžšแž›แžœแžแžžแžŸแž แžกแžขแžฃแžคแžฅแžฆแžงแžฉแžชแžซแžฌแžญแžฎแžฏแžฐแžฑแžฒแžณ"
    "แžถแžทแžธแžนแžบแžปแžผแžฝแžพแžฟแŸ€แŸแŸ‚แŸƒแŸ„แŸ…แŸ†แŸ‡แŸˆแŸ‰แŸŠแŸ‹แŸŒแŸแŸŽแŸแŸแŸ‘แŸ’แŸ”แŸ•แŸ–แŸ—แŸ˜แŸ›แŸ"
    "แŸ แŸกแŸขแŸฃแŸคแŸฅแŸฆแŸงแŸจแŸฉแŸณ"
    "!@#$%^&*()-_=+[]{};:'\",.<>?/|\\ "
)
idx2char = {i + 1: c for i, c in enumerate(TOKENS)}

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

scripted_path = hf_hub_download(repo_id="phonsobon/mini-ocr", filename="model_scripted.pt")
model = torch.jit.load(scripted_path, map_location=device)
model.eval()

def load_image(path):
    img = Image.open(path).convert("L")
    w, h = img.size
    img = img.resize((int(w / h * 32), 32))
    img = np.array(img, dtype=np.float32) / 255.0
    return torch.tensor(img).unsqueeze(0).unsqueeze(0)

def ctc_decode(logits):
    preds = torch.argmax(logits, dim=2)[:, 0].cpu().numpy()
    prev, text = -1, []
    for p in preds:
        if p != prev and p != 0:
            text.append(idx2char.get(p, ""))
        prev = p
    return "".join(text)

img = load_image("your_image.png").to(device)
with torch.no_grad():
    result = ctc_decode(model(img))
print("OCR result:", result)

Input Format

  • Single text-line image (word, phrase, or a short line of text)
  • Converted to greyscale internally
  • Height resized to 32 px; width scales proportionally
  • Values normalised to [0, 1]

For full-document OCR, first crop individual text lines, then pass each crop to the model.


Training Details

Setting Value
Epochs 50
Optimizer Adam, lr = 1e-4
Loss CTC (blank = 0, zero_infinity = True)
Image height 32 px
Dataset Synthetic โ€” rendered from a vocabulary text file across multiple fonts with noise augmentation (Gaussian, salt-and-pepper, blur, JPEG compression)
Train / Valid / Test split 80 / 10 / 10

Limitations

  • Designed for single text-line crops, not full documents or paragraphs.
  • Performance may degrade on handwritten text (trained on synthetic rendered images).
  • Very small fonts (< 10 px rendered height) may produce errors.

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support