You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

mini-text-detection — Khmer & English Text Detection

A YOLO11n-based text detection model fine-tuned to locate and classify text regions in images containing Khmer and English content.
It detects 3 types of text blocks and can be used as the first stage before passing crops to an OCR model (e.g. phonsobon/mini-ocr).

Model Details

Property	Value
Architecture	YOLO11n (nano)
Task	Object Detection — 3 classes
Weights file	`khmer-text-detection-mini.pt`
Framework	Ultralytics / PyTorch
Input	RGB image, any size (auto-resized internally)

Classes

ID	Name	Khmer	Description
`0`	`subject`	កម្មវត្ថុ	Title or subject heading
`1`	`reference`	យោង	Reference or citation
`2`	`content`	អត្ថបទ	Main body / paragraph text

Files

File	Description
`khmer-text-detection-mini.pt`	Full Ultralytics YOLO model (weights + config)

Quick Start

Install dependencies

pip install ultralytics huggingface_hub

Run inference

from ultralytics import YOLO
from huggingface_hub import hf_hub_download

# ── Download model ────────────────────────────────────────────────────────────
model_path = hf_hub_download(
    repo_id="phonsobon/mini-text-detection",
    filename="khmer-text-detection-mini.pt",
)

# ── Class names ───────────────────────────────────────────────────────────────
CLASS_NAMES = {0: "subject", 1: "reference", 2: "content"}

# ── Load & predict ────────────────────────────────────────────────────────────
model = YOLO(model_path)

results = model.predict(
    source="your_image.jpg",   # path, URL, or numpy array
    conf=0.25,                 # confidence threshold
    iou=0.45,                  # NMS IoU threshold
    imgsz=640,
)

# ── Print results ─────────────────────────────────────────────────────────────
for r in results:
    r.show()                                        # display with bounding boxes
    for box in r.boxes:
        cls_id = int(box.cls)
        label  = CLASS_NAMES[cls_id]
        conf   = float(box.conf)
        x1, y1, x2, y2 = box.xyxy[0].tolist()
        print(f"[{label}] conf={conf:.2f}  box=({x1:.0f},{y1:.0f},{x2:.0f},{y2:.0f})")

Filter by class

# Get only subject (heading) boxes
subject_boxes = [b for b in results[0].boxes if int(b.cls) == 0]

# Get only content (body) boxes
content_boxes = [b for b in results[0].boxes if int(b.cls) == 2]

Save annotated images

results = model.predict(source="your_image.jpg", save=True, project="runs/detect")
# Saved to runs/detect/predict/

Batch inference on a folder

results = model.predict(source="path/to/images/", conf=0.25, imgsz=640)
for r in results:
    counts = {name: 0 for name in CLASS_NAMES.values()}
    for box in r.boxes:
        counts[CLASS_NAMES[int(box.cls)]] += 1
    print(r.path, "→", counts)

Crop + OCR Pipeline

Combine this model with phonsobon/mini-ocr for full end-to-end document reading, with each region labelled by type:

from ultralytics import YOLO
from huggingface_hub import hf_hub_download
from PIL import Image

CLASS_NAMES = {0: "subject", 1: "reference", 2: "content"}

# ── Load detection model ──────────────────────────────────────────────────────
det_path = hf_hub_download("phonsobon/mini-text-detection", "khmer-text-detection-mini.pt")
detector = YOLO(det_path)

# ── Detect text regions ───────────────────────────────────────────────────────
image_path = "your_image.jpg"
results = detector.predict(source=image_path, conf=0.25, imgsz=640)

img = Image.open(image_path).convert("RGB")

# ── Crop each region sorted by class ─────────────────────────────────────────
for i, box in enumerate(results[0].boxes):
    cls_id        = int(box.cls)
    label         = CLASS_NAMES[cls_id]
    x1,y1,x2,y2  = map(int, box.xyxy[0].tolist())

    crop = img.crop((x1, y1, x2, y2))
    crop.save(f"crop_{i}_{label}.png")
    print(f"Saved crop {i} → class: {label}")
    # → feed each crop to phonsobon/mini-ocr for text recognition

Input Tips

Works on any image size — YOLO resizes internally to 640 px by default.
Best results on document photos, screenshots, and scanned pages.
Adjust conf (0.1 – 0.5) to trade recall vs. precision depending on your use case.

Limitations

May miss very small text (< ~8 px height in the original image).
Not designed for handwritten or heavily stylised/artistic fonts.
Performance is best on document-style layouts similar to training data.

Related Model

Model	Task
phonsobon/mini-ocr	Text recognition (CRNN + CTC) for Khmer & English

License

MIT

Downloads last month: 21