Trimurti-LM: A 4.2M Parameter Multilingual Language Model

Model Description

Trimurti-LM is a small, efficient multilingual language model trained from scratch on English, Hindi, and Punjabi text. Named after the Hindu trinity (Brahma-Vishnu-Shiva), it represents the three-fold capability of creating text, preserving meaning, and transforming across scripts.

Key Features:

🏗️ Built from scratch - No pre-trained weights used
🌐 Multilingual - Handles 3 languages with 3 different scripts
💾 Tiny footprint - Only 4.2 million parameters
⚡ Fast training - 2.38 hours on consumer GPU (GTX 1650 4GB)
🔤 Smart tokenization - Custom SentencePiece with byte fallback for Indic scripts

Model Specifications

Aspect	Details
Architecture	GPT-2 style decoder-only Transformer
Parameters	4,672,000 (4.2M)
Hidden Size	256
Layers	4
Attention Heads	8
Context Length	128 tokens
Vocabulary	8000 tokens (SentencePiece)
Training Steps	5000
Training Time	2.38 hours
Hardware	NVIDIA GTX 1650 (4GB VRAM)

Training Data

The model was trained on a balanced multilingual corpus:

English: 150,000 sentences
Hindi: 150,000 sentences
Punjabi: 150,000 sentences

Sources:

Primary: AI4Bharat Samanantar dataset (filtered and processed)
Secondary: Custom curated multilingual corpus

Data Processing:

Language tagging: [EN], [HI], [PA] prefixes
Length filtering: 5-50 words per sentence
Script validation for each language
Deduplication and cleaning

Performance

Metric	Value	Notes
Final Loss	1.206	Cross-entropy loss
Perplexity	3.32	e^1.206 = 3.32
Top-1 Accuracy	~25%	Next token prediction
Top-5 Accuracy	~60%	Next token prediction
Language ID Accuracy	95%	With explicit tags

Usage

Quick Start

from transformers import GPT2LMHeadModel
import sentencepiece as spm
import torch

# Load model and tokenizer
tokenizer = spm.SentencePieceProcessor()
tokenizer.load("multilingual_spm.model")
model = GPT2LMHeadModel.from_pretrained("PredictiveManish/Trimurti-LM")

# Generate text
prompt = "[EN] The weather is"
input_ids = tokenizer.encode(prompt)
input_tensor = torch.tensor([input_ids])

with torch.no_grad():
    output = model.generate(
        input_ids=input_tensor,
        max_length=50,
        temperature=0.7,
        do_sample=True,
        pad_token_id=0
    )

generated = tokenizer.decode(output[0].tolist())
print(generated)

citations(surely you're not going to use this but still, if in search of worst models):

If you use Trimurti-LM in your work, please cite:

@software{trimurti_lm_2026,
  title = {Trimurti-LM: A 4.2M Parameter Multilingual Language Model},
  author = {Manish Tiwari},
  year = {2026},
  url = {https://huggingface.co/PredictiveManish/Trimurti-LM},
  note = {Trained from scratch on English, Hindi, and Punjabi with consumer hardware}
}

Primary Dataset

@inproceedings{samanantar_2021,
  title = {Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages},
  author = {Gowtham Ramesh and Sumanth Doddapaneni and Aravinth Bheemaraj and Mayank Jobanputra and Raghavan AK and Ajitesh Sharma and Sujit Sahoo and Harshita Diddee and Mahalakshmi J and Divyanshu Kakwani and Navneet Kumar and Aswin Pradeep and Srihari Nagaraj and Kumar Deepak and Vivek Raghavan and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra},
  booktitle = {Proceedings of the Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks},
  year = {2021},
  url = {https://arxiv.org/abs/2104.05596}
}

Downloads last month: 2

Datasets used to train PredictiveManish/Trimurti-LM

Paper for PredictiveManish/Trimurti-LM

Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

Paper • 2104.05596 • Published Apr 12, 2021