Text Generation
Transformers
Safetensors
Undetermined
indus-script
ancient-scripts
archaeology
nlp
sequence-modeling
grammar-analysis
undeciphered-script
Instructions to use hellosindh/indus-script-models with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use hellosindh/indus-script-models with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="hellosindh/indus-script-models")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("hellosindh/indus-script-models", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use hellosindh/indus-script-models with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "hellosindh/indus-script-models" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "hellosindh/indus-script-models", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/hellosindh/indus-script-models
- SGLang
How to use hellosindh/indus-script-models with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "hellosindh/indus-script-models" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "hellosindh/indus-script-models", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "hellosindh/indus-script-models" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "hellosindh/indus-script-models", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use hellosindh/indus-script-models with Docker Model Runner:
docker model run hf.co/hellosindh/indus-script-models
| language: | |
| - und | |
| license: cc-by-4.0 | |
| tags: | |
| - indus-script | |
| - ancient-scripts | |
| - archaeology | |
| - nlp | |
| - text-generation | |
| - sequence-modeling | |
| - grammar-analysis | |
| - undeciphered-script | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| # Indus Script Models | |
| Trained models for validating, predicting, and generating sequences in the undeciphered | |
| Indus Valley Script (2600β1900 BCE). Built on 3,310 real archaeological inscriptions. | |
| --- | |
| ## Quick Start (3 steps) | |
| ```bash | |
| # Step 1 β Clone the repo | |
| git clone https://huggingface.co/hellosindh/indus-script-models | |
| cd indus-script-models | |
| # Step 2 β Install dependencies | |
| pip install torch transformers | |
| # Step 3 β Run the demo | |
| python inference.py --task demo | |
| ``` | |
| --- | |
| ## What you can do | |
| ### 1. Validate a sequence | |
| Is this inscription grammatically valid? | |
| ```bash | |
| python inference.py --task validate --sequence "T638 T177 T420 T122" | |
| ``` | |
| Output: | |
| ``` | |
| Sequence : T638 T177 T420 T122 | |
| BERT : 0.9650 | |
| N-gram : 0.8930 | |
| ELECTRA : 0.9410 | |
| Ensemble : 0.9410 | |
| Verdict : VALID (>=85%) | |
| ``` | |
| ### 2. Predict a masked sign | |
| What sign most likely fills the missing position? | |
| ```bash | |
| python inference.py --task predict --sequence "T638 [MASK] T420 T122" | |
| ``` | |
| Output: | |
| ``` | |
| Position 1 predictions: | |
| T177 18.3% | |
| T243 12.1% | |
| T653 9.4% | |
| T684 7.2% | |
| T650 5.8% | |
| ``` | |
| ### 3. Generate new sequences | |
| ```bash | |
| # Generate 10 sequences (default threshold 85%) | |
| python inference.py --task generate --count 10 | |
| # More variety, less strict | |
| python inference.py --task generate --count 20 --threshold 0.78 | |
| # High quality only | |
| python inference.py --task generate --count 5 --threshold 0.92 | |
| ``` | |
| ### 4. Score any sequence | |
| ```bash | |
| python inference.py --task score --sequence "T604 T123 T609" | |
| ``` | |
| --- | |
| ## Generating more diverse or longer sequences | |
| Open `inference.py` and find the `task_generate` function. Change the temperature list: | |
| **More random β forces rare signs to appear:** | |
| ```python | |
| # Change this line: | |
| temps = [0.85, 0.90, 1.00, 1.10] | |
| # To: | |
| temps = [1.10, 1.20, 1.30, 1.40] | |
| ``` | |
| **Longer sequences:** | |
| Find the `generate()` method inside `load_nanogpt()` and change `max_len`: | |
| ```python | |
| # Default (avg 7 signs): | |
| def generate(self, temperature=0.85, top_k=40, max_len=15): | |
| # For longer sequences: | |
| def generate(self, temperature=0.85, top_k=40, max_len=25): | |
| # For shorter sequences: | |
| def generate(self, temperature=0.85, top_k=40, max_len=6): | |
| ``` | |
| --- | |
| ## Pros and cons of tuning | |
| | Setting | Effect | Good for | Watch out for | | |
| |---|---|---|---| | |
| | Temperature 0.7β0.8 | Very focused, repeats common signs | High quality outputs | Low diversity | | |
| | Temperature 0.9β1.0 | Balanced β default | General use | Nothing | | |
| | Temperature 1.1β1.3 | More variety, rare signs appear | Exploring vocabulary | Some unusual sequences | | |
| | Temperature above 1.4 | Very random | Stress testing | Most sequences fail quality gate | | |
| | Threshold 0.85 | Strict β default | Publication quality | Slower generation | | |
| | Threshold 0.75 | Relaxed | Larger datasets | Lower average quality | | |
| | Threshold 0.92 | Very strict | Highest confidence only | Very few sequences pass | | |
| | max_len 6 | Short sequences | Matching real length distribution | Misses complex patterns | | |
| | max_len 20+ | Long sequences | Complex grammar patterns | Not representative of real seals | | |
| --- | |
| ## Displaying Indus glyphs | |
| Sequences use sign IDs like T638, T177. To see actual glyphs: | |
| 1. Search for **indus-brahmi-font** and download it | |
| 2. The `glyphs` field in output shows the rendered glyph characters | |
| 3. Open `data/id_to_glyph.json` to see the full sign to character mapping | |
| 4. If want to see mapping with T, open `data/indus_tokenizer/indus_id_map.json` | |
| Without the font installed, glyphs show as boxes or question marks. | |
| The sign IDs (T638, T177 etc.) always work regardless of font. | |
| --- | |
| ## Repo structure | |
| ``` | |
| indus-script-models/ | |
| βββ inference.py run this for all tasks | |
| βββ indus_ngram.py required by ngram_model.pkl β do not move | |
| βββ README.md | |
| βββ models/ | |
| β βββ nanogpt_indus.pt NanoGPT generator (153K params, PPL 13.3) | |
| β βββ ngram_model.pkl N-gram RTL model (88.2% pairwise accuracy) | |
| β βββ mlm/ TinyBERT masked language model (val loss 2.06) | |
| β βββ cls/ TinyBERT classifier (89.0% test accuracy) | |
| β βββ electra/ ELECTRA discriminator (95.1% token accuracy) | |
| β βββ deberta/ DeBERTa discriminator (87.1% test accuracy) | |
| βββ data/ | |
| βββ id_to_glyph.json 641 sign ID to glyph character mappings | |
| βββ indus_tokenizer/ custom tokenizer for Indus Script | |
| ``` | |
| --- | |
| ## How the pipeline works | |
| **Stage 1 β Train on 3,310 real inscriptions:** | |
| Four models trained independently, each learning a different aspect of grammar: | |
| - **TinyBERT MLM** β learns which sign can fill a masked position in a sequence | |
| - **TinyBERT Classifier** β learns to tell valid sequences from corrupted ones | |
| - **N-gram RTL** β learns right-to-left transition probabilities between signs | |
| - **ELECTRA** β learns token-level discrimination between real and fake signs | |
| - **NanoGPT** β learns to generate new sequences from scratch | |
| **Stage 2 β Generate and filter:** | |
| NanoGPT generates candidate sequences in RTL order, then flips them to LTR. | |
| Each candidate is scored by three models: BERT (50%) + N-gram (25%) + ELECTRA (25%). | |
| Only sequences scoring 85% or higher are kept as valid synthetic sequences. | |
| Sequences that exactly match real inscriptions are separated as seal reproductions. | |
| Result: 5,000 novel sequences with 752 exact seal matches as validation evidence. | |
| **Stage 3 β Retrain on combined data:** | |
| The 5,000 synthetic sequences were combined with 3,310 real sequences (8,310 total). | |
| All models were retrained on the larger dataset. Results improved significantly: | |
| | Model | Before | After | | |
| |---|---|---| | |
| | TinyBERT accuracy | 78.4% | 89.0% | | |
| | NanoGPT perplexity | 32.5 | 13.3 | | |
| | DeBERTa accuracy | 80.5% | 87.1% | | |
| The final 5,000 sequences in the dataset were generated with these retrained models. | |
| --- | |
| ## Key findings | |
| - **RTL reading confirmed** β right-to-left has 12% stronger grammatical structure than LTR | |
| - **Grammar proven** β entropy chain H1 to H2 to H3 = 6.03 to 3.41 to 2.39 bits (language-like decay) | |
| - **Zipf law confirmed** β R squared = 0.968, language-like token distribution | |
| - **752 seal reproductions** β model independently reproduced real archaeological inscriptions | |
| - **Sign roles discovered:** | |
| - PREFIX signs at reading end: T638, T604, T406, T496 | |
| - SUFFIX signs at reading start: T123, T122, T701, T741 | |
| - CORE signs in the middle: T101, T268, T177, T243 | |
| --- | |
| ## Known limitations | |
| **DeBERTa calibration issue:** | |
| DeBERTa scores near-zero for all sequences due to confidence calibration failure. | |
| It is logged in output but excluded from the quality gate. | |
| BERT, N-gram, and ELECTRA handle all scoring. | |
| **Vocabulary coverage:** | |
| Only about 26% of the 641 known Indus signs appear reliably in generated sequences. | |
| 475 signs appear 10 times or fewer in the real corpus β too rare for the model to learn. | |
| This is a property of the archaeological record, not a model bug. | |
| No synthetic corpus can reliably generate signs that barely exist in the training data. | |
| **Short sequences:** | |
| The model rarely generates length-2 sequences even though they are common in real inscriptions. | |
| If you need shorter outputs, set `max_len=4` in the generate function. | |
| --- | |
| ## Dataset | |
| The 5,000 synthetic sequences with full scores and sign index are available at: | |
| [hellosindh/indus-script-synthetic](https://huggingface.co/datasets/hellosindh/indus-script-synthetic) | |