Update README.md

4f6155b verified about 2 months ago

7.74 kB

	---
	language:
	- und
	license: cc-by-4.0
	tags:
	- indus-script
	- ancient-scripts
	- archaeology
	- nlp
	- text-generation
	- sequence-modeling
	- grammar-analysis
	- undeciphered-script
	library_name: transformers
	pipeline_tag: text-generation
	---

	# Indus Script Models

	Trained models for validating, predicting, and generating sequences in the undeciphered
	Indus Valley Script (2600–1900 BCE). Built on 3,310 real archaeological inscriptions.

	---

	## Quick Start (3 steps)

	```bash
	# Step 1 — Clone the repo
	git clone https://huggingface.co/hellosindh/indus-script-models
	cd indus-script-models

	# Step 2 — Install dependencies
	pip install torch transformers

	# Step 3 — Run the demo
	python inference.py --task demo
	```

	---

	## What you can do

	### 1. Validate a sequence
	Is this inscription grammatically valid?

	```bash
	python inference.py --task validate --sequence "T638 T177 T420 T122"
	```

	Output:
	```
	Sequence : T638 T177 T420 T122
	BERT : 0.9650
	N-gram : 0.8930
	ELECTRA : 0.9410
	Ensemble : 0.9410
	Verdict : VALID (>=85%)
	```

	### 2. Predict a masked sign
	What sign most likely fills the missing position?

	```bash
	python inference.py --task predict --sequence "T638 [MASK] T420 T122"
	```

	Output:
	```
	Position 1 predictions:
	T177 18.3%
	T243 12.1%
	T653 9.4%
	T684 7.2%
	T650 5.8%
	```

	### 3. Generate new sequences

	```bash
	# Generate 10 sequences (default threshold 85%)
	python inference.py --task generate --count 10

	# More variety, less strict
	python inference.py --task generate --count 20 --threshold 0.78

	# High quality only
	python inference.py --task generate --count 5 --threshold 0.92
	```

	### 4. Score any sequence

	```bash
	python inference.py --task score --sequence "T604 T123 T609"
	```

	---

	## Generating more diverse or longer sequences

	Open `inference.py` and find the `task_generate` function. Change the temperature list:

	More random — forces rare signs to appear:
	```python
	# Change this line:
	temps = [0.85, 0.90, 1.00, 1.10]
	# To:
	temps = [1.10, 1.20, 1.30, 1.40]
	```

	Longer sequences:
	Find the `generate()` method inside `load_nanogpt()` and change `max_len`:
	```python
	# Default (avg 7 signs):
	def generate(self, temperature=0.85, top_k=40, max_len=15):

	# For longer sequences:
	def generate(self, temperature=0.85, top_k=40, max_len=25):

	# For shorter sequences:
	def generate(self, temperature=0.85, top_k=40, max_len=6):
	```

	---

	## Pros and cons of tuning

	\| Setting \| Effect \| Good for \| Watch out for \|
	\|---\|---\|---\|---\|
	\| Temperature 0.7–0.8 \| Very focused, repeats common signs \| High quality outputs \| Low diversity \|
	\| Temperature 0.9–1.0 \| Balanced — default \| General use \| Nothing \|
	\| Temperature 1.1–1.3 \| More variety, rare signs appear \| Exploring vocabulary \| Some unusual sequences \|
	\| Temperature above 1.4 \| Very random \| Stress testing \| Most sequences fail quality gate \|
	\| Threshold 0.85 \| Strict — default \| Publication quality \| Slower generation \|
	\| Threshold 0.75 \| Relaxed \| Larger datasets \| Lower average quality \|
	\| Threshold 0.92 \| Very strict \| Highest confidence only \| Very few sequences pass \|
	\| max_len 6 \| Short sequences \| Matching real length distribution \| Misses complex patterns \|
	\| max_len 20+ \| Long sequences \| Complex grammar patterns \| Not representative of real seals \|

	---

	## Displaying Indus glyphs

	Sequences use sign IDs like T638, T177. To see actual glyphs:

	1. Search for indus-brahmi-font and download it
	2. The `glyphs` field in output shows the rendered glyph characters
	3. Open `data/id_to_glyph.json` to see the full sign to character mapping
	4. If want to see mapping with T, open `data/indus_tokenizer/indus_id_map.json`

	Without the font installed, glyphs show as boxes or question marks.
	The sign IDs (T638, T177 etc.) always work regardless of font.

	---

	## Repo structure

	```
	indus-script-models/
	├── inference.py run this for all tasks
	├── indus_ngram.py required by ngram_model.pkl — do not move
	├── README.md
	├── models/
	│ ├── nanogpt_indus.pt NanoGPT generator (153K params, PPL 13.3)
	│ ├── ngram_model.pkl N-gram RTL model (88.2% pairwise accuracy)
	│ ├── mlm/ TinyBERT masked language model (val loss 2.06)
	│ ├── cls/ TinyBERT classifier (89.0% test accuracy)
	│ ├── electra/ ELECTRA discriminator (95.1% token accuracy)
	│ └── deberta/ DeBERTa discriminator (87.1% test accuracy)
	└── data/
	├── id_to_glyph.json 641 sign ID to glyph character mappings
	└── indus_tokenizer/ custom tokenizer for Indus Script
	```

	---

	## How the pipeline works

	Stage 1 — Train on 3,310 real inscriptions:

	Four models trained independently, each learning a different aspect of grammar:

	- TinyBERT MLM — learns which sign can fill a masked position in a sequence
	- TinyBERT Classifier — learns to tell valid sequences from corrupted ones
	- N-gram RTL — learns right-to-left transition probabilities between signs
	- ELECTRA — learns token-level discrimination between real and fake signs
	- NanoGPT — learns to generate new sequences from scratch

	Stage 2 — Generate and filter:

	NanoGPT generates candidate sequences in RTL order, then flips them to LTR.
	Each candidate is scored by three models: BERT (50%) + N-gram (25%) + ELECTRA (25%).
	Only sequences scoring 85% or higher are kept as valid synthetic sequences.
	Sequences that exactly match real inscriptions are separated as seal reproductions.
	Result: 5,000 novel sequences with 752 exact seal matches as validation evidence.

	Stage 3 — Retrain on combined data:

	The 5,000 synthetic sequences were combined with 3,310 real sequences (8,310 total).
	All models were retrained on the larger dataset. Results improved significantly:

	\| Model \| Before \| After \|
	\|---\|---\|---\|
	\| TinyBERT accuracy \| 78.4% \| 89.0% \|
	\| NanoGPT perplexity \| 32.5 \| 13.3 \|
	\| DeBERTa accuracy \| 80.5% \| 87.1% \|

	The final 5,000 sequences in the dataset were generated with these retrained models.

	---

	## Key findings

	- RTL reading confirmed — right-to-left has 12% stronger grammatical structure than LTR
	- Grammar proven — entropy chain H1 to H2 to H3 = 6.03 to 3.41 to 2.39 bits (language-like decay)
	- Zipf law confirmed — R squared = 0.968, language-like token distribution
	- 752 seal reproductions — model independently reproduced real archaeological inscriptions
	- Sign roles discovered:
	- PREFIX signs at reading end: T638, T604, T406, T496
	- SUFFIX signs at reading start: T123, T122, T701, T741
	- CORE signs in the middle: T101, T268, T177, T243

	---

	## Known limitations

	DeBERTa calibration issue:
	DeBERTa scores near-zero for all sequences due to confidence calibration failure.
	It is logged in output but excluded from the quality gate.
	BERT, N-gram, and ELECTRA handle all scoring.

	Vocabulary coverage:
	Only about 26% of the 641 known Indus signs appear reliably in generated sequences.
	475 signs appear 10 times or fewer in the real corpus — too rare for the model to learn.
	This is a property of the archaeological record, not a model bug.
	No synthetic corpus can reliably generate signs that barely exist in the training data.

	Short sequences:
	The model rarely generates length-2 sequences even though they are common in real inscriptions.
	If you need shorter outputs, set `max_len=4` in the generate function.

	---

	## Dataset

	The 5,000 synthetic sequences with full scores and sign index are available at:

	[hellosindh/indus-script-synthetic](https://huggingface.co/datasets/hellosindh/indus-script-synthetic)