Misspelling Generator

A misspelling words generator of the 466k words from data provider written words.txt. For demonstation, we only use 7 letters combination minimum to generate:

  Words processed  : 125,414
  Lines written    : 173,110,626
  Output file      : misspellings_permutations.txt
  File size        : 2.53 GB

depending your storage, you could do more litter combination limit, just configure the MAX_WORD_LEN in the script.

use generate_permutations_colab.py or google_collab_173MSW.ipynb to generate 173,110,626 misspelling words from 125,414 processed datasets; the output is downloadable at HuggingFace
use generate_typos_colab.py or google_collab_263MSW.ipynb to generate 26,636,990 misspelling words from 415,701 processed datasets; the output is downloadable at Hugging Face

Option 1 — use ``generate_typos_local.py` (Run Locally)

Generates realistic typo variants using 4 strategies:

Strategy	Example (`hello`)	Variants
Adjacent swap	`hlelo`, `helol`	n−1 per word
Char deletion	`hllo`, `helo`, `hell`	n per word
Char duplication	`hhello`, `heello`	n per word
Keyboard proximity	`gello`, `jello`, `hwllo`	varies

Processes only pure-alpha words with length ≥ 3
Produces roughly 10–50 typos per word → ~5M–20M lines total
Output: data/misspellings.txt in misspelling=correction format

To run:

python generate_typos_local.py

Option 2 — use `generate_permutations_colab.py` (Google Colab)

Generates ALL letter permutations of each word. Key config at the top of the file:

MAX_WORD_LEN = 7   # ← CRITICAL control knob

Google Colab Education

What Is Google Colab?

Google Colab gives you a free Linux VM with Python pre-installed. You get:

Resource	Free Tier	Colab Pro ($12/mo)
Disk	~78 GB (temporary)	~225 GB (temporary)
RAM	~12 GB	~25-50 GB
GPU	T4 (limited)	A100/V100
Runtime limit	~12 hours, then VM resets	~24 hours
Google Drive	15 GB (persistent)	15 GB (same)

Colab disk is ephemeral — when the runtime disconnects, all files on the VM are deleted. Only Google Drive persists.

Step-by-Step: Running Option 2 on Colab

Step 1 — Open Colab Go to colab.research.google.com → New Notebook

Step 2 — Upload [words.txt]

# Cell 1
from google.colab import files
uploaded = files.upload()   # select words.txt from your PC

Step 3 — (Optional) Mount Google Drive for persistent storage

# Cell 2
from google.colab import drive
drive.mount('/content/drive')

# Then change OUTPUT_PATH in the script to:
# '/content/drive/MyDrive/misspellings_permutations.txt'

Step 4 — Paste & run the script Copy the entire contents of generate_permutations_colab.py into a new cell. Adjust MAX_WORD_LEN as needed, then run.

Step 5 — Download the result

# If saved to VM disk:
files.download('misspellings_permutations.txt')

# If saved to Google Drive: just access it from drive.google.com

Scale Reference

Full permutations grow at n! (factorial) rate. Here's what to expect:

`MAX_WORD_LEN`	Max perms/word	Est. total output
5	120	~200 MB
6	720	~1–2 GB
7	5,040	~5–15 GB ← recommended start
8	40,320	~50–150 GB
9	362,880	~500 GB – 1 TB
10	3,628,800	~5–50 TB ← impossible

Start with MAX_WORD_LEN = 6 or 7, check the output size, then decide if you want to go higher. The script has a built-in safety check that aborts if the estimated size exceeds 70 GB.

Pro Tips for Colab

Keep the browser tab open — Colab disconnects if idle too long
Use Ctrl+Shift+I → Console and paste setInterval(function(){document.querySelector("colab-connect-button").click()}, 60000) to prevent idle disconnects
For very large outputs, write directly to Google Drive so you don't lose data on disconnect
CPU-only is fine for this script — permutation generation is CPU-bound, not GPU

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Misspelling Generator

Option 1 — use ``generate_typos_local.py` (Run Locally)

Option 2 — use generate_permutations_colab.py (Google Colab)

Google Colab Education

What Is Google Colab?

Step-by-Step: Running Option 2 on Colab

Scale Reference

Pro Tips for Colab

Option 2 — use `generate_permutations_colab.py` (Google Colab)