Misspelling Generator

A misspelling words generator of the 466k words from data provider written words.txt. For demonstation, we only use 7 letters combination minimum to generate:

  Words processed  : 125,414
  Lines written    : 173,110,626
  Output file      : misspellings_permutations.txt
  File size        : 2.53 GB

depending your storage, you could do more litter combination limit, just configure the MAX_WORD_LEN in the script.

  • use generate_permutations_colab.py or google_collab_173MSW.ipynb to generate 173,110,626 misspelling words from 125,414 processed datasets; the output is downloadable at HuggingFace
  • use generate_typos_colab.py or google_collab_263MSW.ipynb to generate 26,636,990 misspelling words from 415,701 processed datasets; the output is downloadable at Hugging Face




Option 1 β€” use ``generate_typos_local.py` (Run Locally)

Generates realistic typo variants using 4 strategies:

Strategy Example (hello) Variants
Adjacent swap hlelo, helol nβˆ’1 per word
Char deletion hllo, helo, hell n per word
Char duplication hhello, heello n per word
Keyboard proximity gello, jello, hwllo varies
  • Processes only pure-alpha words with length β‰₯ 3
  • Produces roughly 10–50 typos per word β†’ ~5M–20M lines total
  • Output: data/misspellings.txt in misspelling=correction format

To run:

python generate_typos_local.py

Option 2 β€” use generate_permutations_colab.py (Google Colab)

Generates ALL letter permutations of each word. Key config at the top of the file:

MAX_WORD_LEN = 7   # ← CRITICAL control knob

Google Colab Education

What Is Google Colab?

Google Colab gives you a free Linux VM with Python pre-installed. You get:

Resource Free Tier Colab Pro ($12/mo)
Disk ~78 GB (temporary) ~225 GB (temporary)
RAM ~12 GB ~25-50 GB
GPU T4 (limited) A100/V100
Runtime limit ~12 hours, then VM resets ~24 hours
Google Drive 15 GB (persistent) 15 GB (same)

Colab disk is ephemeral β€” when the runtime disconnects, all files on the VM are deleted. Only Google Drive persists.

Step-by-Step: Running Option 2 on Colab

Step 1 β€” Open Colab Go to colab.research.google.com β†’ New Notebook

Step 2 β€” Upload [words.txt]

# Cell 1
from google.colab import files
uploaded = files.upload()   # select words.txt from your PC

Step 3 β€” (Optional) Mount Google Drive for persistent storage

# Cell 2
from google.colab import drive
drive.mount('/content/drive')

# Then change OUTPUT_PATH in the script to:
# '/content/drive/MyDrive/misspellings_permutations.txt'

Step 4 β€” Paste & run the script Copy the entire contents of generate_permutations_colab.py into a new cell. Adjust MAX_WORD_LEN as needed, then run.

Step 5 β€” Download the result

# If saved to VM disk:
files.download('misspellings_permutations.txt')

# If saved to Google Drive: just access it from drive.google.com

Scale Reference

Full permutations grow at n! (factorial) rate. Here's what to expect:

MAX_WORD_LEN Max perms/word Est. total output
5 120 ~200 MB
6 720 ~1–2 GB
7 5,040 ~5–15 GB ← recommended start
8 40,320 ~50–150 GB
9 362,880 ~500 GB – 1 TB
10 3,628,800 ~5–50 TB ← impossible

Start with MAX_WORD_LEN = 6 or 7, check the output size, then decide if you want to go higher. The script has a built-in safety check that aborts if the estimated size exceeds 70 GB.

Pro Tips for Colab

  • Keep the browser tab open β€” Colab disconnects if idle too long
  • Use Ctrl+Shift+I β†’ Console and paste setInterval(function(){document.querySelector("colab-connect-button").click()}, 60000) to prevent idle disconnects
  • For very large outputs, write directly to Google Drive so you don't lose data on disconnect
  • CPU-only is fine for this script β€” permutation generation is CPU-bound, not GPU
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support