Misspelling Generator
A misspelling words generator of the 466k words from data provider written words.txt. For demonstation, we only use 7 letters combination minimum to generate:
Words processed : 125,414
Lines written : 173,110,626
Output file : misspellings_permutations.txt
File size : 2.53 GB
depending your storage, you could do more litter combination limit, just configure the MAX_WORD_LEN in the script.
- use
generate_permutations_colab.pyorgoogle_collab_173MSW.ipynbto generate 173,110,626 misspelling words from 125,414 processed datasets; the output is downloadable at HuggingFace - use
generate_typos_colab.pyorgoogle_collab_263MSW.ipynbto generate 26,636,990 misspelling words from 415,701 processed datasets; the output is downloadable at Hugging Face
Option 1 β use ``generate_typos_local.py` (Run Locally)
Generates realistic typo variants using 4 strategies:
| Strategy | Example (hello) |
Variants |
|---|---|---|
| Adjacent swap | hlelo, helol |
nβ1 per word |
| Char deletion | hllo, helo, hell |
n per word |
| Char duplication | hhello, heello |
n per word |
| Keyboard proximity | gello, jello, hwllo |
varies |
- Processes only pure-alpha words with length β₯ 3
- Produces roughly 10β50 typos per word β ~5Mβ20M lines total
- Output: data/misspellings.txt in
misspelling=correctionformat
To run:
python generate_typos_local.py
Option 2 β use generate_permutations_colab.py (Google Colab)
Generates ALL letter permutations of each word. Key config at the top of the file:
MAX_WORD_LEN = 7 # β CRITICAL control knob
Google Colab Education
What Is Google Colab?
Google Colab gives you a free Linux VM with Python pre-installed. You get:
| Resource | Free Tier | Colab Pro ($12/mo) |
|---|---|---|
| Disk | ~78 GB (temporary) | ~225 GB (temporary) |
| RAM | ~12 GB | ~25-50 GB |
| GPU | T4 (limited) | A100/V100 |
| Runtime limit | ~12 hours, then VM resets | ~24 hours |
| Google Drive | 15 GB (persistent) | 15 GB (same) |
Colab disk is ephemeral β when the runtime disconnects, all files on the VM are deleted. Only Google Drive persists.
Step-by-Step: Running Option 2 on Colab
Step 1 β Open Colab Go to colab.research.google.com β New Notebook
Step 2 β Upload [words.txt]
# Cell 1
from google.colab import files
uploaded = files.upload() # select words.txt from your PC
Step 3 β (Optional) Mount Google Drive for persistent storage
# Cell 2
from google.colab import drive
drive.mount('/content/drive')
# Then change OUTPUT_PATH in the script to:
# '/content/drive/MyDrive/misspellings_permutations.txt'
Step 4 β Paste & run the script
Copy the entire contents of generate_permutations_colab.py into a new cell. Adjust MAX_WORD_LEN as needed, then run.
Step 5 β Download the result
# If saved to VM disk:
files.download('misspellings_permutations.txt')
# If saved to Google Drive: just access it from drive.google.com
Scale Reference
Full permutations grow at n! (factorial) rate. Here's what to expect:
MAX_WORD_LEN |
Max perms/word | Est. total output |
|---|---|---|
| 5 | 120 | ~200 MB |
| 6 | 720 | ~1β2 GB |
| 7 | 5,040 | ~5β15 GB β recommended start |
| 8 | 40,320 | ~50β150 GB |
| 9 | 362,880 | ~500 GB β 1 TB |
| 10 | 3,628,800 | ~5β50 TB β impossible |
Start with
MAX_WORD_LEN = 6or7, check the output size, then decide if you want to go higher. The script has a built-in safety check that aborts if the estimated size exceeds 70 GB.
Pro Tips for Colab
- Keep the browser tab open β Colab disconnects if idle too long
- Use
Ctrl+Shift+Iβ Console and pastesetInterval(function(){document.querySelector("colab-connect-button").click()}, 60000)to prevent idle disconnects - For very large outputs, write directly to Google Drive so you don't lose data on disconnect
- CPU-only is fine for this script β permutation generation is CPU-bound, not GPU