June 9th, 2008


A Spellchecker Used to Be a Major Feat of Software Engineering

Here's the situation: it's 1984, and you're assigned to write the spellchecker for a new MS-DOS word processor. Some users, but not many, will have 640K of memory in their PCs. You need to support systems with as little as 256K. That a quarter megabyte to contain the word processor, the document being edited, and the memory needed by the operating system. Oh, and the spellchecker. (For reference, on my MacBook, the standard dictionary in /usr/share/dict/words is 2,486,813 bytes and contains 234,936 words.)

Even if we could represent each word in the spellchecker dictionary as a single byte, we need almost all the full 256K just for that, and of course the single byte representation isn't going to work. So not only does keeping the whole dictionary in RAM look hopeless, but so does keeping the actual dictionary on disk with only an index in RAM.

Now it gets messy. We could try taking a subset of the dictionary, one containing the most common words, and heavily compressing that so it fits in memory. Then we come up with a slower, disk-based mechanism for looking up the rest of the words. Or maybe we jump directly to a completely disk-based solution using a custom database of sorts (remembering, too, that we can't assume the user has a hard disk, so the dictionary still needs to be crunched onto a 360K floppy disk).