> At this time, Bloom filter was not even called Bloom filter. In his paper, Douglas calls it a “superimposed code scheme”.
A Bloom filter is a specific type of superimposed code.
Calvin Mooers developed random (1) superimposed coding in his Master's thesis at MIT back in the 1940s, directly influenced by Shannon's work.
Bourne's superb 1963 book "Methods of Information Handling" gives details of the mathematics.
I've no doubt Douglas knew about the broader technique, which, for example, the author of "The Large Data Base File Structure Dilemma" (1975) at http://dx.doi.org/10.1021/ci60001a005 described as "an old technique called super-imposed coding".
(1) The "random" is an important qualifier because there were superimposed codes predating Mooers, but they were not mathematically interesting or all that practically important.
You can write an external memory spell checker with a tiny amount of RAM: something like
- sort the words in the document
- eliminate unique words (they sort together)
- merge the sorted words with the sorted dictionary and keep only the missing words
I saw this in BASIC in Creative Computing and got it working in on my TRS-80 Color Computer which had much less than 32k of available RAM, so that was the first thing I thought when I saw the headline.
it had a compressed dictionary that would fit together with the other programs you were running on a PC and spell check as you typed; there was a 640k limit for the PC but it could only use a fraction of that so as not to interfere and in the early days of the PC you couldn't actually afford to fill it out.
Worth noting that the article mentions this alternative as their first PoC and its drawbacks: "Because of its simplistic implementation, it was not very accurate, and also slow because of dictionary lookups on the disk."
I guess you really used the fact that most words are repeated to keep the byte count in check? On the old C=64 I had it was a bit of a problem not to blow out the memory with just the text of the document once you started using it for more than a 1 or 2 page paper. Keeping a second sorted copy seems almost luxurious.
I guess you could save the working copy to disk first, then do the sort, then compare, then reload the working copy. I think the C=64 developers probably avoided that strategy because the disk interface was so damn slow.
I may be wrong, but from "external memory" in the description I think the idea is that each of those steps can be done on disk, not RAM. An external merge sort is a pretty standard database primitive, and the other two only require purely sequential access, so are friendly to spinning disks and the like.
could have a capacity of 80 MB or so, which is (1) much larger than RAM, and (2) mainly sequential (it could fast forward for a bit and try to find a particular block, but it was pretty slow) Contrast this to the floppy drives in the 70-140 kb range which the 8-bit micros had. Thus there was a lot of literature on external memory algorithms that would work on tapes in the early days -- though similar methods are still of interest today for "big data" as RAM is faster by a lot when you access it sequentially, you want to minimize round trips in distributed systems, etc.
(It was funny though when I talked to computer scientists around 2004 and was told to forget about external memory algorithms because 'main memory' was the thing; people realized, for instance, that Google Maps could store all the tiles in RAM in half a rack of 1U servers and if utilization was high it was much cheaper than serving the tiles off disk; Hadoop came along and was a blast from the past that itself was obsolete in just a few years)
I can't remember the name of the product but in the 80s there was a hardware spell checker for the IBM PC. It was a box that connected between your keyboard and your PC, and if you ever typed a string of letters that it did not recognize as a dictionary word, it would beep to let you know.
One of the things that got me intrigued by Unix was an early 1980s(ish) Byte article which walked through building a (trivial example, not the "real" one) spell checker out of a split/sort/comm pipeline, something like 7 commands? 8-bit PCs didn't have anything like that, and yet it didn't look like it needed that much sophistication...
in the mid 80's i ran into something similar. Fast is relative.
I had a lot of data, 640KB RAM, 64KB of heap, and 64KB of stack. I had hundreds of megabytes that I had to search extract data from and then combine some of them.
I experimented with data index structured into ternary trees. Conceptually it made sense, but implementation-wise the relationships and paths were still too big to keep in 64KB.
Instead of compression, I did swapping. I wrote a TSR (think service), a piece of code that would process a chunk of the data, extract the results, store it n the stack, dump the original data, make an interrupt call to the TSR, which in turn destroy the heap, and read in the next chunk from storage, return control to the program, process, combine with stack data, and continue until finished the entire process.
Originally this process took about a week for three data entry persons (think about a dozen 3" ring binders filled with tables), and an specialist combining the information. The program completed the work in just a few hours. It was amazingly "fast".
Reading about this and similar techniques in Programming Pearls (Second Edition) by Jon Bentley left the younger me spellbound. Similar to the evolution of linkers up to mold.
How about 39kB for a video game with physics, dynamic graphics, two music tracks, sound effects, online high scores, and built-in instructions? https://news.ycombinator.com/item?id=38372936
Sizecoding is a thing. .kkrieger is a rather famous 96kB FPS game. There is even an entire demoparty called Lovebyte that is dedicated to it, the biggest category is 1k, but all demoscene events I can think of have a sizecoding competition of some kind.
And it is a completely different thing. In general, it is more about procedural generation and tricks then good packing. Runtime packers are used, like crinkler and kkrunchy, but actually they use a lot of RAM, like hundreds of MB, which is a bit surprising considering that the decompressed executable is in the tens of kB. But that's because they use very powerful but slow compression algorithms.
Sizecoding usually doesn't care about RAM, unless the platform requires it, the only think that matters is the size of the executable file and its data. For that 39kB Playdate game, I guess that's the same idea. The Playdate has 16MB of RAM, I bet the game took full advantage of it.
For perspective, in 1983 or so, Grammatik on CP/M ran in under 64k and did "grammar checking" (spell checking, plus a bunch of expert system rules) on an 8-bit system. (It sticks in my memory because of the time spent poking at the really interesting part: that it was so compact because it was in Forth, and there was enough of an outer interpreter in the product that with a little hex editing you could just use it as a Forth interpreter - with a very specialized set of functions preloaded :-)
Also for reference, for those who aren't familiar with typo and don't want to read the C source code listed above (side-comment: citing the comment-free source code 'for reference' is hilarious. Thank you for the laugh :) )
> But 27-bit hash codes were too big: with 2^15 words, they needed 2^15 * 27 bits of memory, while the PDP-11 had only 2^15 * 16 bits (64kB) of RAM—compression was essential.
I'm frustrated when people put this kind of typography on the web. HTML can do superscript.
Then you should enjoy this article because nearly all the expressions are presented as proper mathematical equations bar the few places where those expressions are pseudocode
> At this time, Bloom filter was not even called Bloom filter. In his paper, Douglas calls it a “superimposed code scheme”.
A Bloom filter is a specific type of superimposed code.
Calvin Mooers developed random (1) superimposed coding in his Master's thesis at MIT back in the 1940s, directly influenced by Shannon's work.
Bourne's superb 1963 book "Methods of Information Handling" gives details of the mathematics.
I've no doubt Douglas knew about the broader technique, which, for example, the author of "The Large Data Base File Structure Dilemma" (1975) at http://dx.doi.org/10.1021/ci60001a005 described as "an old technique called super-imposed coding".
(1) The "random" is an important qualifier because there were superimposed codes predating Mooers, but they were not mathematically interesting or all that practically important.
You can write an external memory spell checker with a tiny amount of RAM: something like
I saw this in BASIC in Creative Computing and got it working in on my TRS-80 Color Computer which had much less than 32k of available RAM, so that was the first thing I thought when I saw the headline.Now this blew people away when it came out
https://winworldpc.com/product/turbo-lightning/1x
it had a compressed dictionary that would fit together with the other programs you were running on a PC and spell check as you typed; there was a 640k limit for the PC but it could only use a fraction of that so as not to interfere and in the early days of the PC you couldn't actually afford to fill it out.
Worth noting that the article mentions this alternative as their first PoC and its drawbacks: "Because of its simplistic implementation, it was not very accurate, and also slow because of dictionary lookups on the disk."
I guess you really used the fact that most words are repeated to keep the byte count in check? On the old C=64 I had it was a bit of a problem not to blow out the memory with just the text of the document once you started using it for more than a 1 or 2 page paper. Keeping a second sorted copy seems almost luxurious.
I guess you could save the working copy to disk first, then do the sort, then compare, then reload the working copy. I think the C=64 developers probably avoided that strategy because the disk interface was so damn slow.
I may be wrong, but from "external memory" in the description I think the idea is that each of those steps can be done on disk, not RAM. An external merge sort is a pretty standard database primitive, and the other two only require purely sequential access, so are friendly to spinning disks and the like.
Knuth's Sorting and Searching book talks a lot about that sort of algorithm.
The old IBM 360 had a 16MB address space at best, but in 1968 there were very few installations anywhere near filling it. This kind of tape
https://en.wikipedia.org/wiki/9-track_tape
could have a capacity of 80 MB or so, which is (1) much larger than RAM, and (2) mainly sequential (it could fast forward for a bit and try to find a particular block, but it was pretty slow) Contrast this to the floppy drives in the 70-140 kb range which the 8-bit micros had. Thus there was a lot of literature on external memory algorithms that would work on tapes in the early days -- though similar methods are still of interest today for "big data" as RAM is faster by a lot when you access it sequentially, you want to minimize round trips in distributed systems, etc.
(It was funny though when I talked to computer scientists around 2004 and was told to forget about external memory algorithms because 'main memory' was the thing; people realized, for instance, that Google Maps could store all the tiles in RAM in half a rack of 1U servers and if utilization was high it was much cheaper than serving the tiles off disk; Hadoop came along and was a blast from the past that itself was obsolete in just a few years)
Link seems to be broken.
works4me
I can't remember the name of the product but in the 80s there was a hardware spell checker for the IBM PC. It was a box that connected between your keyboard and your PC, and if you ever typed a string of letters that it did not recognize as a dictionary word, it would beep to let you know.
Xerox PC Type Right
https://vintageapple.org/pcworld/pdf/PC_World_8711_November_... page 237 has a review (big PDF warning)
>You can even set the device to recognize your password and beep if you mistype it, thereby eliminating a bevy of incorrect sign-on messages.
Wow, what a thing for a tech publication to suggest doing! Different times.
Way too smart for worse is better. Think Worser!
The main memory bandwidth and the disk bandwidth were about the same, a little of 1MB/s.
I would have done this in multiple passes (but still used the Bloom Filters, those are cool).
https://github.com/arnoldrobbins/v10spell
https://code.google.com/archive/p/unix-spell/
The original paper is great https://www.semanticscholar.org/paper/Development-of-a-Spell...
It is hosted on his web page https://www.cs.dartmouth.edu/~doug/
https://en.wikipedia.org/wiki/Douglas_McIlroy
If you are a word nerd, you will have found obovate and there, this chart.
https://upload.wikimedia.org/wikipedia/commons/e/e8/Leaf_mor...
One of the things that got me intrigued by Unix was an early 1980s(ish) Byte article which walked through building a (trivial example, not the "real" one) spell checker out of a split/sort/comm pipeline, something like 7 commands? 8-bit PCs didn't have anything like that, and yet it didn't look like it needed that much sophistication...
One a similar note, there is this period video where Brian Kernighan shows how to construct a spell checker using a UNIX shell one-liner:
https://youtu.be/tc4ROCJYbm0?t=4m56s
I wonder what common typos this misses thanks to the hashing?
Related, a contest about compressing the Wordle dictionary: http://golf.horse/wordle/
in the mid 80's i ran into something similar. Fast is relative.
I had a lot of data, 640KB RAM, 64KB of heap, and 64KB of stack. I had hundreds of megabytes that I had to search extract data from and then combine some of them.
I experimented with data index structured into ternary trees. Conceptually it made sense, but implementation-wise the relationships and paths were still too big to keep in 64KB.
Instead of compression, I did swapping. I wrote a TSR (think service), a piece of code that would process a chunk of the data, extract the results, store it n the stack, dump the original data, make an interrupt call to the TSR, which in turn destroy the heap, and read in the next chunk from storage, return control to the program, process, combine with stack data, and continue until finished the entire process.
Originally this process took about a week for three data entry persons (think about a dozen 3" ring binders filled with tables), and an specialist combining the information. The program completed the work in just a few hours. It was amazingly "fast".
This was on a single threaded system.
[0] https://en.wikipedia.org/wiki/Terminate-and-stay-resident_pr...
64kB was huge. The first version of UNIX needed 24kB (half of which was taken by the kernel).
Reading about this and similar techniques in Programming Pearls (Second Edition) by Jon Bentley left the younger me spellbound. Similar to the evolution of linkers up to mold.
How about 39kB for a video game with physics, dynamic graphics, two music tracks, sound effects, online high scores, and built-in instructions? https://news.ycombinator.com/item?id=38372936
The original Elite was 22k on tape running on a machine with 32k of RAM
Sizecoding is a thing. .kkrieger is a rather famous 96kB FPS game. There is even an entire demoparty called Lovebyte that is dedicated to it, the biggest category is 1k, but all demoscene events I can think of have a sizecoding competition of some kind.
And it is a completely different thing. In general, it is more about procedural generation and tricks then good packing. Runtime packers are used, like crinkler and kkrunchy, but actually they use a lot of RAM, like hundreds of MB, which is a bit surprising considering that the decompressed executable is in the tens of kB. But that's because they use very powerful but slow compression algorithms.
Sizecoding usually doesn't care about RAM, unless the platform requires it, the only think that matters is the size of the executable file and its data. For that 39kB Playdate game, I guess that's the same idea. The Playdate has 16MB of RAM, I bet the game took full advantage of it.
Not the same when the machines in question are not on the same level...
For perspective, in 1983 or so, Grammatik on CP/M ran in under 64k and did "grammar checking" (spell checking, plus a bunch of expert system rules) on an 8-bit system. (It sticks in my memory because of the time spent poking at the really interesting part: that it was so compact because it was in Forth, and there was enough of an outer interpreter in the product that with a little hex editing you could just use it as a Forth interpreter - with a very specialized set of functions preloaded :-)
The Wordstar editor I run on my own CP/M system, with 64k of RAM, contains the 2023-byte long "SPELL.COM" spell-checker.
I've not decompiled it to see how it works, but it's small and fast, and works well.
This spell check must have come later. The original spell check burst the words of the document, and sort -u'd them:
makewords sentence | lowercase | sort | unique | mismatch
Marginally related... has anyone ever ported the "typo" program to modern C?
For reference, https://github.com/robpike/typo/blob/master/unix/typo.c
Also for reference, for those who aren't familiar with typo and don't want to read the C source code listed above (side-comment: citing the comment-free source code 'for reference' is hilarious. Thank you for the laugh :) )
https://github.com/robpike/typo/blob/master/typo.png
> But 27-bit hash codes were too big: with 2^15 words, they needed 2^15 * 27 bits of memory, while the PDP-11 had only 2^15 * 16 bits (64kB) of RAM—compression was essential.
I'm frustrated when people put this kind of typography on the web. HTML can do superscript.
Then you should enjoy this article because nearly all the expressions are presented as proper mathematical equations bar the few places where those expressions are pseudocode
I had spelling checkers on the Apple ][ that ran in 48K!
Exactly. 64K only seems tiny to people who didn't use home computers in the 1980s.
Cool. Now move forward.
> Cool. Now move forward.
They are trying, but only seem to go backwards. See Windows, Android, iOS, Teams for examples.