A Bloom filter is a data structure that may report it contains an item that it does not (a false positive), but is guaranteed to report correctly if it contains the item (“no false negatives”). The opposite of a Bloom filter is a data structure that may report a false negative, but can never report a false positive. That is, it may claim that it has not seen an item when it has, but will never claim to have seen an item it has not.
A colleague, Jeff Smick, had need of “the opposite of a Bloom filter”, a while back, when dealing with a data stream that was pumping hundreds of thousands of events per second. First, I’ll discuss the motivation, then we’ll dig into the implementation, and, finally, talk about alternatives.
Along this data stream comes items with a user id, and a few other longs and ints that identify the event completely. These items can be repeated many times and their repetitions are, generally, close together in time. For each unique item, at least one copy must be written to a datastore per hour but multiple duplicate writes are allowed.
This once-per-hour-per-item constraint happens to be very important but the writes were so many that filtering out duplicates became an important scaling concern.
In order to reduce the number of writes, Smick placed a Bloom filter in the stream processors, and was able to get the large reduction expected. However, the code was complicated by the need to switch it out every hour, and, more importantly, the Bloom filter would drop every instance of a unique item if it collided with another, causing the switch to never be flipped during the course of an hour.
Given these constraints, and the inability to hold the entire corpus of items in memory over the whole hour, I pitched him on an alternative design using the opposite of a Bloom filter.
The simplest data structure that fits the description of a “opposite of a Bloom filter” is a hash map. With a large corpus like the one we were dealing with, the memory consumed by a hash map is far too large and its growth is bound only by corpus size. However, from this simple idea a more fruitful one comes out.
The design I pitched uses an array, a hash function, and the ability to swap ids in and out of the array concurrently. The hash function and the array are used in the same way they are inside of a hash map implementation. Unlike a typical hash map implementation, the array size is bounded and hash collisions are not avoided.
The implementation I described used an AtomicReferenceArray internally. When an item’s byte array id is passed to
containsAndAdd, it is hashed to an index in the AtomicReferenceArray. Then using the atomic
getAndSet it swaps the byte array currently at that index with the one being queried for.
The old id at that index is compared to the new id just added. If they differ,
containsAndAdd returns false and the process should write the event. If the ids are the same,
containsAndAdd returns true and the process can safely drop the event on the ground. Occasionally, two byte arrays may hash to the same index. When this happens,
containsAndAdd will return false and a duplicate write will be emitted. However, since duplicate items come close together in time, this is relatively rare.
Note that there is no way for a false positive to be emitted from the filter. (This can take a second to see because the phrases “false positive”, “false negative”, “contains” and “filter” have a bunch of negations in their definitions that collide with the others.)
This filter implementation is a thread-safe “opposite of a Bloom filter” with a nice bound on its memory usage.
For reference, I’ve posted an implementation of this filter to github that is similar to what is being used in production that consumes byte arrays in Java and Go. I’ve discussed the Java code in this post, but the design in each is the same. Smick used a different implementation with a lock per index, but the design remained the same.
An alternative that some might propose is to use a cache object with expiration. Guava has a nice way of creating those, but the underlying hash map implementation will use 2.5x or more memory. And the maximum size setting means that items can be thrown out before we’d like them to be. All that said, using a expiring cache implementation is perfectly suitable for smaller datasets.
Further, other implementations could take advantage of loss-less compression to reduce garbage collection pressure. If your distribution of the ids is compact and orderable, tricks could be played with keeping only the intervals between ids and mapping them to the array with a simpler hash function that maintains their order.
And we’re done
 Complicated further by the timestamp in the event item itself that was the source of truth for when the event had occurred, instead of the time of arrival in the server processing it.
 Reproducing this code for another object type is trivial. Same if you don’t want to the guava dependency for the murmur hash function.