Mneme: Scalable Duplicate Filtering Service »
Created at: 24.03.2011 19:44, source: igvita.com, tagged: Architecture ruby bloomfilter goliath
Detecting and dealing with duplicates is a common problem: sometimes we want to avoid performing an operation based on this knowledge, and at other times, like in a case of a database, we want may want to only permit an operation based on a hit in the filter (ex: skip disk access on a cache miss). How do we build a system to solve the problem? The solution will depend on the amount of data, frequency of access, maintenance overhead, language, and so on. There are many ways to solve this puzzle.
In fact, that is the problem - they are too many ways. Having reimplemented at least half a dozen solutions in various languages and with various characteristics at PostRank, we arrived at the following requirements: we want a system that is able to scale to hundreds of millions of keys, we want it to be as space efficient as possible, have minimal maintenance, provide low latency access, and impose no language barriers. The tradeoff: we will accept a certain (customizable) degree of error, and we will not persist the keys forever.
Mneme: Duplicate filter & detection
Mneme is an HTTP web-service for recording and identifying previously seen records - aka, duplicate detection. To achieve the above requirements, it is implemented via a collection of bloomfilters. Each bloomfilter is responsible for efficiently storing the set membership information about a particular key for a defined period of time. Need to filter your keys for the trailing 24 hours? Mneme can create and automatically rotate 24 hourly filters on your behalf - no maintenance required.

By using a bloomfilter as the underlying datastore we can efficiently test for set membership while minimizing the amount of memory (review the background on bloomfilters here). Instead of storing the actual keys we can instead define the exact number of bits we will store per key by trading off against an acceptable error rate - the fewer bits, the higher the probability of a false positive.
Mneme: Goliath, Redis, HTTP and JSON
To meet all of our original criteria, Mneme is powered by a few components under the hood: Goliath web server, Redis for in-memory storage of the filters, and HTTP + JSON. The bloomfilter is stored inside of a bitstring in Redis, which allows us to easily query and share the same dataset between multiple services, and Goliath provides the high-performance HTTP frontend which abstracts all of the filter logic behind simple GET and POST operations:
$> curl "http://mneme:9000?key=abcd"
{"found":[],"missing":["abcd"]}# -d creates a POST request with key=abcd, aka insert into filter
$> curl "http://mneme:9000?key=abcd" -d' '$> curl "http://mneme:9000?key=abcd"
{"found":["abcd"],"missing":[]}
That's it. You can query for a single or multiple keys via a simple GET operation, and you can insert keys into the filter via a POST operation. The rest is taken care of by Mneme.
Configuring & Launching Mneme
Launching the service is as simple as installing the gem and starting up Redis:
# temporary: install my fork of redis-rb for the asynchronous driver
$> git clone git://github.com/igrigorik/redis-rb.git && cd redis-rb && rake install$> redis-server
$> gem install mneme
$> mneme -p 9000 -sv -c config.rb # run with -h to see all options
Once the gem is installed you will have a mneme executable which will start the Goliath web server. To launch the service, we just need to provide a configuration file to specify the length of time to store the data for, the granularity and the error rate:
config['namespace'] = 'default' # namespace for your app (if you're sharing a redis instance) config['periods'] = 3 # number of periods to store data for config['length'] = 60 # length of a period in seconds (length = 60, periods = 3.. 180s worth of data) config['size'] = 1000 # expected number of keys per period config['bits'] = 10 # number of bits allocated per key config['hashes'] = 7 # number of times each key will be hashed config['seed'] = 30 # seed value for the hash function
mneme.git (HTTP web-service for recording and identifying previously seen records)
Downloads: 14 File Size: 0.0 KB
Once again, to figure out your size, bits, and hashes, revisit the math behind bloomfilters. With the config file in place, you can launch the service and you are ready to go.
Mneme Performance & Memory Requirements
The cost of storing a new key in Mneme is O(num hashes), or O(1), since we only need to update the most recent filter. Similarly, retrieving a key is in the worst case O(num filters * num hashes), or once again, O(1). In both cases the performance is not dependent on number of keys, but rather on the time period and the error rate!
What about the memory footprint? Each filter is stored inside of a Redis bitstring, and GETBIT/SETBIT operations are used under the hood to perform the required operations. Redis does add some overhead to our bitstring, but a quick test shows that for a 1.0% error rate and 1M items, we need just about 2.5mb of memory! For more numbers, check the Mneme readme.
Simple to deploy, minimal maintenance, high performance and no language barriers. Give it a try, and hopefully it will help you avoid having to reimplement yet another duplicate filter solution!
more »
Flow Analysis & Time-based Bloom Filters »
Created at: 06.01.2010 19:45, source: igvita.com, tagged: Architecture bigdata bloomfilter flow
Working with large streams of data is becoming increasingly widespread, be it for log, user behavior, or raw firehose analysis of user generated content. There is some very interesting academic literature on this type of data crunching, although much of it is focused on query or network packet analysis and is often not directly applicable to the type of data we have to deal with in the social web. For example, if you were tasked to build (a better) "Trending Topics" algorithm for Twitter, how would you do it?
Of course, the challenge is that it has to be practical - it needs to be "real-time" and be able to react to emerging trends in under a minute, all the while using a reasonable amount of CPU and memory. Now, we don't know how the actual system is implemented at Twitter, nor will we look at any specific solutions - I have some ideas, but I am more curious to hear how you would approach it. Instead, I want to revisit the concept of Bloom Filters, because as I am making my way through the literature, it is surprising how sparsely they are employed for these types of tasks. Specifically, a concept I have been thinking of prototyping for some time now: time-based, counting bloom filters!
Bloom Filters: What & Why
A Bloom Filter is a probabilistic data structure which can tell if an element is a member of a set. However, the reason it is interesting is because it accomplishes this task with an incredibly efficient use of memory: instead of storing a full hash map, it is simply a bit vector which guarantees that you may have some small fraction of false positives (the filter will report that a key is in the bloom filter when it is really not), but it will never report a false negative. File system and web caches frequently use bloom filters as the first query to avoid otherwise costly database or file system lookups. There is some math involved in determining the right parameters for your bloom filter, which you can read about in an earlier post.

Of course, as is, the Bloom Filter data structure is not very useful for analyzing continuous data streams - eventually we would fill up the filter and it would begin reporting false positives all the time. But, what if your bloom filter only remembered seen data for a fixed interval of time? Imagine adding time-to-live (TTL) timestamp on each record. All of the sudden, if you knew the approximate number of messages for the interval of time you wanted to analyze, then a bloom filter is once again an incredibly fast and space-efficient (fixed memory footprint) data structure!
Time-based Bloom Filters
Arguably the key feature of bloom filters is their compact representation as a bit vector. By associating a timestamp with each record, the size of the filter immediately expands by an order of magnitude, but even with that, depending on the size of the time window you are analyzing, you could store the TTL's in just a few additional bits. Conversely, if counting bits is not mission critical, you could even used a backend such as Redis or Memcached to drive the filter as well. The direct benefit of such approach is that the data can be shared by many distributed processes. On that note, I have added a prototype Redis backend to the bloomfilter gem which implements a time-based, counting Bloom Filter. Let's take a look at a simple example:
require 'bloomfilter' options = { :size => 100, # size of bit vector :hashes => 4, # number of hash functions :seed => rand(100), # seed value for the filter :bucket => 3 # number of bits for the counting filter } # Regular, in-memory counting bloom filter bf = BloomFilter.new(options) bf.insert("mykey") bf.include?("mykey") # => true bf.include?("mykey1") # => false # # Redis-backed bloom filter, with optional time-based semantics # bf = BloomFilter.new(options.merge({:type => :redis, :ttl => 2, :server => {:host => 'localhost'}})) bf.insert("mykey") bf.include?("mykey") # => true sleep(3) bf.include?("mykey") # => false # custom 5s TTL for a key bf.insert("newkey", nil, 5)
Storing data in Redis or Memcached is roughly an order of magnitude less efficient, but it gives us an easy to use, distributed, and fixed memory filter for analyzing continuous data streams. In other words, a useful tool for applications such as duplicate detection, trends analysis, and many others.
Mechanics of Time-Based Bloom Filters
So how does it work? Given the settings above, we create a fixed memory vector of 100 buckets (or bits in raw C implementation). Then, for each key, we hash it 4 times with different key offsets and increment the counts in those buckets - a non-negative value indicates that one of the hash functions for some key has used that bucket. Then, for a lookup, we reverse the operation: generate the 4 different hash keys and look them up, if all of them are non-zero then either we have seen this key or there has been a collision (false positive). By optimizing the size of the bit vector we can control the false positive rate - you're always trading the of amount of allocated memory vs. collision rate. Finally, we also make use of the native expire functionality in Redis to guarantee that keys are only stored for a bounded amount of time.
Time-based bloom filters have seen a few rogue mentions in the academic literature, but to the best of my knowledge, have not seen wide applications in the real world. However, it is an incredibly powerful data structure, and one that could benefit many modern, big-data applications. Gem install the bloomfilter gem and give it a try, perhaps it will help you build a better trends analysis tool. Speaking of which, what other tools, algorithms, or data structures would you use to build a "Trending Topics" algorithm for a high-velocity stream?
more »
