0-60: Deploying Goliath on Heroku Cedar »
Created at: 02.06.2011 20:02, source: igvita.com, tagged: ruby goliath heroku
Earlier this week Heroku rolled out a major upgrade to their webstack. The HTTP/1.1 support and the billing upgrades are both great improvements, but the "process model" definitely takes the crown: all of the sudden Heroku is much more than a Ruby+Rack hosting platform. Prior to this release, Heroku hosted Ruby only apps on top of the Thin web-server. Each "dyno" represented an instance of Thin, and as long as you deployed a Rack-compatible app, then Heroku would spin it up and run it for you. This was great, but also somewhat limiting - for one, you couldn't run non-Rack apps!
The process model changes all that. You can now run any application within their cloud, with the help of a simple Procfile. In fact, it doesn't even have to be Ruby! Node, Clojure apps, long-running Ruby workers, it is all fair game.
Goliath on Heroku Cedar
Goliath is an async web server and framework we developed at PostRank. While it does follow the Rack spec, it is also substantially different: it requires Ruby 1.9 runtime for Fiber support, and it uses an entirely different HTTP parser from Thin. Combined, these differences meant that up until now, deploying Goliath on any existing cloud environment was a no-go. However, with the new cedar stack, this is no longer an issue:
A simple "Hello World HTTP proxy" example. Our API will parse an incoming request, dispatch an async HTTP request and return the results back to the user - no callbacks! We specified our libraries in a Gemfile, and we provided a Procfile with the startup parameters for our API. Believe it or not, that is all what we need. Let's deploy it:
git init .
git add * && git commit -a -m 'hello world'
heroku create --stack cedar goliath-demo
git push heroku master
curl http://goliath-demo.herokuapp.com/
Running a simple benchmark and checking the logs ("heroku logs" on command prompt) shows that an average request to our new Goliath API takes approximately 0.35ms (2500+ req/s). Not enough to handle the load? Simply start another API: heroku scale web=2.
Goliath, Heroku & the Cloud
Combine easy deployment, support for HTTP/1.1 and the async nature of Goliath, and all of the sudden you can develop and deploy simple streaming API's and async API endpoints with near minimal effort! If you can't tell, definitely something I am very excited about - a huge step for Heroku! Now, we just need to wait for a response from the CloudFoundry team - this should get interesting.
more »
Mneme: Scalable Duplicate Filtering Service »
Created at: 24.03.2011 19:44, source: igvita.com, tagged: Architecture ruby bloomfilter goliath
Detecting and dealing with duplicates is a common problem: sometimes we want to avoid performing an operation based on this knowledge, and at other times, like in a case of a database, we want may want to only permit an operation based on a hit in the filter (ex: skip disk access on a cache miss). How do we build a system to solve the problem? The solution will depend on the amount of data, frequency of access, maintenance overhead, language, and so on. There are many ways to solve this puzzle.
In fact, that is the problem - they are too many ways. Having reimplemented at least half a dozen solutions in various languages and with various characteristics at PostRank, we arrived at the following requirements: we want a system that is able to scale to hundreds of millions of keys, we want it to be as space efficient as possible, have minimal maintenance, provide low latency access, and impose no language barriers. The tradeoff: we will accept a certain (customizable) degree of error, and we will not persist the keys forever.
Mneme: Duplicate filter & detection
Mneme is an HTTP web-service for recording and identifying previously seen records - aka, duplicate detection. To achieve the above requirements, it is implemented via a collection of bloomfilters. Each bloomfilter is responsible for efficiently storing the set membership information about a particular key for a defined period of time. Need to filter your keys for the trailing 24 hours? Mneme can create and automatically rotate 24 hourly filters on your behalf - no maintenance required.

By using a bloomfilter as the underlying datastore we can efficiently test for set membership while minimizing the amount of memory (review the background on bloomfilters here). Instead of storing the actual keys we can instead define the exact number of bits we will store per key by trading off against an acceptable error rate - the fewer bits, the higher the probability of a false positive.
Mneme: Goliath, Redis, HTTP and JSON
To meet all of our original criteria, Mneme is powered by a few components under the hood: Goliath web server, Redis for in-memory storage of the filters, and HTTP + JSON. The bloomfilter is stored inside of a bitstring in Redis, which allows us to easily query and share the same dataset between multiple services, and Goliath provides the high-performance HTTP frontend which abstracts all of the filter logic behind simple GET and POST operations:
$> curl "http://mneme:9000?key=abcd"
{"found":[],"missing":["abcd"]}# -d creates a POST request with key=abcd, aka insert into filter
$> curl "http://mneme:9000?key=abcd" -d' '$> curl "http://mneme:9000?key=abcd"
{"found":["abcd"],"missing":[]}
That's it. You can query for a single or multiple keys via a simple GET operation, and you can insert keys into the filter via a POST operation. The rest is taken care of by Mneme.
Configuring & Launching Mneme
Launching the service is as simple as installing the gem and starting up Redis:
# temporary: install my fork of redis-rb for the asynchronous driver
$> git clone git://github.com/igrigorik/redis-rb.git && cd redis-rb && rake install$> redis-server
$> gem install mneme
$> mneme -p 9000 -sv -c config.rb # run with -h to see all options
Once the gem is installed you will have a mneme executable which will start the Goliath web server. To launch the service, we just need to provide a configuration file to specify the length of time to store the data for, the granularity and the error rate:
config['namespace'] = 'default' # namespace for your app (if you're sharing a redis instance) config['periods'] = 3 # number of periods to store data for config['length'] = 60 # length of a period in seconds (length = 60, periods = 3.. 180s worth of data) config['size'] = 1000 # expected number of keys per period config['bits'] = 10 # number of bits allocated per key config['hashes'] = 7 # number of times each key will be hashed config['seed'] = 30 # seed value for the hash function
mneme.git (HTTP web-service for recording and identifying previously seen records)
Downloads: 14 File Size: 0.0 KB
Once again, to figure out your size, bits, and hashes, revisit the math behind bloomfilters. With the config file in place, you can launch the service and you are ready to go.
Mneme Performance & Memory Requirements
The cost of storing a new key in Mneme is O(num hashes), or O(1), since we only need to update the most recent filter. Similarly, retrieving a key is in the worst case O(num filters * num hashes), or once again, O(1). In both cases the performance is not dependent on number of keys, but rather on the time period and the error rate!
What about the memory footprint? Each filter is stored inside of a Redis bitstring, and GETBIT/SETBIT operations are used under the hood to perform the required operations. Redis does add some overhead to our bitstring, but a quick test shows that for a 1.0% error rate and 1M items, we need just about 2.5mb of memory! For more numbers, check the Mneme readme.
Simple to deploy, minimal maintenance, high performance and no language barriers. Give it a try, and hopefully it will help you avoid having to reimplement yet another duplicate filter solution!
more »
Goliath: Non-blocking, Ruby 1.9 Web Server »
Created at: 08.03.2011 21:29, source: igvita.com, tagged: ruby goliath postrank
There are easily half a dozen of factors you need to consider when picking an app server: the choice of the VM, implementation model, performance and memory usage, driver and library availability, community support, and so forth. In other words, it is a complex set of requirements, and no one solution is likely to meet all of them. Not surprisingly, the Ruby ecosystem alone offers a variety of choices where Mongrel, Passenger, Unicorn, and Thin are some of the most popular - each has its own set of advantages and its own set of tradeoffs.
At PostRank, weighing our own set of requirements, we chose an event-driven architecture with MRI Ruby + EventMachine as our primary runtime. In the process, we have iterated on several versions of our own web-stack, and arrived at a model which has been a rock solid performer: a fully asynchronous server powered by Ruby 1.9, with a Fiber context for each request. Today, we are releasing Goliath (http://goliath.io) to the public!
Goliath: Architecture & Features
At its core Goliath is an app server like Mongrel or Thin - it is built around a Rack API - but due to its fully asynchronous nature it is also not a direct substitute. Instead Goliath is both an app server and a lightweight framework designed to meet the following goals: fully asynchronous processing, middleware support, simple configuration, high-performance, and arguably most importantly, readable and maintainable code.
Asynchronous, or event-driven programming relies on the concept of callbacks: blocks of code whose execution is deferred until an appropriate event (ex: socket IO) triggers it. While this is not a complicated concept on its own, in the long run, it seems to result in complicated, non-linear execution models which are hard to maintain - we have experienced this firsthand at PostRank and hence made it a primary concern for Goliath.
To solve this, Goliath runs on Ruby 1.9 and leverages Fibers (coroutines) to allow us to transparently pause and resume the execution of our asynchronous codebase, while preserving the look and feel of a synchronous API!
Goliath: async GitHub proxy
To get started, simply "gem install goliath" under Ruby 1.9 and copy the following example:
require 'goliath' require 'em-synchrony/em-http' class Github < Goliath::API use Goliath::Rack::Params # parse query & body params use Goliath::Rack::Formatters::JSON # JSON output formatter use Goliath::Rack::Render # auto-negotiate response format use Goliath::Rack::ValidationError # catch and render validation errors use Goliath::Rack::Validation::RequiredParam, {:key => 'query'} def response(env) gh = EM::HttpRequest.new("http://github.com/api/v2/json/repos/search/#{params['query']}").get logger.info "Received #{gh.response_header.status} from Github" [200, {'X-Goliath' => 'Proxy'}, gh.response] end end # > gem install em-http-request --pre # > gem install em-synchrony --pre # # > ruby github.rb -sv -p 9000 # > Starting server on 0.0.0.0:9000 in development mode. Watch out for stones. # # > curl -vv "localhost:9000/?query=ruby"
If you are familiar with the Rack API, the above example should be very straightforward: first, we tell our API to use two distinct middleware filters (Params, and Validations), and within our response method we return an array containing the response code, response headers, and the response body.
The asynchronous part, which is the HTTP request we dispatch to GitHub's search API, is automatically paused for us until the request is complete, and later resumed without any intervention on the part of the developer - no callbacks required! Best of all, this same pattern applies for any kind of asynchronous IO.
Performance: MRI, JRuby, Rubinius
Goliath is able to run on MRI Ruby, JRuby and Rubinius today. Depending on which platform you are working with, you will see different performance characteristics. At the moment, MRI Ruby is the best performer: a roundtrip through the full Goliath stack on MRI 1.9.2p136 takes ~0.33ms (~3000 req/s).
JRuby performance (with --1.9 flag) is currently much worse than MRI Ruby 1.9.2 due to the fact that JRuby fibers are currently mapped to native Java threads. However, once Lukas Stadler's JVM coroutine patch (JRUBY-5461) gets integrated, JRuby may well take the performance crown. At the moment, without the MLVM support, a request through the full Goliath stack takes ~6ms (166 req/s).
Rubinius + Goliath performance is tough to pin down - there is a lot of room for optimization within the Rubinius VM. Currently, requests can take as little as 0.2ms and later spike to 50ms - stay tuned!
Getting started with Goliath
Goliath has been in production at PostRank for well over a year, serving a sustained rate of 500+ requests/s for months at a time (no memory leaks, no restarts). Internally, we use it to interface with MySQL, MongoDB, Cassandra, as well as many other local and remote web-services. Goliath supports HTTP keep-alive, request pipelining, and can be used to build real-time, streaming API's - all features we use to optimize our infrastructure.
Take a look through the readme, check out the documentation, and take a look at some of the examples in the repository: streaming API, handling large file uploads, building a http proxy with MongoDB logging, and others. Install it, play with it, fork it, and let us know how it goes!
more »
