Non-blocking ActiveRecord & Rails »
Created at: 15.04.2010 23:39, source: igvita.com, tagged: ruby Ruby on Rails activerecord eventmachine rails
Rails and MySQL go hand in hand. ActiveRecord is perfectly capable of using a number of different databases but MySQL is by far the most popular choice for production deployments. And therein lies a dirty secret: when it comes to performance and 'scalability' of the framework, the Ruby MySQL gem is a serious offender. The presence of the GIL means that concurrency is already somewhat of a myth in the Ruby VM, but the architecture of the driver makes the problem even worse. Let's take a look under the hood.
Dissecting Ruby MySQL drivers
The native mysql gem many of us use in production was designed to expose a blocking API: you issue a SQL query, and the library blocks until the server returns a response. So far so good, but unfortunately it also introduces a nasty side effect. Because it blocks inside of the native code (inside mysql_real_query() C function), the entire Ruby VM is frozen while we wait for the response. So, if you query happens to have taken several seconds, it means that no other block, fiber, or thread will be executed by the Ruby VM. Ever wondered why your threaded Mongrel server never really flexed its threaded muscle? Well, now you know.
Fortunately, the little known mysqlplus gem addresses the immediate problem. Instead of using a single blocking call, it forwards the query to the server, and then starts polling for the response. For the curious, there are also two implementations, one in pure Ruby with a select loop, and a native (C) one which uses rb_thread_select. The benefit? Well, now you can have multiple threads execute database queries without blocking the entire VM! In fact, with a little extra work, we can even get some concurrency out of ActiveRecord.
However, we could even drop threads in our quest for concurrency! Instead of making every thread poll on a socket, we could pass each of those sockets to a single event loop (EventMachine) library, and let it handle all the IO scheduling for us: gem install em-mysqlplus. Same API, in fact, it uses mysqlplus under the covers, but now every query has a callback for true non-blocking database access. Take a look at a few examples in the slides:
Non-blocking Rails with MySQL
Now we come around full circle. The downside of a true asynchronous library is that it requires callbacks, spaghetti code and a fully asynchronous stack. Thankfully, we already have Thin for our async app server, and with the introduction of Fibers in Ruby 1.9, we can wrap our asynchronous driver to behave just as if it had a blocking API.
So, we install em-mysqlplus, require em-synchrony to emulate the 'blocking api', implement an activerecord adapter, and we finally have a fully non-blocking ActiveRecord driver which we can drop into our Rails app! Well, almost, a few other modifications: Rails provides a threaded ConnectionPool, which we need to replace with a Fiber aware one, and finally, we need to disable the built in Mutex (hap tip to Mike Perham for doing all the dirty work for us). Now let's give it a try:
class WidgetsController < ApplicationController def index Widget.find_by_sql("select sleep(1)") render :text => "Oh hai" end end
thin -D start
ab -c 5 -n 10 http://127.0.0.1/widgets/Server Software: thin
Server Hostname: 127.0.0.1
Server Port: 3000Concurrency Level: 5
Time taken for tests: 2.210 seconds
Complete requests: 10
Requests per second: 4.53 [#/sec] (mean)
Our widget action simulates a blocking one-second query, we start up a single thin server, and run an ab test against it: 10 requests, with a max concurrency of 5. And as you can see, the test finishes in just slightly over 2 seconds!
Rails 3, Ruby 1.9 and Drizzle
By mid summer we will see production releases of Rails 3, Ruby 1.9, and Drizzle, and that convergence is worth paying attention to. Both Rails 3 and Ruby 1.9 offer raw performance improvements across the board. In the meantime, Drizzle already provides a fully async libdrizzle driver (talks to MySQL & Drizzle) which we could adopt to future proof our applications. Combine all three with a fibered ActiveRecord driver, an async application server such as Thin, and we could make some serious steps forward when it comes to performance of Rails: significantly lower memory footprint and much better performance across the board.
more »
Untangling Evented Code with Ruby Fibers »
Created at: 22.03.2010 18:50, source: igvita.com, tagged: ruby eventmachine fibers
Event-driven programming requires a mind-shift in how you architect the program, and oftentimes directly affects the choice of language, drivers, and even frameworks you can use. Most commonly found in highly interactive applications (GUI, network servers, etc), it usually implements the reactor pattern to handle concurrency in favor of threads: the “reactor” is a main loop which is responsible for receiving all inbound events (network IO, filesystem, IPC, etc) and demultiplexing them to appropriate handlers or callbacks.
Turns out, the reactor pattern performs extremely well under heavy loads (C10K challenge), hence the continuous rise in adoption (Nginx, Beanstalkd, EventMachine, Twisted, Node.js), but it does have its downsides: it requires reactor-aware libraries, still relies on background processing capabilities for long running computations, and last but not least, once you have nested several callbacks, it results in much more complicated code. Functional purists will disagree with the last statement – after all, we all love JavaScript, and node.js is the new hot thing on the block – but what if we could write event driven code without the added complexity imposed by hundreds of nested callbacks?
Accidental Complexity of Event-Driven Code
Anyone who has written a non-trivial event driven application will be familiar with the following pattern: you often start reading your code bottom-up and then navigate your way up the callback chain. In addition, since there is no single execution context, each callback requires its own nested exception handling, which adds to the complexity - debugging is non-trivial, to say the least. Of course, this is usually not too bad in a context of a simple demo, but it also quickly spirals out of control.
EventMachine.run { page = EventMachine::HttpRequest.new('http://google.ca/').get page.errback { p "Google is down! terminate?" } page.callback { about = EventMachine::HttpRequest.new('http://google.ca/search?q=eventmachine').get about.callback { # callback nesting, ad infinitum } about.errback { # error-handling code } } }
Call me old-fashioned, but I much prefer the if-then-else control flow, with top-down execution and code I can actually read without callback gymnastics. And as luck would have it, turns out these are not inconsistent requirements in Ruby 1.9. With the introduction of Fibers, our applications can do fully-cooperative scheduling (worth a re-read to make sense of the rest), which with a little extra work also means that we can abstract much of the complexity of event driven programming while maintaining all of its benefits!
Fibers & EventMachine: under the hood
Ruby 1.9 Fibers are a means of creating code blocks which can be paused and resumed by our application (think lightweight threads, minus the thread scheduler and less overhead). Each fiber comes with a small 4KB stack, which makes them cheap to spin up, pause and resume. Best of all, a fiber can yield the control and wait until someone else resumes it. I bet you see where we're going: start an async operation, yield the fiber, and then make the callback resume the fiber once the operation is complete. Let's wrap our async em-http client as an example:
def http_get(url) f = Fiber.current http = EventMachine::HttpRequest.new(url).get # resume fiber once http call is done http.callback { f.resume(http) } http.errback { f.resume(http) } return Fiber.yield end EventMachine.run do Fiber.new{ page = http_get('http://www.google.com/') puts "Fetched page: #{page.response_header.status}" if page page = http_get('http://www.google.com/search?q=eventmachine') puts "Fetched page 2: #{page.response_header.status}" end }.resume end
First thing to notice is that we are now executing our asynchronous code within a fiber (Fiber.new{}.resume), and our http_get method sets up the call, assigns the callbacks and then immediately yields control as it tries to return from the function. From there, EventMachine takes over, fetches the data in the background, and then calls the callback method, which in turn resumes our fiber, passing it the actual response. A little bit of fiber gymnastics, but it means that our original code with nested callbacks can now be unwound into a regular top-down execution context with if-then-else control flow. Not bad!
EM-Synchrony: Evented Code With Less Pain
Of course, we wouldn't gain much if the net effect of introducing fibers into our event driven code was swapping callbacks for fiber gymnastics. Thankfully, we can do better because much of the underlying implementation can be easily abstracted at the level of the driver. Let's take a look at our new helper library, em-synchrony:
EventMachine.synchrony do page = EventMachine::HttpRequest.new("http://www.google.com").get p "No callbacks! Fetched page: #{page}" EventMachine.stop end
Instead of invoking the default EM.run block, we call EM.synchrony, which in turn wraps our execution into a Ruby fiber behind the scenes. From there, the library also provides ready-made, fiber aware classes for some of the most common use cases (http: em-http-request, mysql: em-mysqlplus, and memcached: remcached), as well as, a fiber aware connection pool, iterator for concurrency control, and a multi-request interface. Let's take a look at an example which flexes all of the above:
EM.synchrony do # open 4 concurrent MySQL connections db = EventMachine::Synchrony::ConnectionPool.new(size: 4) do EventMachine::MySQL.new(host: "localhost") end # perform 4 http requests in parallel, and collect responses multi = EventMachine::Synchrony::Multi.new multi.add :page1, EventMachine::HttpRequest.new("http://service.com/page1").aget multi.add :page2, EventMachine::HttpRequest.new("http://service.com/page2").aget multi.add :page3, EventMachine::HttpRequest.new("http://service.com/page3").aget multi.add :page4, EventMachine::HttpRequest.new("http://service.com/page4").aget data = multi.perform.responses[:callback].values # insert fetched HTTP data into a mysql database, using at most 2 connections at a time # - note that we're writing async code within the iterator! EM::Synchrony::Iterator.new(data, 2).each do |page, iter| db.aquery("INSERT INTO table (data) VALUES(#{page});") db.callback { iter.return(http) } end puts "All done! Stopping event loop." EventMachine.stop end
Synchrony implements a common pattern: original asynchronous methods which return a deferrable object are aliased with "a" prefix (.get becomes .aget, .query becomes .aquery) to indicate that they are asynchronous, and the fiber aware methods take their place as the defaults. This way, you can still mix sync and async code all in the same context (within an iterator, for example). For more examples of em-synchrony in action, take a look at the specs provided within the library itself.
Towards Scalable & Manageable Event-Driven Code
Event driven programming does not have to be complicated. With a little help from Ruby 1.9 much of the complexity is easily abstracted, which means that we can have all the benefits of event-driven IO, without any of the overhead of a thread scheduler or complicated code. Node.js, Twisted, and other reactor frameworks have a lot going for them, but the combination of EventMachine and Ruby 1.9, to me, is a clear winner.
We should also mention that fibers have been ported to Ruby 1.8, but due to their implementation on top of green threads they incur much larger overhead - in other words, this is your reason to switch to Ruby 1.9! And last but not least, fibers or not, don't forget that Ruby 1.9 still has a GIL which means that only one CPU core will be used. Until we see MVM (multi-VM) support, the solution is simply to run multiple reactors (one or more for each core).
more »
Schema-Free MySQL vs NoSQL »
Created at: 01.03.2010 20:55, source: igvita.com, tagged: databases eventmachine mysql nosql
Amidst the cambrian explosion of alternative database engines (aka, NoSQL) it is almost too easy to lose sight of the fact that the more established solutions, such as relational databases, still have a lot to offer: stable and proven code base, drivers and tools for every conceivable language, and more features than any DBA cares to learn about. Not to mention that relational or not, they often times perform just as well as any other single instance key-value store when faced with large datasets - hence the reason why Riak, Voldemort and others use InnoDB as their data stores. Granted, the “feature bloat” is also the reason why a rewrite can be a good idea, but it also feels like this gray zone is too often overlooked in the NoSQL community - just because you are “NoSQL” does not mean you have to throw away years of work put into relational databases.
Setting aside the fact that we are yet to define what “NoSQL” actually is, some of the attributes that we commonly glob under this label are: document based, schema-free, distributed and “scalable”. The fact that being distributed and being scalable are not one and the same is a subject for another post, instead let’s take a closer look at what schema-free and document-based actually means. In fact, let me jump ahead: I am genuinely surprised that we are yet to see a schema-free engine built on top of MySQL. I know, I know, but suspend you disbelief for a second, because it is not as outrageous as it sounds.
Document Based: a Double Edged Sword
The original reason for and the benefit of the relational model is that by constraining the data schema (read, eliminating structural complexity of the data, or decomposing it into relations), you actually gain power and flexibility in the types of queries you can execute against your database. Said another way, normalized data design allows us to have a general-purpose query language, which allows for queries whose parameters we do not even know at design time, whereas denormalized designs do not. What we loose in flexibility of our data structures, we gain in our ability to interact with the data. Hence, in theory, if you have no way to anticipate the types of queries in the future, a relation model is your best bet. Lose some, win some, chose your poison.
At the same time, we all know that “no join is faster than no join”. The inherent disadvantage of decomposing your data is the required assembly. If you are looking for “speed” or “scalability”, then denormalizing your data is usually the first step. The disadvantage? Now you have introduced a number of potential anomalies into your data: updates, inserts, and deletes can cause data inconsistencies unless you keep careful accounting of all duplication. One-to-One, and One-to-Many relations are usually easy to manage, but Many-to-Many in denormalized schemas are nothing but a recipe for disaster. That is, if you care about consistency.
Finally, since you lose the power of a general purpose query language (SQL), you are now at a mercy of the DSL provided by your new database. Mongo, Couch and many others had to introduce their own query language constructs alongside "map-reduce" functionality to address the problem of querying arbitrarily deep records. Now, I am a fan of both, but frankly, none I have worked with so far are as clean, or as easy to understand as SQL (case in point) - with the downside of making me learn yet another query language.
Schema-free != Document Based
Document based and schema-free are often used interchangeably, but there is an important difference: schema-free does not necessarily imply nested data structures. Likewise, just because MySQL is “relational” does not mean that it must be fixed to a predefined schema - at create time, maybe, but not at runtime. Intersect the two statements, and it means that there is absolutely no reason why we cannot have a schema-free engine in MySQL:
mysql> USE noschema; mysql> CREATE TABLE widgets; /* look ma, no schema! */ mysql> INSERT INTO widgets (id, name) VALUES("a", "apple"); mysql> INSERT INTO widgets (id, name, type) VALUES("b", "blackberry", "phone"); mysql> SELECT * FROM widgets WHERE id = "a"; +---------+---------------+ | id | name | +---------+---------------+ | a | apple | +---------+---------------+ mysql> SELECT * FROM widgets; +---------+---------------+--------+ | id | name | type | +---------+---------------+--------+ | a | apple | NULL | | b | blackberry | phone | +---------+---------------+--------+
As long as we avoid nested data structures, then there is no reason why we should be limited by the columns defined in our tables because we can compose and decompose any relation at runtime. Not only would this mean no migrations or need to store null values, but you could also keep all the tools, drivers, and the SQL query language while adding the full flexibility of being schema-free.
Schema-free DB on top of MySQL
Not able to find any project that would give me this behavior, I ended up prototyping it myself over the weekend, and believe it or not, it works just fine. In fact, the output above is from a real console session with MySQL. All it took is an em-proxy server with a little low-level protocol and query rewriting, and all of the sudden, my MySQL forgot that it requires a schema. Take it for a test-drive yourself (you will need Ruby 1.9):
git clone git://github.com/igrigorik/em-proxy.git && cd em-proxy
ruby examples/schemaless-mysql/mysql_interceptor.rb
mysql -h localhost -P 3307 --protocol=tcp
# snip ... # build the select statements, hide the tables behind each attribute join = "select #{table}.id as id " tables.each do |column| join += " , #{table}_#{column}.value as #{column} " end # add the joins to stich it all together join += " FROM #{table} " tables.each do |column| join += " LEFT OUTER JOIN #{table}_#{column} ON #{table}_#{column}.id = #{table}.id " end join += " WHERE #{table}.id = '#{key}' " if key
Of course, this is nothing but a cute code example nor does it even cover all the different use cases, but let us look at the feature set: driver support for every language (you can point Rails + ActiveRecord, JDBC, etc. at it out the box, no problem), tool support (GUI and command line), replication that works, basically impossible to corrupt, transactions, and so on. Not bad for half a day of hacking with a simple data model in the background:

Instead of defining columns on a table, each attribute has its own table (new tables are created on the fly), which means that we can add and remove attributes at will. In turn, performing a select simply means joining all of the tables on that individual key. To the client this is completely transparent, and while the proxy server does the actual work, this functionality could be easily extracted into a proper MySQL engine - I’m just surprised that no one has done so already. For a closer look, check out the proxy code itself, there are plenty of comments, which explain how it is all pieced together.
The gray zone of SQL vs NoSQL
So what is the point of all this? Well, I hope someone actually writes such an engine, because I believe there is a market for it. There is a lot to be said for a drop in, SQL compatible, schema-free engine, and unlike what the NoSQL propaganda may say, there is absolutely no reason why we can’t have many of the benefits of “NoSQL” within MySQL itself. There is no one clear winner for a database engine or model, so put some thought into your decision up front. Just because Mongo, TC, or Couch are 'document-oriented' or 'schema-free' does not mean they are necessarily better for your application. In the meantime, don't get me wrong, I am still rooting for all the NoSQL projects, as well as have high expectations for Drizzle - they are all doing fantastic work.
more »
