Advanced Messaging & Routing with AMQP »
Created at: 08.10.2009 19:34, source: igvita.com, tagged: Architecture ruby amqp pubsub
Not all message queues are made equal. In the simplest case, a message queue is synonymous with an asynchronous protocol in which the sender and the receiver do not operate on the message at the same time. However, while this pattern is most commonly used to decouple distinct services (an intermediate mailbox, of sorts), the more advanced implementations also enable a host more advanced recipes: load balancing, queueing, failover, pubsub, etc. AMQP can do all of the above, and yesterday's announcement of RabbitMQ 1.7 (an open source AMQP broker) warrants a closer look.
Originally developed at JP Morgan as a vendor neutral wire and broker protocol, AMQP (Advanced Message Queuing Protocol) is, in fact, a general purpose messaging bus. The protocol itself is still under active development, but there are a variety of open source client and server implementations for it, as well as some big commercial supporters (RedHat, Microsoft, etc). In other words, it works, it is production ready, and I can vouch for it from personal experience - we stream tens of millions of messages through AMQP at PostRank on a daily basis.
AMQP vs XMPP: Features & Architecture
The AMQP vs XMPP debate has been raging for years now. On the surface they both look identical, but in reality there are a number of important distinctions. For example, presence is one of the central components of XMPP, but it is not part of the AMQP specification. XMPP uses XML, whereas AMQP has a binary protocol. AMQP has native support for a number of delivery use cases (at least once, exactly once, select subscribers, persistence, etc) and also a variety of exchange implementations which allow fine-grained control to where and how the messages are routed.

The AMQP spec is a fast and recommended read, but by a way of quick introduction, the core architectural components are: publisher, exchange, queue, and consumer. As you may have guessed, the publisher is the data producer which pushes messages to an exchange. Why is it called an exchange? Because the exchange is a routing engine which is responsible for delivering the messages to the right queues (exchanges never store messages). For example, a message may need to be routed to just a single queue (direct exchange), maybe the message should be forwarded to every queue (pubsub) in the list (fanout exchange), or perhaps the message should be routed based on a key (topic exchange).
Publishing & Consuming AMQP in Ruby
The type of exchange, message parameters, and the name of the attached queue can all contribute to the delivery and routing behavior of the message. However, for the sake of example, let's create a simple pubsub fanout exchange in Ruby:
require 'amqp' AMQP.start(:host => 'localhost') do # create a fanout exchange on the broker exchange = MQ.new.fanout('multicast') # publish multiple messages to fanout exchange.publish('hello') exchange.publish('world') end
In order to consume the messages from an exchange the consumer needs to create a queue and then bind it to an exchange. A queue can be durable (survive between server restarts), or auto-deletable for cases when the queue should disappear if the consumer goes down. Best of all, once the queue is bound to an exchange, the messages are streamed to the client in real-time via a persistent connection (no polling!):
require 'amqp' AMQP.start(:host => 'localhost') do amq = MQ.new # bind 'listener' queue to 'multicast' exchange amq.queue('listener').bind(amq.fanout('multicast')).subscribe do |msg| puts msg # process your message here end end
Advanced AMQP Recipes
The flexibility of the message and the exchange model is what makes AMQP such a powerful tool. Whenever a publisher generates a message, he can mark it as 'persistent' which will guarantee delivery through the broker - if there is an attached queue, it will accumulate messages until the consumer requests them. However, if you're streaming transient data (access logs, for example), you can also disable message persistence and not worry about overwhelming your broker. That's how you achieve 'exactly-once' vs 'at least once' semantics.
Trying to build a pubsub hub? Create a fanout exchange and attach as many queues as you want, each consumer will receive a copy of the message. Load balancing? Bind two workers to the same queue and the broker will automatically round-robin the messages (there is no upper limit on the number of workers). Failover? By default the AMQP broker does not require a message to be ACKed by a consumer, but with a simple configuration flag the messages will be kept on the server until the ACK is received. If the consumer goes down without ACKing a message, they will be automatically put back on the queue. Need to route a message based on a key? Topic exchange allows partial matching based on a message key that is set by the producer. Do you want to notify the producer if there are no subscribers attached to a queue? Set the immediate flag on the message and the broker will do all the work.
Best of all, you can also compose these patterns to cover virtually any delivery use case!
AMQP Brokers & Ruby / Rails
There are a variety of available broker implementations: ZeroMQ, ActiveMQ, OpenAMQ, and RabbitMQ. Because the underlying protocol is still in flux, there is definitely some variation between all the implementations - do your homework. If you're looking for a speed demon, ZeroMQ claims a 15.4 microsend routing overhead (4+ million msgs/s). However, RabbitMQ is arguably the most stable and feature complete broker implementation. If you are a CentOS or a Fedora user, you'll be happy to know that it is now part of the distro (yum install rabbitmq-server), otherwise follow the installation instructions.
Once the server is installed, follow the administration guide to start the broker. If you're looking for a RESTful or a GUI tool to help you configure the broker, drop in Alice on the same server. Like the SQL prompt? Install the BQL plugin and familiarize yourself with the syntax.
The AMQP gem is probably the best choice when it comes to Ruby clients - it is asynchronous, it is fast, and it is in use by dozens of companies. If you're looking for a synchronous client, Carrot gem is the answer. If you're using async-observer plugin in your Rails projects, you can drop in async-observer-amqp to migrate to AMQP. In other words, it is easy to get started, it is incredibly powerful, and it has great library support for virtually every language. Give it a try.
more »
Deploying JRuby on Google App Engine »
Created at: 23.09.2009 18:44, source: igvita.com, tagged: Architecture ruby gae google jruby rails
Following the Java support announcement for Google App Engine back in April there was a small flurry of excitement about the possibility of JRuby on GAE. However, while the notion was appealing, the execution was lacking - you had to do backflips to get JRuby, servlets, and XML configs all properly bundled. Compared to the aesthetically pleasing experience of running a Sinatra, Rack, or even a Rails app with Phusion, it is no wonder that most of us promptly forgot about GAE.
Having said that, the auto-scale, load-balancing, geographic redundancy, BigTable datastore, XMPP, memcached and the newer cron and taskqueue services, are definitely all reasons to revisit the platform. John Woodell and Ryan Brown, both Google employees, have been doing terrific work on helping to lower the entry barrier. In fact, while the documentation still needs a lot of work and polish, the deployment story and the API's are basically there - it now takes just a few keystrokes to get your Ruby app on GAE!
Migrating to Google App Engine
The first thing you'll notice with App Engine is that most of your Ruby/Rails applications won't run on the platform out of the box. GAE offers a lot of great benefits, such as free load-balancing and access to the BigTable datastore amongst many other things, but it does so by abstracting or removing some of the interfaces we're all used to: there is no filesystem, BigTable is not a relational database, cron jobs are HTTP calls, and so on. While an inconvenience at first, the functionality is still there but it does require some extra work on behalf of the developer. For example, while ActiveRecord does not work at the moment, there is a DataMapper adapter that can meet all of your needs. Likewise, your code cannot make arbitrary outbound HTTP connections, but you can patch net/http to use the URLFetch API.

The net positive of all of these changes is that your application can then be uniformly scaled across the GAE infrastructure. By default, you receive a free quota for up to (roughly) five million requests, and beyond that a credit card is required. Best of all, you no longer have to think about hardware, databases, or any other supporting services. BigTable guarantees uniform access time for your queries irrespective of the size of the dataset (at a cost of higher overall latency), and CPU, memcached, and XMPP load is automatically spread within the GAE cluster. And if you're feeling adventurous, you could try deploying a distributed file system (GaeVFS), or even build your own GAE hosting environment from scratch with AppScale.
Hello World with GAE
The tooling support for getting your application on GAE is now almost fully automated. Ryan Brown's appengine-api's gem has full coverage of all the essential services, and the google-appengine gem provides a number of essential tools to help you with the generation, setup, and deployment of your applications. A simple "Hello World" Rack application is as simple as:
# install GAE tools (no JRuby required) # > sudo gem install google-appengine # create a rackup file: config.ru require 'appengine-rack' AppEngine::Rack.configure_app( :application => "your-app-id", :version => 1) run lambda { Rack::Response.new("Hello World!") } # start development server on local server # > dev_appserver.rb . # deploy application to GAE # > appcfg.rb update .
All the configuration and deployment logic is done on your behalf by the dev_appserver and appcfg executables: packaging JRuby, creating the XML configs, authentication, etc. Deploying a Rails application, takes just a few more steps. Once the app is live, login into your GAE dashboard to scan the logs, query BigTable, or track your quotas - no need for mucking around with any system utilities!
Developing for App Engine
In theory, GAE has all the potential to become a popular deployment platform for Ruby applications (not to mention a nice market for custom Google Apps modules). The lacking resources at this point are the blog posts, developers, and the overall Ruby community mindshare around App Engine. John Woodell and Ryan Brown have done a fantastic job of lowering the entry barriers, but we're still missing the real-life production deployments that would weed out the API bugs and inconsistencies, as well as, build the overall trust in the platform.
Curious to give it a try, I took an evening to try and port a small, but non-trivial app to App Engine. Having settled on watercoolr (webhooks pubsub), I swapped out SQLite for BigTable and a few keystrokes later, I had a public watercoolr service (http://watercoolr.appspot.com) in production. No configuration, no need to worry about overloading the database. Feel free to use it for your own HTTP pubsub needs! GAE is an exciting platform, check out the examples, articles, and make sure to give it a try.
more »
Post-Javascript DOM with Aptana Jaxer »
Created at: 25.08.2009 20:42, source: igvita.com, tagged: Architecture ajax dom javascript jaxer
It is hard to imagine the web as we know it today without AJAX. Within just a couple of years, it has transformed how we think about web applications, how we build them, and the technology stack underneath. Prototype, JQuery, YUI and dozens of others frameworks let us build rich and interactive applications, while Javascript and rendering engines are progressing in leaps and bounds when it comes to performance. Exciting stuff. However, this client-side functionality comes at a price. The AJAX testing experience is still painful (even with solutions like Selenium, Webrat, etc), and even more generally, working with, or even getting access to the post-Javascript DOM remains a challenge. Want to build a crawler which sees the page as the author intented? That's a tough problem.
Javascript is (Almost) Everywhere
Like it or not, disabling JavaScript in your browser today is likely to render a large section of the web as completely unusable, and that is unlikely to change anytime soon. For that reason, over the years there have been many recurring rumors of the Google bot rendering Javascript when it came to your page. Google denies this, but there are also theories that the V8 engine is a direct result of such initiatives - which, I think, is not completely unfounded.
In either case, if you ever attempted this yourself, you quickly realized just how non-trivial this problem actually is. A typical spider just downloads the HTML of the page, it has not knowledge of the DOM, or Javascript. So, at the end of the day, what you need is a browser: the content must be downloaded, the HTML needs to be interpreted into a DOM model, and then finally handed off to the Javascript runtime for further processing. That's a lot of work.
Jaxer, the AJAX Server
Having attempted at this problem many times with Rhino, SpiderMonkey and even raw Webkit bindings, I finally stumbled on Jaxer. Developed at Aptana, it is based directly on Gecko (Firefox rendering engine), minus the graphical rendering, which means that the DOM is already there and Javascript is free of charge. I never bought into the server-side Javascript movement, but seeing Jaxer in this light definitely opened my eyes to a lot of opportunities. The installation is as easy as unpacking the Apache runtime and then proxying your requests through the server. For example, to get the full post-Javascript DOM of any web-page, just drop this template into your Jaxer directory:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>Post JavaScript DOM</title> </head> <body> <!-- Render post-JS DOM: http://jaxer/post-js.html?www.google.com --> <script runat='server'> var url_query = Jaxer.request.queryString; var sb = new Jaxer.Sandbox(url_query); Jaxer.response.setContents(sb.toHTML()); </script> </body> </html>
Accessing the Server-Side DOM
The first thing you'll notice is just how slow most of the pages will come back when proxied through Jaxer. However, this has nothing to do with Jaxer and everything to do with the dozens of Javascript files and slow remote calls our browsers have to do on our behalf. Their asynchronous nature just happens to mask the abysmal performance of most pages. Which, incidentally, is also likely the reason why our search engines today do not look at the post-JS DOM.
If you haven't already, take a look at the Jaxer tutorials. You have full access to the DOM on the server, which means that you can use JQuery to alter the response, connect to a database, test your Javascript on the fly, etc. Having said that, I'm still secretly hoping that one day our curl command client will just have a flag to return the final post-Javascript DOM.
more »
Smart Clients: ReverseHTTP & WebSockets »
Created at: 18.08.2009 18:30, source: igvita.com, tagged: Architecture reversehttp websocket
Polling architectures, as pervasive as they are today, did not come about due to their efficiency. Whether you are maintaining a popular endpoint (Twitter), or trying to get near real-time news (RSS), neither side benefits from this architecture. Over the years we've built a number of crutches in the form of Cache headers, ETags, accelerators, but none have fundamentally solved the problem - because the client remains 'dumb' the burden is still always on the server. For that reason, it's worth paying attention to some of the technologies which are seeking to reverse this trend: ReverseHTTP & WebSockets.
Smart(er) Clients: Pushing Complexity to the Edges
Our web-browsers today are, for the most part, passive consumers. This model has worked wonderfully up until now, but as thought experiment, think about the implications on your architecture if the browsers were smarter. For example, what if the browser also contained a web server? For one, this would mean that the browser and the web-server would be peers: they could talk via HTTP, use OAuth, Webhooks, etc!
Brushing NAT and firewall complications aside for a few seconds, this would mean that any browser could register itself via a webhook, and the server would simply propagate the updates to each client whenever new data is available (aka, PubSubHubbub with the client). Fast and efficient.
So how do you go about running a web-server in your browser? It is possible in theory, but the support is lacking. Firefox has a built in Javascript webserver for testing, but it requires XPCOM privileges; there is some talk on the Jetpack mailing lists about supporting this feature, but nothing conclusive; node.js is an evented V8 powered web-server, but it can't run within Chrome. The closest you will get without modify the core of your browser is Firefox POW (Plain Old Webserver) extension, but we need something much more flexible.
Hybrid Clients: ReverseHTTP & Supernodes
The first time you read the proposed ReverseHTTP, or equivalent spec, it will undoubtedly feel like a crazy idea. Having said that, it works. Because most clients are hidden behind personal / corporate firewalls or NAT's, it renders them unreachable for a raw client-side HTTP push model. ReverseHTTP proposes a hybrid solution: another web-server in middle acts a proxy, and the client maintains a persistent connection to it. Whenever a request comes to the proxy, it is relayed to the client via the persistent connection, where the response is determined by the browser and finally relayed back to the originating source. In effect, you are running a webserver in your browser. Give it a try, and if you’re curious, check out the server code as well.

Unfortunately, this is still a half-hearted solution because it introduces yet another layer of infrastructure into the equation, but it also has its benefits. First, the scalability complexity is propagated down to the edges – a client, a hosting provider, or even an ISP can maintain such proxy nodes for their users. In fact, that is exactly what Opera Unite is all about. Blur your eyes to cut through the marketing, and you'll see that, in fact, the "revolution" is the ability for a browser to act as a server, via the Opera Unite proxy service. Sounds crazy? Microsoft's Teredo service (tunneling IPv6 traffic over IPv4) has been providing this very service for free for several years now to all Windows users. Last but not least, this is also the same principle that powers your favorite P2P client (Skype, Kazaa, etc). It is not as crazy as it seems.
Bi-directional Communication With WebSockets API
Of course, we also can't forget Comet, BOSH, or many other attempts at the same problem. Personally, I've been holding off for a simpler solution. All I want is a socket – Flash has it, why can't the web browser too? Thankfully, HTML 5 has an answer: WebSocket API.
// open a websocket var conn = new WebSocket("ws://yourwebservice.com/service"); // act on incoming data conn.onopen = function(e) { ... } conn.onread = function(e) { ... } conn.onclose = function(e) { ... } // push messages back to server conn.send("Bi-directional!");
As more and more browsers start adopting HTML 5 features I'm hoping to see WebSockets become a reality within the next couple of years. Full bi-directional data exchange between the client and server, without a need for a third party in between – that is what Opera Unite announcement should have been. It still means maintaining persistent connections between client and server, which is arguably not as elegant as registering your browser session through a webhook, but there is a need for both solutions. One is great for intermittent updates, the other (WebSockets) are the right answer for high-velocity streams.
more »
Masking Latency & Failures with Squid »
Created at: 05.08.2009 19:27, source: igvita.com, tagged: Architecture cache squid varnish
Latency matters. Watching the interviews with the Yahoo developers on the launch of their new homepage, I was once again reminded about the great amount of effort they have put in to make it all work. In fact, I've written about Mark Nottingham's proposals for stale-while-revalidate and stale-if-error in the past (with a homebaked implentation), but a more realistic deployment scenario is to use a caching server like Squid or Varnish. Let's take a look at how these extensions can help your specific deployment.
HTTP Caching Extensions: stale-*
Both stale-while-revalidate and stale-if-error are the direct results of Mark Nottingham's work at Yahoo. It is clear that Squid plays a big role in the Y! infrastructure and these extensions are undoubtedly deployed in their data centers. Stale-while-revalidate addresses a simple problem: when a record becomes stale, instead of allowing the request to hit the application server, serve the stale data to the client and create an asynchronous request to update the cache (read more & spec). The benefit? All of your customers see consistent performance because the data is always served out of the cache (think RSS feeds, search results, etc).

The second extension (stale-if-error) can help you mask downtime by returning stale data while your ops team resolves the problem. How does it work? You specify a cache-control header (Cache-Control: max-age=600, stale-if-error=1200) which indicates for how long, after the record is expired the cache server can serve this data if the application server is down. Of course, nothing stops you from setting this to a high value like a day or longer! This way, if your server goes down, at least the clients won't timeout on their requests while you're resolving the problem!
Setting up Squid as a Reverse Proxy
Varnish and Squid are the two top choices when it comes to caching reverse proxies. Varnish has been getting a lot momentum but unfortunately the features we're looking for are still in the works. There is an open ticket for stale-if-error support, and stale-if-revalidate implementation unfortunately does not provide the full asynchronous refresh model. For that reason, we're going to use Squid 2.7 (note: Squid 3.0 is a full rewrite of the 2.x branch, and while stale-* patches exist, they haven't officially made it into the tree). A minimal Squid config (if there is such a thing) to get us up and running:
# http_port public_ip:port accel defaultsite= default hostname, if not provided http_port 0.0.0.0:80 accel defaultsite=yourdomain.com # IP and port of your main application server (or multiple) cache_peer 192.168.0.1 parent 80 0 no-query originserver name=main cache_peer_domain main yourdomain.com # Do not tell the world that which squid version we're running httpd_suppress_version_string on # Remove the Caching Control header for upstream servers header_access Cache-Control deny all # log all incoming traffic in Apache format logformat combined %>a %ui %un [%tl] "%rm %ru HTTP/%rv" %Hs %<st "%{Referer}>h" "%{User-Agent}>h" %Ss:%Sh access_log /usr/local/squid/var/logs/squid.log combined all
Connecting Squid and your Application
Once Squid is deployed as part of your request chain, taking advantage of its caching mechanism is a trivial matter, just add the "Cache-Control" header! A simple Rack application to make it all work:
require "rack" app = lambda { p [:new_request, Time.now] headers = { # cache for 10 seconds, serve stale for up to 20s, and for up to 30s on errors "Cache-Control" => "max-age=10, stale-while-revalidate=10, stale-if-error=20", "Last-Modified" => Time.now.to_s } [200, headers, "Hello World @ #{Time.now}"] } Rack::Handler::Mongrel.run(app, {:Host => "127.0.0.1", :Port => 80})
The only requirements are that we set a Last-Modified time, and then provide our preferred caching intervals. Max-age specifies the general lifetime of the object (TTL), stale-while-revalidate indicates for how long after max-age has expired can the server provide the stale data (total lifetime is max-age + stale-while-revalidate). Finally, stale-if-error indicates for how long the server can provide the stale data if the application server is down - it is not unusual to set this higher than stale-if-revalidate. That's it!
Masking Latency and Failures
Of course, these extensions do not excuse any of us from building fast and reliable web-services. In theory, we would have no need for these extensions, but as the saying goes: "in theory, theory and practice are the same, in practice, they are not." If you have data that can be served slightly stale (frankly, the majority of the data falls into this category), then Squid can make all the difference next time your server decides to take a break at 4AM.
more »
