Varnish: It’s Not Just For Wood Anymore »

Created at: 15.04.2010 20:00, source: Engine Yard Blog, tagged: Technology varnish

In previous posts, I’ve routinely mentioned a piece of software called Varnish. Varnish is a caching reverse proxy for web traffic, and if your job or your interests lean toward production web applications at all, you definitely want to get familiar with it.

This post isn’t going to try to make a case for using a caching reverse proxy, as I think that’s already sufficiently covered. Instead, it’ll focus specifically on an overview of Varnish; what you need to do with it out of the box to implement decent caching for a typical web application, and some of the more sophisticated capabilities you’ll want to get familiar with.

Varnish was written from the ground up to be a high performance HTTP accelerator. It leverages its host operating system’s own memory management abilities and threading abilities in order to provide a high capacity, high throughput caching system. It has many features that make it a nice tool, but it does avoid the massive feature bloat of many other caching proxy implementations.

Among these are load balancing and graceful handling of dead proxy back ends, a built-in Perl-esque configuration language that permits sophisticated behavior customization, url rewriting, and support for the most useful parts of ESI.

As previously mentioned, Varnish is threaded. Specifically, it manages a thread pool—or set of thread pools (as determined by configuration)—, and it uses one thread for each connection. This generally works well, but it does make you need to think about configuration a little bit. If you configure varnish to accept 20,000 concurrent connections, then it’ll be running 20,000 system threads. Make sure you’re on a system that isn’t going to have its manhood threatened by that situation before you do it in production.

Installing Varnish is straightforward. It’s available in most operating system package management systems, though it may not always be the most recent version; definitely check that you’re getting an acceptably recent version from your package manager. You can also easily build it from source if you want to ensure that you have the most recent version. Simply download the source from Sourceforge. To build:

./autogen.sh
./configure
make
make install

With Varnish, building it isn’t quite the end of the story. While it’ll run out of the box with its build defaults for all parameters, it doesn’t actually run very well that way. It’ll deliver great performance to a point, but the defaults allow it to be overwhelmed pretty easily. Running a durable Varnish instance requires a bit of configuration love, and the command line configuration options are legion.

varnishd -a :80 \
-b 127.0.0.1:81 \
-T 127.0.0.1:6082 \
-s file,/var/lib/varnish,100GB \
-f /etc/varnish/default.vcl \
-u nobody \
-g nobody \
-p obj_workspace=4096 \
-p sess_workspace=262144 \
-p listen_depth=2048 \
-p overflow_max=2000 \
-p ping_interval=2 \
-p log_hashstring=off \
-h classic,5000009 \
-p thread_pool_max=1000 \
-p lru_interval=60 \
-p esi_syntax=0x00000003 \
-p sess_timeout=10 \
-p thread_pools=1 \
-p thread_pool_min=100 \
-p shm_workspace=32768 \
-p srcadd_ttl=0 \
-p thread_pool_add_delay=1

The command line is outrageously long, but don’t hyperventilate. You won’t be typing this by hand in a production deployment anyway, because your startups are all scripted, right? I am not going to go over every one of these settings—the Varnish web site has lots of getting started documentation to guide you when things get confusing. However, let’s take a look at a few of the more interesting parameters that you should know about when configuring.

-a :80

The -a option provides a host and port for Varnish to listen to. If the host is omitted, the given port is listened to on all interfaces.

-b 127:0.0.1:81

This provides a single default backend for varnish to proxy to. It also accepts a HOST:PORT pair.

-s file,/var/lib/varnish/100GB

Varnish functions by allocating a system controlled area of memory to use to store the cached data. This can either be a malloc allocated area, specified by the keyword ‘malloc’, or an area backed by a file, specified by the keyword ‘file’. The file variant uses mmap, while the malloc variant, with a large cache, will make use of swap space and the swapping subsystem.

If using the malloc type of storage, the only option that one provides is the amount of memory to allocate for the storage area. This is a number, in bytes, or a number in bytes suffixed by:

  • K or k for kibibytes
  • M or m for mebibytes
  • G or g for gibibytes
  • T or t for tebibytes

This is pretty straightforward. Tune your cache to the amount of content that you have and the amount of space you have at your disposal. These cache spaces only persist for the life of the Varnish process, which means that if Varnish is killed and restarted, the cache must be populated anew. However, as of version 2.1.0, which was released on March 24th, there is now experimental support for persistent caches.

-p overflow_max=2000

When there are more accepted requests than there are threads to handle them, Varnish sticks them into an overflow queue. If the overflow queue fills up, and the listen queue (the size of which can be controlled with the listen_depth option) is full, then requests start getting dropped.

Requests that are just sitting there waiting to be handled take up space, so this parameter shouldn’t be set absurdly high with no reason. That said, it needs to have a bit of a ceiling to allow for traffic spikes to occur without detrimental effects. This also lets it survive things like someone pointing ‘ab’ at the proxy and saturating it with requests.

Increasing the size of this parameter is one of the crucial changes from the default configuration which is necessary to help ensure a production capable Varnish deployment. If you don’t change it, it’s quite easy to DoS Varnish with something as ubiquitous as Apache Bench.

-p thread_pool_max=1000

-p thread_pools=1

-p thread_pool_min=100

-p thread_pool_add_delay=1

Taken together, these three parameters describe the thread pooling behavior for Varnish. The thread_pools parameter is self describing. Generally, you probably want one pool per core.

The thread_pool_min parameter gives a bottom limit for the number of threads to maintain, per pool, regardless of traffic. Don’t keep this too low, or you may limit Varnish’s ability to rapidly respond to traffic spikes when it’s otherwise not very busy. At the same time, setting it too high just increases the amount of time the OS spends babysitting threads that aren’t doing anything, so practice moderation.

When Varnish doesn’t have enough threads in its thread pool(s) to handle the traffic, it creates new ones. In order to avoid swamping the system, there’s a delay between the launching of each thread. The parameter that controls this is thread_pool_add_delay. This defaults to 20 milliseconds, but that’s far too long to handle load spikes. The prevailing wisdom right now is to set it at one to two milliseconds.

Finally, Varnish has a limit on the total number of threads that it’ll spawn. This is thread_pool_max. Pay attention here. The semantic is different between thread_pool_min, which is the minimum per thread pool, while thread_pool_max is the maximum, collectively. So, for example, if one has thread_pools=4 and also has thread_pool_max=1000 then that means that the entire Varnish process is limited to 1000 threads; this is not a per pool attribute. At the same time, if one had thread_pool_min=100, then there would be a minimum of 400 threads running at all times; that is 100 per thread pool.

These command line options just scratch the surface of what one can do with Varnish. Some of the best features of Varnish come from the Varnish Configuration Language.

This language, commonly called VCL, is a domain specific language that is used to customize Varnish’s request handling and caching behaviors. In appearance, it is reminiscent of a Perl kept to the most C-like basics, but it is pretty easy to both read and use. VCL is also very fast because Varnish actually translates the VCL code into C and then compiles it into a shared object, on the fly, so even complicated logic expressed in VCL has little overall impact on Varnish performance.

Varnish runs with a simple, functional default VCL configuration, but you may add to the configuration by providing your own VCL file like so: -f /etc/varnish/default.vcl

In the sample command line, above, Varnish is given a single back end to proxy to. However, backends can be defined in a VCL file, and when doing so, additional information about the behavior of a backend can be encoded.

backend fast {
  .host = "fasthost.mydomain.com";
  .port = "http";
  .connect_timeout = 1s;
  .first_byte_timeout = 2s;
  .between_bytes_timeout = 1s;
  .probe = {
    .url = "/ping";
    .timeout = 1s;
    .window = 4;
    .threshold = 4;
  }
}
 
backend slow {
  .host = "slowhost.mydomain.com";
  .port = "http";
  .connect_timeout = 6s;
  .first_byte_timeout = 8s;
  .between_bytes_timeout = 3s;
  .probe = {
    .request ==
      "GET /ping HTTP/1.1"
      "Host: pinghost.mydomain.com"
      "X-Ping: true"
      "Connection: close";
    .timeout = 5s;
  }
}

Take note of those .probe sections. These are completely option, but if provided, Varnish will use them to perform health checks on the backend. The .window and .threshold options can be used to provide a health tolerance. That is, given a certain number of checks (the .window), how many have to have been successful (the .threshold) for the backend to be considered healthy.

Varnish also has some support for load balancing between backends. It currently only supports round-robin and random selection, but this behavior can be controlled via VCL, as well.

director plump robin {
  { .backend = fast; }
  { .backend = slow; }
  /* Yep, you can define them inline, too */
  {
    .backend = {
      .host = "alternate.mydomain.com";
      .port = "8080";
    }
  }
}

or

director grasshopper random {
  .retries = 3;
  {
    .backend = fast;
    .weight = 9;
  }
  {
    .backend = slow;
    .weight = 1;
  }
}

As additional features, you can define both access control lists and grace periods for cached content in Varnish. A grace period is simply a period of time after an object in the cache has expired during which it can still be returned in response to a request. You’d use this if there are objects in the cache that take a long time to generate, in order to avoid having a bunch of requests piling up waiting for the generation of the new object.

These parts of VCL are just scratching the surface of the power of VCL, though. VCL offers the ability to define subroutines for grouping your VCL code, several useful built in functions for regular expression matching and cache manipulation, and a whole host of built in subroutines which serve as hooks into the entire request/response cycle for Varnish, allowing you to customize any point of that cycle.

For example, let’s say that you are using a round robin director to load balance between backends, and you want to be sure that if a request for a given resource from one backend fails, no more attempts to that backend, for that resource, are made for a short period of time, to allow it to recover from whatever problem it is having. You can do that with VCL.

sub vcl_recv {
  set req.grace = 60s;
}
 
sub vcl_fetch {
  if (beresp.status == 500) {
    set beresp.saintmode = 15s;
    restart;
  }
  set beresp.grace = 60s;
}

As another example, consider the case from my last post, where I used a Ruby proxy to cache content from Redmine. Redmine isn’t particularly cache friendly, returning cache control headers that normally don’t allow any caching of content. If you wanted to, though, you could make Varnish do it, using VCL.

sub vcl_fetch {
  /* This just says that no matter what those cache control headers are saying, */
  /* insert the content into the cache with a TTL of 60s */
  if (obj.ttl < 60s) {
    set obj.ttl = 60s;
  }
}
 
sub vcl_hash {
  /* This causes varnish to use the cookie contents as part */
  /* of the key for storing and looking up content from the */
  /* cache. For Redmine, this would mean per-user contents */
  /* in the Varnish cache. */
  set req.hash += req.http.cookie;
}

That’s pretty nice. A few lines, and the behavior of the cache is significantly customized. However, remember earlier in the article when I mentioned that VCL is translated into C code and compiled on the fly into a dynamically linked shared object? Well, this means that you can embed arbitrary C code into your VCL:

C{
  #include
  #include
}C
 
sub vcl_mylibstuff {
  C{
    mylib_superfunction(VRT_r_req_request(sp);
  }C
}

The capabilities of VCL are far too expansive to cover well in a short post, but this should give you a taste of how flexible and powerful Varnish is. The Varnish wiki has expanded documentation (which should continue to improve) and a number of examples of using VCL to do useful real world work. Varnish is a true power tool for caching. Check it out, and leave questions and comments here!


more »

Ruby Scales, AND It’s Fast – If You Do It Right! »

Created at: 19.03.2010 23:00, source: Engine Yard Blog, tagged: Technology ruby varnish

“Why does everybody say that CPUs are fast nowadays and that ‘it doesn’t matter that language XYZ is slow’?

It does matter: web applications. If your applications can’t serve all the visitors, then you’re going to lose your customer or you’ll have to learn some other language with better performance.

Once our application serves 200 million page views each day… the languange is really sensitive, so we go with C/C++.”

—ruby-talk Thread

Performance: it’s a topic that comes up over and over again in the Ruby world, and everyone’s got an opinion. Unfortunately, those opinions often focus on minutia, and tend to miss the big picture.

On top of that, things in the Ruby world are far more complex, today, when discussing performance, because one really has to talk about Ruby performance in the context of a specific implementation. Are we talking about Matz Ruby 1.8.x, or 1.9.x? Are we talking about Rubinius or JRuby? What about MacRuby? IronRuby? MagLev? Every one of these has a different performance profile and level of completeness.

For the purposes of this post, and for the purposes of the attention I paid to the two quotes above, I’m going to focus on Matz’ Ruby 1.8.x (MRI). It’s been the Ruby for many years, and it’s what most people are pointing at when they complain about Ruby being slow. Don’t just take my word for it though—check out The Computer Language Benchmarks Game for a substantial set of flawed micro-benchmarks using a plethora of different languages. What they call “Ruby MRI” is, at this time, ruby 1.8.7 (2009-06-12 patchlevel 174). It’s not even close to being the most recent version of 1.8.7, but that’s OK. The benchmarks there have to be taken with a couple grains of salt, anyway.

Here’s why: Micro-benchmarks for languages have only a weak relationship to the performance of complex systems implemented in those languages, even when implemented well. Or, to put it another way, the speed at which a language can complete a simple, discrete task, is not necessarily a strong predictor of how fast a complicated application, composed of many tasks, will perform when implemented in that language. There are other factors which come into play that can strongly influence overall performance; factors like application architecture, and the ability to leverage higher-level built in capabilities, that simplify things which may be complex to implement in other languages.

Many of you probably know people who claim Ruby can’t scale, or is too slow for business-critical web applications. Since you’re reading this, you also know those people are wrong. In fact, it’s usually far easier to scale a Rails application’s web-facing aspect than it is to scale the data storage parts of the application. Nonetheless, scaling that web-facing aspect has costs, and if your application can return content to your customers more efficiently, reducing your hardware needs, you reduce your costs.

Returning to the ruby-talk thread that those quotes came from, my response included an assertion that I thought I could spin up a single Engine Yard Cloud instance, and that running it with an all Ruby stack, I could push 200,000,000 requests through it in less than a day. When I say an all Ruby stack, I’m not talking about the database layer, but rather, the application and anything above it (such as the web server). I wouldn’t use Apache, nginx, or any other non-Ruby web server, and I’d use a real, complex application.

Since I already had a 64bit, 4ECU instance running that I use for testing Ruby 1.8.6 changes, I just used that existing instance. I used Ruby 1.8.6 pl287 for this. I could’ve used use any version, as RVM makes it simple to pick and choose, but that I selected that one because many sites have run on it for a long time (though if you are running on it now, you really should upgrade), and by being a less than current version, it serves my point well.

For generating test traffic, I used the venerable Apache Bench. Even after all these years it’s still got some buggy corner cases, but it’s straightforward and easy to use, and it’s own performance is high enough that it takes some pretty fast test subjects before you start running into the performance limitations of the tool, instead of the test subjects. I ran it on the same machine as my application’s stack because I wanted to eliminate the network as a factor in results, and just feed as many requests to my stack as quickly as possible.

The test application was Redmine, version 0.8.7. I selected Redmine because it’s a complex application familiar to many people, and it’s easy to install. It’s also not yet optimized for speed. Development has been far more focused on features and function than on optimizing for resource usage efficiency. The Rails version that I used is 2.3.2.

So, after installing and configuring Redmine, I started it:

ruby script/server -e production -d

Note that I did not use Mongrel, evented_mongrel, Thin, or anything else sophisticated as the container for the application. It was just webrick, and it was just a single instance of webrick.

I then threw some random data into it just so that there was something other than the empty pages. So, let’s see how it performed!

ab -n 10000 -c 1 http://127.0.0.1:3000/

Hmmm. I rode my exercise bike 1.3 miles while that ran… That didn’t feel fast at all.

Requests per second:    33.98 [#/sec] (mean)
Time per request:       29.432 [ms] (mean)
Time per request:       29.432 [ms] (mean, across all concurrent requests)

OK. I mean, that’s not horrible. Redmine isn’t a lightweight app, and that’s over 2.5 million requests a day on a single process. What happens if there’s some concurrency?

ab -n 10000 -c 25 http://127.0.0.1:3000/
Requests per second:    31.11 [#/sec] (mean)
Time per request:       803.707 [ms] (mean)
Time per request:       32.148 [ms] (mean, across all concurrent requests)

That was a 1.4 mile benchmark ride. Shoot; does that mean Ruby really is slow? That did not go in the direction we need, and let’s be real here: in a real application deployment, there are going to be concurrent requests—many of them, if you’re at all successful. It’s pretty clear what direction everything was moving in, but I wanted to take it one step further.

ab -n 10000 -c 500 http://127.0.0.1:3000/
Benchmarking 127.0.0.1 (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
apr_socket_recv: Connection reset by peer (104)

Well, good to know. Clearly, Redmine running inside of webrick can scale, but there are limits that aren’t too hard to hit on a single process. If we were spreading these requests over multiple processes on multiple instances, we could reasonably scale to many millions of requests per day, even running our code on webrick, assuming that the database layer could keep up with all of that. However, that’s still a long way from two hundred million requests per day.

Even if we were running on a Ruby implementation that was 2x as fast, or 5x as fast, and even if the application were running in a faster container, the basic problem is still the same—we’d have to throw hardware at it until the problem went away. Even if you spent a lot of time laboriously building Redmine in C++ while focusing on performance, you still wouldn’t escape the need, with this simple architecture, to throw hardware at the problem. So, what do you do if you need more throughput out of your application, but aren’t excited about adding more hardware resources?

Consider these runs:

ab -n 10000 -c 1 -C '_redmine_session=9ec759408f1ae3c6f919e50baba5a3dc; path=/' http://127.0.0.1/
Requests per second:    2839.37 [#/sec] (mean)
Time per request:       0.352 [ms] (mean)
Time per request:       0.352 [ms] (mean, across all concurrent requests)

ab -n 10000 -c 1000 -C '_redmine_session=9ec759408f1ae3c6f919e50baba5a3dc; path=/' http://127.0.0.1/
Requests per second:    3862.33 [#/sec] (mean)
Time per request:       258.911 [ms] (mean)
Time per request:       0.259 [ms] (mean, across all concurrent requests)

ab -n 100000 -c 25 -k  -C '_redmine_session=9ec759408f1ae3c6f919e50baba5a3dc; path=/' http://127.0.0.1/
Requests per second:    7797.39 [#/sec] (mean)
Time per request:       3.206 [ms] (mean)
Time per request:       0.128 [ms] (mean, across all concurrent requests)

I barely had time to turn the cranks on the exercise bike for those runs! It turns out that to get that performance, I needed to look at my architecture and rethink how I was positioning my application’s web facing aspect. Most applications, even highly dynamic ones, show lots of the same stuff to the users. In many cases completely identical content is being displayed for many different users. It’s senseless to regenerate this content over and over again. This is where caching enters the architecture picture.

Rails 2 has some built in support for caching. It’ll do page caching, which basically writes a static copy of a dynamically generated page to a persistent location, so that on subsequent hits the web server can deliver the page. This works great, but it has limitations.

All content, for everyone, for a given URL must be identical, and you’re responsible for providing a sweeper that clears old content. Also, requests will still fall down to your web server, which may mean that you still encounter some significant performance penalties when delivering your content in some situations. For example, nginx delivers static files quite quickly if it’s sitting on top of a fast disk. Sit it on a slow disk, though, and page caching returns limited dividends. If it can work for your application though, use it.

Rails also supports partial caching in some different guises—to the file system, to memory, to memcached, etc. Partial caching can be a win architecturally, because it bypasses all of the heavy work involved in generating content; your app can just assemble pregenerated fragments into a complete page. If you haven’t done so, look into that as well. It can be very helpful.

Along those same conceptual lines, there’s also edge side includes, or ESI. ESI essentially lets one’s application return a skeleton of a page, or an incomplete page with some special markup embedded. The proxy that receives that content, and that understands ESI markup can then insert content, either from its own cache, or from a subrequest that it issues to some other URL.

This lets a proxy cache a generated, but incomplete page, yet still fill it out with smaller pieces of dynamically generated content without pushing all of that work back into the dynamic application. So it’s a bit like partial caching, but it’s handled at a shallower level in the stack. I’ve heard that Rails 3 will have a plugin to facilitate the use of ESI, and that it may come built in with a later dot release. Not all reverse caching proxies support ESI, but many of them do.

For Redmine, page caching doesn’t work very well. It, like many applications, uses cookies. Applications can use cookies to identify users, to handle authentication, or to persist data on the user’s browser, instead of on the server. When an application needs to deliver cookies in addition to content, simple page caching won’t work. Redmine falls into this category. And besides… I promised to use a Ruby stack, so leveraging Nginx or Apache to serve files from a page cache would be cheating.

What I really needed was a caching reverse proxy that would sit in front of the application. It had to be smart enough to do the right thing with regard to caching content that has cookies attached (at least for some definition of the right thing), and it had to be stubborn enough to not-quite follow the Cache-Control headers that Redmine set. It needed to be implemented in Ruby, and it be fast enough to be worthwhile.

Most caching reverse proxies are implemented in fast languages. Varnish, one of the fastest caching reverse proxies, is written in C. Nginx , which can be configured to provide a caching reverse proxy, is also implemented with C, as is Squid, one of the oldest proxy servers. Traffic Server is implemented with C++.

Refer back to the benchmarks site. C is a lot faster than MRI Ruby. C++ is significantly faster, too. So, to borrow a phrase from my grandmother, how on God’s green Earth do I expect to write a proxy in Ruby that can compete with one in a language that benchmarks 100x-200x faster than it is?

Bullheaded stubborness in the face of ignorance? Well, yes, a little bit, combined with some specific architectural decisions. Most of those proxies try to do everything. I think there are probably configuration options in Squid that would get it to cook breakfast for me. Traffic Server probably won’t cook breakfast for me, yet, but it will make the bed, and somewhere in the TODO, I’m sure they have plans to allow for it to make breakfast, too, if you can figure out how to configure it. Varnish is one of the fastest proxies, and it gets its speed, in large part, because it won’t make the bed or cook my breakfast. It’s like Charles Emerson Winchester III from M.A.S.H., “I do one thing at a time, I do it very well, and then I move on.” Varnish does still take some configuration eduction to get it to work well, though.

And that is the secret to keeping things fast. Or, at least one of the secrets, anyway. I took it one step further. My approach was:

Do one thing at a time, do it well enough, and then move on.

A couple of years ago I wrote a very fast proxy and simple web server in Ruby that I called Swiftiply. It leverages EventMachine for handling network traffic, and then tries to squeeze the rest of the performance that it needs out of Ruby by not providing any more capability than is really needed to get the job done. Someone once said that “No code is faster than no code.”

Swiftiply didn’t provide enough capability for a caching reverse proxy, but it did have the capability to serve and cache static assets very quickly (on a lot of hardware my benchmarking efforts have run up against Apache Bench’s own performance limits), and it did already function as a proxy, so much of the capability was there. One advantage to it being written in Ruby was that it was relatively straightforward for me to add additional capability to it. So I did.

To really handle Redmine properly requires the ability to cache different versions of the same URL, where the only differentiator is the cookies. Also, Redmine sets a Cache-Control header that looks like this:

Cache-Control: private, max-age=0, must-revalidate

Without digging into it deeply, this means that public caches should not cache the content, and private caches need to confirm with the server that it has valid content before using it. But we want to ignore that (unless Cache-Control is set to no-cache, in which case we’ll pay attention), because we do want to keep private content cached, and we do not want to have to always go back to the application to revalidate on every request. My assumption is that it is OK if, for example, a new issue is added, but it takes a few seconds before a url which shows the issues is refreshed to display that new issue.

The end result is a caching reverse proxy with very few tuning knobs, and behavior that’s not quite HTTP 1.1 correct, but that is very fast, stable, and hackable. It’s probably not actually as fast as it could be, since I piggy backed the implementation onto something that’s doing more than I really need, but it’s good enough. Ruby, as a “slow” language, delivers on something that runs very fast and is good enough for the goal that I had.

If you’re wondering how many requests were pushed through my Ruby stack in 24 hours:

Requests per second:    3283.09 [#/sec] (mean)

That’s 283,659,084 requests in 24 hours (and none of them were keepalive requests). All handled in a Ruby stack. All with a completely browseable and useable Redmine installation that was still responsive while the test was running; I added issues, edited them, removed them, and did administrative actions with no perceptible delays.

I readily admit that this isn’t a test that faithfully simulates real production loads; you probably aren’t going to roll out a production web app servicing two or three hundred million requests a day on a single modestly sized EY Cloud instance. But if you were doing something that wasn’t going to be bottlenecked by the data store, you just might be able to do it, all with slow, slow Ruby. Not bad.

It’s no Varnish, and it never will be. Varnish does far more, more correctly, and all a little bit faster. Varnish also requires some careful tuning to run well, and is not nearly so hackable— so there are tradeoffs. If you neede more performance out of your application, look closely at what a caching reverse proxy can do for you. In the larger view of your application’s deployment architecture, it can make a tremendous difference in your users’ experience. Varnish is a great piece of software, and deserves a post of its own covering configuration and usage.

And if you truly find that you need some specialized capability, don’t be afraid to spike something out with Ruby. Paying a little attention to writing lean code that delivers just the capabilities that you need can result in surprisingly fast, capable code, even in a slow implementation of a slow language like Ruby ;)

Questions and comments welcome!


more »

Masking Latency & Failures with Squid »

Created at: 05.08.2009 19:27, source: igvita.com, tagged: Architecture cache squid varnish

Latency matters. Watching the interviews with the Yahoo developers on the launch of their new homepage, I was once again reminded about the great amount of effort they have put in to make it all work. In fact, I've written about Mark Nottingham's proposals for stale-while-revalidate and stale-if-error in the past (with a homebaked implentation), but a more realistic deployment scenario is to use a caching server like Squid or Varnish. Let's take a look at how these extensions can help your specific deployment.

HTTP Caching Extensions: stale-*

Both stale-while-revalidate and stale-if-error are the direct results of Mark Nottingham's work at Yahoo. It is clear that Squid plays a big role in the Y! infrastructure and these extensions are undoubtedly deployed in their data centers. Stale-while-revalidate addresses a simple problem: when a record becomes stale, instead of allowing the request to hit the application server, serve the stale data to the client and create an asynchronous request to update the cache (read more & spec). The benefit? All of your customers see consistent performance because the data is always served out of the cache (think RSS feeds, search results, etc).

The second extension (stale-if-error) can help you mask downtime by returning stale data while your ops team resolves the problem. How does it work? You specify a cache-control header (Cache-Control: max-age=600, stale-if-error=1200) which indicates for how long, after the record is expired the cache server can serve this data if the application server is down. Of course, nothing stops you from setting this to a high value like a day or longer! This way, if your server goes down, at least the clients won't timeout on their requests while you're resolving the problem!

Setting up Squid as a Reverse Proxy

Varnish and Squid are the two top choices when it comes to caching reverse proxies. Varnish has been getting a lot momentum but unfortunately the features we're looking for are still in the works. There is an open ticket for stale-if-error support, and stale-if-revalidate implementation unfortunately does not provide the full asynchronous refresh model. For that reason, we're going to use Squid 2.7 (note: Squid 3.0 is a full rewrite of the 2.x branch, and while stale-* patches exist, they haven't officially made it into the tree). A minimal Squid config (if there is such a thing) to get us up and running:

> squid.conf

# http_port public_ip:port accel defaultsite= default hostname, if not provided
http_port 0.0.0.0:80 accel defaultsite=yourdomain.com
 
# IP and port of your main application server (or multiple)
cache_peer 192.168.0.1 parent 80 0 no-query originserver name=main
cache_peer_domain main yourdomain.com
 
# Do not tell the world that which squid version we're running
httpd_suppress_version_string on
 
# Remove the Caching Control header for upstream servers
header_access Cache-Control deny all
 
# log all incoming traffic in Apache format
logformat combined %>a %ui %un [%tl] "%rm %ru HTTP/%rv" %Hs %<st "%{Referer}>h" "%{User-Agent}>h" %Ss:%Sh
access_log /usr/local/squid/var/logs/squid.log combined all
 

Connecting Squid and your Application

Once Squid is deployed as part of your request chain, taking advantage of its caching mechanism is a trivial matter, just add the "Cache-Control" header! A simple Rack application to make it all work:

> rack-stale.rb

require "rack"
 
app = lambda {
  p [:new_request, Time.now]
 
  headers = {
    # cache for 10 seconds, serve stale for up to 20s, and for up to 30s on errors
    "Cache-Control" => "max-age=10, stale-while-revalidate=10, stale-if-error=20",
    "Last-Modified" => Time.now.to_s
  }
 
  [200, headers, "Hello World @ #{Time.now}"]
}
 
Rack::Handler::Mongrel.run(app, {:Host => "127.0.0.1", :Port => 80})
 

The only requirements are that we set a Last-Modified time, and then provide our preferred caching intervals. Max-age specifies the general lifetime of the object (TTL), stale-while-revalidate indicates for how long after max-age has expired can the server provide the stale data (total lifetime is max-age + stale-while-revalidate). Finally, stale-if-error indicates for how long the server can provide the stale data if the application server is down - it is not unusual to set this higher than stale-if-revalidate. That's it!

Masking Latency and Failures

Of course, these extensions do not excuse any of us from building fast and reliable web-services. In theory, we would have no need for these extensions, but as the saying goes: "in theory, theory and practice are the same, in practice, they are not." If you have data that can be served slightly stale (frankly, the majority of the data falls into this category), then Squid can make all the difference next time your server decides to take a break at 4AM.


more »