Weak Consistency and CAP Implications »
Created at: 24.06.2010 20:07, source: igvita.com, tagged: Architecture cap consistency
Migrating your web application from a single node to a distributed setup is always a deceivingly large architectural change. You may need to do it due to a resource constraint of a single machine, for better availability, to decouple components, or for a variety of other reasons. Under this new architecture, each node is on its own, and a network link is present to piece it all back together. So far so good, in fact, ideally we would also like for our new architecture to provide a few key properties: Consistency (no data conflicts), Availability (no single point of failure), and Partition tolerance (maintain availability and consistency in light of network problems).
Problem is, the CAP theorem proposed by Eric Brewer and later proved by Seth Gilbert and Nancy Lynch, shows that together, these three requirements are impossible to achieve at the same time. In other words, in a distributed system with an unreliable communications channel, it is impossible to achieve consistency and availability at the same time in the case of a network partition. Alas, such is the tradeoff.
"Pick Two" is too simple
The original CAP conjecture presented by Eric Brewer states that as architects, we can only pick two properties (CA, CP, or PA) at the same time, and many attempts have since been made to classify different distributed architectures into these three categories. Problem is, as Daniel Abadi recently pointed out (and Eric Brewer agrees), the relationships between CA, CP and AP are not nearly as clear-cut as they appear on paper. In fact, any attempt to create a hard partitioning into these buckets seems to only increase the confusion since many of the systems can arbitrarily shift their properties with just a few operational tweaks - in the real world, it is rarely an all or nothing deal.
Focus on Consistency
Following some great conversations about CAP at a recent NoSQL Summer meetup and hours of trying to reconcile all the edge cases, it is clear that the CA vs. CP vs. PA model is, in fact, a poor representation of the implications of the CAP theorem - the simplicity of the model is nice, but in reality the actual design space requires more nuance. Specifically, instead of focusing on all three properties at once, it is more productive to first focus along the continuum of “data consistency” options: none, weak, and full.
On one extreme, a system can demand no consistency. For example, a clickstream application which is used for best effort personalization can easily tolerate a few missed clicks. In fact, the data may even be partitioned by data centre, geography, or server, such that depending on where you are, a different “context” is applied - from home, your search returns one set of results, from work, another! The advantage of such a system is that it is inherently highly available (HA) as it is a share nothing, best effort architecture.
On the other extreme, a system can demand full consistency across all participating nodes, which implies some communications protocol to reach a consensus. A canonical example is a “debit / credit” scenario where full agreement across all nodes is required prior to any data write or read. In this scenario, all nodes maintain the exact same version of the data, but compromise HA in the process - if one node is down, or is in disagreement, the system is down.
CAP Implies Weak Consistency
Strong consistency and high availability are both desirable properties, however the CAP theorem shows that we can’t achieve both of these over an unreliable channel at once. Hence, CAP pushes us into a “weak consistency” model where dealing with failures is a fact of life. However, the good news is that we do have a gamut of possible strategies at our disposal.

In case of a failure, your first choice could be to choose consistency over availability. In this scenario, if a quorum can be reached, then one of the network partitions can remain available, while the second goes offline. Once the link between the two networks is restored, a simple data repair can take place - the minority partition is strictly behind, hence there are no possible data conflicts. Hence we sacrifice HA, but do continue to serve some of the clients.
On the other hand, we could lean towards availability over consistency. In this case, both sides can continue to accept reads and/or writes. Both sides of the partition remain available, and mechanisms such as vector clocks can be used to assist with conflict resolution (although, some conflicts will always require application level resolution). Repeatable reads, read-your-own-writes, and quorum updates are just a few of the examples of possible consistency vs. availability strategies in this scenario.
Hence, a simple corollary to the CAP theorem: when choosing availability under the weak consistency model, multiple versions of a data object will be present, will require conflict resolution, and it is up to your application to determine what is an acceptable consistency tradeoff and a resolution strategy for each type of object.
Speed of Light: Too Slow for PNUTS!
Interestingly enough, dealing with network partitions is not the only case for adopting “weak consistency”. The PNUTS system deployed at Yahoo must deal with WAN replication of data between different continents, and unfortunately, the speed of light imposes some strict latency limits on the performance of such a system. In Yahoo’s case, the communications latency is enough of a performance barrier such that their system is configured, by default, to operate under the “choose availability, under weak consistency” model - think of latency as a pseudo-permanent network partition.
Architecting for Weak Consistency
Instead of arguing over CA vs. CP vs. PA, first determine the consistency model for your application: strong, weak, or shared nothing / best effort. Notice that this choice has nothing to do with the underlying technology, and everything with the demands and the types of data processed by your application. From there, if you land in the weak-consistency model (and you most likely will, if you have a distributed architecture), start thinking how you can deal with the inevitable data conflicts: will you lean towards consistency and some partial downtime, or will you optimize for availability and conflict resolution?
Finally, if you are working under weak consistency, it is also worth noting that it is not a matter of picking just a single strategy. Depending on the context, the application layer can choose a different set of requirements for each data object! Systems such as Voldemort, Cassandra, and Dynamo all provide mechanisms to specify a desired level of consistency for each individual read and write. So, an order processing function can fail if it fails to establish a quorum (consistency over availability), while at the same time, a new user comment can be accepted by the same data store (availability over consistency).
more »
Rails Performance Needs an Overhaul »
Created at: 07.06.2010 17:32, source: igvita.com, tagged: Architecture Ruby on Rails performance rails
Browsers are getting faster; JavaScript frameworks are getting faster; MVC frameworks are getting faster; databases are getting faster. And yet, even with all of this innovation around us, it feels like there is massive gap when it comes to the end product of delivering an effective and scalable service as a developer: the performance of most of our web stacks, when measured end to end is poor at best of times, and plain terrible in most.
The fact that a vanilla Rails application requires a dedicated worker with a 50MB stack to render a login page is nothing short of absurd. There is nothing new about this, nor is this exclusive to Rails or a function of Ruby as a language - whatever language or web framework you are using, chances are, you are stuck with a similar problem. But GIL or no GIL, we ought to do better than that. Node.js is a recent innovator in the space, and as a community, we can either learn from it, or ignore it at our own peril.
Measuring End-to-End Performance

A modern web-service is composed of many moving components, all of which come together to create the final experience. First, you have to model your data layer, pick the database and then ensure that it can get your data in and out in the required amount of time - lots of innovation in this space thanks to the NoSQL movement. Then, we layer our MVC frameworks on top, and fight religious wars as developers on whose DSL is more beautiful - to me, Rails 3 deserves all the hype. On the user side, we are building faster browsers with blazing-fast JavaScript interpreters and CSS engines. However, the driveshaft (the app server) which connects the two pieces (the engine: data & MVC), and the front-end (the browser + DOM & JavaScript), is often just a checkbox in the deployment diagram. The problem is, this checkbox is also the reason why the ‘scalability’ story of our web frameworks is nothing short of terrible.
It doesn't take much to construct a pathological example where a popular framework (Rails), combined with a popular database (MySQL), and a popular app server (Mongrel) produce less than stellar results. Now the finger pointing begins. MySQL is more than capable of serving thousands of concurrent requests, the app server also claims to be threaded, and the framework even allows us to configure a database pool!
Except that, the database driver locks our VM, and both the framework and the app server still have a few mutexes deep in their guts, which impose hard limits on the concurrency (read, serial processing). The problem is, this is the default behaviour! No wonder people complain about 'scalability'. The other popular choices (Passenger / Unicorn) “work around” this problem by requiring dedicated VMs per request - that's not a feature, that's a bug!
The Rails Ecosystem
To be fair, we have come a long way since the days of WEBrick. In many ways, Mongrel made Rails viable, Rack gave us the much needed interface to become app-server independent, and the guys at Phusion gave us Passenger which both simplified the deployment, and made the resource allocation story moderately better. To complete the picture, Unicorn recently rediscovered the *nix IPC worker model, and is currently in use at Twitter. Problem is, none of this is new (at best, we are iterating on the Apache 1.x to 2.x model), nor does it solve our underlying problem.
Turns out, while all the components are separate, and its great to treat them as such, we do need to look at the entire stack as one picture when it comes to performance: the database driver needs to be smarter, the framework should take advantage of the app servers capabilities, and the app server itself can't pretend to work in isolation.
If you are looking for a great working example of this concept in action, look no further than node.js. There is nothing about node that can't be reproduced in Ruby or Python (EventMachine and Twisted), but the fact that the framework forces you to think and use the right components in place (fully async & non-blocking) is exactly why it is currently grabbing the mindshare of the early adopters. Rubyists, Pythonistas, and others can ignore this trend at their own peril. Moving forward, end-to-end performance and scalability of any framework will only become more important.
Fixing the "Scalability" story in Ruby
The good news is, for every outlined problem, there is already a working solution. With a little extra work, the driver story is easily addressed (MySQL driver is just an example, the same story applies to virtually every other SQL/NoSQL driver), and the frameworks are steadily removing the bottlenecks one at a time.
After a few iterations at PostRank, we rewrote some key drivers, grabbed Thin (evented app server), and made heavy use of continuations in Ruby 1.9 to create our own API framework (Goliath) which is perfectly capable of serving hundreds of concurrent requests at a time from within a single Ruby VM. In fact, we even managed to avoid all the callback spaghetti that plagues node.js applications, which also means that the same continuation approach works just as well with a vanilla Rails application. It just baffles me that this is not a solved problem already.
The state of art in the end-to-end Rails stack performance is not good enough. We need to fix that.
more »
Scalable Work Queues with Beanstalk »
Created at: 20.05.2010 19:38, source: igvita.com, tagged: Architecture beanstalk ruby
Any web application that reaches some critical mass eventually discovers that separation of services, where possible, is a great strategy for scaling the service. In fact, oftentimes a user action can be offloaded into a background task, which can be handled asynchronously while the user continues to explore the site. However, coordinating this workflow does require some infrastructure: a message queue, or a work queue. The distinction between the two is subtle and blurry, but it does carry important architectural implications. Should you pick a messaging bus such as AMQP or XMPP, roll your own database backed system such as BJ, go with Resque, or evaluate the other three dozen variants available in every conceivable language?
Of course, there is no single answer to that question - it depends on your application. AMQP is a great power tool for message routing, but there are other systems that can do a better job at specific tasks. One of such tools is Beanstalkd, which is a simple, and a very fast work queue service rolled into a single binary - it is the memcached of work queues. Originally built to power the backend for Causes Facebook app, it is a mature and production ready open source project. It just seems that not too many people talk about it, perhaps exactly because it works so well.
Beanstalkd Features & Recipes
Adam Wiggins recently published a great comparison of Beanstalk to a few other work-queue services, and speed is where it stands out. A single instance of Beanstalk is perfectly capable of handling thousands of jobs a second (or more, depending on your job size) because it is an in-memory, event-driven system. Powered by libevent under the hood, it requires zero setup (launch and forget, ala memcached), optional log based persistence, an easily parsed ASCII protocol, and a rich set of tools for job management that go well beyond a simple FIFO work queue.
Out of the box, Beanstalk supports multiple 'tubes' (work queues), which are created and deleted on demand. In turn, each job is associated with a single tube, and has a number of parameters: priority, time to run, delay, an id, and an opaque job body itself.
Once a job is inserted into the work queue, the server returns an ID, which we can use to inspect the job. From there, the queue itself is actually a priority heap! Need to jump a head in line? Set a higher priority on the job and Beanstalk will do the rest. Or, what if your worker goes down while processing a job? Because you specified a time to run on the job, Beanstalk will monitor the checked out job and put it back on the work queue if the timeout expires - seamless recovery, nice. Does the worker need more time to complete the job? There is a 'touch' command to notify the server to prolong the timeout. Have a bad job that you want to save for later inspection? Just bury it and take care of it later. Need to throttle all of the workers? You can pause the tube for a specified period of time. And there is more, do checkout the protocol specification.
Beanstalk at PostRank: Chronos
At PostRank we have dozens of Beanstalk processes sprinkled throughout which are being used for job management within the same machine and coordination between entire clusters. The larger deployments, which are the front-line coordinators to our crawlers are serving 50+ million jobs on a daily basis (average job is several kb), without breaking a sweat. On average, each job is just several kilobytes, but the numbers add up, meaning that a pure memory system would require 60GB+ of RAM to make it work for our use case. That is where the Beanstalk ASCII protocol, good old MySQL, and a little Ruby come together to create our scheduling system: Chronos.
The idea behind Chronos is based on a simple observation: we have tens of millions of crawler jobs, each of which is repeated on a custom interval, but only a small portion of that entire set needs to be in memory to make the system run! So, out of that observation, two projects were born: em-jack and em-proxy. EM-Jack is a Ruby Eventmachine client, which provides a simple mechanism to define custom command handlers that go beyond the native Beanstalk protocol. On the other hand, em-proxy is a protocol agnostic (layer 3) proxy, which allows us to intercept any TCP data stream and manipulate it at will.
So, instead of talking directly to Beanstalk, all the traffic is routed through our custom em-proxy (~150 LOC) which parses the Beanstalk protocol, intercepts custom commands, or simply inspects the "delay" parameter, and decides where the job should be routed: beanstalk or the MySQL instance. Jobs that are scheduled at least one hour into the future are persisted into the database, which significantly reduces the memory footprint. Finally, in the background, the upcoming jobs are silently loaded into beanstalk as their execution time approaches. Simple, reliable, scales well, and it gives us all the features available in Beanstalk for job management and coordination (plus persistence and replication of MySQL). A quick API demo:
EM.run do jack = EMJack::Connection.new r = jack.put("my message", :ttr => 300) { |jobid| puts "put successful #{jobid}" } j = jack.reserve j.callback do |job| puts job.jobid puts job.body jack.delete(job) { puts "Successfully deleted" } end end
Architecture Limitations and Alternatives
There is an abundance of different job and message queue systems, in part because they are so seemingly simple to write. However, if you ever tried to build your own, you will also know that the scaling and the features available in Beanstalk are non-trivial to replicate. However, there are a few limitations as well. Beanstalk does not currently support replication or any other form of high availability. Likewise, there is no native sharding (a few clients support it), or a native GUI. Other then that, it works, it is fast, it is easy to extend, and it is definitely worth a test drive.
more »
Distributed Coordination with Zookeeper »
Created at: 30.04.2010 21:27, source: igvita.com, tagged: Architecture hadoop zookeeper
The Apache Hadoop project is an umbrella for a number of sub-projects, many of which are quite useful outside the Hadoop framework itself. Avro, for one, is a serialization framework with great performance and an API tailored for dynamic environments. Likewise, Zookeeper is a project incubated within the Hadoop ecosystem, but worth spending some time with due its wide applicability for building distributed systems.
Back in 2006, Google published a paper on "Chubby", a distributed lock service which gained wide adoption within their data centers. Zookeeper, not surprisingly, is a close clone of Chubby designed to fulfill many of the same roles for HDFS and other Hadoop infrastructure. The original paper from Google offers a number of interesting insights, but the biggest takeaway is: Chubby and Zookeeper are both much more than a distributed lock service. In fact, it may be better to think of them as implementations of highly available, distributed metadata filesystems.
Zookeeper as a Metadata Filesystem
At its core, Zookeeper is modeled after a straightforward, tree based, file system API. A client is able to create a node, store up to 1MB of data with it, and also associate as many children nodes as they wish. However, there are no renames, soft or hard links, and no append semantics. Instead, by dropping those features, Zookeeper guarantees completely ordered updates, data versioning, conditional updates (CAS), as well as, a few more advanced features such as "ephemeral nodes", "generated names", and an async notification ("watch") API.
Now, this may seem like a grab bag of features, but don't forget that all of this is functionality is exposed as a remote service. Meaning, any network client can create a node, update associated metadata, submit a conditional update, or request an async notification via a "watch" request, which essentially mirrors the inotify functionality exposed in the Linux kernel.
Likewise, to make things easier, a client can request for Zookeeper to generate the node name to avoid collisions (e.g. to allow multiple concurrent clients to create nodes within same namespace without any collisions). And last but not least, what if you wanted to create a node, which only existed for the lifetime of your connection to Zookeper? That's what "ephemeral nodes" are for - essentially, they give you presence. Now, put all of these things together, and you have a powerful toolkit to solve many problems in distributed computing.
Zookeeper: Distributed Architecture
Of course, if we are to deploy Zookeeper in a distributed environment, we have to think about both the availability and scalability of the service. At Google, Chubby runs on a minimum of five machines, to guarantee high availability, and also transparently provides a built-in master election and failover mechanisms. Not surprisingly, Zookeeper is built under the same model. Given a cluster of Zookeeper servers, only one acts as a leader, whose role is to accept and coordinate all writes (via a quorum). All other servers are direct, read-only replicas of the master. This way, if the master goes down, any other server can pick up the slack and immediately continue serving requests. As an interesting sidenote, Zookeeper allows the standby servers to serve reads, whereas Google's Chubby directs all reads and writes to the single master - apparently a single server can handle all the load!

The one limitation of this design is that every node in the cluster is an exact replica - there is no sharding and hence the capacity of the service is limited by the size of an individual machine. The Google paper briefly mentions a possible logical sharding scheme, but in practice, it seems that they had no need for such a feature just yet. Now let's take a look at the applications.
Applications on top of Zookeeper & Chubby
Within the Hadoop/HBase umbrella, Zookeeper is used to manage master election and store other process metadata. Hadoop/HBase competitors love to point to Zookeeper as a SPOF (single point of failure), but in reality, given the backing architecture this seems to be too often over-dramatized. Perhaps even more interestingly, as the Google paper mentions, Chubby has in effect replaced DNS within the Google infrastructure!
By allowing any client to access and store metadata, you can imagine a simple case where a cluster of databases can share their configuration information (sharding functions, system configs, etc), within a single namespace: "/app/database/config". In fact, each database could request a watch on that node, and receive real-time notifications whenever someone updates the data (for example, a data rebalancing due to adding a new database into the cluster). However, that is just the beginning. Since "ephemeral nodes" provide basic presence information, by having each worker machine register with Zookeeper, we can perform real-time group membership queries to see which nodes are online, and perhaps even figure out what they are currently doing.
What about consensus? With a little extra work we can also leverage the data versioning and notifications APIs to build distributed primitives such as worker queues and barriers. From there, once we have locks and consensus, we can tackle virtually any distributed problem: master election, quorum commits, and so on. In other words, it becomes your swiss army knife for coordinating distributed services.
Working with Zookeeper
Getting started with Zookeeper is relatively straightforward: download the source, load the jar file, and you're ready to experiment in a single node mode. From there, you can try interacting with Zookeeper via a simple shell provided with the project:
# create root node for an application, with world read-write permissions
[zk: localhost:2181(CONNECTED) 2] create /myapp description world:anyone:cdrw
Created /myapp# Create a sequential (-s) and ephemeral (-e) node
[zk: localhost:2181(CONNECTED) 6] create -s -e /myapp/server- appserver world:anyone:cdrw
Created /myapp/server-0000000001# List current nodes
[zk: localhost:2181(CONNECTED) 5] ls /myapp
=> [server-0000000000]# Fetch data for one of the nodes
[zk: localhost:2181(CONNECTED) 8] get /myapp/server-0000000001
appserver
Similarly, we can perform all of the same actions directly from Ruby via several libraries. The zookeeper_client gem provides bindings against the C-based API, but unfortunately it doesn't support some of the more advanced features such as watches and asynchronous notifications. Alternatively, the zookeeper gem by Shane Mingins provides a fully featured JRuby version, which seems to cover the entire API.
And if you are feeling adventurous, you could even try the experimental FUSE mount, or the REST server for Zookeeper. For some ideas, check out game_warden (Ruby DSL for dynamic configuration management) built around zookeeper, and a few other articles on the project. Zookeeper is already in use in a number of projects and companies (Digg, LinkedIn, Wall Street), so you are in good company.
more »
Cluster Monitoring with Ganglia & Ruby »
Created at: 28.01.2010 19:16, source: igvita.com, tagged: Architecture monitoring cloud ganglia
A good monitoring solution can make or break an entire service - a well implemented one will enable you to forecast and plan ahead, as well as, quickly spot and debug problems when they arise. However, anyone that has worked with a cluster of machines will know that this is also a non-trivial problem. There are a number of options, both open-source and commercial, and they span a variety of use cases: alerting and notification (Nagios), intrusion detection (SNORT), performance monitoring (Ganglia, Cacti, Scout, etc), or even customized systems such as MySQL Monitor for in-depth database analysis.
Chances are, you will have to run a mix of all of the above to cover all the cases (ex: WikiMedia's Nagios & Ganglia) since there is no all-in-one solution and each has its tradeoffs. On that note, for performance monitoring, Ganglia is definitely an option to explore (it seems that I tried everything but Ganglia first, and I wish I did so much earlier). Originally developed at University of California, it is an open-source project designed from the ground up to be a distributed monitoring system for high-performance system such as clusters and grids - let's see what that means.
Ganglia's Distributed Architecture
Ganglia is powered by three independent components: gmond, gmetad and a PHP frontend. Due to how the system is architected, all three could either run on the same host, or more likely, be distributed between a number of different nodes. Gmond is the workhorse responsible for gathering user specified stats and sharing them over the network: a gmond daemon runs on every monitored node. It is designed to be fast, portable, with low memory footprint, and comes with a number of native monitoring modules (disk, memory, network, etc). Importantly, the gmond daemon never actually persists any data (memory only) to optimize for speed. But that's not all, because it can also receive data from other gmond's, allowing us to build arbitrary hierarchies of nodes - this is how and why Ganglia is capable of scaling to thousands of nodes.

Unlike other monitoring solutions such as Nagios or Cacti, all of the Ganglia metrics rely on data push - there is absolutely no polling involved. The gmond daemons are all responsible for periodically gathering and distributing their stats upstream. However, the gmond's do not persist data, and that is where the gmetad daemon comes in. This daemon is responsible for collecting data from an arbitrary number of gmond's, or even other gmetad daemons, persisting the metrics into correct RRD (round robin database) files, and then making this data available to the PHP frontend (or any other service that consumes RRD's).
Distributed Monitoring & Custom Metrics
Wikimedia's Ganglia setup is a good example to dissect: it consists of two "clusters" (Florida and Kennisnet), each most likely running their own gmetad node, which is then monitored by a central gmetad node which stitches them together. Each cluster, in turn, has its own hierarchy of gmond nodes: squids, apaches, databases, and so on. All together, it is monitoring over 350 nodes without breaking a sweat - not bad. Alas, one gotcha to be aware of upfront: run your gmetad's off a ram-disk to avoid the IO bottleneck associated with updating hundreds of RRD's.
With default configuration the gmond daemon will automatically monitor over 20 core metrics: load, network in and out, disk, memory and so on. However, it is also easily extensible via several mechanisms: custom monitoring modules, or straight up gmetric command line client. As of version 3.1.0, Ganglia now offers a simple Python API which allows us to create custom metrics modules which will be integrated directly into the gmond process. Hence the gmond process will call the code, gather the metrics and then distribute the data for us. A great working example is Gilad Raphaelli's MySQL extension, which gathers over 50 metrics about your database, and creating your own is also pretty straightforward.
However, if writing a python script is too heavy weight, Ganglia also provides the 'gmetric' executable which you can call right from the command line. Give the metric a name, type and a value, and you are off to the races. In other words, you can use bash, Ruby, or anything that will execute from the command line, in combination with gmetric, to funnel data into Ganglia:
# submit "variable_name" which is a string, with value of "hello", remove this var after 600s
gmetric -n variable_name -t string -v "hello" -d 600# submit "int_var", which is an 8-bit int, with value of 20, remove this var after 60s
gmetric -n int_var -t uint8 -v 20 -d 60
Connecting Ganglia With Ruby
One of the key advantages of Ganglia over Cacti or similar services is that there is no per variable data source setup. Once a metric is pushed to a gmetad node, it automatically gets its own RRD file, and appears on your dashboard - no configuration required, just bring up a new service, push the data and you're rocking! Now, if you want to monitor a Ruby process, you have a couple of alternatives: write a python module, or shell out to gmetric. Neither is optimal because while the first method requires more (Python) code and extra polling, the system exec option can also be prohibitively expensive in terms of performance. Thankfully, Ganglia uses a simple UDP protocol with XDR data formatting which means that with a little reverse-engineering, we can talk to our gmond process directly from Ruby (by pretending that it is gmetric generating the packets):
require 'gmetric' # generate metric from Ruby and send it over UDP Ganglia::GMetric.send("127.0.0.1", 8670, { :name => 'pageviews', :units => 'req/min', :type => 'uint8', # unsigned 8-bit int :value => 7000, # value of metric :tmax => 60, # maximum time in seconds between gmetric calls :dmax => 300 # lifetime in seconds of this metric })
Install the gmetric gem, specify the hostname and port number of your gmond daemon and fire off your metrics via UDP directly into the monitoring system. This way, you can track latency, request rates, throughput, or any other application metric directly from Ruby and push it into Ganglia without any performance penalties.
Finally, once you have pushed a dozen new metrics, you can then also generate your own cumulative reports, which aggregate data from multiple sources (key database metrics, etc). Ganglia is an incredibly flexible platform, and the 3.1.x release has done a lot to improve the ability to customize and extend it to fit your applications. If you haven't already, definitely a tool to look into for your cloud applications.
more »

