Misleading Title About Queueing »
Created at: 05.03.2012 18:00, source: RailsTips - Home, tagged: gauges kestrel
I don’t know about you, but I find it super frustrating when people blog about cool stuff at the beginning of a project, but then as it grows, they either don’t take the time to teach or they get all protective about what they are doing.
I am going to do my best to continue to discuss the strategies we are using to grow Gauges. I hope you find them useful and, by all means, if you have tips or ideas, hit me. Without any further ado…
March 1st of last year (2011), we launched Gauges. March 1st of this year (a few days ago), we finally switched to a queue for track requests. Yes, for one full year, we did all report generation in the track request.
1. In the Beginning
My goal for Gauges in the beginning was realtime. I wanted data to be so freakin’ up-to-date that it blew people’s minds. What I’ve realized over the past year of talking to customers is that sometimes Gauges is so realtime, it is too realtime.
That is definitely not to say that we are going to work on slowing Gauges down. More what it means, is that my priorities are shifting. As more and more websites use Gauges to track, availability moves more and more to the front of my mind.
Gut Detects Issue
A few weeks back, with much help from friends (Brandon Keepers, Jesse Newland, Kyle Banker, Eric Lindvall, and the top notch dudes at Fastest Forward), I started digging into some performance issues that were getting increasingly worse. They weren’t bad yet, but I had this gut feeling they would be soon.
My gut was right. Our disk io utilization on our primary database doubled from January to February, which was also our biggest growth in terms of number of track requests. If we doubled again from February to March, it was not going to be pretty.
Back to the Beginning
From the beginning, Gauges built all tracking reports on the fly in the track request. When a track came in, Gauges did a few queries and then performed around 5-10 updates.
When you are small, this is fine, but as growth happens, updating live during a track request can become an issue. I had no way to throttle traffic to the database. This meant if we had enough large sites start tracking at once, most likely our primary database would say uncle.
As you can guess, if your primary says uncle, you start losing tracking data. In my mind, priority number one is now to never lose tracking data. In order to do this effectively, I felt we were finally at the point where we needed to separate tracking from reporting.
2. Availability Takes Front Seat
My goal is for tracking to never be down. If, occasionally, you can’t get to your reporting data, or if, occasionally, your data gets behind for a few minutes, I will survive. If, however, tracking requests start getting tossed to the wayside while the primary screams for help, I will not.
I talked with some friends and found Kestrel to be very highly recommended, particularly by Eric (linked above). He swore by it, and was pushing it harder than we needed to, so I decided to give it a try.
A few hours later, my lacking JVM skills (Kestrel is Scala) were bearing their head big time. I still had not figured out how to build or run the darn thing. I posted to the mailing list, where someone quickly pointed out that Kestrel defaults to /var for logging, data, etc. and, unfortunately, spits out no error on startup about lacking permissions on OSX. One sudo !! later and I was in business.
3. Kestrel
Before I get too far a long with this fairy tail, let’s talk about Kestrel — what is it and why did I pick it?
Kestrel is a simple, distributed message queue, based on Blaine Cook’s starling. Here are a few great paragraphs from the readme:
Each server handles a set of reliable, ordered message queues. When you put a cluster of these servers together, with no cross communication, and pick a server at random whenever you do a set or get, you end up with a reliable, loosely ordered message queue.
In many situations, loose ordering is sufficient. Dropping the requirement on cross communication makes it horizontally scale to infinity and beyond: no multicast, no clustering, no “elections”, no coordination at all. No talking! Shhh!
It features the memcached protocol, is durable (journaled), has fanout queues, item expiration, and even supports transactional reads.
My favorite thing about Kestrel? It is simple, soooo simple. Sound too good to be true? Probably is, but the honeymoon has been great so far.
Now that we’ve covered what Kestrel is and that it is amazing, let’s talk about how I rolled it out.
4. Architecture
Here is the general idea. The app writes track requests to the tracking service. Workers process off those track requests and generate the reports in the primary database.
After the primary database writes, we send the information through a pusher proxy process, which sends it off to pusher.com, the service that provides all the live web socket goodness that is in Gauges. Below is a helpful sketch:

That probably all makes sense, but remember that we weren’t starting from scratch. We already had servers setup that were tracking requests and I needed to ensure that was uninterrupted.
5. Rollout
Brandon and I have been on a tiny classes and services kick of late. What I am about to say may sound heretical, but we’ve felt that we need a few more layers in our apps. We’ve started using Gauges as a test bed for this stuff, while also spending a lot of time reading about clean code and design patterns.
We decided to create a tiny standardization around exposing services and choosing which one gets used in which environment. Brandon took the standardization and moved it into a gem where we could start trying stuff and share it with others. It isn’t much now, but we haven’t needed it to be.
Declaring Services
We created a Registry class for Gauges, which defined the various pieces we would use for Kestrel. It looked something like this:
class Registry
include Morphine
register :track_service do
KestrelTrackService.new(kestrel_client, track_config['queue'])
end
register :track_processor do
KestrelTrackProcessor.new(blocking_kestrel_client, track_config['queue'])
end
end
We then store an instance of this register in Gauges.app. We probably should have named it Gauges.registry, but we can worry about that later.
At this point, what we did probably seems pointless. The kestrel track service and processor look something like this:
class KestrelTrackService
def initialize(client, queue)
@client = client
@queue = queue
end
def record(attrs)
@client.set(@queue, MessagePack.pack(attrs))
end
end
class KestrelTrackProcessor
def initialize(client, queue)
@client = client
@queue = queue
end
def run
loop { process }
end
def process
record @client.get(@queue)
end
def record(data)
Hit.record(MessagePack.unpack(data))
end
end
The processor uses a blocking kestrel client, which is just a decorator of the vanilla kestrel client. As you can see, all we are doing is wrapping the kestrel-client and making it send the data to the right place.
Using Services
We then used the track_service in our TrackApp like this:
class TrackApp < Sinatra::Base
get '/track.gif' do
# stuff
Gauges.app.track_service.record(track_attrs)
# more stuff
end
end
Then, in our track_processor.rb process, we started the processor like so:
Gauges.app.track_processor.run
Like any good programmer, I knew that we couldn’t just push this to production and cross our fingers. Instead, I wanted to roll it out to work like normal, but also push track requests to kestrel. This would allow me to see kestrel receiving jobs.
On top of that, I also wanted to deploy the track processors to pop track requests off. At this point, I didn’t want them to actually process those track requests and write to the database, I just wanted to make sure the whole system was wired up correctly and stuff was flowing through it.
Another important piece was seeing how many track request we could store in memory with Kestrel, based on our configuration, and how it performed when it used up all the allocated memory and started going to disk.
Service Magic
The extra layer around tracking and processing proved to be super helpful. Note that the above examples used the new Kestrel system, but that I wanted to push this out and go through a verification process first. First, to do the verification process, we created a real-time track service:
class RealtimeTrackService
def record(attrs)
Hit.record(attrs)
end
end
This would allow us to change the track_service in the registry to perform as it currently was in production. Now, we have two services that know how to record track requests in a particular way. What I needed next was to use both of these services at the same time so I created a multi track service:
class MultiTrackService
include Enumerable
def initialize(*services)
@services = services
end
def record(attrs)
each { |service| service.record(attrs) }
end
def each
@services.each do |service|
yield service
end
end
end
This multi track services allowed me to record to both services for a single track request. The updated registry looked something like this:
class Registry
include Morphine
register :track_service do
which = track_config.fetch(:service, :realtime)
send("#{which}_track_service")
end
register :multi_track_service do
MultiTrackService.new(realtime_track_service, kestrel_track_service)
end
register :realtime_track_service do
RealtimeTrackService.new
end
register :kestrel_track_service do
KestrelTrackService.new(kestrel_client, track_config['queue'])
end
end
Note that now, track_service selects which service to use based on the config. All I had to do was update the config to use “multi” as the track service and we were performing realtime track requests while queueing them in Kestrel at the same time.
The only thing left was to beef up failure around the Kestrel service so that it was limited in how it could affect production. For this, I chose to catch failures, log them, and move on as if they didn’t happen.
class KestrelTrackService
def initialize(client, queue, options={})
@client = client
@queue = queue
@logger = options.fetch(:logger, Logger.new(STDOUT))
end
def record(attrs)
begin
@client.set(@queue, MessagePack.pack(attrs))
rescue => e
log_failure(attrs, e)
:error
end
end
private
def log_failure(attrs, exception)
@logger.info "attrs: #{attrs.inspect} exception: #{exception.inspect}"
end
end
I also had a lot of instrumentation in the various track services, so that I could verify counts at a later point. These verifications counts would prove whether or not things were working. I left that out as it doesn’t help the article, but you definitely want to verify things when you roll them out.
Now that the track service was ready to go, I needed a way to ensure that messages would flow through the track processors without actually modifying data. I used a similar technique as above. I created a new processor, aptly titled NoopTrackProcessor.
class NoopTrackProcessor < KestrelTrackProcessor
def record(data)
# don't actually record
# instead just run verification
end
end
The noop track processor just inherits from the kestrel track processor and overrides the record method to run verification instead of generating reports.
Next, I adjusted the registry to allow flipping the processor that is used based on the config.
class Registry
include Morphine
register :track_processor do
which = track_config.fetch(:processor, :noop)
send("#{which}_track_processor")
end
register :kestrel_track_processor do
KestrelTrackProcessor.new(blocking_kestrel_client, track_config['queue'])
end
register :noop_track_processor do
NoopTrackProcessor.new(blocking_kestrel_client, track_config['queue'])
end
end
With those changes in place, I could now set the track service to multi, the track processor to noop, and I was good to deploy. So I did. And it was wonderful.
6. Verification
For the first few hours, I ran the multi track service and turned off the track processors. This created the effect of queueing and never dequeueing. The point was to see how many messages kestrel could hold in memory and how it performed once messages started going to disk.
I used scout realtime to watch things during the evening while enjoying some of my favorite TV shows. A few hours later and almost 530k track requests later, Kestrel hit disk and hummed along like nothing happened.

Now that I had a better handle of Kestrel, I turned the track processors back on. Within a few minutes they had popped all the messages off. Remember, at this point, I was still just noop’ing in the track processors. All reports were still being built in the track request.
I let the multi track service and noop track processors run through the night and by morning, when I checked my graphs, I felt pretty confident. I removed the error suppression from the kestrel service and flipped both track service and track processor to kestrel in the config.
One more deploy and we were queueing all track requests in Kestrel and popping them off in the track processors after which, the reports were updated in the primary database. This meant our track request now performed a single Kestrel set, instead of several queries and updates. As you would expect, response times dropped like a rock.

It is pretty obvious when Kestrel was rolled out as the graph went perfectly flat and dropped to ~4ms response times. BOOM.
You might say, yeah, your track requests are now fast, but your track processors are doing the same work that the app was doing before. You would be correct. Sometimes growing is just about moving slowness into a more manageable place, until you have time to fix it.
This change did not just move slowness to a different place though. It separated tracking and reporting. We can now turn the track processors off, make adjustments to the database, turn them back on, and instantly, they start working through the back log of track requests queued up while the database was down. No tracking data lost.
I only showed you a handful of things that we instrumented to verify things were working. Another key metric for us, since we aim to be as close to realtime as possible, is the amount of time that it takes to go from queued to processing.
Based on the numbers, it takes us around 500ms right now. I believe as long as we keep that number under a second, most people will have no clue that we aren’t doing everything live.
7. Conclusion
By no means are we where I want us to be availability-wise, but at least we are one more step in the right direction. Hopefully this article gives you a better idea how to roll things out into production safely. Layers are good. Whether you are using Rails, Sinatra, or some other language entirely, layer services so that you can easily change them.
Also, we are now a few days in and Kestrel is a beast. Much thanks to Robey for writing it and Twitter for open sourcing it!
more »
More Tiny Classes »
Created at: 06.02.2012 14:00, source: RailsTips - Home, tagged: gauges refactoring
My last post, Keep ’Em Separated, made me realize I should start sharing more about what we are doing to make Gauges maintainable. This post is another in the same vein.
Gauges allows you to share a gauge with someone else by email. That email does not have to exist prior to your adding it, because nothing is more annoying that wanting to share something with a friend or co-worker, but first having to get them to sign up for the service.
If the email address is found, we add the user to the gauge and notify them that they have been added.
If the email address is not found, we create an invite and then send an email to notify them they should sign up, so they can see the data.
The Problem: McUggo Route
The aforementioned sharing logic isn’t difficult, but it was just enough that our share route was getting uggo. It started off looking something like this:
post('/gauges/:id/shares') do
gauge = Gauge.get(params['id'])
if user = User.first_by_email(params[:email])
Stats.increment('shares.existing')
gauge.add_user(user)
ShareWithExistingUserMailer.new(gauge, user).deliver
{:share => SharePresenter.new(gauge, user)}.to_json
else
invite = gauge.invite(params['email'])
Stats.increment('shares.new')
ShareWithNewUserMailer.new(gauge, invite).deliver
{:share => SharePresenter.new(gauge, invite)}.to_json
end
end
Let’s be honest. We’ve all seen Rails controller actions and Sinatra routes that are fantastically worse, but this was really burning my eyes, so I charged our programming butler to refactor it.
The Solution: Move Logic to Separate Class
We talked some ideas through, and once he had finished, the route looked more like this:
post('/gauges/:id/shares') do
gauge = Gauge.get(params['id'])
sharer = GaugeSharer.new(gauge, params['email'])
receiver = sharer.perform
{:share => SharePresenter.new(gauge, receiver)}.to_json
end
Perfect? Who cares. Waaaaaaaaay better? Yes. The concern of a user existing or not is moved away to a place where the route could care less.
Also, the bonus is that sharing a gauge can now be used without invoking a route.
So what does GaugeSharer look like?
class GaugeSharer
def initialize(gauge, email)
@gauge = gauge
@email = email
end
def user
@user ||= … # user from database
end
def existing?
user.present?
end
def perform
if existing?
share_with_existing_user
else
share_with_invitee
end
end
def share_with_existing_user
# add user to gauge
ShareWithExistingUserMailer.new(@gauge, user).deliver
user
end
def share_with_invitee
invite = ... # invite to db
ShareWithNewUserMailer.new(@gauge, invite).deliver
invite
end
end
Now, instead of having several higher-level tests to check each piece of logic, we can just ensure that GaugeSharer is invoked correctly in the route test and then test the crap out of GaugeSharer with unit tests. We can also use GaugeSharer anywhere else in the application that we want to.
This isn’t a dramatic change in code, but it has a dramatic effect on the coder. Moving all these bits into separate classes and tiny methods improves ease of testing and, probably more importantly, ease of grokking for another developer, including yourself at a later point in time.
more »
Keep 'Em Separated »
Created at: 04.02.2012 20:00, source: RailsTips - Home, tagged: gauges pusher
Note: If you end up enjoying this post, you should do two things: sign up for Pusher and then subscribe to destroy all software screencasts. I’m not telling you do this because I get referrals, I just really like both services.
For those that do not know, Gauges currently uses Pusher.com for flinging around all the traffic live.
Every track request to Gauges sends a request to Pusher. We do this using EventMachine in a thread, as I have previously written about.
The Problem
The downside of this, is when you get to the point we were (thousands of a requests a minute), there are so many pusher notifications to send (thousands of a minute) that the EM thread starts stealing a lot of time from the main request thread. You end up with random slow requests that have one to five seconds of “uninstrumented” time. Definitely not a happy scaler does this make.
In the past, we had talked about keeping track of which gauges were actually being watched and only sending a notification for those, but never actually did anything about it.
The Solution
Recently, Pusher added web hooks on channel occupy and channel vacate. This, combined with a growing number of slow requests, was just the motivation I needed to come up with a solution.
We (@bkeepers and I) started by mapping a simple route to a class.
class PusherApp < BaseApp
post '/pusher/ping' do
webhook = Pusher::WebHook.new(request)
if webhook.valid?
PusherPing.receive(webhook)
'ok'
else
status 401
'invalid'
end
end
end
Using a simple class method like this moves all logic out of the route and into a place that is easier to test. The receive method iterates the events and runs each ping individually.
class PusherPing
def self.receive(webhook)
webhook.events.each do |event|
new(event, webhook.time).run
end
end
end
At first, we had something like this for each PusherPing instance.
class PusherPing
def initialize(event, time)
@event = event || {}
@time = time
@event_name = @event['name']
@event_channel = @event['channel']
end
def run
case @event_name
when 'channel_occupied'
occupied
when 'channel_vacated'
vacated
end
end
def occupied
update(@time)
end
def vacated
update(nil)
end
def update(value)
# update the gauge in the
# db with the value
end
end
We pushed out the change so we could start marking gauges as occupied. We then forced a browser refresh, which effectively vacated and re-occupied all gauges people were watching.
Once we new the occupied state of each gauge was correct, we added the code to only send the request to pusher on track if a gauge was occupied.
Deploy. Celebrate. Booyeah.
The New Problem
Then, less than a day later, we realized that pusher doesn’t guarantee the order of events. Imagine someone vacating and then occupying a gauge, but receiving the occupy first and then the vacate.
This situation would mean that live tracking would never turn on for the gauge. Indeed, it started happening to a few people, who quickly let us know.
The New Solution
We figured it was better to send a few extra notifications than never send any, so we decided to “occupy” gauges on our own when people loaded up the Gauges dashboard.
We started in and quickly realized the error of our ways in the pusher ping. Having the database calls directly tied to the PusherPing class meant that we had two options:
- Use the PusherPing class to occupy a gauge when the dashboard loads, which just felt wrong.
- Re-write it to separate the occupying and vacating of a gauge from the PusherPing class.
Since we are good little developers, we went with 2. We created a GaugeOccupier class that looks like this:
class GaugeOccupier
attr_reader :ids
def initialize(*ids)
@ids = ids.flatten.compact.uniq
end
def occupy(time=Time.now.utc)
update(time)
end
def vacate
update(nil)
end
private
def update(value)
return if @ids.blank?
# do the db updates
end
end
We tested that class on its own quite quickly and refactored the PusherPing to use it.
class PusherPing
def run
case @event_name
when 'channel_occupied'
GaugeOccupier.new(gauge_id).occupy(@time)
when 'channel_vacated'
GaugeOccupier.new(gauge_id).vacate
end
end
end
Boom. PusherPing now worked the same and we had a way to “occupy” gauges separate from the PusherPing. We added the occupy logic to the correct point in our app like so:
ids = gauges.map { |gauge| gauge.id }
GaugeOccupier.new(ids).occupy
At this point, we were now “occupied” more than “vacated”, which is good. However, you may have noticed, that we still had the issue where someone loads the dashboard, we occupy the gauge, but then receive a delayed, or what I will now refer to as “stale”, hook.
To fix the stale hook issue, we simply added a bit of logic to the PusherPing class to detect staleness and simple ignore the ping if it is stale.
class PusherPing
def run
return if stale?
# do occupy/vacate
end
def stale?
return false if gauge.occupied_at.blank?
gauge.occupied_at > @time
end
end
Closing Thoughts
This is by no means a perfect solution. There are still other holes. For example, a gauge could be occupied by us after we receive a vacate hook from pusher and stay in an “occupied” state, sending notifications that no one is looking for.
To fix that issue, we can add a cleanup cron or something that occasionally gets all occupied channels from pusher and vacates gauges that are not in the list.
We decided it wasn’t worth the time. We pushed out the occupy fix and are now reaping the benefits of sending about 1/6th of the pusher requests we were before. This means our EventMachine thread is doing less work, which gives our main thread more time to process requests.
You might think us crazy for sending hundreds of http requests in a thread that shares time with the main request thread, but it is actually working quite well.
We know that some day we will have to move this to a queue and an external process that processes the queue, but that day is not today. Instead, we can focus on the next round of features that will blow people’s socks off.
more »
Creating an API »
Created at: 01.12.2011 16:00, source: RailsTips - Home, tagged: api gauges
A few weeks back, we publicly released the Gauges API. Despite building Gauges from the ground up as an API, it was a lot of work. You really have to cross your t’s and dot your i’s when releasing an API.
1. Document as You Build
We made the mistake of documenting after most of the build was done. The problem is documenting sucks. Leaving that pain until the end, when you are excited to release it, makes doing the work twice as hard. Thankfully, we have a closer on our team who powered through it.
2. Be Consistent
As we documented the API, we noticed a lot of inconsistencies. For example, in some places we return a hash and in others we returned an array. Upon realizing these issues, we started making some rules.
To solve the array/hash issue, we elected that every response should return a hash. This is the most flexible solution going forward. It allows us to inject new keys without having to convert the response or release a whole new version of the API.
Changing from an array to a hash meant that we needed to namespace the array with a key. We then noticed that some places were name-spaced and others weren’t. Again, we decided on a rule. In this case, all top level objects should be name-spaced, but objects referenced from a top level object or a collection of several objects did not require name-spacing.
{users:[{user:{...}}, {user:{...}}]} // nope
{users:[{...}, {...}]} // yep
{username: 'jnunemaker'} // nope
{user: {username:'jnunemaker'}} // yep
You get the idea. Consistency is important. It is not so much how you do it as that you always do it the same.
3. Provide the URLs
Most of my initial open source work was wrapping APIs. The one thing that always annoyed me was having to generate urls. Each resource should know the URLs that matter. For example, a user resource in Gauges has a few URLs that can be called to get various data:
{
"user": {
"name": "John Doe",
"urls": {
"self": "https://secure.gaug.es/me",
"gauges": "https://secure.gaug.es/gauges",
"clients": "https://secure.gaug.es/clients"
},
"id": "4e206261e5947c1d38000001",
"last_name": "Doe",
"email": "john@doe.com",
"first_name": "John"
}
}
The previous JSON is the response of the resource /me. /me returns data about the authenticated user and the URLs to update itself (self), get all gauges (/gauges), and get all API clients (/clients). Let’s say next you request /gauges. Each gauge returned has the URLs to get more data about the gauge.
{
"gauges": [
{
// various attributes
"urls": {
"self":"https://secure.gaug.es/gauges/4ea97a8be5947ccda1000001",
"referrers":"https://secure.gaug.es/gauges/4ea97a8be5947ccda1000001/referrers",
"technology":"https://secure.gaug.es/gauges/4ea97a8be5947ccda1000001/technology",
// ... etc
},
}
]
}
We thought this would prove helpful. We’ll see in the long run if it turns out to work well.
4. Present the Data
Finally, never ever use to_json and friends from a controller or sinatra get/post/put block. At least as a bare minimum rule, the second you start calling to_json with :methods, :except, :only, or any of the other options, you probably want to move it to a separate class.
For Gauges, we call these classes presenters. For example, here is a simplified version of the UserPresenter.
class UserPresenter
def initialize(user)
@user = user
end
def as_json(*)
{
'id' => @user.id,
'email' => @user.email,
'name' => @user.name,
'first_name' => @user.first_name,
'last_name' => @user.last_name,
'urls' => {
'self' => "#{Gauges.api_url}/me",
'gauges' => "#{Gauges.api_url}/gauges",
'clients' => "#{Gauges.api_url}/clients",
}
}
end
end
Nothing fancy. Just a simple ruby class that sits in app/presenters. Here is an example of the the /me route looks like in our Sinatra app.
get('/me') do
content_type(:json)
sign_in_required
{:user => UserPresenter.new(current_user)}.to_json
end
This simple presentation layer makes it really easy to test the responses in detail using unit tests and then just have a single integration test that makes sure overall things look good. I’ve found this tiny layer a breath of fresh air.
I am sure that nothing above was shocking or awe-inspiring, but I hope that it saves you some time on your next public API.
more »
