an obscure kernel feature to get more info about dying processes »

Created at: 20.09.2010 14:59, source: time to bleed by Joe Damato, tagged: linux monitoring systems kernel


If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.

tl;dr

This post will describe how I stumbled upon a code path in the Linux kernel which allows external programs to be launched when a core dump is about to happen. I provide a link to a short and ugly Ruby script which captures a faulting process, runs gdb to get a backtrace (and other information), captures the core dump, and then generates a notification email.

I don’t care about faults because I use monit, god, etc

Chill.

Your processes may get automatically restarted when a fault occurs and you may even get an email letting you know your process died. Both of those things are useful, but it turns out that with just a tiny bit of extra work you can actually get very detailed emails showing a stack trace, register information, and a snapshot of the process’ files in /proc.

random walking the linux kernel

One day I was sitting around wondering how exactly the coredump code paths are wired. I cracked open the kernel source and started reading.

It wasn’t long until I saw this piece of code from exec.c1:

void do_coredump(long signr, int exit_code, struct pt_regs *regs)
{
  /* .... */
  lock_kernel();
  ispipe = format_corename(corename, signr);
  unlock_kernel();

   if (ispipe) {
   /* ... */

Hrm. ispipe? That seems interesting. I wonder what format_corename does and what ispipe means. Following through and reading format_corename2:

static int format_corename(char *corename, long signr)
{
	/* ... */

        const char *pat_ptr = core_pattern;
        int ispipe = (*pat_ptr == '|');

	/* ... */

        return ispipe;
}

In the above code, core_pattern (which can be set with a sysctl or via /proc/sys/kernel/core_pattern) to determine if the first character is a |. If so, format_corename returns 1. So | seems relatively important, but at this point I’m still unclear on what it actually means.

Scanning the rest of the code for do_coredump reveals something very interesting3 (this is more code from the function in the first code snippet above):

     /* ... */

     helper_argv = argv_split(GFP_KERNEL, corename+1, NULL);

     /* ... */

     retval = call_usermodehelper_fns(helper_argv[0], helper_argv,
                             NULL, UMH_WAIT_EXEC, umh_pipe_setup,
                             NULL, &cprm);

    /* ... */

WTF? call_usermodehelper_fns? umh_pipe_setup? This is looking pretty interesting. If you follow the code down a few layers, you end up at call_usermodehelper_exec which has the following very enlightening comment:

/**
 * call_usermodehelper_exec - start a usermode application
 *
 *  . . .
 *
 * Runs a user-space application.  The application is started
 * asynchronously if wait is not set, and runs as a child of keventd.
 * (ie. it runs with full root capabilities).
 */

what it all means

All together this is actually pretty fucking sick:

  • You can set /proc/sys/kernel/core_pattern to run a script when a process is going to dump core.
  • Your script is run before the process is killed.
  • A pipe is opened and attached to your script. The kernel writes the coredump to the pipe. Your script can read it and write it to storage.
  • Your script can attach GDB, get a backtrace, and gather other information to send a detailed email.

But the coolest part of all:

  • All of the files in /proc/[pid] for that process remain intact and can be inspected. You can check the open file descriptors, the process’s memory map, and much much more.

ruby codez to harness this amazing code path

I whipped up a pretty simple, ugly, ruby script. You can get it here. I set up my system to use it by:

% echo "|/path/to/core_helper.rb %p %s %u %g" > /proc/sys/kernel/core_pattern

Where:

  • %pPID of the dying process
  • %s – signal number causing the core dump
  • %u – real user id of the dying process
  • %g – real group id of the dyning process

Why didn’t you read the documentation instead?

This (as far as I can tell) little-known feature is documented at linux-kernel-source/Documentation/sysctl/kernel.txt under the “core_pattern” section. I didn’t read the documentation because (little known fact) I actually don’t know how to read. I found the code path randomly and it was much more fun an interesting to discover this little feature by diving into the code.

Conclusion

  • This could/should probably be a feature/plugin/whatever for god/monit/etc instead of a stand-alone script.
  • Reading code to discover features doesn’t scale very well, but it is a lot more fun than reading documentation all the time. Also, you learn stuff and reading code makes you a better programmer.

References

  1. http://lxr.linux.no/linux+v2.6.35.4/fs/exec.c#L1836
  2. http://lxr.linux.no/linux+v2.6.35.4/fs/exec.c#L1446
  3. http://lxr.linux.no/linux+v2.6.35.4/fs/exec.c#L1836


more »

Cluster Monitoring with Ganglia & Ruby »

Created at: 28.01.2010 19:16, source: igvita.com, tagged: Architecture monitoring cloud ganglia

A good monitoring solution can make or break an entire service - a well implemented one will enable you to forecast and plan ahead, as well as, quickly spot and debug problems when they arise. However, anyone that has worked with a cluster of machines will know that this is also a non-trivial problem. There are a number of options, both open-source and commercial, and they span a variety of use cases: alerting and notification (Nagios), intrusion detection (SNORT), performance monitoring (Ganglia, Cacti, Scout, etc), or even customized systems such as MySQL Monitor for in-depth database analysis.

Chances are, you will have to run a mix of all of the above to cover all the cases (ex: WikiMedia's Nagios & Ganglia) since there is no all-in-one solution and each has its tradeoffs. On that note, for performance monitoring, Ganglia is definitely an option to explore (it seems that I tried everything but Ganglia first, and I wish I did so much earlier). Originally developed at University of California, it is an open-source project designed from the ground up to be a distributed monitoring system for high-performance system such as clusters and grids - let's see what that means.

Ganglia's Distributed Architecture

Ganglia is powered by three independent components: gmond, gmetad and a PHP frontend. Due to how the system is architected, all three could either run on the same host, or more likely, be distributed between a number of different nodes. Gmond is the workhorse responsible for gathering user specified stats and sharing them over the network: a gmond daemon runs on every monitored node. It is designed to be fast, portable, with low memory footprint, and comes with a number of native monitoring modules (disk, memory, network, etc). Importantly, the gmond daemon never actually persists any data (memory only) to optimize for speed. But that's not all, because it can also receive data from other gmond's, allowing us to build arbitrary hierarchies of nodes - this is how and why Ganglia is capable of scaling to thousands of nodes.

Unlike other monitoring solutions such as Nagios or Cacti, all of the Ganglia metrics rely on data push - there is absolutely no polling involved. The gmond daemons are all responsible for periodically gathering and distributing their stats upstream. However, the gmond's do not persist data, and that is where the gmetad daemon comes in. This daemon is responsible for collecting data from an arbitrary number of gmond's, or even other gmetad daemons, persisting the metrics into correct RRD (round robin database) files, and then making this data available to the PHP frontend (or any other service that consumes RRD's).

Distributed Monitoring & Custom Metrics

Wikimedia's Ganglia setup is a good example to dissect: it consists of two "clusters" (Florida and Kennisnet), each most likely running their own gmetad node, which is then monitored by a central gmetad node which stitches them together. Each cluster, in turn, has its own hierarchy of gmond nodes: squids, apaches, databases, and so on. All together, it is monitoring over 350 nodes without breaking a sweat - not bad. Alas, one gotcha to be aware of upfront: run your gmetad's off a ram-disk to avoid the IO bottleneck associated with updating hundreds of RRD's.

With default configuration the gmond daemon will automatically monitor over 20 core metrics: load, network in and out, disk, memory and so on. However, it is also easily extensible via several mechanisms: custom monitoring modules, or straight up gmetric command line client. As of version 3.1.0, Ganglia now offers a simple Python API which allows us to create custom metrics modules which will be integrated directly into the gmond process. Hence the gmond process will call the code, gather the metrics and then distribute the data for us. A great working example is Gilad Raphaelli's MySQL extension, which gathers over 50 metrics about your database, and creating your own is also pretty straightforward.

However, if writing a python script is too heavy weight, Ganglia also provides the 'gmetric' executable which you can call right from the command line. Give the metric a name, type and a value, and you are off to the races. In other words, you can use bash, Ruby, or anything that will execute from the command line, in combination with gmetric, to funnel data into Ganglia:

# submit "variable_name" which is a string, with value of "hello", remove this var after 600s
gmetric -n variable_name -t string -v "hello" -d 600

# submit "int_var", which is an 8-bit int, with value of 20, remove this var after 60s
gmetric -n int_var -t uint8 -v 20 -d 60

Connecting Ganglia With Ruby

One of the key advantages of Ganglia over Cacti or similar services is that there is no per variable data source setup. Once a metric is pushed to a gmetad node, it automatically gets its own RRD file, and appears on your dashboard - no configuration required, just bring up a new service, push the data and you're rocking! Now, if you want to monitor a Ruby process, you have a couple of alternatives: write a python module, or shell out to gmetric. Neither is optimal because while the first method requires more (Python) code and extra polling, the system exec option can also be prohibitively expensive in terms of performance. Thankfully, Ganglia uses a simple UDP protocol with XDR data formatting which means that with a little reverse-engineering, we can talk to our gmond process directly from Ruby (by pretending that it is gmetric generating the packets):

> gmetric.rb

require 'gmetric'
 
# generate metric from Ruby and send it over UDP
Ganglia::GMetric.send("127.0.0.1", 8670, {
  :name => 'pageviews',
  :units => 'req/min',
  :type => 'uint8',     # unsigned 8-bit int 
  :value => 7000,       # value of metric
  :tmax => 60,          # maximum time in seconds between gmetric calls
  :dmax => 300          # lifetime in seconds of this metric
})
 

Install the gmetric gem, specify the hostname and port number of your gmond daemon and fire off your metrics via UDP directly into the monitoring system. This way, you can track latency, request rates, throughput, or any other application metric directly from Ruby and push it into Ganglia without any performance penalties.

Finally, once you have pushed a dozen new metrics, you can then also generate your own cumulative reports, which aggregate data from multiple sources (key database metrics, etc). Ganglia is an incredibly flexible platform, and the 3.1.x release has done a lot to improve the ability to customize and extend it to fit your applications. If you haven't already, definitely a tool to look into for your cloud applications.


more »

memprof: A Ruby level memory profiler »

Created at: 11.12.2009 14:59, source: time to bleed by Joe Damato, tagged: bugfix debugging linux monitoring ruby systems x86 debug garbage collection GC memory performance profiling system health x86_64


If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.

What is memprof and why do I care?

memprof is a Ruby gem which supplies memory profiler functionality similar to bleak_house without patching the Ruby VM. You just install the gem, call a function or two, and off you go.

Where do I get it?

memprof is available on gemcutter, so you can just:

gem install memprof

Feel free to browse the source code at: http://github.com/ice799/memprof.

How do I use it?

Using memprof is simple. Before we look at some examples, let me explain more precisely what memprof is measuring.

memprof is measuring the number of objects created and not destroyed during a segment of Ruby code. The ideal use case for memprof is to show you where objects that do not get destroyed are being created:

  • Objects are created and not destroyed when you create new classes. This is a good thing.
  • Sometimes garbage objects sit around until garbage_collect has had a chance to run. These objects will go away.
  • Yet in other cases you might be holding a reference to a large chain of objects without knowing it. Until you remove this reference, the entire chain of objects will remain in memory taking up space.

memprof will show objects created in all cases listed above.

OK, now Let’s take a look at two examples and their output.

A simple program with an obvious memory “leak”:

require 'memprof'

@blah = Hash.new([])

Memprof.start
100.times {
  @blah[1] << "aaaaa"
}

1000.times {
   @blah[2] << "bbbbb"
}
Memprof.stats
Memprof.stop

This program creates 1100 objects which are not destroyed during the start and stop sections of the file because references are held for each object created.

Let's look at the output from memprof:

   1000 test.rb:11:String
    100 test.rb:7:String

In this example memprof shows the 1100 created, broken up by file, line number, and type.

Let's take a look at another example:

require 'memprof'
Memprof.start
require "stringio"
StringIO.new
Memprof.stats

This simple program is measuring the number of objects created when requiring stringio.

Let's take a look at the output:

    108 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:__node__
     14 test2.rb:3:String
      2 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:Class
      1 test2.rb:4:StringIO
      1 test2.rb:4:String
      1 test2.rb:3:Array
      1 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:Enumerable

This output shows an internal Ruby interpreter type __node__ was created (these represent code), as well as a few Strings and other objects. Some of these objects are just garbage objects which haven't had a chance to be recycled yet.

What if nudge the garbage_collector along a little bit just for our example? Let's add the following two lines of code to our previous example:

GC.start
Memprof.stats

We're now nudging the garbage collector and outputting memprof stats information again. This should show fewer objects, as the garbage collector will recycle some of the garbage objects:

    108 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:__node__
      2 test2.rb:3:String
      2 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:Class
      1 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:Enumerable

As you can see above, a few Strings and other objects went away after the garbage collector ran.

Which Rubies and systems are supported?

  • Only unstripped binaries are supported. To determine if your Ruby binary is stripped, simply run: file `which ruby`. If it is, consult your package manager's documentation. Most Linux distributions offer a package with an unstripped Ruby binary.
  • Only x86_64 is supported at this time. Hopefully, I'll have time to add support for i386/i686 in the immediate future.
  • Linux Ruby Enterprise Edition (1.8.6 and 1.8.7) is supported.
  • Linux MRI Ruby 1.8.6 and 1.8.7 built with --disable-shared are supported. Support for --enable-shared binaries is coming soon.
  • Snow Leopard support is experimental at this time.
  • Ruby 1.9 support coming soon.

How does it work?

If you've been reading my blog over the last week or so, you'd have noticed two previous blog posts (here and here) that describe some tricks I came up with for modifying a running binary image in memory.

memprof is a combination of all those tricks and other hacks to allow memory profiling in Ruby without the need for custom patches to the Ruby VM. You simply require the gem and off you go.

memprof works by inserting trampolines on object allocation and deallocation routines. It gathers metadata about the objects and outputs this information when the stats method is called.

What else is planned?

Myself, Jake Douglas, and Aman Gupta have lots of interesting ideas for new features. We don't want to ruin the surprise, but stay tuned. More cool stuff coming really soon :)

Thanks for reading and don't forget to subscribe (via RSS or e-mail) and follow me on twitter.


more »

Rewrite your Ruby VM at runtime to hot patch useful features »

Created at: 23.11.2009 14:59, source: time to bleed by Joe Damato, tagged: bugfix debugging linux monitoring python ruby systems testing x86 allocator debug garbage collection GC memory x86_64


If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.

Some notes before the blood starts flowin’

  • CAUTION: What you are about to read is dangerous, non-portable, and (in most cases) stupid.
  • The code and article below refer only to the x86_64 architecture.
  • Grab some gauze. This is going to get ugly.

TLDR

This article shows off a Ruby gem which has the power to overwrite a Ruby binary in memory while it is running to allow your code to execute in place of internal VM functions. This is useful if you’d like to hook all object allocation functions to build a memory profiler.

This gem is on GitHub

Yes, it’s on GitHub: http://github.com/ice799/memprof.

I want a memory profiler for Ruby

This whole science experiment started during RubyConf when Aman and I began brainstorming ways to build a memory profiling tool for Ruby.

The big problem in our minds was that for most tools we’d have to include patches to the Ruby VM. That process is long and somewhat difficult, so I started thinking about ways to do this without modifying the Ruby source code itself.

The memory profiler is NOT DONE just yet. I thought that the hack I wrote to let us build something without modifying Ruby source code was interesting enough that it warranted a blog post. So let’s get rolling.

What is a trampoline?

Let’s pretend you have 2 functions: functionA() and functionB(). Let’s assume that functionA() calls functionB().

Now also imagine that you’d like to insert a piece of code to execute in between the call to functionB(). You can imagine inserting a piece of code that diverts execution elsewhere, creating a flow: functionA() –> functionC() –> functionB()

You can accomplish this by inserting a trampoline.

A trampoline is a piece of code that program execution jumps into and then bounces out of and on to somewhere else1.

This hack relies on the use of multiple trampolines. We’ll see why shortly.

Two different kinds of trampolines

There are two different kinds of trampolines that I considered while writing this hack, let’s take a closer look at both.

Caller-side trampoline

A caller-side trampoline works by overwriting the opcodes in the .text segment of the program in the calling function causing it to call a different function at runtime.

The big pros of this method are:

  • You aren’t overwriting any code, only the address operand of a callq instruction.
  • Since you are only changing an operand, you can hook any function. You don’t need to build custom trampolines for each function.

This method also has some big cons too:

  • You’ll need to scan the entire binary in memory and find and overwrite all address operands of callq. This is problematic because if you overwrite any false-positives you might break your application.
  • You have to deal with the implications of callq, which can be painful as we’ll see soon.

Callee-side trampoline

A callee-side trampoline works by overwriting the opcodes in the .text segment of the program in the called function, causing it to call another function immediately

The big pro of this method is:

  • You only need to overwrite code in one place and don’t need to worry about accidentally scribbling on bytes that you didn’t mean to.

this method has some big cons too:

  • You’ll need to carefully construct your trampoline code to only overwrite as little of the function as possible (or some how restore opcodes), especially if you expect the original function to work as expected later.
  • You’ll need to special case each trampoline you build for different optimization levels of the binary you are hooking into.

I went with a caller-side trampoline because I wanted to ensure that I can hook any function and not have to worry about different Ruby binaries causing problems when they are compiled with different optimization levels.

The stage 1 trampoline

To insert my trampolines I needed to insert some binary into the process and then overwrite callq instructions like this:

  41150b:       e8 cc 4e 02 00         callq  4363dc [rb_newobj]
  411510:       48 89 45 f8             ....

In the above code snippet, the byte e8 is the callq opcode and the bytes cc 4e 02 00 are the distance to rb_newobj from the address of the next instruction, 0×411510

All I need to do is change the 4 bytes following e8 to equal the displacement between the next instruction, 0×411510 in this case, and my trampoline.

Problem.

My first cut at this code lead me to an important realization: the callq instructions used expect a 32bit displacement from the function I am calling and not absolute addresses. But, the 64bit address space is very large. The displacement between the code for the Ruby binary that lives in the .text segment is so far away from my Ruby gem that the displacement cannot be represented with only 32bits.

So what now?

Well, luckily mmap has a flag MAP_32BIT which maps a page in the first 2GB of the address space. If I map some code there, it should be well within the range of values whose displacement I can represent in 32bits.

So, why not map a second trampoline to that page which can contains code that can call an absolute address?

My stage 1 trampoline code looks something like this:

  /* the struct below is just a sequence of bytes which represent the
    *  following bit of assembly code, including 3 nops for padding:
    *
    *  mov $address, %rbx
    *  callq *%rbx
    *  ret
    *  nop
    *  nop
    *  nop
    */
  struct tramp_tbl_entry ent = {
    .mov = {'\x48','\xbb'},
    .addr = (long long)&error_tramp,
    .callq = {'\xff','\xd3'},
    .ret = '\xc3',
    .pad =  {'\x90','\x90','\x90'},
  };

  tramp_table = mmap(NULL, 4096, PROT_WRITE|PROT_READ|PROT_EXEC,
                                   MAP_32BIT|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
  if (tramp_table != MAP_FAILED) {
    for (; i < 4096/sizeof(struct tramp_tbl_entry); i ++ ) {
      memcpy(tramp_table + i, &ent, sizeof(struct tramp_tbl_entry));
    }
  }
}

It mmaps a single page and writes a table of default trampolines (like a jump table) that all call an error trampoline by default. When a new trampoline is inserted, I just go to that entry in the table and insert the address that should be called.

To get around the displacement challenge described above, the addresses I insert into the stage 1 trampoline table are addresses for stage 2 trampolines.

The stage 2 trampoline

Setting up the stage 2 trampolines are pretty simple once the stage 1 trampoline table has been written to memory. All that needs to be done is update the address field in a free stage 1 trampoline to be the address of my stage 2 trampoline. These trampolines are written in C and live in my Ruby gem.

static void
insert_tramp(char *trampee, void *tramp) {
  void *trampee_addr = find_symbol(trampee);
  int entry = tramp_size;
  tramp_table[tramp_size].addr = (long long)tramp;
  tramp_size++;
  update_image(entry, trampee_addr);
}

An example of a stage 2 trampoline for rb_newobj might be:

static VALUE
newobj_tramp() {
  /* print the ruby source and line number where the allocation is occuring */
  printf("source = %s, line = %d\n", ruby_sourcefile, ruby_sourceline);

  /* call newobj like normal so the ruby app can continue */
  return rb_newobj();
}

Programatically rewriting the Ruby binary in memory

Overwriting the Ruby binary to cause my stage 1 trampolines to get hit is pretty simple, too. I can just scan the .text segment of the binary looking for bytes which look like callq instructions. Then, I can sanity check by reading the next 4 bytes which should be the displacement to the original function. Doing that sanity check should prevent false positives.

static void
update_image(int entry, void *trampee_addr) {
  char *byte = text_segment;
  size_t count = 0;
  int fn_addr = 0;
  void *aligned_addr = NULL;

 /* check each byte in the .text segment */
  for(; count < text_segment_len; count++) {

    /* if it looks like a callq instruction... */
    if (*byte == '\xe8') {

      /* the next 4 bytes SHOULD BE the original displacement */
      fn_addr = *(int *)(byte+1);

      /* do a sanity check to make sure the next few bytes are an accurate displacement.
        * this helps to eliminate false positives.
        */
      if (trampee_addr - (void *)(byte+5) == fn_addr) {
        aligned_addr = (void*)(((long)byte+1)&~(0xffff));

        /* mark the page in the .text segment as writable so it can be modified */
        mprotect(aligned_addr, (void *)byte+1 - aligned_addr + 10,
                       PROT_READ|PROT_WRITE|PROT_EXEC);

        /* calculate the new displacement and write it */
        *(int  *)(byte+1) = (uint32_t)((void *)(tramp_table + entry)
                                     - (void *)(byte + 5));

        /* disallow writing to this page of the .text segment again  */
        mprotect(aligned_addr, (((void *)byte+1) - aligned_addr) + 10,
                      PROT_READ|PROT_EXEC);
      }
    }
    byte++;
  }
}

Sample output

After requiring my ruby gem and running a test script which creates lots of objects, I see this output:

...
source = test.rb, line = 8
source = test.rb, line = 8
source = test.rb, line = 8
source = test.rb, line = 8
source = test.rb, line = 8
source = test.rb, line = 8
source = test.rb, line = 8
...

Showing the file name and line number for each object getting allocated. That should be a strong enough primitive to build a Ruby memory profiler without requiring end users to build a custom version of Ruby. It should also be possible to re-implement bleak_house by using this gem (and maybe another trick or two).

Awesome.

Conclusion

  • One step closer to building a memory profiler without requiring end users to find and use patches floating around the internet.
  • It is unclear whether cheap tricks like this are useful or harmful, but they are fun to write.
  • If you understand how your system works at an intimate level, nearly anything is possible. The work required to make it happen might be difficult though.

Thanks for reading and don't forget to subscribe (via RSS or e-mail) and follow me on twitter.

References

  1. http://en.wikipedia.org/wiki/Trampoline_(computers)


more »