Hot patching inlined functions with x86_64 asm metaprogramming »
Created at: 10.12.2009 14:59, source: time to bleed by Joe Damato, tagged: debugging linux ruby systems x86 debug garbage collection GC memory profiling x86_64

If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.
Disclaimer
The tricks, techniques, and ugly hacks in this article are PLATFORM SPECIFIC, DANGEROUS, and NOT PORTABLE.
This article will make reference to information in my previous article Rewrite your Ruby VM at runtime to hot patch useful features so be sure to check it out if you find yourself lost during this article.
Also, this might not qualify as metaprogramming in the traditional definition1, but this article will show how to generate assembly at runtime that works well with the particular instructions generated for a binary. In other words, the assembly is constructed based on data collected from the binary at runtime. When I explained this to Aman, he called it assembly metaprogramming.
TLDR
This article expands on a previous article by showing how to hook functions which are inlined by the compiler. This technique can be applied to other binaries, but the binary in question is Ruby Enterprise Edition 1.8.7. The use case is to build a memory profiler without requiring patches to the VM, but just a Ruby gem.
It’s on GitHub
The memory profiler is NOT DONE, yet. It will be soon. Stay tuned.
The code described here is incorporated into a Ruby Gem which can be found on github: http://github.com/ice799/memprof specifically at: http://github.com/ice799/memprof/blob/master/ext/memprof.c#L202-318
Overview of the plan of attack
The plan of attack is relatively straight forward:
- Find the inlined code.
- Overwrite part of it to redirect to a stub.
- Call out to a handler from the stub.
- Make sure the return path is sane.
As simple as this seems, implementing these steps is actually a bit tricky.
Finding pieces of inlined code
Before finding pieces of inlined code, let’s first examine the C code we want to hook. I’m going to be showing how to hook the inline function add_freelist.
The code for add_freelist is short:
static inline void
add_freelist(p)
RVALUE *p;
{
if (p->as.free.flags != 0)
p->as.free.flags = 0;
if (p->as.free.next != freelist)
p->as.free.next = freelist;
freelist = p;
}
There is one really important feature of this code which stands out almost immediately. freelist has (at least) compilation unit scope. This is awesome because freelist serves as a marker when searching for assembly instructions to overwrite. Since the freelist has compilation unit scope, it’ll live at some static memory location.
If we find writes to this static memory location, we find our inline function code.
Let’s take a look at the instructions generated from this C code (unrelated instructions snipped out):
437f21: 48 c7 00 00 00 00 00 movq $0x0,(%rax) . . . . . 437f2c: 48 8b 05 65 de 2d 00 mov 0x2dde65(%rip),%rax # 715d98 [freelist] . . . . . 437f48: 48 89 05 49 de 2d 00 mov %rax,0x2dde49(%rip) # 715d98 [freelist]
The last instruction above updates freelist, it is the instruction generated for the C statement freelist = p;.
As you can see from the instruction, the destination is freelist. This makes it insanely easy to locate instances of this inline function. Just need to write a piece of C code which scans the binary image in memory, searching for mov instructions where the destination is freelist and I’ve found the inlined instances of add_freelist.
Why not insert a trampoline by overwriting that last mov instruction?
Overwriting with a jmp
The mov instruction above is 7 bytes wide. As long as the instruction we’re going to implant is 7 bytes or thinner, everything is good to go. Using a callq is out of the question because we can’t ensure the stack is 16-byte aligned as per the x86_64 ABI2. As it turns out, a jmp instruction that uses a 32bit displacement from the instruction pointer only requires 5 bytes. We’ll be able to implant the instruction that’s needed, and even have room to spare.
I created a struct to encapsulate this short 7 byte trampoline. 5 bytes for the jmp, 2 bytes for NOPs. Let’s take a look:
struct tramp_inline tramp = {
.jmp = {'\xe9'},
.displacement = 0,
.pad = {'\x90', '\x90'},
};
Let’s fill in the displacement later, after actually finding the instruction that’s going to get overwritten.
So, to find the instruction that’ll be overwritten, just look for a mov opcode and check that the destination is freelist:
/* make sure it is a mov instruction */
if (byte[1] == '\x89') {
/* Read the REX byte to make sure it is a mov that we care about */
if ( (byte[0] == '\x48') ||
(byte[0] == '\x4c') ) {
/* Grab the target of the mov. REMEMBER: in this case the target is
* a 32bit displacment that gets added to RIP (where RIP is the adress of
* the next instruction).
*/
mov_target = *(uint32_t *)(byte + 3);
/* Sanity check. Ensure that the displacement from freelist to the next
* instruction matches the mov_target. If so, we know this mov is
* updating freelist.
*/
if ( (freelist - (void *)(byte+7) ) == mov_target) {
At this point we’ve definitely found a mov instruction with freelist as the destination. Let’s calculate the displacement to the stage 2 trampoline for our jmp instruction and write the instruction into memory.
/* Setup the stage 1 trampoline. Calculate the displacement to
* the stage 2 trampoline from the next instruction.
*
* REMEMBER!!!! The next instruction will be NOP after our stage 1
* trampoline is written. This is 5 bytes into the structure, even
* though the original instruction we overwrote was 7 bytes.
*/
tramp.displacement = (uint32_t)(destination - (void *)(byte+5));
/* Figure out what page the stage 1 tramp is gonna be written to, mark
* it WRITE, write the trampoline in, and then remove WRITE permission.
*/
aligned_addr = page_align(byte);
mprotect(aligned_addr, (void *)byte - aligned_addr + 10,
PROT_READ|PROT_WRITE|PROT_EXEC);
memcpy(byte, &tramp, sizeof(struct tramp_inline));
mprotect(aligned_addr, (void *)byte - aligned_addr + 10,
PROT_READ|PROT_EXEC);
Cool, all that’s left is to build the stage 2 trampoline which will set everything up for the C level handler.
An assembly stub to set the stage for our C handler
So, what does the assembly need to do to call the C handler? Quite a bit actually so let’s map it out, step by step:
- Replicate the instruction which was overwritten so that the object is actually added to the freelist.
- Save the value of
rdiregister. This register is where the first argument to a function lives and will store the obj that was added to the freelist for the C handler to do analysis on. - Load the object being added to the freelist into
rdi - Save the value of
rbxso that we can use the register as an operand for an absolute indirectcallqinstruction. - Save
rbpandrspto allow a way to undo the stack alignment later. - Align the stack to a 16-byte boundary to comply with the x86_64 ABI.
- Move the address of the handler into
rbx - Call the handler through
rbx. - Restore
rbp,rsp,rdi,rbx. - Jump back to the instruction after the instruction which was overwritten.
To accomplish this let’s build out a structure with as much set up as possible and fill in the displacement fields later. This “base” struct looks like this:
struct inline_tramp_tbl_entry inline_ent = {
.rex = {'\x48'},
.mov = {'\x89'},
.src_reg = {'\x05'},
.mov_displacement = 0,
.frame = {
.push_rdi = {'\x57'},
.mov_rdi = {'\x48', '\x8b', '\x3d'},
.rdi_source_displacement = 0,
.push_rbx = {'\x53'},
.push_rbp = {'\x55'},
.save_rsp = {'\x48', '\x89', '\xe5'},
.align_rsp = {'\x48', '\x83', '\xe4', '\xf0'},
.mov = {'\x48', '\xbb'},
.addr = error_tramp,
.callq = {'\xff', '\xd3'},
.leave = {'\xc9'},
.rbx_restore = {'\x5b'},
.rdi_restore = {'\x5f'},
},
.jmp = {'\xe9'},
.jmp_displacement = 0,
};
So, what’s left to do:
- Copy the REX and source register bytes of the instruction which was overwritten to replicate it.
- Calculate the displacement to
freelistto fully generate the overwrittenmov. - Calculate the displacement to
freelistso that it can be stored inrdias an argument to the C handler. - Fill in the absolute address for the handler.
- Calculate the displacement to the instruction after the stage 1 trampoline in order to
jmpback to resume execution as normal.
Doing that is relatively straight-forward. Let’s take a look at the C snippets that make this happen:
/* Before the stage 1 trampoline gets written, we need to generate
* the code for the stage 2 trampoline. Let's copy over the REX byte
* and the byte which mentions the source register into the stage 2
* trampoline.
*/
inl_tramp_st2 = inline_tramp_table + entry;
inl_tramp_st2->rex[0] = byte[0];
inl_tramp_st2->src_reg[0] = byte[2];
. . . . .
/* Finish setting up the stage 2 trampoline. */
/* calculate the displacement to freelist from the next instruction.
*
* This is used to replicate the original instruction we overwrote.
*/
inl_tramp_st2->mov_displacement = freelist - (void *)&(inl_tramp_st2->frame);
/* fill in the displacement to freelist from the next instruction.
*
* This is to arrange for the new value in freelist to be in %rdi, and as such
* be the first argument to the C handler. As per the amd64 ABI.
*/
inl_tramp_st2->frame.rdi_source_displacement = freelist -
(void *)&(inl_tramp_st2->frame.push_rbx);
/* jmp back to the instruction after stage 1 trampoline was inserted
*
* This can be 5 or 7, it doesn't matter. If its 5, we'll hit our 2
* NOPS. If its 7, we'll land directly on the next instruction.
*/
inl_tramp_st2->jmp_displacement = (uint32_t)((void *)(byte + 7) -
(void *)(inline_tramp_table + entry + 1));
/* write the address of our C level trampoline in to the structure */
inl_tramp_st2->frame.addr = freelist_tramp;
Awesome.
We’ve successfully patched the binary in memory, inserted an assembly stub which was generated at runtime, called a hook function, and ensured that execution can resume normally.
So, what’s the status on that memory profiler?
Almost done, stay tuned for more updates coming SOON.
Conclusion
- Hackery like this is unmaintainable, unstable, stupid, but also fun to work on and think about.
- Being able to hook
add_freelistlike this provides the last tool needed to implement a version of bleak_house (a Ruby memory profiler) without patching the Ruby VM. - x86_64 instruction set is a painful instruction set.
- Use the GNU assembler (gas) instead of trying to generate opcodes by reading the Intel instruction set PDFs if you value your sanity.
Thanks for reading and don’t forget to subscribe (via RSS or e-mail) and follow me on twitter.
References
more »
Debugging Ruby: Understanding and Troubleshooting the VM and your Application »
Created at: 03.12.2009 05:30, source: time to bleed by Joe Damato, tagged: bugfix debugging linux ruby systems x86 debug ltrace performance profiling scaling strace syscall system health x86_64
Download the PDF here.
more »
Rewrite your Ruby VM at runtime to hot patch useful features »
Created at: 23.11.2009 14:59, source: time to bleed by Joe Damato, tagged: bugfix debugging linux monitoring python ruby systems testing x86 allocator debug garbage collection GC memory x86_64

If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.
Some notes before the blood starts flowin’
- CAUTION: What you are about to read is dangerous, non-portable, and (in most cases) stupid.
- The code and article below refer only to the x86_64 architecture.
- Grab some gauze. This is going to get ugly.
TLDR
This article shows off a Ruby gem which has the power to overwrite a Ruby binary in memory while it is running to allow your code to execute in place of internal VM functions. This is useful if you’d like to hook all object allocation functions to build a memory profiler.
This gem is on GitHub
Yes, it’s on GitHub: http://github.com/ice799/memprof.
I want a memory profiler for Ruby
This whole science experiment started during RubyConf when Aman and I began brainstorming ways to build a memory profiling tool for Ruby.
The big problem in our minds was that for most tools we’d have to include patches to the Ruby VM. That process is long and somewhat difficult, so I started thinking about ways to do this without modifying the Ruby source code itself.
The memory profiler is NOT DONE just yet. I thought that the hack I wrote to let us build something without modifying Ruby source code was interesting enough that it warranted a blog post. So let’s get rolling.
What is a trampoline?
Let’s pretend you have 2 functions: functionA() and functionB(). Let’s assume that functionA() calls functionB().
Now also imagine that you’d like to insert a piece of code to execute in between the call to functionB(). You can imagine inserting a piece of code that diverts execution elsewhere, creating a flow: functionA() –> functionC() –> functionB()
You can accomplish this by inserting a trampoline.
A trampoline is a piece of code that program execution jumps into and then bounces out of and on to somewhere else1.
This hack relies on the use of multiple trampolines. We’ll see why shortly.
Two different kinds of trampolines
There are two different kinds of trampolines that I considered while writing this hack, let’s take a closer look at both.
Caller-side trampoline
A caller-side trampoline works by overwriting the opcodes in the .text segment of the program in the calling function causing it to call a different function at runtime.
The big pros of this method are:
- You aren’t overwriting any code, only the address operand of a
callqinstruction. - Since you are only changing an operand, you can hook any function. You don’t need to build custom trampolines for each function.
This method also has some big cons too:
- You’ll need to scan the entire binary in memory and find and overwrite all address operands of
callq. This is problematic because if you overwrite any false-positives you might break your application. - You have to deal with the implications of
callq, which can be painful as we’ll see soon.
Callee-side trampoline
A callee-side trampoline works by overwriting the opcodes in the .text segment of the program in the called function, causing it to call another function immediately
The big pro of this method is:
- You only need to overwrite code in one place and don’t need to worry about accidentally scribbling on bytes that you didn’t mean to.
this method has some big cons too:
- You’ll need to carefully construct your trampoline code to only overwrite as little of the function as possible (or some how restore opcodes), especially if you expect the original function to work as expected later.
- You’ll need to special case each trampoline you build for different optimization levels of the binary you are hooking into.
I went with a caller-side trampoline because I wanted to ensure that I can hook any function and not have to worry about different Ruby binaries causing problems when they are compiled with different optimization levels.
The stage 1 trampoline
To insert my trampolines I needed to insert some binary into the process and then overwrite callq instructions like this:
41150b: e8 cc 4e 02 00 callq 4363dc [rb_newobj] 411510: 48 89 45 f8 ....
In the above code snippet, the byte e8 is the callq opcode and the bytes cc 4e 02 00 are the distance to rb_newobj from the address of the next instruction, 0×411510
All I need to do is change the 4 bytes following e8 to equal the displacement between the next instruction, 0×411510 in this case, and my trampoline.
Problem.
My first cut at this code lead me to an important realization: the callq instructions used expect a 32bit displacement from the function I am calling and not absolute addresses. But, the 64bit address space is very large. The displacement between the code for the Ruby binary that lives in the .text segment is so far away from my Ruby gem that the displacement cannot be represented with only 32bits.
So what now?
Well, luckily mmap has a flag MAP_32BIT which maps a page in the first 2GB of the address space. If I map some code there, it should be well within the range of values whose displacement I can represent in 32bits.
So, why not map a second trampoline to that page which can contains code that can call an absolute address?
My stage 1 trampoline code looks something like this:
/* the struct below is just a sequence of bytes which represent the
* following bit of assembly code, including 3 nops for padding:
*
* mov $address, %rbx
* callq *%rbx
* ret
* nop
* nop
* nop
*/
struct tramp_tbl_entry ent = {
.mov = {'\x48','\xbb'},
.addr = (long long)&error_tramp,
.callq = {'\xff','\xd3'},
.ret = '\xc3',
.pad = {'\x90','\x90','\x90'},
};
tramp_table = mmap(NULL, 4096, PROT_WRITE|PROT_READ|PROT_EXEC,
MAP_32BIT|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
if (tramp_table != MAP_FAILED) {
for (; i < 4096/sizeof(struct tramp_tbl_entry); i ++ ) {
memcpy(tramp_table + i, &ent, sizeof(struct tramp_tbl_entry));
}
}
}
It mmaps a single page and writes a table of default trampolines (like a jump table) that all call an error trampoline by default. When a new trampoline is inserted, I just go to that entry in the table and insert the address that should be called.
To get around the displacement challenge described above, the addresses I insert into the stage 1 trampoline table are addresses for stage 2 trampolines.
The stage 2 trampoline
Setting up the stage 2 trampolines are pretty simple once the stage 1 trampoline table has been written to memory. All that needs to be done is update the address field in a free stage 1 trampoline to be the address of my stage 2 trampoline. These trampolines are written in C and live in my Ruby gem.
static void
insert_tramp(char *trampee, void *tramp) {
void *trampee_addr = find_symbol(trampee);
int entry = tramp_size;
tramp_table[tramp_size].addr = (long long)tramp;
tramp_size++;
update_image(entry, trampee_addr);
}
An example of a stage 2 trampoline for rb_newobj might be:
static VALUE
newobj_tramp() {
/* print the ruby source and line number where the allocation is occuring */
printf("source = %s, line = %d\n", ruby_sourcefile, ruby_sourceline);
/* call newobj like normal so the ruby app can continue */
return rb_newobj();
}
Programatically rewriting the Ruby binary in memory
Overwriting the Ruby binary to cause my stage 1 trampolines to get hit is pretty simple, too. I can just scan the .text segment of the binary looking for bytes which look like callq instructions. Then, I can sanity check by reading the next 4 bytes which should be the displacement to the original function. Doing that sanity check should prevent false positives.
static void
update_image(int entry, void *trampee_addr) {
char *byte = text_segment;
size_t count = 0;
int fn_addr = 0;
void *aligned_addr = NULL;
/* check each byte in the .text segment */
for(; count < text_segment_len; count++) {
/* if it looks like a callq instruction... */
if (*byte == '\xe8') {
/* the next 4 bytes SHOULD BE the original displacement */
fn_addr = *(int *)(byte+1);
/* do a sanity check to make sure the next few bytes are an accurate displacement.
* this helps to eliminate false positives.
*/
if (trampee_addr - (void *)(byte+5) == fn_addr) {
aligned_addr = (void*)(((long)byte+1)&~(0xffff));
/* mark the page in the .text segment as writable so it can be modified */
mprotect(aligned_addr, (void *)byte+1 - aligned_addr + 10,
PROT_READ|PROT_WRITE|PROT_EXEC);
/* calculate the new displacement and write it */
*(int *)(byte+1) = (uint32_t)((void *)(tramp_table + entry)
- (void *)(byte + 5));
/* disallow writing to this page of the .text segment again */
mprotect(aligned_addr, (((void *)byte+1) - aligned_addr) + 10,
PROT_READ|PROT_EXEC);
}
}
byte++;
}
}
Sample output
After requiring my ruby gem and running a test script which creates lots of objects, I see this output:
... source = test.rb, line = 8 source = test.rb, line = 8 source = test.rb, line = 8 source = test.rb, line = 8 source = test.rb, line = 8 source = test.rb, line = 8 source = test.rb, line = 8 ...
Showing the file name and line number for each object getting allocated. That should be a strong enough primitive to build a Ruby memory profiler without requiring end users to build a custom version of Ruby. It should also be possible to re-implement bleak_house by using this gem (and maybe another trick or two).
Awesome.
Conclusion
- One step closer to building a memory profiler without requiring end users to find and use patches floating around the internet.
- It is unclear whether cheap tricks like this are useful or harmful, but they are fun to write.
- If you understand how your system works at an intimate level, nearly anything is possible. The work required to make it happen might be difficult though.
Thanks for reading and don't forget to subscribe (via RSS or e-mail) and follow me on twitter.
References
more »
Defeating the Matasano C++ Challenge with ASLR enabled »
Created at: 16.10.2009 14:59, source: time to bleed by Joe Damato, tagged: bugfix debugging linux security systems testing x86 memory vulnerability x86_64

If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.
Important note
I am NOT a security researcher (I kinda want to be though). As such, there are probably way better ways to do everything in this article. This article is just illustrating my thought process when cracking this challenge.
The Challenge
The Matasano Security blog recently posted an article titled A C++ Challenge1 which included a particularly ugly piece of C++ code that has a security vulnerability. The challenge is for the reader to find the vulnerability, use it execute arbitrary code, and submit the data to Matasano.
Sounds easy enough, let’s do this! cue hacking music
Making it harder
Recent linux kernels have feature called Address Space Layout Randomization (ASLR) which can be set in /proc/sys/kernel/randomize_va_space. ASLR is a security feature which randomizes the start address of various parts of a process image. Doing this makes exploiting a security bug more difficult because the exploit cannot use any hard coded addresses.
The options you can set are:
- 0 – ASLR off
- 1 – Randomize the addresses of the stack, mmap area, and VDSO page. This is the default.
- 2 – Everything in option 1, but also randomize the
brkarea so the heap is randomized.
Just for fun I decided to set it to 2 to make exploiting the challenge more difficult.
Got the code, but now what?
I decided to start attacking this problem by looking for a few common errors, in this order:
strcpy()/strncpy()bugs No callsmemcpy()bugs A few calls- Off by one bugs None obvious
It turned out from a quick look that all calls to memcpy() included sane, hard-coded values. So, it had to be something more complex.
Digging deeper – finding input streams the user can control
Next, I decided to actually read the code and see what it was doing at a high level and what inputs could be controlled. Turns out that the program reads data from a file and uses the data from the file to determine how many objects to allocate.
Obviously, this portion of the code caught my interest so let’s take a quick look:
/* ... */
fd.read(file_in_mem, MAX_FILE_SIZE-1);
/* ... */
struct _stream_hdr *s = (struct _stream_hdr *) file_in_mem;
if(s->num_of_streams >= INT_MAX / (int)sizeof(int)) {
safe_count = MAX_STREAMS;
} else {
safe_count = s->num_of_streams;
}
Obj *o = new Obj[safe_count];
OK, so clearly that if statement is suspect. At the very least it doesn’t check for negative values, so you could end up with safe_count = -1 which might do something interesting when passed to the new operator. Moreover, it appears this if statement will allow values as large as 536870910 ([INT_MAX / sizeof(int)] – 1).
Maybe the exploit has something to do with values this if statement is allowing through?
A closer look at the integer overflow in new
Let’s use GDB to take a closer look at what the compiler does before calling new. I’ve added a few comments in line to explain the assembly code:
mov %edx,%eax ; %edx and %eax store s->num_of_streams add %eax,%eax ; add %eax to itself (s->num_of_streams * 2) add %edx,%eax ; add s->num_of_streams + %eax (s->num_of_streams*3) shl $0x2,%eax ; multiply (s->num_of_streams * 3) by 4 (s->num_of_streams * 12) mov %eax,(%esp) ; move it into position to pass to new call 0x8048a7c <_Znaj@plt> ; call new
The compiler has generated code to calculate: s->num_of_streams * sizeof(Obj). sizeof(Obj) is 12 bytes. For large values of s->num_of_streams multiplying it by 12, causes an integer overflow and the value passed to new will actually be less than what was intended.
For my exploit, I ended up using the value 357913943. This value causes an overflow, because 357913943 * 12 is greater than the biggest possible value for an integer by 20. So the value passed to new is 20. Which is, of course, significantly less than what we actually wanted to allocate. Other people have written about integer overflow in new in other compilers2 before.
Let’s see how this can be used to cause arbitrary code to execute. Remember, for arbitrary code execution to occur there must be a way to cause the target program to write some data to a memory address that can be controlled.
Find the (possible) hand-off(s) to arbitrary code
To find any hand-off locations, I looked for places where memory writes were occurring in the program. I found a few memory writes:
- 2 calls to
memset() - 2 calls to
memcpy() parse_stream()ofclass Obj
Unfortunately (from the attacker’s perspective) the calls to memcpy() and memset() looked pretty sane. The parse_stream() function caught my interest, though.
Take a look:
class Obj {
public:
int parse_stream(int t, char *stream)
{
type = t;
// ... do something with stream here ...
return 0;
}
int length;
int type;
/* ... */
REMEMBER: In C++, member functions of classes have a sekrit parameter which is a pointer to the object the function is being called on. In the function itself, this parameter is accessed using this. So the line writing to the type variable is actually doing this->type = t; where this is supplied to the function sektrily by the compiler.
This is important because this piece of code could be our hand-off! We need to find a way to control the value of this so we can cause a memory write to a location of our choice.
Controlling this to cause arbitrary code to execute
Take a look at an important piece of code in the challenge:
struct imetad {
int msg_length;
int (*callback)(int, struct imetad *);
/* ... */
Nice! The callback field of struct imetad is offset by 4 bytes into the structure. The type field of class Obj is also offset by 4 bytes. See where I’m going?
If we can control the this pointer to point at the struct imetad on the heap when parse_stream is called, it will overwrite the callback pointer. We’ll then be able to set the pointer to any address we want and hand-off execution to arbitrary code!
But how can we manipulate this?
Take a look at this piece of code that calls callback:
o[i].parse_stream(dword, stream_temp); imd->callback(o[i].type, imd);
Since it is possible to overflow new and allocate fewer objects than safe_count is counting, that means that for some values of i, o[i] will be pointing at data that isn’t actually an Obj object, but just other data on the heap. Infact, when i = 2, o[i] will be pointing at the struct imetad object on the heap. The call to parse_stream will pass in a corrupted this pointer, that points at struct imetad. The write to type will actually overwrite callback since they are both offset equal amounts into their respective structures.
And with that, we’ve successfully exploited the challenge causing arbitrary code to execute.
Let’s now figure out how to beat ASLR!
How to defeat address space layout randomization
I did NOT invent this technique, but I read about it and thought it was cool. You can read a more verbose explanation of this technique here. The idea behind the technique is pretty simple:
- When you call
exec, the PID remains the same, but the image of the process in memory is changed. - The kernel uses the PID and the number of jiffies (jiffies is a fine-grained time measurement in the kernel) to pull data from the entropy pool.
- If you can run a program which records stack, heap, and other addresses and then quickly call
execto start the vulnerable program, you can end up with the same memory layout.
My exploit program is actually a wrapper which records an approximate location of the heap (by just calling malloc()), generates the exploit file, and then executes the challenge binary.
Take a look at the relevant pieces of my exploit to get an idea of how it works:
/* ... */
/* do a malloc to get an idea of where the heap lives */
void *dummy = malloc(10);
/* ... */
unsigned int shell_addr = reinterpret_void_ptr_as_uint(dummy);
/*
* XXX TODO FIXME - on my platform, execl'ing from here to the challenge binary
* incurs a constant offset of 0x3160, probably for changes in the environment
* (libs linked for c++ and whatnot).
*/
shell_addr += 0x3160;
/*
* a guess as to how far off the heap the shellcode lives.
*
* luckily we have a large NOP sled, so we should only fail when we miss
* the current entropy cycle (see below).
*/
shell_addr += 700;
/* ... build exploit file in memory ... */
/* copy in our best guess as to the address of the shellcode, pray NOPs
* take care of the rest! */
memcpy(entire_file+88, &shell_addr, sizeof(shell_addr));
/* ... write exploit out to disk ... */
/* launch program with the generated exploit file!
*
* calling execl here inherits the PID of this process, and IF we get lucky
* ~85%+ of the time, we'll execute before the next entropy cycle and hit
* the shellcode, even with ASLR=2.
*/
execl("./cpp_challenge", "cpp_challenge", "exploit", (char *)0);
My exploit for the C++ challenge
My exploit comes with the following caveats:
- i386 system
- The challenge binary is called “cpp_challenge” and lives in the same directory as the exploit binary.
- The exploit binary can write to the directory and create a file called “exploit” which will be handed off to “cpp_challenge”
Get the full code of my exploit here.
Results
Results on my i386 Ubuntu 8.04 VM running in VMWare fusion, for each level of randomize_va_space:
- 0 – 100% exploit hit rate
- 1 – 100% exploit hit rate
- 2 – ~85% exploit hit rate. Sometimes, my exploit code falls out of the time window and the address map changes before the challenge binary is run
I could probably boost the hit rate for 2 a bit, but then I’d probably re-write the entire exploit in assembly to make it run as fast as possible. I didn’t think there was really a point to going to such an extreme, though. So, an 85% hit rate is good enough.
Conclusion
- Security challenges are fun.
- More emphasis and more freely available information on secure coding would be very useful.
- Like it or not developers need to be security conscious when writing code in C and C++.
- As C and C++ change, developers need to carefully consider security implications of new features.
Thanks for reading and don’t forget to subscribe (via RSS or e-mail) and follow me on twitter.
References
more »
Extending ltrace to make your Ruby/Python/Perl/PHP apps faster »
Created at: 08.10.2009 14:59, source: time to bleed by Joe Damato, tagged: debugging linux monitoring python ruby systems x86 debug ltrace performance profiling strace system health x86_64

If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.
A few days ago, Aman (@tmm1) was complaining to me about a slow running process:
I want to see what is happening in userland and trace calls to extensions. Why doesn’t ltrace work for Ruby processes? I want to figure out which MySQL queries are causing my app to be slow.
It turns out that ltrace did not have support for libraries loaded with libdl. This is a problem for languages like Ruby, Python, PHP, Perl, and others because in many cases extensions, libraries, and plugins for these languages are loaded by the VM using libdl. This means that ltrace is somewhat useless for tracking down performance issues in dynamic languages.
A couple late nights of hacking and I managed to finagle libdl support in ltrace. Since most people probably don’t care about the technical details of how it was implemented, I’ll start with showing how to use the patch I wrote and what sort of output you can expect. This patch has made tracking down slow queries (among other things) really easy and I hope others will find this useful.
How to use ltrace:
After you’ve applied my patch (below) and rebuilt ltrace, let’s say you’d like to trace MySQL queries and have ltrace tell you when the query was executed and how long it took. There are two steps:
- Give ltrace info so it can pretty print – echo “int mysql_real_query(addr,string,ulong);” > custom.conf
- Tell ltrace you want to hear about
mysql_real_query:ltrace -F custom.conf -ttTgx mysql_real_query -p <pid>
Here’s what those arguments mean:
- -F use a custom config file when pretty-printing (default: /etc/ltrace.conf, add your stuff there to avoid -F if you wish).
- -tt print the time (including microseconds) when the call was executed
- -T time the call and print how long it took
- -x tells ltrace the name of the function you care about
- -g avoid placing breakpoints on all library calls except the ones you specify with -x. This is optional, but it makes ltrace produce much less output and is a lot easier to read if you only care about your one function.
PHP
Test script
mysql_connect("localhost", "root");
while(true){
mysql_query("SELECT sleep(1)");
}
ltrace output
22:31:50.507523 zend_hash_find(0x025dc3a0, "mysql_query", 12) = 0 <0.000029> 22:31:50.507781 mysql_real_query(0x027bc540, "SELECT sleep(1)", 15) = 0 <1.000600> 22:31:51.508531 zend_hash_find(0x025dc3a0, "mysql_query", 12) = 0 <0.000025> 22:31:51.508675 mysql_real_query(0x027bc540, "SELECT sleep(1)", 15) = 0 <1.000926>
ltrace command
ltrace -ttTg -x zend_hash_find -x mysql_real_query -p [pid of script above]
Python
Test script
import MySQLdb
db = MySQLdb.connect("localhost", "root", "", "test")
cursor = db.cursor()
sql = """SELECT sleep(1)"""
while True:
cursor.execute(sql)
data = cursor.fetchone()
db.close()
ltrace output
22:24:39.104786 PyEval_SaveThread() = 0x21222e0 <0.000029> 22:24:39.105020 PyEval_SaveThread() = 0x21222e0 <0.000024> 22:24:39.105210 PyEval_SaveThread() = 0x21222e0 <0.000024> 22:24:39.105303 mysql_real_query(0x021d01d0, "SELECT sleep(1)", 15) = 0 <1.002083> 22:24:40.107553 PyEval_SaveThread() = 0x21222e0 <0.000026> 22:24:40.107713 PyEval_SaveThread()= 0x21222e0 <0.000024> 22:24:40.107909 PyEval_SaveThread() = 0x21222e0 <0.000025> 22:24:40.108013 mysql_real_query(0x021d01d0, "SELECT sleep(1)", 15) = 0 <1.001821>
ltrace command
ltrace -ttTg -x PyEval_SaveThread -x mysql_real_query -p [pid of script above]
Perl
Test script
#!/usr/bin/perl
use DBI;
$dsn = "DBI:mysql:database=test;host=localhost";
$dbh = DBI->connect($dsn, "root", "");
$drh = DBI->install_driver("mysql");
@databases = DBI->data_sources("mysql");
$sth = $dbh->prepare("SELECT SLEEP(1)");
while (1) {
$sth->execute;
}
ltrace output
22:42:11.194073 Perl_push_scope(0x01bd3010) =<0.000028> 22:42:11.194299 mysql_real_query(0x01bfbf40, "SELECT SLEEP(1)", 15) = 0 <1.000876> 22:42:12.195302 Perl_push_scope(0x01bd3010) = <0.000024> 22:42:12.195408 mysql_real_query(0x01bfbf40, "SELECT SLEEP(1)", 15) = 0 <1.000967>
ltrace command
ltrace -ttTg -x mysql_real_query -x Perl_push_scope -p [pid of script above]
Ruby
Test script
require 'rubygems'
require 'sequel'
DB = Sequel.connect('mysql://root@localhost/test')
while true
p DB['select sleep(1)'].select.first
GC.start
end
snip of ltrace output
22:10:00.195814 garbage_collect() = 0 <0.022194> 22:10:00.218438 mysql_real_query(0x02740000, "select sleep(1)", 15) = 0 <1.001100> 22:10:01.219884 garbage_collect() = 0 <0.021401> 22:10:01.241679 mysql_real_query(0x02740000, "select sleep(1)", 15) = 0 <1.000812>
ltrace command used:
ltrace -ttTg -x garbage_collect -x mysql_real_query -p [pid of script above]
Where to get it
- On github: http://github.com/ice799/ltrace/tree/libdl
- Raw patch (NOTE: This should apply cleanly against ltrace 0.5.3): ltrace.patch
How ltrace works normally
ltrace works by setting software breakpoints on entries in a process’ Procedure Linkage Table (PLT).
What is a software breakpoint
A software breakpoint is just a series of bytes (0xcc on the x86 and x86_64) that raise a debug interrupt (interrupt 3 on the x86 and x86_64). When interrupt 3 is raised, the CPU executes a handler installed by the kernel. The kernel then sends a signal to the process that generated the interrupt. (Want to know more about how signals and interrupts work? Check out an earlier blog post: here)
What is a PLT and how does it work?
A PLT is a table of absolute addresses to functions. It is used because the link editor doesn’t know where functions in shared objects will be located. Instead, a table is created so that the program and the dynamic linker can work together to find and execute functions in shared objects. I’ve simplified the explanation a bit1, but at a high level:
- Program calls a function in a shared object, the link editor makes sure that the program jumps to a slot in the PLT.
- The program sets some data up for the dynamic linker and then hands control over to it.
- The dynamic linker looks at the info set up by the program and fills in the absolute address of the function that was called in the PLT.
- Then the dynamic linker calls the function.
- Subsequent calls to the same function jump to the same slot in the PLT, but every time after the first call the absolute address is already in the PLT (because when the dynamic linker is invoked the first time, it fills in the absolute address in the PLT).
Since all calls to library functions occur via the PLT, ltrace sets breakpoints on each PLT entry in a program.
Why ltrace didn’t work with libdl loaded libraries
Libraries loaded with libdl are loaded at run time and functions (and other symbols) are accessed by querying the dynamic linker (by calling dlsym()). The compiler and link editor don’t know anything about libraries loaded this way (they may not even exist!) and as such no PLT entries are created for them.
Since no PLT entries exist, ltrace can’t trace these functions.
What needed to be done to make ltrace libdl-aware
OK, so we understand the problem. ltrace only sets breakpoints on PLT entries and libdl loaded libraries don’t have PLT entries. How can this be fixed?
Luckily, the dynamic linker and ELF all work together to save your ass.
Executable and Linking Format (ELF) is a file format for executables, shared libraries, and more2. The file format can get a bit complicated, but all you really need to know is: ELF consists of different sections which hold different types of entries. There is a section called .dynamic which has an entry named DT_DEBUG. This entry stores the address of a debugging structure in the address space of the process. In Linux, this struct has type struct r_debug.
How to use struct r_debug to win the game
The debug structure is updated by the dynamic linker at runtime to reflect the current state of shared object loading. The structure contains 3 things that will help us in our quest:
- state – the current state of the mapping change taking place (begin add, begin delete, consistent)
- brk – the address of a function internal to the dynamic linker that will be called when the linker maps, unmaps, or has completed mapping a shared object.
- link map – Pointer to the start of a list of currently loaded objects. This list is called the link map and is represented as a
struct link_mapin Linux.
Tie it all together and bring it home
To add support for libdl loaded libraries to ltrace, the steps are:
- Find the address of the debug structure in the
.dynamicsection of the program. - Set a software breakpoint on
brk. - When the dynamic linker updates the link map, it will trigger the software breakpoint.
- When the breakpoint is triggered, check
statein the debug structure. - If a new library has been added, walk the link map and figure out what was added.
- Search the added library’s symbol table for the symbols we care about.
- Set a software breakpoints on whatever is found.
- Steps 3-8 repeat.
That isn’t too hard all thanks to the dynamic linker providing a way for us to hook into its internal events.
Conclusion
- Read the System V ABI for your CPU. It is filled with insanely useful information that can help you be a better programmer.
- Use the source. A few times while hacking on this patch I looked through the source for GDB and glibc to help me figure out what was going on.
- Understanding how things work at a low-level can help you build tools to solve your high-level problems.
Thanks for reading and don’t forget to subscribe (via RSS or e-mail) and follow me on twitter.
References
- System V Application Binary Interface AMD64 Architecture Processor Supplement, p 78
- Executable and Linking Format (ELF) Specification
more »
