GCC optimization flag makes your 64bit binary fatter and slower »

Created at: 20.07.2010 15:59, source: time to bleed by Joe Damato, tagged: debugging linux systems testing x86


If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.

The intention of this post is to highlight a subtle GCC optimization bug that leads to slower and larger code being generated than would have been generated without the optimization flag.

UPDATED: Graphs are now 0 based on the y axis. Links in the tidbits section (below conclusion) for my ugly test harness and terminal session of the build of the test case in the bug report, objdump, and corresponding system information.

Hold the #gccfail tweets, son.

Everyone fucks up. The point of this post is not to rag on GCC. If writing a C compiler was easy then every asshole with a keyboard would write one for fun.

WARNING: THERE IS MATH, SCIENCE, AND GRAPHS BELOW.

Watch yourself.

The original bug report for -fomit-frame-pointer.

I stumbled across a bug report for GCC that was very interesting. It points out a very subtle bug that occurs when the -fomit-frame-pointer flag is passed to GCC. The bug report is for 32bit code, however after some testing I found that this bug also rears its head in 64bit code.

What is -fomit-frame-pointer supposed to do?

The -fomit-frame-pointer flag is intended to direct GCC to avoid saving and restoring the frame pointer (%ebp or %rbp). This is supposed to make function calls faster, since the function is doing less work each invocation. It should also make function code take fewer bytes since there are fewer instructions being executed.

A caveat of using -fomit-frame-pointer is that it may make debugging impossible on certain systems. To combat this on Linux, .debug_frame and .eh_frame sections are added to ELF binaries to assist in the stack unwinding process when the frame pointer is omitted.

What is the bug?

The bug is that when -fomit-frame-pointer is used, GCC erroneously uses the frame pointer register as a general purpose register when a different register could be used instead.

wat.

The amd64 and i386 ABIs1 2 specify a list of caller and callee saved registers.

  • The frame pointer register is callee saved. That means that if a function is going to use the frame pointer register, it must save and restore the value in the register.
  • The test case provided in the bug report shows that other caller saved registers were available for use.
  • Had the function used a caller saved register instead, there would be no need for the additional save and restore instructions in the function.
  • Removing those instructions would take fewer bytes and execute faster.

What are the consequences?

Let’s take a look at two potential pieces of code.

The first piece is the code that would be generated if -fomit-frame-pointer is not used:

test1:
        pushq %rbp       ; save frame pointer
        movq %rsp,%rbp   ; update frame pointer to the current stack pointer
           ; here is where your function would do work
        leave            ; restore the stack pointer and frame pointer
        ret              ; return

Size: 6 bytes.

The above assembly sequence uses the frame pointer.

Let’s take a look at the code that is generated by GCC when -fomit-frame-pointer is used:

        sub $0x8, %rsp    ; make room on the stack
        movq %rbp, (%rsp) ; store rbp on the stack
          ; here is where your function would modify and use %rbp as needed
        movq (%rsp), %rbp ; restore %rbp
        add $0x8, %rsp    ; get rid of the extra stack space
        ret               ; return

Size: 17 bytes.

The above assembly sequence is what is generated when GCC decides to use the frame pointer register as a general purpose register. Since it is callee saved, it must be saved before being modified and restored after being modified.

So -fomit-frame-pointer makes your binary fatter, but does it make it slower?

Only one way to find out: do science.

I built a simple (and very ugly) testing harness to test the above pieces of code to determine which piece of code is faster. Before we get into the benchmark results, I want to tell you why my benchmark is bullshit.

Yes, bullshit.

You see, it makes me sad when people post benchmarks and neglect to tell others why their benchmark may be inaccurate. So, lemme start the trend.

This benchmark is useless because:

  • Reading the CPU cycle counter is unreliable (more on this below the conclusion). I also tracked wall clock time, too.
  • I don’t have the ideal test environment. I ran this on bare metal hardware, and set the CPU affinity to keep the process pinned to a single CPU… BUT
  • I could have done better if I had pinned init to CPU0 (thereby forcing all children of init to be pinned to CPU0 – remember child processes inherit the affinity mask). I would have then had an entire CPU for nothing but my benchmark.
  • I could have done better if I forced the CPU running my benchmark program to not handle any IRQs.
  • I only tested one version of GCC: (Debian 4.3.2-1.1) 4.3.2
  • I could have taken more samples.

You can find more testing harness tidbits below the conclusion.

Benchmark Results

test 1 — Code sequence simulating using the frame pointer.
test 2 — Code sequence simulating using the frame pointer as a general purpose register.

64bit results

Using -fomit-frame-pointer is SLOWER (contrary to what you’d expect) than not using it!

cycles test 1 cycles test 2 microsecs test 1 microsecs test 2
mean 3514422987.92 4559685515.66 1882707.27 2442663.94
median 3507007423.5 4562511684.5 1878721.5 2444171.5
max 3922780211 4672066854 2101457 2502869
min 3502194976 4327782795 1876113 2318452
std dev 31927179.5632 15449507.8196 17103.7755 8275.49788
variance 1.02E+15 238687291867021 292539135.936 68483865.11835


32bit results

Using -fomit-frame-pointer is FASTER (as it should be) than not using it! The binary is still fatter, though.

cycles test 1 cycles test 2 microsecs test 1 microsecs test 2
mean 3502932799.49 3491263364.89 1876553.08 1870301.35
median 3501486586.5 3492013955.5 1875778 1870702.5
max 3905163528 3731985243 2092032 1999259
min 3500916510 3408834436 1875472 1826144
std dev 10066939.1113 7992367.6913 5393.0412 4281.5466
variance 101343263071403 63877941312996.4 29084893.2588 18331640.9459


Conclusion

  • GCC is a really complex piece of software; this bug is very subtle and may have existed for a while.
  • I’ve said this a few times, but knowing and understanding your system’s ABI is crucial for catching bugs like these.
  • Math and science are cool now, much like computers. You should use both.

Thanks for reading and don’t forget to subscribe (via RSS or e-mail) and follow me on twitter.

Testing harness tidbits

Each run of the benchmark executes either test1 or test2 (from above) 500,000,000 times. I did around 2500 runs for each test function.

You can get the testing harness, a build script, and a test script here: http://gist.github.com/483524

You can look at the terminal session where I build the test from the original bug report on my system: http://gist.github.com/483494

The code I used to read the CPU cycle counter looks like this:

static __inline__ unsigned long long rdtsc(void)
{
  unsigned long hi = 0, lo = 0;
  __asm__ __volatile__ ("lfence\n\trdtsc" : "=a"(lo), "=d"(hi));
  return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}

The lfence instruction is a serializing instruction that ensures that all load instructions which were issued before the lfence instruction have been executed before proceeding. I did this to make sure that the cycle counter was being read after all operations in the test functions were executed.

The values returned by this function are misleading because CPU frequency may be scaled at any time. This is why I also measured wall clock time.

References

  1. http://www.sco.com/developers/devspecs/abi386-4.pdf
  2. http://www.x86-64.org/documentation/abi.pdf


more »

Dynamic symbol table duel: ELF vs Mach-O, round 2 »

Created at: 01.06.2010 15:59, source: time to bleed by Joe Damato, tagged: linux osx systems x86 debug elf mach-o x86_64


If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.

The intention of this post is to continue highlighting some of the similarities and differences between ELF and Mach-O that I encountered while building memprof. The previous post in this series can be found here.

What is a symbol table?

A symbol table is simply a list of names in an object. The names in the list may be names of functions, initialized/uninitialized memory regions, or other things depending on the object format. The symbol table does not need to be mapped into a running process and is only useful for debugging. The symbol table (and other sections) may be removed from an object when you use strip.

Symbol tables in ELF objects

An entry in the symbol table in an ELF object can best be described by the following struct from /usr/include/elf.h:

typedef struct
{
  Elf64_Word    st_name;                /* Symbol name (string tbl index) */
  unsigned char st_info;                /* Symbol type and binding */
  unsigned char st_other;               /* Symbol visibility */
  Elf64_Section st_shndx;               /* Section index */
  Elf64_Addr    st_value;               /* Symbol value */
  Elf64_Xword   st_size;                /* Symbol size */
} Elf64_Sym;

In most cases, this structure is used to find the mapping from a symbol name to the address where it lives. Although, different symbol types (specified by st_info) provide mappings from symbols to other data.

The st_name field is an index into a section called strtab which is just a table of strings.

Symbol tables in Mach-O objects

Let’s take a look at the struct for a symbol table entry in a Mach-O object from /usr/include/mach-o/nlist.h:

struct nlist_64 {
    union {
        uint32_t  n_strx; /* index into the string table */
    } n_un;
    uint8_t n_type;        /* type flag */
    uint8_t n_sect;        /* section number or NO_SECT */
    uint16_t n_desc;       /* see  */
    uint64_t n_value;      /* value of this symbol (or stab offset) */
};

It looks very similar. The immediately noticeable difference with ELF:

  • lack of size field – The only noticeable difference on your first glance is the lack of a size field. The size field in ELF objects describes the number of bytes occupied by the symbol. This is actually pretty useful, especially for memprof. The lack of this field in Mach-O was a source of frustration for Jake when he was implementing Mach-O support.

What is a dynamic symbol table?

Shared objects in both Mach-O and ELF have a symbol table listing only functions that are exporteed by the object.

This table is used during dynamic linking and is mapped into the process’ address space when the object is loaded, unlike the symbol table which is just used for debugging.

The dynamic symbol table is a subset of the symbol table.

Dynamic symbol table in ELF objects

The dynamic symbol table in ELF objects is stored in a section named dynsym. The indexes stored in the st_name field (from the structure listed above) are indexes into the string table in a section named dynstr. dynstr is a string table specifically for entries in the dynamic symbol table.

If you know the symbol you care about, you can simply calculate a hash of the symbol name to find the symbol table entry for that symbol. Unfortunately, there is not very much documentation about the hash function that is to be used.

Your two options are:

The sections storing the hash table data for an object are called .hash and .gnu.hash.

Dynamic symbol table in Mach-O objects

Finding the dynamic symbol table in a Mach-O object is a bit complicated. The pieces to the puzzle are found across different structures and the documentation on how it all works is sparse.

Mach-O objects have a load command called LC_DYSYMTAB which describes information about the dynamic symbol table in Mach-O objects.

I’ve shortened the structure definition, as it is quite large and contains documentation about stuff that is not directly relevant to this post. From /usr/include/mach-o/loader.h:

struct dysymtab_command {
    uint32_t cmd; /* LC_DYSYMTAB */
    uint32_t cmdsize; /* sizeof(struct dysymtab_command) */

    /* .... */

    /*
     * The sections that contain "symbol pointers" and "routine stubs" have
     * indexes and (implied counts based on the size of the section and fixed
     * size of the entry) into the "indirect symbol" table for each pointer
     * and stub.  For every section of these two types the index into the
     * indirect symbol table is stored in the section header in the field
     * reserved1.  An indirect symbol table entry is simply a 32bit index into
     * the symbol table to the symbol that the pointer or stub is referring to.
     * The indirect symbol table is ordered to match the entries in the section.
     */
    uint32_t indirectsymoff; /* file offset to the indirect symbol table */
    uint32_t nindirectsyms;  /* number of indirect symbol table entries */

    /* .... */
};

The LC_DYSYMTAB load command provides the fields indirectsymoff and nindirectsyms which describe the offset into the file where the indirect symbol tables lives and the number of entries in the table, respectively.

The dynamic symbol table in Mach-O is surprisingly simple. Each entry in the table is just a 32bit index into the symbol table. The dynamic symbol table is just a list of indexes and nothing else.

It turns out there are a few more pieces to the puzzle.

Take a look at the definition for a Mach-O section:

struct section_64 { /* for 64-bit architectures */
  char    sectname[16]; /* name of this section */
  char    segname[16];  /* segment this section goes in */
  uint64_t  addr;   /* memory address of this section */
  uint64_t  size;   /* size in bytes of this section */
  uint32_t  offset;   /* file offset of this section */
  uint32_t  align;    /* section alignment (power of 2) */
  uint32_t  reloff;   /* file offset of relocation entries */
  uint32_t  nreloc;   /* number of relocation entries */
  uint32_t  flags;    /* flags (section type and attributes)*/
  uint32_t  reserved1;  /* reserved (for offset or index) */
  uint32_t  reserved2;  /* reserved (for count or sizeof) */
  uint32_t  reserved3;  /* reserved */
};

It turns out that the fields reserved1 and reserved2 are useful too.

If a section_64 structure is describing a symbol_stub or __la_symbol_ptr sections (read the previous post to learn about these sections), then the reserved1 field hold the index into the dynamic symbol table for the sections entries in the table.

symbol_stub sections also make use of the reserved2 field; the size of a single stub entry is stored in reserved2 otherwise, the field is set to 0.

Two notable differences between the dynamic symbol tables

  • There is an explicit section in ELF that contains Elf64_Sym entries. On Mach-O it’s just a list of 32bit offsets.
  • ELF provides a .hash section and/or .gnu_hash section to speed up symbol lookup. Mach-O does not.

What happens when you run strip?

Let’s use strip with no options (other than the filename).

On ELF:

  • All .debug_* sections are removed. These sections contain extra debugging information that helps debuggers figure out more precisely what went wrong.
  • .symtab section is removed.
  • .strtab section is removed.

On Mach-O:

  • Only undefined symbols and dynamic symbols are left in the symbol table. Everything else is removed.

How to strip so I can debug later (linux only)

If you decide to strip your binary please be considerate to future hackers who may need to debug your app for some reason.

You can be considerate by following the directions in strip(1):

1. Link the executable as normal. Assuming that is is called
“foo” then…

2. Run “objcopy –only-keep-debug foo foo.dbg” to
create a file containing the debugging info.

3. Run “objcopy –strip-debug foo” to create a
stripped executable.

4. Run “objcopy –add-gnu-debuglink=foo.dbg foo”
to add a link to the debugging info into the stripped executable.

And don’t forget to put your debugging information somewhere easily accessible and googleable.

If you do this: you are cool. If you don’t…

Conclusion

  1. I like the way ELF does dynamic symbol tables, the gnu_debuglink section, and the lookup hash table for dynamic symbols. All of these pieces are really useful and I am glad they exist.
  2. The indirect symbol table was a bit of a pain to track down on Mach-O as the information is hard to parse on the first pass. To be fair, it is all there if you google around a bit and put the pieces together.
  3. On Linux, if you strip, please add a gnu_debuglink section and put the debug information somewhere I can find it.

Thanks for reading and don’t forget to subscribe (via RSS or e-mail) and follow me on twitter.

References


more »

WARNING: American Express fails miserably at basic security. »

Created at: 25.05.2010 21:54, source: time to bleed by Joe Damato, tagged: security systems vulnerability


If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.

As of 3:35pm PST on 5/25/2010 it seems to be fixed. wireshark shows only TLS traffic now, nothing in the clear. Pretty quick fix, since this was published at 11:54am. Good deal.

This article is going to reveal a pretty serious error in a web form on the American Express Network website. I would strongly recommend NOT filling out the web form described below.

Daily Wish from the American Express Network

Daily wish from the American Express Network sent me an email this morning trying to get me to sign up for their deal of the day service where they offer a very limited quantity of products for a low price.

Sounds simple enough, right?

Well, the time of the sale is not released until the day the sale occurs, unless you are an American Express cardholder. If you are a card holder, you get a special landing page on their website telling you that if you sign up, you can get the sale times before the sale date.

The white arrow below points to the tab that only appears if you clicked through from an email from American Express. The red arrow below points to the sign up button. Take a look:

Sign up page

After clicking the sign up button (red arrow above), a lightbox appears asking for:

  • First and last name
  • American Express credit card number
  • Security code
  • Expiration date
  • Billing zip

Quite a bit of personal information, much of it sensitive. [sarcarsm]Don’t worry the page is secure[/sarcasm], see the form and the white arrow below:

The code from the form

This form looked very suspicious to me, so I decided to take a look at the code to see if the action for this sign up form was over HTTPS. Check it:

<form name="form1" method="post" action="preid2.aspx?ct=7" onsubmit="javascript:return WebForm_OnSubmit();" id="form1">

So the action is to a handler at http://dailywish.amexnetwork.com/preid2.aspx?ct=7. The lack of https doesn’t make me feel very good.

Maybe the WebForm_OnSubmit() function is doing something that might make this secure? Let’s take a look:

<script type="text/javascript">
//<![CDATA[
function WebForm_OnSubmit() {
if (typeof(ValidatorOnSubmit) == "function" && ValidatorOnSubmit() == false) return false;
return true;
}
//]]>
</script>


So it looks like that function is just a validator. It is really starting to feel like this form is insecure.

Let’s bring out wireshark and see what it has to say.

Wireshark packet sniff

So I filled out the form with fake information and sniffed the POST to the server.

The Daily Wish sign up form from the American Express Network is sending credit card numbers, expiration dates, and all the other personal information on the sign up form in the clear back to their server.

Holy. Fuck.

Conclusion

  • Do NOT fill out the form until American Express fixes this issue.

Thanks for reading and don’t forget to subscribe (via RSS or e-mail) and follow me on twitter.


more »