GCC optimization flag makes your 64bit binary fatter and slower »

Created at: 20.07.2010 15:59, source: time to bleed by Joe Damato, tagged: debugging linux systems testing x86


If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.

The intention of this post is to highlight a subtle GCC optimization bug that leads to slower and larger code being generated than would have been generated without the optimization flag.

UPDATED: Graphs are now 0 based on the y axis. Links in the tidbits section (below conclusion) for my ugly test harness and terminal session of the build of the test case in the bug report, objdump, and corresponding system information.

Hold the #gccfail tweets, son.

Everyone fucks up. The point of this post is not to rag on GCC. If writing a C compiler was easy then every asshole with a keyboard would write one for fun.

WARNING: THERE IS MATH, SCIENCE, AND GRAPHS BELOW.

Watch yourself.

The original bug report for -fomit-frame-pointer.

I stumbled across a bug report for GCC that was very interesting. It points out a very subtle bug that occurs when the -fomit-frame-pointer flag is passed to GCC. The bug report is for 32bit code, however after some testing I found that this bug also rears its head in 64bit code.

What is -fomit-frame-pointer supposed to do?

The -fomit-frame-pointer flag is intended to direct GCC to avoid saving and restoring the frame pointer (%ebp or %rbp). This is supposed to make function calls faster, since the function is doing less work each invocation. It should also make function code take fewer bytes since there are fewer instructions being executed.

A caveat of using -fomit-frame-pointer is that it may make debugging impossible on certain systems. To combat this on Linux, .debug_frame and .eh_frame sections are added to ELF binaries to assist in the stack unwinding process when the frame pointer is omitted.

What is the bug?

The bug is that when -fomit-frame-pointer is used, GCC erroneously uses the frame pointer register as a general purpose register when a different register could be used instead.

wat.

The amd64 and i386 ABIs1 2 specify a list of caller and callee saved registers.

  • The frame pointer register is callee saved. That means that if a function is going to use the frame pointer register, it must save and restore the value in the register.
  • The test case provided in the bug report shows that other caller saved registers were available for use.
  • Had the function used a caller saved register instead, there would be no need for the additional save and restore instructions in the function.
  • Removing those instructions would take fewer bytes and execute faster.

What are the consequences?

Let’s take a look at two potential pieces of code.

The first piece is the code that would be generated if -fomit-frame-pointer is not used:

test1:
        pushq %rbp       ; save frame pointer
        movq %rsp,%rbp   ; update frame pointer to the current stack pointer
           ; here is where your function would do work
        leave            ; restore the stack pointer and frame pointer
        ret              ; return

Size: 6 bytes.

The above assembly sequence uses the frame pointer.

Let’s take a look at the code that is generated by GCC when -fomit-frame-pointer is used:

        sub $0x8, %rsp    ; make room on the stack
        movq %rbp, (%rsp) ; store rbp on the stack
          ; here is where your function would modify and use %rbp as needed
        movq (%rsp), %rbp ; restore %rbp
        add $0x8, %rsp    ; get rid of the extra stack space
        ret               ; return

Size: 17 bytes.

The above assembly sequence is what is generated when GCC decides to use the frame pointer register as a general purpose register. Since it is callee saved, it must be saved before being modified and restored after being modified.

So -fomit-frame-pointer makes your binary fatter, but does it make it slower?

Only one way to find out: do science.

I built a simple (and very ugly) testing harness to test the above pieces of code to determine which piece of code is faster. Before we get into the benchmark results, I want to tell you why my benchmark is bullshit.

Yes, bullshit.

You see, it makes me sad when people post benchmarks and neglect to tell others why their benchmark may be inaccurate. So, lemme start the trend.

This benchmark is useless because:

  • Reading the CPU cycle counter is unreliable (more on this below the conclusion). I also tracked wall clock time, too.
  • I don’t have the ideal test environment. I ran this on bare metal hardware, and set the CPU affinity to keep the process pinned to a single CPU… BUT
  • I could have done better if I had pinned init to CPU0 (thereby forcing all children of init to be pinned to CPU0 – remember child processes inherit the affinity mask). I would have then had an entire CPU for nothing but my benchmark.
  • I could have done better if I forced the CPU running my benchmark program to not handle any IRQs.
  • I only tested one version of GCC: (Debian 4.3.2-1.1) 4.3.2
  • I could have taken more samples.

You can find more testing harness tidbits below the conclusion.

Benchmark Results

test 1 — Code sequence simulating using the frame pointer.
test 2 — Code sequence simulating using the frame pointer as a general purpose register.

64bit results

Using -fomit-frame-pointer is SLOWER (contrary to what you’d expect) than not using it!

cycles test 1 cycles test 2 microsecs test 1 microsecs test 2
mean 3514422987.92 4559685515.66 1882707.27 2442663.94
median 3507007423.5 4562511684.5 1878721.5 2444171.5
max 3922780211 4672066854 2101457 2502869
min 3502194976 4327782795 1876113 2318452
std dev 31927179.5632 15449507.8196 17103.7755 8275.49788
variance 1.02E+15 238687291867021 292539135.936 68483865.11835


32bit results

Using -fomit-frame-pointer is FASTER (as it should be) than not using it! The binary is still fatter, though.

cycles test 1 cycles test 2 microsecs test 1 microsecs test 2
mean 3502932799.49 3491263364.89 1876553.08 1870301.35
median 3501486586.5 3492013955.5 1875778 1870702.5
max 3905163528 3731985243 2092032 1999259
min 3500916510 3408834436 1875472 1826144
std dev 10066939.1113 7992367.6913 5393.0412 4281.5466
variance 101343263071403 63877941312996.4 29084893.2588 18331640.9459


Conclusion

  • GCC is a really complex piece of software; this bug is very subtle and may have existed for a while.
  • I’ve said this a few times, but knowing and understanding your system’s ABI is crucial for catching bugs like these.
  • Math and science are cool now, much like computers. You should use both.

Thanks for reading and don’t forget to subscribe (via RSS or e-mail) and follow me on twitter.

Testing harness tidbits

Each run of the benchmark executes either test1 or test2 (from above) 500,000,000 times. I did around 2500 runs for each test function.

You can get the testing harness, a build script, and a test script here: http://gist.github.com/483524

You can look at the terminal session where I build the test from the original bug report on my system: http://gist.github.com/483494

The code I used to read the CPU cycle counter looks like this:

static __inline__ unsigned long long rdtsc(void)
{
  unsigned long hi = 0, lo = 0;
  __asm__ __volatile__ ("lfence\n\trdtsc" : "=a"(lo), "=d"(hi));
  return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}

The lfence instruction is a serializing instruction that ensures that all load instructions which were issued before the lfence instruction have been executed before proceeding. I did this to make sure that the cycle counter was being read after all operations in the test functions were executed.

The values returned by this function are misleading because CPU frequency may be scaled at any time. This is why I also measured wall clock time.

References

  1. http://www.sco.com/developers/devspecs/abi386-4.pdf
  2. http://www.x86-64.org/documentation/abi.pdf


more »

Because Gem Names Are Like Domains in the 90's »

Created at: 30.03.2010 18:00, source: RailsTips - Home, tagged: gems testing

One of my favorite parts of every new gem is naming it. The other day, when I was trying to name joint, it occurred to me that I should always check if a gem name is available before I create my project. I did a quick search on RubyGems and discovered it was available.

Last night, I decided I should whip together a tiny gem that allows you to issue a whois command to see if a gem name is taken. Why leave the command line, eh?

Installation

gem install gemwhois

This adds the whois command to gem. Which means usage is pretty fun.

Usage

$ gem whois httparty

   gem name: httparty
     owners: John Nunemaker, Sandro Turriate
       info: Makes http fun! Also, makes consuming restful web services dead easy.
    version: 0.5.2
  downloads: 40714
  
$ gem whois somenonexistantgem

  Gem not found. It will be mine. Oh yes. It will be mine. *sinister laugh*

If the gem is found, you will see some details about the project (maybe you can convince them to hand over rights if they are squatting). If the gem is not found, you will receive a creepy message in the same vein as the RubyGems 404 page.

The Fun Parts

The fun part of this gem was recently I noticed that other gems have been adding commands to the gem command. I thought that was interesting so I did a bit of research. I knew that both gemedit and gemcutter added commands so I downloaded both from Github and began to peruse the source. Turns out it is quite easy.

First, you have to have a rubygems_plugin.rb file in your gems lib directory. This is mostly ripped from gemcutter:

if Gem::Version.new(Gem::RubyGemsVersion) >= Gem::Version.new('1.3.6')
  require File.join(File.dirname(__FILE__), 'gemwhois')
end

Next, you have to create a command. At the time of this post, here is the entirety of the whois command:

require 'rubygems/gemcutter_utilities'

class Gem::Commands::WhoisCommand < Gem::Command
  include Gem::GemcutterUtilities

  def description
    'Perform a whois lookup based on a gem name so you can see if it is available or not'
  end

  def arguments
    "GEM       name of gem"
  end

  def usage
    "#{program_name} GEM"
  end

  def initialize
    super 'whois', description
  end

  def execute
    whois get_one_gem_name
  end

  def whois(gem_name)
    response = rubygems_api_request(:get, "api/v1/gems/#{gem_name}.json") do |request|
      request.set_form_data("gem_name" => gem_name)
    end

    with_response(response) do |resp|
      json = Crack::JSON.parse(resp.body)
      puts <<-STR.unindent

        gem name: #{json['name']}
          owners: #{json['authors']}
            info: #{json['info']}
         version: #{json['version']}
       downloads: #{json['downloads']}

      STR
    end
  end

  def with_response(resp)
    case resp
    when Net::HTTPSuccess
      block_given? ? yield(resp) : say(resp.body)
    else
      if resp.body == 'This rubygem could not be found.'
        puts '','Gem not found. It will be mine. Oh yes. It will be mine. *sinister laugh*',''
      else
        say resp.body
      end
    end
  end
end

The important part is inheriting from Gem::Command. Be sure to require 'rubygems/command_manager' at some point as well. Once you have the rubygems_plugin file and a command created, you simple register the command:

Gem::CommandManager.instance.register_command(:whois)

The comments and code in RubyGems itself is pretty helpful if you are curious about what you can do.

Testing

The trickier part was testing the command. Obviously, building the gem from gemspec and installing over and over does not a happy tester make. I did a bit of research and found the following testing output helpers and the unindent gem:

module Helpers
  module Output
    def assert_output(expected, &block)
      keep_stdout do |stdout|
        block.call
        if expected.is_a?(Regexp)
          assert_match expected, stdout.string
        else
          assert_equal expected.to_s, stdout.string
        end
      end
    end

    def keep_stdout(&block)
      begin
        orig_stream, $stdout = $stdout, StringIO.new
        block.call($stdout)
      ensure
        s, $stdout = $stdout.string, orig_stream
        s
      end
    end
  end
end

With these little helpers, it was quite easy to setup the command and run it in an automated way:

require 'helper'

class TestGemwhois < Test::Unit::TestCase
  context 'Whois for found gem' do
    setup do
      @gem = 'httparty'
      stub_gem(@gem)
      @command = Gem::Commands::WhoisCommand.new
      @command.handle_options([@gem])
    end

    should "work" do
      output = <<-STR.unindent

        gem name: httparty
          owners: John Nunemaker, Sandro Turriate
            info: Makes http fun! Also, makes consuming restful web services dead easy.
         version: 0.5.2
       downloads: 40707

      STR
      assert_output(output) { @command.execute }
    end
  end
  
  context "Whois for missing gem" do
    setup do
      @gem = 'missing'
      stub_gem(@gem, :status => ["404", "Not Found"])
      @command = Gem::Commands::WhoisCommand.new
      @command.handle_options([@gem])
    end

    should "work" do
      output = <<-STR.unindent

        Gem not found. It will be mine. Oh yes. It will be mine. *sinister laugh*

      STR
      assert_output(output) { @command.execute }
    end
  end
end

The only other piece of the puzzle was using FakeWeb to stub the http responses for the found and missing gems. You can see more on that in the test helper file.

Conclusion

At any rate, the gem is pretty tiny and possibly useless to others, but it was fun. Gave me a chance to play around with testing STDOUT and creating RubyGem commands. Plus, now I know if the gem name I want is available in just a few keystrokes.


more »

Double Shot #663 »

Created at: 08.03.2010 13:55, source: A Fresh Cup, tagged: Double Shot git heroku highlight ruby testing

I'll just be over here in the corner, trying to write some code.


more »

Easy rcov with Test::Unit and Rails 2 »

Created at: 01.02.2010 14:23, source: A Fresh Cup, tagged: Rails Articles rails rcov testing

Yesterday, I needed test coverage numbers for an application using Rails 2.3 and plain old Test::Unit (still my test framework of choice). This proved to be a good deal more difficult than I expected: there are so many forks of the original rcov code, and so much disparate advice on how to get it hooked up, that I spent a great deal of unprofitable time reading before I got the right combination. Hopefully this short list will save you from doing the same:

  1. Install the relevance-rcov gem. This is the stable and maintained fork.
  2. Install the commondream rcov_plugin. This has the best set of working rake tasks I've found.
  3. Add coverage/* and test/coverage/* to your .gitignore.
  4. Run rake test:coverage to generate coverage results.
  5. Open /coverage/index.html in your project to view the results.


more »

Double Shot #619 »

Created at: 05.01.2010 14:40, source: A Fresh Cup, tagged: Double Shot cromwell firebug rubinius sphinx stencil testing

Looks like the next RailsBridge BugMash is going to focus on Rails 3, just before the beta. This could be interesting.

  • Lemon - Yet another ruby unit test framework, this one focused on making sure there's a test case for every method.
  • Reincarnate - How to make a ruby class inherit from itself.
  • Sphinx Monitoring Plugin - The latest nice touch from the Scout guys.
  • Cromwell - Wrapper for ruby signal handling to allow you to easily create unkillable scripts.
  • Stencil - Free-text (ie, not assuming HTML) templating system.
  • Rubinius 1.0.0-RC2 Released - Ruby in Ruby moves along.
  • Firebug 1.6a1 - The next branch of Firebug (for Firefox 3.6/3.7) has gone live.


more »