Enabling BIOS options on a live server with no rebooting »

Created at: 06.07.2009 16:00, source: time to bleed by Joe Damato, tagged: linux scaling systems x86 BIOS kernel performance x86_64


This blog post is going to describe a C program that toggles some CPU and chipset registers directly to enable Direct Cache Access without needing a reboot or a switch in the BIOS. A very fun hack to write and investigate.

Special thanks…

Special thanks going out to Roman Nurik for helping me make the code CSS much, much prettier and easier to read.

Special thanks going out to Jake Douglas for convincing me that I shouldn’t use a stupid sensationalist title for this blog article :)

Intel I/OAT and Direct Cache Access (DCA)

From the Linux Foundation I/OAT project page1:

I/OAT (I/O Acceleration Technology) is the name for a collection of techniques by Intel to improve network throughput. The most significant of these is the DMA engine. The DMA engine is meant to offload from the CPU the copying of [socket buffer] data to the user buffer. This is not a zero-copy receive, but does allow the CPU to do other work while the copy operations are performed by the DMA engine.

Cool! So by using I/OAT the network stack in the Linux kernel can offload copy operations to increase throughput. I/OAT also includes a feature called Direct Cache Access (DCA) which can deliver data directly into processor caches. This is particularly cool because when a network interrupt arrives and data is copied to system memory, the CPU which will access this data will not cause a cache-miss on the CPU because DCA has already put the data it needs in the cache. Sick.

Measurements from the Linux Foundation project2 indicate a 10% reduction in CPU usage, while the Myri-10G NIC website claims they’ve measured a 40% reduction in CPU usage3. For more information describing the performance benefits of DCA see this incredibly detailed paper: Direct Cache Access for High Bandwidth Network I/O.

How to get I/OAT and DCA

To get I/OAT and DCA you need a few things:

  • Intel XEON CPU(s)
  • A NIC(s) which has DCA support
  • A chipset which supports DCA
  • The ioatdma and dca Linux kernel modules
  • And last but not least, a switch in your BIOS to turn DCA on

That last item can actually be a bit more tricky than it sounds for several reasons:

  • some BIOSes don’t expose a way to turn DCA on even though it is supported by the CPU, chipset, and NIC!
  • Your hosting provider may not allow BIOS access
  • Your system might be up and running and you don’t want to reboot to enter the BIOS to enable DCA

Let’s see what you can do to coerce DCA into working on your system if one of the above applies to you.

Build ioatdma kernel module

This is pretty easy, just make menuconfig and toggle I/OAT as a module. You must build it as a module if you cannot or do not want to enable DCA in your BIOS.

The option can be found in Device Drivers -> DMA Engine Support -> Intel I/OAT DMA Support.

Toggling that option will build the ioatdma and dca modules. Build and install the new module.

Enabling DCA without a reboot or BIOS access: Hack overview

In order to enable DCA a few special registers need to be touched.

  • The DCA capability bit in the PCI Express Control Register 4 in the configuration space for the PCI bridge your NIC(s) are attached to.
  • The DCA Model Specific Register on your CPU(s)

Let’s take a closer look at each stage of the hack.

Enable DCA in PCI Configuration Space

PCI configuration space is a memory region where control registers for PCI devices live. By changing register values, you can enable/disable specific features of that PCI device. The configuration space is addressable if you know the PCI bus, device, and function bits for a specific PCI device and the feature you care about.

To find the DCA register for the Intel 5000, 5100, and 7300 chipsets, we need to consult the documentation4:


Cool, so the register needed lives at offset 0×64. To enable DCA, bit 6 needs to be set to 1.

Toggling these register can be a bit cumbersome, but luckily there is libpci which provides some simple APIs to scan for PCI devices and accessing configuration space registers.

#define INTEL_BRIDGE_DCAEN_OFFSET   0x64
#define INTEL_BRIDGE_DCAEN_BIT      6
#define PCI_HEADER_TYPE_BRIDGE     1
#define PCI_VENDOR_ID_INTEL        0x8086 /* lol @ intel */
#define PCI_HEADER_TYPE             0x0e
#define MSR_P6_DCA_CAP             0x000001f8

void check_dca(struct pci_dev *dev)
{
  /* read DCA status */
  u32 dca = pci_read_long(dev, INTEL_BRIDGE_DCAEN_OFFSET);

  /* if it's not enabled */
  if (!(dca & (1 << INTEL_BRIDGE_DCAEN_BIT))) {
    printf("DCA disabled, enabling now.\n");

    /* enable it */
    dca |= 1 << INTEL_BRIDGE_DCAEN_BIT;

    /* write it back */
    pci_write_long(dev, INTEL_BRIDGE_DCAEN_OFFSET, dca);
  } else {
    printf("DCA already enabled!\n");
  }
}

int main(void)
{
  struct pci_access *pacc;
  struct pci_dev *dev;
  u8 type;

  pacc = pci_alloc();
  pci_init(pacc);

  /* scan the PCI bus */
  pci_scan_bus(pacc);

  /* for each device */
  for (dev = pacc->devices; dev; dev=dev->next) {
    pci_fill_info(dev, PCI_FILL_IDENT | PCI_FILL_BASES);

    /* if it's an intel device */
    if (dev->vendor_id == PCI_VENDOR_ID_INTEL) {

        /* read the header byte */
        type = pci_read_byte(dev, PCI_HEADER_TYPE);

        /* if its a PCI bridge, check and enable DCA */
        if (type == PCI_HEADER_TYPE_BRIDGE) {
          check_dca(dev);
        }
    }
  }

  msr_dca_enable();
  return 0;
}

Enable DCA in the CPU MSR

A model specific register (MSR) is a control register that is provided by a CPU to enable a feature that exists on a specific CPU. In this case, we care about the DCA MSR. In order to find it’s address, let’s consult the Intel Developer’s Manual 3B5.

This register lives at offset 0×1f8. We just need to set it to 1 and we should be good to go.

Thankfully, there are device files in /dev for the MSRs of each CPU:

#define MSR_P6_DCA_CAP      0x000001f8
void msr_dca_enable(void)
{
  char msr_file_name[64];
  int fd = 0, i = 0;
  u64 data;

  /* for each CPU */
  for (;i < NUM_CPUS; i++) {
    sprintf(msr_file_name, "/dev/cpu/%d/msr", i);

    /* open the MSR device file */
    fd = open(msr_file_name, O_RDWR);
    if (fd < 0) {
      perror("open failed!");
      exit(1);
    }

    /* read the current DCA status */
    if (pread(fd, &data, sizeof(data), MSR_P6_DCA_CAP) != sizeof(data)) {
      perror("reading msr failed!");
      exit(1);
    }

    printf("got msr value: %*llx\n", 1, (unsigned long long)data);

    /* if DCA is not enabled */
    if (!(data & 1)) {

      /* enable it */
      data |= 1;

      /* write it back */
      if (pwrite(fd, &data, sizeof(data), MSR_P6_DCA_CAP) != sizeof(data)) {
        perror("writing msr failed!");
        exit(1);
      }
    } else {
      printf("msr already enabled for CPU %d\n", i);
    }
  }
}

Code for the hack is on github

Get it here: http://github.com/ice799/dca_force/tree/master

Putting it all together to get your speed boost

  1. Checkout the hack from github: git clone git://github.com/ice799/dca_force.git
  2. Build the hack: make NUM_CPUS=whatever
  3. Run it: sudo ./dca_force
  4. Load the kernel module: sudo modprobe ioatdma
  5. Check your dmesg: dmesg | tail

You should see:

[   72.782249] dca service started, version 1.8
[   72.838853] ioatdma 0000:00:08.0: setting latency timer to 64
[   72.838865] ioatdma 0000:00:08.0: Intel(R) I/OAT DMA Engine found, 4 channels, device version 0x12, driver version 3.64
[   72.904027]   alloc irq_desc for 56 on cpu 0 node 0
[   72.904030]   alloc kstat_irqs on cpu 0 node 0
[   72.904039] ioatdma 0000:00:08.0: irq 56 for MSI/MSI-X

in your dmesg.

You should NOT SEE

[    8.367333] ioatdma 0000:00:08.0: DCA is disabled in BIOS

You can now enjoy the DCA performance boost your BIOS or hosting provider didn't want you to have!

Conclusion

  • Intel I/OAT and DCA is pretty cool, and enabling it can give pretty substantial performance wins
  • Cool features are sometimes stuffed away in the BIOS
  • If you don't have access to your BIOS, you should ask you provider nicely to do it for you
  • If your BIOS doesn't have a toggle switch for the feature you need, do a BIOS update
  • If all else fails and you know what you are doing, you can sometimes pull off nasty hacks like this in userland to get what you want

Thanks for reading and don't forget to subscribe (via RSS or e-mail) and follow me on twitter.

P.S.

I know, I know. I skipped Part 2 of the signals post (here's Part 1 if you missed it). Part 2 is coming soon!

References

  1. http://www.linuxfoundation.org/en/Net:I/OAT
  2. http://www.linuxfoundation.org/en/Net:I/OAT
  3. http://www.myri.com/serve/cache/626.html
  4. Intel® 7300 Chipset Memory Controller Hub (MCH) Datasheet, Section 4.8.12.6
  5. Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3B: System Programming Guide, Part 2, Appendix B-19


more »

Requests Per Second »

Created at: 06.01.2009 09:12, source: The Rails Way - Home, tagged: performance scalability speed

One of the most harmful things about people discussing the performance of web applications is the key metric that we use. Requests per second seems like the obvious metric to use, but it has several insidious characteristics that poison the discourse and lead developers down ever deeper rabbit holes chasing irrelevant gains. The metric prevents us from doing A/B comparisons, or discussing potential improvements without doing some mental arithmetic which appears beyond the capabilities of most of us.

Instead of talking about requests per second, we should always be focussed on the duration of a given request. It’s what our users notice, and it’s the only thing which gives us a notice.

I should prefix the remaining discussion here by saying that most of it does not apply to discussing performance problems at the scale of facebook, google or yahoo. The thing is, statistically speaking, none of you are building applications that will operate at that scale. Sorry if I’m the one who broke this to you, but you’re not building the next google :).

I should also state that requests per second is a really interesting metric when considering the throughput of a whole cluster. But throughput isn’t performance.

Diminshing marginal returns

The biggest problem I have with requests per second is the fact that developers seem incapable of knowing when to stop optimising their applications. As the requests per second get higher and higher, the improvements become less and less relevant. This lets us think we’ve defeated that pareto guy, while we waste ever-larger amounts of our employers’ time.

Let’s take two different performance improvements and compare them using both duration and req/s.

Patch Before After Improvement
A 120 req/s 300 req/s 180 req/s
B 3000 req/s 4000 req/s 1000 req/s

As you can see, when you use req/s as your metric, change B seems like a MUCH bigger saving. It improves performance by 1000 requests a second instead of that measly 180, give that guy a raise! But let’s see what happens when we switch to using durations:

Patch Before After Improvement
A 8.33 ms 3.33 ms 5 ms
B 0.33 ms 0.25 ms 0.08 ms

You see that the actual changes in duration in B is vanishingly tiny. 8% of one millisecond! Odds are that that improvement will vanish into statistical noise when compared to the latency of your network, or your user’s internet connection.

But when we use requests per second, that 1000 is so big and enticing that developers will do almost anything to get it. If they used durations as their metric, they’d probably have spent that time implementing a neat new feature, or responding to customer feedback.

Deltas become meaningless

A special case of my first complaint is that with requests per second the deltas aren’t meaningful without knowing the start and the finish points. As I showed above, a 1000 req/s change could be a tiny change, but it could also be an amazing performance coup. Take this next example:

Before After Diff
1 req/s 1001 req/s 1000 req/s

When expressed as durations you can see that it made a huge difference

Before After Diff
1000 ms 0.99 ms 999.01 ms

So 1000 requests per second could either be irrelevant, or fantastic. Durations don’t have this problem at all. 0.02ms is obviously questionable, and 999.01 ms is an obvious improvement.

This problem most commonly expresses itself when people say “that changeset took 50 requests per second off my application”. Without the before and after numbers, we can’t tell if that’s a big deal, or if the guy needs to take a deep breath and get back to work.

The numbers don’t add up

Finally, requests per second don’t lend themselves nicely to arithmetic, and make developers make silly decisions. The most common case I see this is when comparing web servers to put in front of their rails applications. The reasoning goes something like this:

Nginx does over 9000 requests per second, and apache only does 6000 requests per second!! I’d better use nginx unless I want to pay a 3000 requests per second tax.

When people do this comparison they seem to believe that by switching to nginx from apache their application will go from 100 req/s to 3100 req/s. As always, durations tell us a different story.

Apache Nginx Diff
6000 req/s 9000 req/s 3000 req/s
0.16 ms 0.11 ms 0.05 ms

So we can see that odds are you’ll only gain a 5% of a millisecond’s improvement when switching. Perhaps that improvement is worthwhile for your application, but is it worth the additional complexity?

Conclusion

Durations are a much more useful, and more honest, metric when comparing performance changes in your applications. Requests per second is too wide-spread for us to stop using it entirely, but please don’t use it when talking about performance of your web applications or libraries.


more »