Ruby, Concurrency, and You »

Created at: 14.10.2011 19:41, source: Engine Yard Blog, tagged: Open Source Technology 1.8 1.9 concurrency GIL implementations ironruby jruby macruby maglev MRI parallelism rubinius ruby threads

tl;dr
Ruby Implementation Concurrency Parallelism
MRI 1.8
MRI 1.9
Rubinius 1
Rubinius 2
JRuby
MacRuby
Maglev
IronRuby

A big topic in the world of Ruby this year has been how to get more out of Ruby, specifically, how to get more done in parallel. The topic of concurrency, though, is one fraught with misunderstanding. This is largely due to the complexities of not only thinking about multiple things at once, but the limitations of Ruby implementations and operating systems.

In this article, I’ll lay the groundwork for understanding the difference between concurrency and parallelism. Then, I’ll look at how a programmer experiences them.

Concurrency vs. Parallelism

This has been discussed many times, but I sometimes still have difficulty with it. Let’s first break down the definitions of these two words:

  • Concurrent: existing, happening, or done at the same time
  • Parallel: occurring or existing at the same time or in a simple way

Hmm, ok. Well, that hasn’t improved our thinking about these two topics. We need to dig deeper into how the world of computing applies to these words. Rather than looking at the abstract, let’s instead consider some real world examples.

A “Real World” Example

Let’s say you’ve sat down for the evening to complete tomorrow’s homework. This evening you’ve got both Math and History worksheets to fill out. Tonight for some reason, you decide to do one problem in Math, then one problem in History, then back to Math, etc until all the problems are done.

In the parlance of computing, you’re now doing your Math and History worksheets concurrently. This is because your Current task list includes 2 items: Math worksheet and History worksheet.

Now, clearly you the reader can see a problem here. By switching back and forth, completing your homework will probably take longer than if you did the complete Math worksheet then did the History worksheet. In other words, if you did the worksheets in serial.

So, if concurrent means “having multiple outstanding tasks at once”, then what is parallel? Parallel is the ability to make progress on multiple tasks simultaneously.

Let’s say you’ve been asked to read the book One O’Clock Jump by Lise McClendon. You also need to drive down to San Diego for Comic-Con. Thankfully you find that One O’Clock Jump is available on audiobook!

You can now listen to the book while driving. You’re simultaneously making progress on two separate tasks. This is the equivalent of parallelism in computing.

I hope that these real world examples help illustrate the difference between concurrency and parallelism. Now let's apply this newfound knowledge to Ruby.

Back to Ruby

One reason this problem can be difficult to understand is because Ruby only provides a single mechanism for concurrency. But, whether or not these Threads are parallel depends on a number of factors.

MRI 1.8

Let’s look at MRI 1.8 (and MRI forks such as REE) to begin with, because it has the simplest model. MRI 1.8 uses a technique known as “green threads” to implement Threads. This means that every once in a while (around 100 milliseconds), the program says “oh, I should let another thread run now!” This saves the current info into the current thread and restores another thread. This is exactly like our homework example above. We can have as many things as we’d like in our task list, but we can only make progress on one of them at a time.

There is a wrinkle in the concurrency/parallelism game that I haven’t mentioned before now. This wrinkle is IO, namely how Threads interact when waiting for some external event. MRI 1.8.7 is quite smart, and knows that when a Thread is waiting for some external event (such as a browser to send an HTTP request), the Thread can be put to sleep and be woken up when data is detected. This simple consolation improves the usage of Threads so much that for a very long time the MRI 1.8.7 model was good enough for all Ruby programs.

MRI 1.9

Switching back to Ruby implementations, let’s look at MRI 1.9. As has been previously reported, MRI 1.9 removes the “green threads” we had in MRI 1.8 and uses native threads to implement the Thread class. Now, what are these “native threads”? These are are units of concurrency that the underlying operating system is aware of. A big reason to switch to use native threads is that it vastly simplifies the implementation of Threading. The operating system handles the low level parts of saving and restoring Thread information in a completely transparent way. Additionally, letting the OS know what parts of a program should be concurrent allows it to use the full resources of the computer to make that happen. In this modern world, that means using multiple cores.

Up until now, all we’ve talked about with Ruby’s Threading model was about concurrency, the ability to have multiple outstanding tasks at once. Now when we add in the idea of multiple cores, we can finally talk about parallelism. When a computer includes multiple cores (which is pretty much every computer now), those cores can run different code simultaneously, providing true parallelism. When a computer only has one core, there is no true parallelism, instead there is just simple concurrency, even at the OS level. The OS manages all the processes and threads in the system the same way you handled your Math and History worksheets, doing one for a little while, then grabbing another one.

Back to multiple cores though. Now that there is the opportunity to run things truly in parallel, we have to look at if Ruby can take advantage of that. Since MRI 1.9 uses OS threads, it can actually spread out your Ruby Threads to multiple cores!

Unfortunately, MRI 1.9 prevents the Ruby code itself from running in parallel by requiring that any thread running Ruby code hold a lock. This lock is commonly knows as the GIL (Global Interpreter Lock) or GVL (Global VM Lock).

There are a few reasons the GIL to exists, but for this discussion we will say that it’s because the non-Ruby parts of MRI 1.9 are not thread-safe. This means if data were manipulated by multiple threads at the same time, the data could become corrupt. The important thing for this post is how it applies to parallelism: the GIL inhibits parallelism within Ruby code.

MRI 1.9 uses the same technique as MRI 1.8 to improve the situation, namely the GIL is released if a Thread is waiting on an external event (normally IO) which improves responsiveness. MRI 1.9 also includes an experimental API that C extensions can use to run some C code without the GIL locked to utilize parallelism. This API is very restrictive though because no Ruby object may be accessed in any way while the GIL is not held by the current thread.

That about sums up the situation with MRI 1.8 and 1.9 with regards to concurrency and parallelism. Both provide concurrency of Ruby code, but neither provide parallelism of Ruby code.

Rubinius

Let’s take a quick look at other Ruby implementations where things are a bit different than MRI. I’ll start with Rubinius, since it’s the one I’m most familiar with. Rubinius 1.x also had a GIL and worked pretty much the same as MRI 1.9. With the upcoming 2.0 release though, the GIL will be removed, allowing Ruby code to run fully concurrent and fully parallel. We think this opens up a lot of uses for Ruby (parallel algorithms, etc) that Rubinius couldn’t handle well previously.

JRuby

JRuby layers the Thread class on top of Java’s thread class, so the threading model is whatever the JVM supports. That being said, OpenJDK is the primary JVM; it puts a Java thread directly onto an OS thread with no GIL. Thusly, JRuby almost always has full concurrency and parallelism available to it.

MacRuby

MacRuby also uses Cocoa’s NSThread as its abstraction, which runs without a GIL. So, this is another fully parallel implementation.

Maglev

Maglev runs directly on top of a Smalltalk VM and thusly layers the Thread class on top of a concept called Smalltalk Processes. In this case, the GemStone VM implements Processes in the same way as MRI 1.8, namely via “green threads” that don’t expose concurrency to the OS, and therefore, have no parallelism.

IronRuby

Lastly, IronRuby layers Thread directly on top of CLR’s threads without a GIL.

Conclusion

I hope that this helps to clear up what concurrency and parallelism are and how the different Ruby implementations address them. Having this understanding is critical for discussing and understanding topics such and thread-safety of libraries and performance of applications.

In future posts, we’ll look to build on this knowledge to help you make the best use of Ruby!


more »

Concurrency, Real and Imagined, in MRI; Threads »

Created at: 11.08.2010 12:00, source: Engine Yard Blog, tagged: Technology MRI

In computer science, concurrency is a property of systems in which several computations are executing simultaneously, and potentially interacting with each other. The computations may be executing on multiple cores in the same chip, preemptively time-shared threads on the same processor, or executed on physically separated processors.
-- Wikipedia Concurrency article Simply put, concurrency is when you have more than one logical thread of execution occurring simultaneously, or at least appearing to occur simultaneously. When you write software that makes use of concurrency, you want your software to do two or more things at once. The motivations for using concurrency are varied. Sometimes you may have architectural reasons for using concurrency -- your code makes more sense to you or is easier to write if you conceive it in more than one discretely executing unit. In other cases you may want to employ concurrency in order to make better use of the multiple cores that many modern computers have, enabling you to get better total throughput out of your code than you would have from a non-concurrent implementation. Whatever the motivation for employing concurrency, the reality is that concurrency is a complex subject. There are many different ways to achieve concurrency in software, and they each have their own set of tradeoffs. Furthermore, if your platform is Ruby, your decisions about what kind of concurrency to employ will be influenced by the specific Ruby implementation you are targeting. Each provides a different set of concurrency options for you to consider. This is the first installment in a new series of articles focusing on introducing and exploring the variety of concurrency options available in the Ruby ecosystem. Advantages and disadvantages will be discussed for each, and I'll leave you with a few examples of how you can leverage these different options in your code. It should be a fun subject to explore! Concurrency is all about multitasking -- doing more than one thing at once. The building blocks of multitasking are processes, threads, and fibers. Each of these components is complex in itself, both because of the nuances in how they interact and can be combined, and because different platforms have variations in which capabilities they implement and in how they are implemented. Luckily, their overall description can be summarized in a useful way. Processes are independent units of execution that generally share nothing with other processes, except for resources which are intended to be shared (such as shared memory segments, shared IO resources, or memory mapped files). Processes carry a lot of state information with them and have their own address spaces. Communication between them has to be through an interprocess communication mechanism provided by the platform that the processes are running in. Processes running on the same machine will be scheduled by the kernel, which will typically use some sort of time slicing algorithm to spread CPU usage of all running processes across the available cores. Threads come in several different flavors, including kernel, user space, and green threads. On some platforms there are entities called light-weight processes that bring kernel threads into user space so they look somewhat like processes, but are less expensive. For our purposes, threads are contained within a process, and share the memory space and process state of the process with each other. Green threads differ in that they are not controlled or scheduled by the operating system. Rather, they are provided by the process itself. This has a portability advantage because it means that the threads will be available on every platform that the process can run on, and will work the same on each. The main disadvantage is that green threads, being managed by the process itself, are generally confined to sharing a single core, and are limited to the peculiarities of the process's threading implementation (which may vary substantially from the platform's own threading implementation). Regardless of the type of threading, context switching with threads is generally faster than it is with processes. Fibers are like user space threads, except the operating system doesn't handle scheduling for them. Instead, fibers must be explicitly yielded to allow other fibers to run. This can have performance advantages like the reduction of system scheduling overhead. Since multitasking with fibers is cooperative, the need to use locks on shared resources is reduced or eliminated. Programmers can also leverage fibers to their advantage with IO operations by allowing other things to run while waiting for a slow or blocking IO operation. Ruby concurrency isn't quite as simple as selecting one of the above and using it, however. In the beginning, there was just Ruby, a single implementation that everyone used. This Ruby implementation, now commonly called the Matz Ruby Implmenetation (MRI), saw a widespread usage explosion with the 1.8.x version. It's pretty old now. This is from the ftp://ruby-lang.org FTP server:
carbon:/home/ftp/pub/ruby/1.8$ ls -la | grep ruby-1.8.0
-rw-rw-r--  1 root     ftp   1979070 Aug  4  2003 ruby-1.8.0.tar.gz
So, it has been around for a while, and offers a good starting point for discussing concurrency in Ruby. MRI Ruby 1.8.x supports concurrency in a few ways. One of the first things newcomers to Ruby leap for are its threads. Depending on the language these newcomers were familiar with before arriving at Ruby, they may be in for a surprise. MRI Ruby 1.8.x provides a green thread implementation. As mentioned above, green threads do not make use of any threading system native to the platform. Instead, 1.8.x's threads are implemented within the interpreter itself. This leads to threads behaving consistently across any platform the interpreter runs on. Because they are green threads, however, they offer no advantages for CPU bound tasks. cpu_bound_threads.rb
require 'benchmark'
threads = []
thread_count = ARGV[0].to_i
iterations = ARGV[1].to_i
increment = iterations / thread_count.to_f
sum = 0

Benchmark.bm do |bm|
  bm.report do
    thread_count.times do |counter|
      threads << Thread.new do
        my_sum = 0
        queue = (1 + (increment * counter).to_i)..(0 + (increment * (counter + 1)).to_i)
        queue.each do |x|
          my_sum += x
        end
        Thread.current[:sum] = my_sum
      end
    end

    threads.each {|thread| thread.join; sum += thread[:sum]}

    puts "The sum of #{iterations} is #{sum}"

  end
end
This is a simple program that takes a large range of numbers, divides them into smaller ranges, and hands each smaller range to a thread that calculates the sum of the range it was given. The results from each individual thread are then added together to arrive at a final answer. All examples ran on an 8 core Linux machine. The numbers below are an average of the results of 100 runs for each set of inputs.
Threads
Iterations 50000 500000 5000000
1 0.01730298 0.17149276 1.70610744
2 0.01724724 0.17179465 1.70557474
4 0.01729293 0.17181384 1.70570264
8 0.01741591 0.17210276 1.71201153
As demonstrated by the numbers, MRI 1.8 threads are absolutely no help at all for a CPU bound application. In fact, there is a small but measurable cost to the overhead of managing them that is apparent in the numbers. As thread count increased, timing consistently and measurably slowed. If you are an MRI 1.8 user, do not despair; threads are but one concurrency option available to you. An option that will better serve you for CPU bound tasks is process based concurrency. The idea is simple. In order to leverage multiple cores/CPUs, just create more than one process to handle the work load. Ruby provides a fork() method call which, on platforms that support it using the underlying fork() call from the C standard library. This will create a new process, with a new process ID, that can be considered an exact copy of the parent process, except that its resource allocations will be reset to 0. Since processes do not share memory spaces, you must utilize another system provided communication mechanism in order to pass work to or from processes; this avoids the potential pitfalls that arise when trying to correctly manage locks on shared resources, but it does force one to think more specifically about exactly how to achieve communication. cpu_bound_processes.rb
require 'benchmark'
processes = []
process_count = ARGV[0].to_i
iterations = ARGV[1].to_i
increment = iterations / process_count.to_f
sum = 0

def in_subprocess
  from_subprocess, to_parent = IO.pipe

  pid = fork do
    from_subprocess.close
    r = yield
    to_parent.puts [Marshal.dump(r)].pack("m")
    exit!
  end

  to_parent.close
  [pid,from_subprocess]
end

def get_result_from_subprocess(pid, from_subprocess)
  r = from_subprocess.read
  from_subprocess.close
  Process.waitpid(pid)
  Marshal.load(r.unpack("m")[0])
end

Benchmark.bm do |bm|
  bm.report do
    process_count.times do |counter|
      processes << in_subprocess do
        my_sum = 0
        queue = (1 + (increment * counter).to_i)..(0 + (increment * (counter + 1)).to_i)
        queue.each do |x|
          my_sum += x
        end
        my_sum
      end
    end

   processes.each {|process| sum += get_result_from_subprocess(*process)}

   puts "The sum of #{iterations} is #{sum}"

  end
end
In this example I used IO pipes to send data from the master process to the children, and to receive data from the children, back into the master. As earlier, testing was done on an 8 core linux machine, with 100 runs of each test. The program is equivalent to the threaded version, and was changed only as necessary to enable it to be used in a multiprocess model instead of a multithread model.
Worker Processes
Iterations 50000 500000 5000000
1 0.01805432 0.17199047 1.70812685
2 0.0098329 0.08675517 0.85509328
4 0.00609409 0.0446612 0.43100698
8 0.00847991 0.05346145 0.25621009
Take a good look at these numbers. Everything moves in the correct direction, until you get to the 8 process column. Then timing slows for both the 50000 and 500000 iteration rows that are under the 4 process column. Do you have any theories as to why? Processes are, in many ways, a great way to handle concurrency. One of their drawbacks, though, is that they are heavy structures. They can take up significant time and resources to create . Linux uses copy-on-write semantics when creating forked processes. This means it doesn't actually duplicate the address space of the forked process until pages in that space start changing. Then, it duplicates what changes. This means that forked processes on Linux can be created fairly quickly. However, MRI 1.8 is not very friendly to copy-on-write semantics. If you are unfamiliar with the way memory is managed and garbage is collected in MRI 1.8, you should check out my article on MRI Memory Allocation. One key aspect is that objects carry all of their status bits with them. This means that when the garbage collector scans the object space to find objects it can collect, it touches every object in the address space. For a process forked with copy-on-write semantics, this forces the kernel to make copies of all of those pages. This takes time, and largely negates the fast-creation benefit of copy-on-write forked processes. The times for the lower iterations on the 8 thread test reveal a cost to this form of concurrency. The overhead associated with creating the forked processes overwhelms the performance gains from the division of labor when the work to be done is brief enough. This is a reality for any form of concurrency -- there is always a performance tax from some amount of overhead. That tax is just higher when spawning something heavy like a process. Keep this in mind when you explore concurrency options for your task. These first two examples both represent CPU bound problems. Many real world problems are not CPU bound, though. Rather, they are IO bound issues. Because an IO bound problem has latencies imposed on it by something outside of the program itself, IO bound problems can provide an excellent case for using MRI 1.8's green threads to improve performance. io_bound_threads.rb
require 'net/http'
require 'thread'
require 'benchmark'

def get_data(url)
  tries = 0
  response = nil
  if /^http/.match(url)
    m = /^http:\/\/([^\/]*)(.*)/.match(url)
    site = m[1]
    path = m[2]
    begin
      http = Net::HTTP.new(site)
      http.open_timeout = 30
      http.start {|h| response = h.get(path)}
    rescue Exception
      tries += 1
      retry if tries < 5
    end
  end
  response.kind_of?(Array) ? response[1] : response.respond_to?(:body) ? response.body : ''
end

mutex = Mutex.new
signal = ConditionVariable.new
thread_count = ARGV[0].to_i
fetches = ARGV[1].to_i
url = ARGV[2]
threads = []
count = 0
active_threads = 0

Benchmark.bm do |bm|
  bm.report do
    while count < fetches
      while count < fetches && active_threads < thread_count
        mutex.synchronize do
          active_threads += 1
          count += 1
        end
        Thread.new do
          get_data(url)
          mutex.synchronize do
            active_threads -= 1
            threads << Thread.current
            signal.signal
          end
        end
      end

      mutex.synchronize do
        signal.wait(mutex)
      end
      while th = threads.shift
        th.join
      end
    end
  end
end
This script makes many HTTP requests. For simplicity's sake, lets say it just makes the same request over and over again, but could easily be expanded to take a list of URLs, and to do something useful with the returned data. The script uses threads much like the CPU bound example, except that it is a bit more sophisticated in how it counts the work it has assigned to generated threads, and how it waits for all the threads to be completed. This table shows timing from it in action. The target URL used was not local to the testing machine. Each run used the indicated number of threads to gather the URL, either a "fast" URL, with an over-the-net response speed of about 35 requests per second, or a "slow" URL with an over-the-net response speed of about 3 requests per second, 400 times. There were 100 runs completed. The numbers below are an average from those runs.
Worker Threads
Request speed 35/second 3/second
1 6.53462668 61.1016239
2 3.34861606 30.4514539
5 1.38942396 12.1620945
10 0.72804622 6.0968646
20 0.47964698 3.0411382
Just a glance at these numbers clearly shows that Ruby threads are a big help with an IO bound activity like this. The relationship between number of threads and reduction in time to complete the task is not linear; but even with up to 20 threads there is a significant benefit to additional numbers of threads. The benefit is more linear, and evident for slower requests because the requests spend more time waiting on IO, and less on CPU bound activities. There are some caveats to be aware of with regard to Ruby threads. First, even though they are green threads, as soon as one starts sharing resources between threads, threading becomes something that can be hard to get right. Share as little as possible, thoroughly think through your code, and use tests to support your reasoning, because threading problems can be hard to diagnose and solve. Second, MRI 1.8 has a limit on the number of threads that it will manage. As a consequence of how the internals are implemented, this means that on most systems (notably excluding win32 systems), total thread count is limited to 1024. Also, because of the way it is implemented, the overhead increases to manage a larger number of threads versus smaller. Each thread consumes a significant amount of memory, so do not go crazy with threads or it will backfire on you. Third, because of the way that Ruby threading is implemented, it is possible for a C extension to Ruby to take control of the process and prevent Ruby from allowing context switches to other threads. It is possible to write extensions so that they do not do this, but many are not written in this way. Where this bites most people, is with code that interacts with a database. One can reasonably look at a database query as an IO bound activity -- all the Ruby process is really doing is sending a request to the DB and waiting for a response. However, most DB interaction libraries are implemented as C extensions, and some of them do not play well with Ruby threads. One of the most common offenders is Mysql-Ruby. It will block all of Ruby while waiting for the result from a long running query. This means that a long running query will block the whole process until it returns. On the other hand, Ruby-PG, the driver for Postgres, will context switch within pgconn_block(), the function that makes blocking calls to the database, thus permitting other MRI 1.8 threads to run even during a long running query. Fourth, because MRI 1.8 threads are green threads, they all run inside the context of a single process and a single system thread. Thus, while they give the appearance of concurrency, there is actually only one thread running at once. This is okay, because it is the appearance of concurrency that matters. If you run top on your laptop or VM shell, you will see a large number of processes running on your system. This number will exceed the number of cores that you have by a large margin, but you rarely have to worry about which processes are actually running on one of the cores at any given time. Your kernel takes care of slicing up access to the CPU into fine enough grains that it appears that all the running processes are executing on a core at the same time (even though most of them probably are not actually running at any given time). Concurrency in computing doesn't strictly mean that two or more things are actually running at the same time. Rather, it means that there is an appearance that they are, and that one works with them on the assumption that they are, and lets the underlying scheduler deal with making reality fit that appearance. An entire book could be written about concurrency in Ruby. I've just scratched the surface with this overview of process and thread based concurrency in Ruby. Hopefully this helped answer a few questions or suggested some techniques to consider. Future installments in this series will cover Ruby 1.9.x (which uses system threads as opposed to green threads), JRuby, Rubinius, and using event systems like EventMachine to handle concurrency. So stay tuned! There is a lot more coming soon!


more »

MRI Memory Allocation, A Primer For Developers »

Created at: 04.05.2010 11:00, source: Engine Yard Blog, tagged: Technology 1.8.6 1.8.7 C memprof MRI ruby

Memory allocation in the MRI 1.8.x series of Ruby is seen by many developers to be a black box. A developer writes code and the interpreter just does some magic to make sure that the memory for the code is allocated, and more importantly, eventually garbage collected. You don’t have to think about, it or even care about it all that much.

And generally… that attitude is a productive one. The less you have to actively worry about the little details—like memory management—the more you can concentrate on the parts of the code that do the actual work. At the same time, though, a developer who remains ignorant of what’s going on under the covers does so at his or her own peril.

It’s very useful to have a general understanding of the mechanics involved, as they can sometimes steer you towards making better design choices in applications where your memory footprint matters; it’s also very useful if things start going wrong with the memory footprint of the code. If your carefully built Rails application works a little bit like Mr Creosote, repeatedly misbehaving until it blows up, you need to have a basic understanding of what’s going on with memory management in the interpreter.

There are two types of memory allocations that occur in MRI 1.8.x. First, objects are allocated on a heap, which is really just a collection of slots that Ruby uses to store information about an object. When Ruby runs out of slots, and it can’t free up any slots by running a garbage collection cycle, it will allocate a new heap for additional space.

The second type of allocation is when Ruby allocates memory off of the C heap to provide storage for the actual data contained within an object. This second type of storage is the most direct, and is the easiest to understand:

foo = 'x' * (1024 * 1024 * 10)

What actually happens there is that Ruby uses a slot out of its heap to store a String. The String implementation allocates, via a function called xmalloc(), enough memory to hold that x. xmalloc() is actually an alias, setup via a #define in the defines.h file.

#define xmalloc ruby_xmalloc
#define xcalloc ruby_xcalloc
#define xrealloc ruby_xrealloc
#define xfree ruby_xfree
 
void xmalloc _((long));
void *xcalloc _((long,long));
void *xrealloc _((void,long));
void xfree _((void*));

It does some error checking and runs a garbage collection cycle if allocations have exceeded a hard coded threshold (8000000 bytes), or if an allocation fails (meaning that the system lacks the RAM to fulfill the allocation request).

Then the String#* method creates a new String object (using another slot on the Ruby heap), calculating the size of the buffer by multiplying its own length (1 byte) by the number of repetitions (10,485,760). This buffer is allocated, as before, via xmalloc().

You see that allocation as an immediate increase in RSS.

Try it in irb. Here’s a ps line immediately after starting irb (using Ruby 1.8.7 on an OS X laptop):

wyhaines 35539 0.0 0.1 602836 3060 s007 S+ 9:21AM 0:00.02 irb

And here’s the same thing on a Linux instance:

wyhaines 20720 1.0 0.1 18360 3956 pts/1 S+ 06:49 0:00 irb

I execute the following line in irb:

foo = 'x' * (1024 * 1024 * 10); nil

And here’s the ps output for that process immediately afterwards:

wyhaines 35539 0.0 0.3 613080 13332 s007 S+ 9:21AM 0:00.11 irb # OSX 

wyhaines 20800 1.4 0.6 28604 14212 pts/1 S+ 06:51 0:00 irb # Linux

You can see that the jump in RSS is directly tied to the amount of data that needed to be stored, which is expected, given that the memory was directly allocated in the String implementation. Any class implementation that has to allocate space for its own data storage will behave similarly. Some may use the xalloc function from Ruby, while others may make use of malloc or related functions directly, or may have their own xalloc-like function.

This type of allocation is easy to understand because it’s expected. When an object that needs to hold 10Mb of data is created, there will be an allocation of 10Mb to store it. It does get a little more tricky when dealing with deallocation, since that should not happen until the object is garbage collected by Ruby, and unless you explicitly invoke a GC collection cycle, you can’t really know when it is going to happen. Also, classes implemented in C or C++ can sometimes have bugs with deallocation, leading to RAM being left unexpectedly allocated. MRI Ruby’s own Array#shift method once had a bug of this nature in it.

However, because this sort of allocation comes directly out of the C heap, when a deallocation occurs, you should immediately see it in your process size.

irb(main):002:0> foo = nil
=> nil
irb(main):003:0> GC.start
=> nil
irb(main):004:0>

A ps shows what happened:

wyhaines 39596 0.0 0.1 602836 3092 s011 S+ 10:04AM 0:00.13 irb # OSX

wyhaines 20800 0.0 0.1 18360 3968 pts/1 S+ 06:51 0:00 irb # Linux

The more tricky to understand allocation type is Ruby’s management of its own heap space. Ruby maintains a series of heaps, which are just presized collections of RVALUE structures referred to as slots. Each slot is a little table (an RVALUE) that’s used for keeping track of fundamental object data. On the MRI 1.8.x Ruby, a slot is about 20 bytes in size for a 32 bit build. I added a little instrumentation to a Ruby instance to show this:

wyhaines$ /usr/local/rubyxxx/bin/irb
size of pointer to a heap: 12
length of the array that contains pointers to heaps: 10
  total size of the array of heap pointers: 120
Allocating heap of 10001 slots, each of 20 bytes
  malloc(200020)
Allocating heap of 18001 slots, each of 20 bytes
  malloc(360020)

On my 64 bit Linux instance, each RVALUE is 40 bytes, doubling the size of the Ruby heap.

In general conversation, when talking about Ruby’s heap, we think of it as one big scratch space for storing object data. However, it’s actually represented by a list of pointers to a collection of smaller spaces. Each of these individual spaces is a heap, and all of them together represent the process’s total heap space.

By default, Ruby allocates a heap big enough to store 10000+1 slots on startup. After that first allocation, the number allocated on subsequent allocations is increased by a factor of 1.8 over the previous allocation. So the second heap allocation is for 18000+1 heap slots. The third is for 32400+1, and so on.

The theory is that as RAM usage grows, the likelyhood of needing even more RAM increases, so allocating ever larger buckets hedges against needing to do a new allocation. As you can see in the above example, the initial chunk of 10k buckets isn’t sufficient for running irb, so Ruby ends up allocating a second chunk of 10000 * 1.8 + 1 == 18001 object slots in the next chunk of heap.

The RVALUEs in the Ruby heap are a linked list. Ruby allocates space for them with a simple malloc:

RUBY_CRITICAL(p = (RVALUE)malloc(sizeof(RVALUE)(heap_slots+1)));

What that really does is to ask malloc to allocte a buffer that’s the size of an RVALUE multipled by heap_slots+1, and then cast the pointer that malloc returns to an RVALUE pointer. There is some additional code to deal with error conditions. If malloc can not allocate the space, Ruby will set heap_slots = HEAP_SLOTS_MIN, which is normally hard coded to 10000, and then try again. If it fails again, it throws an error.

Once the space is allocated, Ruby does some housekeeping to make sure it stores the pointer in this new heap, and to increase the size of heap_slots for the next allocation, then it needs to go through the new heap space and to initialize the RVALUE structures.

while (p as.free.flags = 0;
  p->as.free.next = freelist;
  freelist = p;
  p++;
}

Even if you don’t know C, you can probably figure out what’s happening there. It’s walking through each allocated struct, setting flags to 0 and establishing the linked list structure, with each slot pointing to the next one in the list. In doing so, it touches all of the heap space it just malloc’d. This has the side effect of forcing all of those pages into the resident memory of the process.

You can see that in operation with irb. To refresh your memory, here are a couple ps lines, from OSX, and Linux, for an irb process that has just been started:

wyhaines 36996 0.0 0.1 602836 3060 s007 S+ 9:40AM 0:00.03 irb # OSX

wyhaines 20080 1.0 0.1 18360 3956 pts/1 S+ 07:48 0:00 irb # Linux

Remember that just starting IRB creates a bunch of objects. It will already have gone through a couple heap allocations. So, we want to trigger a third. We also want to try to get close to catching it in action. So, in IRB, do this:

a = []; 9000.times {a << ''} # OSX w/ ruby 1.8.7 (2009-06-12 patchlevel 174) [i686-darwin9]

a = []; 5000.times {a << ''} # Linux w/ ruby 1.8.7 (2010-01-10 patchlevel 249) [x86_64-linux]

There’s no magic there. I just did some trial and error experiments to figure out how many objects I needed to create to be close to the threshold of a new allocation. This number will vary some, depending on which Ruby you are using. A ps of the process will now look something like this:

wyhaines 38195 0.0 0.1 602876 3132 s007 S+ 9:56AM 0:00.03 irb # OSX

wyhaines 20379 0.0 0.1 18476 4028 pts/1 S+ 08:05 0:00 irb # Linux

RAM usage has grown a tiny bit, basically to accomodate the individual in-object allocations that happened when creating a whole bunch of tiny objects, but there have been no allocations of significant chunks of memory.

Now, go back to irb, and do this:

1000.times {a << ''}

When you look at ps again:

wyhaines 38195 0.0 0.1 603528 3776 s007 S+ 9:56AM 0:00.04 irb # OSX

root 20379 0.0 0.2 19744 5300 pts/1 S+ 08:05 0:00 irb # Linux

It jumped by quite a big chunk. Doing some quick math, this third heap allocation would be for 32401 heap slots (18000 * 1.8 + 1). If each heap slot is 20 bytes, then (32401 * 20) == 648020 bytes needed. That looks pretty spot on for the RSS size bump that we observed with OSX. For the Linux system, it was already established that each RVALUE takes 40 bytes, so (32401 * 40) == 1296040 bytes, which also is a match for the jump that is seen.

As more objects are created by your code, more heap slots will be used. Ruby does reuse heap slots when object are garbage collected, and if all of the slots in a section of heap are freed, Ruby will free the entire section, but in most code that’s pretty unlikely, meaning that the typical expectation is that when heap is allocated, it’s going to stay allocated.

With the 1.8 scaling factor that’s in MRI, here’s a table to show you how much memory is allocated just for the heap as object counts increase:

Threshold # of Slots RAM w/ 20 byte RVALUEs RAM w/ 40 byte RVALUEs
1000010001200020400040
2800018001360020720040
60400324016480201296040
1187205832111664202332840
22369610497720995404199080
41265218895737791407558280
752772340121680242013604840
13649886122171224434024488680
246697611019892203978044079560
445055419835793967158079343160

As you can see, while those first few allocations are pretty small, they get large fast. With an RVALUE size of 20 bytes, the 10th allocation is about 38Mb, and if the RVALUE size is 40 bytes, that’s about 76Mb.

When talking about Ruby memory allocations, that’s about all that there is to it. However, allocations without deallocations eventually make a developer sad. With Ruby, there’s no way to specifically deallocate an object. Deallocations are the job of the garbage collection system.

MRI Ruby implements a conservative mark and sweep garbage collector. This means that it operates by walking through memory, marking every object that it can find which is accessible at the current point of execution. After it finishes marking everything, it takes a second pass, collecting all of the marked objects.

Garbage collection can be invoked manually, via GC.start, but is typically invoked by Ruby. All of the garbage collection triggers are connected to the allocation behaviors of Ruby, and there are two mechanisms to be aware of, as a developer:

First, as I referred to near the beginning of the article, when the ruby_xmalloc() function runs, it looks at the total size of allocations from the C heap, and if they exceed a hard coded threshold (which defaults to 8000000 bytes), it will trigger a GC cycle. This means that if you have code which does a lot of C heap allocations, or does large C heap allocations, you’ll be triggering garbage collection often.

The other main trigger occurs when a new object is created. Remember that each slot in the Ruby heap is used to store data about a single object. So when a new Ruby object is created, a slot on the heap is necessary.

Ruby maintains a linked list of all unused slots in its heaps. This list is called the freelist. rb_newobj(), in gc.c, creates new objects. However, it first checks to see if there’s anything left in the freelist. If there isn’t, it will first invoke garbage_collect().

The garbage collection code will attempt to add some slots to the freelist by collecting and deallocating unused objects. If it fails, meaning that every object currently allocated on the heap is deemed to be in use, it will call add_heap(), with the effects we discussed earlier.

This second type of trigger is a very common cause of large processes. Imagine you have some code that queries a database with a query that pulls a large number of records. Maybe it does something cool, like pulling two sets of records, and then uses Ruby’s set facilities to get a union of the two sets. It’s all very slick, and works just fine. But then you notice that when the code runs, your process size immediately jumps by many megabytes, and it never goes down. What ’s happened is that your queries created a very large number of temporary objects, and they exceeded the available space in the Ruby heap, so a new heap allocation was performed.

If all that went into this new heap were those temporary objects, then once they were garbage collected, the new heap could be deallocated after it was emptied. But remember what I said earlier: it only removes heap spaces if they’re empty. So any new, longer lived object in that heap will anchor the whole thing into your process forever.

This behavior is important to be aware of, because it’s one of the easiest ways that a developer can inadvertently bump their Ruby process size up higher than they want. While you shouldn’t be paranoid about object creation, you also never want to create thousands or tens of thousands of temporary objects when you could’ve gotten the job done with hundreds, because in practice, those allocation thresholds are a one way street.

Also, be aware that any object with a C/C++ implementation that allocates its own memory should be deallocating that memory when the object is garbage collected. Every C/C++ extension should define a *_free function, which will be called when the object is garbage collected, and which is responsible for freeing any allocations that took place inside the extension’s code.

Memory management in C is an easy place for a programmer to make errors though, so if your code is using an extension, and you’re seeing strange memory behavior, it’s usually a good idea to double check it. At least make sure that you are on the latest version, and that there are no known memory management related bugs with it.

The original outline of this article was actually written as a response to a customer’s trouble ticket here at Engine Yard. He was seeing a large jump in the RSS size of his processes after they ran for a while, and we were trying to figure it out. The customer was seeing a sudden jump of about 43Mb in a long running process.

At the time, it was difficult to really pin the cause down. A sudden jump, when not doing anything extraordinary, fits the MO of a Ruby heap allocation, and the 10th allocation, if you refer to the table above, is almost that large—but any real serious debugging was going to require substantial work.

This has changed some in the last few months. Don’t misunderstand; it’s still a lot of work if you have to try to understand, in depth, the memory allocation/deallocation behavior of a complex piece of Ruby code, but now, Joe Damato and Aman Gupta have brought us memprof.

The next time you’re trying to understand why your program’s RAM usage is doing something that seems strange, arm yourself with the background knowledge from this post, then go grab memprof.

It’ll give you detailed information about your process’s memory behavior, allocations, deallocations, and in depth details about all of the objects currently in your Ruby process’s heap. It will give you all of the details to follow exactly what’s happening inside that black box of allocation and deallocation, and given my personal experience in looking for the source of memory leaks and strange memory behavior, it can turn an all day job into a job that takes a half an hour.

Understanding the basics of your Ruby implementation’s memory management isn’t necessary to write Ruby code, but it’s a good idea if you’re writing and deploying substantial pieces of software. So, dig in and enjoy! They basics aren’t too hard to understand. As always, happy to help answer questions here!


more »