Ruby, Concurrency, and You »

Created at: 14.10.2011 19:41, source: Engine Yard Blog, tagged: Open Source Technology 1.8 1.9 concurrency GIL implementations ironruby jruby macruby maglev MRI parallelism rubinius ruby threads

tl;dr
Ruby Implementation Concurrency Parallelism
MRI 1.8
MRI 1.9
Rubinius 1
Rubinius 2
JRuby
MacRuby
Maglev
IronRuby

A big topic in the world of Ruby this year has been how to get more out of Ruby, specifically, how to get more done in parallel. The topic of concurrency, though, is one fraught with misunderstanding. This is largely due to the complexities of not only thinking about multiple things at once, but the limitations of Ruby implementations and operating systems.

In this article, I’ll lay the groundwork for understanding the difference between concurrency and parallelism. Then, I’ll look at how a programmer experiences them.

Concurrency vs. Parallelism

This has been discussed many times, but I sometimes still have difficulty with it. Let’s first break down the definitions of these two words:

  • Concurrent: existing, happening, or done at the same time
  • Parallel: occurring or existing at the same time or in a simple way

Hmm, ok. Well, that hasn’t improved our thinking about these two topics. We need to dig deeper into how the world of computing applies to these words. Rather than looking at the abstract, let’s instead consider some real world examples.

A “Real World” Example

Let’s say you’ve sat down for the evening to complete tomorrow’s homework. This evening you’ve got both Math and History worksheets to fill out. Tonight for some reason, you decide to do one problem in Math, then one problem in History, then back to Math, etc until all the problems are done.

In the parlance of computing, you’re now doing your Math and History worksheets concurrently. This is because your Current task list includes 2 items: Math worksheet and History worksheet.

Now, clearly you the reader can see a problem here. By switching back and forth, completing your homework will probably take longer than if you did the complete Math worksheet then did the History worksheet. In other words, if you did the worksheets in serial.

So, if concurrent means “having multiple outstanding tasks at once”, then what is parallel? Parallel is the ability to make progress on multiple tasks simultaneously.

Let’s say you’ve been asked to read the book One O’Clock Jump by Lise McClendon. You also need to drive down to San Diego for Comic-Con. Thankfully you find that One O’Clock Jump is available on audiobook!

You can now listen to the book while driving. You’re simultaneously making progress on two separate tasks. This is the equivalent of parallelism in computing.

I hope that these real world examples help illustrate the difference between concurrency and parallelism. Now let's apply this newfound knowledge to Ruby.

Back to Ruby

One reason this problem can be difficult to understand is because Ruby only provides a single mechanism for concurrency. But, whether or not these Threads are parallel depends on a number of factors.

MRI 1.8

Let’s look at MRI 1.8 (and MRI forks such as REE) to begin with, because it has the simplest model. MRI 1.8 uses a technique known as “green threads” to implement Threads. This means that every once in a while (around 100 milliseconds), the program says “oh, I should let another thread run now!” This saves the current info into the current thread and restores another thread. This is exactly like our homework example above. We can have as many things as we’d like in our task list, but we can only make progress on one of them at a time.

There is a wrinkle in the concurrency/parallelism game that I haven’t mentioned before now. This wrinkle is IO, namely how Threads interact when waiting for some external event. MRI 1.8.7 is quite smart, and knows that when a Thread is waiting for some external event (such as a browser to send an HTTP request), the Thread can be put to sleep and be woken up when data is detected. This simple consolation improves the usage of Threads so much that for a very long time the MRI 1.8.7 model was good enough for all Ruby programs.

MRI 1.9

Switching back to Ruby implementations, let’s look at MRI 1.9. As has been previously reported, MRI 1.9 removes the “green threads” we had in MRI 1.8 and uses native threads to implement the Thread class. Now, what are these “native threads”? These are are units of concurrency that the underlying operating system is aware of. A big reason to switch to use native threads is that it vastly simplifies the implementation of Threading. The operating system handles the low level parts of saving and restoring Thread information in a completely transparent way. Additionally, letting the OS know what parts of a program should be concurrent allows it to use the full resources of the computer to make that happen. In this modern world, that means using multiple cores.

Up until now, all we’ve talked about with Ruby’s Threading model was about concurrency, the ability to have multiple outstanding tasks at once. Now when we add in the idea of multiple cores, we can finally talk about parallelism. When a computer includes multiple cores (which is pretty much every computer now), those cores can run different code simultaneously, providing true parallelism. When a computer only has one core, there is no true parallelism, instead there is just simple concurrency, even at the OS level. The OS manages all the processes and threads in the system the same way you handled your Math and History worksheets, doing one for a little while, then grabbing another one.

Back to multiple cores though. Now that there is the opportunity to run things truly in parallel, we have to look at if Ruby can take advantage of that. Since MRI 1.9 uses OS threads, it can actually spread out your Ruby Threads to multiple cores!

Unfortunately, MRI 1.9 prevents the Ruby code itself from running in parallel by requiring that any thread running Ruby code hold a lock. This lock is commonly knows as the GIL (Global Interpreter Lock) or GVL (Global VM Lock).

There are a few reasons the GIL to exists, but for this discussion we will say that it’s because the non-Ruby parts of MRI 1.9 are not thread-safe. This means if data were manipulated by multiple threads at the same time, the data could become corrupt. The important thing for this post is how it applies to parallelism: the GIL inhibits parallelism within Ruby code.

MRI 1.9 uses the same technique as MRI 1.8 to improve the situation, namely the GIL is released if a Thread is waiting on an external event (normally IO) which improves responsiveness. MRI 1.9 also includes an experimental API that C extensions can use to run some C code without the GIL locked to utilize parallelism. This API is very restrictive though because no Ruby object may be accessed in any way while the GIL is not held by the current thread.

That about sums up the situation with MRI 1.8 and 1.9 with regards to concurrency and parallelism. Both provide concurrency of Ruby code, but neither provide parallelism of Ruby code.

Rubinius

Let’s take a quick look at other Ruby implementations where things are a bit different than MRI. I’ll start with Rubinius, since it’s the one I’m most familiar with. Rubinius 1.x also had a GIL and worked pretty much the same as MRI 1.9. With the upcoming 2.0 release though, the GIL will be removed, allowing Ruby code to run fully concurrent and fully parallel. We think this opens up a lot of uses for Ruby (parallel algorithms, etc) that Rubinius couldn’t handle well previously.

JRuby

JRuby layers the Thread class on top of Java’s thread class, so the threading model is whatever the JVM supports. That being said, OpenJDK is the primary JVM; it puts a Java thread directly onto an OS thread with no GIL. Thusly, JRuby almost always has full concurrency and parallelism available to it.

MacRuby

MacRuby also uses Cocoa’s NSThread as its abstraction, which runs without a GIL. So, this is another fully parallel implementation.

Maglev

Maglev runs directly on top of a Smalltalk VM and thusly layers the Thread class on top of a concept called Smalltalk Processes. In this case, the GemStone VM implements Processes in the same way as MRI 1.8, namely via “green threads” that don’t expose concurrency to the OS, and therefore, have no parallelism.

IronRuby

Lastly, IronRuby layers Thread directly on top of CLR’s threads without a GIL.

Conclusion

I hope that this helps to clear up what concurrency and parallelism are and how the different Ruby implementations address them. Having this understanding is critical for discussing and understanding topics such and thread-safety of libraries and performance of applications.

In future posts, we’ll look to build on this knowledge to help you make the best use of Ruby!


more »

Double Shot #628 »

Created at: 18.01.2010 14:13, source: A Fresh Cup, tagged: Double Shot aws cacheable deadweight github has_price helium maglev popthis rails ruby

The inital round of the Developer Wellwishes campaign is drawing to an end on Thursday, when we'll be combining donations with a few other campaigns to make a contribution to charity:water. If you've done your share for earthquake relief and still want to help more people out, do drop by.

  • Distributed Ruby with the MagLev VM - A nice look at what's going on with this alternate Ruby, which to my mind is more significant than Rubinius or MacRuby. Being able to store distributed objects with no ORM is likely to be a killer feature.
  • Deadweight and Helium - CSS coverage tools that can report which selectors are unused in your application.
  • popthis - Turn a folder of email files into a POP3 server for testing purposes.
  • git-googles - Git utilities for branch management and code review.
  • The least you should know about S3 with Rails - This stuff is all out there, but here's a one page overview.
  • Yaml Cookbook - Turns out Ruby's YAML support is a good deal more flexible than I had realized.
  • Ruby Fibonacci Shootout - MacRuby is turning out some impressive performance numbers. Too bad it can't run anything I care about yet.
  • has_price - DSL for organizing things like taxes, discounts, and base prices. Wish I'd had this a year ago.
  • cacheable - Gem that adds cacheify, which is similar to memoize but uses memcached.


more »

Distributed Ruby with the MagLev VM »

Created at: 15.01.2010 19:02, source: igvita.com, tagged: ruby maglev smalltalk

GemStone team made a splash with MagLev at RailsConf '08 where they attracted a fair dose of attention from the attendees. Based on an existing GemStone/Smalltalk VM, it promised a lot of inherent advantages: 64-bit, JIT, years of VM optimizations, and built-in persistence and distribution layers. Since then the team has been making steady progress, which recently resulted in the announcement of the first public alpha. In fact, the project appears to be on track for 1.0 status later this year, alongside with IronRuby, MacRuby, and Rubinius.

However, while the initial focus centered around the potential speed improvements offered by the VM, it is the persistence and distribution aspects of the runtime which make it stand out - if it happens to be faster, so much the better. Based on the Smalltalk VM, it offers integrated persistence (with ACID semantics) and distribution. In other words, you can treat MagLev as a distributed database that is capable of running Ruby code and storing native Ruby bytecode internally. Now that's a mouthful, let's see what it actually means.

MagLev VM: Features & Limitations

The goal of the GemStone team is to write as much of MagLev as possible in Ruby (the standard libraries, the parser, etc), which has already resulted in some good collaboration and synergies with the Rubinius project. As of the first public beta release, the project passes over 27,900 RubySpecs, features a pure ruby parser (slightly modified fork of ruby_parser), and runs RubyGems 1.3.5 out of the box. Popular gems such as rack, sinatra, and minitest all run unmodified, and there is even work on FFI support for C and Smalltalk extensions.

The end goal is full RubySpec compatibility, support for Ruby 1.9, and of course, running Rails - a stripped down version was demoed at RailsConf '09, but more work still needs to be done to make it fully compatible. The VM also ships with a MySQL driver, which means that you can use MagLev as any other Ruby runtime to power your applications, or, you could leverage the built-in persistence API's. MagLev has a distinctly different VM architecture which allows it to persist and share both code and data between multiple runtimes and execution cycles, all through a straight-forward Ruby API! Incidentally, this is also the reason for lack of support of several ObjectSpace methods (garbage_collect, each_object), as the enumeration could potentially mean retrieving gigabytes of persistent objects.

To get started, install MagLev via RVM, or follow the simple instructions on the wiki.

MagLev VM Architecture

The first thing you will notice about working with MagLev is that before you can run the interpreter, you will have to launch the MagLev service itself (maglev start). Turns out, unlike other Ruby VM's, all of the core Ruby classes, and all other persisted code and data actually lives in a separate "stone" process. The VM's ("gems"), connect to the stone and retrieve all of their data from this service. Ruby classes are stored as bytecode in the stone server, which is transported via shared memory for local connections, and via optimized binary protocol for remote connections, to the local interpreter and then compiled down to native machine code. This is how object persistence is made possible in MagLev: the stone server is a standalone process that acts as a database for your Ruby bytecode!

The added advantage is that the stone server supports full ACID semantics, which means that multiple processes can interact with the same repository and share state, objects, and code. A simple example of sharing data between multiple runs:

> maglev-data.rb

 # persist a string in the stone server
 Maglev::PERSISTENT_ROOT[:hello] = "world"
 Maglev.commit_transaction
 
 $ maglev-ruby -e 'p Maglev::MAGLEV_PERSISTENT_ROOT[:hello]'
 > "world"
 

That covers a simple key-value example, but Maglev is also capable of transparently persisting entire object graphs without any data-modeling impedance mismatch:

> maglev-persist.rb

graph_node =<<-EOS
  class Graph
    def initialize; @nodes = []; end
    def push(node); @nodes.push node; end
  end
  class Node; end
EOS
 
# commit Graph class's bytecode into stone server
# - can also load external file: load 'class.rb'
Maglev.persistent { eval graph_node }
 
# build a simple in memory graph
g = Graph.new
g.push Node.new
g.push Node.new
 
# commit in-memory graph to stone server
Maglev::PERSISTENT_ROOT[:data] = g
Maglev.commit_transaction
 
############################
# in different process / VM:
graph = Maglev::PERSISTENT_ROOT[:data]
puts graph.inspect
# > #<Graph:0xa205f01 @nodes=[#<Node:0xa202d01>, #<Node:0xa202c01>]>
 

Instead of using an ORM to map Ruby classes to rows or documents in a database, you can simply store the objects directly in the stone server and interact with them through multiple processes, all without any extra conversions or additional infrastructure. The only caveat is that you would have to build your own indexing structures to power search and lookups beyond the key-value semantics. The KD-Tree example is a great showcase of the power and flexibility this can enable.

MagLev at Scale & In-Production

While the "stone" server persists all the core Ruby classes and any additional data, the VM's ("gems") are not free. According to the documentation, each VM takes ~30Mb of memory at boot time and starts growing from there. On the other hand, the shared memory communication is extremely efficient, which means that hundreds of VM's can be run in parallel on a single box. GemStone claims production deployments of their Smalltalk VM on 64-128 core machines with up to 512GB RAM, running hundreds of concurrent VM's, and achieving over 10K transactions per second (TPS) on their "stone" servers - impressive numbers!

With the new Smalltalk VM (3.0) on the horizon and years of production optimization and research, MagLev is definitely a project to watch. GemStone team has recently started a blog, opened up a Google group, and are starting to produce some great content to help Rubyists leverage their platform. What is missing now are the deployments, case studies, and new frameworks that can leverage all of these features - though, I'm sure, that will come.


more »