A Modern Guide to Threads »
Created at: 21.10.2011 03:49, source: Engine Yard Blog, tagged: Technology code ruby threads
NOTE: Mike Perham from Carbon Five recently wrote a blog post about using threads in Ruby. With his permission, we're reposting it here.
Carbon Five has been building state-of-the-art web applications for startups and large institutions since early 2000. Since their inception, they have focused on quality and value as the critical components of project success.
I spoke recently at Rubyconf 2011 on some advanced topics in threading. What surprised me was how little experience people had with threads so I decided to write this post to give people a little more background on threads. Matz actually recommends not using threads (see below for why) and I think this is a big reason why Rubyists tend not to understand threading.
Simple Threading
Every time you execute ruby, rails or irb, you are creating a process. Within each process, you have something which is executing the code in your process. This is called a thread.
Your operating system starts every process with a "main" thread. Ruby allows you to create as many additional threads as you want by calling Thread.new with a block of code to be executed. Once the block of code has finished executing, the thread is considered dead. If the main thread exits, the process dies.
t1 = Thread.new do
i = 0
1_000_000.times do
i += 1
end
end
t2 = Thread.new do
j = 0
1_000_000.times do
j += 1
end
end
t1.join
t2.join
Above we have two threads independently counting up to one million, while the main thread waits for them to finish by calling join on each thread. These two threads will execute concurrently ("operating or occurring at the same time") with your process's main thread. Not so hard, right?
Race Conditions
Generally your computer can execute one thread per core. I have a dual core CPU in this laptop which means I can execute two threads at the exact same time [1]. Now imagine I want to parallelize my counting above. Instead of having one thread count to two million, I will have two threads count to one million each. That should execute twice as fast because I'll be using two threads and thus both cores:
i = 0
t1 = Thread.new do
1_000_000.times do
i += 1
end
end
t2 = Thread.new do
1_000_000.times do
i += 1
end
end
t1.join
t2.join
puts i
You'd expect the result to print "2000000″, right? Nice try.
> jruby threading.rb 1330864
Any time multiple threads try to change the same variables, they have the potential for race conditions. Why is this?
The race condition is fundamentally due to the multi-step process of changing a variable. Even a simple increment in most languages is actually a multi-step process:
register = i # read the current value from RAM into a register register = register + 1 # increment it by one i = register # write the value back to the variable in RAM
One of the features of threads is that they are controlled by the operating system; the OS can decide to stop Thread 1 and start executing Thread 2 at any point in time. This means that the OS can stop your thread after it has read the value of i into a register. Imagine this sequence of events:
i = 0 # OS is running Thread 1 register = i # 0 register = register + 1 # 1 # OS switches to Thread 2 register = i # 0 register = register + 1 # 1 i = register # 1 # Now OS switches back to Thread 1 i = register # 1
Now technically both threads have incremented i. Will the resulting value be 2? No, because the Thread 2′s increment was lost when Thread 1′s last operation overwrote the memory. This is exactly why we saw 1330864 instead of 2000000; we lost a lot of increments due to this race condition. To avoid race conditions, any variable changes (fancy CS terminology: "mutation of shared state") must be done atomically so that other threads cannot see the change midway through the change process.
Thread Safety
Now you know the fundamental requirement for thread-safe code: mutation of shared state must be done atomically. Any time you change a variable that is shared by many threads, it needs to be done atomically. Unfortunately Ruby and most other mainstream languages only give you one tool to do this: the lock aka the mutex.
Mutex is short for "mutual exclusion" as in "only one thread can be executing this code at a time". Usage is simple:
@mutex = Mutex.new @mutex.synchronize do i += 1 end
Remember that increment is a three-step process but because only one thread can be in the synchronize block at a time, we won't have any problems with race conditions; the Mutex effectively makes the increment atomic.
Here's the dirty secret that everyone who uses threads learns eventually: Threads have such a terrible reputation because locks are very painful to use in practice.
Modern Threading
What are the alternatives? There are several:
- Atomic Instructions — turn multi-step operations into a single atomic operation
- Transactional Memory (STM) — ensure that changes are done as part of a transaction which guarantee atomicity
- Actors — refactor our code so that only one thread may change a variable
My take is that locks exponentially grow the complexity of your codebase and this is a major reason why Matz has always advised Rubyists to use Processes rather than Threads for concurrency. My recent Rubyconf talk on Threads discusses these options. The Clojure language mandates transactional memory for all variable changes. Scala and Erlang offer Actors. Using plain old threads and locks is akin to writing in assembly language: there are better ways now.
In my opinion, the last option is the preferred option since you avoid the race condition in the first place: "Don't communicate by sharing state; share state by communicating". The fundamental idea behind actors is to give each thread a separate responsibility and pass messages between threads according to those responsibilities.
My first piece of advice to Rubyists: avoid Thread.new. This is exactly what Matz is saying also. Instead look for infrastructure that can abstract the use of threads into a safer concurrency model. See Celluloid and girl_friday for instance. Of course, MRI is not particularly suited to high concurrency applications; JRuby is a better choice. Other languages like Clojure or Erlang were designed with concurrency as a language feature right from the start.
I'm not saying that threads and locks should be removed completely from all software. Rather we should treat them for what they are: low-level abstractions that developers should not be using directly. Like threads and locks I see a need for assembly language but it should be used very sparingly. Understanding and knowing how to use higher level concurrency abstractions like actors and STM will make concurrent pieces of your application easier to write and maintain. Unfortunately not all of these options are available to MRI but all are available to JRuby via Java libraries.
1 — True with JRuby, not true with MRI because of the infamous "Global Interpreter Lock". ^
more »
rspec-core 2.7.1 is released! »
Created at: 20.10.2011 15:13, source: David Chelimsky, tagged: bdd rspec ruby
rspec-core-2.7.1
- Bug fixes
- tell autotest the correct place to find the rspec executable
more »
Ruby, Concurrency, and You »
Created at: 14.10.2011 19:41, source: Engine Yard Blog, tagged: Open Source Technology 1.8 1.9 concurrency GIL implementations ironruby jruby macruby maglev MRI parallelism rubinius ruby threads
| Ruby Implementation | Concurrency | Parallelism |
|---|---|---|
| MRI 1.8 | ✔ | |
| MRI 1.9 | ✔ | |
| Rubinius 1 | ✔ | |
| Rubinius 2 | ✔ | ✔ |
| JRuby | ✔ | ✔ |
| MacRuby | ✔ | ✔ |
| Maglev | ✔ | |
| IronRuby | ✔ | ✔ |
A big topic in the world of Ruby this year has been how to get more out of Ruby, specifically, how to get more done in parallel. The topic of concurrency, though, is one fraught with misunderstanding. This is largely due to the complexities of not only thinking about multiple things at once, but the limitations of Ruby implementations and operating systems.
In this article, I’ll lay the groundwork for understanding the difference between concurrency and parallelism. Then, I’ll look at how a programmer experiences them.
Concurrency vs. Parallelism
This has been discussed many times, but I sometimes still have difficulty with it. Let’s first break down the definitions of these two words:
- Concurrent: existing, happening, or done at the same time
- Parallel: occurring or existing at the same time or in a simple way
Hmm, ok. Well, that hasn’t improved our thinking about these two topics. We need to dig deeper into how the world of computing applies to these words. Rather than looking at the abstract, let’s instead consider some real world examples.
A “Real World” Example
Let’s say you’ve sat down for the evening to complete tomorrow’s homework. This evening you’ve got both Math and History worksheets to fill out. Tonight for some reason, you decide to do one problem in Math, then one problem in History, then back to Math, etc until all the problems are done.
In the parlance of computing, you’re now doing your Math and History worksheets concurrently. This is because your Current task list includes 2 items: Math worksheet and History worksheet.
Now, clearly you the reader can see a problem here. By switching back and forth, completing your homework will probably take longer than if you did the complete Math worksheet then did the History worksheet. In other words, if you did the worksheets in serial.
So, if concurrent means “having multiple outstanding tasks at once”, then what is parallel? Parallel is the ability to make progress on multiple tasks simultaneously.
Let’s say you’ve been asked to read the book One O’Clock Jump by Lise McClendon. You also need to drive down to San Diego for Comic-Con. Thankfully you find that One O’Clock Jump is available on audiobook!
You can now listen to the book while driving. You’re simultaneously making progress on two separate tasks. This is the equivalent of parallelism in computing.
I hope that these real world examples help illustrate the difference between concurrency and parallelism. Now let's apply this newfound knowledge to Ruby.
Back to Ruby
One reason this problem can be difficult to understand is because Ruby only provides a single mechanism for concurrency. But, whether or not these Threads are parallel depends on a number of factors.
MRI 1.8
Let’s look at MRI 1.8 (and MRI forks such as REE) to begin with, because it has the simplest model. MRI 1.8 uses a technique known as “green threads” to implement Threads. This means that every once in a while (around 100 milliseconds), the program says “oh, I should let another thread run now!” This saves the current info into the current thread and restores another thread. This is exactly like our homework example above. We can have as many things as we’d like in our task list, but we can only make progress on one of them at a time.
There is a wrinkle in the concurrency/parallelism game that I haven’t mentioned before now. This wrinkle is IO, namely how Threads interact when waiting for some external event. MRI 1.8.7 is quite smart, and knows that when a Thread is waiting for some external event (such as a browser to send an HTTP request), the Thread can be put to sleep and be woken up when data is detected. This simple consolation improves the usage of Threads so much that for a very long time the MRI 1.8.7 model was good enough for all Ruby programs.
MRI 1.9
Switching back to Ruby implementations, let’s look at MRI 1.9. As has been previously reported, MRI 1.9 removes the “green threads” we had in MRI 1.8 and uses native threads to implement the Thread class. Now, what are these “native threads”? These are are units of concurrency that the underlying operating system is aware of. A big reason to switch to use native threads is that it vastly simplifies the implementation of Threading. The operating system handles the low level parts of saving and restoring Thread information in a completely transparent way. Additionally, letting the OS know what parts of a program should be concurrent allows it to use the full resources of the computer to make that happen. In this modern world, that means using multiple cores.
Up until now, all we’ve talked about with Ruby’s Threading model was about concurrency, the ability to have multiple outstanding tasks at once. Now when we add in the idea of multiple cores, we can finally talk about parallelism. When a computer includes multiple cores (which is pretty much every computer now), those cores can run different code simultaneously, providing true parallelism. When a computer only has one core, there is no true parallelism, instead there is just simple concurrency, even at the OS level. The OS manages all the processes and threads in the system the same way you handled your Math and History worksheets, doing one for a little while, then grabbing another one.
Back to multiple cores though. Now that there is the opportunity to run things truly in parallel, we have to look at if Ruby can take advantage of that. Since MRI 1.9 uses OS threads, it can actually spread out your Ruby Threads to multiple cores!
Unfortunately, MRI 1.9 prevents the Ruby code itself from running in parallel by requiring that any thread running Ruby code hold a lock. This lock is commonly knows as the GIL (Global Interpreter Lock) or GVL (Global VM Lock).
There are a few reasons the GIL to exists, but for this discussion we will say that it’s because the non-Ruby parts of MRI 1.9 are not thread-safe. This means if data were manipulated by multiple threads at the same time, the data could become corrupt. The important thing for this post is how it applies to parallelism: the GIL inhibits parallelism within Ruby code.
MRI 1.9 uses the same technique as MRI 1.8 to improve the situation, namely the GIL is released if a Thread is waiting on an external event (normally IO) which improves responsiveness. MRI 1.9 also includes an experimental API that C extensions can use to run some C code without the GIL locked to utilize parallelism. This API is very restrictive though because no Ruby object may be accessed in any way while the GIL is not held by the current thread.
That about sums up the situation with MRI 1.8 and 1.9 with regards to concurrency and parallelism. Both provide concurrency of Ruby code, but neither provide parallelism of Ruby code.
Rubinius
Let’s take a quick look at other Ruby implementations where things are a bit different than MRI. I’ll start with Rubinius, since it’s the one I’m most familiar with. Rubinius 1.x also had a GIL and worked pretty much the same as MRI 1.9. With the upcoming 2.0 release though, the GIL will be removed, allowing Ruby code to run fully concurrent and fully parallel. We think this opens up a lot of uses for Ruby (parallel algorithms, etc) that Rubinius couldn’t handle well previously.
JRuby
JRuby layers the Thread class on top of Java’s thread class, so the threading model is whatever the JVM supports. That being said, OpenJDK is the primary JVM; it puts a Java thread directly onto an OS thread with no GIL. Thusly, JRuby almost always has full concurrency and parallelism available to it.
MacRuby
MacRuby also uses Cocoa’s NSThread as its abstraction, which runs without a GIL. So, this is another fully parallel implementation.
Maglev
Maglev runs directly on top of a Smalltalk VM and thusly layers the Thread class on top of a concept called Smalltalk Processes. In this case, the GemStone VM implements Processes in the same way as MRI 1.8, namely via “green threads” that don’t expose concurrency to the OS, and therefore, have no parallelism.
IronRuby
Lastly, IronRuby layers Thread directly on top of CLR’s threads without a GIL.
Conclusion
I hope that this helps to clear up what concurrency and parallelism are and how the different Ruby implementations address them. Having this understanding is critical for discussing and understanding topics such and thread-safety of libraries and performance of applications.
In future posts, we’ll look to build on this knowledge to help you make the best use of Ruby!
more »
Avoid stubbing methods invoked by a framework »
Created at: 23.09.2011 02:23, source: David Chelimsky, tagged: bdd rspec rails ruby Test Doubles
In a github issue reported to the rspec-mocks project, the user had run into a problem in a Rails’ controller spec in which an RSpec-generated test double didn’t behave as expected. What follows is an edited version of the issue and my response, with the hope that it reaches a wider audience and/or sparks some conversation.
The reported problem: ActiveSupport::JSON::Encoding::CircularReferenceError using doubles
This spec …
require 'spec_helper' describe ListsController do let(:list) { double("List") } describe "GET 'index'" do let(:expected) { [{id: "1", name: "test"}] } before do list.stub(:id){ "1" } list.stub(:name){ "test" } List.stub(:select){ [ list ] } end it "should return the list of lists" do get :index, format: :json response.body.should == expected.to_json end end end
… plus this implementation …
class ListsController < ApplicationController respond_to :json expose(:lists) { List.select("id, name") } def index respond_with(lists) end end
… produces this failure:
Failure/Error: get :index, format: :json
ActiveSupport::JSON::Encoding::CircularReferenceError:
object references itselfThe deeper problem: this is a great example of when not to use stubs.
Here’s why: there are three incorrect assumptions hiding behind the stubs!
selecttakes an Array:List.select(["id","name"]), but the example stubs it incorrectly.- the id is numeric, but the example uses String.
- the json is wrapped:
{"list":{"id":1,"name":"test"}}, but the example doesn’t wrap it.
Even if the stubs were properly aligned with reality, the reason for the error
is that respond_with(lists) eventually calls as_json on the list object,
which, in this example, is an RSpec double that doesn’t implement as_json.
We need to either use a stub_model (which does implement as_json), or
explicitly stub it in the example:
list.stub(:as_json) { { list: {id: 1, name: "test"} } }
But I’d avoid stubs altogether in this case. Stubs are great for well defined
(and understood) public APIs which are invoked by the code being specified.
In this case, we’re stubbing an API (as_json) that is invoked by the Rails
framework, not the code being specified. If the Rails framework ever changes
how it renders json, this example would continue to pass, but it would be a
false positive.
One possible remedy
Here’s how I’d approach this outside-in (based on my own flow, design preferences, and target outcomes. YMMV.)
Start with a request spec:
require 'spec_helper' describe "Lists" do describe "GET 'index.json'" do it "returns the list of lists" do list = List.create!(name: "test") get "/lists.json" response.body.should == [{list: {id: list.id, name: "test"}}].to_json end end end
This shows exactly what to expect, so when working on clients we can refer directly to this without having to dig into internals.
Run this and it fails with uninitialized constant List, so generate the list
resource:
rails generate resource list name:string rake db:migrate rake db:test:prepare
Run it again and it fails with ActionView::MissingTemplate. Now we have a
couple of choices. The purist view says “write a controller spec”, but some
people say controller specs are unnecessary if there are already request specs
(or cukes) as they just add duplication.
For me, the answer depends upon the complexity of the requirement as it
compares to what we get for free from Rails. In this case, the only difference
between the requirement and what Rails gives us for free is that we constrain
the fields to id and name This is something we can implement in the model,
so I’d just implement this very simple controller code and move on:
class ListsController < ApplicationController respond_to :json def index respond_with List.all end end
Now the request spec fails with:
expected: "[{\"list\":{\"id\":1,\"name\":\"test\"}}]"
got: "[{\"list\":{\"created_at\":\"2011-08-27T14:56:19Z\",\"id\":1,\"name\":\"test\",\"updated_at\":\"2011-08-27T14:56:19Z\"}}]"We’re getting more key/value pairs than we want. I want the model responsible for constraining the keys in the json (Rails implements json transformations in the context of the model, so why shouldn’t we?), so I’d add a model spec:
require 'spec_helper' describe List do describe "#as_json" do it "constrains keys to id and name" do list = List.new(:name => "things") list.as_json['list'].keys.should eq(%w[id name]) end end end
This fails with:
expected ["id", "name"]
got ["created_at", "name", "updated_at"]I expect to see created_at and updated_at, but I’m surprised (initially) to
see that id is missing. Thinking this through, it makes sense because the
example generates the list using new, so no id is generated. To get id
to show up in the list of keys, we can use create instead of new, or we can
explicitly set it. I’m going to go with setting the id explicitly to avoid the
db hit, accepting the self-imposed leaky abstraction. It’s all trade-offs.
it "constrains fields to id and name" do list = List.new(:name => "things") list.id = 37 list.as_json['list'].keys.should eq(%w[id name]) end
Now it fails with:
expected ["id", "name"]
got ["created_at", "id", "name", "updated_at"]Now we can implement the constraint:
class List < ActiveRecord::Base def as_json super({ only: %w[id name]}) end end
Now the model spec passes, but the request spec fails with:
ArgumentError: wrong number of arguments (1 for 0)
This is because the as_json implementation fails to honor the Rails
API:
as_json(options = nil)
as_json is called by the Rails framework with an options hash. Had we done
this without the request spec and weren’t aware of this information, we’d have
a bunch of passing specs but the app would blow up. Hooray for testing at
multiple levels!
So we add a new example to the model spec:
it "honors the submitted options hash" do list = List.new(:name => "things") list.id = 37 list.as_json(:only => :name)['list'].keys.should eq(%w[name]) end
This fails with wrong number of arguments (1 for 0) as well, so now we adjust
the model implementation:
def as_json(opts={}) super({ only: %w[id name]}.merge(opts)) end
Now the model spec passes again, and so does the request spec! DONE!
The result is a very nice balance of clarity, speed (in spite of the one db hit in the request spec) and flexibility. Any new endpoints we add will get the same json representation because it is expressed in the model (heeding the principle of least surprise). The model spec not only specifies how the model should represent itself as json, but it helps to explain how the rails framework uses the model. All of this with no stubbing at all, and especially no stubbing of APIs our code isn’t invoking.
more »
Termistat : a status bar for your terminal »
Created at: 26.08.2011 20:03, source: OnRails.org, tagged: ruby terminal status bar
When running background processes that produce detail logging, it’s often difficult to strike the right balance between providing overall status information and details about the current step in the process. It’s helpful to be able to see “tail-like” information at the detail level to monitor and debug your processes; however, it’s also helpful to be able to know summary information, such as the overall progress through the entire task. You can intersperse “record 1 of n” lines in your output, but they are easy to miss in all the noise.
In order to be able to display both types of information concurrently, I built a simple gem called termistat, which allows you to display a status bar for summary information at the top of your terminal in addition to the original detailed output. It was meant to be a whyday contribution, but I didn’t quite finish it in time to be released on whyday…oh well.
Here’s a screenshot of termistat in action:

Termistat requires the ffi-ncurses gem (which requires the ncurses library to be on your system), and has a configuration DSL to customize the appearance somewhat. Check it out and let me know if you have any ideas for improvement!
more »

