Key-Value Stores in Ruby: The Wrap Up »
Created at: 17.11.2009 20:00, source: Engine Yard Blog, tagged: Technology couchdb javascript key-value stores mongodb ruby s3
This last article in our key-value series will briefly cover a few interesting topics that could each have had full articles of their own. This means that if they seem interesting to you, follow the links that I provide to get more information on them. Lastly, I’ll wrap up by introducing Moneta, written by Yehuda Katz, which provides a unified API for a wide variety of different Key-Value Stores. If you want to write code that allows the user to choose the store to use, you’ll want to pay attention to Moneta.
The difficult part of discussing Key-Value Stores stores today is that it’s a product area seeing rapid development and constant evolution. There are more interesting stores and libraries available than can easily be covered, even in a series like this. I could probably be writing posts every two weeks into next year without running out of subjects. So, alas, many things must be left undiscussed or underdiscussed. But let’s move on to the topics we can cover…
CouchDB
The first great Key-Value Store that isn’t going to get its own article is CouchDB. Apache’s CouchDB is a document-oriented database, like MongoDB. It, however, exposes a RESTful JSON based API that you address with a built in HTTP interface. Like MongoDB, it offers a schema free data store. CouchDB offers solid, built-in replication, and uses JavaScript as its query language. It is a powerful tool.
There are several Ruby libraries which can be used to facilitate using CouchDB. In the examples below, I have used CouchRest, which is based on CouchDB’s own couch.js library:
require 'rubygems'
require 'couchrest'
require 'yaml'
DBH = CouchRest.database!('exercise-log')
response = DBH.save_doc({
:date => Time.now,
:activity => ARGV[0],
:duration => ARGV[1]})
stored_record = DBH.get(response['id'])
puts "Stored:\n#{stored_record.to_yaml}"
wyhaines$ ruby /tmp/couch1.rb Stored: --- !map:CouchRest::Document duration: "97:34" _rev: 1-eb6f6e3a3e2eae0cd99f3fcbc63d29d6 _id: 0d9e71f44b3e0d3a2013c282bbccb5a0 activity: pedaling date: 2009/11/12 21:07:45 +0000
Like MongdoDB, one can store any set of keys/values together as a document in CouchDB, and then retrieve it later. CouchRest returns a response from the server that contains an id field, which can be used to retrieve the record that was just stored.
For more complex queries of the document store, one can use views. Views have a lot of power, because they are ultimately defined using JavaScript, but they don’t lend themselves to easy ad-hoc manipulation of the database.
DBH.save_doc({
"_id" => "_design/query",
:views => {
:allkeys => {
:map => "function(doc) { for (var word in doc) { if (!word.match(/^_/)) emit(word,doc[word])}}"
}
}
})
That inserts a view into the database that will be identified by query/allkeys. What a view does is defined by the JavaScript code it contains. Once a view is inserted into CouchDB, using it is simple:
puts DBH.view('query/allkeys').to_yaml
That particular function was lifted shamelessly from the CouchRest README, and just has a couple terms renamed to make it a little more clear. The output:
--- total_rows: 3 rows: - id: 0d9e71f44b3e0d3a2013c282bbccb5a0 value: pedaling key: activity - id: 0d9e71f44b3e0d3a2013c282bbccb5a0 value: 2009/11/12 21:07:45 +0000 key: date - id: 0d9e71f44b3e0d3a2013c282bbccb5a0 value: "97:34" key: duration offset: 0
This is really just the tip of the iceberg with CouchDB/CouchRest; there’s a wealth of functionality. CouchDB views are implemented with map/reduce capability, which means you can use them to crunch some pretty complex problems on your data. Additionally, CouchRest provides a CouchRest::ExtendedDocument, which your own classes can inherit from. This lets you easily create a Ruby model for your data, which is then transparently stored inside CouchDB.
class Exercise "running", :date => Time.now, :duration => "23:44")
Dig into the CouchDB and CouchRest documentation if this looks interesting to you.
S3
I just wanted to briefly mention Amazon’s Simple Storage Service. It is, fundamentally, a simple HTTP accessible Key-Value Store that Amazon has turned into a service. Requests to S3 will have higher latency than requests to a locally hosted data store (and its response latency can be high too), but if you want a simple, robust store that will scale to as much data as you have to push at it, you might seriously consider S3.
Moneta
Moneta is a unified interface to a variety of different key-value type data stores. That is, the same code can be run against a variety of different backing stores, and it will just work. Moneta supports the following stores as of this posting:
- Basic File Store
- BerkeleyDB
- CouchDB
- DataMapper
- File store for xattr
- In-memory store
- Memcache store
- Redis
- S3
- SDBM
- Tokyo
- Xattrs in a file system
Consider this example, which, again, uses CouchDB:
irb(main):003:0> require 'moneta/couch'
require 'rubygems'
require 'yaml'
require 'moneta'
require 'moneta/couch'
cache = Moneta::Couch.new(:db => 'football')
cache['1a_final'] = {
:where => 'Laramie; War Memorial Stadium',
:when => "11:30 MST",
:who => "Southeast Cyclones & Lingle-Ft. Laramie Doggers",
:prediction => "SE Cyclones by 14"}
puts cache['1a_final'].inspect
wyhaines$ ruby /tmp/moneta1.rb --- - prediction: SE Cyclones by 14 when: 11:30 MST who: Southeast Cyclones & Lingle-Ft. Laramie Doggers where: Laramie; War Memorial Stadium
It works, very simply. If I want to change the code to use something else, like a file based store, it’s as simple as changing one line:
--- couch.rb 2009-11-19 15:00:07.000000000 -0700
+++ file.rb 2009-11-19 15:01:12.000000000 -0700
@@ -1,9 +1,9 @@
require 'rubygems'
require 'yaml'
require 'moneta'
-require 'moneta/couch'
+require 'moneta/file'
-cache = Moneta::Couch.new(:db => 'football')
+cache = Moneta::File.new(:path => '/tmp/football')
cache['1a_final'] = {
:where => 'Laramie; War Memorial Stadium',
The rest of the code works without alteration. The Moneta API is designed to be very similar to that of Hash. It has a limited feature set, but the features it provides work identically across all of the supported platforms. For example, it doesn’t currently support iteration or partial matches. If your Key-Value Store needs are simple and you want something that can work with whatever store your users want to use, definitely check out Moneta; it’s a well written tool.
With that, we’ve reached the end of this series. It’s been fun to explore the unique features, as well as the threads that unify each of these different approaches to the problem, on a non-SQL key-value type data store. I hope that I’ve exposed you to new and useful tools.
The landscape of Key-Value Stores is changing rapidly, so it is difficult to stay fully informed all the time. For instance, just a couple days ago there was a blog post implementing a SQL front end for CouchDB. It’s done in Perl, but all it would take is an interested person and a little time, and you could have it in Ruby, too.
If you use a Key-Value Store system, or plan to, keep your eyes open for new developments, because you can bet that someone else will have something interesting next week or next month that may change the landscape again. As always, leave feedback in the comments, and thanks for reading!
more »
Storing Your Files »
Created at: 16.03.2009 05:51, source: The Rails Way - Home, tagged: files nfs s3 uploads
This is the second article in my series on file management, the third article will cover the challenges of handling uploads then we should be able to move on to some more advanced topics.
The second problem you’ll face when building an application to handle files is where and how to store them. Thankfully there are lots of well-supported options, each with their own pros and cons.
The local file system
If your application only runs on a single server, the simplest option is to store them on the local disk of your web/application server. This leaves you with very few moving parts, and you know that both your rails application and your webserver can see the same files, at the same location. But even though this is a simple option there are a few things that you need to be careful of.
A common mistake I see is to use a single directory to handle all of the users’ uploaded files. So your directory structure ends up looking something like this:
/home/railsway/uploads/koz_avatar.png
/home/railsway/uploads/dhh_avatar.png
/home/railsway/uploads/other_avatar.png
The first, and most obvious, problem with this structure is that unless you’re careful you could end up with users overwriting each other’s files. The second, and more painful problem is that you end up with too many files in a single directory which will cause you some pain when you try to do things like list the directory or start removing old files.
The best bet is to store the uploads in a directory which corresponds to the ID of the object which owns those files. But something like the following will also leave you with a huge directory:
/home/railsway/uploads/1/koz_avatar.png
/home/railsway/uploads/2/dhh_avatar.png
/home/railsway/uploads/3/other_avatar.png
The best bet is to partition that directory into a number of sub directories like this:
/home/railsway/uploads/000/000/001/koz_avatar.png
/home/railsway/uploads/000/000/002/dhh_avatar.png
/home/railsway/uploads/000/000/003/other_avatar.png
Thankfully both of the popular file management plugins have built in support for partitioned storage :id_partition in paper clip and :partition in attachment_fu.
NFS, GFS and friends
Once you’ve grown beyond a single app / web server, using the file-system gets a little more complicated. In order to ensure that all your app and web servers can see the same files you have to use a shared file system of some sort. Setting up and running a shared file system is beyond the scope of this site, but a few words of caution.
It’s deceptively easy to set up a simple NFS server for your network and just run your application as you did when it was on a single disk, but some things which are cheap on local disk are slow and expensive over NFS and friends. Make sure you stress test your file server and pay an expert to help you tune the system. The bigger problem I’ve had with NFS and GFS is the impact of downtime or difficulties on your application. Your NFS server becomes a single point of failure for your whole site, and a minor network glitch can render your application completely useless as all the processes get tied up waiting on a blocking read from an NFS mount that’s gone away.
You can solve all those kinds of problems by hiring a good sysadmin and / or spending a large amount of money on serious storage hardware. It’s not a path that I personally choose, but it’s definitely an option you should consider.
Amazon S3
It’s not really possible to write about storage without touching on Amazon S3. In case you’ve been living under a rock for a few years S3 is a hugely scalable, incredibly cheap storage service. There are several good gems to use with your applications and the major file management plugins provide semi-transparent S3 support.
S3 isn’t a file system so there are several things which you have to do differently, however there are alternatives for most of those operations. For instance instead of using X-Sendfile to stream the files to your user, you redirect them to the signed url on amazon’s own service. By way of example our download action from the earlier article would look like this if using S3 and marcel’s s3 library
def download
redirect_to S3Object.url_for('download.zip',
'railswayexample',
:expires_in => 3.hours)
end
But there are a few things you have to be careful with when using S3. The first is that uploading to s3 is much slower than simply writing your file to local disk. Unless you want your rails processes to be tied up for ages, you’ll probably want to have a background job running which transfers the files from your server up to amazon’s. Another factor is that when S3 errors occur your users will be greeted by a very ugly error page:
Finally there’s always the risk of amazon having another bad day which takes your application down for a few hours. Amazon’s engineers are pretty amazing, but nothing’s perfect.
Other options
There are a few options I’ve not used before, but you could investigate:
BLOBs in your database
I’ve never been a fan of using BLOBs to store large files, however some people swear by them. If you’re aware of great tutorial resources for BLOBs and rails, let me know and I’ll link to them from here.
Rackspace’s Cloud Files
When it was first announced Cloud Files from rackspace seemed like it was going to be a great competitor to S3. However there’s currently no equivalent to S3’s signed-url authentication option which means downloads become much harder. To use Cloud Files would require you to build a streaming proxy in your application, and use it to stream files from rackspace back out to the user. You’d also have to pay for the bandwidth twice, once from rackspace, and once from your hosting provider.
This makes it much more complicated than S3 but hopefully this will be addressed in a future release.
MogileFS
MogileFS is a really interesting option. It has some similarities to S3 in that it’s a write-once file storage system which operates over HTTP. But unlike S3 it’s open source software you can run on your own servers. Unfortunately MogileFS is really thinly documented and quite difficult to get up and running. If you know of a really good getting-started tutorial for MogileFS, let me know and I’ll link to it from here.
It also would require you to use perlbal for your load balancer or find an apache module that can support X-Reproxy-Url.
Conclusion
There are a bunch of different options you should consider when picking the storage for your file uploads. Generally my advice would be to start with simple on-disk partitioned storage and grow from there. Don’t rush straight to S3 because all the blogs tell you to, stay as simple as possible for as long you can.
more »
