Announcing Database-less environments »
Created at: 16.05.2012 03:35, source: Engine Yard Blog, tagged: Add-ons databases Engine Yard Cloud mongodb mysql PostgreSQL Technology
At Engine Yard, we believe that you should have the flexibility to set up your environments and manage your data stores as you see fit. This is something we take seriously as we continue to evolve Engine Yard Cloud and today, we are happy to announce database-less environments as an alpha release. If you need to utilize data offerings outside of our natively supported MySQL or PostgreSQL, then this feature will enable you to do so.
Enabling the feature
With database-less environments, it is no longer necessary to have a MySQL or PostgreSQL instance in every environment. Simply boot up a ‘No Database’ cluster with one of our Add-on database providers or roll your own using utility instances. Now it is easier and more affordable than ever to get started on Engine Yard.
You can enable this feature using the Early Access tools. Once you have the 'No db' feature enabled, you will be able to select the "No Database (Alpha)" option under Database Stack on the new environment form.
You can add as many application instances and utilities as you need, and you can stop paying for database masters that you don’t use. For example, you can follow the Mongoid RailsCast (Episode 238) and create a simple blog using Mongoid using two application instances and three utility nodes.
Add-ons and DBaaS
You can also use the ‘No Database’ feature in combination with our Add-on Program (login required). For example, you can have a simple application with just one instance and an external database. See the Database section of our Add-on Program for more information.
We hope you enjoy this feature and let us know what you think.
Notes
Removal of the database.yml file
Environments without databases will not have a database.yml file generated by Engine Yard Cloud. Enabling this feature means that you are either not using ActiveRecord or you have supplied your own database.yml file in your repository.
more »
HandlerSocket: The NoSQL MySQL & Ruby »
Created at: 14.01.2011 18:49, source: igvita.com, tagged: databases handlersocket mysql nosql
The end of an architectural era, time for a complete rewrite? Is it really the case that by attempting to be a "one size fits all", the RDBMS "systems of the past" excel at nothing? The Cambrian explosion of alternative database engines certainly lends some credibility to that view. However, amidst all the hype, it is also easy to overlook the fundamentals: normalized or not, or with or without a "structured query language" (SQL), a B-Tree is still one of the best performing data structures when it comes to indexing data.
With that in mind, Yoshinori Matsunobu and a few of his collaborators at DeNA (one of the largest social game platform providers in Japan), decided to build the literal "NoSQL" directly into MySQL! The HandlerSocket plugin, which they have released to the public can be installed into any existing MySQL server to provide an optimized protocol for reading and writing data directly from the underlying storage engine (such as InnoDB) without any SQL overhead. And the results are stunning: faster than memcached, more flexible, and no cache coherency problems.
HandlerSocket: Under the Hood
The core insight behind HandlerSocket is that for in-memory workloads where data is accessed via an index, the overhead imposed by SQL parsing, locks and concurrency controls has nothing to do with the reading or writing of data from the underlying storage engine. In other words, if all you need is direct access to the index, then you can bypass the SQL layer altogether - that is exactly what HandlerSocket provides.
The plugin is a daemon which can be loaded into any MySQL server, and which opens additional ports on the server to accept direct reads and writes to the underlying storage engine. The protocol is incredibly simple, and most importantly, HandlerSocket runs inside of your existing MySQL server, accessing the same underlying data. This means that you have the full expressive power of SQL, the persistence and error recovery of the underlying engine, and an optimized read/write protocol for where you need it.
Benchmarks are a subjective sport, but Yoshi's tests show that direct access to InnoDB via the HandlerSocket protocol yields a significant improvement even when compared to a raw memcached connection! Perhaps it is memcached we should be optimizing, not running away from our RDBMS engines?
Hands on with HandlerSocket
To get started, follow the installation instructions provided in the HandlerSocket repo, or simply pickup one of the latest builds of the Percona Server - it now ships with HandlerSocket built in! Once you're up and running, you can tweak your settings for the plugin in your my.cnf:
[mysqld]
plugin-load=handlersocket.so # or mysql> install plugin handlersocket soname 'handlersocket.so';
loose_handlersocket_port = 9998
loose_handlersocket_port_wr = 9999
loose_handlersocket_threads = 16
loose_handlersocket_threads_wr = 1
open_files_limit = 65535
If all is well, launch your mysql CLI, and do a "show processlist" to see the open HandlerSocket connections. Now, if you are curious, you can telnet directly to the read or write port and issue some queries, or pick one of the existing clients (PHP, Java, Python, Node.js, Ruby) and do it from the comfort of your favorite language.
Ruby & HandlerSocket
HandlerSocket ships with a C++ library (libhsclient) for which there are several Ruby wrappers: ruby-handlersocket and handlersocket. If native extensions are not a problem, then "gem install handlersocket":
require 'handlersocket' h = HandlerSocket.new(:host => '127.0.0.1', :port => '9998') # open PRIMARY index on widgets.user table, assign it ID #1 h.open_index(1, 'widgets', 'user', 'PRIMARY', 'user_name,user_email,created') # fetch record from index ID 1, where PRIMARY key is equal to 1 p h.execute_single(1, '=', [1]) # > [0, [["Ilya", "ilya@igvita.com", "2010-01-01 00:00:00"]]] # open 'id_created' index on widgets.user table, assign it ID #2 p h.open_index(2, 'widgets', 'user', 'id_created', 'user_name,user_email,created') # fetch record from index ID 2, where id >= 2, and date >= 2010-01-03 p h.execute_single(2, '>=', [2, '2010-01-03'], 10) # > [0, [["Bob", "bob@example.com", "2010-01-03 00:00:00"]]]
Alternatively, if you are looking for a non-blocking version, you can also use em-handlersocket which does not require any native extensions. Non-blocking reactor, combined with the fact that HandlerSocket allows pipelined execution, makes a for a very fast API! A simple example of connecting to a MySQL server and doing a range scan on a composite InnoDB index:
EM.run { c = EM::HandlerSocket.new idx = {:id => 0, :db => 'widgets', :table => 'user', :index_name => 'id_created', :columns => 'user_name'} d = c.open_index(idx) d.callback do |s| # Query index 0 for records where id >= 2, and created_at >= 2010-01-03 d = c.query(:id => 0, :op => '>=', :key => ['2', '2010-01-03']) d.callback do |data| p [:received, data] end end }
Stop the revolution! Hold the rewrite!
If HandlerSocket can fetch data faster than memcached, then this is a game changer. Stop the revolution! Hold the rewrite! After all, caching data in another datastore leads to data duplication, cache coherency problems and additional operational complexity. Perhaps it is too early to simply throw away the billions of dollars we have invested into developing and optimizing the underlying RDBMS engines!
more »
Schema-Free MySQL vs NoSQL »
Created at: 01.03.2010 20:55, source: igvita.com, tagged: databases eventmachine mysql nosql
Amidst the cambrian explosion of alternative database engines (aka, NoSQL) it is almost too easy to lose sight of the fact that the more established solutions, such as relational databases, still have a lot to offer: stable and proven code base, drivers and tools for every conceivable language, and more features than any DBA cares to learn about. Not to mention that relational or not, they often times perform just as well as any other single instance key-value store when faced with large datasets - hence the reason why Riak, Voldemort and others use InnoDB as their data stores. Granted, the “feature bloat” is also the reason why a rewrite can be a good idea, but it also feels like this gray zone is too often overlooked in the NoSQL community - just because you are “NoSQL” does not mean you have to throw away years of work put into relational databases.
Setting aside the fact that we are yet to define what “NoSQL” actually is, some of the attributes that we commonly glob under this label are: document based, schema-free, distributed and “scalable”. The fact that being distributed and being scalable are not one and the same is a subject for another post, instead let’s take a closer look at what schema-free and document-based actually means. In fact, let me jump ahead: I am genuinely surprised that we are yet to see a schema-free engine built on top of MySQL. I know, I know, but suspend you disbelief for a second, because it is not as outrageous as it sounds.
Document Based: a Double Edged Sword
The original reason for and the benefit of the relational model is that by constraining the data schema (read, eliminating structural complexity of the data, or decomposing it into relations), you actually gain power and flexibility in the types of queries you can execute against your database. Said another way, normalized data design allows us to have a general-purpose query language, which allows for queries whose parameters we do not even know at design time, whereas denormalized designs do not. What we loose in flexibility of our data structures, we gain in our ability to interact with the data. Hence, in theory, if you have no way to anticipate the types of queries in the future, a relation model is your best bet. Lose some, win some, chose your poison.
At the same time, we all know that “no join is faster than no join”. The inherent disadvantage of decomposing your data is the required assembly. If you are looking for “speed” or “scalability”, then denormalizing your data is usually the first step. The disadvantage? Now you have introduced a number of potential anomalies into your data: updates, inserts, and deletes can cause data inconsistencies unless you keep careful accounting of all duplication. One-to-One, and One-to-Many relations are usually easy to manage, but Many-to-Many in denormalized schemas are nothing but a recipe for disaster. That is, if you care about consistency.
Finally, since you lose the power of a general purpose query language (SQL), you are now at a mercy of the DSL provided by your new database. Mongo, Couch and many others had to introduce their own query language constructs alongside "map-reduce" functionality to address the problem of querying arbitrarily deep records. Now, I am a fan of both, but frankly, none I have worked with so far are as clean, or as easy to understand as SQL (case in point) - with the downside of making me learn yet another query language.
Schema-free != Document Based
Document based and schema-free are often used interchangeably, but there is an important difference: schema-free does not necessarily imply nested data structures. Likewise, just because MySQL is “relational” does not mean that it must be fixed to a predefined schema - at create time, maybe, but not at runtime. Intersect the two statements, and it means that there is absolutely no reason why we cannot have a schema-free engine in MySQL:
mysql> USE noschema; mysql> CREATE TABLE widgets; /* look ma, no schema! */ mysql> INSERT INTO widgets (id, name) VALUES("a", "apple"); mysql> INSERT INTO widgets (id, name, type) VALUES("b", "blackberry", "phone"); mysql> SELECT * FROM widgets WHERE id = "a"; +---------+---------------+ | id | name | +---------+---------------+ | a | apple | +---------+---------------+ mysql> SELECT * FROM widgets; +---------+---------------+--------+ | id | name | type | +---------+---------------+--------+ | a | apple | NULL | | b | blackberry | phone | +---------+---------------+--------+
As long as we avoid nested data structures, then there is no reason why we should be limited by the columns defined in our tables because we can compose and decompose any relation at runtime. Not only would this mean no migrations or need to store null values, but you could also keep all the tools, drivers, and the SQL query language while adding the full flexibility of being schema-free.
Schema-free DB on top of MySQL
Not able to find any project that would give me this behavior, I ended up prototyping it myself over the weekend, and believe it or not, it works just fine. In fact, the output above is from a real console session with MySQL. All it took is an em-proxy server with a little low-level protocol and query rewriting, and all of the sudden, my MySQL forgot that it requires a schema. Take it for a test-drive yourself (you will need Ruby 1.9):
git clone git://github.com/igrigorik/em-proxy.git && cd em-proxy
ruby examples/schemaless-mysql/mysql_interceptor.rb
mysql -h localhost -P 3307 --protocol=tcp
# snip ... # build the select statements, hide the tables behind each attribute join = "select #{table}.id as id " tables.each do |column| join += " , #{table}_#{column}.value as #{column} " end # add the joins to stich it all together join += " FROM #{table} " tables.each do |column| join += " LEFT OUTER JOIN #{table}_#{column} ON #{table}_#{column}.id = #{table}.id " end join += " WHERE #{table}.id = '#{key}' " if key
Of course, this is nothing but a cute code example nor does it even cover all the different use cases, but let us look at the feature set: driver support for every language (you can point Rails + ActiveRecord, JDBC, etc. at it out the box, no problem), tool support (GUI and command line), replication that works, basically impossible to corrupt, transactions, and so on. Not bad for half a day of hacking with a simple data model in the background:

Instead of defining columns on a table, each attribute has its own table (new tables are created on the fly), which means that we can add and remove attributes at will. In turn, performing a select simply means joining all of the tables on that individual key. To the client this is completely transparent, and while the proxy server does the actual work, this functionality could be easily extracted into a proper MySQL engine - I’m just surprised that no one has done so already. For a closer look, check out the proxy code itself, there are plenty of comments, which explain how it is all pieced together.
The gray zone of SQL vs NoSQL
So what is the point of all this? Well, I hope someone actually writes such an engine, because I believe there is a market for it. There is a lot to be said for a drop in, SQL compatible, schema-free engine, and unlike what the NoSQL propaganda may say, there is absolutely no reason why we can’t have many of the benefits of “NoSQL” within MySQL itself. There is no one clear winner for a database engine or model, so put some thought into your decision up front. Just because Mongo, TC, or Couch are 'document-oriented' or 'schema-free' does not mean they are necessarily better for your application. In the meantime, don't get me wrong, I am still rooting for all the NoSQL projects, as well as have high expectations for Drizzle - they are all doing fantastic work.
more »
Future of RDBMS is RAM Clouds & SSD »
Created at: 07.12.2009 18:51, source: igvita.com, tagged: Architecture databases database ramcloud ssd
Rumors of the demise of relational database systems are greatly exaggerated. The NoSQL movement is increasingly capturing the mindshare of the developers, all the while the academia have been talking about the move away from "RDBMS as one size fits all" for several years. However, while the new storage engines are exciting to see, it is also important to recognize that relational databases still have a bright future ahead - RDBMS systems are headed into main memory, which changes the playing field all together.
Performance is only one aspect that influences the choice of a database. Tree and graph structures are not easy to model within a relational structure, which in turn leads to complicated schemas and system overhead. For that reason alone, document-stores (Tokyo, CouchDB, MongoDB), graph stores (Neo4J), and other alternative data structure databases (Redis) are finding fertile ground for adoption. However, the end of "RDBMS as one size fits all" does not mean the end of relational systems all together. It is too early to bury RDBMS in favor of No (or Less) SQL. We just need to reset how we think about the RDBMS.
Disks are the New Tape
The evolution of disks has been extremely uneven over the last 25 years: disk capacity has increased 1000x, data transfer speeds increased 50x, while seek and rotational delays have only gone up by a factor of 2. Hence, if we only needed to transfer several hundred kilobytes of data in the mid 80's to achieve good disk utilization, then today we need to read at least 10MB of data to amortize the costs of seeking the data - refresh your memory on seek, rotational, and transfer times of our rusty hard drives.
When the best we can hope for is 100-200 IOPS out of a modern hard drive, the trend towards significantly larger block sizes begins to make a lot more sense. Whereas your local filesystem is likely to use 4 or 8kb blocks, systems such as Google's GFS and Hadoop's HDFS are opting out for 64MB+ blocks in order to amortize the cost of seeking for the data - by using much larger blocks, the cost of seeks and access time is once again brought down to single digit percent figures over the transfer time.
Hence, as we generate and store more and more data, the role of the disks must inevitably become more archival. Batch processing systems such as Map-Reduce are well suited for this world and are quickly replacing the old business intelligence (BI) systems for exactly these reasons. In the meantime, the limitations imposed by the random access to disk mean that we need to reconsider the role of disk in our database systems.
OLTP is Headed Into Main Memory & Flash
An average random seek will take 5-10ms when hitting the physical disk and hundreds of microseconds for accessing data from cache. Compare that to a fixed cost of 5-10 microseconds for accessing data in RAM and the benefits of a 100-1000x speed difference can be transformative. Instead of treating memory as a cache, why not treat it as a primary data store? John Ousterhout and his co-authors outline a compelling argument for "RAMCloud". After all, if Facebook keeps over 80% of their data in memcached, and Google stores entire indexes of the web in memory many times over, then your average database-backed application should easily fit and be able to take advantage of the pure memory model also.
The moment all of the data is available in memory, it is an entirely new game: access time and seek times become irrelevant (no disk seeks), the value of optimizing for locality and access patterns is diminished by orders of magnitude, and in fact, entirely new and much richer query models can enable a new class of data-intensive applications. In a world where the developer's time is orders of magnitude more expensive than the hardware (a recent phenomenon), this also means faster iterations and less data-optimization overhead.
The downside to the RAMCloud is the equivalent order of magnitude increase in costs - RAM prices are dropping, but dollar for dollar, RAMCloud systems are still significantly more expensive. Flash storage is an obvious compromise for both speed and price. Theoretical access time for solid-state devices is on the order of 50 microseconds for reads, and 200 microseconds for writes. However, in reality, wrapping solid-state storage in SATA-like hardware devices brings us back to ~200 microseconds for reads, or ~5000 IOPS. Though, of course, innovation continues and devices such as FusionIO’s PCI-E flash storage controller bring us back to 80 microsecond reads at a cost of ~$11 per Gigabyte.
However, even the significantly higher hardware price point is often quickly offset once you factor in the saved developer time and adjacent benefits such as guaranteed performance independent of access patterns or data locality. Database servers with 32GB and 64GB of RAM are no longer unusual, and when combined with SSDs, such as the system deployed at SmugMug, often offer a much easier upgrade path than switching your underlying database system to a NoSQL alternative.
Database Architecture for the RAMCloud
Migrating your data into RAM or Flash yields significant improvements via pure speedup in hardware, however, "it is time for a complete rewrite" argument still holds: majority of existing database systems are built with implicit assumptions for disk-backed storage. These architectures optimize for disk-based indexing structures, and have to rely on multithreading and locking-based concurrency to hide latency of the underlying storage.
When access time is measured in microseconds, optimistic and lock-free concurrency is fair game, which leads to much better multi-core performance and allows us to drop thousands of lines of code for multi-threaded data structures (concurrent B-Trees, etc). RethinkDB is a drop-in MySQL engine designed for SSD drives leveraging exactly these trends, and Drizzle is a larger fork of the entire MySQL codebase aimed at optimizing the relational model for "cloud and net applications": massively distributed, lightweight kernel and extensible.
Migrating Into Main Memory
Best of all, you can start leveraging the benefits of storing your data in main memory even with the existing MySQL databases - most of them are small enough to make the memory buffers nothing but a leaky abstraction. Enable periodic flush to disk for InnoDB (innodb_flush_log_at_trx_commit=2), and create covering indexes for your data (a covering index is an index which itself contains all the required data to answer the query). Issue a couple of warm-up requests to load the data into memory and you are off to the races.
Of course, the above strategy is at best an intermediate solution, so investigating SSD’s as a primary storage layer, and if you are adventurous, give RethinkDB a try. Also keep an eye on Drizzle as the first production release is aimed for summer of 2010. Alternative data storage engines such as Redis, MongoDB and others are also worth looking into, but let us not forget: laws of physics still apply to NoSQL. There is no magic there. Memory is fast, disks are slow. Nothing is stopping relational systems from taking advantage of main memory or SSD storage.
more »
RailsOnPg released »
Created at: 22.10.2009 01:07, source: Robby on Rails, tagged: Ruby on Rails programming PostgreSQL PostgreSQL rubyonrails plugins databases
Hello fellow PostgreSQL and Ruby on Rails geeks,
Alexander Tretyakov (twitter) recently released a plugin for Ruby on Rails, which extends migrations and provides you with the ability to create.
While you can already do something like this with execute in your migrations:
execute("CREATE VIEW my_tasty_snacks AS SELECT * FROM snacks WHERE food = 'Tasty';")With RailsOnPage, you’re provided a DSL so that you can do the following:
create_view :my_tasy_snacks do |view|
view.select '*'
view.from 'snacks'
view.conditions 'food' => 'Tasty'
endnote: I haven’t tested the above, just a hypothetical example
Anyhow, if you’re in the habit of using views, functions, or triggers with your PostgreSQL database and are using Ruby on Rails, you might give RailsOnPg a whirl.
more »

