Protocol Buffers, Avro, Thrift & MessagePack »

Created at: 01.08.2011 22:31, source: igvita.com, tagged: Architecture avro protobuf serialization thrift

Perhaps one of the first inescapable observations that a new Google developer (Noogler) makes once they dive into the code is that Protocol Buffers (PB) is the "language of data" at Google. Put simply, Protocol Buffers are used for serialization, RPC, and about everything in between.

Initially developed in early 2000's as an optimized server request/response protocol (hence the name), they have become the de-facto data persistence format and RPC protocol. Later, following a major (v2) rewrite in 2008, Protocol Buffers was open sourced by Google and now, through a number of third party extensions, can be used across dozens of languages - including Ruby, of course.

But, Protocol Buffers for everything? Well, it appears to work for Google, but more importantly I think this is a great example of where understanding the historical context in which each was developed is just as instrumental as comparing features and benchmarking speed.

Protocol Buffers vs. Thrift

Let's take a step back and compare Protocol Buffers to the "competitors", of which there are plenty. Between PB, Thrift, Avro and MessagePack, which is the best? Truth of the matter is, they are all very good and each has its own strong points. Hence, the answer is as much of a personal choice, as well as understanding of the historical context for each, and correctly identifying your own, individual requirements.

When Protocol Buffers was first being developed (early 2000's), the preferred language at Google was C++ (nowadays, Java is on par). Hence it should not be surprising that PB is strongly typed, has a separate schema file, and also requires a compilation step to output the language-specific boilerplate to read and serialize messages. To achieve this, Google defined their own language (IDL) for specifying the proto files, and limited PB's design scope to efficient serialization of common types and attributes found in Java, C++ and Python. Hence, PB was designed to be layered over an (existing) RPC mechanism.

By comparison, Thrift which was open sourced by Facebook in late 2007, looks and feels very similar to Protocol Buffers - in all likelihood, there was some design influence from PB there. However, unlike PB, Thrift makes RPC a first class citizen: Thrift compiler provides a variety of transport options (network, file, memory), and also tries to target many more languages.

Which is the "better" of the two? Both have been production tested at scale, so it really depends on your own situation. If you are primarily interested in the binary serialization, or if you already have an RPC mechanism then Protocol Buffers is a great place to start. Conversely, if you don't yet have an RPC mechanism and are looking for one, then Thrift may be a good choice. (Word of warning: historically, Thrift has not been consistent in their feature support and performance across all the languages, so do some research).

Protocol Buffers vs. Avro, MessagePack

While Thrift and PB differ primarily in their scope, Avro and MessagePack should really be compared in light of the more recent trends: rising popularity of dynamic languages, and JSON over XML. As most every web developers knows, JSON is now ubiquitous, and easy to parse, generate, and read, which explains its popularity. JSON also requires no schema, provides no type checking, and it is a UTF-8 based protocol - in other words, easy to work with, but not very efficient when put on the wire.

MessagePack is effectively JSON, but with efficient binary encoding. Like JSON, there is no type checking or schemas, which depending on your application can be either be a pro or a con. But, if you are already streaming JSON via an API or using it for storage, then MessagePack can be a drop-in replacement.

Avro, on the other hand, is somewhat of a hybrid. In its scope and functionality it is close to PB and Thrift, but it was designed with dynamic languages in mind. Unlike PB and Thrift, the Avro schema is embedded directly in the header of the messages, which eliminates the need for the extra compile stage. Additionally, the schema itself is just a JSON blob - no custom parser required! By enforcing a schema Avro allows us to do data projections (read individual fields out of each record), perform type checking, and enforce the overall message structure.

"The Best" Serialization Format

Reflecting on the use of Protocol Buffers at Google and all of the above competitors it is clear that there is no one definitive, "best" option. Rather, each solution makes perfect sense in the context it was developed and hence the same logic should be applied to your own situation.

If you are looking for a battle-tested, strongly typed serialization format, then Protocol Buffers is a great choice. If you also need a variety of built-in RPC mechanisms, then Thrift is worth investigating. If you are already exchanging or working with JSON, then MessagePack is almost a drop-in optimization. And finally, if you like the strongly typed aspects, but want the flexibility of easy interoperability with dynamic languages, then Avro may be your best bet at this point in time.


more »

Data Serialization + RPC with Avro & Ruby »

Created at: 16.02.2010 20:35, source: igvita.com, tagged: ruby avro hadoop rpc

Any programmer or project worth their salt needs to invent their own serialization, and if they are serious, an RPC framework - or, at least, that is what it seems like. Between, Protocol Buffers, Thrift, BERT, BSON, or even plain JSON, there is no shortage of choices and architectural decisions packed into each one. For that reason, when Doug Cutting (one of the lead developers on Hadoop) first proposed Avro in April of 2009, a healthy dose of skepticism was in order, after all, both Thrift and PB already had thriving communities - why reinvent the wheel? Having said that, the proposal passed and since then Avro has been making good progress.

Reviewing the latest benchmarks shows Avro as fully competitive in both speed and size of the output data to PB and Thrift. Though, neither speed nor size, while critical components, were the motivating reasons for Avro. Interestingly enough, Avro was designed by Doug Cutting with the goal of making it more friendly to dynamic environments (Python, Ruby, Pig, Hive, etc) where code generation is often an unnecessary and an unwanted step. Unlike PB, or Thrift, Avro stores its schema in plain JSON, as part of the output file, which makes it extremely easy to parse (JSON parsers are easy and abundant) and avoid the need for extra IDL definition stubs and compilers (though if you really want to, Avro can generate code stubs as well).

Embedding IDL with Avro

The decision to embed the Avro data schema alongside the binary packed data opens up a number of interesting use cases. First, dynamic frameworks such as Pig, Hive, or any other Hadoop infrastructure (the goal of Avro is to become the standard data exchange and RPC protocol for all of Hadoop), can load and process data on the fly, without looking for or invoking an IDL compiler. Additionally, having the original schema also allow us to do “data projection”: if the reader is only interested in a subset of the data, then it can selectively parse it out of the stream, allowing for faster processing and easy "versioning" support out of the box. Let’s take a look at a simple example:

> avro-writer.rb

SCHEMA = <<-JSON
{ "type": "record",
  "name": "User",
  "fields" : [
    {"name": "username", "type": "string"},
    {"name": "age", "type": "int"},
    {"name": "verified", "type": "boolean", "default": "false"}
  ]}
JSON
 
file = File.open('data.avr', 'wb')
schema = Avro::Schema.parse(SCHEMA)
writer = Avro::IO::DatumWriter.new(schema)
dw = Avro::DataFile::Writer.new(file, writer, schema)
dw << {"username" => "john", "age" => 25, "verified" => true}
dw << {"username" => "ryan", "age" => 23, "verified" => false}
dw.close
 

Avro specification provides all the primitives types you would expect (string, bool, double, etc.), and also a number of complex types such as records, enums, arrays, maps, unions, and fixed. Also, default values and sort order can be applied for some of the types. The schema itself is a JSON document, which you can peek at in the header of any serialized Avro file, which means that when it comes to reading the data, we don’t need to know anything about the data itself, or alternatively, only read an available subset:

> avro-reader.rb

# read all data from avro file 
file = File.open('data.avr', 'r+')
dr = Avro::DataFile::Reader.new(file, Avro::IO::DatumReader.new)
dr.each { |record| p record }
 
# extract the username only from the avro serialized file
READER_SCHEMA = <<-JSON
{ "type": "record",
  "name": "User",
  "fields" : [
    {"name": "username", "type": "string"}
 ]}
JSON
 
reader = Avro::IO::DatumReader.new(nil, Avro::Schema.parse(READER_SCHEMA))
dr = Avro::DataFile::Reader.new(file, reader)
dr.each { |record| p record }
 

RPC with Ruby and Avro

The RPC piece of Avro is also pretty straight forward: the protocol is defined as an Avro schema, where both the inputs and the methods (along side with the request / response input and output parameters) are provided inline. In principle, this also means that given the right framework, different clients could easily negotiate different data-formatting on the fly, without having to worry about versioning or conditional code paths. A simple Mail protocol with Avro:

> mail-protocol.js

{
  "namespace": "example.proto",
  "protocol": "Mail",
 
  "types": [{"name": "Message", "type": "record", "fields": [
        {"name": "to", "type": "string"},
        {"name": "from", "type": "string"},
        {"name": "body", "type": "string"}]
       }],
 
   "messages": {
      "replay": { "response": "string", "request": [] },
      "send": { "response": "string", "request": [{"name": "message", "type": "Message"}] }
  }
}
 

Here we defined a "Mail" protocol, which takes as input a record of type “Message”, which in turn, includes three strings: to, from, and body. Additionally, we defined two available methods (send and replay), which our server and client can use, as well as, their input and output parameters. Check out the full Ruby implementations of the server and client in the repo.

Avro, Ruby and Hadoop

The Ruby implementation of Avro still needs a lot of love and polish (it currently has a distinct Python smell to it), but given the growing adoption of Hadoop and the rising popularity of all the adjacent frameworks, Avro is definitely here to stay and for good reasons. So, before you invent another serialization framework, grab the source, build the gem, and give it a try.


more »