Routing with Ruby & ZeroMQ Devices »
Created at: 17.11.2010 20:18, source: igvita.com, tagged: Architecture rabbitmq zeromq
ZeroMQ sockets provide message-oriented messaging, support for multiple transports, transparent setup and teardown, and an entire array of routing patterns via different socket types - see quick refresher on ZeroMQ. The combination of these features, as well as the fact that we can bind or connect a single socket to multiple endpoints is what makes ZeroMQ "topology aware". We can encode and construct varying routing strategies at will, without the requirement for external routers, or message brokers.
Building a multi-threaded application? With ZeroMQ you can assemble a high-performance, in-process pub-sub fanout, a worker pipeline, or a custom worfklow combining the two with just a few lines of code. In fact, if you work with ZeroMQ you will quickly find yourself reusing similar sets of patterns. That's where ZeroMQ devices come in. The library ships with support for a few common routing patterns (queue, streamer, and forwarder), and also defines a ZDCF JSON specification to simplify their assembly.
Routing Devices: Queue, Streamer, Forwarder
If RabbitMQ is a specific message-broker implementation that codifies a set of allowed messaging patterns, then ZeroMQ is the low-level toolkit, which allows us to assemble these patterns at will. This flexibility eliminates the need for explicit message routers within our infrastructure, allowing us to assemble dynamic, P2P routing topologies. However, this also does not mean that the standalone message broker does not have its place within our infrastructure. In fact, in practice you want the flexibility of both: sometimes you want to run the router within the same runtime, and at other times, you may want to run it as a standalone process.
ZeroMQ ships with several devices out of the box which implement the most common routing patterns. ZMQ Queue is a simple forwarding device for request-reply messaging, which can allow you to aggregate requests from many requesters, and distribute them to one or more reply sockets - a chokepoint. ZMQ Streamer serves a similar function, but is designed to handle the pipeline pattern instead. And last but not least, is the ZMQ Forwarder, which allows you to aggregate data from multiple publishers, and distribute these messages via a fanout to all the connected subscribers.
Getting started with ZMQ Devices
The above patterns are simple but powerful, and best of all you can run them in memory, or as standalone processes! When you build and install ZeroMQ, you will find three ZMQ binaries on your system: zmq_forwarder, zmq_queue, and zmq_streamer. Simply pass them an XML file outlining your socket configuration, and you are ready to go.
Where would we use a standalone broker? Imagine a pub-sub scenario with multiple data centers. Each remote subscriber could bind directly to the publisher, but that would also mean a lot of expensive network hops and many open connections. Instead, we can run a pub-sub forwarding device (ZMQ Streamer) in each data center, which would bind to the remote socket, stream the messages to the local data center and make them available to the local subscribers!
Assembling Custom ZMQ Devices
While the devices shipped with ZeroMQ cover some of the most common use cases, the real power of the framework is in its flexibility to allow us to assemble arbitrary messaging patterns: mixing different socket types, defining custom routing strategies based on meta-data of request, and so on. This is where ZeroMQ’s ZDCF configuration syntax, and a Ruby DSL can do wonders:
d = ZMQ::Device::Builder.new( context: { iothreads: 2 }, main: { type: :queue, frontend: { type: :SUB, option: { swap: 25 }, connect: ["tcp://127.0.0.1:5555"]}, backend: { type: :PUB, bind: ["tcp://127.0.0.1:5556"] } } ) d.main.start do loop do msg = ZMQ::Message.new frontend.recv msg backend.send msg end end
In the example above, we used ZDCF’s JSON configuration syntax to open all the socket connections, and then defined a simple relay routing strategy in our start block. Of course, you could also implement much more interesting patterns: open more than two sockets, examine the contents of the message, relay the message to select endpoints, and so on! Grab the DSL, and give it a try.
RabbitMQ and ZeroMQ work together
It is easy to mistake RabbitMQ and ZeroMQ for competitors. In reality, their features are complimentary, and the two can work together in very helpful ways! For example, by installing the rmq-0mq plugin into your RabbitMQ instance, you can effectively turn RabbitMQ into a ZeroMQ device!
RabbitMQ has proven to be an incredibly stable, high-performance message broker with support for persistence and a variety of routing topologies. So, if you are looking for a ZeroMQ device capable of handling thousands of connections, with support for persistence, and a number of other features, then the combination of ZeroMQ and RabbitMQ is definitely worth a close look.
more »
Open Source Search with Lucene & Solr »
Created at: 22.10.2010 19:09, source: igvita.com, tagged: Architecture lucene search solr
If you have ever had the need to add full-text indexing or search capability to one of your projects, chances are you will be familiar with Apache Lucene or one of its many derivatives such as PyLucene, Lucene.NET, Ferret (Ruby), or Lucy (C port). With history dating back to the early 2000's, Lucene has become one of the most feature complete information retrieval (IR) libraries with extensive support for dozens of tokenizers, analyzers, query parsers and scoring algorithms.
Having recently attended Lucene Revolution in Boston, it was clear from the presentations, as well as the hallway track, that the project has found its way into applications that span the whole gamut: embedded use, desktop and enterprise search, and even large-scale distributed deployments. Best of all, the number of projects around it, as well as the planned core improvements continues to impress - if you are looking for an open source search solution, Lucene is definitely worth a close look.
Lucene in the wild: Salesforce, LinkedIn, Twitter, et al.
Salesforce, LinkedIn and just recently Twitter are all examples of large-scale Lucene deployments that many of us use on a day-to-day basis. Salesforce started with Lucene back in 2002 and today manages an 8TB+ index (~20 billion documents). Their cluster consists of roughly 16 machines, which in turn contain many small (sharded) Lucene indexes. Currently, they handle ~4000 queries per second (qps), and provide an incremental indexing model where the new user data is searchable within ~3 minutes.
LinkedIn is another Lucene power user, servicing 350+ million queries/week. Their "people search" product currently peaks at ~200 qps on a single node while providing consistent sub 100ms latency. To deliver the real-time experience the LinkedIn team wrote their own sorting framework, and implemented their own faceted search (Bobo) and a real-time indexing system (Zoie).
Twitter is a recent Lucene convert - prior to the switch their search API was powered by (gasp!) MySQL. Michael Busch gave a great talk ("A Billion Queries Per Day") where he described the challenges and the changes they had to make to the internals of Lucene to support their query load, which currently peaks at 1 billion queries/day (~12,000 qps). Not surprisingly, their entire index is held in memory and currently holds up to a billion of the most recent tweets - about a weeks worth of data. Good news is, many of their optimizations should find their way upstream into Lucene core within the next year.
Of course, the above is anything but a complete list. iTunes is another notable user, said to be handling up to 800 queries/sec, and at PostRank we are currently managing a 1TB+ index (growing at ~40GB a week) with over 1.2 billion documents.
Lucene + HTTP: Solr Server
If Lucene is a low-level IR toolkit, then Solr is the fully-featured HTTP search server which wraps the Lucene library and adds a number of additional features: additional query parsers, HTTP caching, search faceting, highlighting, and many others. Best of all, once you bring up the Solr server, you can speak to it directly via REST XML/JSON API's. No need to write any Java code or use Java clients to access your Lucene indexes. Solr and Lucene began as independent projects, but just this past year both teams have decided to merge their efforts - all around, great news for both communities. If you haven't already, definitely take Solr for a spin.
Real-time Search with Lucene
Real-time search was a big theme at Lucene Revolution. Unlike many other IR toolkits, Lucene has always supported incremental index updates, but unfortunately it also required an fsync (flush new documents from memory to disk) and a reopen of the "index reader" to make those documents visible to the incoming requests. Needless to say, fsyncs are expensive, hence the recommended patterns have been: commit after N seconds, or commit after adding N documents. These strategies work in most cases, but become major bottlenecks when you are trying to index a high-velocity stream of updates.
To address this issue, the core team has been working on the "Near Realtime" (NRT) branch, where the in-memory index will also become searchable as soon as it is written to - this alleviates the need to wait for an fsync. However, even with the NRT branch in place, the periodic fsync and an index reader reopen can remain a major bottleneck, as both of those operations can be CPU and memory intensive.
An alternative to NRT is Zoie, a project developed and maintained by the LinkedIn team. Similar to the NRT branch, it makes all the updates immediately visible to the incoming queries, but it also implements an incremental indexing and flush model which alleviates the need to perform large, explicit commits. Zoie is currently deployed in production at LinkedIn as a standalone service (comes with HTTP server), but there is also some work for integrating it with Solr.
Last but not least, realtime search is obviously a big deal for Twitter. At the moment, it takes approximately 10 seconds for a tweet to appear in the search results from the time you publish it to the service - that's pretty fast. To achieve this, all Lucene indexes are maintained in memory in many small segments (up to 16 million tweets per segment) and are heavily optimized for Twitter's small document structure. Once again, take a look at Michael's presentation for more details.
Distributed Search
Out of the box, Lucene does not provide any support for distributed indexes - your application can open multiple index readers, but all of that has to be coordinated manually. Solr on the other hand allows hosting and querying multiple "cores" (indexes), as well as adds support for distributed queries, where results are aggregated from multiple servers, are appropriately scored, and then returned to the user. However, even with Solr, managing cores, shuffling data (rebalancing), and failover logic is left to the application.
SolrCloud is attempting to address some of these issues by embedding a Zookeeper instance to simplify the creation and management of distributed Solr clusters. Specifically, the aim is to provide central configuration, discovery, load balancing, and failover. The project is still in incubation, but a few people at Lucene Revolution have reported to already have it running in production! Promising.
As competition to SolrCloud there are also the Katta and ElasticSearch projects. Out of the two, ElasticSearch appears to have the higher momentum and implements a distributed, RESTful search engine built on top of Lucene. ElasticSearch speaks native JSON, supports automated master election and failover, index replication, atomic operations (no need to commit) and looks like a solid overall competitor to SolrCloud.
Solr, Lucene and NoSQL
Instead of running Lucene or Solr in standalone mode, both are also easily integrated within other applications. For example, Lucandra is aiming to implement a distributed Lucene index directly on top of Cassandra. Jake Luciani, the lead developer of the project, has recently joined the Riptano team as a full-time developer, so do not be surprised if Cassandra will soon support a Lucene powered IR toolkit as one of its features!
At the same time, Lily is aiming to transparently integrate Solr with HBase to allow for a much more flexible query and indexing model of your HBase datasets. Unlike Lucandra, Lily is not leveraging HBase as an index store (see HBasene for that), but runs standalone, albeit tightly integrated Solr servers for flexible indexing and query support.
In summary ...
Lucene may not be on top of your high profile open-source radar, but it definitely has the momentum and the developer community behind it to be considered as such. The projects listed above are only but a small sample of all the great work being done by the community. To get started, check out some of the great books on Lucene and Solr, or dive into one of the online tutorials. Finally, kudos to Lucid Imagination for organizing Lucene Revolution this year, and to all the speakers for great presentations!
more »
Case for Smartphone Web Activity Feeds »
Created at: 05.10.2010 20:32, source: igvita.com, tagged: Architecture mobile pubsubhubbub
Sensor networks have historically been the domain of the military and large-scale commercial applications - powerful, but expensive to setup and operate. At the same time, Morgan Stanley is reporting that by 2012 the smartphone market will outship the global PC market (notebook, netbook, desktop, etc), which will easily create the largest, always-on, broadband-enabled sensor network to date. Only one problem, most mobile OS vendors (Apple, RIM, Google, etc) are app store obsessed at the moment, each fighting for the developer mindshare to stockpile their native app arsenals. But, for a second, imagine if instead of rushing to build mobile apps which pull data off the web to local devices, what if we also had the infrastructure that could efficiently push the data back to the web? Billions of smartphone devices pushing real-time updates to our web applications - now that would be exciting.
Motivation & Use Cases
The question is not whether pull or push is more efficient (you need both), but enabling push from the smartphone as a platform capability could enable an entirely new class of applications: instead of focusing on an experience of a singular mobile user, you could aggregate data from many agents and build "network aware" applications! For the sake of an example, imagine we had the following:
http://www.apple.com/mobile/feed/{subscriber_id}.xml
http://www.blackberry.com/mobile/feed/{subscriber_id}.rss
http://www.android.com/mobile/feed/{subscriber_id}.js
Yes, that's an RSS feed - it is as simple as that. Each vendor already offers their own platform for "push notifications" to deliver data to the phone, so why not offer the reverse? What if each phone also had a representative notification feed that any web app could access? Calendar notifications, battery notifications, geo notifications, and so on, it is all fair game! Add a subscription and an access control management layer on top, and all of the sudden, any web-developer is now but a URL away from enriching the experience of any mobile user: process a notification from the phone, push an update back. Not to mention, the capability to aggregate data from thousands of mobile devices for trends analysis, data-mining applications and so forth - a global mobile sensor network at your disposal!
Mobile Activity Feeds via PubSubHubbub
Maintaining direct tethered links to each individual mobile device is obviously an expensive and a technically challenging proposition, especially if we are after real-time updates. But the good news is, we have already solved this problem in a different context. WebHooks allow us to establish callback (push) semantics between web-services, and PubSubHubbub solves the problem of efficiently delivering real-time notifications from a single publisher (mobile device, in this case) to many subscribers: the phone pushes a single update to the platform provider and the PSHB hub does all the hard work of distributing the individual updates to each subscriber. Distributed architecture, simple protocols, efficient delivery, and it scales well.
Smarter Web via Mobile Sensor Networks
Native apps are great, but why keep the data locked to within a single device? If we could efficiently aggregate activity feeds from thousands of mobile subscribers with rich geo and contextual meta-data (user, or device generated), then imagine all the numerous mash-ups and data-mining applications that could be built on top!
In fact, if the rumours of Facebook looking at building their own mobile device/OS are true (really, not as crazy as it sounds), then I would argue, it is in part because they understand the (commercial and technical) potential of these activity feeds better than anyone else - what seems completely foreign to Apple, RIM and others is completely obvious to the web-natives such as Facebook and Twitter. Question is, who is going to get there first?
more »
ZeroMQ: Modern & Fast Networking Stack »
Created at: 03.09.2010 20:13, source: igvita.com, tagged: Architecture sockets zeromq
Berkeley Sockets (BSD) are the de facto API for all network communication. With roots from the early 1980's, it is the original implementation of the TCP/IP suite, and arguably one of the most widely supported and critical components of any operating system today. BSD sockets that most of us are familiar with are peer-to-peer connections, which require explicit setup, teardown, choice of transport (TCP, UDP), error handling, and so on. And once you solve all of the above, then you are into the world of application protocols (ex: HTTP), which require additional framing, buffering and processing logic. In other words, it is no wonder that a high-performance network application is anything but trivial to write.
Wouldn't it be nice if we could abstract some of the low-level details of different socket types, connection handling, framing, or even routing? This is exactly where the ZeroMQ (ØMQ/ZMQ) networking library comes in: "it gives you sockets that carry whole messages across various transports like inproc, IPC, TCP, and multicast; you can connect sockets N-to-N with patterns like fanout, pubsub, task distribution, and request-reply". That's a lot buzzwords, so lets dissect some of these concepts in more detail.
Message-Oriented vs. Streams & Datagrams
ZeroMQ sockets provide a layer of abstraction on top of the traditional socket API, which allows it to hide much of the everyday boilerplate complexity we are forced to repeat in our applications. To begin, instead of being stream (TCP), or datagram (UDP) oriented, ZeroMQ communication is message-oriented. This means that if a client socket sends a 150kb message, then the server socket will receive a complete, identical message on the other end without having to implement any explicit buffering or framing. Of course, we could still implement a streaming interface, but doing so would require an explicit application-level protocol.
# create zeromq request / reply socket pair ctx = ZMQ::Context.new req = ctx.socket ZMQ::REQ rep = ctx.socket ZMQ::REP # connect sockets: notice that reply can connect first even with no server! rep.connect('tcp://127.0.0.1:5555') req.bind('tcp://127.0.0.1:5555') req.send ZMQ::Message.new('hello' * (1024*1024)) msg = ZMQ::Message.new rep.recv(msg) msg.copy_out_string.size # => 5242880
Switching from a streaming/datagram to a message-oriented model is seemingly a minor change, but one that carries a lot of implications. Because ZeroMQ will handle all of the buffering and framing for you, the client and server applications become an order of magnitude simpler, more secure, and much easier to write.
Transport Agnostic Sockets
ZeroMQ sockets are also transport agnostic: there is a single, unified API for sending and receiving messages across all protocols. By default, there is support for in-process, IPC, multicast, and TCP, and switching between all of them is as simple as changing the prefix on your connection string. This means we can start with IPC for fast local communication, and then switch to TCP at any point for distributed cases with minimal effort. As an added benefit, ZeroMQ handles all connection setup, teardown, and reconnect logic under the hood. That's about as simple as it gets.
Routing & Topology Aware Sockets
ZeroMQ sockets are routing and network topology aware. Since we don't have to explicitly manage the peer-to-peer connection state - all of that is abstracted by the library, as we saw above - nothing stops a single ZeroMQ socket from binding to two distinct ports to listen to for inbound requests, or in reverse, send data to two distinct sockets via a single API call. How does ZeroMQ know who to listen to or push data to? That depends on the type of the socket pair we pick for our application: Request/Reply, Publish/Subscribe, Pipeline, and Pair (alpha).
ctx = ZMQ::Context.new # create publisher socket, and publish to two pipes! pub = ctx.socket(ZMQ::PUB) pub.bind('tcp://127.0.0.1:5000') pub.bind('inproc://some.pipe') # generate random message, ex: '1 9' Thread.new { loop { pub.send [rand(2), rand(10)].join(' ') } } # create a consumer, and listen for messages whose key is '1' sub = ctx.socket(ZMQ::SUB) sub.connect('inproc://some.pipe') sub.setsockopt(ZMQ::SUBSCRIBE, '1') loop { p sub.recv } # => "1 9" ...
In the case of a Publish/Subscribe socket pair (unidirectional communication from publisher to subscribers), the publisher socket will replicate the message to all connected clients (local IPC clients, remote TCP listeners, etc). In the case of a Request/Reply socket pair (bi-directional communication: server, client), the messages will be automatically load balanced by the socket generating the request to one of the connected clients. Finally, a Push/Pull socket pair (pipeline: unidirectional, load-balanced) will allow you to simulate a staged message passing architecture with built-in load balancing.
ZeroMQ allows us to encode the topology of our services directly via the socket API, without having to define and maintain a separate coordination layer of routers, load balancers, and message brokers. Of course, nothing stops us from using any of these tools in combination with ZeroMQ, but in many cases, the ZeroMQ route can yield better performance and much simpler operational complexity.
ZeroMQ under the hood
By default, all communication in ZeroMQ is done in asynchronous fashion. To enable this, anytime you create an application with ZeroMQ, you will have to explicitly declare the number of background I/O threads - in most cases, a single dedicated I/O thread will suffice. All of the thread logic is handled by the C++ core of the library itself, but it does mean that at very minimum, your application will have two scheduled threads.
This asynchronous processing model allows ZeroMQ to abstract all connection setup, teardown, reconnect logic, and also to minimize message delivery latency: no blocking means the messages can be dispatched, delivered and queued (sender or receiver side) in parallel to the regular processing done by your application. Of course, you can also control the queuing behavior of ZeroMQ sockets by setting an allowed memory bound and even a swap size for each socket. Hence, you can simulate the blocking API if desired, but asynchronous I/O is the default. Combined with zero copy semantics, optimized framing, and no locking data structures, the end result is a high performance and throughput oriented messaging middleware with a modern API.
ZeroMQ in the Wild: Mongrel 2
Mongrel2 offers an interesting case-study of applying ZeroMQ to the world of web-servers: all inbound requests are routed by Mongrel2 via a "Push" socket which automatically load-balances the requests to connected handlers. The handlers, in turn, process the incoming requests (via Pull socket) and publish them to a "Pub" socket, to which the Mongrel2 server itself is subscribed to and is listening for its process ID (via a topic filter).
Hence, the processing is not tied to a simple request-response cycle we are commonly used to, where a single backend has to handle the full request start to finish. Instead, we can setup several processing stages (via pipeline pattern), and emit our reply only after it is processed by all stages.
Ambitious and worth exploring
Needless to say, ZeroMQ is an ambitious project, and this short introduction only scratches the surface of the full feature set. The stated goal of ZeroMQ is to "become part of the standard networking stack, and then the Linux kernel". Whether they will succeed, remains to be seen, but it is definitely a very promising and arguably a much needed layer of abstraction on top of the "traditional" BSD sockets. ZeroMQ makes writing high performance networking applications incredibly easy and fun.
The best way to get started with ZeroMQ is to work through some hands-on examples - the concepts are not new, but the ease with which you can compose them takes some getting use to. For Rubyists, Andrew Cholakian has put together a great set of examples to get you started (check out dripdrop as well), and for everyone else, head to the ZeroMQ site, grab your language bindings and dive into the code.
more »
Multi-core, Threads & Message Passing »
Created at: 18.08.2010 20:21, source: igvita.com, tagged: Architecture message-passing multi-core threads
Moore's Law marches on, the transistor counts are continuing to increase at the predicted rate and will continue to do so for the foreseeable future. However, what has changed is where these transistors are going: instead of a single core, they are appearing in multi-core designs, which place a much higher premium on hardware and software parallelism. This is hardly news, I know. However, before we get back to arguing about the "correct" parallelism & concurrency abstractions (threads, events, actors, channels, and so on) for our software and runtimes, it is helpful to step back and take a closer look at the actual hardware and where it is heading.
Single Core Architecture & Optimizations

The conceptual architecture of a single core system is deceivingly simple: single CPU, which is connected to a block of memory and a collection of other I/O devices. Turns out, simple is not practical. Even with modern architectures, the latency of a main memory reference (~100ns roundtrip) is prohibitively high, which combined with highly unpredictable control flow has led CPU manufacturers to introduce multi-level caches directly onto the chip: Level 1 (L1) cache reference: ~0.5 ns; Level 2 (L2) cache reference: ~7ns, and so on.
However, even that is not enough. To keep the CPU busy, most manufacturers have also introduced some cache prefetching and management schemes (ex: Intel's SmartCache), as well as invested billions of dollars into branch prediction, instruction pipelining, and other tricks to squeeze every ounce of performance. After all, if the CPU has a separate floating point and an integer unit, then there is no reason why two threads of execution could not simultaneously run on the same chip - see SMT. Remember Intel's Hyperthreading? As another point of reference, Sun's Niagara chips are designed to run four execution threads per core.
But wait, how did threads get in here? Turns out, threads are a way to expose the potential (and desired) hardware parallelism to the rest of the system. Put another way, threads are a low-level hardware and operating system feature, which we need to take full advantage of the underlying capabilities of our hardware.
Architecting for the Multi-core World
Since the manufacturers could no longer continue scaling the single core (power, density, communication), the designs have shifted to the next logical architecture: multiple cores on a single chip. After all, hardware parallelism existed all along, so the conceptual shift wasn't that large - shared memory, multiple cores, more concurrent threads of execution. Only one gotcha, remember those L1, L2 caches we introduced earlier? Turns out, they may well be the Achilles' heel for multi-core.

If you were to design a multi-core chip, would you allow your cores to share the L1, or L2 cache, or should they all be independent? Unfortunately, there is one answer to this question. Shared caches can allow higher utilization, which may lead to power savings (ex: great for laptops), as well as higher hit rates in certain scenarios. However, that same shared cache can easily create resource contention if one is not careful (DMA is a known offender). Intel's Core Duo and Xeon processors use a shared L2, whereas AMD's Opteron, Athlon, and Intel's Pentium D opted out for independent L1's and L2's. Even more interestingly, Intel's recent Itanium 2 gives each core an independent L1, L2, and an L3 cache! Different workloads benefit from different layouts.
As Phil Karlton once famously said: "There are only two hard things in Computer Science: cache invalidation and naming things," and as someone cleverly added later, "and off by one errors". Turns out, cache coherency is a major problem for all multi-core systems: if we prefetch the same block of data into an L1, L2, or L3 of each core, and one of the cores happens to make a modification to its cache, then we have a problem - the data is now in an inconsistent state across the different cores. We can't afford to go back to main memory to verify if the data is valid on each reference (as that would defeat the purpose of the cache), and a shared mutex is the very anti-pattern of independent caches!
To address this problem, hardware designers have iterated over a number of data invalidation and propagation schemes, but the key point is simple: the cores share a bus or an interconnect over which messages are propagated to keep all of the caches in sync (coherent), and therein lies the problem. While, the numbers vary, the overall consensus is that after approximately 32 cores on a single chip, the amount of required communication to support the shared memory model leads to diminished performance. Put another way, shared memory systems have limited scalability.
Turtles all the way down: Distributed Memory
So if cache coherence puts an upper bound on the number of cores we can support within the shared memory model, then lets drop the shared memory requirement! What if, instead of a monolithic view of the memory, each core instead had its own, albeit much smaller main memory? Distributed memory model has the advantage of avoiding all of the cache coherency problems we listed above. However, it is also easy to imagine a number of workloads where the distributed memory will underperform the shared memory model.
There doesn't appear to be any consensus in the industry yet, but if one had to guess, then a hybrid model seems likely: push the shared memory model as far as you can, and then stamp it out multiple times on a chip, with a distributed memory interconnect - it is cache and interconnect turtles all the way down. In other words, while message passing may be a choice today, in the future, it may well be a requirement if we want to extract the full capabilities of the hardware.
Turtles all the way up: Web Architecture
Most interesting of all, we can find the exact same architecture patterns and their associated problems in the web world. We start with a single machine running the app server and the database (CPU and main memory), which we later split into separate instances (multiple app servers share a remote DB, aka 'multi-core'), and eventually we shard the database (distributed memory) to achieve the required throughput. The similarity of the challenges and the approaches seems hardly like a coincidence. It is turtles all the way down, and it is turtles all the way up.
Threads, Events & Message Passing
As software developers, we are all intimately familiar with the shared memory model and the good news is: it is not going anywhere. However, as the core counts continue to increase, it is also very likely that we will quickly hit diminishing returns with the existing shared memory model. So, while we may disagree on whether threads are a correct application level API (see process calculi variants), they are also not going anywhere - either the VM, the language designer, or you yourself will have to deal with them.
With that in mind, the more interesting question to explore is not which abstraction is "correct" or "more performant" (one can always craft an optimized workload), but rather how do we make all of these paradigms work together, in a context of a simple programming model? We need threads, we need events, and we need message passing - it is not a question of which is better.
more »
