Intuition & Data-Driven Machine Learning »

Created at: 20.04.2011 21:19, source: igvita.com, tagged: Machine Learning ai ml

Clever algorithms and pages of mathematical formulas filled with probability and optimization theory are usually the associations that get invoked when you ask someone to describe the fields of AI and Machine Learning. Granted, there is definitely an abundance of both, but this mental picture also tends to obscure some of the more interesting and recent developments in these fields: data driven learning, and the fact that you are often better off developing simple intuitive insights instead of complicated domain models which are meant to represent every attribute of the problem.

Back in early 2001, Eric Brill and Michele Banko published an incredibly interesting paper (Mitigating the Paucity of Data Problem), which showed that simply increasing the training set by orders of magnitude yielded significant improvements in the performance of common Machine Learning algorithms. In fact, in certain cases, you are simply better off working on getting more data, then spending your time on improving the algorithm - think about that for a minute! We can even take this model to an extreme: data is the algorithm. How does Google Translate support 60+ languages? They don't employ any linguists. Instead, the translation is done through pure data-driven learning.

GoGaRuCo 2010: Machine Learning Trends & Patterns

The goal for my presentation at GoGaRuCo 2010 (slides) was to highlight the following themes: first, the algorithm is important, but we tend to overemphasize its importance; simple, intuitive insights are usually the underpinnings of much more (seemingly) complicated algorithms; data can be an algorithm in by itself. None of these ideas are novel or my own. In fact, much of the recent credit goes to Peter Norvig for popularizing these ideas and driving their adoption at Google (Translate, Sets, Spelling correction, and so on) and throughout the industry at large.

"It's really quiet simple"

Perhaps the largest mental barrier for most developers is that Machine Learning "is hard". It doesn't have to be. In fact, having an intuitive mental model is often more helpful than having a "tight proof of the error boundary" when it comes to implementing one of these systems in real life.

One of my favorite examples, and one that I shared in my presentation, is that of clustering. Namely, to perform a clustering, we need to define a function to measure the "pairwise distance" between all pairs of objects. Can you think of a generic way to do so? How about using your favorite Zlib library?

require 'zlib'
require 'pp'
 
files = Dir[ARGV[0] + '/*']
 
def deflate(*files)
  z = Zlib::Deflate.new
  z.deflate(files.collect {|f| open(f).read}.join("\\n"), Zlib::FINISH).size
end
 
pairwise = files.combination(2).collect do |f1, f2|
  a, b = deflate(f1), deflate(f2)
  both = deflate(f1, f2)
 
  {:files => [f1, f2], :score => (a+b)-both}
end
 
pp pairwise.sort {|a,b| b[:score] <=> a[:score]}[0,20]
 
# > ruby clusterer.rb /path/to/dir
 

The insight is simple: if any two objects, whatever data they main contain, have overlapping data inside, then the compressed filesize of these two files when they are joined together should be smaller than the sum of their individually compressed files. Work through a few examples in your head. Once you prove it to yourself, it will seem remarkably simple. Best of all, this is a generic insight which allows you to cluster virtually any type of data. In fact, the example above will faithfully detect and cluster different languages, separate png and mp3 files, and so on.

Ensembles & Collaborative learning

That is not to say that having a solid understanding of the fundamentals of ML and AI fields is not important - it absolutely is. However, given a choice between a more complicated algorithm and more data, try more data first. Further, instead of trying to develop a complicated algorithm, which encompasses many attributes of your problem, try developing an ensemble method which builds on dozens of simple, intuitive insights. In other words, think simpler, get more data, and see where it takes you - chances are, you will be surprised.


more »

Collaborative Filtering with Ensembles »

Created at: 01.09.2009 21:08, source: igvita.com, tagged: Machine Learning ensemble github ml netflix

One of the most interesting insights from the results of the Netflix challenge is that while the algorithms, the psychology, and good knowledge of statistics goes a long way, it was ultimately the cross-team collaboration that ended the contest. "The Ensemble" team, appropriately named for the technique they used to merge their results consists of over 30 people. Likewise, the runner up team ("BellKor") is a collaborative effort of several distinct groups that merged their results. It is easy to overlook this fact, except that it is not a one-time occurrence. The leaderboard for the recent GitHub contest also clearly shows over half of the top ten entries as ensemble techniques!

Few people would argue against the idea of consulting several experts on any complex subject - as the saying goes: "two heads are better than one". Having said that, the ensemble techniques which leverage this exact strategy are rarely a hot topic in the machine learning communities, or at least, up until now. Given their recent success this is likely to change, but perhaps even more importantly, their effectiveness may actually force the 'collaborative filtering' space to become, well, a lot more collaborative.

GitHub Contest: The Good, The Bad

Religious arguments about the merits of each distributed version control system aside, one thing GitHub does really well is allow users to follow and discover new open source projects. In that light, a competition for a recommendations engine makes perfect sense. As part of the challenge you get access to records of over 56K users, 120K repositories, and 440K relationships between each. Great data set, immediate feedback via post-commit hooks, all results available in the open, and a few people even shared their code!

If there was one thing to improve about the contest, then it would have to be the ranking methodology. The goal was to "guess" the 4,788 removed repositories, which I would argue, is optimizing for the wrong thing. The goal should have been not to guess what users are already watching, but to produce a more general predictor for what users should be watching - the former is a subset of the latter. To see why, think of the case where a general predictor might actually have a set of better recommendations (based on other user's patterns) than the few repos that the GitHub team hid from the dataset for any user.

Ensemble Techniques & Machine Learning

Ranking methodology aside, perhaps the most interesting result of the GitHub contest is due to how it was setup: each submission is a push into a public Git repo, which means that each entry is automatically in the public domain. Unlike the Netflix contest, where all submissions were private and teams had to agree to merge their results, the GitHub contest became a free-for-all and a fertile crowd for testing ensembles!

While the topic of Ensembles is a rich one, the general idea is remarkably simple: instead of attempting to build one general model to capture all the subtleties of the data, use multiple predictors and combine their results. For an illustrative example, assume that we were trying to build a model that would separate the data in the diagram above. After a brief examination, we form a simple hypothesis (call this hypothesis family H): we could separate the data using simple geometric shapes such as a circle (H1), or a square (H2). Visually we can clearly see that neither one by itself will do a good job, but applying them both gives us a pretty good approximation while the model remains very simple (we want our model to be simple, for computational reasons). So what do we do? We train two independent classifiers and then merge their results, aka, use an ensemble of classifiers.

The description above sweeps a lot of details under the rug, but core of the idea is there. Boosting, bagging, consensus aggregation, dynamic classifier selection, and dozens of other techniques have been developed on top of this process and have proven to be very successful. The overarching principle of ensemble techniques is to make each predictor as unique as possible: use a different learning algorithm (decision trees, svm, svd), or use a different feature (random subspace method). Then, once you have many individual classifiers, determine a mechanism to join the results (simple method: each predictor votes on each point, then tally up the votes) and make a final prediction using the new classifier.

GitHub Ensembles

Perhaps the most illustrative example of this simple technique is John Rowell’s entry into the GitHub contest, which is best described by a snippet right from the source:

> ensemble.rb

require 'nokogiri'
require 'open-uri'
 
class Crowdsource
  def initialize
    load_leaderboard  # scrape github contest leaders
    parse_leaders     # find their top performing results
    fetch_results     # download best results
    cleanup_leaders   # cleanup missing or incorrect data
    crunchit          # build an ensemble
  end
  #...
end
 
Crowdsource.new
 

Because all the entries in the GitHub contest are in the public domain (actually, a few people caught on and started deleting their repos immediately after their submissions because of this technique), we can download the best predictions of other users, construct an ensemble, and then submit our own set of results.

Collaborative, Collaborative Filtering: CCF!

While on first examination the ensemble technique may have a flavor of cheating, I think it can actually have a significant effect on how future challenges of this sort could be structured. Don't fight it, embrace it. Instead of focusing on individual effort or insight, it is not hard to imagine much more collaborative, collaborative-filtering competitions (CCF's). Some teams can focus on building great independent predictors, while others can focus on developing and improving the methods for constructing ensembles (a mashup, of sorts). In other words, collaborative filtering has the potential to become a lot more collaborative! And in the meantime, I wish the GitHub guys luck in trying to determine the winner.

Update: GitHub removed most ensemble submissions / released the code to power the challenge.


more »