Collaborative Filtering with Ensembles »
Created at: 01.09.2009 21:08, source: igvita.com, tagged: Machine Learning ensemble github ml netflix
One of the most interesting insights from the results of the Netflix challenge is that while the algorithms, the psychology, and good knowledge of statistics goes a long way, it was ultimately the cross-team collaboration that ended the contest. "The Ensemble" team, appropriately named for the technique they used to merge their results consists of over 30 people. Likewise, the runner up team ("BellKor") is a collaborative effort of several distinct groups that merged their results. It is easy to overlook this fact, except that it is not a one-time occurrence. The leaderboard for the recent GitHub contest also clearly shows over half of the top ten entries as ensemble techniques!
Few people would argue against the idea of consulting several experts on any complex subject - as the saying goes: "two heads are better than one". Having said that, the ensemble techniques which leverage this exact strategy are rarely a hot topic in the machine learning communities, or at least, up until now. Given their recent success this is likely to change, but perhaps even more importantly, their effectiveness may actually force the 'collaborative filtering' space to become, well, a lot more collaborative.
GitHub Contest: The Good, The Bad
Religious arguments about the merits of each distributed version control system aside, one thing GitHub does really well is allow users to follow and discover new open source projects. In that light, a competition for a recommendations engine makes perfect sense. As part of the challenge you get access to records of over 56K users, 120K repositories, and 440K relationships between each. Great data set, immediate feedback via post-commit hooks, all results available in the open, and a few people even shared their code!
If there was one thing to improve about the contest, then it would have to be the ranking methodology. The goal was to "guess" the 4,788 removed repositories, which I would argue, is optimizing for the wrong thing. The goal should have been not to guess what users are already watching, but to produce a more general predictor for what users should be watching - the former is a subset of the latter. To see why, think of the case where a general predictor might actually have a set of better recommendations (based on other user's patterns) than the few repos that the GitHub team hid from the dataset for any user.
Ensemble Techniques & Machine Learning
Ranking methodology aside, perhaps the most interesting result of the GitHub contest is due to how it was setup: each submission is a push into a public Git repo, which means that each entry is automatically in the public domain. Unlike the Netflix contest, where all submissions were private and teams had to agree to merge their results, the GitHub contest became a free-for-all and a fertile crowd for testing ensembles!
While the topic of Ensembles is a rich one, the general idea is remarkably simple: instead of attempting to build one general model to capture all the subtleties of the data, use multiple predictors and combine their results. For an illustrative example, assume that we were trying to build a model that would separate the data in the diagram above. After a brief examination, we form a simple hypothesis (call this hypothesis family H): we could separate the data using simple geometric shapes such as a circle (H1), or a square (H2). Visually we can clearly see that neither one by itself will do a good job, but applying them both gives us a pretty good approximation while the model remains very simple (we want our model to be simple, for computational reasons). So what do we do? We train two independent classifiers and then merge their results, aka, use an ensemble of classifiers.
The description above sweeps a lot of details under the rug, but core of the idea is there. Boosting, bagging, consensus aggregation, dynamic classifier selection, and dozens of other techniques have been developed on top of this process and have proven to be very successful. The overarching principle of ensemble techniques is to make each predictor as unique as possible: use a different learning algorithm (decision trees, svm, svd), or use a different feature (random subspace method). Then, once you have many individual classifiers, determine a mechanism to join the results (simple method: each predictor votes on each point, then tally up the votes) and make a final prediction using the new classifier.
GitHub Ensembles
Perhaps the most illustrative example of this simple technique is John Rowell’s entry into the GitHub contest, which is best described by a snippet right from the source:
require 'nokogiri' require 'open-uri' class Crowdsource def initialize load_leaderboard # scrape github contest leaders parse_leaders # find their top performing results fetch_results # download best results cleanup_leaders # cleanup missing or incorrect data crunchit # build an ensemble end #... end Crowdsource.new
Because all the entries in the GitHub contest are in the public domain (actually, a few people caught on and started deleting their repos immediately after their submissions because of this technique), we can download the best predictions of other users, construct an ensemble, and then submit our own set of results.
Collaborative, Collaborative Filtering: CCF!
While on first examination the ensemble technique may have a flavor of cheating, I think it can actually have a significant effect on how future challenges of this sort could be structured. Don't fight it, embrace it. Instead of focusing on individual effort or insight, it is not hard to imagine much more collaborative, collaborative-filtering competitions (CCF's). Some teams can focus on building great independent predictors, while others can focus on developing and improving the methods for constructing ensembles (a mashup, of sorts). In other words, collaborative filtering has the potential to become a lot more collaborative! And in the meantime, I wish the GitHub guys luck in trying to determine the winner.
Update: GitHub removed most ensemble submissions / released the code to power the challenge.
more »
..and on the seventh day, Science created zsh »
Created at: 31.08.2009 04:08, source: Robby on Rails, tagged: programming terminal osx linux console bash zsh git github commandline
Inspired by some recent posts from Tom on zsh, I decided that I’d do my part to help people give it a whirl. I’ve been using zsh for a few years now and haven’t found myself missing bash.
If you’re interested in taking a few minutes to give zsh a while, you’re in luck. I recently reorganized all of my zsh config into a package and tossed it on github to share. My goal was to create a reusable tool that would allow people to get up and running quickly with some of the fun configuration that I’ve come to rely on on a daily basis.
For example:
- Auto-complete rake and capistrano tasks
- Git branch names when you’re in a git project directory structure
- Tons of color highlighting (grep, git, etc.)
- Sexy prompts.. (so say me)
- much much more…
I invite you to give Oh My Zsh a whirl, which should take you less than a minute. Just follow the instructions.
Also, Oh My Zsh is Snow Leopard compatible. ;-)
more »
Howdy Rip! »
Created at: 11.06.2009 20:35, source: Robby on Rails, tagged: Ruby on Rails ruby programming ruby rubygems gems git github rubyonrails development
Chris Wanstrath (@defunkt) just posted the following on twitter.
“Hello Rip – http://hellorip.com/“
The Rip project describes itself as, “an attempt to create a next generation packaging system for Ruby.”
One of the cool features is that it supports multiple environments. For example, you can have different Rip environments (with different gem versioning) that are targeted towards specific applications. I have to dig around more through the project, but this looks fascinating.
Check it out at http://hellorip.com/
I’m also curious as to how you think you might be able to start using this.
- What are some ways that you could use Rip—http://heybrainstormr.com/t/pgte
more »
A Public “Thank You” »
Created at: 06.05.2009 09:04, source: Rails Dog, tagged: Uncategorized github phusion railsconf resource_controller
@GreggPollack put together an excellent RubyHeroes presentation before the keynote. In addition to recognizing some of the great contributions these individuals made, Gregg suggested that we all take a moment to thank at least three people whose work has made our lives that much easier. So instead of getting caught up in minor complaints about the accomodations or speeches, I decided to throw out some more positive energy.
GitHub - I’m not going to waste a lot of space talking about how much of an impact GitHub has made on the Rails community. Its not like these guys are toiling away in anonymity but their contribution has been so important that I’m just going to go on record as saying “Thanks.”
Phusion - Again, everyone knows what these guys have done for simplifying Rails deployment. Just because they get plenty of “Thank Yous” doesn’t mean I can’t throw my own two cents in here. I also had the opportunity to thank one of the Phusion guys in person at dinner this evening.
ResourceController - James Golick’s plugin finally convinced me to embrace REST. If you’re not familiar with it, its a great way to simplify the implementation of your standard RESTful controller. My controllers became way more DRY as a result. Again, saw James at dinner and shook his hand in person but it doesn’t hurt to say it here.
Honorary Mentions:
- DHH and RailsCore for Rails (DUH)
- Chad Fowler and company for organizing RailsConf
- Gregg Pollack for the Ruby Heroes presentation, conference videos and the entertaining RailsEnvy podcasts (along with Jason Seifer.)
more »
Git commit-msg for Lighthouse tickets »
Created at: 16.02.2009 21:51, source: Robby on Rails, tagged: programming git lighthouse github workflow bash
A quick follow-up to a post from a few months ago on how our team has a naming convention for git branches when we’re working on Lighthouse tickets (read previous post).
I’ve just put together a quick git hook for commit-msg, which will automatically amend the commit message with the current ticket number when you’re following the branch naming conventions described here.
Just toss this gist into .git/hooks/commit-msg.
#!/bin/sh
#
# Will append the current Lighthouse ticket number to the commit message automatically
# when you use the LH_* branch naming convention.
#
# Drop into .git/hooks/commit-msg
# chmod +x .git/hooks/commit-msg
exec < /dev/tty
commit_message=$1
ref=$(git symbolic-ref HEAD 2> /dev/null) || return
branch=${ref#refs/heads/}
if [[ $branch =~ LH_(.*) ]]
then
lighthouse_ticket=${BASH_REMATCH[1]}
echo "What is the state of ticket #${lighthouse_ticket}? "
echo "(o)pen "
echo "(h)old"
echo "(r)esolved"
echo "Enter the current state for #${lighthouse_ticket}: (o)"
state="open"
read state_selection
case $state_selection in
"o" )
state="open"
;;
"h" )
state="hold"
;;
"r" )
state="resolved"
;;
esac
echo >&2 "[#${lighthouse_ticket} state:${state}]" >> "$1"
exit 0
fi
Then a quick example of how this works…
➜ bin git:(LH_9912 ♻ ) git ci -m "another test"
What is the state of this ticket?
(o)pen
(h)old
(r)esolved
Enter the current state: (o)
h
Created commit 1ed2713: another test
1 files changed, 3 insertions(+), 1 deletions(-)
Now to see this in action… (screenshot)
Then we’ll check out the git log really quick.
➜ bin git:(LH_9912) git log
commit 1ed271323c4a054fe56e76bddc9ac81d241a1032
Author: Robby Russell <robby@planetargon.com>
Date: Mon Feb 16 12:06:33 2009 -0800
another test
[#9912 state:hold]
Thanks to Andy for helping me figure out how to read user input during a git hook.
more »

