Netflix is known for using quantitative analyses for improving its performance. In 2006 they announced their $1 million competition to the first team that could improve their recommendation system by 10%. The recommendation system, which is used to suggest movies to individual customers, predicts whether someone will enjoy a movie based on how much they liked or disliked other movies. Netflix provided anonymous rating data for mining, and a test data set to evaluate how closely predicted ratings of movies match subsequent actual ratings. This set off a flurry of activity of individuals, groups and groups of groups. In mid-2009, a team called BellKor's Pragmatic Chaos was the first to achieve the goal of improving the system by 10.09%. According to the rules, the other teams had 30 days to improve upon BellKor’s method. Just before the deadline was reached another team, The Ensemble, submitted a method that improved the rating system by 10.10%. BellKor did not have time to respond.
However, shortly thereafter, the team’s captain, Yehuda Koren posted a note on his blog that he was contacted by Netflix and was told they have the best test accuracy and should be declared the winner. Why? It appears that Netflix kept two verification test sets: one that was the basis for the public standings and another that was secret. The winner was selected based on the success of the approach on the secret data set. So BellKor, which appeared to come in second, based on the public verification test set, seems poised to be the winner based on the hidden test set. Apparently The Ensemble got their additional improvement by overfitting their algorithm to the test data set; when tested on the unused data, their algorithm was inferior. |