by    in Data Science, prediction

Combining Preferences for Pizza Toppings to Predict Sales

The world’s most expensive pizza, auctioned for $4,200 as a charity gift in 2007, was topped with edible gold, lobster marinated in cognac, champagne-soaked caviar, smoked salmon, and medallions of venison. While most of us prefer (or can only afford to prefer) more humble ingredients, our preferences are similarly diverse.  Ranker has a Tastiest Pizza Toppingslist that asks people to express their preferences. At the time of writing there are 29 re-ranks of this list, and a total of 64 different ingredients mentioned. Edible gold, by the way, is not one of them.

Equipped with this data about popular pizza toppings, we were interested in finding out if pizzerias were actually selling the toppings that people say that they want. We also wanted to see if we could predict sales for individual ingredients by looking at one list that combined all of the responses about pizza topping preferences. This “Ultimate List” contains all of toppings that were listed in individual lists (known as re-ranks) and is ordered in a way that reflects how many times each ingredient was mentioned and where they ranked on individual lists. Many of the re-ranks only list a few ingredients, so it is fitting to combine lists and rely on the “wisdom of the crowd” to get a more complete ranking of many possible ingredients.

As a real-world test of how people’s preferences correspond to sales, we used Strombolini’s New York Pizzeria’s list of their top 10 selling ingredients. Pepperoni, cheese, sausage and mushrooms topped the list, followed by: pineapple, bacon, ham, shrimp, onion, and green peppers. All of these ingredients, save for shrimp, are included in the Ranker lists so we considered the 9 overlapping ingredients and measured how close each user’s preference list was to the pizzeria’s sales list.

To compare lists, we used a standard statistical measure known as Kendall’s tau, which counts how many times we would need to swap one item for another (known as a pair-wise swap) before two lists are identical. A Kendall’s tau of zero means the two lists are exactly the same. The larger the Kendall’s tau value becomes, the further one list is from another.

The figure shows, using little stick people, the Kendall’s tau distances between users’ lists, and the Strombolini’s sales list. The green dot corresponds to a perfect tau of zero, and the red dot is the highest possible tau (if two lists are the exact opposite of the other). The dotted line is provided as a reference to show how likely each Kendall’s tau value is by chance (that is, how often different Kendall’s tau values occur for random lists of the ingredients). It is clear that there are large differences in how close individual users’ lists came to the sales-based list. It is also clear that many users produced rankings that were quite different from the sales-based list.

Using this model, the combined list came out to be: cheese, pepperoni, bacon, mushrooms, sausage, onion, pineapple, ham, and green peppers. This is a Kendall’s tau of 7 pair-wise swaps from the Strombolini list, as shown in the figure by the blue dot representing the crowd. This means the combined list is closer to the sales list than all but one of the individual users.

Our “wisdom of the crowd” analysis, combining all the users’ lists, used the same approach we previously applied to predicting celebrity deaths using Ranker data. It is a “Top-N” variant of the psychological approach developed in our work modeling decision-making and individual differences for ranking lists, and has the nice property of naturally incorporating individual differences.

This analysis is a beginning example of a couple of interesting ideas. One is that it is possible to extract relatively complete information from a set of incomplete opinions provided by many people. The other is that this combined knowledge can be compared to, and possibly be predictive of, real-world ground truths, like whether more pizzas have bacon or green peppers on them.  It may never begin to explain, however, why someone would waste champagne-soaked caviar on pizza, as a topping.

Predicting Box Office Success a Year in Advance from Ranker Data

A number of data scientists have attempted to predict movie box office success from various datasets.  For example, researchers at HP labs were able to use tweets around the release date plus the number of theaters that a movie was released in to predict 97.3% of movie box office revenue in the first weekend.  The Hollywood Stock Exchange, which lets participants bet on the box office revenues and infers a prediction, predicts 96.5% of box office revenue in the opening weekend.  Wikipedia activity predicts 77% of box office revenue according to a collaboration of European researchers.  Ranker runs lists of anticipated movies each year, often for more than a year in advance, and so the question I wanted to analyze in our data was how predictive is Ranker data of box office success.

However, since the above researchers have already shown that online activity at the time of the opening weekend predicts box office success during that weekend, I wanted to build upon that work and see if Ranker data could predict box office receipts well in advance of opening weekend.  Below is a simple scatterplot of results, showing that Ranker data from the previous year predicts 82% of variance in movie box office revenue for movies released in the next year.

Predicting Box Office Success from Ranker Data
Predicting Box Office Success from Ranker Data

The above graph uses votes cast in 2011 to predict revenues from our Most Anticipated 2012 Films list.  While our data is not as predictive as twitter data collected leading up to opening weekend, the remarkable thing about this result is that most votes (8,200 votes from 1,146 voters) were cast 7-13 months before the actual release date.  I look forward to doing the same analysis on our Most Anticipated 2013 Films list at the end of this year.

– Ravi Iyer