Why Topsy/Twitter Data may never predict what matters to the rest of us

Recently Apple paid a reported $200 million for Topsy and some speculate that the reason for this purchase is to improve recommendations for products consumed using Apple devices, leveraging the data that Topsy has from Twitter.  This makes perfect sense to me, but the utility of Twitter data in predicting what people want is easy to overstate, largely because people often confuse bigger data with better data.  There are at least 2 reasons why there is a fairly hard ceiling on how much Twitter data will ever allow one to predict about what regular people want.

1.  Sampling – Twitter has a ton of data, with daily usage of around 10%.  Sample size isn’t the issue here as there is plenty of data, but rather the people who use Twitter are a very specific set of people.  Even if you correct for demographics, the psychographic of people who want to share their opinion publicly and regularly (far more people have heard of Twitter than actually use it) is way too unique to generalize to the average person, in the same way that surveys of landline users cannot be used to predict what psychographically distinct cellphone users think.

2. Domain Comprehensiveness – The opinions that people share on Twitter are biased by the medium, such that they do not represent the spectrum of things many people care about.  There are tons of opinions on entertainment, pop culture, and links that people want to promote, since they are easy to share quickly, but very little information on people’s important life goals or the qualities we admire most in a person or anything where people’s opinions are likely to be more nuanced.  Even where we have opinions in those domains, they are likely to be skewed by the 140 character limit.

Twitter (and by extension, companies that use their data like Topsy and DataSift) has a treasure trove of information, but people working on next generation recommendations and semantic search should realize that it is a small part of the overall puzzle given the above limitations.  The volume of information gives you a very precise measure of a very specific group of people’s opinions about very specific things, leaving out the vast majority of people’s opinions about the vast majority of things.  When you add in the bias introduced by analyzing 140 character natural language, there is a great deal of variance in recommendations that likely will have to be provided by other sources.

At Ranker, we have similar sampling issues, in that we collect much of our data at Ranker.com, but we are actively broadening our reach through our widget program, that now collects data on thousands of partner sites.  Our ranked list methodology certainly has bias too, which we attempt to mitigate that through combining voting and ranking data.  The key is not in the volume of data, but rather in the diversity of data, which helps mitigate the bias inherent in any particular sampling/data collection method.

Similarly, people using Twitter data would do well to consider issues of data diversity and not be blinded by large numbers of users and data points.  Certainly Twitter is bound to be a part of understanding consumer opinions, but the size of the dataset alone will not guarantee that it will be a central part.  Given these issues, either Twitter will start to diversify the ways that it collects consumer sentiment data or the best semantic search algorithms will eventually use Twitter data as but one narrowly targeted input of many.

– Ravi Iyer

by    in interest graph, Market Research, Pop Culture

Hierarchical Clustering of a Ranker list of Beers

This is a guest post by Markus Pudenz.

Ranker is currently exploring ways to visualize the millions of votes collected on various topics each month.  I’ve recently begun using hierarchical cluster analysis to produce taxonomies (also known as dendograms), and applied these techniques to Ranker’s Best Beers from Around the World. A dendrogram allows one to visualize the relationships on voting patterns (scroll down to see what a dendrogram looks like). What hierarchical clustering does is break down the list into related groups based on voting patterns of the users, grouping like items with items that were voted similarly by the same users. The algorithm is agglomerative, meaning it is starts with individual items and combines them iteratively until one large cluster (all of the beers in the list)  remains.

Every beer in our dendrogram is related to another at some level, whether in the original cluster or further down the dendrogram. See the height axis on the left side? The lower the cluster is on the axis, the closer the relationship the beers will have. For example, the cluster containing Guinness and Guinness Original is the lowest in this dendrogram indicating these to beers have the closest relationship based on the voting patterns. Regarding our list, voters have the option to Vote Up or Vote Down any beer they want. Let’s start at the top of the dendrogram and work our way down.

Hierarchical Clustering of Beer Preferences

Looking at the first split of the clusters, one can observe the cluster on the right contains beers that would generally be considered well-known including Guinness, Sam Adams, Heineken and Corona. In fact, the cluster on the right includes seven of the top ten beers from the list. The fact that most of our popular beers are in this right cluster indicates that there is a strong order effect with voters more likely to select beers that are more popular when ranking their favorite beers. For example, if someone selects a beer that is in the top ten, then another beer they select is also more likely to be in the top ten. As we examine the right cluster further, the first split divides the cluster into two smaller clusters. In the left cluster, we can clearly see, unsurprisingly, that a drinker who likes Guinness is more likely to vote for another variety of Guinness. This left cluster is comprised almost entirely of Guinness varieties with the exception of Murphy’s Irish Stout. The right cluster lists a larger variety of beer makers including Sam Adams, Stella Artois and Pyramid. In addition, none of the beers in this right cluster are stouts as with the left cluster. The only brewer in this right cluster with multiple varieties is Sam Adams with Boston Lager and Octoberfest meaning drinkers in this cluster were not as brand loyal as in the left cluster. Drinkers in this cluster were more likely to select a beer variety from a different brewer. When reviewing this cluster from the first split in the dendrogram, there is clearly a defined split between those drinkers who prefer a heavier beer (stout) as opposed to those who prefer lighter beers like lagers, pilseners, pale ales or hefeweizen.

Conversely, for beers in the left cluster, drinkers are more likely to vote for other beers that are not as popular with only three of the top ten beers in this cluster. In addition, because of the larger size, the range of beer styles and brewers for this cluster is more varied as opposed to those in the right cluster. The left cluster splits into three smaller clusters before splitting further. One cluster that is clearly distinct is the second of these clusters. This cluster is comprised almost entirely of Belgian style beers with the only exception being Pliny the Elder, an IPA. La Fin du Monde is a Belgian style tripel from Quebec with the remaining brewers from Belgium. One split within this cluster is comprised entirely of beer varieties from Chimay indicating a strong relationship; voters who select Chimay are more likely to also select a different style from Chimay when ranking their favorites.  Our remaining clusters have a little more variety. Our first cluster, the smallest of the three, has a strong representation from California with varieties from Stone, Sierra Nevada and Anchor Steam taking four out of six nodes in the cluster. Stone IPA and Stone Arrogant Bastard Ale have the strongest relationship in this cluster. Our third cluster, the largest of the three, has even more variety than the first. We see a strong relationship especially with Hoegaarden and Leffe.

I was also curious as to whether the beers in the top ten were associated with larger or smaller breweries. As the following list shows,  there is an even split between the larger conglomerates like AB InBev, Diageo, Miller Coors and independent breweries like New Belgium and Sierra Nevada.

  1. Guinness (Diageo)
  2. Newcastle (Heineken)
  3. Sam Adams Boston Lager (Boston Beer Company)
  4. Stella Artois (AB InBev)
  5. Fat Tire (New Belgium Brewing Company)
  6. Sierra Nevada Pale Ale (Sierra Nevada Brewing Company)
  7. Blue Moon (Miller Coors)
  8. Stone IPA (Stone Brewing Company)
  9. Guinness Original (Diageo)
  10. Hoegaarden Witbier (AB InBev)

Markus Pudenz

by    in Opinion Graph, Pop Culture, Rankings

Examining Regional Voting Differences with Ranker’s Polling Widget

Ranker has a new program where we offer a polling widget to partner sites who want the engagement of a poll in list format (as opposed to the standard radio button poll).  Currently, sites that use our poll (e.g. TheNextWeb or CBC) are seeing 20-50% of visitors engaging in the poll and an increase in returning visitors who want to keep track of results.  We also give partners prominent placement on Ranker.com (details of that here), but a benefit that is less obvious is the potential insights from one’s users that one can gain from the data behind a poll.  To illustrate what is possible, I’m going to use data from one of our regular widget users, Phish.net, who posted this poll on Phish’s best summer concert jams.

One piece of data that Ranker can give partners is a regional breakdown of voters.  Unsuprisingly, there were strong regional differences in voting behavior with voters from the northeast often choosing a jam from their New Jersey show, voters from the west coast often choosing a jam from their Hollywood Bowl show, voters from the south often choosing a jam from their Maryland show, voters from the midwest often choosing a jam from their Chicago show, and voters from the mountain region often choosing a jam from their show at The Gorge.  However, the interesting thing to me was that the leading jam in every region was Tweezer – Lake Tahoe from July 31st.  As someone who believes that better crowdsourced answers are produced by aggregating across bias and who has only been to 1 Phish concert, I’m definitely going to have to check out this jam.  Perhaps the answer is obvious to more experienced Phish fans, but the results of the poll are certainly instructive to the more casual music fan who wants a taste of Phish.

Below are the results of the poll in graphical format.  Notice how the shows cluster based on venue and geography except for Tweezer – Lake Tahoe which is directly in the center of the graph.

If you’re interested in running a widget poll on your site, the benefits are more clearly spelled out here and you can email us at “widget at ranker.com”.  We’d love to provide similar region based insights for your polls as well.

– Ravi Iyer


by    in Rankings

Rankings are the Future of Mobile Search

Did you know that Ranker is one of the top 100 web destinations for mobile per Quantcast, ahead of household names like The Onion and People magazine?  We are ranked #520 in the non-mobile world.  Why do we do better with mobile users as opposed to people using a desktop computer?  I’ve made this argument for awhile, but I’m hardly an authority, so I was heartened to see Google making a similar argument.

This embrace of mobile computing impacts search behavior in a number of important ways.

First, it makes the process of refining search queries much more tiresome. …While refining queries is never a great user experience, on a mobile device (and particularly on a mobile phone) it is especially onerous.  This has provided the search engines with a compelling incentive to ensure that the right search results are delivered to users on the first go, freeing them of laborious refinements.

Second, the process of navigating to web pages (is) a royal pain on a hand-held mobile device.

This situation provides a compelling incentive for the search engines to circumvent additional web page visits altogether, and instead present answers to queries – especially straightforward informational queries – directly in the search results.  While many in the search marketing field have suggested that the search engines have increasingly introduced direct answers in the search results to rob publishers of clicks, there’s more than a trivial case to be made that this is in the best interest of mobile users.  Is it really a good thing to compel an iPhone user to browse to a web page – which may or may not be optimized for mobile – and wait for it to load in order to learn the height of the Eiffel Tower?

As a result, if you ask your mobile phone for the height of a famous building (Taipei 101 in the below case), it doesn’t direct you to a web page.  Instead it answers the question itself.

That’s great for a question that has a single answer, but an increasing number of searches are not for objective facts with a single answer, but rather for subjective opinions where a ranked list is the best result.  Consider the below chart showing the increase in searches for the term “best”.  A similar pattern can be found for most any adjective.

So if consumers are increasingly doing searches on mobile phones, requiring a concise list of potential answers to questions with more than one answer, they naturally are going to end up at sites which have ranked lists…like Ranker. As such, a lot of Ranker’s future growth is likely to parallel the growth of mobile and the growth of searches for opinion based questions.

– Ravi Iyer

An Opinion Graph of the World’s Beers

One of the strengths of Ranker‘s data is that we collect such a wide variety of opinions from users that we can put opinions about a wide variety of subjects into a graph format.  Graphs are useful as they let you go beyond the individual relationships between items and see overall patterns.  In anticipation of Cinco de Mayo, I produced the below opinion graph of beers, based on votes on lists such as our Best World Beers list.  Connections in this graph represent significant correlations between sentiment towards connected beers, which vary in terms of strength.  A layout algorithm (force atlas in Gephi) placed beers that were more related closer to each other and beers that had fewer/weaker connections further apart.  I also ran a classification algorithm that clustered beers according to preference and colored the graph according to these clusters.  Click on the below graph to expand it.

Ranker's Beer Opinion Graph

One of the fun things about graphs is that different people will see different patterns.  Among the things I learned from this exercise are:

  • •The opposite of light beer, from a taste perspective, isn’t dark beer.  Rather, light beers like Miller Lite are most opposite craft beers like Stone IPA and Chimay.
  • •Coors light is the light beer that is closest to the mainstream cluster.  Stella Artois, Corona, and Heineken are also reasonable bridge beers between the main cluster and the light beer world.
  • •The classification algorithm revealed six main taste/opinion clusters, which I would label: Really Light Beers (e.g. Natural Light), Lighter Mainstream Beers (e.g. Blue Moon), Stout Beers (e.g. Guinness), Craft Beers (e.g. Stone IPA), Darker European Beers (e.g. Chimay), and Lighter European Beers (e.g. Leffe Blonde).  The interesting parts about the classifications are the cases on the edge, such as how Newcastle Brown Ale appeals to both Guinness and Heineken drinkers.
  • •Seeing beers graphed according to opinions made me wonder if companies consciously position their beers accordingly.  Is Pyramid Hefeweizen successfully appealing to the Sam Adams drinker who wants a bit of European flavor?  Is Anchor Steam supposed to appeal to both the Guinness drinker and the craft beer drinker?  I’m not sure if I know enough about the marketing of beers to know the answer to this, but I’d be curious if beer companies place their beers in the same space that this opinion graph does.

These are just a few observations based on my own limited beer drinking experience.  I tend to be more of a whiskey drinker, and hope more of you will vote on our Best Tasting Whiskey list, so I can graph that next.  I’d love to hear comments about other observations that you might make from this graph.

– Ravi Iyer

Ranker Uses Big Data to Rank the World’s 25 Best Film Schools

NYU, USC, UCLA, Yale, Julliard, Columbia, and Harvard top the Rankings.

Does USC or NYU have a better film school?  “Big data” can provide an answer to this question by linking data about movies and the actors, directors, and producers who have worked on specific movies, to data about universities and the graduates of those universities.  As such, one can use semantic data from sources like Freebase, DBPedia, and IMDB to figure out which schools have produced the most working graduates.  However, what if you cared about the quality of the movies they worked on rather than just the quantity?  Educating a student who went on to work on The Godfather must certainly be worth more than producing a student who received a credit on Gigli.

Leveraging opinion data from Ranker’s Best Movies of All-Time list in addition to widely available semantic data, Ranker recently produced a ranked list of the world’s 25 best film schools, based on credits on movies within the top 500 movies of all-time.  USC produces the most film credits by graduates overall, but when film quality is taken into account, NYU (208 credits) actually produces more credits among the top 500 movies of all-time, compared to USC (186 credits).  UCLA, Yale, Julliard, Columbia, and Harvard take places 3 through 7 on the Ranker’s list.  Several professional schools that focus on the arts also place in the top 25 (e.g. London’s Royal Academy of Dramatic Art) as well as some well-located high schools (New York’s Fiorello H. Laguardia High School & Beverly Hills High School).

The World’s Top 25 Film Schools

  1. New York University (208 credits)
  2. University of Southern California (186 credits)
  3. University of California – Los Angeles (165 credits)
  4. Yale University (110 credits)
  5. Julliard School (106 credits)
  6. Columbia University (100 credits)
  7. Harvard University (90 credits)
  8. Royal Academy of Dramatic Art (86 credits)
  9. Fiorello H. Laguardia High School of Music & Art (64 credits)
  10. American Academy of Dramatic Arts (51 credits)
  11. London Academy of Music and Dramatic Art (51 credits)
  12. Stanford University (50 credits)
  13. HB Studio (49 credits)
  14. Northwestern University (47 credits)
  15. The Actors Studio (44 credits)
  16. Brown University (43 credits)
  17. University of Texas – Austin (40 credits)
  18. Central School of Speech and Drama (39 credits)
  19. Cornell University (39 credits)
  20. Guildhall School of Music and Drama (38 credits)
  21. University of California – Berkeley (38 credits)
  22. California Institute of the Arts (38 credits)
  23. University of Michigan (37 credits)
  24. Beverly Hills High School (36 credits)
  25. Boston University (35 credits)

“Clearly, there is a huge effect of geography, as prominent New York and Los Angeles based high schools appear to produce more graduates who work on quality films compared to many colleges and universities,“ says Ravi Iyer, Ranker’s Principal Data Scientist, a graduate of the University of Southern California.

Ranker is able to combine factual semantic data with an opinion layer because Ranker is powered by a Virtuoso triple store with over 700 million triples of information that are processed into an entertaining list format for users on Ranker’s consumer facing website, Ranker.com.  Each month, over 7 million unique users interact with this data – ranking, listing and voting on various objects – effectively adding a layer of opinion data on top of the factual data from Ranker’s triple store. The result is a continually growing opinion graph that connects factual and opinion data.  As of January 2013, Ranker’s opinion graph included over 30,000 nodes with over 5 million edges connecting these nodes.

– Ravi Iyer

Predicting Box Office Success a Year in Advance from Ranker Data

A number of data scientists have attempted to predict movie box office success from various datasets.  For example, researchers at HP labs were able to use tweets around the release date plus the number of theaters that a movie was released in to predict 97.3% of movie box office revenue in the first weekend.  The Hollywood Stock Exchange, which lets participants bet on the box office revenues and infers a prediction, predicts 96.5% of box office revenue in the opening weekend.  Wikipedia activity predicts 77% of box office revenue according to a collaboration of European researchers.  Ranker runs lists of anticipated movies each year, often for more than a year in advance, and so the question I wanted to analyze in our data was how predictive is Ranker data of box office success.

However, since the above researchers have already shown that online activity at the time of the opening weekend predicts box office success during that weekend, I wanted to build upon that work and see if Ranker data could predict box office receipts well in advance of opening weekend.  Below is a simple scatterplot of results, showing that Ranker data from the previous year predicts 82% of variance in movie box office revenue for movies released in the next year.

Predicting Box Office Success from Ranker Data
Predicting Box Office Success from Ranker Data

The above graph uses votes cast in 2011 to predict revenues from our Most Anticipated 2012 Films list.  While our data is not as predictive as twitter data collected leading up to opening weekend, the remarkable thing about this result is that most votes (8,200 votes from 1,146 voters) were cast 7-13 months before the actual release date.  I look forward to doing the same analysis on our Most Anticipated 2013 Films list at the end of this year.

– Ravi Iyer

by    in Data Science

Crowdsourcing Objective Answers to Subjective Questions – Nerd Nite Los Angeles

A lot of the questions on Ranker are subjective, but that doesn’t mean that we cannot use data to bring some objectivity to this analysis.  In the same way that Yelp crowdsources answers to subjective questions about restaurants and TripAdvisor crowdsources answers to subjective questions about hotels, Ranker crowdsources answers to a broader assortment of relatively subjective questions such as the Tastiest Pizza Toppings, the Best Cruise Destination, and the Worst Way to Die.

A few weeks ago, I did an informal talk on the Wisdom of Crowds approach that Ranker takes to crowdsource such answers at a Los Angeles bar as part of “Nerd Nite”.  The gist of it is that one can crowdsource objective answers to subjective questions by asking diverse groups of people questions in diverse ways.  Greater diversity, when aggregated effectively, enables the error inherent in answering any subjective question to be minimized.  For example, we know intuitively that relying on only the young or only the elderly or only people in cities or only people who live in rural areas gives us biased answers to subjective questions.  But when all of these diverse groups agree on a subjective question, there is reason to believe that there is an objective truth that they are responding to.  Below is the video of that talk.

If you want to see a more formal version of this talk, I’ll be speaking at greater length on Ranker’s methodologies at the Big Data Innovation Summit in San Francisco this Friday.

– Ravi Iyer

by    in interest graph, Opinion Graph

A Battle of Taste Graphs: Baltimore Ravens Fans vs. San Francisco 49ers Fans

Super Bowl Sunday is a day when two cities and two fan groups are competing for bragging rights, even as the Baltimore Ravens and San Francisco 49ers themselves do the playing.  You might be interested in understanding these teams’ fans better through an exploration of their fans’ taste graphs, from a recent post on our data blog, which examines correlations between votes on lists like the Top NFL Teams of 2012 and non-sports lists like our list of delicious vegetables (yum!).

For one, There is also absolutely zero consensus where music is concerned. 49er’s fans listen to an eclectic mixture of genres: up-and-coming rappers like Kendrick Lamar sit right next to INXS and 90s brit-poppers Pulp. Yet where the Ravens are concerned, classic rock is still king: Hendrix, CCR, and Neil Young are an undisputed top three. The 49ers also have the Ravens utterly beat in terms of culinary taste. Monterrey Jack and Cosmos are a fairly clear favorite among fans, while Baltimore’s stick to staples: Coffee, Bell peppers, and Ham are the only food items that correlated enough to even be tracked.

 A Snapshot from Ranker’s Data Mining Tool

TV tastes also varied between the two teams: Ravens fans stuck to almost exclusively comedic faire (Pinky and The Brain, Rugrats, Mythbusters and Louie correlated strongly), while the 49er’s stuck to more structured, dramatic shows, such as The Walking Deadand Dexter.

Read the full post here over on our data blog.

– Ravi Iyer

by    in Data Science, interest graph, Opinion Graph

The Opinion Graph predicts more than the Interest Graph

At Ranker, we keep track of talk about the “interest graph” as we have our own parallel graph of relationships between objects in our system, that we call an “opinion graph”.  I was recently sent this video concerning the power of the interest graph to drive personalization.

The points made in the video are very good, about how the interest graph is more predictive than the social graph, as far as personalization goes.  I love my friends, but the kinds of things they read and the kinds of things I read are very different and while there is often overlap, there is also a lot of diversity.  For example, trying to personalize my movie recommendations based on my wife’s tastes would not be a satisfying experience.  Collaborative filtering using people who have common interests with me is a step in the right direction and the interest graph is certainly an important part of that.

However, you can predict more about a person with an opinion graph versus an interest graph. The difference is that while many companies can infer from web behavior what people are interested in, perhaps by looking at the kinds of articles and websites they consume, a graph of opinions actually knows what people think about the things they are reading about.  Anyone who works with data knows that the more specific a data point is, the more you can predict, as the amount of “error” in your measurement is reduced.  Reduced measurement error is far more important for prediction than sample size, which is a point that gets lost in the drive toward bigger and bigger data sets.  Nate Silver often makes this point in talks and in his book.

For example, if you know someone reads articles about Slumdog Millionare, then you can serve them content about Slumdog Millionare.  That would be a typical use case for interest graph data. Using collaborative filtering, you can find out what other Slumdog Millionare fans like and serve them appropriate content.  With opinion graph data, of the type we collect at Ranker, you might be able to differentiate between a person who thinks that Slumdog Millionare is simply a great movie versus someone who thinks the soundtrack was one of the best ever.  If you liked the movie, we would predict that you would also like Fight Club.  But if you liked the soundtrack, you might instead be interested in other music by A.R. Rahman.

Simply put, the opinion graph can predict more about people than the interest graph can.

– Ravi Iyer

Page 3 of 612345...Last »