Why Topsy/Twitter Data may never predict what matters to the rest of us

Recently Apple paid a reported $200 million for Topsy and some speculate that the reason for this purchase is to improve recommendations for products consumed using Apple devices, leveraging the data that Topsy has from Twitter.  This makes perfect sense to me, but the utility of Twitter data in predicting what people want is easy to overstate, largely because people often confuse bigger data with better data.  There are at least 2 reasons why there is a fairly hard ceiling on how much Twitter data will ever allow one to predict about what regular people want.

1.  Sampling – Twitter has a ton of data, with daily usage of around 10%.  Sample size isn’t the issue here as there is plenty of data, but rather the people who use Twitter are a very specific set of people.  Even if you correct for demographics, the psychographic of people who want to share their opinion publicly and regularly (far more people have heard of Twitter than actually use it) is way too unique to generalize to the average person, in the same way that surveys of landline users cannot be used to predict what psychographically distinct cellphone users think.

2. Domain Comprehensiveness – The opinions that people share on Twitter are biased by the medium, such that they do not represent the spectrum of things many people care about.  There are tons of opinions on entertainment, pop culture, and links that people want to promote, since they are easy to share quickly, but very little information on people’s important life goals or the qualities we admire most in a person or anything where people’s opinions are likely to be more nuanced.  Even where we have opinions in those domains, they are likely to be skewed by the 140 character limit.

Twitter (and by extension, companies that use their data like Topsy and DataSift) has a treasure trove of information, but people working on next generation recommendations and semantic search should realize that it is a small part of the overall puzzle given the above limitations.  The volume of information gives you a very precise measure of a very specific group of people’s opinions about very specific things, leaving out the vast majority of people’s opinions about the vast majority of things.  When you add in the bias introduced by analyzing 140 character natural language, there is a great deal of variance in recommendations that likely will have to be provided by other sources.

At Ranker, we have similar sampling issues, in that we collect much of our data at Ranker.com, but we are actively broadening our reach through our widget program, that now collects data on thousands of partner sites.  Our ranked list methodology certainly has bias too, which we attempt to mitigate that through combining voting and ranking data.  The key is not in the volume of data, but rather in the diversity of data, which helps mitigate the bias inherent in any particular sampling/data collection method.

Similarly, people using Twitter data would do well to consider issues of data diversity and not be blinded by large numbers of users and data points.  Certainly Twitter is bound to be a part of understanding consumer opinions, but the size of the dataset alone will not guarantee that it will be a central part.  Given these issues, either Twitter will start to diversify the ways that it collects consumer sentiment data or the best semantic search algorithms will eventually use Twitter data as but one narrowly targeted input of many.

– Ravi Iyer

by    in Data Science, interest graph, Opinion Graph

The Opinion Graph predicts more than the Interest Graph

At Ranker, we keep track of talk about the “interest graph” as we have our own parallel graph of relationships between objects in our system, that we call an “opinion graph”.  I was recently sent this video concerning the power of the interest graph to drive personalization.

The points made in the video are very good, about how the interest graph is more predictive than the social graph, as far as personalization goes.  I love my friends, but the kinds of things they read and the kinds of things I read are very different and while there is often overlap, there is also a lot of diversity.  For example, trying to personalize my movie recommendations based on my wife’s tastes would not be a satisfying experience.  Collaborative filtering using people who have common interests with me is a step in the right direction and the interest graph is certainly an important part of that.

However, you can predict more about a person with an opinion graph versus an interest graph. The difference is that while many companies can infer from web behavior what people are interested in, perhaps by looking at the kinds of articles and websites they consume, a graph of opinions actually knows what people think about the things they are reading about.  Anyone who works with data knows that the more specific a data point is, the more you can predict, as the amount of “error” in your measurement is reduced.  Reduced measurement error is far more important for prediction than sample size, which is a point that gets lost in the drive toward bigger and bigger data sets.  Nate Silver often makes this point in talks and in his book.

For example, if you know someone reads articles about Slumdog Millionare, then you can serve them content about Slumdog Millionare.  That would be a typical use case for interest graph data. Using collaborative filtering, you can find out what other Slumdog Millionare fans like and serve them appropriate content.  With opinion graph data, of the type we collect at Ranker, you might be able to differentiate between a person who thinks that Slumdog Millionare is simply a great movie versus someone who thinks the soundtrack was one of the best ever.  If you liked the movie, we would predict that you would also like Fight Club.  But if you liked the soundtrack, you might instead be interested in other music by A.R. Rahman.

Simply put, the opinion graph can predict more about people than the interest graph can.

– Ravi Iyer