by    in Data Science

The Long Tail of Opinion Data

If you want to find out what the best restaurant in your area is, what the best printer under $80 is, or what the best movie of 2010 was, there are many websites out there that can help you.  Sites like Yelp, Rotten Tomatoes, and Engadget have built sustainable businesses by providing opinions in these vertical domains.  Ranker also has a best movies of all time list and while I might argue that our list is better than Rotten Tomatoes list (is Man on Wire really the best movie ever?), there isn’t anything particularly novel about having a list of best movies.  At the point where Ranker is the go-to site for opinions about restaurants, electronics, and movies, it will be a very big business indeed.

We are actually competitive already for movies, but where Ranker has unique value is in the long tail of opinions.  There are lots of domains where opinions are valuable, but are rarely systematically polled.  As this Motley Fool writer points out, we are one of the few places with opinions about companies with the worst customer service, and the only one that updates in real time.  Memes are arguably some of the most valuable things to know about, yet there is little data oriented competition for our funniest memes lists.  As inherently social creatures, opinions about people are obviously of tremendous value, yet outside of Gallup polls about politicians, there is little systematic knowledge of people’s opinions about people in the news, outside of our votable opinions about people lists.

Not only are there countless domains where systematic opinions are not collected, but even in the domains that exist, opinions tend to be unidimensionally focused on “best”, with little differentiation for other adjectives.  What if you want to identify the funniest, most annoying, dumbest, worst, or hottest item in a domain?  “Best” searches far outnumber “worst” searches on Google (about 50 to 1 according to Google trends), but if you combine all the adjectives (e.g. funniest, dumbest) and combine them with all the qualifers (e.g. of 2011, that remind you of college, that you love to hate), there is a long tail of opinions even in the most popular domains that is unserved.  Where else is data systematically collected on British Comedians?

When you combine the opportunities available in the long tail of domains plus the long tail of adjectives and qualifiers, you get a truly large set of opinions that make up the long tail of opinions on the internet.  There are myriad companies trying to mine Twitter for this data, which somewhat validates my intuition that there is opportunity here, but clever algorithms will never make up for the imperfections of mining 140 character text.  Many companies will try and compete by squeezing the last bit of signal from imperfect data, but my experience in academia and in technology has taught me that there is no substitute for collecting better data. If my previous assertion that the knowledge graph is more than just facts is true, then there will be great demand for this long tail of opinions, just as there is great demand for the long tail of niche searches.  And Ranker is one of the few companies empirically sampling this long tail.

– Ravi Iyer