by    in Data Science

The Long Tail of Opinion Data

If you want to find out what the best restaurant in your area is, what the best printer under $80 is, or what the best movie of 2010 was, there are many websites out there that can help you.  Sites like Yelp, Rotten Tomatoes, and Engadget have built sustainable businesses by providing opinions in these vertical domains.  Ranker also has a best movies of all time list and while I might argue that our list is better than Rotten Tomatoes list (is Man on Wire really the best movie ever?), there isn’t anything particularly novel about having a list of best movies.  At the point where Ranker is the go-to site for opinions about restaurants, electronics, and movies, it will be a very big business indeed.

We are actually competitive already for movies, but where Ranker has unique value is in the long tail of opinions.  There are lots of domains where opinions are valuable, but are rarely systematically polled.  As this Motley Fool writer points out, we are one of the few places with opinions about companies with the worst customer service, and the only one that updates in real time.  Memes are arguably some of the most valuable things to know about, yet there is little data oriented competition for our funniest memes lists.  As inherently social creatures, opinions about people are obviously of tremendous value, yet outside of Gallup polls about politicians, there is little systematic knowledge of people’s opinions about people in the news, outside of our votable opinions about people lists.

Not only are there countless domains where systematic opinions are not collected, but even in the domains that exist, opinions tend to be unidimensionally focused on “best”, with little differentiation for other adjectives.  What if you want to identify the funniest, most annoying, dumbest, worst, or hottest item in a domain?  “Best” searches far outnumber “worst” searches on Google (about 50 to 1 according to Google trends), but if you combine all the adjectives (e.g. funniest, dumbest) and combine them with all the qualifers (e.g. of 2011, that remind you of college, that you love to hate), there is a long tail of opinions even in the most popular domains that is unserved.  Where else is data systematically collected on British Comedians?

When you combine the opportunities available in the long tail of domains plus the long tail of adjectives and qualifiers, you get a truly large set of opinions that make up the long tail of opinions on the internet.  There are myriad companies trying to mine Twitter for this data, which somewhat validates my intuition that there is opportunity here, but clever algorithms will never make up for the imperfections of mining 140 character text.  Many companies will try and compete by squeezing the last bit of signal from imperfect data, but my experience in academia and in technology has taught me that there is no substitute for collecting better data. If my previous assertion that the knowledge graph is more than just facts is true, then there will be great demand for this long tail of opinions, just as there is great demand for the long tail of niche searches.  And Ranker is one of the few companies empirically sampling this long tail.

– Ravi Iyer

by    in Data Science, Google Knowledge Graph

The Knowledge Graph is about more than facts

Today, Google announced the introduction of the “knowledge graph”, which introduces facts into Google searches.  So now, when you search for an object that Google understands, search results reflect Google’s actual understanding, leveraging what they know about each object.  Here is a video with more detail.

At Ranker, we know things about specific objects too, as most every item in the Ranker system maps to a Freebase object, which is a company (MetaWeb) that Google bought in order to provide these features.  We know a lot of the same information that Google knows, since we leverage the Freebase dataset.  For example, on our Godfather page, we present facts such as who directed the movie, when it was released, and what it’s rating was.  However, we also present other facts that people traditionally do not think of as part of the knowledge graph, but are actually just as essential to understanding the world.  We tell you that it’s one of the best movies of all time.  We also tell you that people who like the Godfather also tend to like Goodfellas, the Shawshank Redemption, and Scarlett Johansson.

Is this “knowledge”?  These aren’t “hard” facts, but it is a fact that people generally think of The Godfather as a good movie and Gilgi as a bad movie.  Moreover, knowledge about people’s opinions is essential for understanding the world in the way that the “Star Trek computer” that is referred to in Google’s blog post understands the world.  Could you pick a college based on factual information about enrollment and majors offered?  Could you hold an intelligent conversation about Harvard without knowing it’s place in the universe of universities?  Could you choose a neighborhood to live in based solely on statistics about the neighborhood, or would understanding what neighborhoods people like you also tend to like help you make the right choice?  If the broader mission of a search engine is to help you answer questions, then information about people’s opinions about colleges and neighborhoods is essential in these cases.  The knowledge graph isn’t just about facts, it’s about opinions as well.  Much of the knowledge you use in everyday reasoning concerns opinions, and if the internet is to get smarter, it needs this knowledge just as much as it needs to know factual information.

My guess is that Google gets this.  In 2004, searches for the word “best” were roughly equal to searches for words like car, computer or software, but people are increasingly searching for opinions online.  My uneducated guess is that Google bought Zagat, in part, for this reason.  Bing, Wolphram Alpha, Apple, and Facebook are all working on similar semantic search solutions, and as long as people continue to dream about the holodeck computer that can intelligently answer requests like “book me a hotel room in Toronto” or “buy my niece a present for her birthday”, data about opinions will be a part of the future of the knowledge graph.

– Ravi Iyer