by    in Data Science, interest graph, Opinion Graph

The Opinion Graph predicts more than the Interest Graph

At Ranker, we keep track of talk about the “interest graph” as we have our own parallel graph of relationships between objects in our system, that we call an “opinion graph”.  I was recently sent this video concerning the power of the interest graph to drive personalization.

The points made in the video are very good, about how the interest graph is more predictive than the social graph, as far as personalization goes.  I love my friends, but the kinds of things they read and the kinds of things I read are very different and while there is often overlap, there is also a lot of diversity.  For example, trying to personalize my movie recommendations based on my wife’s tastes would not be a satisfying experience.  Collaborative filtering using people who have common interests with me is a step in the right direction and the interest graph is certainly an important part of that.

However, you can predict more about a person with an opinion graph versus an interest graph. The difference is that while many companies can infer from web behavior what people are interested in, perhaps by looking at the kinds of articles and websites they consume, a graph of opinions actually knows what people think about the things they are reading about.  Anyone who works with data knows that the more specific a data point is, the more you can predict, as the amount of “error” in your measurement is reduced.  Reduced measurement error is far more important for prediction than sample size, which is a point that gets lost in the drive toward bigger and bigger data sets.  Nate Silver often makes this point in talks and in his book.

For example, if you know someone reads articles about Slumdog Millionare, then you can serve them content about Slumdog Millionare.  That would be a typical use case for interest graph data. Using collaborative filtering, you can find out what other Slumdog Millionare fans like and serve them appropriate content.  With opinion graph data, of the type we collect at Ranker, you might be able to differentiate between a person who thinks that Slumdog Millionare is simply a great movie versus someone who thinks the soundtrack was one of the best ever.  If you liked the movie, we would predict that you would also like Fight Club.  But if you liked the soundtrack, you might instead be interested in other music by A.R. Rahman.

Simply put, the opinion graph can predict more about people than the interest graph can.

– Ravi Iyer

by    in Data Science

Mitt Romney Should Have Advertised on the X-Files

With the election recently behind us, many political analysts are conducting analyses of the campaigns, examining what worked and what didn’t.  One specific area where the Obama team is getting praise is in their unprecedented use of data to drive campaign decisions, and even more specifically, how they used data to micro-target fans who watched specific TV shows.  From this New York Times article concerning the Obama Team’s TV analytics:

“Culling never-before-used data about viewing habits, and combining it with more personal information about the voters the campaign was trying to reach and persuade than was ever before available, the system allowed Mr. Obama’s team to direct advertising with a previously unheard-of level of efficiency, strategists from both sides agree….

[They] created a new set of ratings based on the political leanings of categories of people the Obama campaign was interested in reaching, allowing the campaign to buy its advertising on political terms as opposed to traditional television industry terms…..

[They focused] on niche networks and programs that did not necessarily deliver large audiences but, as Mr. Grisolano put it, did provide the right ones.”

 

The Obama team focused more on undecided/apolitical voters in an effort to get them to the polls.  Given that some Mitt Romney supporters have blamed a lack of turnout of supporters for the results of the election, perhaps Romney would have been smart to have created a ranked list of TV shows, based on how much fans of the shows supported Romney, and then placed positive/motivating ads on those shows in an effort to increase turnout of his base.  Where would Romney get such data?  From Ranker!

Mitt Romney is on many votable Ranker lists (e.g. Most Influential People of 2012) and based on people who voted on those lists and also lists such as our Best Recent TV Shows list, we can examine which TV shows are positively or negatively associated with Mitt Romney.  Below are the top positive results from one of our internal tools.

As you can see, the X-Files appears to be the highest correlated show, by a fair margin.  I don’t watch the X-Files, so I wasn’t sure why this correlation exists, but I did a bit of research, and found this article exploring how the X-Files supported a number of conservative themes, such as the persistence of evil, objective truth, and distrust of government (also see here).  The article points out that in one episode, right wing militiamen are depicted as being heroic, which never would happen in a more liberal leaning plot.  Perhaps if you are a conservative politician seeking to motivate your base, you should consider running ads on reruns of the X-Files, or if you run a television station that shows X-Files reruns, consider contacting your local conservative politicians leveraging this data.

You may notice that this list contains more classic/rerun shows (e.g. Leave it to Beaver) than current shows.  This appears to be part of a general trend where conservatives on Ranker tend to positively vote for classic TV, a subject we’ll cover in a future blog post.  The possibility of advertising on reruns is part of what we would like to highlight in this post, as ads are likely relatively cheap and audiences can be more easily targeted, a tactic which the Obama campaign has been praised for.  At Ranker, we’re hopeful that more advertisers will seek value in the long-tail and mid-tail and will seek to mimic the tactics of the Obama campaign, as our data is uniquely suited for such psychographic targeting.

– Ravi Iyer

by    in Data Science

How Crowdsourcing can uncover Niche/Trending shows

At Ranker, people give us their opinions in various different ways. Some people vote.  Other people make long lists.  Still others make really short lists.  Some people tell us their absolute favorite things, while others list everything they’ve ever experienced.  One of the advantages of this diversity is that it allows us to examine patterns within these divergent types of opinions.  For example, some things are really popular, meaning that everyone lists them (e.g. Michael Jordan is on everyone’s best basketball players list).  Most popular things are also things that people generally list high on their lists and also get lots of positive votes (e.g. Michael Jordan).  However, there are some things that don’t get listed very often, but when they do get listed, people are passionate about them, meaning that they get listed high on people’s lists.  We highlight these items in our system using the niche symbol.

I’ve recently been examining our “niche” tag, which signifies when something is not particularly popular, but people are passionate about it.  There are many reasons why things can be niche.  Some things appeal specifically to younger (e.g. Rugrats) or older crowds (e.g.  The Rockford Files).  Other things have natural audiences (e.g.baseball fans who appreciate defense and think Ozzie Smith is one of the greatest players of all time).  The most interesting case is when something that I can’t identify starts showing on the niche list (see the list at the time of this writing here).

This is especially helpful for someone like me, who doesn’t always know what is ‘hot’ and naturally looks to data to find new quality entertainment.  Awhile back, the show Community consistently was showing highest on our niche algorithm.  Few people listed it as one of the best recent TV shows, but those who listed it tended to think very highly of it.  I was intruiged enough to watch the pilot on Hulu and have since become hooked.  Community has since graduated from our niche algorithm as it became popular.  Sometimes passion amongst a small group is how a trend starts.

As Margaret Mead believed that only a small group of citizens could change the world, so Malcolm Gladwell has shown how a small group of trendsetters can signal changes in pop culture.  Not everything on our niche list will become the next big thing, but it’s certainly a good place to search for candidates.

Among the things that people seem to be passionate about now, that aren’t so popular, are several good candidates for up and coming movies, bands, or TV shows.  Pappillon is currently hot, scoring over 2 standard deviations higher in terms of list position on our best movie list, despite being less popular than most movies.  Another Earth and 13 Assassins,  seem like potentially interesting and under the radar films from 2011. Real Time with Bill Maher‘s niche status may be due to appeal particular ideological group, but Warehouse 13 appealed to just my niche as it had passionate fans on both the best recent TV shows list and the best Sci-Fi TV shows list (it has since graduated from the list due to increased popularity).  Warehouse 13’s highest correlated show is one of my favorites, Battlestar Galactica, so I’m definitely going to check it out.

I tend to be a late adopter of pop culture, but thanks to the niche tag, maybe I can be a little hipper going forward.  Take a look at our niche items as of October 20, 2012 and any comments on other things to consider checking out would be appreciated. Or perhaps take a look in a few months time and consider whether our niche tag successfully captured coming trends in a few cases.

– Ravi Iyer

by    in Data Science, Market Research

Validating Ranker’s Aggregated Data vs. a Gallup Poll of Best Colleges

We were talking to someone in the market research field about the credibility of Ranker’s aggregated rankings, and they were intruiged and suggested that we validate our data by comparing the aggregated results of one of our lists to the results achieved by a traditional research company using traditional market research methodologies.  Companies like Gallup often do not survey the same types of questions that we ask at Ranker, in part due to the inherent difficulties of open ended polling via random digit dialing.  You can’t realistically call someone up at dinner time and ask them to list their 50 favorite TV shows.  You could ask them to name one favorite, but doing that, you can end up with headlines like “Americans admire Glenn Beck more than they admire the Pope.”  However, one question that both Gallup and Ranker have asked concerns the nation’s top colleges/universities.  How do Ranker’s results compare to Gallup’s data?  Below are our results, side by side.

Ranker vs Gallup Best US Colleges

From a market researcher’s perspective, this is good news for Ranker data.  Our algorithms have successfully replicated the top 4 results from the Gallup poll exactly, at a fraction of the cost.  This likely occurs because Ranker data is largely collected from users who find our website via organic search, so while our data is not a representative probability sample (assuming such a thing still exists in a world where people screen their calls on cellphones), our users tend to be more representative than the motivated Yelp user or the intellectual Quora user.  If you compare how representative Ranker’s best movies list is compared to Rotten Tomatoes aggregated opinion list (Toy Story 2 and Man on Wire are #1 & #2!?!?), you get a sense of the importance of having relatively representative data.

In addition, the fact that our lists are derived from a combination of methodologies (listing, reranking, + voting), means that the error associated with each method somewhat cancels out.  Indeed, one might argue that Ranker’s top dream colleges list is better than Gallup’s for precisely this reason as individuals are often tempted to list their alma mater or their local school as the best college, and the long tail of answers might actually contain more pertinent information.  Aggregating ranked lists from motivated users and combining that data with casual voters might actually be the best way to answer a question like this.

– Ravi Iyer

by    in Data Science, Google Knowledge Graph

How Ranker leverages Google’s Knowledge Graph

Google recently held their I/O conference and one of the talks was given by Freebase’s Shawn Simister, who was once Freebase’s biggest fan, and has since gone on to work at Google, which acquired Freebase a few years ago.  What is Freebase?  It’s the structured semantic data that powers Google’s knowledge graph and Ranker, along with many other organizations featured in this talk (Ranker is mentioned around the 8:45 mark).  This talk gives organizations that may not be familiar with Freebase an overview of how they can leverage the Freebase’s semantic data.

How does Ranker use the knowledge graph?  Freebase’s semantic data powers much of what we do at Ranker and the below graph illustrates how we relate to the semantic web.

How Ranker Relates to the Semantic Web

We leverage the data from the semantic web, often via Freebase, to create content in list format (e.g. The Best Beatles Songs), which our users then vote on and re-rank.  This creates an opinion data layer that is easily exportable to any other entity (e.g. The New York Times or Netflix) that is connected to the larger semantic web.  Our hope is that just as people in the presentation are beginning to create mashups of factual data, eventually people will also want to merge in opinion data, and we hope to have the best semantic opinion dataset out there when that happens.  The more people that connect their data to the semantic web, the more lists we can create, and the more potential consumers exist for our opinion data.  As such, we’d encourage you to check out Shawn’s presentation and hopefully you’ll find Freebase as useful as we do.

– Ravi Iyer

 

by    in Data Science

Siri (and other mobile interfaces) will eventually need semantic opinion data

Search engines, which process text and give you a menu of potential matches, make sense when you use an interface with a keyboard, a mouse, and a relatively large screen. Consider the below search for information about Columbia.  Whether I mean Columbia University, Columbia Sportswear, or Columbia Records, I can relatively easily navigate to the official website of the place that I need.

Mobile devices require specificity as the cost of an incorrect result is magnified by the limits of the user interface.  When using something like Siri, it is important to be able to give a precise answer to a question, rather than a menu of potential answers, as it is far harder to choose using these interfaces.  As technology gets better, we will start to expect intelligent devices to be able to make the same inferences that we are able to make about what we mean when given limited information.  For example, if I say “how do I get to Columbia?” to my phone while in New York, it should direct me to Columbia University, whereas in Chicago, it should direct me to Columbia College of Chicago.  Leveraging contextual information is part of what makes Siri special, as it allows you to, for example, use pronouns.  Some have said that Siri has resurrected the semantic web, as, in order to make the above choice of “Columbia” intelligently, it needs to know that Columbia University is located in New York while Columbia College is located in Chicago.

I have made the case before that people are increasingly seeking opinion data, not just factual data, online.  It bears repeating that, as depicted in the below graph, searches for opinion words like “best” are increasing, relative to factual words like “car”, “computer”, and “software” which once were as prevalent as “best”, but now lag behind.

The implication of these two trends is clear.  As more knowledge discovery is done via mobile devices that need semantic data to deliver precise contextual answers, and more knowledge discovery is about opinions, then mobile interfaces such as Siri, or Google’s answer to Siri, will increasingly require semantic opinion data sets to power them.  Using such a dataset, you could ask your mobile device to “find a foreign movie” while travelling and it could cross-reference your preferences with those of others to find the best foreign movie that happens to be playing in your geographic area and conforms to your taste.  You could ask your mobile device to play some Jazz music, and it could consider what music you might like or not like, in addition to the genre classifications of available albums.  These are the kinds of intelligent operations that human beings do everyday, leveraging our knowledge both of the world’s facts and the world’s opinions and in order to do these tasks well, any intelligent agent attempting these tasks will require the same set of structured knowledge, in the form of a semantic opinions.  Not coincidentally, Ranker’s unique competency is the development of a comprehensive semantic opinion dataset.

– Ravi Iyer

by    in Data Science

The Long Tail of Opinion Data

If you want to find out what the best restaurant in your area is, what the best printer under $80 is, or what the best movie of 2010 was, there are many websites out there that can help you.  Sites like Yelp, Rotten Tomatoes, and Engadget have built sustainable businesses by providing opinions in these vertical domains.  Ranker also has a best movies of all time list and while I might argue that our list is better than Rotten Tomatoes list (is Man on Wire really the best movie ever?), there isn’t anything particularly novel about having a list of best movies.  At the point where Ranker is the go-to site for opinions about restaurants, electronics, and movies, it will be a very big business indeed.

We are actually competitive already for movies, but where Ranker has unique value is in the long tail of opinions.  There are lots of domains where opinions are valuable, but are rarely systematically polled.  As this Motley Fool writer points out, we are one of the few places with opinions about companies with the worst customer service, and the only one that updates in real time.  Memes are arguably some of the most valuable things to know about, yet there is little data oriented competition for our funniest memes lists.  As inherently social creatures, opinions about people are obviously of tremendous value, yet outside of Gallup polls about politicians, there is little systematic knowledge of people’s opinions about people in the news, outside of our votable opinions about people lists.

Not only are there countless domains where systematic opinions are not collected, but even in the domains that exist, opinions tend to be unidimensionally focused on “best”, with little differentiation for other adjectives.  What if you want to identify the funniest, most annoying, dumbest, worst, or hottest item in a domain?  “Best” searches far outnumber “worst” searches on Google (about 50 to 1 according to Google trends), but if you combine all the adjectives (e.g. funniest, dumbest) and combine them with all the qualifers (e.g. of 2011, that remind you of college, that you love to hate), there is a long tail of opinions even in the most popular domains that is unserved.  Where else is data systematically collected on British Comedians?

When you combine the opportunities available in the long tail of domains plus the long tail of adjectives and qualifiers, you get a truly large set of opinions that make up the long tail of opinions on the internet.  There are myriad companies trying to mine Twitter for this data, which somewhat validates my intuition that there is opportunity here, but clever algorithms will never make up for the imperfections of mining 140 character text.  Many companies will try and compete by squeezing the last bit of signal from imperfect data, but my experience in academia and in technology has taught me that there is no substitute for collecting better data. If my previous assertion that the knowledge graph is more than just facts is true, then there will be great demand for this long tail of opinions, just as there is great demand for the long tail of niche searches.  And Ranker is one of the few companies empirically sampling this long tail.

– Ravi Iyer

by    in Data Science, Google Knowledge Graph

The Knowledge Graph is about more than facts

Today, Google announced the introduction of the “knowledge graph”, which introduces facts into Google searches.  So now, when you search for an object that Google understands, search results reflect Google’s actual understanding, leveraging what they know about each object.  Here is a video with more detail.

At Ranker, we know things about specific objects too, as most every item in the Ranker system maps to a Freebase object, which is a company (MetaWeb) that Google bought in order to provide these features.  We know a lot of the same information that Google knows, since we leverage the Freebase dataset.  For example, on our Godfather page, we present facts such as who directed the movie, when it was released, and what it’s rating was.  However, we also present other facts that people traditionally do not think of as part of the knowledge graph, but are actually just as essential to understanding the world.  We tell you that it’s one of the best movies of all time.  We also tell you that people who like the Godfather also tend to like Goodfellas, the Shawshank Redemption, and Scarlett Johansson.

Is this “knowledge”?  These aren’t “hard” facts, but it is a fact that people generally think of The Godfather as a good movie and Gilgi as a bad movie.  Moreover, knowledge about people’s opinions is essential for understanding the world in the way that the “Star Trek computer” that is referred to in Google’s blog post understands the world.  Could you pick a college based on factual information about enrollment and majors offered?  Could you hold an intelligent conversation about Harvard without knowing it’s place in the universe of universities?  Could you choose a neighborhood to live in based solely on statistics about the neighborhood, or would understanding what neighborhoods people like you also tend to like help you make the right choice?  If the broader mission of a search engine is to help you answer questions, then information about people’s opinions about colleges and neighborhoods is essential in these cases.  The knowledge graph isn’t just about facts, it’s about opinions as well.  Much of the knowledge you use in everyday reasoning concerns opinions, and if the internet is to get smarter, it needs this knowledge just as much as it needs to know factual information.

My guess is that Google gets this.  In 2004, searches for the word “best” were roughly equal to searches for words like car, computer or software, but people are increasingly searching for opinions online.  My uneducated guess is that Google bought Zagat, in part, for this reason.  Bing, Wolphram Alpha, Apple, and Facebook are all working on similar semantic search solutions, and as long as people continue to dream about the holodeck computer that can intelligently answer requests like “book me a hotel room in Toronto” or “buy my niece a present for her birthday”, data about opinions will be a part of the future of the knowledge graph.

– Ravi Iyer

by    in Data Science, Market Research

Better Data, Not Bigger Data – Thoughts from the Data 2.0 Conference

As part of our effort to promote Ranker’s unique dataset, I recently attended the Data 2.0. conference in San Francisco.  “Data 2.0.” is a relatively vague term, and as Ranker’s resident Data Scientist, I have a particular perspective on what constitutes the future of data.  My PhD is in psychology, not computer science, and so for me, data has always been a means, rather than an end. One thing that became readily apparent at the first few talks I saw, was that a lot of the emphasis of the conference was on dealing with bigger data sets, but without much consideration of what one could do with this data.  It goes without saying that larger sample sizes allow for more statistical power than smaller sample sizes, but as the person who has collected some of the larger samples of psychological data (via YourMorals.org and BeyondThePurchase.org), I have often found that what holds me back from predictive power with my data is not the volume of data, but rather the diversity of variables in my dataset.  What I often need is not bigger data, it’s better data.

The same premise has informed much of our data decision making at Ranker, where we emphasize the quality of our semantic, linked data, as opposed to the quantity.  Again, both quality and quantity are important, but my thought going through the conference was that there was an over-emphasis on quantity.  I didn’t find anyone talking about semantic data, which is one of the primary “Data 2.0.” concepts that relates more to quality than quantity.

I tested this idea out with a few people at the conference, framed as “better data beats better algorithms” and generally got positive feedback about the phrase.  I was heartened when the moderator of a panel entitled “Data Science and Predicting the Future”, which included Alex Gray, Anthony Goldbloom, and Josh Wills, specifically proposed the question as to what was more important, data, people, or algorithms.  It wasn’t quite the question I had in mind, but it served as a great jumping off point for a great discussion.  Josh Wills, who worked as a data scientist at Google previously actually said the following, which I’m paraphrasing, as I didn’t take exact notes:

“Google and Facebook both have really smart people.  They use essentially the same algorithms.  The reason why Google can target ads better than Facebook is purely a function of better data.  There is more intent in the data related to the Google user, who is actively searching for something, and so there is more predictive power.  If I had a choice between asking my team to work on better algorithms or joining the data we have with other data, I’d want my team joining my data with other data, as that is what will lead to the most value.”

 

Again, that is paraphrased.  Some of the panelists disagreed a bit.  Alex Gray works on algorithms and so emphasized the importance of algorithms.  To be fair, I work with relatively precise data, so I have the same bias in emphasizing the importance of quality data.  Daniel Tunkelang, Principal Data Scientist of LinkedIn, supported Josh, in saying that better data was indeed more important than bigger data, a point his colleague, Monica Rogati, had made recently at a conference.  I was excited to hear that others had been having similar thoughts about the need for better, not bigger, data.

I ended up asking a question myself about the Netflix challenge, where the algorithms and collective intelligence addressing the problem (reducing error of prediction) were maximized, but the goal was a relatively modest 10% gain, which was won by a truly complex algorithm that Netflix itself found too costly to use, relative to the gains.  Surely better data (e.g. user opinions about different genres or user opinions about more dimensions of each movie) would have led to much greater than a 10% gain.  There seemed to be general agreement, though Anthony Goldbloom rightly pointed out that you need the right people to help figure out how to get better data.

In the end, we all have our perspectives, based perhaps on what we work on, but I do think that the “better data” perspective is often lost in the rush toward larger datasets with more complex algorithms.  For more on this perspective, here and here are two blog posts I found interesting on the subject.  Daniel Tunkelang blogged about the same panel here.

– Ravi Iyer

by    in Data Science

The Moral Psychology and Big Data Singularity – SXSW 2012

Below is a narrated powerpoint from a presentation I gave at South by Southwest Interactive on March 11, 2012.  The point of this presentation was to explore the intersection of technology and psychology, and hopefully to convince technologists to try to use our data to examine intangible things like values.  While the talk focuses more on psychology, many of the ideas were inspired by the semantic datasets we work with at Ranker.  Working with semantic datasets puts one in the mindset of considering synergy among different fields with different kinds of data.

Page 3 of 41234