How Netflix’s AltGenre Movie Grammar Illustrates the Future of Search Personalization

I recently got sent this Atlantic article on how Netflix reverse engineered Hollywood by a few contacts, and it happens to mirror my long term vision for how Ranker’s data fits into the future of search personalization.  Netflix’s goal, to put “the right title in front of the right person at the right time,” is very similar to what Apple, Bing, Google, and Facebook are attempting to do with regards to personalized contextual search.  Rather than you having to type in “best kitchen gadgets for mothers”, applications like Google Now and Cue (bought by Apple) hope to eventually be able to surface this information to you in real time, knowing not only when your mother’s birthday is, but also that you tend to buy kitchen gadgets for her, and knowing what the best rated kitchen gadgets that aren’t too complex and are in your price range happen to be.  If the application was good enough, a lot of us would trust it to simply charge our credit card and send the right gift.  But obviously we are a long way from that reality.

Netflix’s altgenre movie grammar (e.g. Irreverent Werewolf Movies Of The 1960s) gives us a glimpse of the level of specificity that would be required to get us there.  Consider what you need to know to buy the right gift for your mom.  You aren’t just looking for a kitchen gadget, but one with specific attributes.  In altgenre terminology, you might be looking for “best simple, beautifully designed kitchen gadgets of 2014 that cost between $25 and $100” or “best kitchen gadgets for vegetarian technophobes”.  Google knows that simple text matching is not going to get it the level of precision necessary to provide such answers, which is why semantic search, where the precise meaning of pages is mapped, has become a strategic priority.

However, the universe of altgenre equivalents in the non-movie world is nearly endless (e.g. Netflix has thousands of ways just to classify movies), which is where Ranker comes in, as one of the world’s largest sources for collecting explicit cross-domain altgenre-like opinions.  Semantic data from sources like wikipedia, dbpedia, and freebase can help you put together factual altgenres like “of the 60s” or “that starred Brad Pitt“, but you need opinion ratings to put together subtler data like “guilty pleasures” or “toughest movie badasses“.  Netflix’s success is proof of the power of this level of specificity in personalizing movies and consider how they produced this knowledge.  Not through running machine learning algorithms on their endless stream of user behavior data, but rather by soliciting explicit ratings along these dimensions by paying “people to watch films and tag them with all kinds of metadata” using a “36-page training document that teaches them how to rate movies on their suggestive content, goriness, romance levels, and even narrative elements like plot conclusiveness.”  Some people may think that with enough data, TripAdvisor should be able to tell you which cities are “cool”, but big data is not always better data.  Most data scientists will tell you the importance of defining the features in any recommendation task (see this article for technical detail on this), rather than assuming that a large amount of data will reveal all of the right dimensions.  The wrong level of abstraction can make prediction akin to trying to predict who will win the superbowl by knowing the precise position and status of every cell in every player on every NFL team.  Netflix’s system allows them to make predictions at the right level of abstraction.

The future of search needs a Netflix grammar that goes beyond movies.  It needs to able to understand not only which movies are dark versus gritty, but also which cities are better babymoon destinations versus party cities and which rock singers are great vocalists versus great frontmen.  Ranker lists actually have a similar grammar to Netflix movies, except that we apply this grammar beyond the movie domain.  In a subsequent post, I’ll go into more detail about this, but suffice it to say for now that I’m hopeful that our data will eventually play a similar role in the personalization of non-movie content that Netflix’s microtagging plays in film recommendations.

– Ravi Iyer

 

Ranker Uses Big Data to Rank the World’s 25 Best Film Schools

NYU, USC, UCLA, Yale, Julliard, Columbia, and Harvard top the Rankings.

Does USC or NYU have a better film school?  “Big data” can provide an answer to this question by linking data about movies and the actors, directors, and producers who have worked on specific movies, to data about universities and the graduates of those universities.  As such, one can use semantic data from sources like Freebase, DBPedia, and IMDB to figure out which schools have produced the most working graduates.  However, what if you cared about the quality of the movies they worked on rather than just the quantity?  Educating a student who went on to work on The Godfather must certainly be worth more than producing a student who received a credit on Gigli.

Leveraging opinion data from Ranker’s Best Movies of All-Time list in addition to widely available semantic data, Ranker recently produced a ranked list of the world’s 25 best film schools, based on credits on movies within the top 500 movies of all-time.  USC produces the most film credits by graduates overall, but when film quality is taken into account, NYU (208 credits) actually produces more credits among the top 500 movies of all-time, compared to USC (186 credits).  UCLA, Yale, Julliard, Columbia, and Harvard take places 3 through 7 on the Ranker’s list.  Several professional schools that focus on the arts also place in the top 25 (e.g. London’s Royal Academy of Dramatic Art) as well as some well-located high schools (New York’s Fiorello H. Laguardia High School & Beverly Hills High School).

The World’s Top 25 Film Schools

  1. New York University (208 credits)
  2. University of Southern California (186 credits)
  3. University of California – Los Angeles (165 credits)
  4. Yale University (110 credits)
  5. Julliard School (106 credits)
  6. Columbia University (100 credits)
  7. Harvard University (90 credits)
  8. Royal Academy of Dramatic Art (86 credits)
  9. Fiorello H. Laguardia High School of Music & Art (64 credits)
  10. American Academy of Dramatic Arts (51 credits)
  11. London Academy of Music and Dramatic Art (51 credits)
  12. Stanford University (50 credits)
  13. HB Studio (49 credits)
  14. Northwestern University (47 credits)
  15. The Actors Studio (44 credits)
  16. Brown University (43 credits)
  17. University of Texas – Austin (40 credits)
  18. Central School of Speech and Drama (39 credits)
  19. Cornell University (39 credits)
  20. Guildhall School of Music and Drama (38 credits)
  21. University of California – Berkeley (38 credits)
  22. California Institute of the Arts (38 credits)
  23. University of Michigan (37 credits)
  24. Beverly Hills High School (36 credits)
  25. Boston University (35 credits)

“Clearly, there is a huge effect of geography, as prominent New York and Los Angeles based high schools appear to produce more graduates who work on quality films compared to many colleges and universities,“ says Ravi Iyer, Ranker’s Principal Data Scientist, a graduate of the University of Southern California.

Ranker is able to combine factual semantic data with an opinion layer because Ranker is powered by a Virtuoso triple store with over 700 million triples of information that are processed into an entertaining list format for users on Ranker’s consumer facing website, Ranker.com.  Each month, over 7 million unique users interact with this data – ranking, listing and voting on various objects – effectively adding a layer of opinion data on top of the factual data from Ranker’s triple store. The result is a continually growing opinion graph that connects factual and opinion data.  As of January 2013, Ranker’s opinion graph included over 30,000 nodes with over 5 million edges connecting these nodes.

– Ravi Iyer

by    in Data Science, Google Knowledge Graph

How Ranker leverages Google’s Knowledge Graph

Google recently held their I/O conference and one of the talks was given by Freebase’s Shawn Simister, who was once Freebase’s biggest fan, and has since gone on to work at Google, which acquired Freebase a few years ago.  What is Freebase?  It’s the structured semantic data that powers Google’s knowledge graph and Ranker, along with many other organizations featured in this talk (Ranker is mentioned around the 8:45 mark).  This talk gives organizations that may not be familiar with Freebase an overview of how they can leverage the Freebase’s semantic data.

How does Ranker use the knowledge graph?  Freebase’s semantic data powers much of what we do at Ranker and the below graph illustrates how we relate to the semantic web.

How Ranker Relates to the Semantic Web

We leverage the data from the semantic web, often via Freebase, to create content in list format (e.g. The Best Beatles Songs), which our users then vote on and re-rank.  This creates an opinion data layer that is easily exportable to any other entity (e.g. The New York Times or Netflix) that is connected to the larger semantic web.  Our hope is that just as people in the presentation are beginning to create mashups of factual data, eventually people will also want to merge in opinion data, and we hope to have the best semantic opinion dataset out there when that happens.  The more people that connect their data to the semantic web, the more lists we can create, and the more potential consumers exist for our opinion data.  As such, we’d encourage you to check out Shawn’s presentation and hopefully you’ll find Freebase as useful as we do.

– Ravi Iyer

 

by    in Data Science, Google Knowledge Graph

The Knowledge Graph is about more than facts

Today, Google announced the introduction of the “knowledge graph”, which introduces facts into Google searches.  So now, when you search for an object that Google understands, search results reflect Google’s actual understanding, leveraging what they know about each object.  Here is a video with more detail.

At Ranker, we know things about specific objects too, as most every item in the Ranker system maps to a Freebase object, which is a company (MetaWeb) that Google bought in order to provide these features.  We know a lot of the same information that Google knows, since we leverage the Freebase dataset.  For example, on our Godfather page, we present facts such as who directed the movie, when it was released, and what it’s rating was.  However, we also present other facts that people traditionally do not think of as part of the knowledge graph, but are actually just as essential to understanding the world.  We tell you that it’s one of the best movies of all time.  We also tell you that people who like the Godfather also tend to like Goodfellas, the Shawshank Redemption, and Scarlett Johansson.

Is this “knowledge”?  These aren’t “hard” facts, but it is a fact that people generally think of The Godfather as a good movie and Gilgi as a bad movie.  Moreover, knowledge about people’s opinions is essential for understanding the world in the way that the “Star Trek computer” that is referred to in Google’s blog post understands the world.  Could you pick a college based on factual information about enrollment and majors offered?  Could you hold an intelligent conversation about Harvard without knowing it’s place in the universe of universities?  Could you choose a neighborhood to live in based solely on statistics about the neighborhood, or would understanding what neighborhoods people like you also tend to like help you make the right choice?  If the broader mission of a search engine is to help you answer questions, then information about people’s opinions about colleges and neighborhoods is essential in these cases.  The knowledge graph isn’t just about facts, it’s about opinions as well.  Much of the knowledge you use in everyday reasoning concerns opinions, and if the internet is to get smarter, it needs this knowledge just as much as it needs to know factual information.

My guess is that Google gets this.  In 2004, searches for the word “best” were roughly equal to searches for words like car, computer or software, but people are increasingly searching for opinions online.  My uneducated guess is that Google bought Zagat, in part, for this reason.  Bing, Wolphram Alpha, Apple, and Facebook are all working on similar semantic search solutions, and as long as people continue to dream about the holodeck computer that can intelligently answer requests like “book me a hotel room in Toronto” or “buy my niece a present for her birthday”, data about opinions will be a part of the future of the knowledge graph.

– Ravi Iyer