by    in Data, Data Science, Popular Lists

Applying Machine Learning to the Diversity within our Worst Presidents List

Ranker visitors come from a diverse array of backgrounds, perspectives and opinions.  The diversity of the visitors, however, is often lost when we look at the overall rankings of the lists, due to the fact that the rankings reflect a raw average of all the votes on a given item–regardless of how voters behave on multiple other items.  It would be useful then, to figure out more about how users are voting across a range of items, and to recreate some of the diversity inherent in how people vote on the lists.

Take for instance, one of our most popular lists: Ranking the Worst U.S. Presidents, which has been voted on by over 60,000 people, and is comprised of over a half a million votes.

In this partisan age, it is easy to imagine that such a list would create some discord. So when we look at the average voting behavior of all the voters, the list itself has some inconsistencies.  For instance, the five worst-rated presidents alternate along party lines–which is unlikely to represent a historically accurate account of which presidents are actually the worst.  The result is a list that represents our partisan opinions about our nation’s presidents:

 

ListScreenShot

 

The list itself provides an interesting glimpse of what happens when two parties collide in voting for the worst presidents, but we are missing interesting data that can inform us about how diverse our visitors are.  So how can we reconstruct the diverse groups of voters on the list such that we can see how clusters of voters might be ranking the list?

To solve this, we turn to a common machine learning technique referred to as “k-means clustering.” K-means clustering takes the voting data for each user, summarizes it into a result, and then finds other users with similar voting patterns.  The k-means algorithm is not given any information whatsoever from me as the data scientist, and has no real idea what the data mean at all.  It is just looking at each Ranker visitor’s votes and looking for people who vote similarly, then clustering the patterns according to the data itself.  K-means can be done to parse as many clusters of data as you like, and there are ways to determine how many clusters should be used.  Once the clusters are drawn, I re-rank the presidents for each cluster using Ranker’s algorithm, and the we can see how different clusters ranked the presidents.

As it happens, there are some differences in how clusters of Ranker visitors voted on the list.  In a two-cluster analysis, we find two groups of people with almost completely opposite voting behavior.

(*Note that since this is a list of voting on the worst president, the rankings are not asking voters to rank the presidents from best to worst, it is more a ranking of how much worse each president is compared to the others)

The k-means analysis found one cluster that appears to think Republican presidents are worst:

ClusterOneB

Here is the other cluster, with opposite voting behavior:

ClusterTwoB

In this two-cluster analysis, the shape of the data is pretty clear, and fits our preconceived picture of how partisan politics might be voting on the list.  But there is a bias toward recent presidents, and the lists do not mimic academic lists and polls ranking the worst presidents.

To explore the data further, I used a five cluster analysis–in other words, looking for five different types of voters in the data.

Here is what the five cluster analysis returned:

FiveClusterRankings

The results show a little more diversity in how the clusters ranked the presidents.  Again, we see some clusters that are more or less voting along party lines based on recent presidents (Clusters 5 and 4).  Cluster 1 and 3 also are interesting in that the algorithm also seems to be picking up clusters of visitors who are voting for people that have not been president (Hillary Clinton, Ben Carson), and thankfully were never president (Adolf Hitler).  Cluster 2 and 3 are most interesting to me however, as they seem to show a greater resemblance to the academic lists of worst presidents, (for reference, see wikipedia’s rankings of presidents) but the clusters tend toward a more historical bent on how we think of these presidents–I think of this as a more informed partisan-ship.

By understanding the diverse sets of users that make up our crowdranked lists, we are able to improve our overall rankings, and also provide more nuanced understanding how different group opinions compare, beyond the demographic groups we currently expose on our Ultimate Lists.  Such analyses help us determine outliers and agenda pushers in the voting patterns, as well as allowing us to rebalance our sample to make lists that more closely resemble a national average.

  • Glenn Fox

 

 

A Ranker World of Comedy Opinion Graph: Who Connects the Funny Universe?

In the previous post, we showed how a Gephi layout algorithm was able to capture different domains in the world of comedy across all of the Ranker lists tagged with the word “funny”.  However, these algorithms also give us information about the roles that individuals play within clusters. The size of the node indicates that node’s ability to connect other nodes, so bigger nodes indicate individuals who serve as a gateway between different nodes and categories.  These are the nodes that you would want to target if you wanted to reach the broadest audience, as people who like these comedic individuals also like many others.  Sort of like having that one friend who knows everyone send out the event invite instead of having to send it to a smaller group of friends in your own social network and hoping it gets around. So who connects the comedic universe?

The short answer: Dave Chappelle (click to enlarge)

Chappelle

Dave Chappelle is the superconnector. He has both the largest number of direct connections and the largest number of overall connections. If you want to reach the most people, go to him. If you want to connect people between different kinds of comedy, go to him.  He is the center of the comedic universe. He’s not the only one with connections though.

Top 10 Overall Connectors

  1. Dave Chappelle 
  2. Eddie Izzard 
  3. John Cleese 
  4. Ricky Gervais
  5. Rowan Atkinson
  6. Eric Idle
  7. Billy Connolly
  8. Bill Hicks
  9. It’s Always Sunny In Philadelphia
  10. Sarah Silverman

 

We can also look at who the biggest connectors are between different comedy domains.

  • Contemporary TV Shows: It’s Always Sunny in Philadelphia, ALF, and The Daily Show are the strongest connectors. They provide bridges to all 6 other comedy domains.
  • Contemporary Comedians on American Television: Dave Chappelle, Eddie Izzard and Ricky Gervais are the strongest connectors. They provide bridges to all 6 other comedy domains.
  •  Classic Comedians: John Cleese and Eric Idle are the strongest connectors. They provide bridges to all 6 other comedy domains.
  • Classic TV Shows: The Muppet Show and Monty Python’s Flying Circus are the strongest connectors. They provide bridges to Classic TV Comedians, Animated TV shows, and Classic Comedy Movies.
  • British Comedians: Rowan Atkinson is the strongest connector. He serves as a bridge to all of the other 6 comedy domains.
  • Animated TV Shows: South Park is the strongest connector. It serves as a bridge to Classic Comedians, Classic TV Shows, and British Comedians.
  • Classic Comedy Movies: None of the nodes in this domain were strong connectors to other domains, though National Lampoon’s Christmas Vacation was the strongest node in this network.

 

 

A Ranker Opinion Graph of the Domains of the World of Comedy

One unique aspect of Ranker data is that people rank a wide variety of lists, allowing us to look at connections beyond the scope of any individual topic.  We compiled data from all of the lists on Ranker with the word “funny” to get a bigger picture of the interconnected world of comedy.  Using Gephi layout algorithms, we were able to create an Opinion Graph which categorizes comedy domains and identify points of intersection between them (click to make larger).

all3sm

In the following graphs, colors indicate different comedic categories that emerged from a cluster analysis, and the connecting lines indicate correlations between different nodes with thicker lines indicating stronger relationships.  Circles (or nodes) that are closest together are most similar.  The classification algorithm produced 7 comedy domains:

 

CurrentTVwAmerican TV Shows and Characters: 26% of comedy, central nodes =  It’s Always Sunny in Philadelphia, ALF, The Daily Show, Chappelle’s Show, and Friends.

NowComedianwContemporary Comedians on American Television: 25% of nodes, includes Dave Chappelle, Eddie Izzard, Ricky Gervais, Billy Connolly, and Bill Hicks.

 

ClassicComedianswClassic Comedians: 15% of comedy, central nodes = John Cleese, Eric Idle, Michael Palin, Charlie Chaplin, and George Carlin.

ClassicTVClassic TV Shows and Characters: 14% of comedy, central nodes = The Muppet Show, Monty Python’s Flying Circus, In Living Color, WKRP in Cincinnati, and The Carol Burnett Show.

BritComwBritish Comedians: 9% of comedy, central nodes = Rowan Atkinson, Jennifer Saunders, Stephen Fry, Hugh Laurie, and Dawn French.

AnimwAnimated TV Shows and Characters: 9% of comedy, central nodes = South Park, Family Guy, Futurama, The Simpsons, and Moe Szyslak.

MovieswClassic Comedy Movies: 1.5% of comedy, central nodes = National Lampoon’s Christmas Vacation, Ghostbusters, Airplane!, Vacation, and Caddyshack.

 

 

Clusters that are the most similar (most overlap/closest together):

  • Classic TV Shows and Contemporary TV Shows
  • British Comedians and Classic TV shows
  • British Comedians and Contemporary Comedians on American Television
  • Animated TV Shows and Contemporary TV Shows

Clusters that are the most distinct (lest overlap/furthest apart):

  • Classic Comedy Movies do not overlap with any other comedy domains
  • Animated TV Shows and British Comedians
  • Contemporary Comedians on American Television and Classic TV Shows

 

Take a look at our follow-up post on the individuals who connect the comedic universe.

– Kate Johnson

 

by    in Data, Data Science, Opinion Graph

A Ranker Opinion Graph of Important Life Goals

What does it mean to be successful, and what life goals should we be setting in order to get there? Is spending time with family most important? What about your career?  We asked people to rank their life goals in order of importance on Ranker, and using a layout algorithm (force atlas in Gephi), we were able to determine goal categories and organized these goals into a layout which placed goals most closely related nearer to each other.

The connecting lines in the graph represent significant correlations or relationships between different life goals, with thicker lines indicating stronger relationships.  The colors in the graph differentiate between unique groups that emerged from a cluster analysis.  Click on the below graph to expand it.

all_black

The classification algorithm produced 5 main life goal clusters:
(1) Religion/Spirituality (e.g., Christian values, achieving Religion & Spirituality),
(2) Achievement and Material Goods (e.g., being a leader, avoiding failure, having money/wealth),
(3) Interpersonal Involvement/Moral Values (e.g., sharing life, doing the right thing, being inspiring),
(4) Personal Growth (e.g., achieving wisdom & serenity, pursuing ideals and passions, peace of mind), and
(5) Emotional/Physical Well-Being (e.g., being healthy, enjoying life, being happy).

These clusters are well matched to those identified by Robert Emmon’s (1999) psychological research on goal pursuit and well-being. Emmon’s found that life goals form 4 primary categories: work and achievement, relationships and intimacy, religion and spirituality, and generativity (leaving legacy/contributing to society).

However, not all goals are created equal.  While success related goals may be able to help us get ahead in life, they also have downsides.   People who focus on zero-sum goals such as work and achievement tend to report less happiness and life satisfaction compared to people who pursue goals. Our data also show a large divide between Well-being and Work/Achievement goals with relatively no overlap between these two groups.

Other interesting relationships in our graph:

  • Goals related to moral values (e.g., doing the right thing) were clustered with (and therefore more closely related to) interpersonal goals than they were to religious goals.
  • Sexuality was related to goals from opposite ends of the space in unique ways. Well-being goals were related to sexual intimacy whereas Achievement goals were related to promiscuity.
  • While most goal clusters were primarily made up of goals for pursuing positive outcomes, the Achievement/Material Goods goal cluster also included the most goals related to avoiding negative consequences (e.g., avoiding failure, avoiding effort, never going to jail).
  • Our Personal Growth goal cluster is unique from many of the traditional goal taxonomies in the psychological literature, and our data did not find the typical goal cluster related to Generativity. This may show a shift in goal striving from community growth to personal growth.

– Kate Johnson

Citation: Emmons, R. A. (1999). The psychology of ultimate concerns: Motivation and spirituality in personality. New York: Guilford Press.

 

by    in Opinion Graph, Ranker Comics

A Cluster Analysis of the Superpower Opinion Graph produces 5 Superhero types

If you could have one superpower, which would you choose?  Data from the Ranker list “Badass Superpowers We’d Give Anything to Have” improves on the age-old classroom ice breaker question by letting people rank all of the superpowers in order of how much they would want them.  Because really, unless you’re one of the X-men, you probably would have more than one power. So, if you could have a collection of superpowers, what kind of superhero would you be?

Using Gephi and data from Ranker’s Opinion Graph, we ran a cluster analysis on people’s votes on the superpowers list to determine what groupings of superpowers different people wanted.

This analysis grouped superpowers into 5 clusters, which we interpreted to represent unique superhero types.

 

The Overall Superpower Opinion Graph

Allpowers

 

 

The 5 Types of Superheroes

    god

1. The Creationist God: This superhero type is characterized by creation and destruction, Old-Testament Christian God-style. Notable superpowers: the ability to create/destroy worlds, die and come back to life, have gods’ weapons (Thor’s Hammer, Zeus’ Thunderbolt), remove others’ senses, and resurrect the dead.

timelord

2. The Time Lord: This superhero type is basically The Doctor from Dr. Who. Notable superpowers: omnipotence, travel to other dimensions, open portals to anywhere, and travel beyond the omniverse.

elementalist

3. The Elementalist: This superhero type has the ability to manipulate the elements and use them as weapons to their advantage. Notable superpowers: manipulation of water, fire, weather, and plants, ability to shapeshift, shoot ice, and lightning and fire.

superman

4. The Superhuman: This superhero type is humans+, with enhanced human senses and decreased human limitations. Notable superpowers: sense danger, x-ray vision, walk through walls, super speed, mind reading, flight, super strength, and enhanced flexibility.

zen

5. The Zen Master: This superhero type sounds a bit like being permanently on mind-altering psychoactive substances crossed with Gandhi. Notable superpowers: speech empowerment, spiritual enlightenment, and infinite appetite!!.

 

-Kate Johnson

by    in Opinion Graph, Rankings

Ranky Goes to Washington?

Something pretty cool happened last week here at Ranker, and it had nothing to do with the season premiere of the “Big Bang Theory”, which we’re also really excited about. Cincinnati’s number one digital paper used our widget to create a votable list of ideas mentioned in Cincinnati Mayor John Cranley’s first State of the City. As of right now, 1,958 voters cast 5,586 votes on the list of proposals from Mayor Cranley (not surprisingly, “fixing streets” ranks higher than the “German-style beer garden” that’s apparently also an option).

Now, our widget is used by thousands of websites to either take one of our votable lists or create their own and embed it on their site, but this was the very first time Ranker was used to directly poll people on public policy initiatives.

Here’s why we’re loving this idea: we feel confident that Ranker lists are the most fun and reliable way to poll people at scale about a list of items within a specific context. That’s what we’ve been obsessing about for the past 6 years. But we also think this could lead to a whole new way for people to weigh in in fairly  large numbers on complex public policy issues on an ongoing basis, from municipal budgets to foreign policy. That’s because Ranker is very good at getting a large number of people to cast their opinion about complex issues in ways that can’t be achieved at this scale through regular polling methods (nobody’s going to call you at dinner time to ask you to rank 10 or 20 municipal budget items … and what is “dinner time” these days, anyway?).  It may not be a representative sample, but it may be the only sample that matters, given that the average citizen of Cincinnati will have no idea about the details within the Mayor’s speech and likely will give any opinion simply to move a phone survey conversation along about a topic they know little about.

Of course, the democratic process is the best way to get the best sample (there’s little bias when it’s the whole friggin voting population!) to weigh in on public policy as a whole. But elections are very expensive, infrequent, and the focus of their policy debates is the broadest possible relative to their geographical units, meaning that micro-issues like these will often get lost in same the tired partisan debates.

Meanwhile, society, technology, and the economy no longer operate on cycles consistent with elections cycles: the rate and breadth of societal change is such that the public policy environment specific to an election quickly becomes obsolete, and new issues quickly need sorting out as they emerge, something our increasingly polarized legislative processes have a hard time doing.

Online polls are an imperfect, but necessary, way to evaluate public policy choices on an ongoing basis. Yes, they are susceptible to bias, but good statistical models can overcome a lot of such bias and in a world where the response rates for telephone polls continue to drop, there simply isn’t an alternative.  All polling is becoming a function of statistical modeling applied to imperfect datasets.  Offline polls are also expensive, and that cost is climbing as rapidly as response rates are dropping. A poll with a sample size of 800 can cost anywhere between $25,000 and $50,000 depending on the type of sample and the response rate.  Social media is, well, very approximate. As we’ve covered elsewhere in this blog, social media sentiment is noisy, biased, and overall very difficult to measure accurately.

In comes Ranker. The cost of that Cincinnati.com Ranker widget? $0. Its sample size? Nearly 2,000 people, or anywhere between 2 to 4x the average sample size of current political polls. Ranker is also the best way to get people to quickly and efficiently express a meaningful opinion about a complex set of issues, and we have collected thousands of precise opinions about conceptually complex topics like the scariest diseases and the most important life goals by making providing opinions entertaining within a context that makes simple actions meaningful.

Politics is the art of the possible, and we shouldn’t let the impossibility of perfect survey precision preclude the possibility of using technology to improve civic engagement at scale.  If you are an organization seeking to poll public opinion about a particular set of issues that may work well in a list format, we’d invite you to contact us.

– Ravi Iyer

by    in Data Science, Pop Culture, prediction

Comparing World Cup Prediction Algorithms – Ranker vs. FiveThirtyEight

Like most Americans, I pay attention to soccer/football once every four years.  But I think about prediction almost daily and so this year’s World Cup will be especially interesting to me as I have a dog in this fight.  Specifically, UC-Irvine Professor Michael Lee put together a prediction model based on the combined wisdom of Ranker users who voted on our Who will win the 2014 World Cup list, plus the structure of the tournament itself.  The methodology runs in contrast to the FiveThirtyEight model, which uses entirely different data (national team results plus the results of players who will be playing for the national team in league play) to make predictions.  As such, the battle lines are clearly drawn.  Will the Wisdom of Crowds outperform algorithmic analyses based on match results?  Or a better way of putting it might be that this is a test of whether human beings notice things that aren’t picked up in the box scores and statistics that form the core of FiveThirtyEight’s predictions or sabermetrics.

So who will I be rooting for?  Both methodologies agree that Brazil, Germany, Argentina, and Spain are the teams to beat.  But the crowds believe that those four teams are relatively evenly matched while the FiveThirtyEight statistical model puts Brazil as having a 45% chance to win.  After those first four, the models diverge quite a bit with the crowd picking the Netherlands, Italy, and Portugal amongst the next few (both models agree on Colombia), while the FiveThirtyEight model picks Chile, France, and Uruguay.  Accordingly, I’ll be rooting for the Netherlands, Italy, and Portugal and against Chile, France, and Uruguay.

In truth, the best model would combine the signal from both methodologies, similar to how the Netflix prize was won or how baseball teams combine scout and sabermetric opinions.  I’m pretty sure that Nate Silver would agree that his model would be improved by adding our data (or similar data from betting markets like Betfair that similarly thought that FiveThirtyEight was underrating Italy and Portugal) and vice versa.  Still, even as I know that chance will play a big part in the outcome, I’m hoping Ranker data wins in this year’s world cup.

– Ravi Iyer

Ranker’s Pre-Tournament Predictions:

FiveThirtyEight’s Pre-Tournament Predictions:

by    in About Ranker, Opinion Graph, Pop Culture, Rankings

Ranker’s Rankings API Now in Beta

Increasingly, people are looking for specific answers to questions as opposed to webpages that happen to match the text they type into a search engine.  For example, if you search for the capital of France or the birthdate of Leonardo Da Vinci, you get a specific answer.  However, the questions that people ask are increasingly about opinions, not facts, as people are understandably more interested in what the best movie of 2013 was, as opposed to who the producer for Star Trek: Into Darkness was.

Enter Ranker’s Rankings API, which is currently now in beta, as we’d love the input of potential users’ of our API to help improve it.  Our API returns aggregated opinions about specific movies, people, tv shows, places, etc.  As an input, we can take a Wikipedia, Freebase, or Ranker ID.  For example, below is a request for information about Tom Cruise, using his Ranker ID from his Ranker page (contact us if you want to use other IDs to access).
http://api.ranker.com/rankings/?ids=2257588&type=RANKER

In the response to this request, you’ll get a set of Rankings for the requested object, including a set of list names (e.g. “listName”:”The Greatest 80s Teen Stars”), list urls (e.g. “listUrl”:”http://www.ranker.com/crowdranked-list/45-greatest-80_s-teen-stars” – note that the domain, www.ranker.com, is implied), item names (e.g. “itemName”:”Tom Cruise”) position of the item on this list (e.g. “position”:21), number of items on the list (e.g. “numItemsOnList”:70), the number of people who have voted on this list (e.g. “numVoters”:1149), the number of positive votes for this item (e.g. “numUpVotes”:245) vs. the number of negative votes (e.g. “numDownVotes”:169), and the Ranker list id (e.g. “listId”:584305).  Note that results are cached so they may not match the current page exactly.

Here is a snipped of the response for Tom Cruise.

[ { “itemName” : “Tom Cruise”,
“listId” : 346881,
“listName” : “The Greatest Film Actors & Actresses of All Time”,
“listUrl” : “http://www.ranker.com/crowdranked-list/the-greatest-film-actors-and-actresses-of-all-time”,
“numDownVotes” : 306,
“numItemsOnList” : 524,
“numUpVotes” : 285,
“numVoters” : 5305,
“position” : 85
},
{ “itemName” : “Tom Cruise”,
“listId” : 542455,
“listName” : “The Hottest Male Celebrities”,
“listUrl” : “http://www.ranker.com/crowdranked-list/hottest-male-celebrities”,
“numDownVotes” : 175,
“numItemsOnList” : 171,
“numUpVotes” : 86,
“numVoters” : 1937,
“position” : 63
},
{ “itemName” : “Tom Cruise”,
“listId” : 679173,
“listName” : “The Best Actors in Film History”,
“listUrl” : “http://www.ranker.com/crowdranked-list/best-actors”,
“numDownVotes” : 151,
“numItemsOnList” : 272,
“numUpVotes” : 124,
“numVoters” : 1507,
“position” : 102
}

…CLIPPED….
]

What can you do with this API?  Consider this page about Tom Cruise from Google’s Knowledge Graph.  It tells you his children, his spouse(s), and his movies.  But our API will tell you that he is one of the hottest male celebrities, an annoying A-List actor, an action star, a short actor, and an 80s teen star.  His name comes up in discussions of great actors, but he tends to get more downvotes than upvotes on such lists, and even shows up on lists of “overrated” actors.

We can provide this information, not just about actors, but also about politicians, books, places, movies, tv shows, bands, athletes, colleges, brands, food, beer, and more.  We will tend to have more information about entertainment related categories, for now, but as the domains of our lists grow, so too will the breadth of opinion related information available from our API.

Our API is free and no registration is required, though we would request that you provide links and attributions to the Ranker lists that provide this data.  We likely will add some free registration at some point.  There are currently no formal rate limits, though there are obviously practical limits so please contact us if you plan to use the API heavily as we may need to make changes to accommodate such usage.  Please do let me know (ravi a t ranker) your experiences with our API and any suggestions for improvements as we are definitely looking to improve upon our beta offering.

– Ravi Iyer

Ranker Opinion Graph: the Best Froyo Toppings

Its hard to resist a cold treat on a hot summer afternoon, and frozen yogurt shops with their array of flavors and toppings have a little of something for everyone. Once you’re done agonizing over whether you want new york cheesecake or wild berry froyo (and trying a sample of each at least twice), its time for the topping bar. But which topping should you choose? We asked people to vote for their favorite frozen yogurt toppings on Ranker from a list of 32 toppings, and they responded with over 7,500 votes.

The Top 5 Frozen Yogurt Toppings (by number of upvotes):
1. Oreo (235 votes)
2. Strawberries(225 votes)
3. Brownie bits (223 votes)
4. Hot fudge (216 votes)
5. Whipped cream (201 votes)

But let’s be honest, who can just choose just ONE topping for their froyo? Using Gephi and data from Ranker’s Opinion Graph, we ran a cluster analysis on people’s favorite froyo topping votes to determine which toppings people like to eat together (click on graph to enlarge). In the graph, larger circles mean more likes with other toppings. Most of the versatile toppings were either a syrup (like strawberry sauce) or chocolate candy (like Reese’s Pieces).froyo

The 10 Most Versatile Froyo Toppings:

1. Strawberry sauce
2. Snickers
3. Magic Shell
4. White Chocolate chips
5. Peanut butter chips
6. Butterscotch syrup
7. Candies Nestle Butterfinger Bar
8. Reese’s Pieces
9. M&Ms
10. Brownie bits

 

Using the modularity clustering tool in Gephi, we were then able to sort toppings into groups based on which toppings people were most likely to upvote together. We identified 4 kinds of froyo topping lovers:

fruitnut1. Fruit and Nuts (Blue): This cluster is all about the fruits and nuts. These people love Strawberry sauce, sliced almonds, and Marschino cherries.

chocolate2. Chocolate (purple): This cluster encompases all things chocolate. These people love Magic Shell, Brownie bits, and chocolate syrup.

 

sugar3. Sugar candy (green): This cluster is made up of pure sugar. These people love gummy worms, Rainbow sprinkles, and Skittles.

 

 

salty4. Salty and Cake (Red): This cluster encompasses cake bites and toppings that have a salty taste to them. These people like Snickers, Cheesecake bits, and Caramel Syrup.

 

Some additional thoughts:

  • Banana was a strange topping that was only linked with Snickers.
  •  People who like nuts like both fruit and items from the salty category.
  •  People who like blueberries only like other fruits.
  • People who like sugar items like gummy worms also like chocolate, but don’t particularly like fruit.

 

– Kate Johnson

Why Topsy/Twitter Data may never predict what matters to the rest of us

Recently Apple paid a reported $200 million for Topsy and some speculate that the reason for this purchase is to improve recommendations for products consumed using Apple devices, leveraging the data that Topsy has from Twitter.  This makes perfect sense to me, but the utility of Twitter data in predicting what people want is easy to overstate, largely because people often confuse bigger data with better data.  There are at least 2 reasons why there is a fairly hard ceiling on how much Twitter data will ever allow one to predict about what regular people want.

1.  Sampling – Twitter has a ton of data, with daily usage of around 10%.  Sample size isn’t the issue here as there is plenty of data, but rather the people who use Twitter are a very specific set of people.  Even if you correct for demographics, the psychographic of people who want to share their opinion publicly and regularly (far more people have heard of Twitter than actually use it) is way too unique to generalize to the average person, in the same way that surveys of landline users cannot be used to predict what psychographically distinct cellphone users think.

2. Domain Comprehensiveness – The opinions that people share on Twitter are biased by the medium, such that they do not represent the spectrum of things many people care about.  There are tons of opinions on entertainment, pop culture, and links that people want to promote, since they are easy to share quickly, but very little information on people’s important life goals or the qualities we admire most in a person or anything where people’s opinions are likely to be more nuanced.  Even where we have opinions in those domains, they are likely to be skewed by the 140 character limit.

Twitter (and by extension, companies that use their data like Topsy and DataSift) has a treasure trove of information, but people working on next generation recommendations and semantic search should realize that it is a small part of the overall puzzle given the above limitations.  The volume of information gives you a very precise measure of a very specific group of people’s opinions about very specific things, leaving out the vast majority of people’s opinions about the vast majority of things.  When you add in the bias introduced by analyzing 140 character natural language, there is a great deal of variance in recommendations that likely will have to be provided by other sources.

At Ranker, we have similar sampling issues, in that we collect much of our data at Ranker.com, but we are actively broadening our reach through our widget program, that now collects data on thousands of partner sites.  Our ranked list methodology certainly has bias too, which we attempt to mitigate that through combining voting and ranking data.  The key is not in the volume of data, but rather in the diversity of data, which helps mitigate the bias inherent in any particular sampling/data collection method.

Similarly, people using Twitter data would do well to consider issues of data diversity and not be blinded by large numbers of users and data points.  Certainly Twitter is bound to be a part of understanding consumer opinions, but the size of the dataset alone will not guarantee that it will be a central part.  Given these issues, either Twitter will start to diversify the ways that it collects consumer sentiment data or the best semantic search algorithms will eventually use Twitter data as but one narrowly targeted input of many.

– Ravi Iyer

Page 1 of 3123