by    in Data Science, Popular Lists, Rankings

In Good Company: Varieties of Women we would like to Drink With

They say you’re defined by the company you keep.  But how are you defined by the company you want to keep?

The list “Famous Women You’d Want to Have a Beer With”  provides an interesting way to examine this idea.  In other words, how people vote on this list can define something about what kind of person is doing the voting.

We can think of people as having many traits, or dimensions.  The traits and dimensions that are most important to the voters will be given higher rankings.  For instance, some people may rank the list thinking about the trait of how funny the person is, so may be more inclined to rate comedians higher than drama actresses.  Others may vote just on attractiveness, or based on singing talent, etc…  It may be the case that some people rank comedians and singers in a certain way, whereas others would only spend time with models and actresses.  By examining how people rank the various celebrities along these dimensions, we can learn something about the people doing the voting.

The rankings on the site, however, are based on the sum of all of the voters’ behavior on the list, so the final rankings do not tell us about how certain types of people are voting on the list.  While we could manually go through the list to sort the celebrities according to their traits, i.e. put comedians with comedians, singers with singers,  we would risk using our own biases to put voters into categories where they do not naturally belong.  It would be much better to let the voter’s own voting decide how the celebrities should be clustered.  To do this, we can use some fancy-math techniques from machine learning, called clustering algorithms, to let a computer examine the voting patterns and then tell us which patterns are similar between all the voters.   In other words, we use the algorithm to find patterns in the voting data, to then put similar patterns together into groups of voters, and then examine how the different groups of voters ranked the celebrities.  How each group ranked the celebrities tells us something about the group, and about the type of people they would like to keep them company.

As it happens, using this approach actually finds unique clusters, or groups, in the voting data, and we can then guess for ourselves how the voters from each group can be defined based on the company they wish to keep.

Here are the results:

Cluster 1:

Cluster4_MakeCelebPanels

Cluster 1 includes females known to be funny, and includes established comedians like Carol Burnett and Ellen DeGeneres. What is interesting is that Emma Stone and Jennifer Lawrence are also included, who are also highly ranked on lists based on physical attractiveness, they also have a reputation for being funny.  The clustering algorithm is showing us that they are often categorized alongside other funny females as well.  Among the clusters, this cluster has the highest proportion of female voters, which may explain why the celebrities are ranked along dimensions other than attractiveness.

 

Cluster 2:

Cluster1_MakeCelebPanels

Cluster 2 appears to consist of celebrities that are more in the nerdy camp, with Yvonne Strahovski and Morena Baccarin, both of whom play roles on shows popular with science fiction fans.  In the bottom of this list we see something of a contrarian streak as well, with downvotes handed out to some of the best known celebrities who rank highly on the list overall.

Cluster 3:

Cluster2_MakeCelebPanels

Cluster 3 is a bit more of a puzzle.  The celebrities tend to be a bit older, and come from a wide variety of backgrounds that are less known for a single role or attribute.  This cluster could be basing their votes more on the celebrity’s degree of uniqueness, which is somewhat in contrast with the bottom ranked celebrities who represent the most common and regularly listed female celebrities on Ranker.

Cluster 4:

Cluster3_MakeCelebPanels

We would also expect a list such as this to be heavily correlated with physical attractiveness, or perhaps for the celebrity’s role as a model.  Cluster 4 is perhaps the best example of this, and likely represents our youngest cluster.  The top ranked women are from the entertainment sector and are known for their looks, whereas in the bottom ranked people are from politics, comedy, or are older and probably less well known to the younger voters.  As we might expect, cluster 3 also has a high proportion of younger voters.

Here is the list of the top and bottom ten for each cluster (note that the order within these lists is not particularly important since the celebrity’s scores will be very close to one another):

TopCelebsPerClusterTable

 

In the end, the adage that we are defined by the company we keep appears to have some merit–and can be detected with machine learning approaches.  Though not a perfect split among the groups, there were trends in each group that drew the people of the cluster together.  This approach can provide a useful tool as we improve the site and improve the content for our visitors.   We are using these approaches to help improve the site and to provide better content to our visitors.

 

–Glenn R. Fox, PhD