An octopus called Paul was one of the media stars of the 2010 soccer world cup. Paul correctly predicted 11 out of 13 matches, including the final in which Spain defeated the Netherlands. The 2014 world cup is in Brazil and, in an attempt to avoid eating mussels painted with national flags, we made predictions by analyzing data from Ranker’s “Who Will Win The 2014 World Cup?” list.
Ranker lists provide two sources of information, and we used both to make our predictions. One source is the original ranking, and the re-ranks provided by other users. For the world cup list, some users were very thorough, ranking all (or nearly all) of the 32 teams who qualified for the world cup. Other users were more selective, listing just the teams they thought would finish in the top places. An interesting question for data analysis is how much weight should be given to different rankings, depending on how complete they are.
The second source of information on Ranker are the thumbs-up and thumbs-down votes other users make in response to the master list of rankings. Often ranker lists have many more votes than they have re-ranks, and so the voting data potentially are very valuable. So, another interesting question for data analysis is how the voting information should be combined with the ranking information.
A special feature of making world cup predictions is that there is very useful information provided by the structure of the competition itself. The 32 teams have been drawn in 8 brackets with 4 teams each. Within a bracket, every team plays every other team once in initial group play. The top two teams from each bracket then advance to a series of elimination games. This system places strong constraints on possible outcomes, which a good prediction should follow. For example, Although Group B contains Spain, the Netherlands, and Chile — all strong teams, currently ranked in the top 16 in the world according to FIFA rankings — only two can progress from group play and finish in the top 16 for the world cup.
We developed a model that accounts for all three of these sources of information. It uses the ranking and re-ranking data, the voting data, and the constraints coming from the brackets, to make an overall prediction. The results of this analysis are shown in the figure. The left panel shows the thumbs-up (to the right, lighter) and thumbs-down (to the left, darker) votes for each team. The middle panel summarizes the ranking data, with the area of the circles corresponding to how often each team was ranked in each position. The right hand panel shows the inferred “strength” of each team on which we based our predicted order.
Our overall prediction has host-nation Brazil winning. But the distribution of strengths shown in the model inferences panel suggests it is possible Germany, Argentina, or Spain could win. There is little to separate the remainder of the top 16, with any country from the Netherlands to Algeria capable of doing well in the finals. The impact of the drawn brackets on our predictions is clear, with a raft of strong countries — the England, USA, Uruguay, and Chile — predicted to miss the finals, because they have been drawn in difficult brackets.
– Michael Lee