As part of our effort to promote Ranker’s unique dataset, I recently attended the Data 2.0. conference in San Francisco. “Data 2.0.” is a relatively vague term, and as Ranker’s resident Data Scientist, I have a particular perspective on what constitutes the future of data. My PhD is in psychology, not computer science, and so for me, data has always been a means, rather than an end. One thing that became readily apparent at the first few talks I saw, was that a lot of the emphasis of the conference was on dealing with bigger data sets, but without much consideration of what one could do with this data. It goes without saying that larger sample sizes allow for more statistical power than smaller sample sizes, but as the person who has collected some of the larger samples of psychological data (via YourMorals.org and BeyondThePurchase.org), I have often found that what holds me back from predictive power with my data is not the volume of data, but rather the diversity of variables in my dataset. What I often need is not bigger data, it’s better data.
The same premise has informed much of our data decision making at Ranker, where we emphasize the quality of our semantic, linked data, as opposed to the quantity. Again, both quality and quantity are important, but my thought going through the conference was that there was an over-emphasis on quantity. I didn’t find anyone talking about semantic data, which is one of the primary “Data 2.0.” concepts that relates more to quality than quantity.
I tested this idea out with a few people at the conference, framed as “better data beats better algorithms” and generally got positive feedback about the phrase. I was heartened when the moderator of a panel entitled “Data Science and Predicting the Future”, which included Alex Gray, Anthony Goldbloom, and Josh Wills, specifically proposed the question as to what was more important, data, people, or algorithms. It wasn’t quite the question I had in mind, but it served as a great jumping off point for a great discussion. Josh Wills, who worked as a data scientist at Google previously actually said the following, which I’m paraphrasing, as I didn’t take exact notes:
“Google and Facebook both have really smart people. They use essentially the same algorithms. The reason why Google can target ads better than Facebook is purely a function of better data. There is more intent in the data related to the Google user, who is actively searching for something, and so there is more predictive power. If I had a choice between asking my team to work on better algorithms or joining the data we have with other data, I’d want my team joining my data with other data, as that is what will lead to the most value.”
Again, that is paraphrased. Some of the panelists disagreed a bit. Alex Gray works on algorithms and so emphasized the importance of algorithms. To be fair, I work with relatively precise data, so I have the same bias in emphasizing the importance of quality data. Daniel Tunkelang, Principal Data Scientist of LinkedIn, supported Josh, in saying that better data was indeed more important than bigger data, a point his colleague, Monica Rogati, had made recently at a conference. I was excited to hear that others had been having similar thoughts about the need for better, not bigger, data.
I ended up asking a question myself about the Netflix challenge, where the algorithms and collective intelligence addressing the problem (reducing error of prediction) were maximized, but the goal was a relatively modest 10% gain, which was won by a truly complex algorithm that Netflix itself found too costly to use, relative to the gains. Surely better data (e.g. user opinions about different genres or user opinions about more dimensions of each movie) would have led to much greater than a 10% gain. There seemed to be general agreement, though Anthony Goldbloom rightly pointed out that you need the right people to help figure out how to get better data.
In the end, we all have our perspectives, based perhaps on what we work on, but I do think that the “better data” perspective is often lost in the rush toward larger datasets with more complex algorithms. For more on this perspective, here and here are two blog posts I found interesting on the subject. Daniel Tunkelang blogged about the same panel here.
– Ravi Iyer