Data is a tool, not an end, but understandably, some people are really into their tools. They like to describe how many
petabytes zettabytes their data takes up every second picosecond, requiring even more tools that allow them to analyze that data ever faster. It’s very very cool. But just like the engines on those lamborghinis I see idling in Los Angeles traffic on the way to the office, I have to question how truly useful all that engineering is.
Do we really need zettabytes of data to produce the insight that I might, in my weaker moments, click on a link advertising photos of singles in my area or detaling “13+ Things You Shouldn’t Eat in a Restaurant”? [ these are actual headlines served by content recommendation companies that leverage enormous datasets on web behavior] Does Facebook really need all my likes, interests, and friends to know to serve me clickbait or is the single biggest predictor of whether I might generate a click for an advertiser the fact that I have enjoyed clickbait in the past? If 8% of internet users account for 85% of banner ad clicks, how effective can the plethora of data scientists who work on advertising actually be, over and above a simple cookie that identifies that 8% and removes banner ads for everyone else?
Rather than simply declaring, in rather cliched form, that “big data is dead”, I have a solution: Better Data. If I want to know what to buy my wife for Christmas, I can analyze everything she has done on the internet for the past 10 years…or I could just ask her. If I want to know who is going to win the world cup, I could analyze the statistics of every player and team in every situation and create an algorithm that scores their collective talents…or I could just ask people who they think will win. Small datasets with rich variables that incorporate lots of information intelligently (e.g. stock prices) almost always out-perform complex algorithms performed on low-level datasets.
Evidence for this is found not only in the fact that algorithms cannot reliably beat the stock market (though they can make money by beating slower, dumber algorithms), but that the world’s biggest companies like Google, Facebook, and Baidu are emphasizing “Deep Learning” artificial intelligence as primary initiatives. Deep learning attempts to encode the patterns hiding in lots of low level data points (e.g. pixel colors) into higher-order variables that human beings find meaningful (e.g. a cat or a smiling friend), effectively creating better smaller datasets. The excitement over deep learning is an acknowledgment that zettabytes of data yield far less meaningful information about a person than the average human can get from a 15 minute conversation. Deep learning may someday allow Google to read our email with the same sophistication as a human, but the average toddler still far outpaces the most sophisticated deep learning algorithms. And it still needs good data to be trained on. It will never be able to take all the videos ever uploaded onto YouTube and predict much variance in the direction of the stock market because the data is not there. If you want to predict the stock market, you need better data on companies. If you want to predict what a person will buy or better yet, what really motivates them, you need to ask them questions about what motivates them.
How can we create better datasets? Think less like an engineer and more like someone writing a biography. Rather that trying ever more technological solutions to squeeze knowledge from a stone, think about what is missing in our understanding of the average person. If, through some combination of deep learning and data aggregation, I am able to fully understand 1% or 25% or 100% of a person’s online behavior, I still will only understand that part of their world that is revealed through their online behavior. How can we start to ask people what their most meaningful moments from college were, what annoys them most, or what makes them happiest in their quiet moments? Dating sites probably have some of the best data around because they ask meaningful questions, even given the relatively low number of people who use those sites as compared to Gmail or Facebook, and the sharpness of the insights that they are able to produce is no accident. The OK Cupid blog (better data) will always be more interesting than the Facebook data blog (bigger data) until Facebook is able to collect data more meaningful than the generic “like”.
2015 is an exciting time to be working on data. Tools are more accessible than ever, such that many engineers can find a tutorial and learn to run any algorithm in a weekend. Data is more ubiquitous and accessible than ever as well. But the world doesn’t need yet another company that takes publicly accessible data and mines it for sentiment, while throwing off stats about how big their data is. Think like a biographer, figure out what nobody else is asking and create meaningful data.