In recent years, we’ve been repeatedly told that we’re living and working in an era of Big Data (and Big Science). We’ve heard how Nate Silver and others are revolutionizing how we analyze and interpret data. In many areas of science and in many aspects of life, for that matter, we’re obtaining collections of datasets so large and complex that it becomes necessary to change our traditional analysis methods. Since the volume, velocity, and variety of data are rapidly increasing, it is increasingly important to develop and apply appropriate techniques and statistical tools.
However, is it true that Big Data changes everything? Much can be gained from proper data analysis and from “data-driven science.” For example, the popular story about Billy Beane and Moneyball shows how Big Data and statistics transformed how baseball teams are assessed. But I’d like to point out some misconceptions and dangers of the concept of data-driven science.
Governments, corporations, and employers are already collecting (too?) much of our precious, precious data and expending massive effort to study it. We might worry about this because of concerns of privacy, but we should also worry about what might happen to analyses that are excessively focused on the data. There are questions that we should be asking more often: Who’s collecting the data? Which data and why? Which analysis tools and why? What are their assumptions and priors? My main point will be that the results from computer codes churning through massive datasets are not objective or impartial, and the data don’t inevitably drive us to a particular conclusion. This is why the concept of “data-driven” anything is misleading.
Let’s take a look at a few examples of data-driven analysis that have been in the news lately…
Nate Silver and FiveThirtyEight
Many media organizations are about something, and they use a variety of methods to study it. In a sense, FiveThirtyEight isn’t really about something. (If I wanted to read about nothing, I’d check out ClickHole and be more entertained.) Instead, FiveThirtyEight is about their method, which they call “data journalism” and by which they mean “statistical analysis, but also data visualization, computer programming and data-literate reporting.”
I’m exaggerating though. They cover broad topics related to politics, economics, science, life, and sports. They’ve had considerable success making probabilistic predictions about baseball, March Madness, and World Cup teams and in packaging statistics in a slick and easy-to-understand way. They also successfully predicted the 2012 US elections on a state-by-state basis, though they stuck to the usual script of treating it as a horse race: one team against another. Their statistical methods are sometimes “black boxes”, but if you look, they’ll often provide additional information about them. Their statistics are usually sound, but maybe they should be more forthcoming about the assumptions and uncertainties involved.
Their “life” section basically allows them to cover whatever they think is the popular meme of the day, which in my opinion isn’t what a non-tabloid media organization should be focused on doing. This section includes their “burrito competition,” which could be a fun idea but their bracket apparently neglected sparsely-populated states like New Mexico and Arizona, where the burrito historically originated.
The “economics” section has faced substantial criticism. For example, Ben Casselman’s article, “Typical minimum-wage earners aren’t poor, but they’re not quite middle class,” was criticized in Al-Jazeera America for being based on a single set of data plotting minimum-wage workers by household income. He doesn’t consider the controversial issue of how to measure poverty or the decrease in the real value of the minimum wage, and he ends up undermining the case for raising the minimum wage. Another article about corporate cash hoarding was criticized by Paul Krugman and others for jumping to conclusions based on revised data. As Malcolm Harris (an editor at The New Inquiry) writes, “Data extrapolation is a very impressive trick when performed with skill and grace…but it doesn’t come equipped with the humility we should demand from our writers.”
Their “science” section leaves a lot to be desired. For example, they have a piece assessing health news reports in which the author (Jeff Leek) uses Bayesian priors based on an “initial gut feeling” before assigning numbers to a checklist. As pointed out in this Columbia Journalism Review article, “plenty of people have already produced such checklists—only more thoughtfully and with greater detail…Not to mention that interpreting the value of an individual scientific study is difficult—a subject worthy of much more description and analysis than FiveThirtyEight provides.” And then there was the brouhaha about Roger Pielke, whose writings about the effects of climate change I criticized before, and who’s now left the organization.
Maybe Nate Silver should leave these topics to the experts and stick to covering sports? He does that really well.
Thomas Piketty on Inequality
Let’s briefly consider two more examples. You’ve probably heard about the popular and best-selling analysis of data-driven economics in Thomas Piketty’s magnum opus, Capital in the Twenty-first Century. It’s a long but well-written book in which Piketty makes convincing arguments about how income and wealth inequality are worsening in the United States, France, and other developed countries. (See these reviews in the NY Review of Books and Slate.) It’s influential because of its excellent and systematic use of statistics and data analysis, because of the neglect of wealth inequality by other mainstream economists, and of course because of the economic recession and the dominance of the top 1 percent.
Piketty has been criticized by conservatives, and he has successfully responded to these critics. His proposal for a progressive tax on wealth has also been criticized by some. Perhaps the book’s popularity and the clearly widespread and underestimated economic inequality will result in more discussion and consideration of this and other proposals.
I want to make a different point though. As impressive as Piketty’s book is, we should be careful about how we interpret it and his ideas for reducing inequality. For example, as argued by Russell Jacoby, unlike Marx in Das Kapital, Piketty takes the current system of capitalism for granted. Equality “as an idea and demand also contains an element of resignation; it accepts society, but wants to balance out the goods or privileges…Equalizing pollution pollutes equally, but does not end pollution.” While Piketty’s ideas for reducing economic extremes could be very helpful, they don’t “address a redundant labor force, alienating work, or a society driven by money and profit.” You may or may not agree with Piketty’s starting point—and you do have to start somewhere—but it’s important to keep it in mind when interpreting the results.
As before, just because something is “data-driven” doesn’t mean that the data, analysis, or conclusions can’t be questioned. We always need to be grounded in data, but we need to be careful about how we interpret analyses of them.
HealthMap on Ebola
Harvard’s HealthMap gained attention for using algorithms to detect the beginning of the Ebola outbreak in Africa before the World Health Organization did. Is that a big success for “big data”? Not so, according to Foreign Policy. “It’s an inspirational story that is a common refrain in the ‘big data’ world—sophisticated computer algorithms sift through millions of data points and divine hidden patterns indicating a previously unrecognized outbreak that was then used to alert unsuspecting health authorities and government officials…The problem is that this story isn’t quite true.” By the time HealthMap monitored its very first report, the Guinean government had actually already announced the outbreak and notified the WHO. Part of the problem is that it was published in French, while most monitoring systems today emphasize English-language material.
This seems to be another case of people jumping to conclusions to fit a popular narrative.
What does all this mean for Science?
Are “big data” and “data-driven” science more than just buzzwords? Maybe. But as these examples show, we have to be careful when utilizing them and interpreting their results. When some people conduct various kinds of statistical analyses and data mining, they act as if the data speak for themselves. So their conclusions must be indisputable! But the data never speak for themselves. We scientists and analysts are not simply going around performing induction, collecting every relevant datum around us, and cranking the data through machines.
Every analysis has some assumptions. We all make assumptions about which data to collect, which way to analyze them, which models to use, how to reduce our biases, and how to assess our uncertainties. All machine learning methods, including “unsupervised” learning (in which one tries to find hidden patterns in data), require assumptions. The data definitely do not “drive” one to a particular conclusion. When we interpret someone’s analysis, we may or may not agree with their assumptions, but we should know what they are. And any analyst who does not clearly disclose their assumptions and uncertainties is doing everyone a disservice. Scientists are human and make mistakes, but these are obvious mistakes to avoid. Although objective data-driven science might not be possible, as long as we’re clear about how we choose our data and models and how we analyze them, then it’s still possible to make progress and reach a consensus on some issues and ask new questions on others.