The Opinion Pages|Op-Ed Contributor |NYT Now

Eight (No, Nine!) Problems With Big Data

By GARY MARCUS and ERNEST DAVISAPRIL 6, 2014

BIG data is suddenly everywhere. Everyone seems to be collecting it, analyzing it, making money from it and celebrating (or fearing) its powers. Whether we’re talking about analyzing zillions of Google search queries to predict flu outbreaks, or zillions of phone records to detect signs of terrorist activity, or zillions of airline stats to find the best time to buy plane tickets, big data is on the case. By combining the power of modern computing with the plentiful data of the digital era, it promises to solve virtually any problem — crime, public health, the evolution of grammar, the perils of dating — just by crunching the numbers.

Or so its champions allege. “In the next two decades,” the journalist Patrick Tucker writes in the latest big data manifesto, “The Naked Future,” “we will be able to predict huge areas of the future with far greater accuracy than ever before in human history, including events long thought to be beyond the realm of human inference.” Statistical correlations have never sounded so good.

Is big data really all it’s cracked up to be? There is no doubt that big data is a valuable tool that has already had a critical impact in certain areas. For instance, almost every successful artificial intelligence computer program in the last 20 years, from Google’s search engine to the I.B.M. “Jeopardy!” champion Watson, has involved the substantial crunching of large bodies of data. But precisely because of its newfound popularity and growing use, we need to be levelheaded about what big data can — and can’t — do.

The first thing to note is that although big data is very good at detecting correlations, especially subtle correlations that an analysis of smaller data sets might miss, it never tells us which correlations are meaningful. A big data analysis might reveal, for instance, that from 2006 to 2011 the United States murder rate was well correlated with the market share of Internet Explorer: Both went down sharply. But it’s hard to imagine there is any causal relationship between the two. Likewise, from 1998 to 2007 the number of new cases of autism diagnosed was extremely well correlated with sales of organic food (both went up sharply), but identifying the correlation won’t by itself tell us whether diet has anything to do with autism.

Second, big data can work well as an adjunct to scientific inquiry but rarely succeeds as a wholesale replacement. Molecular biologists, for example, would very much like to be able to infer the three-dimensional structure of proteins from their underlying DNA sequence, and scientists working on the problem use big data as one tool among many. But no scientist thinks you can solve this problem by crunching data alone, no matter how powerful the statistical analysis; you will always need to start with an analysis that relies on an understanding of physics and biochemistry.

Third, many tools that are based on big data can be easily gamed. For example, big data programs for grading student essays often rely on measures like sentence length and word sophistication, which are found to correlate well with the scores given by human graders. But once students figure out how such a program works, they start writing long sentences and using obscure words, rather than learning how to actually formulate and write clear, coherent text. Even Google’s celebrated search engine, rightly seen as a big data success story, is not immune to “Google bombing” and “spamdexing,” wily techniques for artificially elevating website search placement.

Fourth, even when the results of a big data analysis aren’t intentionally gamed, they often turn out to be less robust than they initially seem. Consider Google Flu Trends, once the poster child for big data. In 2009, Google reported — to considerable fanfare — that by analyzing flu-related search queries, it had been able to detect the spread of the flu as accurately and more quickly than the Centers for Disease Control and Prevention. A few years later, though, Google Flu Trends began to falter; for the last two years it has made more bad predictions than good ones.

As a recent article in the journal Science explained, one major contributing cause of the failures of Google Flu Trends may have been that the Google search engine itself constantly changes, such that patterns in data collected at one time do not necessarily apply to data collected at another time. As the statistician Kaiser Fung has noted, collections of big data that rely on web hits often merge data that was collected in different ways and with different purposes — sometimes to ill effect. It can be risky to draw conclusions from data sets of this kind.

A fifth concern might be called the echo-chamber effect, which also stems from the fact that much of big data comes from the web. Whenever the source of information for a big data analysis is itself a product of big data, opportunities for vicious cycles abound. Consider translation programs like Google Translate, which draw on many pairs of parallel texts from different languages — for example, the same Wikipedia entry in two different languages — to discern the patterns of translation between those languages. This is a perfectly reasonable strategy, except for the fact that with some of the less common languages, many of the Wikipedia articles themselves may have been written using Google Translate. In those cases, any initial errors in Google Translate infect Wikipedia, which is fed back into Google Translate, reinforcing the error.

A sixth worry is the risk of too many correlations. If you look 100 times for correlations between two variables, you risk finding, purely by chance, about five bogus correlations that appear statistically significant — even though there is no actual meaningful connection between the variables. Absent careful supervision, the magnitudes of big data can greatly amplify such errors.

Seventh, big data is prone to giving scientific-sounding solutions to hopelessly imprecise questions. In the past few months, for instance, there have been two separate attempts to rank people in terms of their “historical importance” or “cultural contributions,” based on data drawn from Wikipedia. One is the book “Who’s Bigger? Where Historical Figures Really Rank,” by the computer scientist Steven Skiena and the engineer Charles Ward. The other is an M.I.T. Media Lab project called Pantheon.

Both efforts get many things right — Jesus, Lincoln and Shakespeare were surely important people — but both also make some egregious errors. “Who’s Bigger?” claims that Francis Scott Key was the 19th most important poet in history; Pantheon has claimed that Nostradamus was the 20th most important writer in history, well ahead of Jane Austen (78th) and George Eliot (380th). Worse, both projects suggest a misleading degree of scientific precision with evaluations that are inherently vague, or even meaningless. Big data can reduce anything to a single number, but you shouldn’t be fooled by the appearance of exactitude.

FINALLY, big data is at its best when analyzing things that are extremely common, but often falls short when analyzing things that are less common. For instance, programs that use big data to deal with text, such as search engines and translation programs, often rely heavily on something called trigrams: sequences of three words in a row (like “in a row”). Reliable statistical information can be compiled about common trigrams, precisely because they appear frequently. But no existing body of data will ever be large enough to include all the trigrams that people might use, because of the continuing inventiveness of language.

To select an example more or less at random, a book review that the actor Rob Lowe recently wrote for this newspaper contained nine trigrams such as “dumbed-down escapist fare” that had never before appeared anywhere in all the petabytes of text indexed by Google. To witness the limitations that big data can have with novelty, Google-translate “dumbed-down escapist fare” into German and then back into English: out comes the incoherent “scaled-flight fare.” That is a long way from what Mr. Lowe intended — and from big data’s aspirations for translation.

Wait, we almost forgot one last problem: the hype. Champions of big data promote it as a revolutionary advance. But even the examples that people give of the successes of big data, like Google Flu Trends, though useful, are small potatoes in the larger scheme of things. They are far less important than the great innovations of the 19th and 20th centuries, like antibiotics, automobiles and the airplane.

Big data is here to stay, as it should be. But let’s be realistic: It’s an important resource for anyone analyzing data, not a silver bullet.

104 Comments

Dave Powell

Florida 19 minutes ago

The statement that big data can't tell us which correlations are meaningful is simply wrong. The examples cited are called autocorrelation and there are well known statistical tests to detect it and to correct for it.

SAF93

Boston, MA 19 minutes ago

This is a nice article, focusing on limitations of big data. I agree that those who tout this approach should look deeper for intrinsic limitations of the approach. Computers are capable of handling huge amounts of data, far more than humans, and they don't get distracted or forget things, and generally have a low error rate. Thus, they enhance our limited computational abilities. However, computer programs do not yet enable machines to learn and think like humans (do computers have "aha" moments when they converge on p < 0.001?). Biological nervous systems are composed of many dynamically linked circuits, each specialized to handle different types of data. Thus, for example, we can easily understand metaphors, which is a challenge even for IBM's Watson. Current computers know only the binary language of ones and zeros.

John Lentini

Big Pine Key, FL 19 minutes ago

Isaac Asimov's character Hari Seldon used big data (which he termed "psychohistory" ) to predict future trends in the Foundation series, first published in 1951. Even 12,000 years into the future, big data was of no help in predicting one-off events.

Kilgore Trout

USA 19 minutes ago

As Mark Twain famously said, "There are lies, there are damned lies, and then there's Big Data"

yoyo

pianosa 19 minutes ago

Big data mining has the capability of finding whatever you want to find. One should not confuse it with statistical analysis. Substituting the discovery of correlations for analysis and careful testing is a fool's errand.

"If you torture the data long enough,Nature will confess."- Jan Kmenta

joe

Getzville, NY 25 minutes ago

There are two other problems with big data. The first involves the nature of statistics, the statement of the problem. Statistics are accomplished by proving a null hypothesis, I.e., an assumption. An assumption is states that is to be proved and then the statistics are used to prove the validity of the hypothesis. The statement of the hypothesis can influence the outcome, like the old saw of the glass half full vs half empty.

Statistics and computer simulations can never account for the so-called "Black Swan" effect, the outlier one in a million event that couldn't be predicted. ( Nassim Nicholas Taleb's book in 2001)

The financial crisis of 2007/8 is a prime example. The assumption was made by almost every major computer model that the housing market could never crash. That shows the power of the wrong assumption

RAP

Connecticut 25 minutes ago

What! You mean "Big Data" is not a wrestler performing for the WWE?
Actually, your column makes a good case for, perhaps, relegating Big Data to the ranks of that particular "dog and pony" show. In essence, collecting all sorts of data and statistics is one thing; interpreting this mass of information is quite another with just about everyone falling short in the "predicting" future trends part of the equation.
Do the recent tremors in California mean "the big one" is more likely? Nope.
Can weather forecasters accuracy extend beyond 3 days? Nope.
Can anyone predict Mr.Putin's next move? Nope.
Heck, even the computer guys running ads trying to sell me stuff seem to always get it wrong (just because I looked at the Dyson fan doesn't mean that I want to buy or be inundated with ads for "fans, ostrich feathers and ice cube makers" on the off chance I may want to buy something).
But if "Big Data" and it's collecting keep a good segment of the population employed, then it's probably a good thing. The danger is assuming all the information being collected points out some kind of trend. We are still a long way from Mr. Assimov's "Foundation Trilogy" where the actions of masses of humans were entirely predictable. I just hope the people collecting all this information realize this.

W.A. Spitzer

Faywood, New Mexico 26 minutes ago

Big data may have its place, but sometimes small data is more useful. I went to WalMart yesterday to look for a shirt. All the shirts available were short sleeved. Now big data may tell WalMart that in the month of April most buyers are looking for short sleeved shirts for the summer; but with the intense sun in the high desert of New Mexico anybody who would wear a short sleeved shirt outside is either crazy or a tourist. In this case a person with only half a brain is smarter than big data.

skeptonomist

Tennessee 1 hour ago

This piece points out some rather obvious dangers of working with correlations and other types of analysis, but completely omits the main reasons for the increasing prevalence of data-based analysis: the Internet and personal computers. Sixty years ago the data were kept in printed form which had to be transcribed and analyzed by hand. Now as the data are put into electronic form they can be instantly accessed by anyone through the Internet, and quickly analyzed with small but powerful computers.

Data-based analysis - "big data" - is not a fad, it is the natural consequence of the vastly greater accessibility of data. Things will be done with data that weren't done before. What would be beneficial, rather than pulling back from looking at data, is better education about the proper use of it.

ACW

New Jersey 1 hour ago

from 2006 to 2011 the United States murder rate was well correlated with the market share of Internet Explorer: Both went down sharply. But it’s hard to imagine there is any causal relationship between the two.'

Maybe there is such a thing as 'browser rage', akin to 'road rage', in which one vents one's frustrations. Maybe fewer people were driven to distraction after switching to less buggy browsers and didn't take out their anger on the nearest target.

Stranger cause-and-effect relationships have happened.

Margaret

Atlanta 1 hour ago

And the key phrase "careful supervision" is what needs emphasis. This is what will prevent the finance company from sending an email to the homeowner, internally addressed to the recipients' bankrupt former spouse offering credit based on the value of property to which said former spouse has no claim. Careful supervision is needed at every step in the process, something lost by marketers with dollar signs in their eyes.

thefrenchguy

HHI, SC 1 hour ago

The value or precision of data analysis depends on one big factor: the input data itself: garbage in - garbage out; this is well pointed out by the author. The other important dimension in data analysis that none of the super power engine has is: CONTEXT. Any data analyzed without the proper context should not be considered reliable. I have developed a simple formula: data + context = knowledge
The big data crunchers like Google wants everyone to believe that they hold the secret of the data so they can get bigger and richer. And everyone is falling into the trap at light speed for one simple reason: people are getting less and less "critical thinker" than ever.

Liz R

Catskill Mountains 1 hour ago

The impact of serial translation across languages is interesting. It reminds me of the ever-narrowing view of reality that an "infinity of mirrors" offers.

I wonder about that every time someone emails me a link from a news aggregation site with a one-line comment tailored to our particular understanding of each other or the general topic of the linked piece. One knows that Big Data Collection is chugging away in the background, compiling its info-trove on the numbers of emails about what to whom from where, and when, with what sort of comment. And that's before we get to Monetizable Big Data on the generated ad content based on however fanciful or banal are those one-line comments.

Meanwhile somewhere someone's particular toast is burning, which --as a potential but never gathered data point in an array of such singular facts-- may be the only (subsumed) reality in the Data That Matters right now. So it's not that Big Data is intrinsically misleading, or even that it may offer newer ways to "lie with statistics" as Huff's venerable book would have it. It's that the available Big Data on anything may be a massively trivial distraction from some very small subset of Small Data that really matters to one person this morning, and maybe to the planet tomorrow.

So how do we keep Big Data from hogging more bandwidth than it deserves? Teach critical thinking. Teach critical thinking. Teach critical thinking. Teach critical thinking. Teach critical thinking...

Doug F

Illinois 1 hour ago

Interesting piece and many interesting comments. And wrong-axied or not, I really liked the graph(ic). For my own contribution I'll quote my dad: "The old adage that 'statistics don't lie, only statisticians do,' is only half true."

Charlie B

North Port, Florida 19 minutes ago

Or there is the similar adage "Figures don't lie, but liars figure"

In a past career where we tried to predict the reliability of complex computer systems we had this caveat "don't mistake the model for reality"

Richard Luettgen

New Jersey 1 hour ago

As to the first cautionary, it's really a matter of increasing sophistication in the parsing of big data. The ability to determine the meaningfulness of correlations will grow with its analysis, and with enough we'll be able to score correlations by their likely contribution to events or states; and thereby focus on the more impactful correlations. The maps that trace a single correlation to a likely event or state haven't yet been built, but they'll get there.

As to the second, the parsing of big data to come to useful conclusions of any kind, scientific or otherwise, relies on an underlying understanding by the software of the principles underpinning the inquiry, including physics and biochemistry. The decision models we develop through which we strain big data simply need to rest on those basic premises.

The third cautionary simply points out the need for layered analysis, another characteristic of increasing sophistication. Much as a regulator layers law on law to better assure that scams, once detected generally, can be identified as they're happening, the tools applied to big data will include checks to assure that the base analysis isn't being gamed.

I could go on to the eighth or ninth, but you get the pattern, and there's a limit to the capacity to debunk in 1500 characters (including spaces). Big data indeed is here to stay, but silver bullets sometimes are crafted over time, and don't always spring full-grown from the forehead of Zeus.

Steve Ruis

Chicago 1 hour ago

One thinks of one of the first exponents of big data, the CIA. Its track record at making useful predictions is close to "abject failure." They have gotten very little right since its inception, so Big Data or Small Data, predicting singular events is exceedingly rare and those who do are often just speculating. Consider whether Russia will invade eastern Ukarine: some say yes, others say no. Whatever happened, so will be able to say "they predicted it."

Ynes Brueckner

Gallup New Mexico 1 hour ago

Woah Nellie!

The potential of delay in tracking a deadly disease outbreak puts shivers in the crew at CDC. If Google can do it in certain cases it could save a pandemic. The potential value in such a thing has vast implications for public health.

Of course there have been many a boy who cried wolf, but the general principles of screening apply. The question is whether there would be too many false positives to make it burdensome for the various public health entires to track them down. If the process can be utilized as a rudimentary screen it could be very helpful in early detection of critical outbreaks

rjon

Mahomet, IL 1 hour ago

Isn't Big Data the father of little Data? Oh, I'm sorry. I thought we were talking about Star Wars, an equally imaginative tale of the future.

William Hatcher

Dayton, Oregon 1 hour ago

As ever, the real problem is not to have one's desired conclusions drive one's analysis.

Tom

Boston 1 hour ago

The story reminds me of a story a science professor early on in my career used to illustrate false correlations. A biologist wants to study the hearing ability of fleas. His hypothesis is that it is related to the hairs on their legs. He sets up the following experiment: He takes a flea, puts it in a small container and then proceeds to make a loud noise. The flea jumps, and the scientist uses sophisticated technology to measure the height of the jump. He repeats the experiment the appropriate number of times to achieve statistical significance. Then he removes one lg at a time, observing that the height of the jumps diminish with the removal of more and more legs, until finally, when all legs are removed, the flea no longer jumps. He concludes that removing all legs of fleas makes them deaf.

John

Upstate New York 1 hour ago

Claims about Big Data remind me of the early days of the internet, which was touted as a wonderfully useful "Information Superhighway" but quickly devolved into the advertising/mass consumption/trivial mass culture morass that we know today. As with the internet, there will be many useful and worthy purposes to which Big Data will be put, but they will be swamped by the overwhelming efforts to sell things to people who didn't know they wanted these things.

The Perspective

Chicago 1 hour ago

No one produces and collects more data--by volume--than public education. And the vast majority of administrators who attempt to decipher and relay the data do not understand causal relationships or use specious reasoning in attempting to analyze data. Too often this half-understood data is then tossed out to Boards of Education, parents, and teachers (many of whom actually know mathematical analysis and statistics) to demonstrate perceived growth or lack thereof.

C Murali

New York 1 hour ago

People have been making spurious claims with statistics for eons. Big data just gives them the ability to do it with enormous amounts of data.

pfh777

Chicago 1 hour ago

The reason this article is poignant is not so much the issues of current statistics and the time required to teach analysis to computers all of which the article nicely covers. Rather, it is poignant because people have expectations that big data will quickly solve problems so accurately. It takes time and experience to hone the algorithms and gather enough examples to drive precision from the analysis.

Fastjazz

CT 1 hour ago

The issue with big data isn't that it is a silver bullet - does anyone really believe this is the case? We all know big data is at the starting point. The real issue is that big data does provide all sorts of new insights not in the 'big' metaphysical problems of the world but in the little day to day problems of the world - such as retailers think about with customers every day - analyzing big data in real time (real time BI) DOES offer a competitive advantage and businesses need to be on top of this or they will fall behind. Big picture hype yes, but small picture reality is that big data is now an important issue.