BIG data is suddenly everywhere. Everyone seems to be collecting it, analyzing it, making money from it and celebrating (or fearing) its powers. Whether we’re talking about analyzing zillions of Google search queries to predict flu outbreaks, or zillions of phone records to detect signs of terrorist activity, or zillions of airline stats to find the best time to buy plane tickets, big data is on the case. By combining the power of modern computing with the plentiful data of the digital era, it promises to solve virtually any problem — crime, public health, the evolution of grammar, the perils of dating — just by crunching the numbers.
Or so its champions allege. “In the next two decades,” the journalist Patrick Tucker writes in the latest big data manifesto, “The Naked Future,” “we will be able to predict huge areas of the future with far greater accuracy than ever before in human history, including events long thought to be beyond the realm of human inference.” Statistical correlations have never sounded so good.
Is
big data really all it’s cracked up to be? There is no doubt that big
data is a valuable tool that has already had a critical impact in
certain areas. For instance, almost every successful artificial
intelligence computer program in the last 20 years, from Google’s search
engine to the I.B.M. “
The first thing to note is that although big data is very good at detecting correlations, especially subtle correlations that an analysis of smaller data sets might miss, it never tells us which correlations are meaningful. A big data analysis might reveal, for instance, that from 2006 to 2011 the United States murder rate was well correlated with the market share of Internet Explorer: Both went down sharply. But it’s hard to imagine there is any causal relationship between the two. Likewise, from 1998 to 2007 the number of new cases of autism diagnosed was extremely well correlated with sales of organic food (both went up sharply), but identifying the correlation won’t by itself tell us whether diet has anything to do with autism.
Second, big data can work well as an adjunct to scientific inquiry but rarely succeeds as a wholesale replacement. Molecular biologists, for example, would very much like to be able to infer the three-dimensional structure of proteins from their underlying DNA sequence, and scientists working on the problem use big data as one tool among many. But no scientist thinks you can solve this problem by crunching data alone, no matter how powerful the statistical analysis; you will always need to start with an analysis that relies on an understanding of physics and biochemistry.
Third, many tools that are based on big data can be easily gamed. For example, big data programs for grading student essays often rely on measures like sentence length and word sophistication, which are found to correlate well with the scores given by human graders. But once students figure out how such a program works, they start writing long sentences and using obscure words, rather than learning how to actually formulate and write clear, coherent text. Even Google’s celebrated search engine, rightly seen as a big data success story, is not immune to “Google bombing” and “spamdexing,” wily techniques for artificially elevating website search placement.
Fourth,
even when the results of a big data analysis aren’t intentionally
gamed, they often turn out to be less robust than they initially seem.
Consider Google Flu Trends, once the poster child for big data. In 2009,
Google reported — to considerable fanfare — that by analyzing
flu-related search queries, it had been able to detect the spread of the
flu as accurately and more quickly than the
As a recent article in the journal Science explained, one major contributing cause of the failures of Google Flu Trends may have been that the Google search engine itself constantly changes, such that patterns in data collected at one time do not necessarily apply to data collected at another time. As the statistician Kaiser Fung has noted, collections of big data that rely on web hits often merge data that was collected in different ways and with different purposes — sometimes to ill effect. It can be risky to draw conclusions from data sets of this kind.
A
fifth concern might be called the echo-chamber effect, which also stems
from the fact that much of big data comes from the web. Whenever the
source of information for a big data analysis is itself a product of big
data, opportunities for vicious cycles abound. Consider translation
programs like
A sixth worry is the risk of too many correlations. If you look 100 times for correlations between two variables, you risk finding, purely by chance, about five bogus correlations that appear statistically significant — even though there is no actual meaningful connection between the variables. Absent careful supervision, the magnitudes of big data can greatly amplify such errors.
Seventh,
big data is prone to giving scientific-sounding solutions to hopelessly
imprecise questions. In the past few months, for instance, there have
been two separate attempts to rank people in terms of their “historical
importance” or “cultural contributions,” based on data drawn from
Wikipedia. One is the book “Who’s Bigger? Where Historical Figures
Really Rank,” by the computer scientist
Both
efforts get many things right — Jesus, Lincoln and Shakespeare were
surely important people — but both also make some egregious errors.
“Who’s Bigger?” claims that
FINALLY, big data is at its best when analyzing things that are extremely common, but often falls short when analyzing things that are less common. For instance, programs that use big data to deal with text, such as search engines and translation programs, often rely heavily on something called trigrams: sequences of three words in a row (like “in a row”). Reliable statistical information can be compiled about common trigrams, precisely because they appear frequently. But no existing body of data will ever be large enough to include all the trigrams that people might use, because of the continuing inventiveness of language.
To select an example more or less at random, a book review that the actor
Wait, we almost forgot one last problem: the hype. Champions of big data promote it as a revolutionary advance. But even the examples that people give of the successes of big data, like Google Flu Trends, though useful, are small potatoes in the larger scheme of things. They are far less important than the great innovations of the 19th and 20th centuries, like antibiotics, automobiles and the airplane.
Big data is here to stay, as it should be. But let’s be realistic: It’s an important resource for anyone analyzing data, not a silver bullet.
104 Comments
Dave Powell
Florida 19 minutes agoThe statement that big data can't tell us which correlations are meaningful is simply wrong. The examples cited are called autocorrelation and there are well known statistical tests to detect it and to correct for it.
SAF93
Boston, MA 19 minutes agoThis is a nice article, focusing on limitations of big data. I agree that those who tout this approach should look deeper for intrinsic limitations of the approach. Computers are capable of handling huge amounts of data, far more than humans, and they don't get distracted or forget things, and generally have a low error rate. Thus, they enhance our limited computational abilities. However, computer programs do not yet enable machines to learn and think like humans (do computers have "aha" moments when they converge on p < 0.001?). Biological nervous systems are composed of many dynamically linked circuits, each specialized to handle different types of data. Thus, for example, we can easily understand metaphors, which is a challenge even for IBM's Watson. Current computers know only the binary language of ones and zeros.
John Lentini
Big Pine Key, FL 19 minutes agoIsaac Asimov's character Hari Seldon used big data (which he termed "psychohistory" ) to predict future trends in the Foundation series, first published in 1951. Even 12,000 years into the future, big data was of no help in predicting one-off events.
Kilgore Trout
USA 19 minutes agoAs Mark Twain famously said, "There are lies, there are damned lies, and then there's Big Data"
yoyo
pianosa 19 minutes agoBig data mining has the capability of finding whatever you want to find. One should not confuse it with statistical analysis. Substituting the discovery of correlations for analysis and careful testing is a fool's errand.
"If you torture the data long enough,Nature will confess."- Jan Kmenta
joe
Getzville, NY 25 minutes agoThere are two other problems with big data. The first involves the nature of statistics, the statement of the problem. Statistics are accomplished by proving a null hypothesis, I.e., an assumption. An assumption is states that is to be proved and then the statistics are used to prove the validity of the hypothesis. The statement of the hypothesis can influence the outcome, like the old saw of the glass half full vs half empty.
Statistics and computer simulations can never account for the so-called "Black Swan" effect, the outlier one in a million event that couldn't be predicted. ( Nassim Nicholas Taleb's book in 2001)
The financial crisis of 2007/8 is a prime example. The assumption was made by almost every major computer model that the housing market could never crash. That shows the power of the wrong assumption
RAP
Connecticut 25 minutes agoWhat! You mean "Big Data" is not a wrestler performing for the WWE?
Actually, your column makes a good case for, perhaps, relegating Big Data to the ranks of that particular "dog and pony" show. In essence, collecting all sorts of data and statistics is one thing; interpreting this mass of information is quite another with just about everyone falling short in the "predicting" future trends part of the equation.
Do the recent tremors in California mean "the big one" is more likely? Nope.
Can weather forecasters accuracy extend beyond 3 days? Nope.
Can anyone predict Mr.Putin's next move? Nope.
Heck, even the computer guys running ads trying to sell me stuff seem to always get it wrong (just because I looked at the Dyson fan doesn't mean that I want to buy or be inundated with ads for "fans, ostrich feathers and ice cube makers" on the off chance I may want to buy something).
But if "Big Data" and it's collecting keep a good segment of the population employed, then it's probably a good thing. The danger is assuming all the information being collected points out some kind of trend. We are still a long way from Mr. Assimov's "Foundation Trilogy" where the actions of masses of humans were entirely predictable. I just hope the people collecting all this information realize this.
W.A. Spitzer
is a trusted commenter Faywood, New Mexico 26 minutes agoBig data may have its place, but sometimes small data is more useful. I went to WalMart yesterday to look for a shirt. All the shirts available were short sleeved. Now big data may tell WalMart that in the month of April most buyers are looking for short sleeved shirts for the summer; but with the intense sun in the high desert of New Mexico anybody who would wear a short sleeved shirt outside is either crazy or a tourist. In this case a person with only half a brain is smarter than big data.
skeptonomist
is a trusted commenter Tennessee 1 hour agoThis piece points out some rather obvious dangers of working with correlations and other types of analysis, but completely omits the main reasons for the increasing prevalence of data-based analysis: the Internet and personal computers. Sixty years ago the data were kept in printed form which had to be transcribed and analyzed by hand. Now as the data are put into electronic form they can be instantly accessed by anyone through the Internet, and quickly analyzed with small but powerful computers.
Data-based analysis - "big data" - is not a fad, it is the natural consequence of the vastly greater accessibility of data. Things will be done with data that weren't done before. What would be beneficial, rather than pulling back from looking at data, is better education about the proper use of it.
ACW
New Jersey 1 hour agofrom 2006 to 2011 the United States murder rate was well correlated with the market share of Internet Explorer: Both went down sharply. But it’s hard to imagine there is any causal relationship between the two.'
Maybe there is such a thing as 'browser rage', akin to 'road rage', in which one vents one's frustrations. Maybe fewer people were driven to distraction after switching to less buggy browsers and didn't take out their anger on the nearest target.
Stranger cause-and-effect relationships have happened.
Margaret
Atlanta 1 hour agoAnd the key phrase "careful supervision" is what needs emphasis. This is what will prevent the finance company from sending an email to the homeowner, internally addressed to the recipients' bankrupt former spouse offering credit based on the value of property to which said former spouse has no claim. Careful supervision is needed at every step in the process, something lost by marketers with dollar signs in their eyes.
thefrenchguy
HHI, SC 1 hour agoThe value or precision of data analysis depends on one big factor: the input data itself: garbage in - garbage out; this is well pointed out by the author. The other important dimension in data analysis that none of the super power engine has is: CONTEXT. Any data analyzed without the proper context should not be considered reliable. I have developed a simple formula: data + context = knowledge
The big data crunchers like Google wants everyone to believe that they hold the secret of the data so they can get bigger and richer. And everyone is falling into the trap at light speed for one simple reason: people are getting less and less "critical thinker" than ever.
Liz R
Catskill Mountains 1 hour agoThe impact of serial translation across languages is interesting. It reminds me of the ever-narrowing view of reality that an "infinity of mirrors" offers.
I wonder about that every time someone emails me a link from a news aggregation site with a one-line comment tailored to our particular understanding of each other or the general topic of the linked piece. One knows that Big Data Collection is chugging away in the background, compiling its info-trove on the numbers of emails about what to whom from where, and when, with what sort of comment. And that's before we get to Monetizable Big Data on the generated ad content based on however fanciful or banal are those one-line comments.
Meanwhile somewhere someone's particular toast is burning, which --as a potential but never gathered data point in an array of such singular facts-- may be the only (subsumed) reality in the Data That Matters right now. So it's not that Big Data is intrinsically misleading, or even that it may offer newer ways to "lie with statistics" as Huff's venerable book would have it. It's that the available Big Data on anything may be a massively trivial distraction from some very small subset of Small Data that really matters to one person this morning, and maybe to the planet tomorrow.
So how do we keep Big Data from hogging more bandwidth than it deserves? Teach critical thinking. Teach critical thinking. Teach critical thinking. Teach critical thinking. Teach critical thinking...
Doug F
Illinois 1 hour agoInteresting piece and many interesting comments. And wrong-axied or not, I really liked the graph(ic). For my own contribution I'll quote my dad: "The old adage that 'statistics don't lie, only statisticians do,' is only half true."
Charlie B
North Port, Florida 19 minutes agoOr there is the similar adage "Figures don't lie, but liars figure"
In a past career where we tried to predict the reliability of complex computer systems we had this caveat "don't mistake the model for reality"
Richard Luettgen
New Jersey 1 hour agoAs to the first cautionary, it's really a matter of increasing sophistication in the parsing of big data. The ability to determine the meaningfulness of correlations will grow with its analysis, and with enough we'll be able to score correlations by their likely contribution to events or states; and thereby focus on the more impactful correlations. The maps that trace a single correlation to a likely event or state haven't yet been built, but they'll get there.
As to the second, the parsing of big data to come to useful conclusions of any kind, scientific or otherwise, relies on an underlying understanding by the software of the principles underpinning the inquiry, including physics and biochemistry. The decision models we develop through which we strain big data simply need to rest on those basic premises.
The third cautionary simply points out the need for layered analysis, another characteristic of increasing sophistication. Much as a regulator layers law on law to better assure that scams, once detected generally, can be identified as they're happening, the tools applied to big data will include checks to assure that the base analysis isn't being gamed.
I could go on to the eighth or ninth, but you get the pattern, and there's a limit to the capacity to debunk in 1500 characters (including spaces). Big data indeed is here to stay, but silver bullets sometimes are crafted over time, and don't always spring full-grown from the forehead of Zeus.
Steve Ruis
Chicago 1 hour agoOne thinks of one of the first exponents of big data, the CIA. Its track record at making useful predictions is close to "abject failure." They have gotten very little right since its inception, so Big Data or Small Data, predicting singular events is exceedingly rare and those who do are often just speculating. Consider whether Russia will invade eastern Ukarine: some say yes, others say no. Whatever happened, so will be able to say "they predicted it."
Ynes Brueckner
Gallup New Mexico 1 hour agoWoah Nellie!
The potential of delay in tracking a deadly disease outbreak puts shivers in the crew at CDC. If Google can do it in certain cases it could save a pandemic. The potential value in such a thing has vast implications for public health.
Of course there have been many a boy who cried wolf, but the general principles of screening apply. The question is whether there would be too many false positives to make it burdensome for the various public health entires to track them down. If the process can be utilized as a rudimentary screen it could be very helpful in early detection of critical outbreaks
rjon
Mahomet, IL 1 hour agoIsn't Big Data the father of little Data? Oh, I'm sorry. I thought we were talking about Star Wars, an equally imaginative tale of the future.
William Hatcher
Dayton, Oregon 1 hour agoAs ever, the real problem is not to have one's desired conclusions drive one's analysis.
Tom
Boston 1 hour agoThe story reminds me of a story a science professor early on in my career used to illustrate false correlations. A biologist wants to study the hearing ability of fleas. His hypothesis is that it is related to the hairs on their legs. He sets up the following experiment: He takes a flea, puts it in a small container and then proceeds to make a loud noise. The flea jumps, and the scientist uses sophisticated technology to measure the height of the jump. He repeats the experiment the appropriate number of times to achieve statistical significance. Then he removes one lg at a time, observing that the height of the jumps diminish with the removal of more and more legs, until finally, when all legs are removed, the flea no longer jumps. He concludes that removing all legs of fleas makes them deaf.
John
Upstate New York 1 hour agoClaims about Big Data remind me of the early days of the internet, which was touted as a wonderfully useful "Information Superhighway" but quickly devolved into the advertising/mass consumption/trivial mass culture morass that we know today. As with the internet, there will be many useful and worthy purposes to which Big Data will be put, but they will be swamped by the overwhelming efforts to sell things to people who didn't know they wanted these things.
The Perspective
Chicago 1 hour agoNo one produces and collects more data--by volume--than public education. And the vast majority of administrators who attempt to decipher and relay the data do not understand causal relationships or use specious reasoning in attempting to analyze data. Too often this half-understood data is then tossed out to Boards of Education, parents, and teachers (many of whom actually know mathematical analysis and statistics) to demonstrate perceived growth or lack thereof.
C Murali
New York 1 hour agoPeople have been making spurious claims with statistics for eons. Big data just gives them the ability to do it with enormous amounts of data.
pfh777
Chicago 1 hour agoThe reason this article is poignant is not so much the issues of current statistics and the time required to teach analysis to computers all of which the article nicely covers. Rather, it is poignant because people have expectations that big data will quickly solve problems so accurately. It takes time and experience to hone the algorithms and gather enough examples to drive precision from the analysis.
Fastjazz
CT 1 hour agoThe issue with big data isn't that it is a silver bullet - does anyone really believe this is the case? We all know big data is at the starting point. The real issue is that big data does provide all sorts of new insights not in the 'big' metaphysical problems of the world but in the little day to day problems of the world - such as retailers think about with customers every day - analyzing big data in real time (real time BI) DOES offer a competitive advantage and businesses need to be on top of this or they will fall behind. Big picture hype yes, but small picture reality is that big data is now an important issue.
104 Comments