Monday, July 11, 2016

Correlation And Causation In The Age Of Data

I recently came across some unusual, rather interesting information which I’d like to share with you.
Did you know that the marriage rate in Kentucky is closely correlated with the number of people who fall out of fishing boats and drown each year? Or, that the number of letters in the Scripps National Spelling Bee, closely tracks how many people were killed by venomous spiders each year? Now, what if I told you that as the number of Facebook users grew, Greek sovereign debt spiked? How about if we noted that as box office receipts for M. Night Shyamalan movies dropped, so did newspaper sales? Or, on a somewhat morbid note, suppose you learned that swimming pool drowning deaths in a given year, were directly correlated with the number of movies in which Nicolas Cage appeared during that time. Pretty interesting stuff, no?
Now, do we agree that Facebook’s expansion caused the Greek debt crisis? Or, that the increased number of drowning deaths in a particular year, was thanks to Nicolas Cage’s movie appearances? You’ll stop me right there, I’m sure. Does this really makes sense? What relationship is there between the two variables, in any of the scenarios mentioned above? How can we say that the occurrence of one event, actually caused the other to come about?
While some of these statements might sound exceptionally silly, they underscore an important point: Correlation does not necessarily mean causation. Just because two variables somehow move in tandem, doesn’t mean that the occurrence of one variable, in fact brought about a change in the other.
That is, Shyamalan’s less acclaimed movies aren’t killing off the newspaper business. Spelling bees aren’t causing a rash of spider attacks. This might sound like a statement of the obvious, but, in today’s data-rich environment, this basic principle remains as important as ever.
With each passing year, we have access to more data than ever before. According to 2013 findings from Norwegian research firm SINTEF, 90% of all the data available in the world at the time, was created from 2011 to 2013.Other studies found the amount of available digital data is doubling every two years (faster than even Moore’s Law), while some observers believe we’ll see a 4300% increase in annual data generation, by 2020.
This exponential growth in data is driven in part by the proliferation of cellphones, tablets and other electronic devices, but also connected devices(components of the “Internet of Things”, basically, a range of Internet-capable devices that transmit data, including wearable devices, sensors, medical equipment, machine components, and more), as well as increasingly powerful computing tools for analyzing large volumes of complex data (i.e. Big Data). In short, we will enjoy access to more information, about more aspects of life, than ever before.
In making sense of this new knowledge, we are likely to discover countless new correlations, between seemingly disparate pieces of data. Yet, how we interpret this information, remains as critical as ever. In a 2015 piece in Information Age, Ben Rossi cites the example of Google Flu Trends and Google Dengue Trends. Rossi notes that while these tools are intended to detect the spread of the flu and other illnesses, by monitoring increases in Google searches around illness-related phrases, it is well known that people sometimes mindlessly Google a variety of words.
This can inflate the search frequency, and supposed incidence, of these diseases. It is yet another example of confusing correlation and causation, with potentially serious consequences. After all, what if public health agencies began incorrectly directing resources towards one illness, and away from another, more serious disease, based on a misunderstanding caused by Google search trends?.
Writing in technology journal The New Atlantis, statistician Nick Barrowman expands upon some of the challenges posed by the rise of massive sets of data, combined with increasingly powerful computing systems. Thanks to these two developments, correlations may well be “mass produced” such that “many of them will be meaningless.” Barrowman observes that some experts, have heralded the rise of Big Data as eliminating the need for any real understanding of causation, for giving any real thought as to why things happen.
Specifically, Barrowman critiques the work of author and former Wired magazine editor Chris Anderson, who in a widely read 2008 piece argued that: “This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear….who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity….correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.”
Barrowman acknowledges that correlation data holds value, but argues that without truly considering counterfactual situations, we can’t really confirm whether one event actually caused another. That is, if we believe that A caused B, we can’t truly confirm such causation, unless we ask ourselves what would happen if A had not occurred.
For example, suppose one argues that thanks to a new diet, your neighbor lost 30 pounds. What if this neighbor had never changed his or her diet? What if he or she actually took up jogging, and that was a catalyst behind the weight loss? Without digging further and considering alternatives, we might miscategorize or oversimplify the causes of a particular phenomenon. Barrowman ultimately pushes for a rigorous experimental approach, along with tools like randomization, in order to truly understand whether or not one event actually caused another.
This approach might sound a bit abstract and theoretical, until we consider how oversimplifying correlation and causation, can lead to misdirected public policy. University of Chicago economist Steven Levitt,and his coauthor Stephen Dubner, offered an instructive example of this in Freakonomics. In 2004, then-governor of Illinois, Rod Blagojevich, set forward a plan to mail one book per month, to the home of each newborn child, until that child turned five years old. Blagojevich crafted this initiative, which would cost around $26 million per year, in response to a study which found that children from homes where books were present, earned higher reading test scores.
Of course, as Levitt and Dubner wonder, were these children doing better in school, simply because books were present in the home, or because they were raised in families which valued education, and where intellectual pursuits, including reading, were emphasized by parents? Based on the empirical data before us, the latter is more likely true.
Yet, Blagojevich was prepared to direct tens of millions of dollars per year (this legislation ultimately wasn’t adopted), in pursuit of what was a classic correlation-causation misunderstanding. It isn’t tough to imagine similar misunderstandings, whether in business, public policy, or other crucial arenas, resulting in the adoption of otherwise poorly directed schemes and solutions.
To a certain extent, we are victims of our own brains, that is, our cognitive biases. In his book Thinking Fast And Slow, psychologist and behavioral economist Daniel Kahneman reviews a fascinating study from the late Belgian psychologist Albert Michotte, demonstrating that even infants as young as six months of age, quickly form ideas of causation from seemingly associated visual stimuli, and are caught by surprise when this pattern is disrupted. As Kahneman puts it: “We are evidently ready from birth to have impressions of causality, which do not depend on reasoning about patterns of causation.” In a bid to make sense of our world, we are predisposed, from a very early time in life, to draw conclusions of causation between seemingly associated events.
Kahneman also refers to an instructive anecdote in Nassim Taleb’s bestseller The Black Swan, which further illustrates the human need to find order, through developing narratives of causation. On the day when Saddam Hussein was captured by American forces in Iraq, Bloomberg News displayed a headline stating that US Treasury prices (which fluctuate throughout the day), had risen (which is generally associated with a lower tolerance for risk), because Hussein’s capture might not prevent terrorism. Later in the day, when Treasury prices fell (indicating greater investor receptiveness to risk), Bloomberg posted a new headline, indicating that this was because Hussein’s detention increased the attractiveness of riskier assets.
Rather perplexingly, the same cause (the capture of Saddam Hussein) was used to explain two seemingly opposing events (that is, a rise, followed by a fall, in Treasury prices), which occurred in a short time span. What could explain this seemingly irrational, contradictory approach? As Taleb explains, human beings are predisposed to making sense of information in terms of narratives, and to finding causality in any confusing situation, even when doing so might appear, at second glance, less than prudent.
So, what’s the solution to all of this? In today’s world, vast amounts of data are compiled on virtually every aspect of our existence, and parsed at an unprecedented pace and scale. This offers myriad information and vast potential, in fields ranging from public health and treatment of cancer, to financelawurban planningretail sales, and a million other areas. With each passing day, we are gaining greater insight into actual human existence and behavior.
Not surprisingly, correlations can be spotted just about everywhere. Given this reality, how do we avoid inaccurate findings of causation, which can result in flawed decisions? Simply put, we need to be skeptical. When we find a correlation between two events, or sets of data, we should actively search for alternate explanations, rather than simply accepting causation as a fact. We must also engage in the sort of counterfactual thinking that Barrowman advocated, with focus on experimentation and empirical data. In short, we ought to keep our eyes open, and be wary of drawing sweeping conclusions, purely on the basis of correlations and association.
There must be a conscious effort to avoid assuming that we know too much, and steer clear of falling in love with the power of seductive computer-generated correlations. Big Data is powerful, but it must be applied prudently. With this approach, we can make prudent use of the opportunities which new streams of data present us with, while avoiding our inherent mental blind spots.