The Book of Why is a book on probability and causality by the American computer scientist Judea Pearl, with science writer Dana Mackenzie. In The Book of Why, Pearl elucidates the recent shift in statistics and science towards the use of causal thinking.
Most anyone who takes a class in statistics is taught very early on the mantra “correlation does not equal causation.” This simple formula drilled into our heads is supposed to prevent spurious associations between parallel trends. Everyone has seen examples of parallel trends that match eachother almost exactly over time, yet are entirely unrelated. That's obvious. But in this book, Pearl argues that there are times when we can clearly say that correlation does imply causation. He shows that through the use of mathematics and an appropriate causal model it is possible to determine with some precision the amount of an effect caused by other phenomena.
For instance, what percentage of increased risk of cancer is caused by cigarette smoke versus genetic factors? This is a very difficult question to answer in the absence of a causal model, but once an appropriate causal model is proposed, it becomes possible to determine this. Pearl's preferred tool for creating causal models is a simple graphical tool, a map of lines and arrows between causes and effects. A line with an arrow between one node and another denotes a causative relationship. Such lines can be bivalent, meaning each factor has a causal relationship to the other. Controlling for a particular variable through the use of a controlled experiment enables the researcher to “erase” the lines from that factor to other variables, effectively negating its influence on them.
For a layperson this might not seem like a big deal. So what? You've got some lines and arrows between different things. Why is that any different than what we've done before? I agree, it does seem like a rather uninteresting distinction at first glance. But it's actually a massive shift. Previously, a scientist would propose a hypothesis, for instance that smoking caused cancer. Then, the scientist would compare cancer rates between smokers and non-smokers to see if there was a relationship. Usually, a few confounding variables, additional things that could influence the results, would be proposed. For instance, perhaps if your parents were smokers, you would be more likely to smoke yourself. Confounding variables would be controlled for. In the end, the scientist would publish their results. But it wasn't entirely clear why some things were considered confounders, or how much they influenced other things, or which other things they affected. These sorts of complex systems could get overwhelming very quickly. It also wasn't always easy to input them into a computer. Often it would involve a spreadsheet with dozens of columns, some of which influenced each other through formulas, and others that didn't. In short, for more than a toy example, it was a mess, and didn't inspire confidence in the results. Scientists were therefore very cautious in saying “This shows a causal relationship between Factor X and Effect Y”.
Pearl's invention (based on the work of other scientists such as the geneticist Sewall Wright) was his simple graphical modeling system. If a scientist can describe the relationships between all the variables graphically, then a computer program could be written that would automate calculation of the influence between each of the factors. If a single variable is influenced by 8 other factors, then the program can estimate the relative influence of each given enough data. But the key here is the graphical model. Pearl notes that from what he's seen, only a few other scientists in history have been able to apply his idea of calculating the effects of many, many variables on each other using only numerical methods. It's very hard to estimate the effects of more than a few variables using a spreadsheet. But inputting a graphical model is a cinch. It's so simple a high school student can do it, and because it can be easily grasped by anyone, it allows for far more confidence in the result.
Pearl also describes something he calls the “Causal Ladder.” This is a hierarchy going from less complex to more complex in terms of causal sophistication. The ladder consists of three levels, the first being pure data, the second being interventions, and the third being counterfactual statements. Pearl posits that for the majority of statistical investigation we have been stuck at the first level. We have gathered and processed data, and looked at relationships between variables to find links. This is the level that computers and software currently operate at. They have largely failed to ascend to the second level of the ladder, the interventional level. That's because this level requires a working causal model of the system. A question posed at this level might look like this: “Given that I know that the price of bananas is influenced by the price of fertilizer, what would the price be if I doubled the price of fertilizer?” Because this relies on the causal model “price of fertilizer -> price of bananas”, it has largely been outside the capabilities of computer programs and statistical packages. But now that Dr. Pearl has worked out the mathematics between showing associations between variables causally, this level of the ladder of causation should be available to a computer! The third rung of the ladder is unfortunately still outside the grasp of a computer. This is the level of counterfactuals, questions like “What if I had not done X, but instead had done X’?” At present, computers are still largely incapable of handling counterfactuals, but Dr. Pearl believes that this is one of the keys to seriously advancing machine intelligence. A computer that can ask and answer counterfactuals will be a truly remarkable machine.
One complaint about the book itself: the book felt very unpolished. Maybe rushed to the press, perhaps just cheaply made? The diagrams were the most glaringly obvious. For one, many of the diagrams were clearly handmade in something like MS PowerPoint. The arrows and lines don't always make neat angles, and sometimes multiple arrowheads going into the same point are visibly crooked. This could be remedied by drawing the diagrams in a software package like LaTeX, something that I'm sure the scientific publishing industry is more than capable of. Additionally, at multiple points throughout the book, the figures and explanations for them will be on different pages. For instance, there will be a sentence describing the figure, or telling the reader to examine an important feature of the figure, but the figure is on the next page. Sometimes this is unavoidable, but this is quite frequent in this book. This seems pretty sloppy, and is the kind of thing that good book design and planning can help to alleviate.
Overall, I liked the book a lot. I was unaware of the shift in thinking in statistics over the past 40 years, and didn't realize that the causal revolution had quietly taken place. It's exhilarating to read that we finally be out of the woods in the whole question of whether two variables tracking each other can really be said to be “causally related.”