Ben Goldacre’s got a fab example of misleading statistics, and the ways in which you can learn to think about things to avoid jumping to a wrong conclusion.

Look at his first nerdy table of data on that article. All they’ve done is take a bunch of people who drink alcohol, and a bunch who don’t, and counted how many from each group ended up with lung cancer. It turns out that the drinkers are more likely to get lung cancer than the non-drinkers.

The obvious conclusion – and (spoiler alert) the wrong one – is that drinking alcohol somehow puts you at greater risk of developing lung cancer. You might conclude, from that table, that if you currently drink alcohol, you can reduce your risk of developing cancer by no longer drinking alcohol, thus moving yourself to the safer “non-drinkers” group.

This is actually a fine example of the Bad Science mantra, and Ben makes an important point which many non-nerds might not naturally appreciate about statistics: the need to control for other variables.

If drinking doesn’t give you cancer, then why do drinkers get more cancer? The other two tables offer a beautiful explanation. Of all the drinkers and non-drinkers originally counted, try asking them another question: whether or not they smoke cigarettes. What you get when you do that is the next two tables.

If you *just* look at the smokers, then the chances of a drinker and a non-drinker getting lung cancer are almost exactly the same. If you look only at the non-drinkers, ditto. In other words, once you know whether someone smokes cigarettes, *whether or not they drink makes no difference to their odds of getting lung cancer*.

Which is a long way away from the obvious conclusion we were tempted to draw from the first set of data.

What we did here was to *control for* another variable – namely smoking – before drawing sweeping conclusions from the data. When we give smokers and non-smokers their own separate tables, it means that smoking cigarettes isn’t unfairly weighing the data we’ve already got any more. It becomes clear that drinkers aren’t simply more likely to get cancer; they’re *more likely to be smokers*.

And although Ben’s right to point out the importance of controlling for other variables like this, what interests me is the reminder of the importance of Bayesian probability.

In particular, the thing to remember is that the probability of an event is a measure of *your uncertainty*, and not something inherent in the event itself.

For instance, if that first table is all the data you have, then all you know is that drinkers are more at risk of cancer than non-drinkers. If you were to estimate somebody’s odds of getting lung cancer, and the *only thing you knew about them* is that they’re a drinker, the best you could do is to place it at 16% – the amount of drinkers who developed lung cancer in the study.

If you later acquire the extra data in the second tables, *and* find out that the individual you’re interested in is *not* a smoker, then suddenly you can re-adjust your estimate, and give them about a 3% chance of getting lung cancer. They haven’t done anything differently; nothing about their situation has changed for them to suddenly appear much more healthy. You’ve just learned more about them.

And it’s still not true that their odds of developing cancer are exactly 3% in any objective sense. Maybe tomorrow you’ll learn something about their age, or gender, or family history, and adjust your estimate again based on the new data. Maybe you don’t know that a doctor actually diagnosed them with lung cancer yesterday. This, obviously, makes a huge difference to their odds of having lung cancer – but it doesn’t change the fact that they’re in a low-risk group, and a 3% estimate is the best you can do based on your current knowledge.

In conclusion: stats are hard, listen to maths geeks (or become one yourself) before panicking about the latest tabloid healthscare.