Lies, damned lies and statistics!

On the eve of the biggest lottery jackpot in the history of mankind, I was pondering probability theory (and yes I did zoom over to the US and buy a ticket, and so if this blog becomes somewhat less active you will know why). It reminded me of an excellent paper  I recently came across on the role of significance tests by Charles Lambdin (2012) who has resurrected the arguments against significance tests as our fundamental statistical method for hypothesis testing. He makes the case that these really represent modern magic, rather than empirical science, and I must admit to some extent I tend to agree, and wonder why we are still rely on the use of p-values today, with an almost blind acceptance of their rightful place in the scientific world.

The modern formulation and philosophy of hypothesis testing was developed by three men between 1915-1933; Ronald Fisher (1890-1962), Jerzey Neyman (1884 -1981) and Egon Pearson (1895-1980).  Fisher developed the principles of statistical significance testing and p-values, whilst Neyman & Pearson took a slightly different view and formulated the method we use today where we compare two competing hypotheses: Ho and an alternative hypothesis (Ha), and also developed the notions of Type I and II errors.

As we all know from our undergraduate studies, with hypothesis testing using p-values two outcomes are possible for a statistical test result. Either the test statistic result is in the critical region, and the test result is declared as statistically significant at the α (5%) significance level. 
In this case two logical choices are possible for a researcher: they can reject the null hypothesis, or accept the possibility that a Type I error has occurred (≤ α). Secondly, the test statistic result is outside of the critical region (the shaded areas below).

Here the p-value is greater than the statistical significance level (α) and either there is not enough evidence to reject H0, or there is a Type II error. We should note this is not the same as finding evidence in favour of H0. A lack of evidence against a hypothesis is not evidence for It, which is another common scientific error (see below). Another problem with the H0 is that many researchers view accepting the H0 as a failure of the experiment. This view is unfortunately rather poor science, as accepting or rejecting any hypothesis is a positive result in that it contributes to the knowledge. Even if the H0 is not refuted we have learned something new, and the term ‘failure’, should only be applied to errors of experimental design, or incorrect initial assumptions. That said, the relative dearth of reports of negative outcome studies compared to positive ones in the scientific literature should give us some cause for question here.

The Neyman-Pearson approach has attained the status of orthodoxy in modern health science, but we should note that Fisher’s inductive inference and Neyman-Pearson’s deductive approach are actually philosophically opposed, and many researchers are unaware of the philosophical distinctions between them and misinterpret this in their reports.

Commonly researchers state H0 and Ha, the type I error rate (α) and the p-value is determined, and the power for the test statistic is computed, (Neyman–Pearson’s approach). But then the p-value is mistakenly presented as the Type I error rate and the probability that the null hypothesis is true (Fisher’s approach), rather than the correct interpretation that it represents probability that random sampling would lead to a difference between sample means as large, or larger than that observed if H0 were true (Neyman-Pearson’s approach). This seems a subtle difference but technically these are two very different things, and Fisher viewed his p-value as an objective measure of evidence against the null hypothesis, whilst Neyman & Pearson did not. They view that this (and particularly a single test in a single study) never “proves” anything.

As p-values represent conditional probabilities (conditional on the premise that the H0 is true) an example to help us understand this difference would be that the probability that a given nurse is female, Prob(female | Nurse), is around 80% the inverse probability, that a given female is a nurse; Prob(Nurse| female), is likely smaller than 5%. Arguments over this misinterpretation of p-values continues to lead to widespread disagreement on the interpretation of test results.

There is also some good evidence that because of the way p-values are calculated they actually exaggerate the evidence against the null hypothesis (Berger & Selke, 1987; Hubbard & Lindsay, 2008) and this represents one of the most significant (no pun intended – Ok well just a bit) criticisms of p-values as an overall measure of evidence.

In science, we are typically interested in the causal effect size, i.e., the amount and nature of differences, and view that if a sample is large enough, a difference can be found to be “statistically significant.” However, the standard 5% significance level doesn’t really have any mathematical basis and is actually a convention as a result of a long-standing tradition, and the exaggeration of evidence against the null-hypothesis inherent in the use of p-values makes the value of significance testing to empirical scientific enquiry limited. To be clear statistical test results and significance values  do not provide any of the following (although many researchers assume they do):

  • the probability that the null hypothesis is true,
  • the probability that the alternative hypothesis is true,
  • the probability that the initial finding can be replicated, and,
  • if a result is important (or not).

The do give us some statistical evidence that the phenomenon we are examining likely exists but that is about it, and yet we should be very aware that the ubiquitous p-value does not provide an objective, or unambiguous measure of evidence in hypothesis testing. This is not a new argument and has been argued since the 1930’s with some researchers arguing that significance tests really represent modern sorcery rather than science (Bakan, 1966; Lambdin, 2012) and that their counter-intuitive nature frequently leads to confusion about the terminology.

Logically it has also been noted that p-values fail to meet the simple logical condition required by a measure of support, in that if hypothesis Ha implies hypothesis H0 as the converse we should expect at least as much support for H0 as there is for Ha (Hubbard & Lindsay, 2008; Schervish, 1996).

Our problem is we need practical methods that avoid us both dismissing meaningful results, and exaggerating evidence. Bayesian techniques have been a suggestion for the replacement significance texting and p-values, but have yet to take hold, probably because of the simplicity of implementing p-values and their widespread perceived objectivity (Thompson, 1998). Overall these concerns should emphasize the importance of repeated studies and consideration of findings in a larger context, and ultimately this leads us to a good argument of the value of meta-analysis.

But the question remains, why are we still relying on p-values when there are so many issues with them, and probably much better techniques?

Beat’s me and maybe the sooner we become Bayesian’s the better for science.



Bakan, D. (1966). The test of significance in psychological research     Psychological Bulletin, 66, 423-437.

Berger, J. O., & Selke, T. (1987). Testing a point null hypothesis: The irreconcilability of p values and evidence. Journal of the American Statistical Association, 82(2,), 112–139.

Hubbard, R., & Lindsay, R. M. (2008). Why P values are not a useful measure of evidence in statistical significance testing Theory & Psychology, 18(1), 69-88.

Lambdin, C. (2012). Significance tests as sorcery:  Science is empirical-significance tests are not. Theory & Psychology, 22(67), 67-90.

Schervish, M. J. (1996). P values: What they are and what they are not.  The American Statistician, 50, 203-206.

Thompson, J. R. (1998). A response to “describing data requires no adjustment  for multiple comparisons”  American Journal of Epidemiology, 147(9)

My faith in no faith.

Well, the big news this week in the UK has been the resignation of the Archbishop of Canterbury, Dr Rowan Williams. For those of you who may not know, the ‘Archbishop of Canterbury’ is the head of the Church of England in everything but name (that title actually goes to the Queen).

OK, so this is hardly world-shaking news, as the Church of England is scarcely a big player in terms of global religions, in fact I only know a few people who are members, but it’s always seemed a rather benign organisation really. A church of gentle hymns and prayers, of church fetes and jam making and I for one will rather miss this liberal Archbishop; he is in favour of same-sex marriage and women Bishops, questions the nature of miracles and has been a pain in the arse to the UK Government over the morality (or rather immorality) of youth unemployment.

The controversy however concerns the matter of his new job. He is to become the new Master of Magdalene College, Cambridge. Despite the fact that he had a previous academic career, this move still brought out what Alain de Botton has come to call the ‘North Oxford Mafia’ including the Queen Bee himself, Richard Dawkins. As early as ‘Unweaving the Rainbow’ (1999) Dawkins has argued that Theology should not be taught in Universities as, in a nutshell, it is uncritical and therefore not academic and he still maintains that line. In that same book, he calls for atheist scientists to ‘come out’ and declare their atheism. So OK, I’ll put my cards on the table. I am an atheist. I don’t believe in gods. I don’t believe in any supernatural being that can influence, nor intervene in, my life. There have been times in my life when I really tried (normally during periods of extrematis) but have never changed my mind.

The curious thing about this is that I haven’t come to the non-belief in gods through any rational process. It is not through my background in science. I haven’t researched comparative religions, familiarised myself with the metaphysical arguments, nor carried out exhaustive multi-variant meta-analyses on huge parameterised data sets and come to this conclusion on the basis of such work. Rather, I intuitively feel that there is nothing there. I sometimes feel that I end up using the inverse arguments to those who do have a belief. So my atheism is certainly not based on science, nor rationality, but rather a ‘belief’ there are no gods.

This is possibly not exactly what Richard Dawkins has in mind.

Particularly as in ‘The God Delusion’ (2006) he goes even further and develops the line of argument that those who do believe in a god are self-delusional. Now, as I’ve said, I’m an atheist and I should be on his side, but even I think this quite outrageous. This is a Professor in Public Understanding of Science at Oxford University arguing that people who disagree with him and believe in gods are delusional and are even hallucinating (you can see examples of this on Youtube if you’re so inclined). For crying out loud, what sort of academic stance is that? You defend your idea by saying that any criticism is invalid, as any criticism is by definition delusional. I can’t wait to get to an academic conference and use that response on anyone who critically engages with my work!

Of course, that is not an academic argument, as it works both ways. My atheism is based on an intuitive sense that there is nothing there, is that delusional as well? I presume it must be as the probability is surly the same both ways, isn’t it. Either we’re all delusional, or none of us are, or perhaps for some reason unknown to me (but presumably known to Richard Dawkins) only those that agree with archbishops are?

No, I’ll live and let live. I’ve become increasingly angered by the intolerance of Neo-Atheism and its association with science. I’ve tried to argue on this blog for a science of equality, peace and social justice. Dawkins line of argument seems divisive, designed to upset and ultimately barren.

I’m a scientist and an atheist, but I quite like home made jam, fruit cake and liberalism. More tea vicar?


The Placebo Effect; how does it work?

There was an interesting discussion recently posted on the neurophysiologist Dr. Marcello Costa’s blog about the nature of the placebo and nocebo effects. See:

Basically, he argues a well researched position that there is now consideable evidence showing expectations to get better have significant effects on how patients actually feel, and gives some suggested physiological explanations of the phenomena.

We hear a lot about the placebo effect, so what is it?

A placebo (Latin for “to please”) is the measurable, observable, or experienced improvement in health or behavior not attributable to a medication or invasive treatment that has been administered.

It is frequently argued (see  for example) that the placebo effect is not really mind over matter; and has become a catchall term for a positive change in health not attributable to a therapeutic intervention.

The change seen with placebos has been suggested to be due to a number of things:

1) Regression to the mean –  the fascinating statistical phenomenon that if a variable is extreme on its first measurement, it will tend to be closer to the average on a second measurement. A 2004 paper by Barnett et al. has a good explanation (Barnett et al, 2005). Regression to the mean is another reason why we need repeat studies to reinforce findings.

2) Spontaneous Resolution – Leave people alone and frequently they often get better without any therapeutic interventions (much to the chagrin of many surgeons)! A proportion of the population will naturally resolve an illness without treatment. This is a good argument for minimizing interventions, vs. the “lets throw the kitchen sink at this health problem” approach.

3) Reduction of psychological stress (stress has a direct physiological link through the neuro-endocrine response) and a reduction of stress can have positive physiological benefits.

4) Misdiagnosis – frequently conditions are misdiagnosed (especially in early phases), and differential diagnosis remains as much an art as a science

5) Subject expectancy e.g. classical conditioning. Remember Pavlov?

The Placebo effect is nicely characterized by this quote:

“The physician’s belief in the treatment and the patient’s faith in the physician exert a mutually reinforcing effect; the result is a powerful remedy that is almost guaranteed to produce an improvement and sometimes a cure.” — Petr Skrabanek and James McCormick, Follies and Fallacies in Medicine, p. 13.

In this way we can see that placebo effect can work very well to support dubious non-evidence based health practices such as nutritional supplements (and I mean of the “wonder -food” variety) or other  dodgy and fake practices,; drinking sharks-fin soup (now thought toxic), rhino horn for increased potency, psychic surgery etc etc.

In scientific experimentation we frequently use controls such as inert substances (e.g. normal saline) and have to consider that is some cases these will produce an effect similar to what would be expected with an active substance (e.g. an IV analgesic). However we can counter this with large samples, double blind and repeat studies. Indeed, in scientific clinical trials we are required to take the placebo effect into account (a requirement was introduced in a revision of the Declaration of Helsinki )

A related phenomena is the nocebo (Latin for “to harm”) effect, which is basically the same as the placebo effect but this time  the subject experiences harmful, unpleasant, or undesirable effects after receiving a placebo. Nocebo responses are thought to be due only to the subject’s pessimistic belief and expectation that the inert drug will produce negative consequences. One well known example is that C. K. Meador claimed that people who believe in voodoo can actually die because of their belief (Meadow, 1992), and there are other studies that have demonstrated this effect.

Dr Costa suggests that these effects are neurological mediated by higher brain centres, and pain for example, is significantly affected by the higher brain, so it’s very open to the placebo/nocebo effects.

He also makes a good point about the ethical issues using placebos in research. Clearly research using nocebos has ethical problems but even with placebos is it ethical to deceive patients in this way for the sake of science (even if they know they “might” get a placebo)  or for physicians to give antibiotics for viral infections, and vitamins for fatigue (a common practice, even though it is not for the overall good of the population)?

So it seems there is lots of room for more research into this interesting phenomenon. We would be interested on what people think of the ethics of using placebos in both scientific research and practice.



Barnett, A. G., van der Pols, J. C., & Dobson, A. J. (2005). Regression to the mean: What it is and how to deal with it International Journal of Epidemiology, 34(1), 215-220. doi:10.1093/ije/dyh299

Meador C.K. (1992) Hex Death: Voodoo Magic or Persuasion?” Southern Medical Journal 85(3): 244-47).