The Internet is flooded with explanations related to P-values (*To P or not to P?*) and problems around multiple testing (the more tests you do, the more significant results you get even if everything is random). I don’t bring anything new, I just wrap it for myself and add a few examples that hopefully make some concepts clearer. I will talk about Type I and Type II error rate, the power of the test, disadvantages of multiple tests, how it looks like when we use the correction for multiple tests (not much significant left) and what else we could do with that (some global analysis?). Read further if you also think about these things.

**P-values**

Several decades ago, ecologists were strongly arguing that it is important to start using rigorous statistical approaches and standards in ecology; that was the era when ecology studies were based mostly on natural history observations instead of systematically sampled data analysed by rigorous statistics. Meanwhile, ecology grew into a standard scientific discipline, and some start to argue that P-values are often being misused and misunderstood, and should be banned from ecological publications (as it happens in medicine or psychology). Putting this discussion aside, P-values are still representing a useful tool telling us how much evidence we have against the null hypothesis, which we are testing.

**P-value is the probability of obtaining the value of test statistic equal or more extreme than the observed one, even if the null hypothesis is true**. The null hypothesis is usually formulated as “there is no relationship between the variables” (or there is some fixed relationship, from which we are testing the difference). For example, if I relate species composition to some environmental factor using constrained ordination, P-value is the probability of obtaining the test statistic (R^{2}, pseudo F-value) equal or higher than the one we truly calculated, even if the environment is truly not related to species composition (null hypothesis is true). If the P-value is sufficiently low (e.g. less than 5%, as is the common standard, proposed originally by Ronald Fisher), we may conclude that it is quite unlikely that the result could be obtained at a situation that the null hypothesis is true and state that the observed value is “significant”. That 5% (or in general alpha level) is the probability of Type I error, and we report such result as “significant” at given alpha threshold (e.g. P < 0.05).

Some examples. First, let’s generate a data for which **we are sure that the null hypothesis is true**: we generate two random variables, independently from each other; then we calculate correlation between them (Pearson’s *r*) and test it using parametric *t*-test (all this is done by ‘’cor.test’’ function in R):

1 2 3 | x.rand <- rnorm (100) y.rand <- rnorm (100) cor.test (x.rand, y.rand)$p.value |

We obtain some P-value (e.g. if you include ”set.seed (1234)” before the script above, the P-value should be P = 0.8020633, i.e. not significant at alpha = 0.05). Yet, even if these variables are randomly generated, occasionally we still can obtain a “significant” result with P-value smaller than alpha, with a probability of alpha (try it – you can e.g. ‘’set.seed (1257)’’, and you obtain P-value = 0.001678019, well significant at 0.05 level).

Now, the second situation – we generate data, for which **we are sure that the null hypothesis is not true**; variable x is still generated randomly, but variable y is the function of x, with added random noise. The same – calculate correlation and test it:

1 2 3 | x.dep <- rnorm (100) y.dep <- 0.2 * x.dep + rnorm (100) cor.test (x.dep, y.dep)$p.value |

If in the code above you ‘’set.seed (12345)’’, you got P-value = 0.001676154, well below the 0.05 threshold, so you would conclude that the value is significant at P < 0.05. But of course, it may also happen that the result is not significant, even it should be (e.g. with ‘’set.seed (12355)’’, you obtain P-value = 0.06517154, which would already not be significant at the 0.05 significance level).

To summarize, see the possible outcomes in the table:

The left column shows the situation that the null hypothesis is, in reality, true (our first example with randomly generated variables). If I still reject it, I conduct the Type I error or a false positive. The probability that this will happen is set by me and equals to alpha level at which I decided to reject the null hypothesis (alpha = 0.05 or 5% in this case). If I did not reject it, I conduct the true negative (with the probability of 1 – alpha).

The right column shows the situation that the null hypothesis is in reality not true (our second example, where y is dependent on x because I enforced this dependence). If I reject the null hypothesis, I did correct decision (true positive), if I did not reject it, I conducted the Type II error or a false negative). The probability of Type II error is beta, and the probability of true positive is 1 – beta, also called **the** **power of the test**. Beta and the power of the test depends on several factors – test itself, how powerful it is (some are more and some less powerful), and also the data – sample size, and amount of variance. In our case, the data are quite noisy and I sample only 100 samples, so it may be not so easy to find the relationship which is there (this will improve if we increase the sample size, e.g. to 1000).

Now, in reality, I do not know whether the null hypothesis is true or not (that is why I want to test it), and I need to collect the data and use the statistic to get an idea. Both Type I and Type II errors are problematic – it means that I rejected something I should not (Type I error) or did not reject something I should (Type II error). The probability of Type I error is selected by me (here 0.05), while the probability of Type II error is dependent on three factors: the sample size (larger sample size means lower Type II error), the selected value of Type I error (lower alpha means higher beta), and the true difference of the data from the null hypothesis, i.e. true effect size (the higher is the difference, the lower is Type II error rate – it’s easier to reject the null and conduct true positive). This means that decreasing our choice of Type I error rate (alpha) increases the probability of Type II error (beta) and decreases the power of the test (1 – beta). Not an easy life; best solution – sample more to get higher sample size, which will reduce the probability of Type II error (of course, not always possible).

Both the effect size and the number of samples in the dataset are also influencing the P-value (it decreases with both the effect size and number of samples), so in case of a large dataset, we are likely to get strongly significant results even if the true effect (e.g. R^{2}) is pretty small. Some argue that datasets with more than several hundreds of samples the P-value already does not represent a useful tool for the decision, because virtually anything could get highly significant. Also, the threshold for rejecting the null hypothesis (often set as P < 0.05) should not be taken too seriously. It does not mean that at P = 0.048 we have evidence against the null hypothesis, but at P = 0.052 we already do not. Better is to consider P-value as a continuous measure, and use those thresholds (P < 0.05, P < 0.01, P < 0.001) only as a recommendation to what results we should pay more attention. The other extreme, i.e. interpreting everything, would return us several decades back, to the era when interpretations of ecological studies were based on feelings and intuition rather than on the results of statistical analyses.

**Multiple testing issue**

Sometimes the study includes an analysis which is based on testing more than one null hypothesis. Imagine example: you want to know whether plant functional traits (e.g. leaf area, SLA) are related to some of the environmental factors (e.g. elevation, soil P). We will keep technical details of how this can be done aside, just imagine that you have ten different traits and ten different environmental variables, and you collect 100 samples (e.g. vegetation plots) in which you do all these measurements. Then, you take each of the ten traits and regress it against each of the ten environmental variables, test the significance, and interpret significant results. What is the problem?

Let’s do the same exercise as above. First, **simulate the situation when all null hypotheses are true** (variables are randomly and independently generated). R code gets more complex, so I deposited it in the Github repository here. What I do: I generate two random variables from the normal distribution, each having 100 values, calculated regression of y on x, and tested the significance (by parametric F-test in this case; note that if I calculate Pearson’s r correlation, the P-value would be identical). I repeat this 100 times and highlighted significant results by adding the red regression line into the scatterplot.

You see that I got a few significant results, although I should not – the null hypothesis is truly correct, I generated data in this way. These are false positives, and their probability is equal to the significance level alpha I set up (here 0.05, i.e. 5%). You can see that if I repeat the same analysis again, I got again some significant hits, just this time in different combinations of traits and environmental variables.

In fact, I can calculate the probability that, for given number of tests and given alpha value, I will get at least one significant result (with P < alpha), even if the null hypothesis is true: p_{{at least one significant result}} = 1 – (1 – alpha)^{m}, where m is the number of tests. In case of 100 tests, the probability of obtaining at least one result significant at P < 0.05 is 1 – (1 – 0.05)^100 = 99.4; I can be almost sure that I get at least one!

Now, the second scenario, when all 100 relationships are generated with null hypothesis not true – I created variables which are dependent on each other. The same as above applies – I test each of them, and if significant, I add the red regression line. In this case, I coloured the background each scatter by light blue, to indicate that the variables are truly not independent.

The number of hits is high, but not all combinations are significant. Those which are not are false negatives, and those which are significant are true positive; the probability of true positive equals to the power of the test, in this case, something around 50% (How do I know that? Repeat the analysis many times, e.g. 1000 times, to see the proportion of true positives; this gives you the power of the test, from which you can calculate Type II error rate as beta = 1 – power). If you repeat the same thing, again quite high proportion of relationships will be significant (true positives), but some not even they should (false negatives).

In reality, we don’t know for which combination of variables the null hypothesis is true and for which it is not. Let’s make a third scenario, in which we mix both options – some of the relationships are generated from independent variables (null hypothesis is true), some from related variables (null hypothesis is not true). For simplicity, the 10 relationships in the first row are those with null hypothesis not true, and the rest (90) with null hypothesis true.

The figure shows the whole spectrum of situations: true positives (blue background with red regression line), true negatives (white background, no regression line), false positive (white background, regression line) and false negatives (blue background, no regression line). The problem is that, in reality, we do not know which of the relationships have the blue background (null hypothesis is not true), so when we do the interpretation one by one, we may well interpret true positives (correct!) as false positives (not correct!).

How to deal with this? One way is to adjust P-values of individual tests to correct for multiple testing issue. The simplest is the Bonferroni correction, when you simply multiply each P-value by the number of tests you are doing (P_{adj} = P * m) and then evaluate P_{adj} instead of P for the significance. This helps to control for the overall (or family-wise) Type I error rate, but at the same time, it quite drastically reduces the power of the test, and you are unlikely to get false positives, but you are quite unlikely to get true positives. There are better methods then Bonferroni (e.g. FDR, false discovery rate, or Holm’s correction), but the drastic reduction of test power is true for all of them.

If we increase the power of the test by increasing the sample size, this problem will be less serious, but this is not always possible. Below, I increased the number of samples in each comparison from 100 to 500.

This looks great! No false positive, and only two false negatives. However, in real data, this is quite unlikely to happen, because you won’t have the power of the test more than 99% as in this case.

Seems that multiple testing strategy is not the best way to go – either you suffer from too many false positives, or you correct and get (almost) nothing significant. If possible, opt for other types of analysis, some which conduct a single global test instead of the series of pairwise ones, or combine these two strategies together. A well-known example of such an approach is one-way ANOVA and post-hoc comparison. ANOVA tests whether the mean of at least one of the groups deviates from overall mean; if yes, then one can conducts pairwise post-hoc comparison of groups to identify which of the groups are those different (still, this pairwise comparison is not a simple series of t-tests, but includes implicit correction for multiple testing). In the case of trying to find relationship between a set of two variable types (e.g. traits and environment as in our example), you may opt for constrained ordination, which is in fact multivariate regression (many traits regressed on many environmental variables, more precisely CWM-RDA in this case, but details are beyond this text). If the ordination is significant, then go for pairwise regressions to see which of the variables are truly related (still, to use some of the corrections for multiple testing is advisable).

You may ask: so if I measure things, relate them to each other and find a significant relationship, does it mean that these variables, in fact, may have no real meaning, they could be in fact just random variables? The answer is yes, it may happen that this result is just a false positive. The way to make sure that the relationship between those variables (e.g. given trait and environment) truly exists is to replicate the study, best in a different location. Or, research the literature to see whether ever someone detected the same relationship in a similar context. It’s unlikely that the same combination of variables would be a false positive in many studies (you see above that false positives occur, but in the new replication it occurs in different variable combination). If you repeat the study again (and again and again) and you get the same relationship significant (or you see multiple papers reporting the same thing), that is strong evidence that it may truly mean something.

### Where to go next?

A nice readable text on this topic is the chapter The *p* value and the base rate fallacy in the book *Statistics Done Wrong* by Alex Reinhart (see the comics there!). Also, a few years ago there was an inspiring series of forum papers in Ecology, started by the paper of Paul A. Murtaugh entitled *In defense of P values* (Murtaugh 2014), followed up by several others and wrapped together again by Murtaugh. Also, there is recently an interesting suggestion that we should redefine statistical significance and lower the alpha threshold from 0.05 to 0.005, which will greatly reduce the probability of reporting false positives in scientific studies and increase reproducibility. As it goes, one of the first comments (from many) to this paper is entitled “Remove, rather than redefine, statistical significance”…

R script for all calculations in this post is in Github repository: https://gist.github.com/zdealveindy/39feabd7c7b4f3d780d2b8c5232456a1