### Article

## Fisher’s significance test: A gentle introduction

### Search Medline for

### Authors

Published: | May 11, 2020 |
---|

### Outline

### Abstract

The p-value is often misunderstood and, for example, misinterpreted as a probability for the correctness of the null hypothesis. The aim of this article is to first explain the definition of the p-value. Determining the p-value requires knowledge of a probability function. How an appropriate statistical model is selected and how the p-value is determined using this model, the null hypothesis and the empirical data is explained using the t-distribution. When interpreting the p-value obtained in this way, two incompatible statistical schools of thought are confronted: the orthodox Neyman-Pearson hypothesis test, which amounts to a decision between the null hypothesis and a complementary alternative hypothesis, and Fisher’s significance test, in which no alternative hypothesis is formulated and in which the smaller the p-value, the greater the evidence against the null hypothesis. The amount ends with some critical remarks about the handling of p-values.

### Introduction

The p-value is often misunderstood and, for example, misinterpreted as a probability for the correctness of the null hypothesis. P-values play an important role in two schools of thought: Fisher’s significance test and Neyman and Pearson’s hypothesis test [1], [2]. While the significance test leads to a quantitative interpretation of the p-value, in which it is interpreted as a continuous measure of evidence against the null hypothesis, the p-value in the null hypothesis test merely serves a decision using predefined rules.

In 2016, the American Statistical Association (ASA) published a statement on the handling of p-values. Among other things it was stated: “The widespread use of ‘statistical significance’ (generally interpreted as ‘p≤0.05’) as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process” [3]. In 2019 Amrhein et al. published an article entitled “Retire statistical significance” in Nature in which they draw attention to the many pitfalls in the dichotomization of p-values into “significant” (usually p≤0.05) and “non-significant” (usually p>0.05) and generally discourage this dichotomization of p-values, i.e. the categorization into two areas [4].

A dilemma in the application of the significance or hypothesis test remains the lack of understanding of what these methods can answer at all. The aim of this paper is to illustrate essential background information and the steps of the significance test by means of a fictive study in which two groups are compared with each other. Most biostatistics textbooks do not consistently provide this background information and steps of the significance test. The article is intended for people who can only vaguely describe what the procedure does.

### Fundamental statistical concepts − standard deviation, sampling error, and standard error

#### Basic understanding – random sampling from a target population (population model)

The target population of a scientific question represents the totality of all observation units. If the target population is the resident population of the FRG, the total population in 2016 is 82.5 million. Interesting variables of this population could be mean values and scatters of characteristics (e.g. mean sleep latency, i.e. the average time from switching off the light in the bedroom to falling asleep in minutes). These characteristics of variables of the target population, which are usually unknown to us, are abbreviated with Greek letters in the sense of a statistical convention. For example, the Greek letter µ and σ are used for the mean value and the variance of a variable in the target population.

When conducting empirical studies, it is generally not possible to examine the whole target population. For this reason, only a sample from the target population is examined and information from the sample is used to make statements about the target population. The statistical inference of a sample to a target population represents an inductive conclusion and is referred to in statistics as inferential statistics.

When random samples are drawn from a target population, the so-called sampling error (sampling variability) occurs. Since only a part of the target population is examined, there is variability from sample to sample. This can easily be illustrated by the toss of a fair coin. One would expect that 50% of all tosses would show head. This expected value, also called probability, is the prognosis of a relative frequency. If the coin were flipped 10 times, head could appear 4 times. Flipping the coin 10 times again would not necessarily result in 4 times head, but e.g. 6 times head. This variability is an expression of the sampling error. Thus there can be no certain conclusion from a sample to a target population. The law of large numbers states that with increasing study size the sampling error becomes smaller and smaller.

#### Variability versus uncertainty

If, for example, one undertakes a study on the basis of a sample of 30 adult women with sleep disorders aged 55–64 living in Germany with the aim of estimating the true mean value µ of the sleep latency of the target population, the sample provides a mean value of e.g. 38 min and a corresponding empirical variance s^{2}, which is calculated according to the following formula:

Assuming a normal distribution of the variable sleep latency, a suitable statistical measure describing the variability in the sample would be the standard deviation (SD), which is the square root of the variance, in addition to the variance. The standard deviation s for the sample would be 8.5 min. If this study were repeated, in which a random sample of 30 adult women with sleep disorders aged 55–64, resident in Germany, is again obtained, the mean value would be for example 33 min and the standard deviation would be for example 8.4 min. The standard error of the mean (SE) is not a measure that quantifies the variability of the measured values within the sample, but rather the uncertainty of the estimate of the mean µ of the target population [5]. The standard error is calculated according to the following formula:

where *n* is the number of observations. It can be seen that the smaller the variability of the characteristic in the sample and the larger the sample, the smaller the SE becomes.

### How does a statistical test work – the t-test as an example

#### Two-group comparison

In an example of two randomly sampled groups, we compare the effect of a new sleeping pill on sleep latency. The verum group includes 32 persons, the placebo group 30 persons (cf. Table 1 [Tab. 1]). In both groups, sleep latency was determined after 7 days of treatment in the sleep laboratory (polysomnography). The null hypothesis is that the two groups do not differ with regard to sleep latency. Several tests have been suggested for such a group comparison.

In Table 2 [Tab. 2], we briefly explain the permutation test that is historically important. The permutation test is rarely used nowadays because the computing effort may be huge. In our example, there are 4.5 times 10^{17} permutations. Therefore, in our case the t-test would be preferred which can be regarded as a good approximation of the permutation test and is most popular in the biomedical literature.

A comparison of the mean values of the two samples shows that the mean sleep latency in the verum group is 5 min lower than in the placebo group. In both groups, sleep latency varied, as can be seen from the standard deviations. Both samples are associated with random error due to sampling error.

The question that arises here is whether the difference of 5 min is only an expression of a random error or whether this difference is an expression of an actual effect of the sleeping pill. In the first case, both samples would come from identical populations (µ_{p}=µ_{v}), in the second case, the two samples would come from different populations, i.e., populations with µ_{p}≠µ_{v}. Figure 1 [Fig. 1] illustrates the problem: could it be that placebo and verum do not differ with respect to the true sleep latency averages, i.e. come from the same population with e.g. µ=38 min, and the two sample averages (33 min and 38 min) are merely an expression of the sampling error, similar to the coin toss of a fair coin? Or could it be that the new sleep pill actually has an effect on sleep latency so that the true mean values come from target populations with different mean values (µ_{p}≠µ_{v})?

#### Expectation of statistical variability of study results due to random error

A significance test can provide some, albeit imperfect, information on these central questions. To answer the above questions, the behavior of the mean difference due to the random error must first be determined, assuming that a null hypothesis H_{0} were true. There is an infinite set of null hypotheses. In medicine, the nil hypothesis has prevailed, i.e. the null hypothesis of no association between treatment assignment (placebo or verum) and sleep latency (i.e. µ_{p}=µ_{v}). The Greek letters indicate that this null hypothesis refers to the target population. Under this hypothesis, mean differences that are not equal to zero are an expression of the random error. Similar to how extreme outcomes of experiments are rarely observed when tossing a fair coin (e.g. 10 tosses and it appears 10 times head), the difference of the means rarely takes extreme values under the null hypothesis.

But how many permuted arrangements of patients split into two groups do exist and how would differences of the means in these arrangements behave if the null hypothesis µ_{p}=µ_{v} were true? The difficulty in answering this question lies in the fact that the behavior of the difference of the means under the null hypothesis depends on the variability of the sleep latency within the samples and the size of the samples.

So in order to predict how the differences of the means would behave if the null hypothesis were true, one has to take these two influencing variables into account. Here a kind of normalization is helpful, which will be illustrated by the following example. A difference of means of 3 seconds is observed for two groups of marathon runners (2 hours, 3 min, 40 seconds versus 2 hours, 3 min, 43 seconds) and for two groups of 400 meters runners (46 seconds versus 49 seconds). For similar groups of runners, the differences of 3 seconds have a different meaning. For marathon runners, the difference is very small in relation to the average total duration of the run, while it is relatively larger for 400 meters runners. The relation to the average running time is a kind of normalization. The choice of statistical test, which ensures such standardization, determines which test statistics is chosen. If, for example, the t-test is selected for independent samples, the corresponding test variable is the t-statistic, for the Chi-square test it is the Chi-square-statistic etc. The choice of the appropriate statistical test again depends on criteria, which are briefly explained in Table 3 [Tab. 3].

The t-statistic is defined as:

The expected difference of means in the t-statistic formula is the value assumed under the null hypothesis H_{0}. In the case of the nil hypothesis µ_{p}=µ_{v} a difference of zero minutes is expected. This simplifies the t-statistics:

In the case of unequal variances, the standard error of the difference of the means is calculated according to the following formula:

with

n_{1}: number of patients in group 1 (placebo)

n_{2}: number of patients in group 2 (verum)

: variances of sleep latency in group 1

: variances of sleep latency in group

The formula changes if the variances are equal (formula not shown). The standard error of the difference of the means depends on the variances of the variable (sleep latency) and the group sizes of the groups being compared. After determining the standard error, the t-statistic for two independent samples with unequal variances is:

Independence means that the two patient groups are independent of each other and also that patients within the groups are independent of each other. For example, independence is violated if the outcome of a patient would contribute statistically to both patient groups. Similarly, independence would be violated if patients in the same group influenced each other in terms of outcomes of interest. Independence is also violated when a characteristic is collected from a group of patients several times over time (e.g. before and after treatment). The data of the sleep study now have the following t-value:

The t-value for the concrete study is therefore +2.33. This distribution can be determined by using the so-called degrees of freedom (df). The number of degrees of freedom is the number of values that can be freely varied without changing the mean values. If, for example, there are three numbers k, l and m and their sum is 100, it is clear that if two of the three numbers are known, the third number is automatically given. If k=20 and l=70, m must be 10. With 62 patients in the study one has n_{1}–1+n_{2}–1=30–1+32–1=60 degrees of freedom. If 60 values were freely selected, then one has no further choice for the last two observations.

With the help of the 60 degrees of freedom, the appropriate distribution can now be displayed under the assumption of the null hypothesis. The illustration of the formula for creating the t-distribution is omitted for didactic reasons (it is the ratio of the standard normal variable z and the square root of a chi-square value with n degrees of freedom divided by n). The t-distribution is symmetrical and bell-shaped like the normal distribution (Figure 2 [Fig. 2]).

The probability density function (PDF) supplies so-called density values depending on the t-values. In contrast to probabilities, which can only assume values between 0 and 1, densities can also assume values >1.

#### Interpretation of the t-value

A single density value of the PDF has no practical interpretation. The total area under the curve of the PDF is 1 so that (partial) areas under the probability density function have the interpretation of probabilities. In the context of the study, it is now possible to answer the question of how high the probability is that the t value assumes ≥+2.33 under the null hypothesis (µ_{p}=µ_{v}), i.e. t=0.

The cumulative distribution function (CDF) returns the probability that a t-value is smaller than or equal to a concrete value t_{k}. It is also possible to use the CDF to calculate the probability that t becomes ≥t_{k} by subtracting the probability for t values <t_{k} from the value of one. The formula for this function is omitted at this point, but can easily be found on the Internet [6]. In the case of the sleep study, t_{k}=+2.33. Figure 3 [Fig. 3] shows the area under the curve for t≥+2.33 for a one-sided view and for the areas under the curve for t≤–2.33 and t≥+2.33, a two-sided view.

The one-sided area has an amount of 0.01. This means that the probability that studies under the assumption of the null hypothesis (µ_{p}=µ_{v}) generate a t value of ≥+2.33 is 1%. On a two-sided basis, the probability that studies assuming the null hypothesis (µ_{p}=µ_{v}) generate a t value of ≤–2.33 or ≥+2.33 is 2%. The probability of 1% corresponds to the one-sided p-value, while the probability of 2% corresponds to the two-sided p-value.

### The p-value – explanation and some caveats

#### Interpretation of the p-value

The p-value thus provides the probability (criterion 1) under a null hypothesis (criterion 2) of finding a result such as the present study result or observing study results that deviate even more from the null hypothesis (criterion 3). All three criteria are necessary criteria for the definition of the p-value.

It is important to note here that the p-value makes a statement about the behavior of a test statistic in presence of random error given the null hypothesis. At a p-value of 0.01, only 1% of the studies would generate a t-value of ≥+2.33 if the null hypothesis were true. Thus, the p-value also makes a statement about the outcomes of studies that were not observed (counterfactual element). Furthermore, it must be emphasized that the p-value was calculated under a condition: the condition that the null hypothesis H_{0} were true, which is why the p-value is also referred to as a conditional probability. The null hypothesis was merely assumed, regardless of how large the truth content of this hypothesis is.

Fisher interpreted the p-value as a continuous measure of evidence against the null hypothesis. He said: “No scientific worker has a fixed level of significance at which from year to year, and in all circumstances he rejects hypotheses; he rather gives mind to each particular case in the light of his evidence and his ideas” [7]. This means that, according to Fisher’s school, the classification of a p-value is context-dependent and the application of a fixed threshold of typically 0.05 is not justified. The orthodox rejection of a null hypothesis at a pre-defined threshold of 0.05 comes from the competing school of Neyman and Pearson, who introduced the hypothesis test as a decision-theoretical procedure.

What does a large p-value of e.g. 0.70 mean? Technically speaking, it means that the probability is 70% of the observed study outcome or of study outcomes deviating even more from the null hypothesis under the assumption of the null hypothesis. In practice, this means that the significance test provided little evidence against the tested null hypothesis or statistical model. However, it does not mean that the null hypothesis is true. The p-value is a function of the strength of effect (e.g. observed mean difference, here 5 min) and the study size (here 62 women). With a large p-value, a strong effect can actually be present, but the study size was very small. Typical errors in the definition of p-values are discussed below.

“The p-value is the probability that the null hypothesis is true.” The p-value does not provide a statement about the probability of the truth of the null hypothesis, but the p-value was calculated under the assumption that the null hypothesis was true. Incidentally, the reference to even more extreme outcomes of the study (counterfactual element) is missing here.

“The p-value is the probability of type I error.” This statement is incorrect because it mixes principles of the significance test (Fisher) with those of the hypothesis test (Neyman & Pearson). According to the school of Fisher, there is no a priori fixed level of significance (also called type I error). In contrast, according to Neyman & Pearson, the level of significance, called type I error, is fixed before the study started whereas the p-value is derived from the statistical model and the study data after the study has been done. According to Neyman & Pearson, the type I error remains as it is after the end of the study and the p-value is compared to the a priori fixed type I error for making a decision.

The type I error, also called α error, is determined according to Neyman and Pearson before the beginning of the study. At the end of the study, the p-value which is obtained from the null hypothesis, the statistical model (e.g. t-test) and the study data is compared with the α (most often 0.05). The statement that “a low p-value excludes chance as an explanation for an observed difference” proves a gross lack of understanding.

Almost correct sounding definitions of the p-value are for example: “The p-value is the probability to observe the present study result or even more extreme study results.” In this definition, the central condition (criterion 2) of the p-value is missing: the calculation takes place under the assumption that the null hypothesis were true. The following incorrect definition is also popular: “The p-value is the probability of observing the present study result under the null hypothesis.” Here criterion 3 is missing: the p-value also makes a statement about unobserved study results that deviate even more from the null hypothesis than the present study result.

In the significance test according to Fisher, there is no so-called type I error and type II error, there is no confidence interval, no alternative hypothesis and no concept for statistical power or sample size calculations. These phenomena originate from Neyman & Pearson and only become relevant when performing hypothesis tests, which are decision-theoretically only valid if all steps of the hypothesis test procedure are adhered to, which is why authors also speak of Neyman-Pearson orthodoxy [8]:

- 1.
- Definition of the null and alternative hypothesis before the start of the study.
- 2.
- Determination of type I and type II error before the start of the study.
- 3.
- Determination of test statistics before the start of the study.
- 4.
- Calculation of the required sample size before the start of the study.
- 5.
- Conduct the study in compliance with the required sample size
- 6.
- Calculation of the test statistics and comparison with a critical value of the test statistics or comparison of the p-value with the specified type I error (after the study).
- 7.
- Decision: if p≤α, the null hypothesis is rejected, if p>α, the null hypothesis is not rejected (after the study).

If steps 1–7 are not complied with, the decision-theoretical procedure of hypothesis testing loses its validity. The decision (7^{th} step) must be consistently applied. If, for example, α=0.05 was specified and p=0.07 came out at the end of the study, then according to Neyman & Pearson it cannot be said that there was a “significance trend” or something similar, but only that the null hypothesis was not rejected. Likewise p-values ≤0.05 are not sub-categorized into e.g. p≤0.05*, p≤0.01** and p≤0.001*** according to Neyman & Pearson.

#### Conditions necessary for the correct interpretation of the p-value

Many introductory textbooks of biostatistics merely introduce the theory of significance testing. This means that there are no sources of error other than random error. In the practice of empirical studies, however, this is an unrealistic assumption. Greenland et al. [9] rightly point out that in the case of a low p-value only a signal is given that something may be wrong with the so-called statistical model. The statistical model consists of three components: the chosen test statistics, the chosen null hypothesis and the empirical study data.

In addition to the hypothesis that the low p-value represents evidence against the null hypothesis, the following alternative explanations need to be considered, all of which are related to the statistical model and thus influence the p-value:

- An unsuitable test statistic was applied.
- Selection bias into the study or selection bias during follow-up of study subjects occurred.
- The comparison between two samples is confounded (mixing of effects).
- There is information bias in the measurement of the variables in the study.

If the p-value is low, we can only conclude that something is wrong with the statistical model. However, the p-value itself does not show what is wrong with the model. The inexperienced user of the significance test thinks of a low p-value only as an indication that the null hypothesis might be wrong. In addition to the contextual dependence of the meaning of low p-values explained by Fisher, the result of a significance test must always be seen in the light of the complete statistical model.

### Summary

Fisher’s significance test is a different procedure than the Neyman & Pearson hypothesis test, which is often ignored. While the significance test produces a p-value, which according to Fisher should be interpreted context-dependently as a continuous measure of evidence against the null hypothesis, the p-value serves as a decision criterion if the necessary steps of the hypothesis test are followed. The significance test leads to the p-value, whose definition must contain three criteria: probability, the use of the null hypothesis assumption, and the counterfactual element of the p-value. P-values can be small for various reasons and the evidence against the null hypothesis is one of several competing reasons in empirical studies.

### References

- 1.
- Gigerenzer G, Swijtink Z, Porter T, Daston L, Beatty J, Krüger L. The empire of chance. How probability changed science and everyday life. Cambridge: Cambridge University Press; 1989.
- 2.
- Amrhein V, Trafimow D, Greenland S. Inferential statistics as descriptive statistics: there is no replication crisis if we don't expect replication. PeerJ Preprints. 2018;6:e26857v4. DOI: 10.7287/peerj.preprints.26857v3
- 3.
- Wasserstein RL, Lazar NA. The ASA's statement on p-values: context, process, and purpose. Am Stat. 2016;70:129-33. DOI: 10.1080/00031305.2016.1154108
- 4.
- Amrhein V, Greenland S, McShane B. Scientists rise up against statistical significance. Nature. 2019 Mar;567(7748):305-307. DOI: 10.1038/d41586-019-00857-9
- 5.
- Cox DR. Principles of statistical inference. Cambridge: Cambridge University Press; 2006. DOI: 10.1017/CBO9780511813559
- 6.
- Student's t-distribution. In: Wikipedia. [accessed 2019 May 16]. Available from: https://en.wikipedia.org/wiki/Student%27s_t-distribution
- 7.
- Fisher RA. Statistical methods and scientific inference. Edinburgh: Oliver & Boyd; 1956.
- 8.
- Oakes MW. Statistical inference. Chichester: Wiley; 1986.
- 9.
- Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN, Altman DG. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol. 2016 Apr;31(4):337-50. DOI: 10.1007/s10654-016-0149-3
- 10.
- Manly BFJ. Randomization, bootstrap and Monte Carlo methods in biology. London: Chapman & Hall; 1996. Randomization; p. 3-7.
- 11.
- Feinstein AR. Principles of medical statistics. Boca Raton: Chapman & Hall/CRC; 2002. Testing stochastic hypotheses; p. 190-1.