The ANOVA

The analysis of variance (ANOVA) is the most widely used method of statistical analysis of quantitative data. It calculates the probability that differences among the observed means could simply be due to chance. Every scientist should know how to use it.

It is closely related to Student’s t-test, but whereas the t-test is only suitable for comparing two treatment means the ANOVA can be used both for comparing several means and in more complex situations.

The ANOVA partitions the total variation into a number of parts such as Treatment, Block, Error and Total, depending on the design of the experiment.

The one-way ANOVA is used for a single-factor between subjects design, i.e. for comparing two or more treatment means.

The two-way ANOVA “without interaction” is used to compare treatment means, but these arise from a randomised block design where the experiment has been split up into a number of “mini-experiments”

The two-way ANOVA “with interaction” is used for a design with two or more fixed-effects factors, known as a “factorial” design.

A mixed design (with and without interaction) is used for factorial designs in randomised blocks.
Residuals diagnostic plots are used to examine whether the data are consistent with the assumptions underlying the ANOVA. Transformations of scale may be necessary to do a valid ANOVA.

Residuals plots are used to assess the assumptions on which the ANOVA depends.

A scale transformation may be needed if the assumptions are not met.

The one-way ANOVA

This type of ANOVA is done if there is a single factor such as “treatment” or “group”. There can be any number of levels. The calculations are usually done by computer, but the hypothetical data in the table below with just two levels (treated and control), shows how they would be done long-hand. It is not suggested that anyone should actually do the calculations long-hand, but many people want to understand how the calculations are done.

The aim in this case is to partition the total variation into parts associated with treatment and residual or error. Most computer packages would require the data to be presented with one line per individual with the observation and a code indicating the treatment group. The appropriate ANOVA command or menu button would then be invoked.

Treated	Control
10	7
12	10
11	8
13	9
Total 46	Total 34
Grand total	80

By longhand, calculate
1. The correction factor CF=(Grand total)²/N =80²/8 =800
2. The total sum of squares= sum (X²i )- CF = 10²+12²+11²+13²+7²+10²+8²+9²– CF = 828-800 = 28
3. The treatment sum of squares = (Sum(trt. totals)²/n)-CF= 46²/4 + 34²/4 -CF = 818-800 = 18

The results are set out in the ANOVA table as shown below. There are three sources of variation, Treatment, Error and Total. As there are two treatments, T, there will be T-1 = 1 DF (degrees of freedom) for treatments and as there are 8 observations total there will be 7 total DF. The error DF is obtained by subtraction. The SS calculated above are then put in the table, and the error SS is obtained by subtraction (i.e. 28-18). The MS is calculated as the SS/DF. The F is 18.0/1.67=10.8. The p-value needs to be looked up in tables, but a computer will give an actual value.

Analysis of variance table

Source DF SS MS F P
Treatment 1 18.00 18.00 10.80 <0.05
Error 6 10.00 1.67
Total 7 28.00

From the p-value we would conclude that the null hypothesis that there is no difference between treatment means should be rejected at the 5% significance level since p is less than 0.05
Note that the error mean square 1.67 is an estimate of the pooled variance, so the pooled standard deviation will be the square root of this, 1.29. Note also that the proportion of the variation associated with treatments can be obtained from the Sums of Squares as 18/28= 0.64 or 64%.
When there are just two groups the ANOVA is mathematically identical to Student’t two-sample t-test, but the ANOVA can be used where there are more than two treatment groups, which is not appropriate with the t-test. The ANOVA can also cope with more than one factor such as the data arising from a randomised block or factorial experimental design.

Assumptions
The ANOVA and Student’s t-test are so-called “parametric” tests. They depend on the assumptions 1) that the observations are independent, 2) that the residuals (deviations from group means) have a normal distribution, 3) the variation is the same in each group. These last two assumptions should always be examined by studying the “residuals”, i.e. deviations from group means.

Residuals plots are given by most good computer programs. An example of such a plot is given below (Residuals plots) However, the ANOVA is quite robust against some deviation from assumptions 2 & 3. Assumption 1 depends on correct experimental design, and in particular on the correct identification of the experimental unit and appropriate randomisation.

The two-way ANOVA without interaction

Randomised block and within-subject designs involve two factors; a treatment and a “block” or “animal”. The treatment is known as “fixed effects” because it is repeatable, i.e it is possible to repeat “treatment A” at different times and on different animals. However, a block is a “random effect”. It is not possible to repeat “block 1” or “animal 1” on different occasions. Block means are therefore random variables.

The two-way ANOVA without interaction partitions the total variation into parts associated with treatments, blocks and error. This is illustrated using the hypothetical data given below. Suppose the aim of the experiment was to compare three diets A, B and C on growth rate in mice, but the available mice vary substantially in weight. They were therefore grouped according to weight, with the heaviest three being assigned to block 1, the next three to block 2 and the lightest three to blocks 3. Within each block the animals were then assigned at random to one of the diets. The results (say g/week) are shown in the table. The ANOVA would be calculated long-hand in exactly the same way as the one-way ANOVA

Treatment	Block 1	Block 2	Block 3	Treatment totals
A	10	12	7	29
B	7	8	5	20
C	5	6	4	15
Block totals	22	26	16	Grand total 64

Thus, the correction factor CF would be 64²/9,
the total sum of squares will be each number squared, minus the CF i.e. (10²+7²+5²+….4²)- CF,
the treatment sum of squares will be the sum of the (treatment totals)2/nt, where nt is the number of observations making up the treatment total (i.e. (22²/3+26²/3)-CF and the blocks sum of squares will be the sum of the (blocks totals)²/nb-CF, where nb is the number of blocks.
The error sum of squares is obtained by subtraction.

The two-way ANOVA table for these data (calculated by computer) is as follows:

Analysis of Variance for Observations
Source DF SS MS F P
Block 2 16.889 8.444 13.82 0.016
Treatment 2 33.556 16.778 27.45 0.005
Error 4 2.444 0.611
Total 8 52.889

From these results it is clear that there were large block differences (p=0.016), implying that blocking was worthwhile: it will have improved the power of the experiment. There were statistically significant treatment effects i.e. the null hypothesis that there are no differences among the three treatment means is rejected at p=0.005. However, the analysis does not indicate whether all means are different, or just some of them. This will require post-hoc comparisons. These are available in most computer packages. Dunnett’s test will indicate which means differ from a control mean. Tukey’s test will show which means differ from each other. These tests are not considered in any more detail here. They should be explained in the literature accompanying the statistical package.

The pooled within-group variance is 0.611, so the pooled standard deviation is the square root of this, 0.78, and the proportion of the variation associated with differences between treatments is 16.889/52.889 = 31.9%.

The two-way or N- ANOVA with interaction

This analysis is used when there are two or more fixed-effect factors. Usually the aim is to see whether these interact or “potentiate” each other. Consider the hypothetical data in the table below. This is a 2×2 factorial. The aim would be to see if the strains differ in response to some treatment.

Strain	Control	Treated
A	5	5
A	6	7
A	5	8
B	3	9
B	4	7
B	5	9

The calculations will normally be done by computer, but they are done long-hand below to show how they are done.

By hand calculate first the Correction Factor CF= GT²/N, the grand total squared, divided by the number of observations (73×73/12=444.0833).

The total SSQ would be the sum of each number squared minus the CF (485-444.0833=40.9167).

The strain SSQ is the strains totals squared divided by the number of each strain which is (36²+37²)/6-CF=444.1667-CF=0.0834

The treatment SSQ is the treatment totals squared, divided by the number in each, minus the CF which is 468.1667-CF = 24.0834

Finally, the interaction SSQ is the sum of the treatment by strain sub-groups squared and divided by the number in each group, minus the CF, minus the strain SSQ, minus the treatment SSQ. This is {(16²+20²+12²+25²)/3}-CF-Strain SSQ-Treatment SSQ. This is 475.0000-444.0833-0.0834-24.0834 = 6.7499

The ANOVA table is shown below, rounding the numbers to two decimal places. However, the p-values are calculated using MINITAB rather than having to look them up in a table.

Analysis of Variance for Observations
Source                     DF       SS        MS            F       P
Strain                         1       0.08     0.08      0.07        0.803
Treatment                  1     24.08   24.08    19.27      0.002
Strain x Trt               1       6.75       6.75     5.40       0.049
Error                          8      10.00     1.25
Total                          11     40.92

From this analysis the null hypothesis that the response is independent of strain would be rejected at the 5% level of significance as the p-value for the interaction is 0.049.

The cell, strain and treatment means are shown in the table below. Note that the statistically significant interaction is because strain B Controls were lower than strain A, but were strain B treated animals had higher values than strain A.

Strain	Control	Treated	Means
A	5.33	6.67	6.00
B	4.00	8.33	6.17
Means	4.67	7.50

The error MS is the pooled within group variance (1.25) so the standard deviation is the square root of this, which is 1.12. This would be used in the presentation of the results. Standard deviations worked out within each cell would be unreliable because they would only be based on three animals, so the pooled standard deviation would be the most appropriate estimate of the population standard deviation.

Higher factorial experiments

Suppose 16 animals were available, and the experimenter wanted to compare two strains (S), two sexes (X) and two treatments (T). This would be a 2x2x2 factorial with 8 different treatment combinations, and there could be two animals on each treatment combination. Such a design would be analysed in a very similar way to the one above, except that there would now be three “main effects” to be estimated (differences between the two strains, the two sexes and the two treatments in each case averaging across the other factors), there would be three two-way interactions (SxX, SxT and XxT), and one three-way interaction (SxXxT), and there would be eight degrees of freedom to estimate the error, which although a bit low according to the resource equation method of determining sample size (see Sample size button), is not too bad.

Suppose it was decided to add two diets to this experiment. This would then be a 2x2x2x2 factorial design. The problem would be that there would be no degrees of freedom left to estimate error because there would only be one animal on each of the 16 treatment combinations. This may not be an insuperable problem. High order interactions (3-way and 4-way or higher) usually turn out to be non-significant (i.e. the interaction SS is not very different in magnitude from the error SS). So it is possible to poole these and use them as the error term. High level factorials of this sort are widely used in industrial research, but are not disusses in further detail here.

A mixed design (with and without interaction)

Experiments with both fixed-effect factors and random effect factors present no particular problems. A numerical example of a 4 (strains) x2 (treatments) factorial design done as a randomised block is given in the pages on Experimental Designs

Residuals plots

These are used to examine the assumptions underlying the ANOVA, and will also show whether there are any serious outliers. An example of some residuals plots produced by MINITAB version 14 is given below. The data are the percent of liver cells staining positive following treatment of rats with a hormone.

The first, Normal Probability plot should give a straight line if the residuals have a normal distribution. In this case there is clearly some deviation from this ideal. The plot of Residuals Versus the Fitted Values (the fitted values are the group means) shows whether the variation is the same in each group. In this case small fitted values clearly vary less than large ones.

The Histogram of the Residuals should look like a Normal bell-shaped curve, but in this case it is slightly skewed with a few large values although with this sample size it is probably not very different from normal.

Finally, if the data is in order of collection the plots of Residuals Versus the Order of the Data will show whether there is a trend. In this case the data has already been sorted by treatment and order has been lost, so it is not possible to look for such a trend. It does, however, clearly show that the animals in the treated groups (nos 20-36) are more variable than the controls.

There is clear evidence in this case for heterogeneity of variance and lack of normality of the residuals, although this is not particularly extreme. However, the ANOVA is quite robust and is still valid if there are only slight departures from these two assumptions. Unfortunately, there is no general rule on how much of a departure from normality and homogeneity of variance there has to be to make a transformation of scale (see below) or the use of non-parametric methods (see button) necessary. In this case it would probably be sensible to transform the data (see below), but it is a borderline case.

Transformations of scale

Where the assumption about normality of the residuals and homogeneity of variances are not met the situation may be improved by a transformation of scale. Three transformations commonly used are:

The logarithmic transformation for skewed measurement data

Biological data often has a skewed distribution, particularly when the concentrations of something is being measured. Concentrations can not be less than zero, but often there are a few high values, as shown in the histogram in the Residuals Plots above. Taking the logs (to any base, but usually to base 10) will often result in a better fit to the assumptions. All the statistical analyses would then be done on the logarithm of each observation, but in presenting the results, the means should be back-transformed by taking antilogs. However, the standard deviations can not be treated in this way. If there are some numbers below one, negative numbers can be avoided by adding one before taking logs, and subtracting it again after taking the antilogs.

The square root transformation for counts

Counts where the mean count is low (e.g. where a lot of the counts are 0, 1, 2 or 3) often have a Poisson distribution where the mean is equal to the variance. A square root transformation will normalise the residuals, i.e. each count should be replaced by its square root. Sometimes one is added to each number before taking square roots.

The logit transformation for percentages

Percentages where a large proportion of the values are either less than 20% or greater than 80% have a skewed distribution because it is not possible to have values of less than 0% or greater than 100%. The logit transformation is X=ln{p/(1-p)}, where p is the proportion, should correct this situation