Randomized Block Design Examples Statistical Research Papers

Abstract

Randomized block experimental designs have been widely used in agricultural and industrial research for many decades. Usually they are more powerful, have higher external validity, are less subject to bias, and produce more reproducible results than the completely randomized designs typically used in research involving laboratory animals. Reproducibility can be further increased by using time as a blocking factor. These benefits can be achieved at no extra cost. A small experiment investigating the effect of an antioxidant on the activity of a liver enzyme in four inbred mouse strains, which had two replications (blocks) separated by a period of two months, illustrates this approach. The widespread failure to use these designs more widely in research involving laboratory animals has probably led to a substantial waste of animals, money, and scientific resources and slowed down the development of new treatments for human and animal diseases.

experimental design, randomized block, repeatability, animal experiments, reproducibility

Introduction

A fundamental assumption in experimental biology is that if an experiment is well designed, correctly executed, properly analysed, and adequately documented, the results should be reproducible, apart from the occasional type I error (false positive) associated with the chosen significance level. However, several recent publications have found that excessive numbers of animal experiments are unreproducible (i.e., the results could not be repeated by different investigators). For example, Begley and Ellis (2012) attempted to repeat 53 landmark experiments concerned with cancer research but were only able to do so with six of them. In some cases, the original authors were unable to repeat their own experiments. In another paper, investigators (Scott et al. 2008) noted that there were more than 50 reports of drugs that alleviated the symptoms of amyotrophic lateral sclerosis (ALS) in the standard transgenic mouse model of this disease, but only one had any effect in humans. A detailed study showed that there are a number of confounding factors that need to be controlled when using this model. The authors devised a better protocol, which controlled these confounding factors, and then rescreened these 50 drugs plus another 20. They found that none of the drugs was effective in the mouse model. Similarly, Prinz and colleagues (2011) were able to reproduce only 20 to 25% of the results of 67 published studies and claimed that, within the pharmaceutical industry, it is accepted anecdotally that less than half of academic papers give reproducible results.

In many cases, lack of repeatability is due to failure to adhere to some of the most basic requirements of good experimental design. A survey of 271 animal experiments showed that 87% of papers did not report randomization of treatments to subjects (although this does not necessarily mean that it was not done), and 86% did not report blinding in situations where it would be appropriate (Kilkenny et al. 2009). Such failures can lead to biased results and unrepeatable (in the same laboratory) or unreproducible (in a different laboratory) experiments. Lack of reproducibility in other laboratories may also be caused by treatment x environment interactions. For example, animal houses may differ in the physical environment, management, or microflora in such a way as to alter the relative treatment differences. Results may also be unrepeatable or unreproducible because the wrong strain of animals was used. There is no effective genetic quality control of outbred stocks. In one study, for example, the investigators obtained 26 weekly samples of 30 Sprague-Dawley rats from a commercial supplier and tested them for response to a synthetic polypeptide, a response controlled by a single gene in the major histocompatibility complex (Simonian et al. 1968). On average, about 80% of the rats were responders and, for the first 12 weeks, the percentage of responders in each sample varied about this mean. However, in weeks 13, 17, 18, 19, and 20, only about 5% of the rats were responders. These rats cannot have come from the same colony and may have responded differently to other experimental treatments, but there was no indication from the breeder that different rats had been supplied. There have been other examples of the wrong animals being supplied by commercial breeders (Festing 1982). And, of course, any single experiment has a 5% chance of getting a false positive result due to statistical sampling, assuming a 5% significance level is chosen. Finally, some results may be unreproducible because the authors detected serious errors and later withdrew the paper, unknown to other investigators, or because the paper is fraudulent (Steen 2011). For all these reasons, it is legitimate to require evidence that the results of important experiments are reproducible. However, repeating experiments is time consuming and expensive both financially and in the use of animals. Because the work is not new, it may also be difficult to obtain funding and get the results published. An alternative could be to design better experiments with built-in repeatability. This could be done using randomized block designs, with the blocking factor being time (i.e., the experiment is split up over a period of hours, days, or months).

Randomized Block Experimental Designs

The “randomized block” (RB) design is a generic name for a family of experimental designs in which the experimental material is split up into a number of “mini-experiments” that are recombined in the final statistical analysis. Typically, in each block there is a single experimental unit to which each treatment is assigned (although there can be more than one). They include crossover designs, within-subjects designs, matched designs, and Latin square designs. These have a number of useful properties and should be more widely used in research involving laboratory animals. They can be used to:

  1. Spread the experiment over a period of time and/or space. So if there are, say, four treatments and the cage of animals is the experimental unit, each block will consist of four cages. Block 1 might be started this week, block 2 next week, and so on. If six blocks are needed, then the experiment will be extended over a 6-week period. Each block may involve a different batch of animals, which could be of a slightly different age or weight with possible differences in the batches of diet, or its age since it was manufactured. Cages may be placed at different levels in a rack. None of these variables is of interest, so they are removed as a block effect in the statistical analysis. If the relationship between the observations alters, then treatment effects will become statistically less significant. If the relative values remain unchanged, then this implies a good level of repeatability with real treatment effects being more likely to be detected.

  2. Increase the power of an experiment by matching the experimental units in each block, say, on age, weight, or location in the animal house. This means that powerful experiments can often be done even though the experimental units are somewhat heterogeneous, providing that matching is possible. This is particularly important with large experiments in which it is often difficult to obtain a sufficiently homogeneous group of animals.

  3. Take account of material which has a natural structure. Within-litter experiments are an obvious example.

  4. Split the experiment up into smaller bits in order to make it more manageable. This would be useful with large experiments and should help to minimize measurement errors because the work can be done under less time pressure.

  5. Increase the external validity of an experiment because each block samples a different environment and/or time period.

Sample Size Determination

Either a power analysis or, for smaller more fundamental studies, the resource equation method can be used to set sample size. A power analysis requires an estimate of the standard deviation (for measurement variables). As the magnitude of the likely block effect and the standard deviation are usually unknown, the estimate will probably have to come from an unblocked (completely randomized) experiment. The payoff will be taken in increased power (experience has shown that RB designs are nearly always more powerful than completely randomized designs of the same size, with the possible exception of very small experiments). However, when similar blocked experiments are done frequently, the estimate of the standard deviation from these could be used in future power analyses to estimate the sample size.

The resource equation method (Mead 1988) is E = (total number of experimental units)-(number of treatments), and E should be between about 10 and 20, but with some leeway. It depends on the law of diminishing returns and aims to ensure that there is an adequate estimate of the error variance. It is particularly useful for small fundamental studies and more complex designs with many treatments, such as factorial designs and where there is no estimate of the standard deviation, thereby preventing the use of a power analysis.

The Statistical Analysis of Randomized Block Designs

In an RB design, typically each observation can be classified by two factors, one “fixed” (usually called the treatment that is deliberately varied and is of scientific interest) and the other “random” (which may be called a “block” or replicate), which is of no scientific interest but which could cause noise if not removed in the statistical analysis. There can be any number of replications (blocks). Randomization is done separately within each block. An individual observation will be made up of a grand mean μ , a deviation “t” due to the treatment it receives, a deviation “b” due to the block, and an individual deviation “e”: where Yij is the individual observation, μ represents the overall mean, ti represents a deviation due to the ith treatment, bJ represents a deviation due to the jth block, and eij represents a random deviation associated with the individual experimental unit. In this case, i = 1 … t where t is the number of treatments and j = 1 … b, where b is the number of blocks.

Assuming a single treatment factor and a single block factor, the experiment is analyzed by a two-way analysis of variance without interaction. However, a factorial treatment structure can also be used (as in the example below), so there can be two or more treatment factors. A Latin square design has two blocking factors, often designated “rows” and “columns.”

Note that with these designs no two observations come from the same block and treatment so the estimate of the standard deviation is obtained as the square root of the error mean square in the analysis of variance.

Randomized block designs can sometimes have within-block replication (e.g., two or more experimental units per block and treatment combination), but this is not discussed here.

An Example

The example given below comes from a series of studies that were aimed at exploring the effect of antioxidants on susceptibility to cancer. In this example, diallyl sulphide (DS), a substance found in garlic, was administered by gavage in three daily doses of 0.2 mg/g to 8-week-old female mice of four inbred strains, and the activity of a number of liver enzymes was compared in treated and vehicle-treated controls. This work was carried out in 1993 as part of MAFF Project FS1710 entitled “Mechanisms of modulation of carcinogens by antioxidants: genetic control of the anticarcinogenic response in mice.” The work was done under UK legislation and all animals were humanely euthanized as directed under the Animals (scientific procedures) Act 1986. Were this an original research paper rather than an example, then full details would need to be given according to the Animal Research: Reporting of in vivo Experiments (ARRIVE) guidelines (Kilkenny and Altman 2010).

The purpose of the experiment was to assess whether the activity of the liver enzymes was altered by the DS treatment and to see if there were any important strain differences in response. Altogether there were eight treatment combinations: 4 inbred strains × 2 treatments in a factorial arrangement, in two blocks. Both treatment and strain were regarded as fixed effects, block being a random effect. Note that strain is a classification that cannot be randomized. So in this experiment the only randomization was the decision of which of two mice of each strain within a block would receive the treatment and which would be the control. However, the order in which the animals were sacrificed in each block was randomized.

The work was started when the MRC Toxicology Unit was being relocated from South London to Leicester. The new animal house was not yet ready, and the first block of the experiment was done with the mice housed in a plastic film isolator. The second block was done approximately two months later with the animals housed in one of the new animal rooms, so the two blocks had different environmental conditions although these were not quantified. The determinations of enzyme activity were done separately for each block using freshly made up solutions.

The raw data showing the activity of one of the liver enzymes, glutathione-S-transferase (Gst), assayed using the chlorodinitrobenzene (CDNB) method, is shown in Table 1. The units are nmol conjugate formed per minute per mg of protein.

Table 1

Gst levels (nmol conjugate formed per minute per mg of protein) in individual mice in an RB experiment in two blocks separated by approximately three months. Note that all Block 2 values are higher than the corresponding Block 1 values

Strain TreatmentaBlock 1 Block 2 
NIH 444 764 
NIH 614 831 
BALB/c 423 586 
BALB/c 625 782 
A/J 408 609 
A/J 856 1002 
129/Ola 447 606 
129/Ola 719 766 
Block means 567 743 
Strain TreatmentaBlock 1 Block 2 
NIH 444 764 
NIH 614 831 
BALB/c 423 586 
BALB/c 625 782 
A/J 408 609 
A/J 856 1002 
129/Ola 447 606 
129/Ola 719 766 
Block means 567 743 

View Large

There was a large block effect, with all block A values being lower than the corresponding block B values. Why was this? The protocols were identical. It could have been due to slight differences in calibration of instruments or minor differences in reagents and solutions used to assess the enzyme activity. Possibly the animals supplied were of a slightly different age, had a different microflora, were on a different batch of diet, or perhaps the different environment of isolator versus animal room altered their response. There are many variables that can influence such results, and it is impossible to identify or control them all. What is important is the relative magnitude of each observation. This was maintained as shown by the strong correlation of 0.88 between the two blocks, shown in Figure 1. Large block effects are common in this type of design. They highlight the importance of having concurrent controls and randomization, as well as the danger of using historical data where differences of the sort seen here between blocks might be mistaken for treatment effects.

Figure 1

Plot of Gst levels in Block A versus Block B for the randomized block experiment. The correlation between the blocks of r = 0.88 is large and statistically highly significant (p < 0.01).

Figure 1

Plot of Gst levels in Block A versus Block B for the randomized block experiment. The correlation between the blocks of r = 0.88 is large and statistically highly significant (p < 0.01).

The analysis of variance (ANOVA; Table 2) shows a large treatment effect, no significant difference between strains (p = 0.091) but some evidence of a strain by treatment interaction (p = 0.028). When a significant two-way interaction is observed, the individual means need to be looked at separately. These are shown graphically in Figure 2 with strain NIH being less responsive to treatment with DS than the other strains. The residual mean square (2957) provides an estimate of the pooled within-group variance, so the standard deviation is 54.5 units. The experiment is a bit small according to the resource equation, with E = 7. But power has probably been increased by using inbred strains and the randomized block design.

Table 2

The analysis of variance of the data shown in Table 1. By convention, p values of less than 0.05 are considered “statistically significant” so there is no evidence of strain differences in mean Gst levels, but some evidence of strain differences in response, but also see text

Source Df Sum Sq Mean Sq 
Blocks 1242 1242 42.01 <0.000 
Strain 286 95 3.22 0.091  
Treatment 2275 2275 76.93 <0.000 
Strain × treatment 495 165 5.58 0.028 
Residuals 207 29 
Source Df Sum Sq Mean Sq 
Blocks 1242 1242 42.01 <0.000 
Strain 286 95 3.22 0.091  
Treatment 2275 2275 76.93 <0.000 
Strain × treatment 495 165 5.58 0.028 
Residuals 207 29 

View Large

Figure 2

Bar plot showing Gst levels in control (left) and treated (right) means for each mouse strain. Each bar is the mean of two observations taken approximately two months apart. Error bars are ± the least significant difference (α = 0.05) so if they overlap there is no significant difference between the means, and if they don't there is a significant difference (p < 0.05). All bars are the same length because sample sizes are identical and a pooled standard deviation has been used. Note that the control values are reasonably similar, but there are slight strain differences in response (strain × treatment interaction, p < 0.05 with strain A/J responding most, and NIH not significantly responding). But see text for discussion.

Figure 2

Bar plot showing Gst levels in control (left) and treated (right) means for each mouse strain. Each bar is the mean of two observations taken approximately two months apart. Error bars are ± the least significant difference (α = 0.05) so if they overlap there is no significant difference between the means, and if they don't there is a significant difference (p < 0.05). All bars are the same length because sample sizes are identical and a pooled standard deviation has been used. Note that the control values are reasonably similar, but there are slight strain differences in response (strain × treatment interaction, p < 0.05 with strain A/J responding most, and NIH not significantly responding). But see text for discussion.

An ANOVA makes three assumptions about the data: (1) The experimental units are independent (i.e., the treatments have been individually assigned to the experimental units by randomization); (2) The residuals (deviation of each observation from its group mean) have a normal distribution; and (3) The variances are homogeneous. The first assumption is met in this case by randomization. It is good practice to investigate the second and third assumptions using “residual model diagnostic plots,” as shown in Figure 3. The top plot is concerned with studying the homogeneity of variances. There should be a scattering of points with no pattern, as is the case here. The lower plot is a normal probability plot of the residuals. If they have a normal distribution, points should lie on a straight line. In this case, there are four points lying off the line, showing that the assumption of a normal distribution of the residuals is slightly marginal. The ANOVA is quite robust to deviations from these assumptions, but a transformation of the scale of measurement can often be used if there is a serious deviation from normality to see if the fit is better. In this case, a log transformation of the raw data provides a slightly better fit (not shown). The analysis of variance of the log-transformed data differs in that there is no statistically significant strain × treatment interaction. It is a matter of judgment whether to use the raw or the log-transformed scales. However, in this case it makes little difference to the conclusions. There is clearly a strong treatment effect with no large strain differences. Even if the strains differed slightly in response, that magnitude of difference is unlikely to be of much biological significance. It is also worth noting that the ANOVA provides a single overall test of whether there is a treatment, strain, and interaction effect. But that is three tests. Applying a Bonferroni correction would imply that a p value of 0.05/3 = 0.017 should be used, in which case the interaction would not be “significant.” But it is the overall interpretation that is important. That rarely depends on such minute details of the statistical analysis. Even a well-designed experiment will give slightly different results if it is repeated. If it were important to find strain differences in Gst levels or in the Gst response to chemicals, which was not a purpose in this case, then further work would be needed with a different experimental design and a wider choice of strains.

Figure 3

Residuals diagnostic plots for the example experiment using the raw data. The top plot of residuals as a function of fitted values is used to assess homogeneity of variances. If these are homogeneous, there should be a scattering of points with no pattern, as is the case here. The lower plot is a normal Q-Q plot and should give a straight-line fit to the points if the residuals have a normal distribution. In this case, there is some deviation from that ideal associated with points 1, 8 and 9 (see text for discussion).

Figure 3

Residuals diagnostic plots for the example experiment using the raw data. The top plot of residuals as a function of fitted values is used to assess homogeneity of variances. If these are homogeneous, there should be a scattering of points with no pattern, as is the case here. The lower plot is a normal Q-Q plot and should give a straight-line fit to the points if the residuals have a normal distribution. In this case, there is some deviation from that ideal associated with points 1, 8 and 9 (see text for discussion).

Discussion

The purpose of this paper is to bring randomized block experimental designs to the attention of scientists using laboratory animals. Because of their valuable properties, these designs have been widely used in agricultural and industrial research for many decades. Mead (1988), who has experience in medicine, agriculture, and industry complains that about 85% of all experiments are randomized complete block designs, and suggests that investigators should be more flexible in their choice of design. Yet these designs are rarely used in experiments involving laboratory animals. This cannot be because they are unsuitable. There is nothing about research with laboratory animals that sets it apart from all other disciplines. The reason must be that scientists are unfamiliar with these designs. Mead also says that a statistician should be fully involved with a research scientist in designing his/her experiments and that “this is the only efficient approach to designing experiments.” Yet in the last 50 years few statisticians of stature have been closely involved (e.g., published papers or books) in this area of research. The failure over such a long period of time to use the most efficient designs must surely have led to a serious waste of animals, time, and other scientific resources.

The example experiment, to determine the effect of DS on Gst levels, could have been done on a single occasion using 16 outbred CD-1 mice assigned at random to the two treatments. But it would have given neither an indication of whether there was a genetic component to the response nor whether the results were repeatable. The good agreement between the two blocks, separated by two months and a different animal house environment, suggests that the results are likely to be robust. There was no evidence of strain differences in Gst levels, although there was a hint of strain differences in the treatment response, depending on the scale of measurement. Altogether, the randomized block design gave extra information and had higher external validity at virtually no extra cost, with some assurance that the results should be reproducible.

Conclusions

Randomized block experimental designs include within-subject, crossover, and matched designs in which the experimental material is split up into a number of mini-experiments that are combined in the statistical analysis. They are widely used in many other research disciplines and, because of their useful properties, should be more widely used in laboratory animal research. They can be more convenient, more powerful, and can use fewer animals than completely randomized designs. If the blocks are separated in time and there is good agreement between them, then this gives some assurance that the experiment is reproducible. Their more widespread use would save money, animals, and other scientific resources and would speed up the development of new treatments for diseases of humans and animals.

References

Genetic contamination of laboratory animal colonies: an increasingly serious problem

ILAR News

 , 

1982

, vol. 

25

 (pg. 

6

-

10

)
,  ,  ,  ,  ,  ,  ,  . 

Survey of the quality of experimental design, statistical analysis and reporting of research using animals

PLoS One

 , 

2009

, vol. 

4

 pg. 

e7824

The design of experiments

Cambridge

 , 

1988

New York

Cambridge University Press

,  ,  . 

Believe it or not: how much can we rely on published data on potential drug targets?

Nat Rev Drug Discov

 , 

2011

, vol. 

10

 pg. 

712

,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  . 

Design, power, and interpretation of studies in the standard murine model of ALS

Amyotroph Lateral Scler

 , 

2008

, vol. 

9

 (pg. 

4

-

15

)
,  ,  . 

Studies on synthetic polypeptide antigens. XX. Genetic control of the antibody response in the rat to structurally different synthetic polypeptide antigens

J Immunol

 , 

1968

, vol. 

101

 (pg. 

730

-

742

)

Introduction

In designed experiments, one measures responses on multiple experimental subjects with the goal of analyzing the effect of changes in controlled experimental conditions, or treatments, on the responses of different subjects. In laboratory experiments, one controls for extraneous conditions, to ensure each experimental run is conducted under similar conditions, so that one is reasonably confident that consistent differences in response are caused by the treatments. However, it is not always possible to ensure that all extraneous conditions are properly controlled. To reduce the effects of extraneous conditions, two techniques are commonly employed: randomization and blocking.

Blocking refers to the division of experimental runs into smaller sub-groups, or blocks. Each treatment is applied randomly to a number of subjects within each block. This design, known as a Randomized Complete Block Design (RCBD), is commonly employed in biological experiments, where, for example, experimental runs on a given day may be treated as a block (Sokal and Rohlf, 1981). The randomization protocol reduces any bias in favor of particular treatments, while the blocking enables extraneous variation to be absorbed into block effects. Consequently, one obtains better estimates of treatment effects and more powerful tests for treatment differences (Cochran and Cox, 1957).

In the RCBD, the application of treatments to subjects within a block must be completely randomized. If a treatment is applied to five subjects within a block, then the subjects are chosen randomly within the block. However, in many experimental designs, practical constraints prevent this ideal situation from being realistic. The motivation for this research arises from insect pheromone-tracking studies, where a pheromone plume is generated at one end of a wind tunnel. An insect begins at the other end of the tunnel and is challenged to track the plume to the pheromone source. The goal is to detect differences in the response of an insect to different types of plumes or treatments (Willis and Baker, 1984; Linn et al., 1988a,b; Mafra-Neto and Cardé, 1998; Zanen and Cardé, 1999; Cardé and Knols, 2000; Dekker et al., 2001; Willis and Avondet, 2005). The effect of various types of formulated synthetic pheromone on different species of walking and flying insects has been studied by various researchers (Linn et al., 1988a,b; Willis and Arbas, 1991; Linn et al., 1996; Cardé et al., 1998; Willis and Avondet, 2005).

Whenever the complete randomization protocol is violated, we refer to the corresponding design framework as `restricted randomization'. One such type is illustrated in Fig. 1. Forinstance, in chemo-orientation studies, changing the treatment typically involves dismantling and reconfiguring the experimental device, which is often not practical after each experimental run. A more practical approach often undertaken by experimental scientists is: within each block, multiple subjects are challenged to a single treatment before changing the device to administer the next treatment. Only the order in which the treatments are applied is randomized. By considering subjects in groups, the experiments could be conducted in a relatively short time span. This randomization protocol is a type of restricted randomization as illustrated in Fig. 1. We demonstrate the effects of a restricted randomization on the analysis and scientific conclusions using a computer-simulated experiment and data involving virgin male Periplaneta americana.

Fig. 1.

A design layout under a type of restricted randomization for three blocks, in which multiple subjects were challenged to a single treatment before administering the next treatment. The order of the treatments as well as the assignment of the subjects within each block were randomized.

Analysis of variance (ANOVA) is the fundamental tool for analyzing data from designed experiments (Cochran and Cox, 1957; Searle, 1971). Chapter 8 of Sokal and Rohlf (1981) emphasizes that ANOVA, while being an effective tool for any modern biologist, may create artificial constructs in the mind of a scientist that could lead to misleading conclusions. In subsequent sections of this article, this is demonstrated in the context of restricted randomization.

In many of the animal chemo-orientation studies published over the past two decades (see for instance, Table 1), an RCBD or a related design was employed to analyze experimental data obtained under the restricted or other modified randomization protocols. One such modified experimental design was employed by Linn et al. (1988a), in which some treatment effects were confounded with block effects. This might be one of the reasons for obtaining non-significance of the treatment effects. In a flight tunnel experiment, Linn et al. (1988b) challenged 5–10 males of each species of Trichoplusia ni and Pseudoplusia includens to the treatments at one of the two dosages. From their experimental design description it appears that there were multiple levels of blocking and a type of restricted randomization. They analysed the data using a one-way ANOVA, however, ignoring the block effects and restriction in randomization.

Mafra-Neto and Cardé (1998) utilized an RCBD to test the effect of treatments; however, it is not clear from the `Materials and methods' section whether their experimental design indeed satisfies the complete randomization protocol. From the `Materials and methods' section of Linn et al. (1996), it appears that they do not have a complete randomization protocol; however, they analysed their experimental data using an ANOVA (although not clear whether it was a one-way or two-way). The effect of light levels and plume structure on the orientation maneuvers of male Lymantria dispar (gypsy moths) flying along pheromone plumes has been studied by Cardé and Knols (2000). They used the flight tracks of 20 males per treatment (a total of six treatments), tested in a complete randomized block design. The goal was to study the effects of odor plume structure on the orientation maneuvers of different species of walking and flying insects (Willis and Baker, 1984; Baker, 1990; Cardé and Knols, 2000). Justus et al. (2002) employed an ANOVA; however, they do not state the details of the experimental design and the statistical analysis employed such as a one-way ANOVA or an RCBD. Vickers (2002) considered male Heliothis subflexa, which were flown in a wind tunnel to a variety of combinations of synthetic pheromone components admixed on a filter paper disk. Their experimental design violated the complete randomization protocol since groups of 3–5 males were flown under each treatment on any given day; however, they analysed the experimental data using an RCBD ANOVA.

The flight behavior of mosquitoes in host-odor plumes and the effects of the fine-scale structures of such plumes have been studied by Dekker et al. (2001). They considered seven treatments; each treatment had eight replicates and treatments were randomized within each test day. Zanen and Cardé (1999) applied an RCBD to test the treatments on a given day on male L. dispar; however, they did not employ a two-way ANOVA to analyse the resulting experimental data. Consequently, their analysis does not match the design. Furthermore, it is not clear whether their design satisfied a complete randomization protocol.

All the above analyses lead to an important question: how does violating the fundamental assumption of complete randomization affect the interpretation of experimental results or scientific conclusions? It is often very difficult (1) to assess the consequences of subtle modifications of the design on the resulting analysis and scientific conclusions and/or (2) to identify an alternative method of analysis corresponding to the randomization protocol scheme at hand by simply referring to the statistical literature (Cochran and Cox, 1957; Searle, 1971; Sokal and Rohlf, 1981).

The goals of this article are to provide insights into the statistical analysis issues embedded within designed experiments when practical constraints impose restrictions on randomization of the treatments. The statistical analyses of simulated experiments and data involving virgin male P. americana demonstrate the consequences of overlooking the restricted randomization on the scientific questions being addressed as well as the analysis and interpretation of the results. Our simulated experimental data, presented in the `Results' section, demonstrate that the RCBD incorrectly finds a highly significant treatment effect that was not present in the model, while failing to find the real effect present in the model. However, the risk of a false positive (Type I error) indication of the treatment significance is substantially reduced under the alternative analysis. In essence, by employing an RCBD when the underlying assumption is not satisfied, we are more likely to declare an effect exists when it does not. This has implications for the understanding of experimental results as well as the scientific conclusions. It is important to note that the methodology and analysis employed for the simulated experiment is equally applicable to any organism or artificial agent tested under a restricted randomization framework.

Applying an appropriate model to account for the changes in the design is relevant for two reasons. Violation of assumptions in a particular design could result in (1) underestimation of experimental error variance and (2) obtaining false positives (Type I errors). These in turn, may lead to incorrect results or invalidate the analysis employed by a researcher. Therefore, it is important to choose an appropriate model and error structure in considering designed experiments.

Table 1.

A summary of several experimental studies conducted by various researchers on chemo-orientation studies

Materials and methods

The effect of a restricted randomization on the analysis of experimental data is best illustrated through a simulated experiment. In general, if there is a restriction on randomization at a given level in experimentation, there will be a `split' in the design, leading to a split-plot design nested within an RCBD structure. There are two types of experimental units: the larger units, the groups of subjects (for example, insects), are called the whole-plots and the smaller ones, individual subjects, the sub-plots. A split-plot design creates a nesting within the design structure since the whole-plots are nested within blocks and the sub-plots are in turn nested within the whole-plots. The design structure for the whole-plot experimental units is essentially an RCBD. A split-plot design has two advantages over a simpler ANOVA: (1) if treatment can be applied to the whole-plot at once, rather than separately to sub-plots, this may reduce costs, and (2) because sub-plots are usually more uniform, parameters measuring comparisons among conditions may be estimated more precisely.

Several examples of split-plot designs can be found in the biological literature. Linn and Roelofs (1983) considered a total of 100 treatments in a 5×4×5 factorial design such that two of the three factors were varied between days and one factor was varied within days. This design has a split-plot structure with days serving as whole-plots. The experiments of Linn et al. (1988a) consider two species, Grapholita molesta and Pectinophora gossypiella. They challenged 5–10 males to each treatment per day, with a total of 70 males for each treatment–temperature combination. Both the treatments and temperatures were randomized over the experimental period. This experimental design has a split-plot structure with unbalanced data. In both of these examples, the analysis was based on ANOVA and regression techniques, rather than a split-plot analysis.

Model for a hypothetical split-plot design

Every ANOVA is associated with a linear model specifying the effects being considered. The linear model for a split-plot ANOVA includes hierarchies of terms modeling both the block effects and treatment effects. The key concept in constructing models for split-plot designs is recognizing the different sizes of experimental units and consequently identifying the corresponding design structures and treatment structures.

Consider an experiment in which several treatments are administered to different subjects and the experiment is conducted over several blocks. Suppose that a biological experiment consists of subjects of different ages, challenged to various treatments such as pheromones on different blocks (for example, days). The age factor may be included in the model to determine the performance of behavior of an animal as it develops over time. The split plot design originated in agricultural field trials and, in this setting, one factor may be fertilizer and another factor irrigation level.

Consider a hypothetical design such that Yijk denotes a response measured on a subject at the kth (k=1,...,n) age when challenged to the jth (j=1,...,t) treatment in the ith (i=1,...,b) block. The simplest possible model (1) can be written as 1 where μ represents the overall mean, αi is the block effect, βj is the effect of the jth treatment applied at the whole-plot level, γij is the whole-plot effect, andδ k is the age effect. The term ϵijk measures random error.

The scientific interest is in the treatment effect βj and the age effect δk. The basic statistical problem is to detect the significance of these effects and estimate the size of any effects that are present. The effects βj and δk are treated as fixed effects or parameters in the model. The block effectα i and the whole-plot effect γij are of no inherent interest; however, these can cause considerable variation from block to block and whole-plot to whole-plot. Therefore, these effects are treated as random. A typical assumption is that these effects follow distributions N(0,σα2) and N(0,σγ2) respectively, whereσ α2 andσ γ2 are unknown variance components corresponding to the block and whole-plot effects, respectively. The random errors ϵijk are assumed to follow a N(0,σϵ2) distribution.

Simulated experiment: generation of data

We demonstrate how to specify a model for split-plot design and how to construct appropriate F-test statistics through simulated experimental data. The design is generated as follows:

  1. Sixteen runs, numbered 1 to 16, were performed for each of five blocks.

  2. Each block was divided into four whole-plots of four runs. The four treatments, such as pheromone plumes, were randomly assigned to the four whole-plots.

  3. Four subjects, one of each age, are assigned in a random order to the runs within each whole-plot.

The important feature of this design is that two factors, treatment and age, are being varied across experimental runs. The treatment is varied randomly at the whole-plot level, while the age is varied randomly at the level of individual runs or sub-plots.

After creating the experimental design, responses were generated according to the following model (2): 2 where subscripts i, j and k denote the block, treatment and age, respectively. (1) The overall mean is μ=100; (2) the block effectα i has a normal distribution with mean zero andσ α2=25; (3) the δk term representing the age effect takes values 1,...,4 corresponding to subjects of four different age groups (for example, insects of 10 days, 20 days, 30 days and 40 days old, respectively); (4) the componentλ irijk represents drift of experimental conditions over time, with rijk taking the run number 1,...,16 of the (j,k) treatment combination in the ith block; (5) the coefficient λi follows a N(0,1) distribution; and (6) ϵijk has a normal distribution with mean zero and σϵ2=1. That is, we assumed that different random effects contribute differently to the level of variation in the final measurement of Yijk. The statistical model used to generate the data includes an age effect, a random block effect and a random drift effect within the block. However, the model does not include any treatment effect βi and, therefore, the responses are independent of the treatment employed.

It is important to emphasize that the model under which the simulated data is generated is different from the model. In particular, the experimental drift in the model corresponding to the simulation experiment does not match exactly the assumption of the split-plot model. This effect was intentionally introduced into the model, since in practice one does not know the precise form of any uncontrolled variation. It is important that the statistical analysis is robust to misspecification of this term. The data was generated using the statistical language S (Venables and Ripley, 2002) and the model (2) was fitted using the lm() function in S. The model fitting and data analysis can also be performed using R, a freely available open source statistical language available from http://www.r-project.org.

P. americana experiment

The P. americana experiment is an example of a split-plot design, characterized by multiple levels of blocking. The experiment involved 3–18 weeks old virgin males of P. Americana, which were challenged to track wind-borne plumes of (–)-periplanone-B 2 h into their scotophase (12 h:12 h L:D cycle). The animals were video recorded as they tracked wind-borne plumes of the female sex-pheromone (–)-periplanone-B in a laboratory wind tunnel. Each videotaped walking path was digitized using a computerized motion analysis system.

Plume structure

For this experiment, four different plume structures were constructed by varying the size, shape and orientation of the pheromone source (Fig. 2). The point source plume was constructed using a 0.7 cm diameter circular filter paper disk (Whatman No. 1, Eastbourne, East Sussex, UK) held perpendicular to the airflow with an insect pin (Fig. 2A). A ribbon plume with a chemical source 0.05 cm wide was constructed by rotating the 0.7 cm filter paper disk 90°, so that the disk shape was parallel to air flow in the wind tunnel, resulting in a very narrow plume (Fig. 2B). The third plume was created by increasing the surface area of the source by ca. 25 times while also proportionally increasing the dosage of pheromone solution applied to the source. The wide plume treatment source was 14.3 cm wide ×0.7 cm tall (Fig. 2C). The cylinder plume structure was generated by placing a Plexiglas® cylinder (81.28 cm tall× 7.62 cm diameter) 5 cm upwind of the 0.7 cm diameter circular filter paper disk held perpendicular to airflow (Fig. 2D). The reader is referred to Willis and Avondet (2005) for further details on the materials employed in conducting this experiment.

Fig. 2.

An illustration of P. americana males tracking female pheromone upwind (right to left) in a laboratory wind tunnel, containing the time-averaged plume boundaries of titanium tetrachloride smoke plume in 25 cm s–1 wind. Each circle represents the body position at every 0.083 s. (A) Point source plume, 2.4 cm wide at the source spreads to 7.7 cm wide at the downwind end. (B) Ribbon plume, 1.5 cm wide at the source spreads to 6.1 cm wide at the downwind end. (C) Wide plume, 17 cm wide at the source spreads to 26.5 cm at the downwind end. (D) Cylinder plume, 7.6 cm wide at the source spreads to 68.1 cm at the downwind end.

The aim of this experiment was to test the hypothesis that male cockroaches steer their walking while tracking female pheromone using a chemotactic strategy characterized by counter turning (turning-back) when they experience a sharp pheromone-clean air edge (Willis and Avondet, 2005).

Measurements

Response variables measured from the digitized insect movement tracks included: track angle (degrees), track width (cm), ground speed (cm s–1), body axis (degrees), net velocity (cm s–1), inter-turn duration (s), the number of times each animal stopped, and the duration of each stop (s). For the purposes of the analysis, Willis and Avondet (2005) considered (1) a turn as the location at which the head reached a local maximum or minimum value with respect to the lateral frame of reference of the wind tunnel, and (2) an animal to be in stopping position if there was no movement between two sequential positions of the head point. Measurement of these response variables from one animal was considered as one trial. The response variable is the average of the measurements for an entire walking track of a single animal.

The animals are expected to have peak response during a specific period in each scotophase, and only a limited number of experimental runs can be performed each day. The experiment was therefore carried out over 5 days. The design can be summarized is as follows:

  1. The experiment was run over 5 days.

  2. Each day was divided into four whole-plots. Each of the four pheromone treatments were randomly assigned to one of the whole-plots.

  3. Within each whole-plot, five animals were tracked. Each animal or sub-plot representing the smallest experimental unit.

This gives a total of 100, corresponding to 5×4×5 experimental runs. Three animals did not respond when challenged with the experimental conditions so that 24 observations in total were completed for each treatment, except for the second treatment which yielded 25. The analysis therefore includes a total of 97 observations.

Split-plot model for the P. americana experiment

The key feature of the P. americana design is that the treatments (the pheromone plumes) were varied at the level of whole-plots, and not at the level of individual experimental runs. In the simulated experimental data described above, a second factor of age was varied at the level of an experimental unit; however, in the P. americana experiment, this second factor is absent.

Since the treatments were applied to groups of animals within each apparatus setup, the treatments must be associated with the whole-plot part of the design. Therefore, in order to make an appropriate inference regarding treatments, the F-statistic denominator must include the random variation between the whole-plots.

In the context of the P. americana experiment, the response Yijk in model (1) represents the ground speed (cm s–1) averaged over different time points; αi and γij are the day and whole-plots effects, respectively;β j is the effect of the jth pheromone applied at the whole-plot level and δk is the effect of the treatment applied at the subject level. However, since no factor was varied at the subject level in this experiment, the δk term is absent in the model.

Appropriate F-test statistics

The analysis involves computing the sums of squares (SS) and mean squares (MS) due to each of the terms in the model; see Sokal and Rohlf (1981) and Searle et al. (1992) for the partitioning of the SS. In order to test for the significance of treatment effects, one forms an F-ratio using the treatment MS (MStrt) and the error MS as: 3 where the mean square error (MSE) is an unbiased estimator of the error variance σϵ2, the variation between subjects within groups. The above F-statistic is an appropriate one to employ when the only source of random variation in the estimated treatment effects are the random errors. This is the case for the age effect δk term in the model (1). Since each age occurs once in each whole-plot, any block effects must influence all treatments equally. Consequently, the presence of block effects does not inflate the treatment MS.

The F-statistic in Equation (3) is no longer valid when one is testing for the treatment effects applied at the whole-plot level, the βj term in model (1). In the P. americana experiment, there may be additional whole-plot variation due to differences in responsiveness of animals during the scotophase, or due to any random variation in resetting the experimental device. The MSE estimates only the subject-to-subject variation while ignoring these other potential sources of random variation. Therefore, the F-statistic in Equation (3) corresponding to the treatment effects is biased upwards, leading to false indications of significant treatment effects (Type I error).

The split-plot analysis overcomes this difficulty by modeling whole-plot variation by the random γij term. The appropriate denominator for the F-ratio is the MS attributed to γij; i.e. MSintr. Consequently, the F-ratio becomes:

Results

We demonstrate the consequences of the randomization protocol on the analysis of experiments and scientific conclusions using the simulated and real data.

Comparison of the RCBD and split-plot analyses: simulation experiment

In terms of the practical significance, the main findings from our simulation experiment are summarized in Tables 2 and 3. Table 2 presents the results of an RCBD ANOVA, which assumes complete randomization as illustrated in Fig. 3. The analysis shows a highly significant block effect (P<0.00001). However, scientific interest is usually in treatment effects and the analysis in Table 1 incorrectly finds a highly significant treatment effect (P=2.533×105), while failing to detect the real age effect (P=0.2919). The restricted randomization in this design has led to both false positive (Type I error) and false negative (Type II error) results.

Fig. 3.

A completely randomized design layout for a single block in which every subject and every treatment were individually randomized.

The restricted randomization means that the treatments are submitted in whole-plots of four runs, and for the purpose of analyzing treatment effects, the correct analysis is to treat these four runs as a single experimental unit. This leads to the split-plot ANOVA. The results of applying the split-plot analysis to the simulated experimental data are shown in Table 3. This analysis correctly identifies that there is no significant treatment effect (P=0.1889) and that the age effect is highly significant (P=6.126×10–5). Therefore, the risk of a false positive indication of the treatment significance is substantially reduced under the split-plot design. In essence, by employing an RCBD when the underlying assumption is not satisfied, we are more likely to reject a true null hypothesis. This has implications to the understanding of experimental results as lack of treatment effect would be expected to be of relevance.

Table 2.

ANOVA under the RCBD for the simulated experimental data assuming the complete randomization within each block, such as day, while treating the block effect as fixed

Table 3.

ANOVA under a split-plot design for the simulated experimental data assuming the restricted randomization

Comparison of the RCBD and split-plot analyses: the P. americana experiment

We present the analysis of the data from the P. americana experiment using RCBD and split-plot models to demonstrate when and how likely false positives can occur and their consequences on the biological questions. This illustrates how violations of the underlying assumption for the RCBD leads to underestimation of the error variability and inflating the statistical significance of the treatment effects.

The response variable, ground speed (cm s–1), is shown in Fig. 4 sorted first by day and next by pheromone, and in Fig. 5 sorted first by pheromone and then by day. It is clear that the pheromone D (cylinder source) is behaving differently. It also appears that treatments behave differently on different days. For example, a comparison of pheromone A (point source) vs pheromone B (ribbon source) shows that animals are responding more rapidly to the point source on days 2 and 4 and more rapidly to the ribbon source on days 1 and 5.

Fig. 4.

The response variable ground speed (cm s–1) for the P. americana experiment sorted first by day and next by pheromone.

Fig. 5.

The response variable ground speed (cm s–1) for the P. americana experiment sorted first by pheromone (Pher) and then by day.

Analysis of the data was conducted using the SAS PROC MIXED program (Littell et al., 1996). Table 4 presents a two-way ANOVA for the response variable ground speed (cm s–1). While the usual analysis of an RCBD includes only main effects for treatments and blocks, in the present experiment there are multiple replications of the pheromones on each day. Therefore, we are able to include an interaction term between the treatments and blocks. The table shows a highly significant day and pheromone effects (P<0.0001) and a significant pheromone-by-day interaction effect (P<0.0307). These results are consistent with those obtained by Willis and Avondet (2005).

0 comments

Leave a Reply

Your email address will not be published. Required fields are marked *