lab09_nested.Rmd
Remember from lecture that nested designs are useful in situations where we subsample within each experimental unit and we’re interested in treatment effects at the experimental (whole) unit level, not the subunit level. For example:
We count larvae at multiple subplots within a plot
We weigh multiple chicks in a brood
\[\Large y_{ijk} = \mu + \alpha_i + \beta_{ij} + \epsilon_{ijk}\]
Because we want our inferences to apply to all experimental units, not just the ones in our sample, \(\beta_{ij}\) is random:
\[\Large \beta_{ij} \sim normal(0, \sigma^2_B)\]
And as always,
\[\Large \epsilon_{ijk} \sim normal(0, \sigma^2)\]
Treatment effects:
\[\Large H_0 : \alpha_1 = · · · = \alpha_a = 0\] \[\Large H_a : at\; least\; one\; inequality\] Random variation among experimental units:
\[\Large H_0 : \sigma^2_B = 0\]
\[\Large H_a : \sigma^2_B > 0\]
Recall that a nested ANOVA is primarily used when the variation among experimental units is of interest. If it is not, we can simplify to a one-way ANOVA.
For this lab, we’ll use another data set on the response of gypsy moth larvae to pesticide treatments:
library(FANR6750)
data("mothData2")
str(mothData2)
#> 'data.frame': 36 obs. of 3 variables:
#> $ larvae : num 16 16 15.8 14.2 13.9 14.2 13.5 13.4 14 13.1 ...
#> $ Treatment: chr "Bt" "Bt" "Bt" "Bt" ...
#> $ Plot : int 1 1 1 1 2 2 2 2 3 3 ...
As before, we’ll need to convert Plot
and
Treatment
to be factors:
mothData2$Treatment <- factor(mothData2$Treatment,
levels = c("Control", "Bt", "Dimilin"))
mothData2$Plot <- factor(mothData2$Plot)
Note that we explicitly define the levels for Treatment
so that control will be treated as the baseline level and plotted first.
We don’t need to do that with Plot
since the default
numeric order is fine.
To see the level of replication within each treatment and plot, let’s cross-tabulate the data:
table(mothData2$Treatment, mothData2$Plot)
#>
#> 1 2 3 4 5 6 7 8 9
#> Control 0 0 0 4 4 4 0 0 0
#> Bt 4 4 4 0 0 0 0 0 0
#> Dimilin 0 0 0 0 0 0 4 4 4
Notice how plots are labeled! Not according to \(j \in {1, 2, 3}\), even though there are three plots within each treatment (and four replicates within each plot). Rather, each plot is labeled independently, 1–9.
Why is this important?
aov()
Based on the models we’ve fit previously this semester, we might instinctively analyze these data using the following:
aov.wrong <- aov(larvae ~ Treatment + Plot, data = mothData2)
summary(aov.wrong)
#> Df Sum Sq Mean Sq F value Pr(>F)
#> Treatment 2 215.4 107.7 208.89 <2e-16 ***
#> Plot 6 11.2 1.9 3.61 0.0093 **
#> Residuals 27 13.9 0.5
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
But if we do so, the denominator degrees of freedom are incorrect. Instead, we should use:
aov.correct <- aov(larvae ~ Treatment + Error(Plot), data = mothData2)
summary(aov.correct)
#>
#> Error: Plot
#> Df Sum Sq Mean Sq F value Pr(>F)
#> Treatment 2 215.4 107.7 57.9 0.00012 ***
#> Residuals 6 11.2 1.9
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Error: Within
#> Df Sum Sq Mean Sq F value Pr(>F)
#> Residuals 27 13.9 0.516
Can you spot the difference? What calculations changed between these two analyses?
What happens if we analyze plot-level means? The
aggregate()
function is similar to tapply()
but it works on entire data frames. Here we get averages for each whole
plot:
plotData <- aggregate(larvae ~ Treatment + Plot,
data = mothData2, FUN = mean)
plotData
#> Treatment Plot larvae
#> 1 Bt 1 15.50
#> 2 Bt 2 13.75
#> 3 Bt 3 14.00
#> 4 Control 4 18.25
#> 5 Control 5 18.75
#> 6 Control 6 19.25
#> 7 Dimilin 7 12.50
#> 8 Dimilin 8 13.50
#> 9 Dimilin 9 13.00
When we use aov()
to analyze the aggregated data, we get
the same F and p as before:
When using using aov()
with Error term:
You can’t use TukeyHSD()
You don’t get a direct estimate of \(\sigma^2_B\)
Doesn’t handle unbalanced designs well
But, you can use model.tables()
and
se.contrast()
An alternative is to use lme()
function in
nlme
package
Possible to get direct estimates of \(\sigma^2_B\) and other variance parameters
Handles very complex models and unbalanced designs
Possible to do multiple comparisons and contrasts using the
glht()
function in the multcomp
package
But…
Only works if there random effects
ANOVA tables aren’t as complete as aov()
lme()
function
To see how the lme()
function works let’s fit the moth
data model using this approach:
The syntax for random effects can be a little confusing at first but
see ?lme
for additional information.
Now we can view the ANOVA table from the model using:
anova(lme1, Terms = "Treatment")
#> F-test for: Treatment
#> numDF denDF F-value p-value
#> 1 2 6 57.87 1e-04
And we can view the variance parameter estimates:
VarCorr(lme1)
#> Plot = pdLogChol(1)
#> Variance StdDev
#> (Intercept) 0.3364 0.580
#> Residual 0.5156 0.718
The first row shows the estimates of \(\sigma^2_B\) and \(\sigma_B\). The second row shows the estimates of \(\sigma^2\) and \(\sigma\).
What can we say about the amount of random variation within whole units relative to among whole units (after accounting for treatment effects)?
Next, we can extract the plot-level random effects. These are the \(\beta_{ij}\)’s:
round(ranef(lme1), 2)
#> (Intercept)
#> 1 0.78
#> 2 -0.48
#> 3 -0.30
#> 4 -0.36
#> 5 0.00
#> 6 0.36
#> 7 -0.36
#> 8 0.36
#> 9 0.00
And finally, multiple comparisons:
tuk <- glht(lme1, linfct = mcp(Treatment="Tukey"))
summary(tuk)
#>
#> Simultaneous Tests for General Linear Hypotheses
#>
#> Multiple Comparisons of Means: Tukey Contrasts
#>
#>
#> Fit: lme.formula(fixed = larvae ~ Treatment, data = mothData2, random = ~1 |
#> Plot)
#>
#> Linear Hypotheses:
#> Estimate Std. Error z value Pr(>|z|)
#> Bt - Control == 0 -4.333 0.557 -7.78 <0.001 ***
#> Dimilin - Control == 0 -5.750 0.557 -10.32 <0.001 ***
#> Dimilin - Bt == 0 -1.417 0.557 -2.54 0.03 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> (Adjusted p values reported -- single-step method)
confint(tuk)
#>
#> Simultaneous Confidence Intervals
#>
#> Multiple Comparisons of Means: Tukey Contrasts
#>
#>
#> Fit: lme.formula(fixed = larvae ~ Treatment, data = mothData2, random = ~1 |
#> Plot)
#>
#> Quantile = 2.345
#> 95% family-wise confidence level
#>
#>
#> Linear Hypotheses:
#> Estimate lwr upr
#> Bt - Control == 0 -4.333 -5.639 -3.027
#> Dimilin - Control == 0 -5.750 -7.056 -4.444
#> Dimilin - Bt == 0 -1.417 -2.723 -0.111
To determine if salinity affects adult fish reproductive performance, a researcher places one pregnant female in a tank with one of three salinity levels: low, medium, and high, or a control tank. A week after birth, two offspring (fry) are measured.
The data can be accessed using:
Run a nested ANOVA using aov()
and lme()
on
the fish dataset. Answer the following questions:
What are the null and alternative hypotheses?
Does salinity affect fry growth?
If so, which salinity levels differ?
Is there more random variation among or within experimental units?
Put your answers in an R Markdown report. Your report should include:
A well-formatted ANOVA table (with caption); and
A publication-quality plot of the estimated effect of salinity on fry growth, including 95% confidence intervals. The figure should also have a descriptive caption and any aesthetics (color, line type, etc.) should be clearly defined.
You may format the report however you like but it should be well-organized, with relevant headers, plain text, and the elements described above.
As always:
Be sure the output type is set to:
output: html_document
Title the document:
title: "Homework 3: Nested design"
Be sure to include your first and last name in the
author
section
Be sure to set echo = TRUE
in all R
chunks so we can see both your code and the output
Please upload both the html
and .Rmd
files when you submit your assignment
See the R Markdown reference sheet for help with creating
R
chunks, equations, tables, etc.
See the “Creating publication-quality graphics” reference sheet for tips on formatting figures