LECTURE 5: One-way ANOVA

class: center, middle, inverse, title-slide

.title[
# LECTURE 5: One-way ANOVA
]
.subtitle[
## FANR 6750 (Experimental design)
]
.author[
### <br/><br/><br/>Fall 2022
]

---

class: inverse

# outline

<br/>
#### 1) Overview

<br/>  
--

#### 2) ANOVA as a linear model

<br/> 
--

#### 3) ANOVA table

<br/> 
--

#### 4) Example

---
# general idea

### Extension of the *t*-test for comparing > 2 populations

---
# motivating example

Foresters are studying the effect of 4 different fertilizers (treatments) on the growth of loblolly pine, which are grown on 3 plots (replicates) receiving each treatment. Data are average height per plot after 5 years:

.pull-left[
<br/>
<table class="table table-condensed" style="font-size: 18px; width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
<tr>
<th style="empty-cells: hide;border-bottom:hidden;" colspan="1"></th>
<th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="4"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Treatment</div></th>
</tr>
  <tr>
   <th style="text-align:center;"> Replicate </th>
   <th style="text-align:center;"> A </th>
   <th style="text-align:center;"> B </th>
   <th style="text-align:center;"> C </th>
   <th style="text-align:center;"> D </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> 1 </td>
   <td style="text-align:center;"> 11 </td>
   <td style="text-align:center;"> 7 </td>
   <td style="text-align:center;"> 6 </td>
   <td style="text-align:center;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 2 </td>
   <td style="text-align:center;"> 9 </td>
   <td style="text-align:center;"> 9 </td>
   <td style="text-align:center;"> 5 </td>
   <td style="text-align:center;"> 3 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 3 </td>
   <td style="text-align:center;"> 10 </td>
   <td style="text-align:center;"> 8 </td>
   <td style="text-align:center;"> 7 </td>
   <td style="text-align:center;"> 4 </td>
  </tr>
</tbody>
</table>
]

.pull-right[
#### Notation

- The number of groups (treatments) is `$\large a=4$`

- The number of observations within each group (replicates) is `$\large n=3$`

- `$\large y_{ij}$` denotes the `$\large j$`th observation from the `$\large i$`th group
]

---
# a brief tangent

#### What counts as an observation?

#### Experimental unit

> the physical unit that receives a particular treatment

#### Observational unit

> the physical unit on which measurements are taken

These are not always the same!

Examples

- Agricultural fields given different fertilizer, crop yield measured

- Rats given different diets, disease state measured

- Microcosm given different predator abundance, tadpole growth measured

---
# motivating example

**Question:** Is there a difference in growth among the four treatment groups?

--
<img src="05_anova_files/figure-html/pine1-1.png" width="576" style="display: block; margin: auto;" />

---
# motivating example

#### Hypotheses
- `$\large H_0 : \mu_A = \mu_B = \mu_C = \mu_D$`

- `$\large H_a :$` At least one inequality

#### How should we test the null?

--
We could do this using 6 *t*-tests

<br/>
--
But this would alter the overall (experiment-wise) `$\large \alpha$` level because each individual test has a chance (usually  `$\large \alpha = 0.05$`) of incorrectly rejecting a true null hypothesis, and this is multiplied when multiple tests are used

<br/>
--
An alternative procedure involves comparing the variation among the groups with the variation within the groups. If `$H_0$` is false, then the variance among is greater than the variance within groups.

---
# toward the additive model

#### To understand why the test is based on variance, it is helpful to consider several types of means:

--
- Grand mean

`$$\large \bar{y}. = \frac{\sum_i\sum_j y_{ij}}{a \times n}$$`

---
# motivating example

**Question:** Is there a difference in growth among the four treatment groups?

---
# toward the additive model

#### To understand why the test is based on variance, it is helpful to consider several types of means:

- Grand mean

`$$\large \bar{y}. = \frac{\sum_i\sum_j y_{ij}}{a \times n}$$`

- Group means

`$$\large \bar{y}_i = \frac{\sum_j y_{ij}}{n}$$`

---
# motivating example

**Question:** Is there a difference in growth among the four treatment groups?

---
# toward the additive model

#### To understand why the test is based on variance, it is helpful to consider several types of means:

- Grand mean

`$$\large \bar{y}. = \frac{\sum_i\sum_j y_{ij}}{a \times n}$$`

- Group means

`$$\large \bar{y}_i = \frac{\sum_j y_{ij}}{n}$$`

We can now decompose the observations as

`$$\large y_{ij} = \color{#446E9B}{\bar{y}.} + \color{#D47500}{(\bar{y}_i - \bar{y}.)} + \color{#3CB521}{(y_{ij} - \bar{y}_i)}$$`

---
# the additive model

#### The decomposition

`$$\Large y_{ij} = \color{#446E9B}{\bar{y}.} + \color{#D47500}{(\bar{y}_i - \bar{y}.)} + \color{#3CB521}{(y_{ij} - \bar{y}_i)}$$`
--

#### The additive model

`$$\Large y_{ij} = \color{#446E9B}{\mu} + \color{#D47500}{\alpha_i} + \color{#3CB521}{\epsilon_{ij}}$$`

#### where

`$$\Large \epsilon_{ij} \sim normal(0, \sigma^2)$$`

---
# the additive model

`$$\large y_{ij} = \mu + \alpha_i + \epsilon_{ij}$$`

`$$\large \epsilon_{ij} \sim normal(0, \sigma^2)$$`

#### Notes

- `$\large \mu$` is the grand mean of the population, estimated by `$\large \bar{y}.$`  
  
--

- `$\large \alpha_i$` is the effect of treatment *i*, estimated by `$\large\bar{y}_i - \bar{y}.$`

--
  + It is the deviation of the group mean from the grand mean

+ If all `$\large\alpha_i = 0$`, there is no treatment effect

+ Thus, we can write either  
    - `$H_0 : \mu_1 = \mu_2=... =\mu_a$`, or  
    - `$H_0 : \alpha_1 = \alpha_2=... =\alpha_a = 0$`

- `$\large \epsilon_{ij}$` is the residual error, estimated by `$\large y_{ij} - \bar{y}_i$`

+ It is the unexplained (random) deviation of the observation from the group mean
  
---
# sums of squares

#### Variation among groups

`$$\Large SS_A = n \sum_i (\bar{y}_i - \bar{y}.)^2$$`

---
# motivating example

**Question:** Is there a difference in growth among the four treatment groups?

---
# sums of squares

#### Variation among groups

`$$\Large SS_A = n \sum_i (\bar{y}_i - \bar{y}.)^2$$`

#### Variation within groups

`$$\Large SS_W = \sum_i \sum_j (y_{ij} - \bar{y}_i)^2$$`
---
# motivating example

**Question:** Is there a difference in growth among the four treatment groups?

---
# sums of squares

#### Variation among groups

`$$\Large SS_A = n \sum_i (\bar{y}_i - \bar{y}.)^2$$`

#### Variation within groups

`$$\Large SS_W = \sum_i \sum_j (y_{ij} - \bar{y}_i)^2$$`

#### Total variation

`$$\Large SS_T = SS_A + SS_W = \sum_i \sum_j (y_{ij} - \bar{y}.)^2$$`
---
# motivating example

**Question:** Is there a difference in growth among the four treatment groups?

---
# mean squares

### To covert the sums of squares to variances, divide by the degrees of freedom

--
#### Mean squares among

`$$\Large MS_A = \frac{SS_A}{a-1}$$`

--
#### Mean squares within

`$$\Large MS_W = \frac{SS_W}{a(n-1)}$$`

---
# F-statistic

`$$\LARGE F = \frac{MS_A}{MS_W}$$`

### To test the null hypothesis

- Compare the F statistic to the critical value: `$\large F_{a-1,a(n-1)}$`

- This is always a one-tailed test. Why?

---
class: inverse, center, middle

# anova table

---
# anova table

<br/>

<table class="table table-condensed" style="font-size: 18px; width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:center;"> Source </th>
   <th style="text-align:center;"> df </th>
   <th style="text-align:center;"> SS </th>
   <th style="text-align:center;"> MS </th>
   <th style="text-align:center;"> F </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> Among groups </td>
   <td style="text-align:center;"> $a-1$ </td>
   <td style="text-align:center;"> $n \sum_i (\bar{y}_i - \bar{y}.)^2$ </td>
   <td style="text-align:center;"> $\frac{SS_A}{a-1}$ </td>
   <td style="text-align:center;"> $\frac{MS_A}{MS_W}$ </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Within groups </td>
   <td style="text-align:center;"> $a(n-1)$ </td>
   <td style="text-align:center;"> $\sum_i \sum_j (y_{ij} - \bar{y}_i)^2$ </td>
   <td style="text-align:center;"> $\frac{SS_W}{a(n-1)}$ </td>
   <td style="text-align:center;">  </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Total </td>
   <td style="text-align:center;"> $an-1$ </td>
   <td style="text-align:center;"> $\sum_i \sum_j (y_{ij} - \bar{y}.)^2$ </td>
   <td style="text-align:center;">  </td>
   <td style="text-align:center;">  </td>
  </tr>
</tbody>
</table>

---
# worked example

#### Suppose we are interested in the effect of elevation on the abundance of Canada Warblers

.pull-left[

]

--
.pull-right[
<table class="table table-condensed" style="font-size: 14px; width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
<tr>
<th style="empty-cells: hide;border-bottom:hidden;" colspan="1"></th>
<th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="3"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Elevation</div></th>
</tr>
  <tr>
   <th style="text-align:center;"> Replicate </th>
   <th style="text-align:center;"> Low </th>
   <th style="text-align:center;"> Medium </th>
   <th style="text-align:center;"> High </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> 1 </td>
   <td style="text-align:center;"> 1 </td>
   <td style="text-align:center;"> 2 </td>
   <td style="text-align:center;"> 4 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 2 </td>
   <td style="text-align:center;"> 3 </td>
   <td style="text-align:center;"> 0 </td>
   <td style="text-align:center;"> 7 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 3 </td>
   <td style="text-align:center;"> 0 </td>
   <td style="text-align:center;"> 4 </td>
   <td style="text-align:center;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 4 </td>
   <td style="text-align:center;"> 2 </td>
   <td style="text-align:center;"> 3 </td>
   <td style="text-align:center;"> 5 </td>
  </tr>
</tbody>
</table>
]

???

Image courtesy of William H. Majoros via Wikicommons
--

#### Hypotheses
- `$H_0 : \mu_L = \mu_M = \mu_H$` or `$H_0 : \alpha_L = \alpha_M = \alpha_H = 0$`

- `$H_a$` : At least one inequality

---
# worked example

---
# worked example

---
# worked example

---
# procedure

.pull-left[
<table class="table table-condensed" style="font-size: 14px; width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
<tr>
<th style="empty-cells: hide;border-bottom:hidden;" colspan="1"></th>
<th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="3"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Elevation</div></th>
</tr>
  <tr>
   <th style="text-align:center;"> Replicate </th>
   <th style="text-align:center;"> Low </th>
   <th style="text-align:center;"> Medium </th>
   <th style="text-align:center;"> High </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> 1 </td>
   <td style="text-align:center;"> 1 </td>
   <td style="text-align:center;"> 2 </td>
   <td style="text-align:center;"> 4 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 2 </td>
   <td style="text-align:center;"> 3 </td>
   <td style="text-align:center;"> 0 </td>
   <td style="text-align:center;"> 7 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 3 </td>
   <td style="text-align:center;"> 0 </td>
   <td style="text-align:center;"> 4 </td>
   <td style="text-align:center;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 4 </td>
   <td style="text-align:center;"> 2 </td>
   <td style="text-align:center;"> 3 </td>
   <td style="text-align:center;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Group means </td>
   <td style="text-align:center;"> 1.50 </td>
   <td style="text-align:center;"> 2.25 </td>
   <td style="text-align:center;"> 5.25 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Grand mean </td>
   <td style="text-align:center;">  </td>
   <td style="text-align:center;"> 3.00 </td>
   <td style="text-align:center;">  </td>
  </tr>
</tbody>
</table>
]

.pull-right[
#### Calculate sums of squares **among** groups ( `$SS_A$` )

`$= n \sum_i (\bar{y}_i - \bar{y}.)^2$`

`$\small = 4 \times ((1.5 - 3)^2 + (2.25 - 3)^2 + (5.25-3)^2)$`

`$= 31.5$`
]

---
# procedure

.pull-right[
#### Calculate sums of squares **within** groups ( `$SS_W$` )

`$= \sum_i \sum_j (y_{ij} - \bar{y}_i)^2$`

]

---
# procedure

.pull-right[
#### Calculate sums of squares **within** groups ( `$SS_W$` )

`$= \sum_i \sum_j (y_{ij} - \bar{y}_i)^2$`

`$\scriptsize = (1 - 1.50)^2 + (3 - 1.50)^2 + (0 - 1.50)^2 + (2 - 1.50)^2 +$`  
`$\scriptsize \;\;\; (2 - 2.25)^2 + (0 - 2.25)^2 + (4 - 2.25)^2 + (3 - 2.25)^2 +$`  
`$\scriptsize \;\;\; (4 - 5.25)^2 + (7 - 5.25)^2 + (5 - 5.25)^2 + (5 - 5.25)^2 +$`

]

---
# procedure

.pull-right[
#### Calculate sums of squares **within** groups ( `$SS_W$` )

`$= \sum_i \sum_j (y_{ij} - \bar{y}_i)^2$`

`$= 18.5$`

]

---
# procedure

<br/>

#### Critical value: `$\large F_{\alpha=0.05,2,9} = 4.26$`

### Decision?

---
# anova as a linear model

As discussed previously, ANOVA is a linear model

`$$\large y_{j} = \beta_0 + \beta_1 x^1_j + \beta_2x^2_j + \epsilon_{j}$$`

So we could also analyze these data using the `lm()` function:

```r
cawa_long <- tidyr::pivot_longer(cawa, 
                                 cols = c("Low", "Medium", "High"), 
                                 names_to = "Elevation", values_to = "Count")

fit.lm <- lm(Count ~ Elevation, data = cawa_long)
summary(fit.lm)
```

```
##              term estimate std.error statistic   p.value
## 1     (Intercept)     5.25    0.7169     7.324 4.451e-05
## 2    ElevationLow    -3.75    1.0138    -3.699 4.928e-03
## 3 ElevationMedium    -3.00    1.0138    -2.959 1.598e-02
```

---
# anova as a linear model

Before we can interpret these output (and how it relates to the ANOVA table), we need to understand how `R` fits this model

---
# anova as a linear model

#### The model matrix

```r
head(model.matrix(fit.lm), 2)
```

```
##   (Intercept) ElevationLow ElevationMedium
## 1           1            1               0
## 2           1            0               1
```

- One row for each observation

- Intercept = reference level (alphabetical order by default)

- Low and Medium treated as *dummy variables* (0/1)

---
# anova as a linear model

#### The model matrix

```r
head(model.matrix(fit.lm), 2)
```

```
##   (Intercept) ElevationLow ElevationMedium
## 1           1            1               0
## 2           1            0               1
```

- Multiplied by the vector of model coefficients `$\beta_0$`, `$\beta_1$`, `$\beta_2$` to get `$E[y_i]$`

- `R` names the coefficients `Intercept`, `ElevationLow`, `ElevationMedium`

- e.g., row 1 = `$E[y_1] = Intercept \times 1 + ElevationLow \times 1 + ElevationMedium \times 0$`

---
# anova as a linear model

#### How do we interpret the coefficients?

- `Intercept` is the expected count at a high elevation site

- `ElevationLow` is the *difference* between high and low elevation

- `ElevationMedium` is the *difference* between high and medium elevation

---
# anova as a linear model

#### Residuals

- `lm()` also returns residuals (e.g., `$y_i - E[y_i]$`)

```r
fit.lm$residual
```

```
##     1     2     3     4     5     6     7     8     9    10    11    12 
## -0.50 -0.25 -1.25  1.50 -2.25  1.75 -1.50  1.75 -0.25  0.50  0.75 -0.25
```
--

```r
sum(fit.lm$residuals^2)
```

```
## [1] 18.5
```

- Does this look familiar?

---
# anova as a linear model

#### Residuals

What about among group variation?

```r
fit.lm$fitted.values
```

```
##    1    2    3    4    5    6    7    8    9   10   11   12 
## 1.50 2.25 5.25 1.50 2.25 5.25 1.50 2.25 5.25 1.50 2.25 5.25
```

```r
sum((fit.lm$fitted.values - mean(fit.lm$fitted.values))^2)
```

```
## [1] 31.5
```

- So the model is the same, the only difference is *how* we present the results

---
# anova as a linear model

One more way to fit the model:

```r
fit.lm2 <- lm(Count ~ Elevation - 1, data = cawa_long)
summary(fit.lm2)
```

```
##              term estimate std.error statistic   p.value
## 1   ElevationHigh     5.25    0.7169     7.324 4.451e-05
## 2    ElevationLow     1.50    0.7169     2.092 6.592e-02
## 3 ElevationMedium     2.25    0.7169     3.139 1.195e-02
```

```r
head(model.matrix(fit.lm2), 5)
```

```
##   ElevationHigh ElevationLow ElevationMedium
## 1             0            1               0
## 2             0            0               1
## 3             1            0               0
## 4             0            1               0
## 5             0            0               1
```

---
# causal inference

#### Can we make causal inference about the effect of elevation on Canada Warbler abundance?

<br/>

### **Answer**: Definitely not!

<br/>

### **What was missing?**

---
# looking ahead

<br/>

#### **Next time:** Multiple comparisons

<br/>

#### **Reading:** Quinn chp. 3.4