LECTURE 7: linear models part 2: categorical predictor

class: center, middle, inverse, title-slide

.title[
# LECTURE 7: linear models part 2: categorical predictor > 2 levels
]
.subtitle[
## FANR 6750 (Experimental design)
]
.author[
### <br/><br/><br/>Fall 2023
]

---

class: inverse

# outline

#### 1) Motivating example

<br/>  
--

#### 2) Extending the linear model

<br/> 
--

#### 3) Dummy variables with > 2 levels

<br/>

#### 4) *F*-tests

<br/>

#### 5) Comparing linear model and ANOVA approaches

---
# categorical predictors

#### In lecture 4, we learned about fitting models with categorical predictors with 1 (measurement error) or 2 (frog) levels

- Linear models with these categorical predictors correspond to one-sample or two-sample *t*-tests

- Binary categorical predictors are quite common in ecological studies (e.g., male vs female, juvenile vs adult, disturbed vs undisturbed)

#### However, you will often encounter categorical predictors with >2 levels

- What are some examples?

---
# motivating example

Foresters are studying the effect of 4 different fertilizers (treatments) on the growth of loblolly pine, which are grown on 3 plots (replicates) receiving each treatment. Data are average height per plot after 5 years:

.pull-left[
<br/>
<table class="table table-condensed" style="font-size: 18px; width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
<tr>
<th style="empty-cells: hide;border-bottom:hidden;" colspan="1"></th>
<th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="4"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Treatment</div></th>
</tr>
  <tr>
   <th style="text-align:center;"> Replicate </th>
   <th style="text-align:center;"> A </th>
   <th style="text-align:center;"> B </th>
   <th style="text-align:center;"> C </th>
   <th style="text-align:center;"> D </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> 1 </td>
   <td style="text-align:center;"> 11 </td>
   <td style="text-align:center;"> 9 </td>
   <td style="text-align:center;"> 7 </td>
   <td style="text-align:center;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 2 </td>
   <td style="text-align:center;"> 10 </td>
   <td style="text-align:center;"> 8 </td>
   <td style="text-align:center;"> 6 </td>
   <td style="text-align:center;"> 4 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 3 </td>
   <td style="text-align:center;"> 9 </td>
   <td style="text-align:center;"> 7 </td>
   <td style="text-align:center;"> 5 </td>
   <td style="text-align:center;"> 3 </td>
  </tr>
</tbody>
</table>
]

.pull-right[
#### Notation

- The number of groups (treatments) is `$\large a=4$`

- The number of observations within each group (replicates) is `$\large n=3$`

- `$\large y_{ij}$` denotes the `$\large j$`th observation from the `$\large i$`th group
]

---
# a brief tangent

#### What counts as an observation?

#### Experimental unit

> the physical unit that receives a particular treatment

#### Observational unit

> the physical unit on which measurements are taken

These are not always the same!

Examples

- Agricultural fields given different fertilizer, crop yield measured

- Rats given different diets, disease state measured

- Microcosm given different predator abundance, tadpole growth measured

---
# motivating example

**Question:** Is there a difference in growth among the four treatment groups?

--
<img src="07_anova_files/figure-html/pine1-1.png" width="576" style="display: block; margin: auto;" />

---
# motivating example

#### Hypotheses
- `$\large H_0 : \mu_A = \mu_B = \mu_C = \mu_D$`

- `$\large H_a :$` At least one inequality

#### How should we test the null?

--
We could do this using 6 *t*-tests

<br/>
--
But this would alter the overall (experiment-wise) `$\large \alpha$` level because each individual test has a chance (usually `$\large \alpha = 0.05$`) of incorrectly rejecting a true null hypothesis, and this is multiplied when multiple tests are used

---
# extending the linear model

#### Fortunately, extending the previous linear model to include additional predictor levels is "straightforward"

- The model looks familiar

`$$\large y_{ij} = \beta_0 + \beta_ix_{ij} + \epsilon_{ij}$$`

`$$\large \epsilon_{ij} \sim normal(0, \sigma)$$`

- What's tricky is how we code `$\large x_{ij}$` when it is more than just 0 or 1

- Fortunately, what we learned about how `R` codes dummy variables will help

---
# dummy variable coding

```r
data(pinedata)
pinedata[c(1,4,7,10),]
```

```
##    Replicate Treatment Height
## 1          1         A     11
## 4          1         B      9
## 7          1         C      7
## 10         1         D      5
```

```r
fit1 <- lm(Height ~ Treatment, data = pinedata)

model.matrix(fit1)[c(1,4,7,10),]
```

```
##    (Intercept) TreatmentB TreatmentC TreatmentD
## 1            1          0          0          0
## 4            1          1          0          0
## 7            1          0          1          0
## 10           1          0          0          1
```

---
# dummy variable coding

```r
model.matrix(fit1)[c(1,4,7,10),]
```

Remember that the model matrix has one column for every parameter in the model

- How many parameters is this model estimating?

`$$\large E[y_{ij}] = \beta_0 + \beta_1 I(B)_{ij} + \beta_2 I(C)_{ij} + \beta_3 I(D)_{ij}$$`
where `$I(B/C/D)_{ij}$` are dummy variables (0/1 depending on treatment)

- What is the interpretation of each parameter?

---
# motivating example

```r
broom::tidy(fit1)
```

```
## # A tibble: 4 × 5
##   term        estimate std.error statistic     p.value
##   <chr>          <dbl>     <dbl>     <dbl>       <dbl>
## 1 (Intercept)    10.0      0.577     17.3  0.000000126
## 2 TreatmentB     -2.00     0.816     -2.45 0.0400     
## 3 TreatmentC     -4.00     0.816     -4.90 0.00120    
## 4 TreatmentD     -6.00     0.816     -7.35 0.0000801
```

- Be sure you understand what each parameter means!

- *t*-statistics and *p*-values are calculated and interpreted in the same way as the frog example

- But wait, the null hypothesis was `$\mu_A = \mu_B = \mu_C = \mu_D$`

---
# regression F-statistic

.small-code[

```r
summary(fit1)
```

```
## 
## Call:
## lm(formula = Height ~ Treatment, data = pinedata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##     -1     -1      0      1      1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   10.000      0.577   17.32  1.3e-07 ***
## TreatmentB    -2.000      0.816   -2.45   0.0400 *  
## TreatmentC    -4.000      0.816   -4.90   0.0012 ** 
## TreatmentD    -6.000      0.816   -7.35  8.0e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1 on 8 degrees of freedom
## Multiple R-squared:  0.882,	Adjusted R-squared:  0.838 
## F-statistic:   20 on 3 and 8 DF,  p-value: 0.000449
```
]

---
# regression F-statistic

The *F*-statistic tests whether all regression coefficients (other than the intercept) are all 0

- `$H_0$`: All regression coefficients are 0

- `$H_A$`: At least one regression coefficient is not 0

*F*-statistics are the ratio of *sample variances*

`$$F = \frac{s^2_A}{s^2_B}$$`
- Null hypothesis is that population variances are equal ( `$\sigma^2_A = \sigma^2_B$` )

- Always positive (variances can be negative)

- Usually > 1 (by putting larger variance in the numerator)

Where to the variances come from in the linear model and what do they tell us about differences among groups?

---
# regression F-statistic

#### To understand why the test is based on variance, it is helpful to consider several types of means:

.pull-left[
- Grand mean

`$$\large \bar{y}. = \frac{\sum_i \sum_jy_{ij}}{a \times n}$$`
]

.pull-right[
<img src="07_anova_files/figure-html/pine_grm-1.png" width="288" style="display: block; margin: auto;" />
]

---
# regression F-statistic

#### To understand why the test is based on variance, it is helpful to consider several types of means:

.pull-left[
- Grand mean

`$$\large \bar{y}. = \frac{\sum_i \sum_jy_{ij}}{a \times n}$$`

- Group means

`$$\large \bar{y}_i = \frac{\sum_j y_{ij}}{n}$$`
]

.pull-right[
<img src="07_anova_files/figure-html/pinedata_gm-1.png" width="288" style="display: block; margin: auto;" />
]

---
# regression F-statistic

#### To understand why the test is based on variance, it is helpful to consider several types of means:

.pull-left[
- Grand mean

`$$\large \bar{y}. = \frac{\sum_i \sum_jy_{ij}}{a \times n}$$`

- Group means

`$$\large \bar{y}_i = \frac{\sum_j y_{ij}}{n}$$`

We can now decompose the observations as

`$$\large y_{ij} = \color{#446E9B}{\bar{y}.} + \color{#D47500}{(\bar{y}_i - \bar{y}.)} + \color{#3CB521}{(y_{ij} - \bar{y}_i)}$$`
]

.pull-right[
<img src="07_anova_files/figure-html/pinedata_var-1.png" width="288" style="display: block; margin: auto;" />
]

---
# the additive model

#### The decomposition

`$$\Large y_{ij} = \color{#446E9B}{\bar{y}.} + \color{#D47500}{(\bar{y}_i - \bar{y}.)} + \color{#3CB521}{(y_{ij} - \bar{y}_i)}$$`
--

#### The additive model

`$$\Large y_{ij} = \color{#446E9B}{\mu} + \color{#D47500}{\alpha_i} + \color{#3CB521}{\epsilon_{ij}}$$`

#### where

`$$\Large \epsilon_{ij} \sim normal(0, \sigma^2)$$`

---
# the additive model

`$$\large y_{ij} = \mu + \alpha_i + \epsilon_{ij}$$`

`$$\large \epsilon_{ij} \sim normal(0, \sigma^2)$$`

#### Notes

- `$\large \mu$` is the grand mean of the population, estimated by `$\large \bar{y}.$`  
  
--

- `$\large \alpha_i$` is the effect of treatment *i*, estimated by `$\large\bar{y}_i - \bar{y}.$`

--
  + It is the deviation of the group mean from the grand mean

+ If all `$\large\alpha_i = 0$`, there is no treatment effect

+ Thus, we can re-write the null hypothesis `$H_0 : \alpha_1 = \alpha_2=... =\alpha_a = 0$`

- `$\large \epsilon_{ij}$` is the residual error, estimated by `$\large y_{ij} - \bar{y}_i$`

+ It is the unexplained (random) deviation of the observation from the group mean
  
---
# sums of squares

#### Variation among groups

`$$\Large SS_A = n \sum_i (\bar{y}_i - \bar{y}.)^2$$`

---
# motivating example

**Question:** Is there a difference in growth among the four treatment groups?

---
# sums of squares

#### Variation among groups

`$$\Large SS_A = n \sum_i (\bar{y}_i - \bar{y}.)^2$$`

#### Variation within groups

`$$\Large SS_W = \sum_i \sum_j (y_{ij} - \bar{y}_i)^2$$`
---
# motivating example

**Question:** Is there a difference in growth among the four treatment groups?

---
# sums of squares

#### Variation among groups

`$$\Large SS_A = n \sum_i (\bar{y}_i - \bar{y}.)^2$$`

#### Variation within groups

`$$\Large SS_W = \sum_i \sum_j (y_{ij} - \bar{y}_i)^2$$`

#### Total variation

`$$\Large SS_T = SS_A + SS_W = \sum_i \sum_j (y_{ij} - \bar{y}.)^2$$`

---
# motivating example

**Question:** Is there a difference in growth among the four treatment groups?

---
# mean squares

### To covert the sums of squares to variances, divide by the degrees of freedom

--
#### Mean squares among

`$$\Large MS_A = \frac{SS_A}{a-1}$$`

--
#### Mean squares within

`$$\Large MS_W = \frac{SS_W}{a(n-1)}$$`

---
# F-statistic

`$$\LARGE F = \frac{MS_A}{MS_W}$$`

#### Note:

- *F* is a ratio measuring the variance among groups to the variance within groups

- If there is a large treatment effect, what happens to `$MS_A$`?

- If there is little residual variation, what happens to `$MS_W$`?

- Large values of `$F$` indicate treatment effects are large relative to residual variation, but can we conclude there is a treatment effect?

---
# the F-distribution

#### **If the null hypothesis is true** ( `$\sigma^2_A = \sigma^2_B$`), the ratio of sample variances follows an *F*-distribution

.pull-left[
#### Properties:

- `$F > 0$`

- `$F$`-distribution is not symmetrical

- Shape of distribution depends on an ordered pair of degrees of freedom, `$df_A$`
and `$df_B$`
]

.pull-right[
<img src="07_anova_files/figure-html/f-1.png" width="288" style="display: block; margin: auto;" />
]

---
# f-distribution

Just like the *t*-distribution, we can use the *F*-distribution to calculate *p*-values and test the null hypothesis that the variances are equal

---
# analysis of variance

#### Using the *F*-statistic to test whether there is a treatment effect is commonly referred to as *Analysis of Variance* (ANOVA)

- Especially in experimental settings, ANOVA is very commonly used

- The linear model approach is increasingly common, especially in observational settings

- From a practical standpoint, the linear model and ANOVA approaches provide exactly the same information, just presented in different ways (just like the linear model vs t-test we saw previously)

---
# worked example

#### Suppose we are interested in the effect of elevation on the abundance of Canada Warblers

.pull-left[

]

--
.pull-right[
<table class="table table-condensed" style="font-size: 14px; width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
<tr>
<th style="empty-cells: hide;border-bottom:hidden;" colspan="1"></th>
<th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="3"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Elevation</div></th>
</tr>
  <tr>
   <th style="text-align:center;"> Replicate </th>
   <th style="text-align:center;"> Low </th>
   <th style="text-align:center;"> Medium </th>
   <th style="text-align:center;"> High </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> 1 </td>
   <td style="text-align:center;"> 1 </td>
   <td style="text-align:center;"> 2 </td>
   <td style="text-align:center;"> 4 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 2 </td>
   <td style="text-align:center;"> 3 </td>
   <td style="text-align:center;"> 0 </td>
   <td style="text-align:center;"> 7 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 3 </td>
   <td style="text-align:center;"> 0 </td>
   <td style="text-align:center;"> 4 </td>
   <td style="text-align:center;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 4 </td>
   <td style="text-align:center;"> 2 </td>
   <td style="text-align:center;"> 3 </td>
   <td style="text-align:center;"> 5 </td>
  </tr>
</tbody>
</table>
]

???

Image courtesy of William H. Majoros via Wikicommons

---
# worked example

.small-code[

```r
data(warblerdata)

fit.lm <- lm(Count ~ Elevation, data = warblerdata)

summary(fit.lm)
```

```
## 
## Call:
## lm(formula = Count ~ Elevation, data = warblerdata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.250 -0.688 -0.250  0.938  1.750 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        5.250      0.717    7.32  4.5e-05 ***
## ElevationLow      -3.750      1.014   -3.70   0.0049 ** 
## ElevationMedium   -3.000      1.014   -2.96   0.0160 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.43 on 9 degrees of freedom
## Multiple R-squared:  0.63,	Adjusted R-squared:  0.548 
## F-statistic: 7.66 on 2 and 9 DF,  p-value: 0.0114
```
]

---
# worked example

```r
data(warblerdata)

fit.aov <- aov(Count ~ Elevation, data = warblerdata)

summary(fit.aov)
```

```
##             Df Sum Sq Mean Sq F value Pr(>F)  
## Elevation    2   31.5   15.75    7.66  0.011 *
## Residuals    9   18.5    2.06                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---
# anova table from `lm`

In case you want more detail about the relationship between `lm` and `aov` output:

- `lm()` function also returns residuals (e.g., `$y_i - E[y_i]$`)

.small-code[

```r
fit.lm$residual
```

```
##     1     2     3     4     5     6     7     8     9    10    11    12 
## -0.50  1.50 -1.50  0.50 -0.25 -2.25  1.75  0.75 -1.25  1.75 -0.25 -0.25
```
]

- Residual sum of squares

.small-code[

```r
sum(fit.lm$residuals^2)
```

```
## [1] 18.5
```
]

- Residual mean square

.small-code[

```r
sum(fit.lm$residuals^2)/9
```

```
## [1] 2.056
```
]

---
# anova table from `lm`

In case you want more detail about the relationship between `lm` and `aov` output:

What about among group variation?

.small-code[

```r
fit.lm$fitted.values
```

```
##    1    2    3    4    5    6    7    8    9   10   11   12 
## 1.50 1.50 1.50 1.50 2.25 2.25 2.25 2.25 5.25 5.25 5.25 5.25
```
]

- Treatment sum of squares

.small-code[

```r
sum((fit.lm$fitted.values - mean(fit.lm$fitted.values))^2)
```

```
## [1] 31.5
```
]

- Treatment mean square

.small-code[

```r
sum((fit.lm$fitted.values - mean(fit.lm$fitted.values))^2)/2
```

```
## [1] 15.75
```
]

---
# anova table from `lm`

In case you want more detail about the relationship between `lm` and `aov` output:

*F*-statistic and *p*-value

```r
MSa <- sum((fit.lm$fitted.values - mean(fit.lm$fitted.values))^2)/2
MSe <- sum(fit.lm$residuals^2)/9

(F <- MSa/MSe)
```

```
## [1] 7.662
```

```r
(p <- pf(F, 2, 9, lower.tail = FALSE))
```

```
## [1] 0.0114
```

---
# recap

1) Models with categorical variables can be fit using either `lm()` or `aov()`

2) The test traditionally known as *ANOVA* is just an extension of the linear model used analyze a binary predictor variable

3) `R` codes categorical predictors using dummy variables and `lm` coefficients correspond to the difference between the reference level and the other treatment levels

4) `lm()` or `aov()` provide the exact same information, just presented in different ways

5) ANOVA output commonly used for manipulative experiments, linear model output for observational studies

6) But how do we tell which groups differ from each other (aside from just the reference level)...

---
# looking ahead

<br/>

### **Next time**: Multiple comparisons

<br/>

### **Reading**: [Fieberg chp. 3.9](https://statistics4ecologists-v1.netlify.app/matrixreg#pairwise-comparisons) and [3.12](https://statistics4ecologists-v1.netlify.app/matrixreg#contrasts)