LECTURE 9: multiple regression

class: center, middle, inverse, title-slide

.title[
# LECTURE 9: multiple regression
]
.subtitle[
## FANR 6750 (Experimental design)
]
.author[
### <br/><br/><br/>Fall 2024
]

---

class: inverse

# outline

#### 1) Model structure

<br/>  
--

#### 2) Factor + continuous predictor (ANCOVA)

<br/> 
--

#### 3) Two factors (blocking design)

<br/>

#### 4) Two continuous predictors

<br/>

#### 5) Centering predictors

---
# review

`$$\large y_i = \beta_0 + \beta_1 \times X_i + \epsilon_i$$`
`$$\large \epsilon_i \sim normal(0, \sigma)$$`
So far, we have learned about linear models that contain only a single predictor variable `$X$`:

1) Single continuous variable = "simple" regression (lecture 2)

2) Single categorical predictor w/ one level = one-sample *t*-test (lecture 4)

3) Single categorical predictor w/ two levels = two-sample *t*-test (lecture 4)

4) Single categorical predictor w/ > 2 levels = ANOVA (lecture 7)

More often than not, your models will need to contain more than one predictor

---
# multiple regression

Models with more than one predictor go by many names (blocking, ANCOVA, factorial, etc) but are all forms of *multiple regression* of the form:

`$$\Large y_i = \beta_0 + \beta_1 \times X1_i + \beta_2 \times X2_i + \epsilon_i$$`
`$$\Large \epsilon_i \sim normal(0, \sigma)$$`

\*Note that this model only contains two predictors but multiple regression models often contain many predictors

Interpretation of the intercept `$\beta_0$`, residual error `$\epsilon_i$`, and residual variance `$\sigma$` remains the same as before

Interpretation of the slope coefficients `$\beta_1$`, `$\beta_2$`, etc. changes *slightly*

- Slopes are the expected change in the response variable `$y_i$` for a unit change in the corresponding predictor variable *while holding all other predictors constant*

- This is a subtle difference but an important one

---
# example

One of the most common reasons for using multiple regression models is to control for extraneous sources of variation (i.e., sources of variation that influence the response variable but are not of interest in and of themselves)

Why would we want to control for extraneous variation?

Perhaps we are raising desert tortoises (*Gopherus agassizii*) for release into the wild and are interested in whether different diets influence their weight gain

.pull-left[
<img src="https://upload.wikimedia.org/wikipedia/commons/d/dc/Desert_Tortoise_%28Gopherus_agassizii%292.jpg" width="80%" height="80%" style="display: block; margin: auto;" />
]

.pull-right[
<img src="09_regression_files/figure-html/unnamed-chunk-2-1.png" width="360" style="display: block; margin: auto;" />
]

---
# example

A tortoises final weight, however, is not influenced only by diet. For example, it may also be influenced by it's starting body size

As we will see, accurately measuring the effect of diet requires taking into account each individual's initial body size. Multiple regression allows us to do this

---
# anova

Let's start by fitting a model we've already learned about:

.vsmall-code[

```r
fit.lm <- lm(weight ~ diet, data = dietdata)
summary(fit.lm)
```

```
## 
## Call:
## lm(formula = weight ~ diet, data = dietdata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -76.39 -19.29  -0.37  19.78  54.61 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   261.19       7.69   33.94   <2e-16 ***
## dietLow         4.56      10.88    0.42    0.677    
## dietMed        11.02      10.88    1.01    0.316    
## dietHigh       25.28      10.88    2.32    0.024 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.8 on 56 degrees of freedom
## Multiple R-squared:  0.0989,	Adjusted R-squared:  0.0506 
## F-statistic: 2.05 on 3 and 56 DF,  p-value: 0.117
```

]

Conclusions?

---
# "ancova"

Now let's fit a slightly different model:

.vsmall-code[

```r
fit.lm2 <- lm(weight ~ diet + length, data = dietdata)
summary(fit.lm2)
```

```
## 
## Call:
## lm(formula = weight ~ diet + length, data = dietdata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -38.25 -12.18  -2.53  12.15  49.23 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  191.364      8.862   21.59  < 2e-16 ***
## dietLow       13.727      6.878    2.00  0.05091 .  
## dietMed       25.226      6.974    3.62  0.00065 ***
## dietHigh      30.784      6.833    4.51  3.5e-05 ***
## length         2.789      0.297    9.38  5.2e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.6 on 55 degrees of freedom
## Multiple R-squared:  0.654,	Adjusted R-squared:  0.628 
## F-statistic: 25.9 on 4 and 55 DF,  p-value: 4.14e-12
```

]

--
What changed?

---
# signal vs. noise

`$$\large y_i = \beta_0 + \beta_1 X1_i + \beta_2 X2_i + \epsilon_i$$`

In any statistical test, our goal is to detect a signal ( `$\beta$` ) in the presence of noise (residual variation `$\epsilon$`)

Our ability to do that depends on the strength of the signal (usually beyond our control) and the amount of noise (partially within our control)

- In the first model, where was all the variation in weight caused by variation in length?

- We can see this clearly by looking at the ANOVA tables for the two models

---
# anova tables

.small-code[

```r
anova(fit.lm)
```

```
## Analysis of Variance Table
## 
## Response: weight
##           Df Sum Sq Mean Sq F value Pr(>F)
## diet       3   5459    1820    2.05   0.12
## Residuals 56  49734     888
```

```r
anova(fit.lm2)
```

```
## Analysis of Variance Table
## 
## Response: weight
##           Df Sum Sq Mean Sq F value  Pr(>F)    
## diet       3   5459    1820    5.23   0.003 ** 
## length     1  30615   30615   88.07 5.2e-13 ***
## Residuals 55  19119     348                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```
]

Note the residual sum of squares and mean square error of the two models

---
class:inverse
# sums of squares

When calculating the sums of squares for multiple regression models, the order that we write the model matters

We wrote the model formula with diet first and length second (`weight ~ diet + length`). `R` calculates the sums of squares for diet first, then length

- Notice that the diet SS in the two previous tables are the same

If we wrote `weight ~ length + diet`, `R` calculates the length SS first, then the diet SS

- The diet SS tells us how much variation is explained by the treatment variable **after** accounting for the covariate

This *sequential* method of calculating is called the Type I sums of squares

- Type I SS is the default used by `R`

---
class:inverse
# sums of squares

.small-code[

```r
anova(fit.lm2)
```

```r
fit.lm3 <- lm(weight ~ length + diet, data = dietdata)
anova(fit.lm3)
```

```
## Analysis of Variance Table
## 
## Response: weight
##           Df Sum Sq Mean Sq F value  Pr(>F)    
## length     1  27882   27882   80.21 2.5e-12 ***
## diet       3   8192    2731    7.86 0.00019 ***
## Residuals 55  19119     348                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```
]

---
class:inverse
# sums of squares

Type I SS is only one of several ways of calculating sums of squares

- Type II and Type III SS use slightly different methods that do not depend on order

- For balanced designs, all three approaches will give the same answers

- For unbalanced designs, the correct approach depends on the hypotheses being tested

- Other software programs (SAS, SPSS) default to Type III so may give different results than `R` (but all programs have methods to calculate SS using each method)

---
# another example

In 1960, Herbert Stoddard established an experiment at Tall Timbers Research Station to study the effects of fire frequency on long-leaf pine forests in the Red Hills region of north Florida and southwest Georgia. The experiment originally consisted of eighty four, 0.5ha plots, each of which was assigned one of four treatments: 1-year, 2-year, or 3-year fire intervals or unburned. Stoddard recognized the soil type likely influenced response to fire, so treatment plots were allocated across the 3 primary soil types found on Tall Timbers

.pull-left[

]

.pull-right[

]
---
# another example

We will use a made-up data set from this real experiment, with the goal of understanding whether understory plant richness (measured as the number of understory vascular plants recorded on each plot) differs among burn treatments.

<table class="table table-condensed" style="font-size: 14px; width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
<tr>
<th style="empty-cells: hide;border-bottom:hidden;" colspan="1"></th>
<th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="4"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Fire frequency</div></th>
<th style="empty-cells: hide;border-bottom:hidden;" colspan="1"></th>
</tr>
  <tr>
   <th style="text-align:center;"> Replicate </th>
   <th style="text-align:center;"> Unburned </th>
   <th style="text-align:center;"> 1-year </th>
   <th style="text-align:center;"> 2-year </th>
   <th style="text-align:center;"> 3-year </th>
   <th style="text-align:center;"> Soil </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> 1 </td>
   <td style="text-align:center;"> 30 </td>
   <td style="text-align:center;"> 43 </td>
   <td style="text-align:center;"> 36 </td>
   <td style="text-align:center;"> 37 </td>
   <td style="text-align:center;"> Loamy sand </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 2 </td>
   <td style="text-align:center;"> 28 </td>
   <td style="text-align:center;"> 37 </td>
   <td style="text-align:center;"> 30 </td>
   <td style="text-align:center;"> 27 </td>
   <td style="text-align:center;"> Sandy loam </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 3 </td>
   <td style="text-align:center;"> 18 </td>
   <td style="text-align:center;"> 27 </td>
   <td style="text-align:center;"> 24 </td>
   <td style="text-align:center;"> 21 </td>
   <td style="text-align:center;"> Upland </td>
  </tr>
</tbody>
</table>

What are the experimental units? What are the observational units?

---
# another example

Fit the model with treatment only:

```r
data(firedata)
fm1 <- lm(species ~ interval, data = firedata)
anova(fm1)
```

```
## Analysis of Variance Table
## 
## Response: species
##           Df Sum Sq Mean Sq F value Pr(>F)
## interval   3    173    57.6     1.1    0.4
## Residuals  8    419    52.4
```

Conclusions?

---
# another example

Because the soil types differ in a variety of ways that may influence plant communities, we probably want to control for this variability in our analysis

We do that by adding it to the model:

.small-code[

```r
data(firedata)
fm2 <- lm(species ~ interval + soil, data = firedata)
anova(fm2)
```

```
## Analysis of Variance Table
## 
## Response: species
##           Df Sum Sq Mean Sq F value  Pr(>F)    
## interval   3    173    57.6    13.4 0.00456 ** 
## soil       2    394   196.8    45.7 0.00023 ***
## Residuals  6     26     4.3                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```
]

Conclusions?

---
# another example

Note that in this case, each treatment is applied exactly once within each soil type

Soil type itself is not a experimental treatment and is not replicated

This design is often referred to as a "randomized complete blocking" design (RCBD)

- The soil types are called "blocks" and should be identified as a source of variability **before** the experiment is conducted

- "Complete" refers to each treatment being represented in each block

- "Randomized" means that treatments are assigned randomly to each experimental unit

Again, blocks must be identified during the design phase. You should not "search" for blocks after the fact (why?)

---
# multiple regression

As these examples show, one reason to use a multiple regression model is to increase power to detect treatment effects

- Remember that `$F = MS_{treatment}/MS_{error}$`

- By explaining some of the residual variation, including predictors in the model reduces `$MS_{errer}$` and thereby increases `$F$`

- When one predictor is a factor and the other(s) is continuous, this model is often referred to as an *ANCOVA* (**An**alysis of **Cova**riance)

- When both predictors are factors and each treatment is included within each block, this model is often referred to a randomized complete block design

In experimental setting, ANCOVA and blocking are used when there is extraneous variation that we cannot control during the design phase

- Predictor/blocks are not of interest; only included to increase power

---
# interpreting model output

If we are only interested in the treatment effect and not the blocks/continuous predictor, the ANOVA table (possibly in combination with multiple comparisons) is often sufficient for interpreting the model

- This is often the case in experimental settings

However, in observational studies, we are often interested in the effects of both predictors

- In this case, it helps to understand the model structure and parameter interpretation in more detail

- Understanding how multiple regression models are structured will also help when you need to include different combinations of factors and continuous predictors

Luckily, you already have the tools exploring these models in detail...

---
# interpreting the ancova model

.small-code[

```r
broom::tidy(fit.lm2)
```

```
## # A tibble: 5 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   191.       8.86      21.6  1.56e-28
## 2 dietLow        13.7      6.88       2.00 5.09e- 2
## 3 dietMed        25.2      6.97       3.62 6.48e- 4
## 4 dietHigh       30.8      6.83       4.51 3.50e- 5
## 5 length          2.79     0.297      9.38 5.16e-13
```
]

- `Intercept` = predicted weight of control group **when length = 0**

- `dietLow` = difference between control and low diet

- `dietMed` = difference between control and medium diet

- `dietHigh` = difference between control and high diet

- `length` = predicted increase in weight for one unit increase in length

---
# interpreting the ancova model

#### Question: does the effect of length depend on diet treatment?

#### Answer: No. The model assumes the effect of length is the same for every treatment

---
# interpreting the rcbd model

.vsmall-code[

```r
summary(fm2)
```

```
## 
## Call:
## lm(formula = species ~ interval + soil, data = firedata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5000 -0.9167  0.0833  0.9375  2.2500 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       32.08       1.47   21.87  6.0e-07 ***
## interval1         10.33       1.69    6.10  0.00088 ***
## interval2          4.67       1.69    2.75  0.03310 *  
## interval3          2.67       1.69    1.57  0.16656    
## soilsandy loam    -6.25       1.47   -4.26  0.00532 ** 
## soilupland       -14.00       1.47   -9.54  7.6e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.07 on 6 degrees of freedom
## Multiple R-squared:  0.956,	Adjusted R-squared:  0.92 
## F-statistic: 26.3 on 5 and 6 DF,  p-value: 0.000518
```
]

This scenario (two factors) is a little more tricky. But, as before, we'll use the design matrix to help understand what each parameter means

---
# interpreting the rcbd model

.small-code[

```r
model.matrix(fm2)
```

```
##    (Intercept) interval1 interval2 interval3 soilsandy loam soilupland
## 1            1         0         0         0              1          0
## 2            1         0         0         0              0          0
## 3            1         0         0         0              0          1
## 4            1         1         0         0              1          0
## 5            1         1         0         0              0          0
## 6            1         1         0         0              0          1
## 7            1         0         1         0              1          0
## 8            1         0         1         0              0          0
## 9            1         0         1         0              0          1
## 10           1         0         0         1              1          0
## 11           1         0         0         1              0          0
## 12           1         0         0         1              0          1
## attr(,"assign")
## [1] 0 1 1 1 2 2
## attr(,"contrasts")
## attr(,"contrasts")$interval
## [1] "contr.treatment"
## 
## attr(,"contrasts")$soil
## [1] "contr.treatment"
```
]

---
# interpreting the rcbd model

- `Intercept` = predicted number of species in the unburned treatment in loamy sand

- `interval1` = difference between unburned and 1-year fire interval

- `interval2` = difference between unburned and 2-year fire interval

- `interval3` = difference between unburned and 3-year fire interval

- `soilsandy loam` = difference between loamy sand and sandy loam soils

- `soilupland` = difference between loamy sand and upland soils

Again, because there is no interaction between `interval` and `soil`, the model assumes that the effects of fire frequency are the same in every soil type (and that the soil differences are the same for every fire frequency)

---
# one more example

We have looked at multiple regression examples with two factors and with one factor and one continuous covariate, but what about multiple continuous covariates?

The `biomassdata` object contains (made up) biomass measurements (kg) as a function of rainfall (mm) and elevation (km)

```r
data("biomassdata")
head(biomassdata)
```

```
##   biomass rainfall elevation
## 1   96.12    166.0    4.4397
## 2  122.89    150.3    4.9462
## 3   69.14    195.3    5.8502
## 4   47.02    192.0    0.5679
## 5   20.56    211.6    2.5554
## 6   10.77    218.0    1.0106
```

---
# multiple regression model

.vsmall-code[

```r
fit.lm <- lm(biomass ~ rainfall + elevation, data = biomassdata)
broom::tidy(fit.lm)
```

```
## # A tibble: 3 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   320.      4.11        78.0 2.31e-51
## 2 rainfall       -1.44    0.0217     -66.4 3.97e-48
## 3 elevation       3.97    0.241       16.5 3.77e-21
```
]

- `Intercept` = predicted biomass when rainfall = 0 and elevation = 0

- `rainfall` = predicted change in biomass for 1mm increase in rainfall while holding elevation constant

- `elevation` = predicted change in biomass for 1km increase in elevation while holding rainfall constant

---
# centering and standardizing data

When fitting models to continuous covariates, it is common to center or standardize covariates.

Centering is done by subtracting the mean

.small-code[

```r
biomassdata$elevation.c <- biomassdata$elevation - mean(biomassdata$elevation)
biomassdata$rainfall.c <- biomassdata$rainfall - mean(biomassdata$rainfall)
```
]

When interpreting centered data:

- positive values indicate observations larger than the mean

- negative values indicate observation smaller than the mean

- units don't change

---
# multiple regression model

.vsmall-code[

```r
fit.lm <- lm(biomass ~ rainfall.c + elevation.c, data = biomassdata)
broom::tidy(fit.lm)
```

```
## # A tibble: 3 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    62.0     0.414      150.  1.24e-64
## 2 rainfall.c     -1.44    0.0217     -66.4 3.97e-48
## 3 elevation.c     3.97    0.241       16.5 3.77e-21
```
]

- `Intercept` = predicted biomass when rainfall and elevation are **at their mean**

- `rainfall` = predicted change in biomass for 1mm increase in rainfall while holding elevation constant

- `elevation` = predicted change in biomass for 1km increase in elevation while holding rainfall constant

---
# multiple regression model

---
# looking ahead

<br/>

### **Next time**: Interactions

<br/>

### **Reading**: [Fieberg chp. 3.8](https://statistics4ecologists-v1.netlify.app/matrixreg#models-with-interactions)