LECTURE 9: multiple regression

class: center, middle, inverse, title-slide

.title[
# LECTURE 9: multiple regression
]
.subtitle[
## FANR 6750 (Experimental design)
]
.author[
### <br/><br/><br/>Fall 2023
]

---

class: inverse

# outline

#### 1) Model structure

<br/>  
--

#### 2) Factor + continuous predictor (ANCOVA)

<br/> 
--

#### 3) Two factors (blocking design)

<br/>

#### 4) Two continuous predictors

<br/>

#### 5) Centering predictors

---
# review

`$$\large y_i = \beta_0 + \beta_1 \times X_i + \epsilon_i$$`
`$$\large \epsilon_i \sim normal(0, \sigma)$$`
So far, we have learned about linear models that contain only a single predictor variable `$X$`:

1) Single continuous variable = "simple" regression (lecture 2)

2) Single categorical predictor w/ one level = one-sample *t*-test (lecture 4)

3) Single categorical predictor w/ two levels = two-sample *t*-test (lecture 4)

4) Single categorical predictor w/ > 2 levels = ANOVA (lecture 7)

More often than not, your models will need to contain more than one predictor

---
# multiple regression

Models with more than one predictor go by many names (blocking, ANCOVA, factorial, etc) but are all forms of *multiple regression* of the form:

`$$\Large y_i = \beta_0 + \beta_1 \times X1_i + \beta_2 \times X2_i + \epsilon_i$$`
`$$\Large \epsilon_i \sim normal(0, \sigma)$$`

\*Note that this model only contains two predictors but multiple regression models often contain many predictors

Interpretation of the intercept `$\beta_0$`, residual error `$\epsilon_i$`, and residual variance `$\sigma$` remains the same as before

Interpretation of the slope coefficients `$\beta_1$`, `$\beta_2$`, etc. changes *slightly*

- Slopes are the expected change in the response variable `$y_i$` for a unit change in the corresponding predictor variable *while holding all other predictors constant*

- This is a subtle difference but an important one

---
# example

One of the most common reasons for using multiple regression models is to control for extraneous sources of variation (i.e., sources of variation that influence the response variable but are not of interest in and of themselves)

Why would we want to control for extraneous variation?

Perhaps we are raising desert tortoises (*Gopherus agassizii*) for release into the wild and are interested in whether different diets influence their weight gain

.pull-left[
<img src="https://upload.wikimedia.org/wikipedia/commons/d/dc/Desert_Tortoise_%28Gopherus_agassizii%292.jpg" width="80%" height="80%" style="display: block; margin: auto;" />
]

.pull-right[
<img src="09_regression_files/figure-html/unnamed-chunk-2-1.png" width="360" style="display: block; margin: auto;" />
]

---
# example

A tortoises final weight, however, is not influenced only by diet. For example, it may also be influenced by it's starting body size

As we will see, accurately measuring the effect of diet requires taking into account each individual's initial body size. Multiple regression allows us to do this

---
# anova

Let's start by fitting a model we've already learned about:

.vsmall-code[

```r
fit.lm <- lm(weight ~ diet, data = dietdata)
summary(fit.lm)
```

```
## 
## Call:
## lm(formula = weight ~ diet, data = dietdata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.637 -1.925 -0.037  1.977  5.458 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   26.118      0.769   33.96   <2e-16 ***
## dietLow        0.456      1.088    0.42    0.677    
## dietMed        1.102      1.088    1.01    0.315    
## dietHigh       2.529      1.088    2.33    0.024 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.98 on 56 degrees of freedom
## Multiple R-squared:  0.0991,	Adjusted R-squared:  0.0508 
## F-statistic: 2.05 on 3 and 56 DF,  p-value: 0.117
```

]

Conclusions?

---
# "ancova"

Now let's fit a slightly different model:

.vsmall-code[

```r
fit.lm2 <- lm(weight ~ diet + length, data = dietdata)
summary(fit.lm2)
```

```
## 
## Call:
## lm(formula = weight ~ diet + length, data = dietdata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.821 -1.221 -0.252  1.216  4.919 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19.1402     0.8858   21.61  < 2e-16 ***
## dietLow       1.3688     0.6875    1.99  0.05147 .  
## dietMed       2.5265     0.6973    3.62  0.00064 ***
## dietHigh      3.0830     0.6832    4.51  3.4e-05 ***
## length        0.5573     0.0594    9.38  5.2e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.86 on 55 degrees of freedom
## Multiple R-squared:  0.654,	Adjusted R-squared:  0.628 
## F-statistic: 25.9 on 4 and 55 DF,  p-value: 4.15e-12
```

]

--
What changed?

---
# signal vs. noise

`$$\large y_i = \beta_0 + \beta_1 X1_i + \beta_2 X2_i + \epsilon_i$$`

In any statistical test, our goal is to detect a signal ( `$\beta$` ) in the presence of noise (residual variation `$\epsilon$`)

Our ability to do that depends on the strength of the signal (usually beyond our control) and the amount of noise (partially within our control)

- In the first model, where was all the variation in weight caused by variation in length?

- We can see this clearly by looking at the ANOVA tables for the two models

---
# anova tables

.small-code[

```r
anova(fit.lm)
```

```
## Analysis of Variance Table
## 
## Response: weight
##           Df Sum Sq Mean Sq F value Pr(>F)
## diet       3     55   18.22    2.05   0.12
## Residuals 56    497    8.87
```

```r
anova(fit.lm2)
```

```
## Analysis of Variance Table
## 
## Response: weight
##           Df Sum Sq Mean Sq F value  Pr(>F)    
## diet       3   54.6    18.2    5.24   0.003 ** 
## length     1  305.8   305.8   88.03 5.2e-13 ***
## Residuals 55  191.1     3.5                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```
]

Note the residual sum of squares and mean square error of the two models

---
class:inverse
# sums of squares

When calculating the sums of squares for multiple regression models, the order that we write the model matters

We wrote the model formula with diet first and length second (`weight ~ diet + length`). `R` calculates the sums of squares for diet first, then length

- Notice that the diet SS in the two previous tables are the same

- This is called the Type I sums of squares

If we wrote `weight ~ length + diet`, `R` calculates the length SS first, then the diet SS

- The diet SS tells us how much variation is explained by the treatment variable **after** accounting for the covariate

- This is called the Type III sums of squares

- In experimental settings, Type III is generally preferred

---
class:inverse
# sums of squares

.small-code[

```r
anova(fit.lm2)
```

```r
fit.lm3 <- lm(weight ~ length + diet, data = dietdata)
anova(fit.lm3)
```

```
## Analysis of Variance Table
## 
## Response: weight
##           Df Sum Sq Mean Sq F value  Pr(>F)    
## length     1  278.3   278.3   80.09 2.5e-12 ***
## diet       3   82.2    27.4    7.89 0.00018 ***
## Residuals 55  191.1     3.5                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```
]

---
# another example

We are interested in comparing the alternative gypsy moth control strategies (Bt, Dimilin, no spray) in their effectiveness in controlling gypsy moth. Because sprayed areas are large, and
different treatments are applied on different ridges, extraneous variability due to location is expected. Data are the average number of moths captured in pheromone traps placed in the plots.

<table class="table table-condensed" style="font-size: 10px; width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
<tr>
<th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="1"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Region</div></th>
<th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="3"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Treatment</div></th>
</tr>
  <tr>
   <th style="text-align:right;">   </th>
   <th style="text-align:right;"> Control </th>
   <th style="text-align:right;"> Bt </th>
   <th style="text-align:right;"> Dimilin </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 25 </td>
   <td style="text-align:right;"> 16 </td>
   <td style="text-align:right;"> 14 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 2 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 15 </td>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;"> 16 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 32 </td>
   <td style="text-align:right;"> 18 </td>
   <td style="text-align:right;"> 12 </td>
  </tr>
</tbody>
</table>

What are the experimental units? What are the observational units?

---
# another example

Fit the model with treatment only:

```r
data(mothdata)
fm1 <- lm(Larvae ~ Treatment, data = mothdata)
anova(fm1)
```

```
## Analysis of Variance Table
## 
## Response: Larvae
##           Df Sum Sq Mean Sq F value Pr(>F)
## Treatment  2    223   111.6    1.84   0.21
## Residuals  9    546    60.6
```

Conclusions?

---
# another example

Because the ridges differ in a variety of ways that may influence moth abundance, we probably want to control for this variability in our analysis

We do that by adding it to the model:

.small-code[

```r
data(mothdata)
fm2 <- lm(Larvae ~ Treatment + Region, data = mothdata)
anova(fm2)
```

```
## Analysis of Variance Table
## 
## Response: Larvae
##           Df Sum Sq Mean Sq F value Pr(>F)  
## Treatment  2    223   111.6    5.83  0.039 *
## Region     3    431   143.6    7.51  0.019 *
## Residuals  6    115    19.1                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```
]

Conclusions?

---
# another example

Note that in this case, each treatment is applied exactly once within each region

Region itself is not a experimental treatment and is not replicated

This design is often referred to as a "randomized complete blocking" design (RCBD)

- The regions are called "blocks" and should be identified as a source of variability **before** the experiment is conducted

- "Complete" refers to each treatment being represented in each block

- "Randomized" means that treatments are assigned randomly to each experimental unit

Again, blocks must be identified during the design phase. You should not "search" for blocks after the fact (why?)
---
# multiple regression

As these examples show, one reason to use a multiple regression model is to increase power to detect treatment effects

- Remember that `$F = MS_{treatment}/MS_{error}$`

- By explaining some of the residual variation, including predictors in the model reduces `$MS_{errer}$` and thereby increases `$F$`

When one predictor is a factor and the other(s) is continuous (as in this example), this model is often referred to as an *ANCOVA* (**An**alysis of **Cova**riance)

In experimental setting, ANCOVA is often used when there is an extraneous source of information that we cannot control during the design phase

- the continuous predictor is often not of interest, it is only included to increase power (similar to blocks)

In observational settings, the same model may be used but we often have have interest in both predictors

---
# interpreting model output

If we are only interested in the treatment effect and not the blocks/continuous predictor, the ANOVA table (possibly in combination with multiple comparisons) is often sufficient for interpreting the model

However, if we are interested in the effects of both predictors, it helps to understand the model structure and parameter interpretation in more detail

- Understanding how multiple regression models are structured will also help when you need to include different combinations of factors and continuous predictors

Luckily, you already have the tools exploring these models in detail...

---
# interpreting the ancova model

.small-code[

```r
broom::tidy(fit.lm2)
```

```
## # A tibble: 5 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   19.1      0.886      21.6  1.51e-28
## 2 dietLow        1.37     0.688       1.99 5.15e- 2
## 3 dietMed        2.53     0.697       3.62 6.36e- 4
## 4 dietHigh       3.08     0.683       4.51 3.41e- 5
## 5 length         0.557    0.0594      9.38 5.20e-13
```
]

- `Intercept` = predicted weight of control group **when length = 0**

- `dietLow` = difference between control and low diet

- `dietMed` = difference between control and medium diet

- `dietHigh` = difference between control and high diet

- `length` = predicted increase in weight for one unit increase in length

---
# interpreting the ancova model

#### Question: does the effect of length depend on diet treatment?

#### Answer: No. The model assumes the effect of length is the same for every treatment

---
# interpreting the rcbd model

.vsmall-code[

```r
summary(fm2)
```

```
## 
## Call:
## lm(formula = Larvae ~ Treatment + Region, data = mothdata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.250 -1.021  0.167  0.604  5.750 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         24.42       3.09    7.89  0.00022 ***
## TreatmentBt         -8.75       3.09   -2.83  0.03001 *  
## TreatmentDimilin    -9.50       3.09   -3.07  0.02191 *  
## Region2            -13.33       3.57   -3.73  0.00971 ** 
## Region3             -4.67       3.57   -1.31  0.23924    
## Region4              2.33       3.57    0.65  0.53782    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.37 on 6 degrees of freedom
## Multiple R-squared:  0.851,	Adjusted R-squared:  0.726 
## F-statistic: 6.84 on 5 and 6 DF,  p-value: 0.0183
```
]

This scenario (two factors) is a little more tricky. But, as before, we'll use the design matrix to help understand what each parameter means

---
# interpreting the rcbd model

.small-code[

```r
model.matrix(fm2)
```

```
##    (Intercept) TreatmentBt TreatmentDimilin Region2 Region3 Region4
## 1            1           0                0       0       0       0
## 2            1           1                0       0       0       0
## 3            1           0                1       0       0       0
## 4            1           0                0       1       0       0
## 5            1           1                0       1       0       0
## 6            1           0                1       1       0       0
## 7            1           0                0       0       1       0
## 8            1           1                0       0       1       0
## 9            1           0                1       0       1       0
## 10           1           0                0       0       0       1
## 11           1           1                0       0       0       1
## 12           1           0                1       0       0       1
## attr(,"assign")
## [1] 0 1 1 2 2 2
## attr(,"contrasts")
## attr(,"contrasts")$Treatment
## [1] "contr.treatment"
## 
## attr(,"contrasts")$Region
## [1] "contr.treatment"
```
]

---
# interpreting the rcbd model

- `Intercept` = predicted number of larvae of control treatment in region 1

- `TreatmentBt` = difference between control and Bt treatments

- `TreatmentDimilin` = difference between control and dimilin treatments

- `Region2` = difference region 1 and region 2

- `Region 3` = difference region 1 and region 3

- `Region 4` = difference region 1 and region 4

Again, because there is no interaction between `Treatment` and `Region`, the model assumes that the effects of pesticide treatments are the same in every region (and that the regional differences are the same for every treatment level)

---
# one more example

We have looked at multiple regression examples with two factors and with one factor and one continuous covariate, but what about multiple continuous covariates?

The `biomassdata` object contains (made up) biomass measurements (kg) as a function of rainfall (mm) and elevation (km)

```r
data("biomassdata")
head(biomassdata)
```

```
##   biomass rainfall elevation
## 1   96.12    166.0    4.4397
## 2  122.89    150.3    4.9462
## 3   69.14    195.3    5.8502
## 4   47.02    192.0    0.5679
## 5   20.56    211.6    2.5554
## 6   10.77    218.0    1.0106
```

---
# multiple regression model

.vsmall-code[

```r
fit.lm <- lm(biomass ~ rainfall + elevation, data = biomassdata)
broom::tidy(fit.lm)
```

```
## # A tibble: 3 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   320.      4.11        78.0 2.31e-51
## 2 rainfall       -1.44    0.0217     -66.4 3.97e-48
## 3 elevation       3.97    0.241       16.5 3.77e-21
```
]

- `Intercept` = predicted biomass when rainfall = 0 and elevation = 0

- `rainfall` = predicted change in biomass for 1mm increase in rainfall while holding elevation constant

- `elevation` = predicted change in biomass for 1km increase in elevation while holding rainfall constant

---
# centering and standardizing data

When fitting models to continuous covariates, it is common to center or standardize covariates.

Centering is done by subtracting the mean

.small-code[

```r
biomassdata$elevation.c <- biomassdata$elevation - mean(biomassdata$elevation)
biomassdata$rainfall.c <- biomassdata$rainfall - mean(biomassdata$rainfall)
```
]

When interpreting centered data:

- positive values indicate observations larger than the mean

- negative values indicate observation smaller than the mean

- units don't change

---
# multiple regression model

.vsmall-code[

```r
fit.lm <- lm(biomass ~ rainfall.c + elevation.c, data = biomassdata)
broom::tidy(fit.lm)
```

```
## # A tibble: 3 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    62.0     0.414      150.  1.24e-64
## 2 rainfall.c     -1.44    0.0217     -66.4 3.97e-48
## 3 elevation.c     3.97    0.241       16.5 3.77e-21
```
]

- `Intercept` = predicted biomass when rainfall and elevation are **at their mean**

- `rainfall` = predicted change in biomass for 1mm increase in rainfall while holding elevation constant

- `elevation` = predicted change in biomass for 1km increase in elevation while holding rainfall constant

---
# multiple regression model

---
# looking ahead

<br/>

### **Next time**: Interactions

<br/>

### **Reading**: [Fieberg chp. 3.8](https://statistics4ecologists-v1.netlify.app/matrixreg#models-with-interactions)