LECTURE 18: linear regression (continued)

class: center, middle, inverse, title-slide

.title[
# LECTURE 18: linear regression (continued)
]
.subtitle[
## FANR 6750 (Experimental design)
]
.author[
### <br/><br/><br/>Fall 2022
]

---

# outline

<br/>

1) Motivation

<br/>

2) Linear models

<br/>

3) Example

<br/>

4) Matrix notation

---
# motivation

#### Why do we need this part of the course?

- We have been modeling all along

- Good experimental design + ANOVA is often the most direct route to causal inference

- However, it isn’t always possible (or even desirable) to control some aspects of the system being investigated

- When manipulative experiments aren’t possible, observational studies and predictive models can be the next best option

---
# what is a model?

#### Definition

> A model is an abstraction of reality used to describe the relationship between two or more variables

#### Types of models

- Conceptual

- Mathematical

- Statistical

#### Cautionary note

> All models are wrong but some are useful (George Box, 1976)

---
# statistical models

<br/>

#### What are they useful for?

--
- Formalizing hypotheses using math and probability

- Evaulating hypotheses by confronting models with data

- Predicting unobserved (including future) outcomes

---
# statistical models

#### Unlike many other types of models, statistical models are fitted to data

#### Two important components:

1) Deterministic component

- Equation for the expected value of the response variable

2) Stochastic component

- Probability distribution describing the differences between the expected values and the observed values

- In parametric statistics, we assume we know the distribution, but not the parameters of the distribution

---
class: inverse, center, middle

# linear models

---
# is this a linear model?

`$$\Large y = 20 + 0.5x$$`

---
# is this a linear model?

`$$\Large y = 20 + 0.5x - 0.3x^2$$`

---
# linear models

### All of the models we've covered this semester, including fixed-effects regression and ANOVA, are linear models

#### You must understand linear models before you can apply more advanced models such as GLMs, GAMS, GLMMs, etc. . .

---
# linear models

#### A linear model is an equation of the form:

`$$\large y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + ... + \beta_px_{ip} + \epsilon_i$$`

where the `$\beta$`’s are coefficients, and the `$x$` values are predictor variables (or dummy variables for categorical predictors)

#### This equation is often expressed in matrix notation as:

`$$\Large y = \mathbf{X \beta} + \epsilon$$`
where `$\mathbf{X}$` is a **design matrix** and `$\mathbf{\beta}$` is a vector of coefficients

More on matrix notation later...

---
## INTERPRETING THE `$\beta$`'s

<br/>

#### You must be able to interpret the `$\large \beta$` coefficients for any model that you fit to your data

<br/>

#### A linear model might have dozens of continuous and categorical predictors variables, with dozens of associated `$\large \beta$` coefficients

<br/>

#### Linear models can also include polynomial terms and interactions

---
## INTERPRETING THE `$\beta$`'s

#### The intercept `$\large \beta_0$` is the expected value of `$\large y$`, when all `$\large x$`’s are 0

#### If `$\large x$` is a continuous explanatory variable:

- `$\large \beta$` can usually be interpreted as a slope parameter

- In this case, `$\large \beta$` is the change in `$\large y$` resulting from a 1 unit change in `$\large x$` (while holding the other predictors constant)

.pull-left[

```r
lm(y ~ x, data = df)
```

```
## 
## Call:
## lm(formula = y ~ x, data = df)
## 
## Coefficients:
## (Intercept)            x  
##      11.346        0.957
```

]

.pull-right[

<img src="18_lm_files/figure-html/unnamed-chunk-6-1.png" width="648" style="display: block; margin: auto;" />
]

---
### INTERPRETING THE `$\large \beta$`'s FOR CATEGORICAL PREDICTORS

#### Things are more complicated for categorical explantory variables (i.e., factors) because they must be converted to dummy variables

#### There are many ways of creating dummy variables

In `R`, the default method for creating dummy variables from unordered factors works like this:

- One level of the factor is treated as a reference level

- The reference level is associated with the intercept

- The `$\beta$` coefficients for the other levels of the factor are differences from the reference level

#### The default method corresponds to:

```
options(contrasts=c("contr.treatment","contr.poly"))
```

---
### INTERPRETING THE `$\large \beta$`'s FOR CATEGORICAL PREDICTORS

#### Another common method for creating dummy variables results in `$\large \beta$`’s that can be interpretted as the `$\large \alpha$`’s from the additive models that we saw earlier in the class

#### With this method:

- The `$\beta$` associated with each level of the factor is the difference from the intercept

- The intercept can be interpeted as the grand mean if the continuous variables have been centered

- One of the levels of the factor will not be displayed because it is redundant when the intercept is estimated

#### This method corresponds to:

```
options(contrasts=c("contr.sum","contr.poly"))
```

---
class: inverse, middle, center

# example

---
# example

### The Island Scrub Jay (*Aphelocoma insularis*)

---
# example

---
# santa cruz data

#### Habitat data for all 2787 grid cells covering the island

```r
data("cruzData")
head(cruzData)
```

```
##        x       y elevation forest chaparral habitat seeds
## 1 230737 3774324       241      0         0     Oak   Low
## 2 231037 3774324       323      0         0    Pine   Med
## 3 231337 3774324       277      0         0    Pine  High
## 4 230437 3774024        13      0         0     Oak   Med
## 5 230737 3774024       590      0         0     Oak  High
## 6 231037 3774024       533      0         0     Oak   Low
```

---
# maps of predictor variables

### Elevation

---
# maps of predictor variables

### Forest cover

---
# questions

<br/>

--
1) How many jays are on the island?

<br/>

2) What environmental variables influence abundance?

<br/>

3) Can we predict consequences of environmental change?

---
# maps of predictor variables

### Chaparral and survey locations

---
# the (fake) jay data

```r
data("jayData")
head(jayData)
```

```
##        x       y elevation forest chaparral habitat seeds jays
## 1 258637 3764124       423   0.00      0.02     Oak   Med   34
## 2 261937 3769224       506   0.10      0.45     Oak   Med   38
## 3 246337 3764124       859   0.00      0.26     Oak  High   40
## 4 239437 3763524      1508   0.02      0.03    Pine   Med   43
## 5 239437 3767724       483   0.26      0.37     Oak   Med   36
## 6 236437 3769524       830   0.00      0.01     Oak   Low   39
```

---
# simple linear regression

```r
fm1 <- lm(jays ~ elevation, data = jayData)
broom::tidy(fm1)
```

???

Don't forget that you can also use `broom::glance()` to get other output in data.frame format

---
class: center, middle, inverse

# matrix notation

---
# matrix notation

<br/>

#### Linear models are often expressed in matrix notation

<br/>

#### There are two reasons for this:

- It is more compact and therefore easier to write

- Matrix multiplication is fast on a computer

---
# linear model

#### All of the fixed effects models that we have covered can be expressed this way:

`$$\Large y = \mathbf{X \beta} + \epsilon$$`

where

`$$\large \epsilon \sim normal(0, \sigma^2)$$`
--

#### Examples include

- Completely randomized ANOVA

- Randomized complete block designs with fixed block effects

- Factorial designs

- ANCOVA

---
# then how do they differ?

<br/>

- The design matrices are different

- And so are the number of parameters (coefficients) to be estimated

- Important to understand how to construct design matrix that includes categorical variables

---
# design matrix

<br/>

- A design matrix has `$N$` rows and `$K$` columns, where `$N$` is the total sample size and `$K$` is the number of coefficients (parameters) to be estimated

- The first column contains just 1’s. This column corresponds to the intercept (`$\beta_0$`)

- Continuous predictor variables appear unchanged in the design matrix

- Categorical predictor variables appear as dummy variables

- In `R`, the design matrix is created internally based on the formula that you provide

- The design matrix can be viewed using the `model.matrix()` function

---
# design matrix

#### Model

```r
fm1 <- lm(jays ~ elevation, data = jayData)
```

#### Design matrix

```r
X1 <- model.matrix(fm1)
head(X1, n = 4) # First 4 rows of design matrix
```

```
##   (Intercept) elevation
## 1           1       423
## 2           1       506
## 3           1       859
## 4           1      1508
```

#### Estimated `$\large \beta$` coefficients

```r
(beta.hat1 <- coef(fm1)) # Estimates of beta0 and beta1
```

```
## (Intercept)   elevation 
##   33.082808    0.008337
```

---
# matrix multiplication

<br/>

`$$\Large E(y) = \mathbf{X \beta}$$`

<br/>

`$$\large \begin{bmatrix}
    a\times x + b\times y + c\times z\\
    d\times x + e\times y + f\times z\\
    g\times x + h\times y + i\times z
\end{bmatrix} = \begin{bmatrix}
    a & b & c \\
    d & e & f \\
    g & h & i 
\end{bmatrix}\;\;
\mathbf \times \begin{bmatrix}
    x \\
    y \\
    z
\end{bmatrix}$$`

<br/>

where:

- The first matrix corresponds to the expected values of `$y$`

- The second matrix corresponds to the design matrix `$\mathbf{X}$`

- The third matrix corresponds to `$\mathbf{\beta}$`

---
# matrix multiplication

<br/>

`$$\Large E(y) = \mathbf{X \beta}$$` or `$$\Large E(y_i) = \beta_0 + \beta_1ELEV_i$$`

<br/>

```r
Ey1 <- X1 %*% beta.hat1 # Expected number of jays at each site
head(Ey1, 5)
```

```
##    [,1]
## 1 36.61
## 2 37.30
## 3 40.24
## 4 45.65
## 5 37.11
```

---
# simple linear regression

---
# multiple linear regression

```r
fm2 <- lm(jays ~ elevation + forest, data = jayData)
broom::tidy(fm2)
```

---
# design matrix for multiple regression

#### Design matrix

```r
X2 <- model.matrix(fm2)
head(X2, n = 4) # First 4 rows of design matrix
```

```
##   (Intercept) elevation forest
## 1           1       423   0.00
## 2           1       506   0.10
## 3           1       859   0.00
## 4           1      1508   0.02
```

#### Estimated `$\large \beta$` coefficients

```r
(beta.hat2 <- coef(fm2)) # Estimates of beta0 and beta1
```

```
## (Intercept)   elevation      forest 
##   33.065994    0.008337    0.294350
```

---
# matrix multiplication

`$$\Large E(y) = \mathbf{X \beta}$$` or 
`$$\large E(y_i) = \beta_0 + \beta_1ELEV_i + \beta_2FOREST_i$$`
<br/>

```r
Ey2 <- X2 %*% beta.hat2 # Expected number of jays at each site
head(Ey2, 5)
```

```
##    [,1]
## 1 36.59
## 2 37.31
## 3 40.23
## 4 45.64
## 5 37.17
```

---
# multiple linear regression

---
# one-way anova

```r
fm3 <- lm(jays ~ habitat, data = jayData)
broom::tidy(fm3)
```

---
# design matrix for anova

#### Model

```r
fm3 <- lm(jays ~ habitat, data=jayData)
```

#### Design matrix

```r
X3 <- model.matrix(fm3)
head(X3, n = 4) # First 4 rows of design matrix
```

```
##   (Intercept) habitatOak habitatPine
## 1           1          1           0
## 2           1          1           0
## 3           1          1           0
## 4           1          0           1
```

#### Estimated `$\large \beta$` coefficients

```r
(beta.hat3 <- coef(fm3)) # Estimates of beta0, beta1, beta2
```

```
## (Intercept)  habitatOak habitatPine 
##      35.875       3.493       2.039
```

---
# matrix multiplication

`$$\Large E(y) = \mathbf{X \beta}$$` or 
`$$\large E(y_i) = \beta_0 + \beta_1OAK_i + \beta_2PINE_i$$`
<br/>

```r
Ey3 <- X3 %*% beta.hat3 # Expected number of jays at each site
head(Ey3, 5)
```

```
##    [,1]
## 1 39.37
## 2 39.37
## 3 39.37
## 4 37.91
## 5 39.37
```

---
# one-way anova

---
# ancova

```r
fm4 <- lm(jays ~ habitat + elevation, data = jayData)
broom::tidy(fm4)
```

---
# design matrix for ancova

#### Model

```r
fm4 <- lm(jays ~  habitat + elevation, data = jayData)
```

#### Design matrix

```r
X4 <- model.matrix(fm4)
head(X4, n = 4) # First 4 rows of design matrix
```

```
##   (Intercept) habitatOak habitatPine elevation
## 1           1          1           0       423
## 2           1          1           0       506
## 3           1          1           0       859
## 4           1          0           1      1508
```

#### Estimated `$\large \beta$` coefficients

```r
(beta.hat4 <- coef(fm4)) # Beta estimates
```

```
## (Intercept)  habitatOak habitatPine   elevation 
##   30.716372    3.166148    1.695456    0.008289
```

---
# matrix multiplication

`$$\Large E(y) = \mathbf{X \beta}$$` or 
`$$\large E(y_i) = \beta_0 + \beta_1OAK_i + \beta_2PINE_i+ \beta_3ELEV_i$$`
<br/>

```r
Ey4 <- X4 %*% beta.hat4 # Expected number of jays at each site
head(Ey4, 5)
```

```
##    [,1]
## 1 37.39
## 2 38.08
## 3 41.00
## 4 44.91
## 5 37.89
```

---
# ancova

---
# continuous-categorical interaction

```r
fm5 <- lm(jays ~ habitat * elevation, data = jayData)
broom::tidy(fm5)
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 31.6544 </td>
   <td style="text-align:right;"> 1.4463 </td>
   <td style="text-align:right;"> 21.8861 </td>
   <td style="text-align:right;"> 0.0000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> habitatOak </td>
   <td style="text-align:right;"> 2.4287 </td>
   <td style="text-align:right;"> 1.5652 </td>
   <td style="text-align:right;"> 1.5516 </td>
   <td style="text-align:right;"> 0.1241 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> habitatPine </td>
   <td style="text-align:right;"> 0.4000 </td>
   <td style="text-align:right;"> 1.5799 </td>
   <td style="text-align:right;"> 0.2532 </td>
   <td style="text-align:right;"> 0.8007 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> elevation </td>
   <td style="text-align:right;"> 0.0068 </td>
   <td style="text-align:right;"> 0.0020 </td>
   <td style="text-align:right;"> 3.3931 </td>
   <td style="text-align:right;"> 0.0010 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> habitatOak:elevation </td>
   <td style="text-align:right;"> 0.0012 </td>
   <td style="text-align:right;"> 0.0022 </td>
   <td style="text-align:right;"> 0.5592 </td>
   <td style="text-align:right;"> 0.5774 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> habitatPine:elevation </td>
   <td style="text-align:right;"> 0.0020 </td>
   <td style="text-align:right;"> 0.0022 </td>
   <td style="text-align:right;"> 0.9508 </td>
   <td style="text-align:right;"> 0.3441 </td>
  </tr>
</tbody>
</table>

---
# continuous-categorical interaction

---
# quadratic effect of elevation

```r
fm6 <- lm(jays ~ elevation + I(elevation^2), data = jayData)
broom::tidy(fm6)
```

---
# quadratic effect of elevation

---
# interaction and quadratic effects

```r
fm7 <- lm(jays ~ habitat * forest + elevation + I(elevation^2), data = jayData)
broom::tidy(fm7)
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 29.1977 </td>
   <td style="text-align:right;"> 1.0304 </td>
   <td style="text-align:right;"> 28.338 </td>
   <td style="text-align:right;"> 0.0000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> habitatOak </td>
   <td style="text-align:right;"> 3.7052 </td>
   <td style="text-align:right;"> 0.8433 </td>
   <td style="text-align:right;"> 4.394 </td>
   <td style="text-align:right;"> 0.0000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> habitatPine </td>
   <td style="text-align:right;"> 2.2160 </td>
   <td style="text-align:right;"> 0.8757 </td>
   <td style="text-align:right;"> 2.531 </td>
   <td style="text-align:right;"> 0.0131 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> forest </td>
   <td style="text-align:right;"> 40.0659 </td>
   <td style="text-align:right;"> 27.8020 </td>
   <td style="text-align:right;"> 1.441 </td>
   <td style="text-align:right;"> 0.1529 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> elevation </td>
   <td style="text-align:right;"> 0.0122 </td>
   <td style="text-align:right;"> 0.0023 </td>
   <td style="text-align:right;"> 5.285 </td>
   <td style="text-align:right;"> 0.0000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> I(elevation^2) </td>
   <td style="text-align:right;"> 0.0000 </td>
   <td style="text-align:right;"> 0.0000 </td>
   <td style="text-align:right;"> -1.721 </td>
   <td style="text-align:right;"> 0.0886 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> habitatOak:forest </td>
   <td style="text-align:right;"> -42.9246 </td>
   <td style="text-align:right;"> 27.8496 </td>
   <td style="text-align:right;"> -1.541 </td>
   <td style="text-align:right;"> 0.1267 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> habitatPine:forest </td>
   <td style="text-align:right;"> -39.1819 </td>
   <td style="text-align:right;"> 27.8386 </td>
   <td style="text-align:right;"> -1.407 </td>
   <td style="text-align:right;"> 0.1627 </td>
  </tr>
</tbody>
</table>

---
# Predict jay abundance at each grid cell

```r
E5 <- predict(fm5, type="response", newdata=cruzData, interval="confidence")
```

```r
E5 <- cbind(cruzData[,c("x","y")], E5)
head(E5)
```

```
##        x       y   fit   lwr   upr
## 1 230737 3774324 36.01 35.14 36.87
## 2 231037 3774324 34.91 34.02 35.79
## 3 231337 3774324 34.50 33.57 35.43
## 4 230437 3774024 34.19 33.02 35.36
## 5 230737 3774024 38.79 38.23 39.36
## 6 231037 3774024 38.34 37.75 38.93
```

---
# map the predictions

---
# map the predictions

---
# map the predictions

---
# future scenarios

#### What if pine and oak disappear?

---
# future scenarios

#### What if pine and oak disappear?

---
# summary

<br/>

- Linear models are the foundation of modern statistical modeling techniques

<br/>

- They can be used to model a wide array of biological processes, and they can be easily extended when their assumptions do not hold

<br/>

- One of the most important extensions is to cases where the residuals are not normally distributed. Generalized linear models address this issue