LECTURE 19: generalized linear models

class: center, middle, inverse, title-slide

.title[
# LECTURE 19: generalized linear models
]
.subtitle[
## FANR 6750 (Experimental design)
]
.author[
### <br/><br/><br/>Fall 2022
]

---

# outline

<br/>

1) Motivation

<br/>

2) Logistic regression

<br/>

3) Poisson regression

---
# motivation

#### Benefits of generalized linear models

- The residuals don’t have to be normally distributed

- The response variable can be binary, integer, strictly-positive, etc...

- The variance is not assumed to be constant

- Useful for manipulative experiments or observational studies, just like linear models

#### Examples

- Presence-absence studies

- Studies of survival

- Seed germination studies

---
# from linear to generalized linear

#### Linear model

`$$\Large E(y_i) = \beta_0 + \beta_1x_{i1} +\beta_2x_{i2} + ...$$`
`$$\Large y_i \sim normal(E(y_i), \sigma^2)$$`

#### Generalized linear model

`$$\Large g\big(E(y_i)\big) = \beta_0 + \beta_1x_{i1} +\beta_2x_{i2} + ...$$`
`$$\Large y_i \sim f(E(y_i))$$`

where `$g()$` is a **link function**, such as the log or logit link and `$f$` is a probability distribution such as the binomial or Poisson

---
# alternative representations

#### This:
`$$\Large g\big(E(y_i)\big) = \beta_0 + \beta_1x_{i1} +\beta_2x_{i2} + ...$$`
`$$\Large y_i \sim f(E(y_i))$$`

#### Is the same as this:

`$$\Large E(y_i) = g^{-1}(\beta_0 + \beta_1x_{i1} +\beta_2x_{i2} + ...)$$`
`$$\Large y_i \sim f(E(y_i))$$`

#### Is the same as this:

`$$\Large g\big(E(y_i)\big) = \mathbf{X \beta}$$`
`$$\Large y_i \sim f(E(y_i))$$`

---
# link functions

<table>
 <thead>
  <tr>
   <th style="text-align:center;"> Distribution </th>
   <th style="text-align:center;"> link name </th>
   <th style="text-align:center;"> link equation </th>
   <th style="text-align:center;"> inverse link equation </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> Binomial </td>
   <td style="text-align:center;"> logit </td>
   <td style="text-align:center;"> $log(\frac{p}{1-p})$ </td>
   <td style="text-align:center;"> $\frac{e^{X\beta}}{1+e^{X\beta}}$ </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Poisson </td>
   <td style="text-align:center;"> log </td>
   <td style="text-align:center;"> $log(\lambda)$ </td>
   <td style="text-align:center;"> $e^{X\beta}$ </td>
  </tr>
</tbody>
</table>

<br/>

<table>
 <thead>
  <tr>
   <th style="text-align:center;"> Distribution </th>
   <th style="text-align:center;"> link name </th>
   <th style="text-align:center;"> link in R </th>
   <th style="text-align:center;"> inverse link in R </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> Binomial </td>
   <td style="text-align:center;"> logit </td>
   <td style="text-align:center;"> qlogis </td>
   <td style="text-align:center;"> plogis </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Poisson </td>
   <td style="text-align:center;"> log </td>
   <td style="text-align:center;"> log </td>
   <td style="text-align:center;"> exp </td>
  </tr>
</tbody>
</table>

---
# logit link example

```r
beta0 <- 5
beta1 <- -0.08
(logit.p <- beta0 + beta1*100) # linear predictor for x = 100
```

```
## [1] -3
```

#### How do we convert -3 to a probability? Use the inverse-link:

```r
(p <- exp(logit.p)/(1+exp(logit.p))) # same as plogis(logit.p)
```

```
## [1] 0.04743
```

#### To go back, use the link function itself:

```r
log(p/(1-p))
```

```
## [1] -3
```

```r
qlogis(p)
```

```
## [1] -3
```

---
# logit link example

```r
plot(function(x) 5 + -0.08*x, from=0, to=100,
xlab="Elevation", ylab="logit(prob of occurrence)")
```

---
# logit link example

```r
plot(function(x) plogis(5 + -0.08*x), from=0, to=100,
xlab="Elevation", ylab="Probability of occurrence")
```

---
class: inverse, middle, center
# logistic regression

---
# logistic regression

#### Logistic regression is a specific type of GLM in which the response variable follows a binomial distribution and the link function is the logit

#### It would be better to call it “binomial regression” since other link functions (e.g. the probit) can be used

#### Appropriate when the response is binary or a count with an upper limit

#### Examples:

- Presence/absence studies

- Survival studies

- Disease prevalance studies

---
# logistic regression

<br/>

`$$\Large logit(p_i) = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + ...$$`
`$$\Large y_i \sim binomial(N, p_i)$$`
--

#### where:

- `$\large N$` is the number of “trials” (e.g. coin flips)

- `$\large p_i$` is the probability of success for sample unit *i*

---
# binomial distribution

<br/>

---
# binomial distribution

<br/>

---
# binomial distribution

#### Properties

- The expected value of `$\large y$` is `$\large Np$`

- The variance is `$\large Np(1 - p)$`

#### Bernoulli distribution

- The Bernoulli distribution is a binomial distribution with a single trial (N = 1)

- Logistic regression is usually done in this context, such that the response variable is 0/1 or No/Yes or Bad/Good, etc.

---
# worked example using glm

First we will model the presence-absence response variable to determine if elevation and habitat affect the probability of occurrence

```r
data("frogData")
head(frogData, n=12)
```

```
##    presence abundance elevation habitat
## 1         0         0        58     Oak
## 2         1         7       191     Oak
## 3         0         0        43     Oak
## 4         1        11       374     Oak
## 5         1        11       337     Oak
## 6         1         1        64     Oak
## 7         1         4       195     Oak
## 8         1         6       263     Oak
## 9         0         0       181     Oak
## 10        1         1        59     Oak
## 11        1        50       489   Maple
## 12        1         5       317   Maple
```

---
# raw data

```r
ggplot(frogData, aes(x = elevation, y = presence)) +
  geom_point() +
  scale_y_continuous("Frog Occurrence") + scale_x_continuous("Elevation")
```

---
# raw data

```r
library(dplyr)

frogData %>% group_by(habitat) %>% summarise(group.prob = mean(presence)) %>%
  ggplot(., aes(x = habitat, y = group.prob)) +
  geom_col(fill = "grey70", color = "black") +
  scale_y_continuous("Proportion of sites with frogs") + scale_x_discrete("Habitat")
```

---
# the `glm` function

<br/>

```r
fm1 <- glm(presence ~ habitat + elevation, 
           family=binomial(link="logit"), 
           data = frogData)
broom::tidy(fm1)
```

<br/>

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> -0.9960 </td>
   <td style="text-align:right;"> 1.217 </td>
   <td style="text-align:right;"> -0.8184 </td>
   <td style="text-align:right;"> 0.4131 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> habitatOak </td>
   <td style="text-align:right;"> -0.0968 </td>
   <td style="text-align:right;"> 1.367 </td>
   <td style="text-align:right;"> -0.0708 </td>
   <td style="text-align:right;"> 0.9436 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> habitatPine </td>
   <td style="text-align:right;"> -0.3372 </td>
   <td style="text-align:right;"> 1.382 </td>
   <td style="text-align:right;"> -0.2441 </td>
   <td style="text-align:right;"> 0.8072 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> elevation </td>
   <td style="text-align:right;"> 0.0137 </td>
   <td style="text-align:right;"> 0.006 </td>
   <td style="text-align:right;"> 2.2723 </td>
   <td style="text-align:right;"> 0.0231 </td>
  </tr>
</tbody>
</table>

---
# occurrence probability and elevation

```r
newdat <- data.frame(elevation=seq(12, 489, length = 50), 
                     habitat="Oak")
head(newdat)
```

```
##   elevation habitat
## 1     12.00     Oak
## 2     21.73     Oak
## 3     31.47     Oak
## 4     41.20     Oak
## 5     50.94     Oak
## 6     60.67     Oak
```

To get confidence intervals on (0,1) scale, predict on linear (link) scale and then backtransform using inverse-link

```r
pred.link <- predict(fm1, newdata=newdat, se.fit=TRUE, type="link")
newdat$mu <- plogis(pred.link$fit)
newdat$lower <- plogis(pred.link$fit - 1.96*pred.link$se.fit)
newdat$upper <- plogis(pred.link$fit + 1.96*pred.link$se.fit)
```

---
# occurrence probability and elevation

---
class: center, middle, inverse

# poisson regression

---
# poisson regression

<br/>

`$$\Large log(\lambda_i) = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + ...$$`
`$$\Large y_i \sim Poisson(\lambda_i)$$`

<br/>

#### where

- `$\large \lambda_i$` is the expected value of `$\large y_i$`

---
# poisson regression

#### Useful for:

- Count data

- Number of events in time intervals

- Other types of integer data

#### Properties

- The expected value of *y* ( `$\large \lambda$`) is equal to the variance

- This is an assumption of the Poisson model

- Like all assumptions, it can be relaxed if you have enough data

---
# poisson distribution

<br/>

---
# poisson distribution

<br/>

---
# poisson distribution

<br/>

---
# log link example

```r
plot(function(x) 5 + -0.08*x, from=0, to=100,
xlab="Elevation", ylab="log(Expected abundance)")
```

---
# log link example

```r
plot(function(x) exp(5 + -0.08*x), from=0, to=100,
xlab="Elevation", ylab="Expected abundance")
```

---
# the `glm` function

<br/>

```r
fm2 <- glm(abundance ~ habitat + elevation, family=poisson(link="log"), 
           data=frogData)
broom::tidy(fm2)
```

<br/>

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> -0.7870 </td>
   <td style="text-align:right;"> 0.2941 </td>
   <td style="text-align:right;"> -2.6756 </td>
   <td style="text-align:right;"> 0.0075 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> habitatOak </td>
   <td style="text-align:right;"> -0.1534 </td>
   <td style="text-align:right;"> 0.1971 </td>
   <td style="text-align:right;"> -0.7782 </td>
   <td style="text-align:right;"> 0.4364 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> habitatPine </td>
   <td style="text-align:right;"> -0.0653 </td>
   <td style="text-align:right;"> 0.1104 </td>
   <td style="text-align:right;"> -0.5914 </td>
   <td style="text-align:right;"> 0.5543 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> elevation </td>
   <td style="text-align:right;"> 0.0098 </td>
   <td style="text-align:right;"> 0.0006 </td>
   <td style="text-align:right;"> 15.5507 </td>
   <td style="text-align:right;"> 0.0000 </td>
  </tr>
</tbody>
</table>

---
# prediction

```r
newdat <- data.frame(elevation = seq(12, 489, length = 50),
                     habitat = "Oak")
head(newdat)
```

```
##   elevation habitat
## 1     12.00     Oak
## 2     21.73     Oak
## 3     31.47     Oak
## 4     41.20     Oak
## 5     50.94     Oak
## 6     60.67     Oak
```

To get confidence intervals on (0, `$\infty$`) scale, predict on linear (link) scale and then backtransform using inverse-link

```r
pred.link <- predict(fm2, newdata=newdat, se.fit=TRUE, type="link")
newdat$mu <- exp(pred.link$fit)
newdat$lower <- exp(pred.link$fit - 1.96*pred.link$se.fit)
newdat$upper <- exp(pred.link$fit + 1.96*pred.link$se.fit)
```

---
# prediction

<br/>

---
# assessing model fit

#### The most common problem in Poisson regression is overdispersion

#### Overdispersion is the situation in which there is more variability in the data than predicted by the model

#### Overdispersion cannot be assessed by simply comparing the mean and variance of the response variable

### The presence of many zeros is not necessarily indicative of overdispersion

### Overdispersion can be assessed using a goodness-of-fit test

---
# goodness-of-fit

#### The fit of a Poisson regression can be assessed using a `$\large \chi^2$` test

The test statistic is the residual deviance:

`$$\Large D = 2\bigg[\sum y_i log\bigg(\frac{y_i}{\hat{\lambda}_i}\bigg)-(y_i-\hat{\lambda}_i)\bigg]$$`

If the null hypothesis is true (ie, the model fits the data), `$D$` should follow `$\chi^2$` distribution with `$N - K$` degrees-of-freedom

```r
N <- nrow(frogData) # sample size
K <- length(coef(fm2)) # number of parameters
df.resid <- N-K # degrees-of-freedom
Dev <- deviance(fm2) # residual deviance
(p.value <- 1-pchisq(Dev, df=df.resid))# p-value
```

```
## [1] 0.0121
```

---
### `$\large \chi^2$` DISTRIBUTION AND RESIDUAL DEVIANCE

<br/>

```r
curve(dchisq(x, df=df.resid), from=0, to=50, xlab="Deviance", ylab="Density")
abline(v=Dev, lwd=3, col="red")
```

--
The red line is the residual deviance. We reject the null hypothesis, and we conclude that the Poisson model does not fit the data

---
# what if model doesn't fit the data?

<br/>

#### Alternatives to the Poisson distribution

- Negative binomial

- Zero-inflated Poisson

---
# negative binomial distribution

---
# negative binomial distribution

---
# negative binomial distribution