LECTURE 18: logistic regression

class: center, middle, inverse, title-slide

.title[
# LECTURE 18: logistic regression
]
.subtitle[
## FANR 6750 (Experimental design)
]
.author[
### <br/><br/><br/>Fall 2023
]

---

# outline

1) Motivation

<br/>

2) Model

<br/>

3) Model fitting

<br/>
--

4) Model interpretation

---
# logistic regression

Logistic regression is a specific type of GLM in which the response variable follows a binomial distribution and the link function is (usually) the logit

Logistic regression is used to model a binary response variable as a function of explanatory variables

#### Examples:

- Is presence of a focal species related to habitat type?

- Is survival a function of body condition?

- Is disease prevalence related to population density?

---
# logistic regression

<br/>

`$$\large y_i \sim Bernoulli(p_i)$$`

`$$\large log\bigg(\frac{p_i}{1 - p_i}\bigg) = logit(p_i) = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + ...$$`

#### where:

- `$\large y_i$` is the response (coded as 0/1)

- `$\large p_i$` is the probability of success for sample unit *i*

- `$\large \frac{p_i}{1 - p_i}$` is the *odds* of success

- `$\large log\big(\frac{p_i}{1 - p_i}\big)$` is the log odds

---
# logit link

The logit link  allows us to transform the probability `$p_i$` (constrained to `$0$` to `$1$`) to the real scale ( `$-\infty$` to `$\infty$`)

`$$\large \log\bigg(\frac{0.5}{1-0.5}\bigg) = log(1) = 0$$`
--

The inverse logit `$\frac{e^\mu}{1+e^\mu}$` allows us to transform the linear predictor to the probability scale

`$$\large \frac{e^0}{1+e^0} = \frac{1}{2}$$`

The odds ratio `$\frac{p}{1-p}$` varies from `$0$` to `$\infty$` and indicate the probability of success relative to the probability of failure

- `$1:2$` odds = `$0.33/(1 - 0.33)$` = failure is twice as likely as success

- `$4:1$` odds = `$0.8/(1 - 0.8)$` = success if four times a likely as failure

---
# logit link

<br/>

---
# example

Imagine we are interested in the effects of elevation and habitat on the probability of occurrence for a rare orchid

```r
data("orchidata")
head(orchiddata, n=12)
```

```
##    presence abundance elevation habitat
## 1         0         0        58     Oak
## 2         1         7       191     Oak
## 3         0         0        43     Oak
## 4         1        11       374     Oak
## 5         1        11       337     Oak
## 6         1         1        64     Oak
## 7         1         4       195     Oak
## 8         1         6       263     Oak
## 9         0         0       181     Oak
## 10        1         1        59     Oak
## 11        1        50       489   Maple
## 12        1         5       317   Maple
```

---
# raw data

```r
ggplot(orchiddata, aes(x = elevation, y = presence)) +
  geom_point() +
  scale_y_continuous("Orchid Occurrence") + scale_x_continuous("Elevation")
```

---
# raw data

```r
library(dplyr)

orchiddata %>% group_by(habitat) %>% summarise(group.prob = mean(presence)) %>%
  ggplot(., aes(x = habitat, y = group.prob)) +
  geom_col(fill = "grey70", color = "black") +
  scale_y_continuous("Proportion of sites with orchids") + scale_x_discrete("Habitat")
```

---
# the `glm` function

<br/>

```r
fm1 <- glm(presence ~ habitat + elevation, 
           family=binomial(link="logit"), 
           data = orchiddata)
broom::tidy(fm1)
```

<br/>

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> -0.9960 </td>
   <td style="text-align:right;"> 1.217 </td>
   <td style="text-align:right;"> -0.8184 </td>
   <td style="text-align:right;"> 0.4131 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> habitatOak </td>
   <td style="text-align:right;"> -0.0968 </td>
   <td style="text-align:right;"> 1.367 </td>
   <td style="text-align:right;"> -0.0708 </td>
   <td style="text-align:right;"> 0.9436 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> habitatPine </td>
   <td style="text-align:right;"> -0.3372 </td>
   <td style="text-align:right;"> 1.382 </td>
   <td style="text-align:right;"> -0.2441 </td>
   <td style="text-align:right;"> 0.8072 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> elevation </td>
   <td style="text-align:right;"> 0.0137 </td>
   <td style="text-align:right;"> 0.006 </td>
   <td style="text-align:right;"> 2.2723 </td>
   <td style="text-align:right;"> 0.0231 </td>
  </tr>
</tbody>
</table>

---
# intepreting parameters

<table class="table" style="font-size: 10px; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> -0.9960 </td>
   <td style="text-align:right;"> 1.217 </td>
   <td style="text-align:right;"> -0.8184 </td>
   <td style="text-align:right;"> 0.4131 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> habitatOak </td>
   <td style="text-align:right;"> -0.0968 </td>
   <td style="text-align:right;"> 1.367 </td>
   <td style="text-align:right;"> -0.0708 </td>
   <td style="text-align:right;"> 0.9436 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> habitatPine </td>
   <td style="text-align:right;"> -0.3372 </td>
   <td style="text-align:right;"> 1.382 </td>
   <td style="text-align:right;"> -0.2441 </td>
   <td style="text-align:right;"> 0.8072 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> elevation </td>
   <td style="text-align:right;"> 0.0137 </td>
   <td style="text-align:right;"> 0.006 </td>
   <td style="text-align:right;"> 2.2723 </td>
   <td style="text-align:right;"> 0.0231 </td>
  </tr>
</tbody>
</table>

- The probability of occurrence in maple-dominated forest at sea level = 0.27 (`plogis(-0.996)`)

- The probability of occurrence in oak-dominated forest at sea level = 0.25 (`plogis(-0.996 - 0.09)`)

+ The *odds* of occurrence in oak (relative to maple) are `exp(-0.09)` = 0.91
    
--

- The probability of occurrence in pine-dominated forest at sea level = 0.21 (`plogis(-0.996 - 0.34)`)

+ The *odds* of occurrence in pine (relative to maple) are `exp(-0.34)` = 0.71
    
---
# intepreting parameters

What about elevation?

First, let's visualize the fitted relationship

---
# example

<br/>

---
# example

What's the change in probability of occurrence at 1m vs 0m elevation (in maple habitat)?

- `plogis(-0.996 + 0.014 * 1) - plogis(-0.996 + 0.014 * 0)` = 0.002

What's the change in probability of occurrence at 101m vs 100m elevation ?

- `plogis(-0.996 + 0.014 * 101) - plogis(-0.996 + 0.014 * 100)` = 0.004

Change in probability is not linear

---
# example

Change in probability is not linear.

But the change is *odds* is

- Odds ratio: `$\frac{p_2}{1 - p_2}/\frac{p_1}{1 - p_1}$`

.small-code[

```r
p1 <- plogis(-0.996 + 0.014 * 0)
p2 <- plogis(-0.996 + 0.014 * 1)
(OR1 <- (p2/(1-p2))/(p1/(1-p1)))
```

```
## [1] 1.014
```

```r
p3 <- plogis(-0.996 + 0.014 * 400)
p4 <- plogis(-0.996 + 0.014 * 401)
(OR2 <- (p4/(1-p4))/(p3/(1-p3)))
```

```
## [1] 1.014
```
]

What is the change in odds for one unit change in elevation?

- `$e^{\beta_{Elev}}$` = `exp(0.014)` = 1.0141

---
# summary

Logistic regression is used when we want to model a binary response as a function of explanatory variables

The response `$y_i$` is modeled as arising from a Bernoulli distribution with probability `$p_i$` and the logit link is used to map the linear predictor to the probability scale

All of the linear modeling concepts we have learned this semester (continuous/categorical explanatory variables, multiple regression, interactions, random effects) can be used within the logistic regression framework

Parameter estimates from the logistic regression measure the change in log odds for one unit change in the explanatory variables

- Log odds are constant across all values of `$x$`, but the probabilities are not

---
# looking ahead

<br/>

### **Next time**: Poisson regression

<br/>

### **Reading**: [Fieberg chp. 15](https://statistics4ecologists-v1.netlify.app/poissonr)