class: center, middle, inverse, title-slide .title[ # LECTURE 18: logistic regression ] .subtitle[ ## FANR 6750 (Experimental design) ] .author[ ###
Fall 2023 ] --- # outline 1) Motivation <br/> -- 2) Model <br/> -- 3) Model fitting <br/> -- 4) Model interpretation --- # logistic regression Logistic regression is a specific type of GLM in which the response variable follows a binomial distribution and the link function is (usually) the logit -- Logistic regression is used to model a binary response variable as a function of explanatory variables -- #### Examples: - Is presence of a focal species related to habitat type? - Is survival a function of body condition? - Is disease prevalence related to population density? --- # logistic regression <br/> `$$\large y_i \sim Bernoulli(p_i)$$` `$$\large log\bigg(\frac{p_i}{1 - p_i}\bigg) = logit(p_i) = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + ...$$` -- #### where: - `\(\large y_i\)` is the response (coded as 0/1) - `\(\large p_i\)` is the probability of success for sample unit *i* - `\(\large \frac{p_i}{1 - p_i}\)` is the *odds* of success - `\(\large log\big(\frac{p_i}{1 - p_i}\big)\)` is the log odds --- # logit link The logit link allows us to transform the probability `\(p_i\)` (constrained to `\(0\)` to `\(1\)`) to the real scale ( `\(-\infty\)` to `\(\infty\)`) `$$\large \log\bigg(\frac{0.5}{1-0.5}\bigg) = log(1) = 0$$` -- The inverse logit `\(\frac{e^\mu}{1+e^\mu}\)` allows us to transform the linear predictor to the probability scale `$$\large \frac{e^0}{1+e^0} = \frac{1}{2}$$` -- The odds ratio `\(\frac{p}{1-p}\)` varies from `\(0\)` to `\(\infty\)` and indicate the probability of success relative to the probability of failure -- - `\(1:2\)` odds = `\(0.33/(1 - 0.33)\)` = failure is twice as likely as success - `\(4:1\)` odds = `\(0.8/(1 - 0.8)\)` = success if four times a likely as failure --- # logit link <br/> <table> <tbody> <tr> <td style="text-align:left;"> log odds = log(p/(1-p)) </td> <td style="text-align:right;"> -2.20 </td> <td style="text-align:right;"> -1.10 </td> <td style="text-align:right;"> 0.0 </td> <td style="text-align:right;"> 1.39 </td> <td style="text-align:right;"> 4.60 </td> </tr> <tr> <td style="text-align:left;"> odds = p/(1-p) </td> <td style="text-align:right;"> 0.11 </td> <td style="text-align:right;"> 0.33 </td> <td style="text-align:right;"> 1.0 </td> <td style="text-align:right;"> 4.00 </td> <td style="text-align:right;"> 99.00 </td> </tr> <tr> <td style="text-align:left;"> p </td> <td style="text-align:right;"> 0.10 </td> <td style="text-align:right;"> 0.25 </td> <td style="text-align:right;"> 0.5 </td> <td style="text-align:right;"> 0.80 </td> <td style="text-align:right;"> 0.99 </td> </tr> </tbody> </table> -- <br/> <img src="18_logistic_regression_files/figure-html/unnamed-chunk-2-1.png" width="720" style="display: block; margin: auto;" /> --- # example Imagine we are interested in the effects of elevation and habitat on the probability of occurrence for a rare orchid ```r data("orchidata") head(orchiddata, n=12) ``` ``` ## presence abundance elevation habitat ## 1 0 0 58 Oak ## 2 1 7 191 Oak ## 3 0 0 43 Oak ## 4 1 11 374 Oak ## 5 1 11 337 Oak ## 6 1 1 64 Oak ## 7 1 4 195 Oak ## 8 1 6 263 Oak ## 9 0 0 181 Oak ## 10 1 1 59 Oak ## 11 1 50 489 Maple ## 12 1 5 317 Maple ``` --- # raw data ```r ggplot(orchiddata, aes(x = elevation, y = presence)) + geom_point() + scale_y_continuous("Orchid Occurrence") + scale_x_continuous("Elevation") ``` <img src="18_logistic_regression_files/figure-html/unnamed-chunk-4-1.png" width="504" style="display: block; margin: auto;" /> --- # raw data ```r library(dplyr) orchiddata %>% group_by(habitat) %>% summarise(group.prob = mean(presence)) %>% ggplot(., aes(x = habitat, y = group.prob)) + geom_col(fill = "grey70", color = "black") + scale_y_continuous("Proportion of sites with orchids") + scale_x_discrete("Habitat") ``` <img src="18_logistic_regression_files/figure-html/unnamed-chunk-5-1.png" width="504" style="display: block; margin: auto;" /> --- # the `glm` function <br/> ```r fm1 <- glm(presence ~ habitat + elevation, family=binomial(link="logit"), data = orchiddata) broom::tidy(fm1) ``` <br/> <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> -0.9960 </td> <td style="text-align:right;"> 1.217 </td> <td style="text-align:right;"> -0.8184 </td> <td style="text-align:right;"> 0.4131 </td> </tr> <tr> <td style="text-align:left;"> habitatOak </td> <td style="text-align:right;"> -0.0968 </td> <td style="text-align:right;"> 1.367 </td> <td style="text-align:right;"> -0.0708 </td> <td style="text-align:right;"> 0.9436 </td> </tr> <tr> <td style="text-align:left;"> habitatPine </td> <td style="text-align:right;"> -0.3372 </td> <td style="text-align:right;"> 1.382 </td> <td style="text-align:right;"> -0.2441 </td> <td style="text-align:right;"> 0.8072 </td> </tr> <tr> <td style="text-align:left;"> elevation </td> <td style="text-align:right;"> 0.0137 </td> <td style="text-align:right;"> 0.006 </td> <td style="text-align:right;"> 2.2723 </td> <td style="text-align:right;"> 0.0231 </td> </tr> </tbody> </table> --- # intepreting parameters <table class="table" style="font-size: 10px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> -0.9960 </td> <td style="text-align:right;"> 1.217 </td> <td style="text-align:right;"> -0.8184 </td> <td style="text-align:right;"> 0.4131 </td> </tr> <tr> <td style="text-align:left;"> habitatOak </td> <td style="text-align:right;"> -0.0968 </td> <td style="text-align:right;"> 1.367 </td> <td style="text-align:right;"> -0.0708 </td> <td style="text-align:right;"> 0.9436 </td> </tr> <tr> <td style="text-align:left;"> habitatPine </td> <td style="text-align:right;"> -0.3372 </td> <td style="text-align:right;"> 1.382 </td> <td style="text-align:right;"> -0.2441 </td> <td style="text-align:right;"> 0.8072 </td> </tr> <tr> <td style="text-align:left;"> elevation </td> <td style="text-align:right;"> 0.0137 </td> <td style="text-align:right;"> 0.006 </td> <td style="text-align:right;"> 2.2723 </td> <td style="text-align:right;"> 0.0231 </td> </tr> </tbody> </table> -- - The probability of occurrence in maple-dominated forest at sea level = 0.27 (`plogis(-0.996)`) -- - The probability of occurrence in oak-dominated forest at sea level = 0.25 (`plogis(-0.996 - 0.09)`) -- + The *odds* of occurrence in oak (relative to maple) are `exp(-0.09)` = 0.91 -- - The probability of occurrence in pine-dominated forest at sea level = 0.21 (`plogis(-0.996 - 0.34)`) -- + The *odds* of occurrence in pine (relative to maple) are `exp(-0.34)` = 0.71 --- # intepreting parameters <table class="table" style="font-size: 10px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> -0.9960 </td> <td style="text-align:right;"> 1.217 </td> <td style="text-align:right;"> -0.8184 </td> <td style="text-align:right;"> 0.4131 </td> </tr> <tr> <td style="text-align:left;"> habitatOak </td> <td style="text-align:right;"> -0.0968 </td> <td style="text-align:right;"> 1.367 </td> <td style="text-align:right;"> -0.0708 </td> <td style="text-align:right;"> 0.9436 </td> </tr> <tr> <td style="text-align:left;"> habitatPine </td> <td style="text-align:right;"> -0.3372 </td> <td style="text-align:right;"> 1.382 </td> <td style="text-align:right;"> -0.2441 </td> <td style="text-align:right;"> 0.8072 </td> </tr> <tr> <td style="text-align:left;"> elevation </td> <td style="text-align:right;"> 0.0137 </td> <td style="text-align:right;"> 0.006 </td> <td style="text-align:right;"> 2.2723 </td> <td style="text-align:right;"> 0.0231 </td> </tr> </tbody> </table> -- What about elevation? -- First, let's visualize the fitted relationship --- # example <br/> <img src="18_logistic_regression_files/figure-html/unnamed-chunk-10-1.png" width="576" style="display: block; margin: auto;" /> --- # example What's the change in probability of occurrence at 1m vs 0m elevation (in maple habitat)? - `plogis(-0.996 + 0.014 * 1) - plogis(-0.996 + 0.014 * 0)` = 0.002 -- What's the change in probability of occurrence at 101m vs 100m elevation ? - `plogis(-0.996 + 0.014 * 101) - plogis(-0.996 + 0.014 * 100)` = 0.004 -- Change in probability is not linear <img src="18_logistic_regression_files/figure-html/unnamed-chunk-11-1.png" width="576" style="display: block; margin: auto;" /> --- # example Change in probability is not linear. -- But the change is *odds* is - Odds ratio: `\(\frac{p_2}{1 - p_2}/\frac{p_1}{1 - p_1}\)` .small-code[ ```r p1 <- plogis(-0.996 + 0.014 * 0) p2 <- plogis(-0.996 + 0.014 * 1) (OR1 <- (p2/(1-p2))/(p1/(1-p1))) ``` ``` ## [1] 1.014 ``` ```r p3 <- plogis(-0.996 + 0.014 * 400) p4 <- plogis(-0.996 + 0.014 * 401) (OR2 <- (p4/(1-p4))/(p3/(1-p3))) ``` ``` ## [1] 1.014 ``` ] -- What is the change in odds for one unit change in elevation? -- - `\(e^{\beta_{Elev}}\)` = `exp(0.014)` = 1.0141 --- # summary Logistic regression is used when we want to model a binary response as a function of explanatory variables -- The response `\(y_i\)` is modeled as arising from a Bernoulli distribution with probability `\(p_i\)` and the logit link is used to map the linear predictor to the probability scale -- All of the linear modeling concepts we have learned this semester (continuous/categorical explanatory variables, multiple regression, interactions, random effects) can be used within the logistic regression framework -- Parameter estimates from the logistic regression measure the change in log odds for one unit change in the explanatory variables - Log odds are constant across all values of `\(x\)`, but the probabilities are not --- # looking ahead <br/> ### **Next time**: Poisson regression <br/> ### **Reading**: [Fieberg chp. 15](https://statistics4ecologists-v1.netlify.app/poissonr)