class: center, middle, inverse, title-slide .title[ # LECTURE 19: generalized linear models ] .subtitle[ ## FANR 6750 (Experimental design) ] .author[ ###
Fall 2022 ] --- # outline <br/> 1) Motivation <br/> -- 2) Logistic regression <br/> -- 3) Poisson regression --- # motivation #### Benefits of generalized linear models -- - The residuals don’t have to be normally distributed -- - The response variable can be binary, integer, strictly-positive, etc... -- - The variance is not assumed to be constant -- - Useful for manipulative experiments or observational studies, just like linear models -- #### Examples - Presence-absence studies - Studies of survival - Seed germination studies --- # from linear to generalized linear #### Linear model `$$\Large E(y_i) = \beta_0 + \beta_1x_{i1} +\beta_2x_{i2} + ...$$` `$$\Large y_i \sim normal(E(y_i), \sigma^2)$$` -- #### Generalized linear model `$$\Large g\big(E(y_i)\big) = \beta_0 + \beta_1x_{i1} +\beta_2x_{i2} + ...$$` `$$\Large y_i \sim f(E(y_i))$$` -- where `\(g()\)` is a **link function**, such as the log or logit link and `\(f\)` is a probability distribution such as the binomial or Poisson --- # alternative representations #### This: `$$\Large g\big(E(y_i)\big) = \beta_0 + \beta_1x_{i1} +\beta_2x_{i2} + ...$$` `$$\Large y_i \sim f(E(y_i))$$` #### Is the same as this: `$$\Large E(y_i) = g^{-1}(\beta_0 + \beta_1x_{i1} +\beta_2x_{i2} + ...)$$` `$$\Large y_i \sim f(E(y_i))$$` -- #### Is the same as this: `$$\Large g\big(E(y_i)\big) = \mathbf{X \beta}$$` `$$\Large y_i \sim f(E(y_i))$$` --- # link functions <table> <thead> <tr> <th style="text-align:center;"> Distribution </th> <th style="text-align:center;"> link name </th> <th style="text-align:center;"> link equation </th> <th style="text-align:center;"> inverse link equation </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> Binomial </td> <td style="text-align:center;"> logit </td> <td style="text-align:center;"> \(log(\frac{p}{1-p})\) </td> <td style="text-align:center;"> \(\frac{e^{X\beta}}{1+e^{X\beta}}\) </td> </tr> <tr> <td style="text-align:center;"> Poisson </td> <td style="text-align:center;"> log </td> <td style="text-align:center;"> \(log(\lambda)\) </td> <td style="text-align:center;"> \(e^{X\beta}\) </td> </tr> </tbody> </table> <br/> -- <table> <thead> <tr> <th style="text-align:center;"> Distribution </th> <th style="text-align:center;"> link name </th> <th style="text-align:center;"> link in R </th> <th style="text-align:center;"> inverse link in R </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> Binomial </td> <td style="text-align:center;"> logit </td> <td style="text-align:center;"> qlogis </td> <td style="text-align:center;"> plogis </td> </tr> <tr> <td style="text-align:center;"> Poisson </td> <td style="text-align:center;"> log </td> <td style="text-align:center;"> log </td> <td style="text-align:center;"> exp </td> </tr> </tbody> </table> --- # logit link example ```r beta0 <- 5 beta1 <- -0.08 (logit.p <- beta0 + beta1*100) # linear predictor for x = 100 ``` ``` ## [1] -3 ``` -- #### How do we convert -3 to a probability? Use the inverse-link: ```r (p <- exp(logit.p)/(1+exp(logit.p))) # same as plogis(logit.p) ``` ``` ## [1] 0.04743 ``` -- #### To go back, use the link function itself: ```r log(p/(1-p)) ``` ``` ## [1] -3 ``` ```r qlogis(p) ``` ``` ## [1] -3 ``` --- # logit link example ```r plot(function(x) 5 + -0.08*x, from=0, to=100, xlab="Elevation", ylab="logit(prob of occurrence)") ``` <img src="19_glm_files/figure-html/unnamed-chunk-6-1.png" width="648" style="display: block; margin: auto;" /> --- # logit link example ```r plot(function(x) plogis(5 + -0.08*x), from=0, to=100, xlab="Elevation", ylab="Probability of occurrence") ``` <img src="19_glm_files/figure-html/unnamed-chunk-7-1.png" width="648" style="display: block; margin: auto;" /> --- class: inverse, middle, center # logistic regression --- # logistic regression #### Logistic regression is a specific type of GLM in which the response variable follows a binomial distribution and the link function is the logit -- #### It would be better to call it “binomial regression” since other link functions (e.g. the probit) can be used -- #### Appropriate when the response is binary or a count with an upper limit -- #### Examples: - Presence/absence studies - Survival studies - Disease prevalance studies --- # logistic regression <br/> `$$\Large logit(p_i) = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + ...$$` `$$\Large y_i \sim binomial(N, p_i)$$` -- #### where: - `\(\large N\)` is the number of “trials” (e.g. coin flips) - `\(\large p_i\)` is the probability of success for sample unit *i* --- # binomial distribution <br/> <img src="19_glm_files/figure-html/unnamed-chunk-8-1.png" width="648" style="display: block; margin: auto;" /> --- # binomial distribution <br/> <img src="19_glm_files/figure-html/unnamed-chunk-9-1.png" width="648" style="display: block; margin: auto;" /> --- # binomial distribution #### Properties - The expected value of `\(\large y\)` is `\(\large Np\)` - The variance is `\(\large Np(1 - p)\)` -- #### Bernoulli distribution - The Bernoulli distribution is a binomial distribution with a single trial (N = 1) - Logistic regression is usually done in this context, such that the response variable is 0/1 or No/Yes or Bad/Good, etc. --- # worked example using glm First we will model the presence-absence response variable to determine if elevation and habitat affect the probability of occurrence ```r data("frogData") head(frogData, n=12) ``` ``` ## presence abundance elevation habitat ## 1 0 0 58 Oak ## 2 1 7 191 Oak ## 3 0 0 43 Oak ## 4 1 11 374 Oak ## 5 1 11 337 Oak ## 6 1 1 64 Oak ## 7 1 4 195 Oak ## 8 1 6 263 Oak ## 9 0 0 181 Oak ## 10 1 1 59 Oak ## 11 1 50 489 Maple ## 12 1 5 317 Maple ``` --- # raw data ```r ggplot(frogData, aes(x = elevation, y = presence)) + geom_point() + scale_y_continuous("Frog Occurrence") + scale_x_continuous("Elevation") ``` <img src="19_glm_files/figure-html/unnamed-chunk-11-1.png" width="648" style="display: block; margin: auto;" /> --- # raw data ```r library(dplyr) frogData %>% group_by(habitat) %>% summarise(group.prob = mean(presence)) %>% ggplot(., aes(x = habitat, y = group.prob)) + geom_col(fill = "grey70", color = "black") + scale_y_continuous("Proportion of sites with frogs") + scale_x_discrete("Habitat") ``` <img src="19_glm_files/figure-html/unnamed-chunk-12-1.png" width="576" style="display: block; margin: auto;" /> --- # the `glm` function <br/> ```r fm1 <- glm(presence ~ habitat + elevation, family=binomial(link="logit"), data = frogData) broom::tidy(fm1) ``` <br/> <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> -0.9960 </td> <td style="text-align:right;"> 1.217 </td> <td style="text-align:right;"> -0.8184 </td> <td style="text-align:right;"> 0.4131 </td> </tr> <tr> <td style="text-align:left;"> habitatOak </td> <td style="text-align:right;"> -0.0968 </td> <td style="text-align:right;"> 1.367 </td> <td style="text-align:right;"> -0.0708 </td> <td style="text-align:right;"> 0.9436 </td> </tr> <tr> <td style="text-align:left;"> habitatPine </td> <td style="text-align:right;"> -0.3372 </td> <td style="text-align:right;"> 1.382 </td> <td style="text-align:right;"> -0.2441 </td> <td style="text-align:right;"> 0.8072 </td> </tr> <tr> <td style="text-align:left;"> elevation </td> <td style="text-align:right;"> 0.0137 </td> <td style="text-align:right;"> 0.006 </td> <td style="text-align:right;"> 2.2723 </td> <td style="text-align:right;"> 0.0231 </td> </tr> </tbody> </table> --- # occurrence probability and elevation ```r newdat <- data.frame(elevation=seq(12, 489, length = 50), habitat="Oak") head(newdat) ``` ``` ## elevation habitat ## 1 12.00 Oak ## 2 21.73 Oak ## 3 31.47 Oak ## 4 41.20 Oak ## 5 50.94 Oak ## 6 60.67 Oak ``` -- To get confidence intervals on (0,1) scale, predict on linear (link) scale and then backtransform using inverse-link ```r pred.link <- predict(fm1, newdata=newdat, se.fit=TRUE, type="link") newdat$mu <- plogis(pred.link$fit) newdat$lower <- plogis(pred.link$fit - 1.96*pred.link$se.fit) newdat$upper <- plogis(pred.link$fit + 1.96*pred.link$se.fit) ``` --- # occurrence probability and elevation <img src="19_glm_files/figure-html/unnamed-chunk-17-1.png" width="648" style="display: block; margin: auto;" /> --- class: center, middle, inverse # poisson regression --- # poisson regression <br/> `$$\Large log(\lambda_i) = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + ...$$` `$$\Large y_i \sim Poisson(\lambda_i)$$` <br/> -- #### where - `\(\large \lambda_i\)` is the expected value of `\(\large y_i\)` --- # poisson regression #### Useful for: - Count data - Number of events in time intervals - Other types of integer data -- #### Properties - The expected value of *y* ( `\(\large \lambda\)`) is equal to the variance - This is an assumption of the Poisson model - Like all assumptions, it can be relaxed if you have enough data --- # poisson distribution <br/> <img src="19_glm_files/figure-html/unnamed-chunk-18-1.png" width="648" style="display: block; margin: auto;" /> --- # poisson distribution <br/> <img src="19_glm_files/figure-html/unnamed-chunk-19-1.png" width="648" style="display: block; margin: auto;" /> --- # poisson distribution <br/> <img src="19_glm_files/figure-html/unnamed-chunk-20-1.png" width="648" style="display: block; margin: auto;" /> --- # log link example ```r plot(function(x) 5 + -0.08*x, from=0, to=100, xlab="Elevation", ylab="log(Expected abundance)") ``` <img src="19_glm_files/figure-html/unnamed-chunk-21-1.png" width="648" style="display: block; margin: auto;" /> --- # log link example ```r plot(function(x) exp(5 + -0.08*x), from=0, to=100, xlab="Elevation", ylab="Expected abundance") ``` <img src="19_glm_files/figure-html/unnamed-chunk-22-1.png" width="648" style="display: block; margin: auto;" /> --- # the `glm` function <br/> ```r fm2 <- glm(abundance ~ habitat + elevation, family=poisson(link="log"), data=frogData) broom::tidy(fm2) ``` <br/> <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> -0.7870 </td> <td style="text-align:right;"> 0.2941 </td> <td style="text-align:right;"> -2.6756 </td> <td style="text-align:right;"> 0.0075 </td> </tr> <tr> <td style="text-align:left;"> habitatOak </td> <td style="text-align:right;"> -0.1534 </td> <td style="text-align:right;"> 0.1971 </td> <td style="text-align:right;"> -0.7782 </td> <td style="text-align:right;"> 0.4364 </td> </tr> <tr> <td style="text-align:left;"> habitatPine </td> <td style="text-align:right;"> -0.0653 </td> <td style="text-align:right;"> 0.1104 </td> <td style="text-align:right;"> -0.5914 </td> <td style="text-align:right;"> 0.5543 </td> </tr> <tr> <td style="text-align:left;"> elevation </td> <td style="text-align:right;"> 0.0098 </td> <td style="text-align:right;"> 0.0006 </td> <td style="text-align:right;"> 15.5507 </td> <td style="text-align:right;"> 0.0000 </td> </tr> </tbody> </table> --- # prediction ```r newdat <- data.frame(elevation = seq(12, 489, length = 50), habitat = "Oak") head(newdat) ``` ``` ## elevation habitat ## 1 12.00 Oak ## 2 21.73 Oak ## 3 31.47 Oak ## 4 41.20 Oak ## 5 50.94 Oak ## 6 60.67 Oak ``` -- To get confidence intervals on (0, `\(\infty\)`) scale, predict on linear (link) scale and then backtransform using inverse-link ```r pred.link <- predict(fm2, newdata=newdat, se.fit=TRUE, type="link") newdat$mu <- exp(pred.link$fit) newdat$lower <- exp(pred.link$fit - 1.96*pred.link$se.fit) newdat$upper <- exp(pred.link$fit + 1.96*pred.link$se.fit) ``` --- # prediction <br/> <img src="19_glm_files/figure-html/unnamed-chunk-27-1.png" width="648" style="display: block; margin: auto;" /> --- # assessing model fit #### The most common problem in Poisson regression is overdispersion -- #### Overdispersion is the situation in which there is more variability in the data than predicted by the model -- #### Overdispersion cannot be assessed by simply comparing the mean and variance of the response variable -- ### The presence of many zeros is not necessarily indicative of overdispersion -- ### Overdispersion can be assessed using a goodness-of-fit test --- # goodness-of-fit #### The fit of a Poisson regression can be assessed using a `\(\large \chi^2\)` test The test statistic is the residual deviance: `$$\Large D = 2\bigg[\sum y_i log\bigg(\frac{y_i}{\hat{\lambda}_i}\bigg)-(y_i-\hat{\lambda}_i)\bigg]$$` -- If the null hypothesis is true (ie, the model fits the data), `\(D\)` should follow `\(\chi^2\)` distribution with `\(N - K\)` degrees-of-freedom -- ```r N <- nrow(frogData) # sample size K <- length(coef(fm2)) # number of parameters df.resid <- N-K # degrees-of-freedom Dev <- deviance(fm2) # residual deviance (p.value <- 1-pchisq(Dev, df=df.resid))# p-value ``` ``` ## [1] 0.0121 ``` --- ### `\(\large \chi^2\)` DISTRIBUTION AND RESIDUAL DEVIANCE <br/> ```r curve(dchisq(x, df=df.resid), from=0, to=50, xlab="Deviance", ylab="Density") abline(v=Dev, lwd=3, col="red") ``` <img src="19_glm_files/figure-html/unnamed-chunk-29-1.png" width="504" style="display: block; margin: auto;" /> -- The red line is the residual deviance. We reject the null hypothesis, and we conclude that the Poisson model does not fit the data --- # what if model doesn't fit the data? <br/> #### Alternatives to the Poisson distribution - Negative binomial - Zero-inflated Poisson --- # negative binomial distribution <img src="19_glm_files/figure-html/unnamed-chunk-30-1.png" width="648" style="display: block; margin: auto;" /> --- # negative binomial distribution <img src="19_glm_files/figure-html/unnamed-chunk-31-1.png" width="648" style="display: block; margin: auto;" /> --- # negative binomial distribution <img src="19_glm_files/figure-html/unnamed-chunk-32-1.png" width="648" style="display: block; margin: auto;" />