class: center, middle, inverse, title-slide .title[ # LECTURE 19: poisson regression ] .subtitle[ ## FANR 6750 (Experimental design) ] .author[ ###
Fall 2023 ] --- # outline 1) Motivation <br/> -- 2) Model <br/> -- 3) Model fitting <br/> -- 4) Model interpretation <br/> -- 5) Over-dispersion --- # poisson regression Poisson regression is a specific type of GLM in which the response variable follows a Poisson distribution and the link function is the log -- Poisson regression is used to model a positive, integer response variable as a function of explanatory variables -- #### Examples: - Is the abundance of a focal species related to habitat type? - Is the number of foraging bouts a function of predator density? - Is flock size influenced by food availability? --- # poisson regression <br/> `$$\large y_i \sim Poisson(\lambda_i)$$` `$$\large log(\lambda_i) = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + ...$$` -- #### where: - `\(\large y_i\)` is the response - `\(\large \lambda_i\)` is the expected value of the response variable for sample unit *i* --- # log link The log link allows us to transform the expected value `\(\lambda_i\)` (constrained to `\(0\)` to `\(\infty\)`) to the real scale ( `\(-\infty\)` to `\(\infty\)`) `$$\large \log(2) = 0.693$$` -- The exponential allows us to transform the linear predictor to the positive scale `$$\large e^{0.693} = 2$$` --- # log link <br/> <img src="19_poisson_regression_files/figure-html/unnamed-chunk-1-1.png" width="720" style="display: block; margin: auto;" /> --- # example Imagine we are interested in the effects of elevation and habitat on the abundance of a rare orchid ```r data("orchidata") head(orchiddata, n=12) ``` ``` ## presence abundance elevation habitat ## 1 0 0 58 Oak ## 2 1 7 191 Oak ## 3 0 0 43 Oak ## 4 1 11 374 Oak ## 5 1 11 337 Oak ## 6 1 1 64 Oak ## 7 1 4 195 Oak ## 8 1 6 263 Oak ## 9 0 0 181 Oak ## 10 1 1 59 Oak ## 11 1 50 489 Maple ## 12 1 5 317 Maple ``` --- # raw data ```r ggplot(orchiddata, aes(x = elevation, y = abundance)) + geom_point() + scale_y_continuous("Orchid Count") + scale_x_continuous("Elevation") ``` <img src="19_poisson_regression_files/figure-html/unnamed-chunk-3-1.png" width="504" style="display: block; margin: auto;" /> --- # raw data ```r library(dplyr) orchiddata %>% group_by(habitat) %>% summarise(group.prob = mean(abundance)) %>% ggplot(., aes(x = habitat, y = group.prob)) + geom_col(fill = "grey70", color = "black") + scale_y_continuous("Mean number of orchids") + scale_x_discrete("Habitat") ``` <img src="19_poisson_regression_files/figure-html/unnamed-chunk-4-1.png" width="504" style="display: block; margin: auto;" /> --- # the `glm` function <br/> ```r fm1 <- glm(abundance ~ habitat + elevation, family=poisson(link="log"), data = orchiddata) broom::tidy(fm1) ``` <br/> <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> -0.7870 </td> <td style="text-align:right;"> 0.2941 </td> <td style="text-align:right;"> -2.6756 </td> <td style="text-align:right;"> 0.0075 </td> </tr> <tr> <td style="text-align:left;"> habitatOak </td> <td style="text-align:right;"> -0.1534 </td> <td style="text-align:right;"> 0.1971 </td> <td style="text-align:right;"> -0.7782 </td> <td style="text-align:right;"> 0.4364 </td> </tr> <tr> <td style="text-align:left;"> habitatPine </td> <td style="text-align:right;"> -0.0653 </td> <td style="text-align:right;"> 0.1104 </td> <td style="text-align:right;"> -0.5914 </td> <td style="text-align:right;"> 0.5543 </td> </tr> <tr> <td style="text-align:left;"> elevation </td> <td style="text-align:right;"> 0.0098 </td> <td style="text-align:right;"> 0.0006 </td> <td style="text-align:right;"> 15.5507 </td> <td style="text-align:right;"> 0.0000 </td> </tr> </tbody> </table> --- # intepreting parameters <table class="table" style="font-size: 10px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> -0.7870 </td> <td style="text-align:right;"> 0.2941 </td> <td style="text-align:right;"> -2.6756 </td> <td style="text-align:right;"> 0.0075 </td> </tr> <tr> <td style="text-align:left;"> habitatOak </td> <td style="text-align:right;"> -0.1534 </td> <td style="text-align:right;"> 0.1971 </td> <td style="text-align:right;"> -0.7782 </td> <td style="text-align:right;"> 0.4364 </td> </tr> <tr> <td style="text-align:left;"> habitatPine </td> <td style="text-align:right;"> -0.0653 </td> <td style="text-align:right;"> 0.1104 </td> <td style="text-align:right;"> -0.5914 </td> <td style="text-align:right;"> 0.5543 </td> </tr> <tr> <td style="text-align:left;"> elevation </td> <td style="text-align:right;"> 0.0098 </td> <td style="text-align:right;"> 0.0006 </td> <td style="text-align:right;"> 15.5507 </td> <td style="text-align:right;"> 0.0000 </td> </tr> </tbody> </table> -- - The expected number of orchids in maple-dominated forest at sea level = 0.46 (`exp(-0.787)`) -- - The expected number of orchids in oak-dominated forest at sea level = 0.39 (`exp(-0.787 - 0.153)`) -- - The expected number of orchids in pine-dominated forest at sea level = 0.43 (`exp(-0.787 - 0.065)`) --- # intepreting parameters <table class="table" style="font-size: 10px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> -0.7870 </td> <td style="text-align:right;"> 0.2941 </td> <td style="text-align:right;"> -2.6756 </td> <td style="text-align:right;"> 0.0075 </td> </tr> <tr> <td style="text-align:left;"> habitatOak </td> <td style="text-align:right;"> -0.1534 </td> <td style="text-align:right;"> 0.1971 </td> <td style="text-align:right;"> -0.7782 </td> <td style="text-align:right;"> 0.4364 </td> </tr> <tr> <td style="text-align:left;"> habitatPine </td> <td style="text-align:right;"> -0.0653 </td> <td style="text-align:right;"> 0.1104 </td> <td style="text-align:right;"> -0.5914 </td> <td style="text-align:right;"> 0.5543 </td> </tr> <tr> <td style="text-align:left;"> elevation </td> <td style="text-align:right;"> 0.0098 </td> <td style="text-align:right;"> 0.0006 </td> <td style="text-align:right;"> 15.5507 </td> <td style="text-align:right;"> 0.0000 </td> </tr> </tbody> </table> -- What about elevation? -- Again, let's visualize the fitted relationship --- # example <br/> <img src="19_poisson_regression_files/figure-html/unnamed-chunk-9-1.png" width="576" style="display: block; margin: auto;" /> --- # example What's the change in abundance at 1m vs 0m elevation (in oak habitat)? - `exp(-0.786 + 0.009 * 1) - exp(-0.786 + 0.009 * 0)` = 0.005 -- What's the change in abundance at 101m vs 100m elevation ? - `exp(-0.786 + 0.009 * 101) - exp(-0.786 + 0.009 * 100)` = 0.012 -- Change in abundance is not linear <img src="19_poisson_regression_files/figure-html/unnamed-chunk-10-1.png" width="576" style="display: block; margin: auto;" /> --- # example Change in `\(\lambda\)` is not linear because exponential functions are multiplicative -- Remember that `\(e^{(a + b)} = e^a e^b\)`, so `\(\lambda_i = exp(\beta_0 + \beta_1x_i) = exp(\beta_0)exp(\beta_1x_i)\)` -- `$$\frac{\lambda_2}{\lambda_1}=\frac{exp(\beta_0)exp(\beta_1(x_1+1))}{exp(\beta_0)exp(\beta_1x_1)} = exp(\beta_1)$$` -- .vsmall-code[ ```r l1 <- exp(-0.786 + 0.009 * 0) l2 <- exp(-0.786 + 0.009 * 1) l2/l1 ``` ``` ## [1] 1.009 ``` ```r l3 <- exp(-0.786 + 0.009 * 400) l4 <- exp(-0.786 + 0.009 * 401) l4/l3 ``` ``` ## [1] 1.009 ``` ] -- What is the change in abundance for one unit change in elevation? -- - `\(e^{\beta_{Elev}}\)` = `exp(0.009)` = 1.009 (about 1% increase per meter) --- class: inverse, middle # over-dispersion --- # over-dispersion Poisson regression assumes that the mean ( `\(E[y_i|x_i] = \lambda_i\)`) is equal to the residual variance ( `\(Var(y_i|x_i) = \lambda_i\)`) -- This is assumption is often violated with real data. Specificallly, the residual variance is often larger than the mean ( `\(Var(y_i|x_i) > E[y_i|x_i]\)`). This is called **over-dispersion** -- There are many possible reasons data will be over-dispersed: - Omission of important predictor variables - Measurement error in response and explanatory variables - Relationship between `\(log(\lambda)\)` and `\(x\)` may be non-linear - Data may contain outliers - Clustering (spatial, temporal, within-individual) - Response variable may be a mixture of random processes (e.g., is a site occupied? If occupied, how many individuals?) --- # example Suppose we are interested in how predator abundance influences foraging behavior of eastern chipmunks (*Tamias striatus*). -- .pull-left[ We place 100 chipmunks in individual enclosures with a small tray of food in the middle, letting each individual habituate to the enclosures before beginning the experiment. We then randomly simulate one of 3 levels of aerial predator abundance (absent, low, high) using cardboard cutouts of raptor silhouettes. While the predator silhouettes are in place, we record the number of times an individual visits the food tray within a 15-minute interval. ] .pull-right[ <img src="https://upload.wikimedia.org/wikipedia/commons/9/9c/Tamias_striatus_UL_14.jpg" width="225" height="150" style="display: block; margin: auto;" /> After 15-minutes, the silhouettes are removed and the individuals are allowed to re-habituate before another the predator treatment is randomly applied (why is this useful?). Due to logistical contraints, not every individual was measured the same number of time (unbalanced) ] --- # example .pull-left[ ```r data("foragingdata") head(foragingdata) ``` ``` ## predator individual foraging ## 1 Low 1 3 ## 2 Low 1 1 ## 3 Low 1 4 ## 4 High 1 2 ## 5 High 1 2 ## 6 Absent 2 8 ``` ] .pull-right[ <img src="19_poisson_regression_files/figure-html/unnamed-chunk-14-1.png" width="504" style="display: block; margin: auto;" /> ] --- # example .pull-left[ Model fitting: .vsmall-code[ ```r glm1 <- glm(foraging ~ predator, data = foragingdata, family = "poisson") summary(glm1) ``` ``` ## ## Call: ## glm(formula = foraging ~ predator, family = "poisson", data = foragingdata) ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.3001 0.0507 25.6 < 2e-16 *** ## predatorLow -0.7448 0.0908 -8.2 2.4e-16 *** ## predatorHigh -1.1788 0.1100 -10.7 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 709.36 on 299 degrees of freedom ## Residual deviance: 553.57 on 297 degrees of freedom ## AIC: 1193 ## ## Number of Fisher Scoring iterations: 5 ``` ] ] -- .pull-right[ Checking for over-dispersion: .vsmall-code[ ```r performance::check_overdispersion(glm1) ``` ``` ## # Overdispersion test ## ## dispersion ratio = 1.907 ## Pearson's Chi-Squared = 566.245 ## p-value = < 0.001 ``` ] ] --- # over-dispersion In general, over-dispersion will not bias parameter estimates but it will makes standard error's too small (what happens to *p*-values?) -- So one way to deal with over-dispersion is to adjust standard errors using a variance inflation factor: - `\(E[y_i|x_i] = \lambda_i\)` - `\(Var[y_i|x_i] = \phi\lambda_i\)` -- In `R`, this can be done using a *quasi-likelihood* approach .small-code[ ```r glm2 <- glm(foraging ~ predator, data = foragingdata, family = quasipoisson()) ``` ] --- # over-dispersion Let's compare the output of the two models: -- .small-code[ ```r broom::tidy(glm1) # Poisson model ``` ``` ## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 1.30 0.0507 25.6 5.09e-145 ## 2 predatorLow -0.745 0.0908 -8.20 2.43e- 16 ## 3 predatorHigh -1.18 0.110 -10.7 8.32e- 27 ``` ] .small-code[ ```r broom::tidy(glm2) # Quasi-Poisson model ``` ``` ## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 1.30 0.0700 18.6 1.25e-51 ## 2 predatorLow -0.745 0.125 -5.94 8.07e- 9 ## 3 predatorHigh -1.18 0.152 -7.76 1.35e-13 ``` ] --- # over-dispersion Did the quasi-Poisson method fix the over-dispersion? -- ```r performance::check_overdispersion(glm2) ``` ``` ## # Overdispersion test ## ## dispersion ratio = 1.907 ## Pearson's Chi-Squared = 566.245 ## p-value = < 0.001 ``` -- No! It only adjusted the SE --- # over-dispersion As mentioned, another reason for over-dispersion (and thus way to address it) is the omission of important predictor variables -- In this example, is there a predictor variable missing? How should this variable be treated in the model? -- .small-code[ ```r library(lme4) glmm1 <- glmer(foraging ~ predator + (1|individual), data = foragingdata, family = "poisson") ``` ] -- What does it mean to treat individual as a random effect in this model? What additional information does this model provide? --- # over-dispersion .vsmall-code[ ```r summary(glmm1) ``` ``` ## Generalized linear mixed model fit by maximum likelihood (Laplace ## Approximation) [glmerMod] ## Family: poisson ( log ) ## Formula: foraging ~ predator + (1 | individual) ## Data: foragingdata ## ## AIC BIC logLik deviance df.resid ## 1068.2 1083.0 -530.1 1060.2 296 ## ## Scaled residuals: ## Min 1Q Median 3Q Max ## -1.5209 -0.7969 -0.0868 0.4888 2.8575 ## ## Random effects: ## Groups Name Variance Std.Dev. ## individual (Intercept) 0.33 0.575 ## Number of obs: 300, groups: individual, 95 ## ## Fixed effects: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.194 0.087 13.72 < 2e-16 *** ## predatorLow -0.767 0.104 -7.39 1.4e-13 *** ## predatorHigh -1.230 0.118 -10.40 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Correlation of Fixed Effects: ## (Intr) prdtrL ## predatorLow -0.430 ## predatorHgh -0.340 0.300 ``` ] --- # over-dispersion Did that fix the over-dispersion? ```r performance::check_overdispersion(glmm1) ``` ``` ## # Overdispersion test ## ## dispersion ratio = 0.749 ## Pearson's Chi-Squared = 221.656 ## p-value = 1 ``` --- # summary Poisson regression is used when we want to model a count-based response as a function of explanatory variables -- The response `\(y_i\)` is modeled as arising from a Poisson distribution with mean `\(\lambda_i\)` and the log link is used to map the linear predictor to the positive scale -- All of the linear modeling concepts we have learned this semester (continuous/categorical explanatory variables, multiple regression, interactions, random effects) can be used within the Poisson regression framework -- Count data will often be over-dispersed, which can lead to biased standard error estimates - Over-dispersion can be addressed by using quasi-likelihood methods or adding additional covariates - In some cases, other modeling approaches may be necessary to address over-dispersion. We will cover these methods in more detail in the coming weeks --- # looking ahead <br/> ### **Next time**: Zero-inflated count data <br/> ### **Reading**: [Fieberg chp. 15.5](https://fw8051statistics4ecologists.netlify.app/poissonr#NBR) & [chp. 17](https://fw8051statistics4ecologists.netlify.app/zimodels)