class: center, middle, inverse, title-slide .title[ # LECTURE 20: zero-inflated data ] .subtitle[ ## FANR 6750 (Experimental design) ] .author[ ###
Fall 2023 ] --- # outline 1) Motivation <br/> -- 2) Negative binomial regression <br/> -- 3) Hurdle models <br/> -- 4) Zero-inflated mixture models --- # poisson regression Poisson regression is a specific type of GLM in which the response variable follows a Poisson distribution and the link function is the log -- Poisson regression is used to model a positive, integer response variable as a function of explanatory variables -- #### Examples: - Is the abundance of a focal species related to habitat type? - Is the number of foraging bouts a function of predator density? - Is flock size influenced by food availability? --- # poisson distribution <br/> <img src="20_zero_inflation_files/figure-html/unnamed-chunk-1-1.png" width="576" style="display: block; margin: auto;" /> --- # example Imagine we are interested in restoring oyster beds along the Georgia coast. We "seed" experimental plots with larvae and then record the number of oysters that establish within a randomly-selected quadrat within each plot. We also record the substrate (rock vs sand), water temperature, water depth, salinity, and water clarity of each plot .pull-left[ .vsmall-code[ ```r data("oysterdata") head(oysterdata) ``` ``` ## oyster temperature substrate depth clarity salinity ## 1 0 50.32 rock 7.515 0.42599 0.81431 ## 2 0 43.20 rock 4.110 0.87892 0.93008 ## 3 1 52.15 rock 6.298 0.07195 0.23231 ## 4 4 59.88 rock 3.056 0.10583 0.65415 ## 5 1 51.23 rock 1.502 0.32987 0.83385 ## 6 6 48.53 rock 1.866 0.30914 0.09163 ``` ] ] .pull-right[ <br/> <img src="20_zero_inflation_files/figure-html/unnamed-chunk-3-1.png" width="288" style="display: block; margin: auto;" /> ] --- # zero inflation Zero inflation occurs when there are more zeros in the response variable `\(y\)` than expected under the assumed probability distribution (usually the Poisson distribution) -- .pull-left[ .small-code[ ```r fit.pr <- glm(oyster ~ ., data = oysterdata, family = "poisson") ``` ] ] .pull-right[ .vvsmall-code[ ```r summary(fit.pr) ``` ``` ## ## Call: ## glm(formula = oyster ~ ., family = "poisson", data = oysterdata) ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -4.6070 0.5910 -7.80 6.4e-15 *** ## temperature 0.1158 0.0104 11.11 < 2e-16 *** ## substratesand -0.9148 0.1124 -8.14 4.0e-16 *** ## depth -0.0703 0.0198 -3.56 0.00038 *** ## clarity 0.0232 0.1834 0.13 0.89945 ## salinity -0.3483 0.1733 -2.01 0.04438 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 753.98 on 231 degrees of freedom ## Residual deviance: 490.90 on 226 degrees of freedom ## (18 observations deleted due to missingness) ## AIC: 841.9 ## ## Number of Fisher Scoring iterations: 6 ``` ] ] --- # zero inflation For models with with the `glm()` function, we can test for zero inflation: .vsmall-code[ ```r performance::check_zeroinflation(fit.pr) ``` ``` ## # Check for zero-inflation ## ## Observed zeros: 119 ## Predicted zeros: 73 ## Ratio: 0.61 ``` ``` ## Model is underfitting zeros (probable zero-inflation). ``` ] -- If we suspect zero inflation, we have several options: - Negative binomial regression - Hurdle models - Zero-inflated mixture model --- class: inverse, center, middle # negative binomial regression --- # negative binomial distribution The negative binomial distribution is a *discrete* probability distribution (only integers) with support (i.e., possible values) from 0 to `\(+\infty\)` -- The negative binomial distribution has two parameters: `\(\mu\)` and `\(\theta\)` -- - The mean of the negative binomial distribution is `\(\mu\)` - The variance is `\(\mu + \frac{\mu^2}{\theta}\)` - As `\(\theta \to \infty\)`, the negative binomial distribution converges to the Poisson -- In `R`: .vsmall-code[ ```r x <- rnbinom(n = 1000, mu = 1, size = 1) # theta = size mean(x) ``` ``` ## [1] 1.03 ``` ```r var(x) ``` ``` ## [1] 1.887 ``` ] --- # negative binomial distribution Because the variance is larger than the mean, the negative binomial distribution is often used to model over-dispersed data, including zero inflation -- .left-nudge[ .small-code[ ```r library(MASS) fit.nb <- glm.nb(oyster ~ ., data = oysterdata) ``` ] ] .right-nudge[ .vvsmall-code[ ```r summary(fit.nb) ``` ``` ## ## Call: ## glm.nb(formula = oyster ~ ., data = oysterdata, init.theta = 1.057151222, ## link = log) ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -4.8872 0.9499 -5.14 2.7e-07 *** ## temperature 0.1262 0.0173 7.29 3.2e-13 *** ## substratesand -0.9872 0.1813 -5.45 5.2e-08 *** ## depth -0.1059 0.0348 -3.04 0.0024 ** ## clarity -0.1884 0.3287 -0.57 0.5665 ## salinity -0.2624 0.3055 -0.86 0.3905 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for Negative Binomial(1.057) family taken to be 1) ## ## Null deviance: 340.60 on 231 degrees of freedom ## Residual deviance: 234.91 on 226 degrees of freedom ## (18 observations deleted due to missingness) ## AIC: 756.9 ## ## Number of Fisher Scoring iterations: 1 ## ## ## Theta: 1.057 ## Std. Err.: 0.225 ## ## 2 x log-likelihood: -742.860 ``` ] ] --- # negative binomial regression Did the negative binomial model address the zero inflation? -- .small-code[ ```r performance::check_zeroinflation(fit.nb) ``` ``` ## # Check for zero-inflation ## ## Observed zeros: 119 ## Predicted zeros: 105 ## Ratio: 0.88 ``` ``` ## Model is underfitting zeros (probable zero-inflation). ``` ] -- Although the `check_zeroinflation()` indicates some remaining zero inflation, the negative binomial model seems to fit relatively well - additional tests are possible and probably warranted. See Fieberg chp. 17.3 for additional detail -- Sometimes, rather than address zero inflation using a different distribution, we may suspect that (some of the) zero's are the result of a *different process* than the one generating the counts - What are some examples? - In these cases, models that treat zeros as a mixture of different distributions may be warranted --- class: inverse, center, middle # hurdle models --- # hurdle models Hurdle models are actually composed of two separate models, one estimating the probability that an observation is 0 and another modeling the non-zero counts -- - What distribution do you think is used for the zero/non-zero part of the model? -- + Assume `\(z_i = 0\)` if `\(y_i = 0\)`, otherwise `\(z_i = 1\)` `$$z_i \sim Bernouli(p_i)$$` `$$logit(p_i) = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2}...$$` -- - What distribution do you think is used for the non-zero observations? -- + zero-truncated Poisson/Negative binomial --- # zero-truncated distributions <br/> <img src="20_zero_inflation_files/figure-html/unnamed-chunk-11-1.png" width="576" style="display: block; margin: auto;" /> --- # hurdle models .left-nudge[ .vsmall-code[ ```r library(pscl) fit.hrp <- hurdle(oyster ~ ., data = oysterdata) ``` ] ] .right-nudge[ .vvsmall-code[ ```r summary(fit.hrp) ``` ``` ## ## Call: ## hurdle(formula = oyster ~ ., data = oysterdata) ## ## Pearson residuals: ## Min 1Q Median 3Q Max ## -1.651 -0.576 -0.343 0.483 4.656 ## ## Count model coefficients (truncated poisson with log link): ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.7032 0.6980 -5.31 1.1e-07 *** ## temperature 0.0944 0.0124 7.59 3.3e-14 *** ## substratesand -0.0861 0.1248 -0.69 0.49 ## depth 0.0135 0.0218 0.62 0.54 ## clarity 0.0505 0.1936 0.26 0.79 ## salinity -0.2491 0.1780 -1.40 0.16 ## Zero hurdle model coefficients (binomial with logit link): ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.3330 1.5656 -2.13 0.03326 * ## temperature 0.1130 0.0299 3.78 0.00016 *** ## substratesand -1.8083 0.3158 -5.73 1e-08 *** ## depth -0.2192 0.0628 -3.49 0.00049 *** ## clarity -0.3924 0.5693 -0.69 0.49061 ## salinity -0.3552 0.5349 -0.66 0.50667 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Number of iterations in BFGS optimization: 15 ## Log-likelihood: -334 on 12 Df ``` ] ] --- # hurdle models .left-nudge[ .vsmall-code[ ```r fit.hrnb <- hurdle(oyster ~ ., data = oysterdata, dist = "negbin") ``` ] ] .right-nudge[ .vvsmall-code[ ```r summary(fit.hrnb) ``` ``` ## ## Call: ## hurdle(formula = oyster ~ ., data = oysterdata, dist = "negbin") ## ## Pearson residuals: ## Min 1Q Median 3Q Max ## -1.651 -0.576 -0.343 0.483 4.656 ## ## Count model coefficients (truncated negbin with log link): ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.7034 0.6981 -5.30 1.1e-07 *** ## temperature 0.0944 0.0124 7.58 3.4e-14 *** ## substratesand -0.0861 0.1248 -0.69 0.49 ## depth 0.0135 0.0218 0.62 0.54 ## clarity 0.0505 0.1936 0.26 0.79 ## salinity -0.2491 0.1780 -1.40 0.16 ## Log(theta) 10.1957 98.4641 0.10 0.92 ## Zero hurdle model coefficients (binomial with logit link): ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.3330 1.5656 -2.13 0.03326 * ## temperature 0.1130 0.0299 3.78 0.00016 *** ## substratesand -1.8083 0.3158 -5.73 1e-08 *** ## depth -0.2192 0.0628 -3.49 0.00049 *** ## clarity -0.3924 0.5693 -0.69 0.49061 ## salinity -0.3552 0.5349 -0.66 0.50667 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Theta: count = 26787.464 ## Number of iterations in BFGS optimization: 55 ## Log-likelihood: -334 on 13 Df ``` ] ] --- # hurdle models Sub-models can contain the same or different covariates -- Parameters for sub-models are interpreted in the same way as for logistic/Poisson regression -- `\(\lambda_i\)` is the expected value (i.e., mean) of the non-zero observations, not the overall mean of all observations (which is `\(\frac{1-p_i}{1-e^{\lambda_i}}\lambda_i\)`) -- Models can be compared using AIC - note that `performance::check_zeroinflation()` does not work for `hurdle` model objects -- Generally, Hurdle models are most appropriate when the count-based process cannot generate zeros (i.e., zeros are determined by a separate process than counts) - Examples? --- class: inverse, center, middle # zero-inflated mixture models --- # zero-inflated mixture models Sometimes, zero observations can occur due to both the binomial process and the count process - Habitat may be suitable or unsuitable, but abundance may be zero at suitable locations (especially if density is low) - Individuals may be breeders or non-breeders, but some breeders will still fail to produce offspring (e.g., nest predation) -- In these cases, the zero-truncated distribution used by the Hurdle model is not appropriate -- Zero-inflated mixture models allow zero observations to arise from both a logistic/binomial process and a count model - Both Poisson and negative binomial count models are possible --- # zero-inflated mixture models .left-nudge[ .vsmall-code[ ```r fit.zip <- zeroinfl(oyster ~ .|., data = oysterdata) ``` ] ] .right-nudge[ .vvsmall-code[ ```r summary(fit.zip) ``` ``` ## ## Call: ## zeroinfl(formula = oyster ~ . | ., data = oysterdata) ## ## Pearson residuals: ## Min 1Q Median 3Q Max ## -1.686 -0.574 -0.343 0.535 4.489 ## ## Count model coefficients (poisson with log link): ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.8550 0.6942 -5.55 2.8e-08 *** ## temperature 0.0973 0.0125 7.80 6.0e-15 *** ## substratesand -0.0734 0.1229 -0.60 0.55 ## depth 0.0139 0.0216 0.64 0.52 ## clarity 0.0242 0.1912 0.13 0.90 ## salinity -0.2573 0.1750 -1.47 0.14 ## ## Zero-inflation model coefficients (binomial with logit link): ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.4511 2.0654 0.22 0.82711 ## temperature -0.0666 0.0371 -1.80 0.07222 . ## substratesand 2.0329 0.3784 5.37 7.8e-08 *** ## depth 0.2554 0.0725 3.52 0.00043 *** ## clarity 0.4385 0.6488 0.68 0.49917 ## salinity 0.2308 0.6026 0.38 0.70176 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Number of iterations in BFGS optimization: 17 ## Log-likelihood: -333 on 12 Df ``` ] ] --- # zero-inflated mixture models .left-nudge[ .vsmall-code[ ```r fit.zinb <- zeroinfl(oyster ~ .|., data = oysterdata, dist = "negbin") ``` ] ] .right-nudge[ .vvsmall-code[ ```r summary(fit.zinb) ``` ``` ## ## Call: ## zeroinfl(formula = oyster ~ . | ., data = oysterdata, dist = "negbin") ## ## Pearson residuals: ## Min 1Q Median 3Q Max ## -1.686 -0.574 -0.343 0.535 4.489 ## ## Count model coefficients (negbin with log link): ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.8552 0.6945 -5.55 2.8e-08 *** ## temperature 0.0973 0.0125 7.80 6.1e-15 *** ## substratesand -0.0734 0.1230 -0.60 0.55 ## depth 0.0139 0.0216 0.64 0.52 ## clarity 0.0242 0.1912 0.13 0.90 ## salinity -0.2573 0.1750 -1.47 0.14 ## Log(theta) 9.9406 97.7048 0.10 0.92 ## ## Zero-inflation model coefficients (binomial with logit link): ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.4508 2.0658 0.22 0.82727 ## temperature -0.0666 0.0371 -1.80 0.07228 . ## substratesand 2.0330 0.3785 5.37 7.8e-08 *** ## depth 0.2554 0.0725 3.52 0.00043 *** ## clarity 0.4384 0.6488 0.68 0.49919 ## salinity 0.2307 0.6026 0.38 0.70177 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Theta = 20755.352 ## Number of iterations in BFGS optimization: 17 ## Log-likelihood: -333 on 13 Df ``` ] ] --- # model comparison Generally, alternative zero-inflated models are compared using AIC <table> <thead> <tr> <th style="text-align:left;"> Model </th> <th style="text-align:right;"> k </th> <th style="text-align:right;"> AIC </th> <th style="text-align:right;"> delta AIC </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Zero-inflated Poisson </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 690.4 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> Zero-inflated NegBig </td> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 692.4 </td> <td style="text-align:right;"> 2.000 </td> </tr> <tr> <td style="text-align:left;"> Poisson Hurdle </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 692.5 </td> <td style="text-align:right;"> 2.078 </td> </tr> <tr> <td style="text-align:left;"> NegBig Hurdle </td> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 694.5 </td> <td style="text-align:right;"> 4.078 </td> </tr> <tr> <td style="text-align:left;"> NegBin GLM </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 756.9 </td> <td style="text-align:right;"> 66.479 </td> </tr> <tr> <td style="text-align:left;"> Poisson GLM </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 841.9 </td> <td style="text-align:right;"> 151.496 </td> </tr> </tbody> </table> -- Do AIC scores tell us if the models are "good"? -- - NO! - Goodness-of-fit tests are also important for zero-inflated model. See [Fieberg 17.7](https://fw8051statistics4ecologists.netlify.app/zimodels#goodness-of-fit) for examples --- # looking ahead <br/> ### **Next time**: Exam 3 review