class: center, middle, inverse, title-slide .title[ # LECTURE 21: zero-inflated data ] .subtitle[ ## FANR 6750 (Experimental design) ] .author[ ###
Fall 2025 ] --- # outline 1) Motivation <br/> -- 2) Negative binomial regression <br/> -- 3) Hurdle models <br/> -- 4) Zero-inflated mixture models --- # poisson regression Poisson regression is a specific type of GLM in which the response variable follows a Poisson distribution and the link function is the log -- Poisson regression is used to model a positive, integer response variable as a function of explanatory variables -- #### Examples: - Is the abundance of a focal species related to habitat type? - Is the number of foraging bouts a function of predator density? - Is flock size influenced by food availability? --- # poisson distribution <br/> <img src="21_zero_inflation_files/figure-html/unnamed-chunk-1-1.png" width="576" style="display: block; margin: auto;" /> --- # example Imagine we are interested in restoring oyster beds along the Georgia coast. We "seed" experimental plots with larvae and then record the number of oysters that establish within a randomly-selected quadrat within each plot. We also record the substrate (rock vs sand), water temperature, water depth, salinity, and water clarity of each plot .pull-left[ .vsmall-code[ ``` r data("oysterdata") head(oysterdata) ``` ``` ## oyster temperature substrate depth clarity salinity ## 1 0 50.32 rock 7.515 0.42599 0.81431 ## 2 0 43.20 rock 4.110 0.87892 0.93008 ## 3 1 52.15 rock 6.298 0.07195 0.23231 ## 4 4 59.88 rock 3.056 0.10583 0.65415 ## 5 1 51.23 rock 1.502 0.32987 0.83385 ## 6 6 48.53 rock 1.866 0.30914 0.09163 ``` ] ] .pull-right[ <br/> <img src="21_zero_inflation_files/figure-html/unnamed-chunk-3-1.png" width="288" style="display: block; margin: auto;" /> ] --- # zero inflation Zero inflation occurs when there are more zeros in the response variable `\(y\)` than expected under the assumed probability distribution (usually the Poisson distribution) -- .pull-left[ .small-code[ ``` r fit.pr <- glm(oyster ~ ., data = oysterdata, family = "poisson") ``` ] ] .pull-right[ .vvsmall-code[ ``` r summary(fit.pr) ``` ``` ## ## Call: ## glm(formula = oyster ~ ., family = "poisson", data = oysterdata) ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -4.6070 0.5910 -7.80 6.4e-15 *** ## temperature 0.1158 0.0104 11.11 < 2e-16 *** ## substratesand -0.9148 0.1124 -8.14 4.0e-16 *** ## depth -0.0703 0.0198 -3.56 0.00038 *** ## clarity 0.0232 0.1834 0.13 0.89945 ## salinity -0.3483 0.1733 -2.01 0.04438 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 753.98 on 231 degrees of freedom ## Residual deviance: 490.90 on 226 degrees of freedom ## (18 observations deleted due to missingness) ## AIC: 841.9 ## ## Number of Fisher Scoring iterations: 6 ``` ] ] --- # zero inflation For models with with the `glm()` function, we can test for zero inflation: .vsmall-code[ ``` r performance::check_zeroinflation(fit.pr) ``` ``` ## # Check for zero-inflation ## ## Observed zeros: 119 ## Predicted zeros: 73 ## Ratio: 0.61 ``` ``` ## Model is underfitting zeros (probable zero-inflation). ``` ] -- If we suspect zero inflation, we have several options: - Negative binomial regression - Hurdle models - Zero-inflated mixture model --- class: inverse, center, middle # negative binomial regression --- # negative binomial distribution The negative binomial distribution is a *discrete* probability distribution (only integers) with support (i.e., possible values) from 0 to `\(+\infty\)` -- The negative binomial distribution has two parameters: `\(\mu\)` and `\(\theta\)` -- - The mean of the negative binomial distribution is `\(\mu\)` - The variance is `\(\mu + \frac{\mu^2}{\theta}\)` - As `\(\theta \to \infty\)`, the negative binomial distribution converges to the Poisson -- In `R`: .vsmall-code[ ``` r x <- rnbinom(n = 1000, mu = 1, size = 1) # theta = size mean(x) ``` ``` ## [1] 1.015 ``` ``` r var(x) ``` ``` ## [1] 2.125 ``` ] --- # negative binomial distribution Because the variance is larger than the mean, the negative binomial distribution is often used to model over-dispersed data, including zero inflation -- .left-nudge[ .small-code[ ``` r library(MASS) fit.nb <- glm.nb(oyster ~ ., data = oysterdata) ``` ] ] .right-nudge[ .vvsmall-code[ ``` r summary(fit.nb) ``` ``` ## ## Call: ## glm.nb(formula = oyster ~ ., data = oysterdata, init.theta = 1.057151222, ## link = log) ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -4.8872 0.9499 -5.14 2.7e-07 *** ## temperature 0.1262 0.0173 7.29 3.2e-13 *** ## substratesand -0.9872 0.1813 -5.45 5.2e-08 *** ## depth -0.1059 0.0348 -3.04 0.0024 ** ## clarity -0.1884 0.3287 -0.57 0.5665 ## salinity -0.2624 0.3055 -0.86 0.3905 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for Negative Binomial(1.057) family taken to be 1) ## ## Null deviance: 340.60 on 231 degrees of freedom ## Residual deviance: 234.91 on 226 degrees of freedom ## (18 observations deleted due to missingness) ## AIC: 756.9 ## ## Number of Fisher Scoring iterations: 1 ## ## ## Theta: 1.057 ## Std. Err.: 0.225 ## ## 2 x log-likelihood: -742.860 ``` ] ] --- # negative binomial regression Did the negative binomial model address the zero inflation? -- .small-code[ ``` r performance::check_zeroinflation(fit.nb) ``` ``` ## # Check for zero-inflation ## ## Observed zeros: 119 ## Predicted zeros: 105 ## Ratio: 0.88 ``` ``` ## Model is underfitting zeros (probable zero-inflation) (p = 0.080). ``` ] -- Although the `check_zeroinflation()` indicates some remaining zero inflation, the negative binomial model seems to fit relatively well - additional tests are possible and probably warranted. See Fieberg chp. 17.3 for additional detail -- Sometimes, rather than address zero inflation using a different distribution, we may suspect that (some of the) zero's are the result of a *different process* than the one generating the counts - What are some examples? - In these cases, models that treat zeros as a mixture of different distributions may be warranted --- class: inverse, center, middle # hurdle models --- # hurdle models Hurdle models are actually composed of two separate models, one estimating the probability that an observation is 0 and another modeling the non-zero counts -- - What distribution do you think is used for the zero/non-zero part of the model? -- + Assume `\(z_i = 0\)` if `\(y_i = 0\)`, otherwise `\(z_i = 1\)` `$$z_i \sim Bernouli(p_i)$$` `$$logit(p_i) = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2}...$$` -- - What distribution do you think is used for the non-zero observations? -- + zero-truncated Poisson/Negative binomial --- # zero-truncated distributions <br/> <img src="21_zero_inflation_files/figure-html/unnamed-chunk-11-1.png" width="576" style="display: block; margin: auto;" /> --- # hurdle models .left-nudge[ .vsmall-code[ ``` r library(pscl) fit.hrp <- hurdle(oyster ~ ., data = oysterdata) ``` ] ] .right-nudge[ .vvsmall-code[ ``` r summary(fit.hrp) ``` ``` ## ## Call: ## hurdle(formula = oyster ~ ., data = oysterdata) ## ## Pearson residuals: ## Min 1Q Median 3Q Max ## -1.651 -0.576 -0.343 0.483 4.656 ## ## Count model coefficients (truncated poisson with log link): ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.7032 0.6980 -5.31 1.1e-07 *** ## temperature 0.0944 0.0124 7.59 3.3e-14 *** ## substratesand -0.0861 0.1248 -0.69 0.49 ## depth 0.0135 0.0218 0.62 0.54 ## clarity 0.0505 0.1936 0.26 0.79 ## salinity -0.2491 0.1780 -1.40 0.16 ## Zero hurdle model coefficients (binomial with logit link): ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.3330 1.5656 -2.13 0.03326 * ## temperature 0.1130 0.0299 3.78 0.00016 *** ## substratesand -1.8083 0.3158 -5.73 1e-08 *** ## depth -0.2192 0.0628 -3.49 0.00049 *** ## clarity -0.3924 0.5693 -0.69 0.49061 ## salinity -0.3552 0.5349 -0.66 0.50667 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Number of iterations in BFGS optimization: 15 ## Log-likelihood: -334 on 12 Df ``` ] ] --- # hurdle models .left-nudge[ .vsmall-code[ ``` r fit.hrnb <- hurdle(oyster ~ ., data = oysterdata, dist = "negbin") ``` ] ] .right-nudge[ .vvsmall-code[ ``` r summary(fit.hrnb) ``` ``` ## ## Call: ## hurdle(formula = oyster ~ ., data = oysterdata, dist = "negbin") ## ## Pearson residuals: ## Min 1Q Median 3Q Max ## -1.651 -0.576 -0.343 0.483 4.656 ## ## Count model coefficients (truncated negbin with log link): ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.7034 0.6981 -5.30 1.1e-07 *** ## temperature 0.0944 0.0124 7.58 3.4e-14 *** ## substratesand -0.0861 0.1248 -0.69 0.49 ## depth 0.0135 0.0218 0.62 0.54 ## clarity 0.0505 0.1936 0.26 0.79 ## salinity -0.2491 0.1780 -1.40 0.16 ## Log(theta) 10.1956 98.1745 0.10 0.92 ## Zero hurdle model coefficients (binomial with logit link): ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.3330 1.5656 -2.13 0.03326 * ## temperature 0.1130 0.0299 3.78 0.00016 *** ## substratesand -1.8083 0.3158 -5.73 1e-08 *** ## depth -0.2192 0.0628 -3.49 0.00049 *** ## clarity -0.3924 0.5693 -0.69 0.49061 ## salinity -0.3552 0.5349 -0.66 0.50667 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Theta: count = 26785.49 ## Number of iterations in BFGS optimization: 55 ## Log-likelihood: -334 on 13 Df ``` ] ] --- # hurdle models Sub-models can contain the same or different covariates -- Parameters for sub-models are interpreted in the same way as for logistic/Poisson regression -- `\(\lambda_i\)` is the expected value (i.e., mean) of the non-zero observations, not the overall mean of all observations (which is `\(\frac{1-p_i}{1-e^{\lambda_i}}\lambda_i\)`) -- Models can be compared using AIC - note that `performance::check_zeroinflation()` does not work for `hurdle` model objects -- Generally, Hurdle models are most appropriate when the count-based process cannot generate zeros (i.e., zeros are determined by a separate process than counts) - Examples? --- class: inverse, center, middle # zero-inflated mixture models --- # zero-inflated mixture models Sometimes, zero observations can occur due to both the binomial process and the count process - Habitat may be suitable or unsuitable, but abundance may be zero at suitable locations (especially if density is low) - Individuals may be breeders or non-breeders, but some breeders will still fail to produce offspring (e.g., nest predation) -- In these cases, the zero-truncated distribution used by the Hurdle model is not appropriate -- Zero-inflated mixture models allow zero observations to arise from both a logistic/binomial process and a count model - Both Poisson and negative binomial count models are possible --- # zero-inflated mixture models .left-nudge[ .vsmall-code[ ``` r fit.zip <- zeroinfl(oyster ~ .|., data = oysterdata) ``` ] ] .right-nudge[ .vvsmall-code[ ``` r summary(fit.zip) ``` ``` ## ## Call: ## zeroinfl(formula = oyster ~ . | ., data = oysterdata) ## ## Pearson residuals: ## Min 1Q Median 3Q Max ## -1.686 -0.574 -0.343 0.535 4.489 ## ## Count model coefficients (poisson with log link): ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.8550 0.6942 -5.55 2.8e-08 *** ## temperature 0.0973 0.0125 7.80 6.0e-15 *** ## substratesand -0.0734 0.1229 -0.60 0.55 ## depth 0.0139 0.0216 0.64 0.52 ## clarity 0.0242 0.1912 0.13 0.90 ## salinity -0.2573 0.1750 -1.47 0.14 ## ## Zero-inflation model coefficients (binomial with logit link): ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.4511 2.0654 0.22 0.82712 ## temperature -0.0666 0.0371 -1.80 0.07222 . ## substratesand 2.0329 0.3784 5.37 7.8e-08 *** ## depth 0.2554 0.0725 3.52 0.00043 *** ## clarity 0.4385 0.6488 0.68 0.49917 ## salinity 0.2308 0.6026 0.38 0.70176 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Number of iterations in BFGS optimization: 21 ## Log-likelihood: -333 on 12 Df ``` ] ] --- # zero-inflated mixture models .left-nudge[ .vsmall-code[ ``` r fit.zinb <- zeroinfl(oyster ~ .|., data = oysterdata, dist = "negbin") ``` ] ] .right-nudge[ .vvsmall-code[ ``` r summary(fit.zinb) ``` ``` ## ## Call: ## zeroinfl(formula = oyster ~ . | ., data = oysterdata, dist = "negbin") ## ## Pearson residuals: ## Min 1Q Median 3Q Max ## -1.686 -0.574 -0.343 0.535 4.489 ## ## Count model coefficients (negbin with log link): ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.8548 0.6942 -5.55 2.8e-08 *** ## temperature 0.0973 0.0125 7.80 6.0e-15 *** ## substratesand -0.0734 0.1229 -0.60 0.55 ## depth 0.0139 0.0216 0.64 0.52 ## clarity 0.0242 0.1912 0.13 0.90 ## salinity -0.2573 0.1750 -1.47 0.14 ## Log(theta) 14.0816 NaN NaN NaN ## ## Zero-inflation model coefficients (binomial with logit link): ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.4522 2.0653 0.22 0.82667 ## temperature -0.0667 0.0371 -1.80 0.07214 . ## substratesand 2.0328 0.3784 5.37 7.8e-08 *** ## depth 0.2554 0.0725 3.52 0.00043 *** ## clarity 0.4384 0.6488 0.68 0.49926 ## salinity 0.2306 0.6026 0.38 0.70192 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Theta = 1304901.633 ## Number of iterations in BFGS optimization: 55 ## Log-likelihood: -333 on 13 Df ``` ] ] --- # model comparison Generally, alternative zero-inflated models are compared using AIC |Model | k| AIC| delta AIC| |:---------------------|--:|-----:|---------:| |Zero-inflated Poisson | 12| 690.4| 0.000| |Zero-inflated NegBig | 13| 692.4| 2.000| |Poisson Hurdle | 12| 692.5| 2.078| |NegBig Hurdle | 13| 694.5| 4.078| |NegBin GLM | 7| 756.9| 66.479| |Poisson GLM | 6| 841.9| 151.496| -- Do AIC scores tell us if the models are "good"? -- - NO! - Goodness-of-fit tests are also important for zero-inflated model. See [Fieberg 17.7](https://fw8051statistics4ecologists.netlify.app/zimodels#goodness-of-fit) for examples --- # looking ahead <br/> ### **Next time**: Exam 3!