GLMs for modeling count data

In this activity, we will analyze a small data set containing counts of both population size and reproductive success using Poisson and Binomial GLMs.

Objectives

Analyze count data using Poisson and Binomial GLMs
Gain experience diagnosing Nimble output to check convergence
R functions used in this exercise:
- system.file()

Data

The data for this activity comes from a long-term project that monitored a population of peregrine falcons nesting in the French Jura from 1964 to 2003 ¹.

Load and inspect the data:

library(WILD8370)
data("falcons")
head(falcons)

Year	Pairs	R.Pairs	Eyasses
1964	34	19	27
1965	45	18	30
1966	39	22	36
1967	36	19	37
1968	20	8	19
1969	18	7	10

The falcons dataframe has 4 columns:

Year: the year (integer).
Pairs: the number of adult pairs (integer).
R.pairs: the number of reproductive pairs (integer).
Eyasses: the number of fledged young (integer).

Analysis 1: Change in population size

The first model we will fit examines change in the number of adult falcon pairs over time. Plotting the data shows that this change has not been linear:

ggplot(falcons, aes(x = Year, y = Pairs)) + geom_point() + stat_smooth(se = FALSE)

After a short decline at the beginning of the study period, the population then increased dramatically before perhaps reaching its carrying capacity.

Modeling non-linear effects using linear models

How can we model the non-linear change in abundance if, by definition, linear models model linear effects? Using polynomials!

Remember the equation for a curved line with a single peak (or bottom):

<<<<<<< HEAD y = a + b \times x + c \times x^2 ======= \Large y = a + b \times x + c \times x^2 >>>>>>> ce5d54b074616c65c7754d9e4f0af1acd8908fa4

Where a is the maximum (or minimum) value of y, b is the value of x where this maximum (or minimum) occurs and c determines whether the peak is a maximum (c<0) or a minimum (c>0).

We can add more complex shape by adding additional polynomial terms. For example, including a cubic term creates an s-shaped curve:

<<<<<<< HEAD y = a + b \times x + c \times x^2 + d \times x^3 ======= \Large y = a + b \times x + c \times x^2 + d \times x^3

For instance: >>>>>>> ce5d54b074616c65c7754d9e4f0af1acd8908fa4

Including polynomial terms in the linear predictor on the model gives us enormous flexibility to model non-linear relationships using GLMs.

Modeling change in falcon counts

To build a model for the falcon data, we need to define the components required in all GLMs (the distribution, link function, and linear predictor). Because these are counts, a natural choice for the distribution is:

C_t \sim Poisson(\lambda_t) where C_t is the observed count in year t and \lambda_t is the expected count.

As we learned in lecture, the conventional link function for count data is the log-link:

log(\lambda_t) = log(E(\lambda_t))

Finally, we need to write the linear predictor. Based on the preliminary visualization of the data, a cubic polynomial might be appropriate to capture the non-linear change over time:

log(\lambda_t) = \alpha + \beta_1 \times year_t + \beta_2 \times year^2_t + \beta_3 \times year^3_t

Accessing and viewing the Nimble model

You can copy and paste the following code into your own R script.

library(nimble)
falcon_mod <- nimbleCode({
  # Priors
  alpha ~ dnorm(0, 0.33)
  beta1 ~ dnorm(0, 0.33)
  beta2 ~ dnorm(0, 0.33)
  beta3 ~ dnorm(0, 0.33)

  # Likelihood
  for (i in 1:n){
    C[i] ~ dpois(lambda[i])
    log(lambda[i]) <- alpha + beta1 * year[i] + beta2 * pow(year[i],2) + beta3 * pow(year[i],3)
  } #end i
})

From this file, you can see that we will use relatively non-informative normal priors on each of the regression coefficients.

You can also see that the likelihood statement is very similar to the linear regression model from the last lecture, with a few minor differences. First, because we assume the observed falcon counts come from a Poisson distribution, we use dpois(lambda[i]) rather than dnorm(mu[i], tau). Also, we have to apply the log-link function to the predicted counts (log(lambda[i])=...). Notice that Nimble allows you to model the transformed predicted counts on the left hand side of the linear predictor equation

In Lab Questions

Plot a histogram of random samples from the normal prior used in the model (remember to convert the precision of 0.33 to standard deviation). As you can see, this is not as vague as the normal priors we have used in the past. What is the advantage of using less-vague normal priors?
In the linear regression model we fit in the last lecture, we also had a prior for \tau, the (inverse) of the process variance. Why do we not include that parameter in this model?
Creating the lambda[i] object is not strictly necessary since it is a deterministic function of year. If you wanted to have fewer lines of code, you could include the linear predictor directly inside of the dpois() function instead of lambda[i], though you would need to appropriately transform the linear predictor. What transformation would you use to put the linear predictor on the count scale?

Fitting the model

Before fitting the model, we need to prepare the input for Nimble This includes:

storing the data as a named list
storing constants as a named list
creating a function to randomly generate the initial values for each parameter
creating a vector with the parameters we want Nimble to monitor
set the MCMC settings

We’ve mentioned several times, it’s often a bad idea to include covariate values that are too far from 0. For this reason, we will first scale year to mean=0 and sd=1:

year <- (falcons$Year - mean(falcons$Year))/sd(falcons$Year)

nimdat <- list(C = falcons$Pairs)

nimconsts <- list(year = year, n = nrow(falcons))

niminits <- function(){list(alpha = rnorm(1), beta1 = rnorm(1), beta2 = rnorm(1), beta3 = rnorm(1))}

params <- c("alpha", "beta1", "beta2", "beta3", "lambda")

nC <- 3 #chains

nI <- 10000 #iterations

nB <- 2500 #burnin

nT <- 1 #thin

Now we’re ready to run the model. The easiest way to run a Nimble model is to run it in one command:

falcon_res <- nimbleMCMC(code = falcon_mod,
                     data = nimdat,
                     constants = nimconsts,
                     inits = niminits(),
                     monitors = params,
                     thin = nT,
                     niter = nI,
                     nburnin = nB,
                     nchains = nC,
                     samplesAsCodaMCMC = TRUE
                      )

However, sometimes (when you get more advanced), you’ll want to be able to customize your Nimble runs a little bit more. It’s also much easier to error check your initial values on models that take a very long time to run if you run it step by step. Here’s what that would look like with one chain:

prepnim <- nimbleModel(code = falcon_mod, constants = nimconsts,
                           data = nimdat, inits = niminits(), calculate = T)
prepnim$initializeInfo() #will tell you what is or isn't intialized
prepnim$calculate() #if this is NA or -Inf you know it's gone wrong
mcmcnim <- configureMCMC(prepnim, monitors = params, print = T)
nimMCMC <- buildMCMC(mcmcnim) #actually build the code for those samplers
Cmodel <- compileNimble(prepnim) #compiling the model itself in C++;
Compnim <- compileNimble(nimMCMC, project = prepnim) # compile the samplers next
Compnim$run(niter = nI, nburnin = nB, thin = nT)
falcon_res <- (as.mcmc(as.matrix(Compnim$mvSamples)))

View a portion of the results (printing all of the lambda values takes up too much room):

summary(falcon_res[,c('alpha', 'beta1', 'beta2', 'beta3', 'lambda[1]', 'lambda[2]', 'lambda[3]')])$quantiles

	2.5%	25%	50%	75%	97.5%
alpha	4.1739	4.2118	4.2306	4.2501	4.2884
beta1	1.0238	1.0844	1.1159	1.1452	1.2022
beta2	-0.0393	-0.0082	0.0075	0.0229	0.0527
beta3	-0.2785	-0.2490	-0.2335	-0.2171	-0.1853
lambda[1]	26.6889	30.2211	32.3052	34.4064	38.6127
lambda[2]	25.6956	28.5859	30.2625	31.9602	35.2361
lambda[3]	25.0503	27.4629	28.8161	30.2126	32.8666

Note that beta2 appears to cross 0 - there’s not much of an effect of year^2 on lambda.

This seems reasonable, but let’s make sure the Rhat values are less than 1.1. To do this, we’ll use the gelman.diag() function from the coda package.

library(coda)
gelman.diag(falcon_res, multivariate = F)$psrf[1:10,] #don't want to print out all 40 lambdas
#>           Point est. Upper C.I.
#> alpha          1.002      1.008
#> beta1          1.004      1.011
#> beta2          1.004      1.012
#> beta3          1.004      1.009
#> lambda[1]      1.003      1.007
#> lambda[2]      1.003      1.006
#> lambda[3]      1.003      1.006
#> lambda[4]      1.002      1.006
#> lambda[5]      1.002      1.006
#> lambda[6]      1.002      1.007

All parameters appear to have converged.

As usual, let’s check the trace plots to see how they look:

# View traceplots for alpha, beta1, beta2, and beta3 (not for lambda)
MCMCvis::MCMCtrace(falcon_res[,params[-5],], Rhat = T, pdf = F)

By monitoring lambda we can also plot the predicted counts along with the observed counts. First, we need to calculate the posterior means and upper/lower bounds of the 95% credible interval and add them to the falcons data frame, then use ggplot to visualize:

#get the quantiles for just the lambda parameters:
lambdas <- summary(falcon_res[,paste0('lambda[', 1:40, ']')])$quantiles 

falcons <- dplyr::mutate(falcons, lambda = lambdas[,3], 
                                  q2.5 = lambdas[,1], 
                                  q97.5 = lambdas[,5])

ggplot(falcons) + 
  geom_ribbon(aes(x = Year, ymin = q2.5, ymax = q97.5), fill = "grey90") +
  geom_path(aes(x = Year, y = lambda), color = "red") +
  geom_point(aes(x = Year, y = Pairs)) +
  scale_y_continuous("Pairs")+
  theme_bw()

Analysis 2: Nest success model

Next, let’s use the falcons data set to fit another type of GLM - the binomial GLM. Hopefully this exercise will show you that once you’re comfortable writing out and coding the GLM components (distribution, link function, and linear predictor), it is extremely easy to fit models with different distributional assumptions.

To estimate reproductive success (i.e., the probability that a pair successfully produces offspring), we will model the number of reproductive pairs (falcons$R.Pairs) as a function of the total number of pairs (falcons$Pairs).

Because the total number of reproductive pairs cannot exceed the total number of pairs, the counts in falcons$.RPairs are bounded to be less than falcons$Pairs. In this case, the Poisson distribution is not an appropriate count model. Instead, we will use the binomial distribution:

C_t \sim binomial(N_t, p_t)

Our goal is to model p_t, the probability of nesting successfully in each year. In this case, the log link is not appropriate - p_t is bound between 0 and 1. For probabilities, the logit link is generally the appropriate link function:

logit(p_t) = log\bigg(\frac{p_t}{1-p_t}\bigg)

Following Kéry & Schaub, we’ll model probability as a quadratic function of year:

logit(p_t) = \alpha + \beta_1 \times year_t + \beta_2 \times year^2_t

As in the last example, you can copy the code here. Note that in Nimble the order for the arguments of a binomial are probability and then size, unlike what we would type in R:

pairs_mod <- nimbleCode({
  # Priors
  alpha ~ dnorm(0, 0.33)
  beta1 ~ dnorm(0, 0.33)
  beta2 ~ dnorm(0, 0.33)

  # Likelihood
  for (t in 1:nyears){
    C[t] ~ dbinom(p[t], N[t])
    logit(p[t]) <- alpha + beta1 * year[t] + beta2 * pow(year[t],2)
  } #end i
})

As before, we prepare the data and run the model:

#using standardized year from above
nimdat2 <- list(C = falcons$R.Pairs)
nimconsts2 <- list(N = falcons$Pairs, year = year, nyears = nrow(falcons))
niminits2 <- function(){list(alpha = rnorm(1), beta1 = rnorm(1), beta2 = rnorm(1))}

params2 <- c("alpha", "beta1", "beta2", "p")

nC <- 3

nI <- 10000

nB <- 2500

nT <- 1

pairs_res <- nimbleMCMC(code = pairs_mod,
                     data = nimdat2,
                     constants = nimconsts2,
                     inits = niminits2(),
                     monitors = params2,
                     thin = nT,
                     niter = nI,
                     nburnin = nB,
                     nchains = nC,
                     samplesAsCodaMCMC = TRUE
                      )

summary(pairs_res[,c('alpha', 'beta1', 'beta2', 'p[1]', 'p[2]', 'p[3]')])$quantiles

	2.5%	25%	50%	75%	97.5%
alpha	0.6699	0.7432	0.7818	0.8188	0.8914
beta1	-0.0298	0.0279	0.0585	0.0893	0.1469
beta2	-0.3833	-0.3305	-0.3026	-0.2747	-0.2220
p[1]	0.3893	0.4352	0.4602	0.4854	0.5324
p[2]	0.4163	0.4591	0.4825	0.5058	0.5489
p[3]	0.4422	0.4821	0.5037	0.5252	0.5649

# View traceplots for alpha, beta1, and beta2(not for p)
MCMCvis::MCMCtrace(pairs_res[,params2[-4],], pdf = F, Rhat = T)


ps <- summary(pairs_res[,paste0('p[', 1:40, ']')])$quantiles 
falcons <- dplyr::mutate(falcons, p = ps[,3], 
                                  q2.5_p = ps[,1], 
                                  q97.5_p = ps[,5])

ggplot(falcons) + 
  geom_ribbon(aes(x = Year, ymin = q2.5_p, ymax = q97.5_p), fill = "grey90") +
  geom_path(aes(x = Year, y = p), color = "red") +
  geom_point(aes(x = Year, y = R.Pairs/Pairs)) +
  scale_y_continuous("Pairs")+
  theme_classic()

Homework Questions

Using what you’ve learned in this lab, write out a model for the expected number of nestlings (Eyasses) per reproducing pair (R.pairs) in each year.

Begin by writing out the mathematical formulation of your model
Write out the Nimble code.
Provide your model with data, constants and initial values
Check your output for convergence
Display a ggplot showing the expected number of nestlings per reproducing pair in each year (with 95% credible interval) vs the raw data. Don’t forget a title and axis labels.

It may help to plot the data first to see what type of data you are working with:

ggplot(falcons, aes(x = Year, y = Eyasses/R.Pairs))+
  geom_line(lwd = 1, lty = 2)+
  geom_point()+
  geom_smooth()+
  theme_bw()

In our second analysis in lab, we used a binomial GLM to describe the proportion of successful peregrine pairs per year in the French Jura mountains. To see the connections between three important types of GLMs, first use a Poisson GLM to model the number of successful pairs (thus disregarding the fact that the binomial total varies by year), and second, use a normal GLM to do the same. In the same graph, compare the predicted numbers of successful pairs for every year under all three models (binomial, Poisson, and normal GLMs). [This assignment stolen directly from the WinBUGS book, so blame Marc Kéry and Michael Schaub for this one.]

Note: If you find that your normal distribution model predicts extremely low counts, be sure to look at your priors. If selected correctly, you should see all 3 models roughly overlap the raw data.

On a 1-10 scale, with 1 being the worst week ever and 10 being the best, how would you rate this week’s content? What lingering questions/confusion about the lecture or lab do you still have?
On a 1-10 scale, with 1 being the worst week ever and 10 being the best, how would you rate this week’s content? What lingering questions/confusion about the lecture or lab do you still have?

Clark Rushing