class: center, middle, inverse, title-slide # Lecture 3 ## Principles of Bayesian inference ###
WILD6900 (Spring 2021) --- ## Readings > Hobbs & Hooten 71-78; 79-105 --- ## From probability to Bayes theorem #### Our goal as ecologists is to understand processes that we **cannot directly observe** based on quantities that we *can* observe -- #### We will refer to the unobserved processes as `\(\large \theta\)` -- #### `\(\large \theta\)` can include parameters of our models or latent states - population size - occupancy status - alive/dead state of individuals -- #### Each of these unobserved processes is governed by a probability distribution `\(\large [\theta]\)` --- ## From probability to Bayes theorem #### To learn about `\(\theta\)`, we take observations `\(\large y\)` -- #### Before those data are collected, they are *random variables* > the probability of observing `\(y\)` conditional on `\(\theta\)` is governed by a probability distribution `\([y|\theta]\)` -- #### We want to know the probability distribution of the unobserved `\(\large \theta\)` conditional on the observed data `\(\large y\)`, that is `\(\large [\theta|y]\)` --- ## From probability to Bayes theorem #### We know from last week that: `\(^1\)` `$$\Large [\theta|y] = \frac{[\theta, y]}{[y]}$$` -- #### and `\(^2\)` `$$\Large [\theta, y] = [y|\theta][\theta]$$` -- #### Through substitution, we get **Bayes theorem** `\(^3\)`: `$$\Large [\theta|y] = \frac{[y|\theta][\theta]}{[y]} \tag{1}$$` ??? `\(^1\)` This is the definition of conditional probabilities `\(^2\)` This is the definition of joint probabilities `\(^3\)` Although the use of Bayesian methods has historically been somewhat controversial, Bayes theorem is not. It is a simple outcome of basic probability --- ## From probability to Bayes theorem #### To understand what Bayes theorem says and why it is such a powerful principle, let's break down each part of equation 1: `$$\LARGE \underbrace{[\theta|y]}_{posterior\; distribution} = \frac{\overbrace{[y|\theta]}^{likelihood} \;\;\;\;\overbrace{[\theta]}^{prior}}{\underbrace{[y]}_{marginal\; distribution}}$$` --- class: inverse, center, middle # The likelihood distribution --- ## Likelihood The concept of likelihood may be familiar to you from previous statistics classes because it is the central principle of *frequentist* statistics <br/> -- The likelihood allows us to answer the question: > what is the probability that we will observe the data if our deterministic model `\(g(\theta)\)` is the true process that gives rise to the data? <br/> -- That is, in likelihood, we treat `\(\theta\)` as *fixed* and *known* rather than a random variable + By assuming `\(\theta\)` is fixed and known, we can calculate the probability density of our observation `\(y\)` conditional on `\(\theta\)` --- ## Likelihood #### **Example** Let's say we're sampling the number of trees on small plots and that we **_know_** the expected number of trees/plot is 40 `\(^4\)` <br/> -- On one plot, we observe 34 trees. What is the probability of `\(\large y = 34\)`? <br/> -- To answer this question, we first need to select a sensible probability distribution for the number of trees on a plot -- + Because these values have to be positive integers, the Poisson distribution is an obvious choice `\(^5\)` -- Next, we calculate the probability `\(\large Pr(y=34|\lambda = 40)\)`: ```r dpois(x = 34, lambda = 40) ``` ``` ## [1] 0.04247 ``` ??? `\(^4\)` Obviously in real life we would not know this but again, likelihood assumes that this quantify is known `\(^5\)` Remember that the expected value of a Poisson distribution is governed by the parameter `\(\lambda\)`. In this case, we know that `\(\lambda = 40\)` --- ## Likelihood **Example** On a second plot, we observed 42 trees <br/> -- What is the probability of that observation? + `\(\large Pr(y=42|\lambda=40) =\)` 0.058 <br/> -- What is the joint probability of both observations? <br/> -- Assuming the observations are independent, the joint probability (probability of `\(\large y=34\)` *and* `\(\large y=42\)`) is the product of the individual probabilities: + 0.04 x 0.06 = `0.00059`. --- ## Likelihood #### In this example, we start by assuming we know that `\(\large \lambda = 40\)` <br/> -- #### Of course that doesn't make much sense -- In our research, we never know `\(\large \lambda\)` (or to be consistent with eq. 1, `\(\large \theta\)`) <br/> - We want to estimate `\(\large \lambda\)` using our data -- We do this by using a *likelihood function*: `$$\Large \underbrace{L(\theta|y)}_{likelihood\; function} = \underbrace{[y|\theta]}_{likelihood} = \prod_{i=1}^n [y_i|\theta]$$` ??? This equation means that the likelihood of `\(\theta\)` given the data `\(y\)` is equal to the probability of the data conditional on `\(\theta\)` --- ## Likelihood profile An important distinction between the probability distribution `\(\large [y|\theta]\)` and a likelihood function `\(\large L(\theta|y)\)` is: + In the probability distribution, we treat the parameter as fixed and the data as random + In the likelihood function, we treat the data as fixed and the parameter as variable --- ## Likelihood profile In our example, the tree counts `\(\large y\)` are random variables - they can take a range of possible values due to chance The probability distribution `\(\large [y|\theta]\)` tells us the probability of each possible value `\(y\)`: ```r y_probs <- data.frame(y = 15:65, pr_y = dpois(15:65, lambda = 40)) ggplot(data = y_probs, aes(x = y, y = pr_y)) + geom_point() ``` <img src="Lecture3_files/figure-html/unnamed-chunk-3-1.png" width="432" style="display: block; margin: auto;" /> ??? To reiterate, the plot above assumes we know `\(\lambda = 40\)` and that the `\(y\)`'s are random variables from a Poisson distribution. Because this is a probability distribution, the area under the curve is 1. --- ## Likelihood profile To create a *likelihood profile*, we flip this around We treat our observation as fixed (for simplicity, let's use our observation `\(y=42\)`) and estimate the probability as a function of different values of `\(\lambda\)`: ```r y <- 42 y_probs <- data.frame(lambda = 15:65, pr_y = dpois(y, lambda = 15:65)) (lik_profile <- ggplot(data = y_probs, aes(x = lambda, y = pr_y)) + geom_path()) ``` <img src="Lecture3_files/figure-html/unnamed-chunk-5-1.png" width="432" style="display: block; margin: auto;" /> --- ## Likelihood profile #### In this plot, the area under the curve does not equal 1 - the likelihood profile is *not* a probability distribution <img src="Lecture3_files/figure-html/unnamed-chunk-6-1.png" width="720" style="display: block; margin: auto;" /> --- ## Likelihood profile #### Saying that `\(\large \lambda\)` is not fixed allows us to estimate the likelihood profile by varying the values of `\(\large \lambda\)` #### But this is not the same as saying it's a random variable! -- #### For something to be a random variable, it must be defined by a probability distribution + For the likelihood profile, we have not defined a probability distribution for `\(\large \lambda\)` (that is `\(\large [\lambda]\)`) <br/> -- As a result, we vary `\(\large \lambda\)` but it is not a random variable and likelihood profiles do not define the probability or probability density of `\(\large \lambda\)` --- ## Likelihood profile This distinction between likelihood profiles and probability distributions is one of the reasons that results of likelihood-based methods can be difficult to interpret Many of the methods familiar to ecologists use the principle of maximum likelihood to determine the value of `\(\large \theta\)` that is most supported by our data The maximum likelihood estimate is the peak of the likelihood curve `\(^6\)`: <img src="Lecture3_files/figure-html/unnamed-chunk-7-1.png" width="468" style="display: block; margin: auto;" /> ??? `\(^6\)` In a real study we would not base the MLE on a single observation --- ## Likelihood profile #### But the MLE does not tell us the probability of `\(\large \theta\)` given our data! #### So although MLE does tell us the value of `\(\large \theta\)` that is most consistent with our data, we can not say things like: + "There is a 90% probability that `\(\large \theta > 0\)`" `\(^7\)` + "There is a 96% probability that `\(\large a \geq \theta \geq b\)`" `\(^7\)` ??? `\(^7\)` Even though these are the things we usually want to know! --- class: inverse, center, middle # The prior `\(^8\)` ??? `\(^8\)` Choosing prior distributions is a complex topic that is an area of active research in the statistical community. As a result, the cultural norms for using priors in ecological modeling appears to be rapidly changing. For these reasons, we'll spend a good deal of time discussing how to choose priors in the next lecture. --- ## The prior distribution As we just learned, `\(\large \theta\)` is not a random variable in the likelihood function because it is not governed by a probability distribution -- `$$\Large \underbrace{[\theta|y]}_{posterior\; distribution} = \frac{\overbrace{[y|\theta]}^{likelihood} \;\;\;\;\overbrace{[\theta]}^{prior}}{\underbrace{[y]}_{marginal\; distribution}}$$` In eq. 1, the **prior** distribution is what allows us to treat `\(\large \theta\)` as a random variable -- + The prior describes what we know about the probability of `\(\large \theta\)` *before* we collect any data -- + Priors can contain a lot of information about `\(\large \theta\)` (*informative priors*) or very little (*uninformative priors*) -- + Well-constructed priors can also improve the behavior of our models --- ## The prior distribution #### The prior distribution provides us with a principled method of incorporating information about `\(\large \theta\)` into our analysis -- - This information could be results from a pilot study or results from previously published studies - In some cases, the prior could simply reflect our own domain expertise -- In this way, priors allow us to weigh conclusions drawn from our data against what we already know about our system `\(^9\)` -- In the words of Marc Kéry: >I find it hard not to be impressed by the application of Bayes rule to statistical inference since it so perfectly mimics the way of how we learn in everyday life ! In our guts, we always weigh any observation we make, or new information we get, with what we know to be the case or believe to know. ??? `\(^9\)` This is a profound property because it is consistent with both the way that science advances (the accumulation of evidence in support of specific hypotheses) as well as how we naturally learn about the world around us --- ## The prior distribution #### **Example** #### Suppose I tell you that on my way to class, I saw a 6-foot tall man -- - You would find this statement both believable and boring because a 6-ft tall man is consistent with what you know about the distribution of human heights -- #### If I said I saw a 7-ft tall man, you might find this more noteworthy but believable (because your prior tells you this a possible, though rare, occurrence) -- #### If I tell you I saw an 8-fit tall man, you'll question my credibility and require additional evidence because you know it is extremely implausible for someone to be this tall --- ## The prior distribution #### **Example** In our example of tree counts, we need to define a prior for `\(\lambda\)`, the average number of trees per plot To start, we know that `\(\lambda\)` has to be a positive real number (though not necessarily an integer) -- + The gamma distribution allows for positive real values -- In our discussion of likelihood functions, we assumed we know that `\(\large \lambda = 40\)`. Let's relax that assumption a bit + previous research has shown that the mean number of trees per plot is 40, with a variance of 6 --- ## The prior distribution #### **Example** We can use moment matching to turn `\(\large \mu = 40\)` and `\(\large \sigma^2 = 6\)` into the two parameters that govern the gamma distribution: `$$\Large \alpha = \frac{\mu^2}{\sigma^2}$$` `$$\Large \beta = \frac{\mu}{\sigma^2}$$` which in our example gives `\(\alpha=\)` 267 and `\(\beta =\)` 7 --- ## The prior distribution #### **Example** #### Now plot that prior alongside our previously define likelihood profile: ```r prior <- data.frame(lambda = seq(from = 15, to = 65, by = 0.25), pr_lambda = dgamma(seq(from = 15, to = 65, by = 0.25), 40^2/6, 40/6)) (prior_lik <- lik_profile + geom_path(data = prior, aes(x = lambda, y = pr_lambda), linetype = "longdash")) ``` --- ## The prior distribution **Example** <img src="Lecture3_files/figure-html/unnamed-chunk-9-1.png" width="576" style="display: block; margin: auto;" /> --- class: inverse, center, middle # The joint distribution --- ## The joint distribution The product of the likelihood `\(\large [y|\theta]\)` and the prior `\(\large [\theta]\)` (the numerator of Bayes theorem) is called the **joint distribution** It is important to note again that the joint distribution, like the likelihood profile, is not a probability distribution because the area under the curve does not sum to 1 ```r joint <- data.frame(lambda = seq(from = 15, to = 65, by = 0.25), jnt_dist = dgamma(seq(from = 15, to = 65, by = 0.25), 40^2/6, 40/6) * dpois(42, seq(from = 15, to = 65, by = 0.25))) (prior_lik_joint <- prior_lik + geom_path(data = joint, aes(x = lambda, y = jnt_dist), linetype = "dotted")) ``` --- ## The joint distribution <img src="Lecture3_files/figure-html/unnamed-chunk-11-1.png" width="576" style="display: block; margin: auto;" /> --- class: inverse, center, middle # The marginal distribution --- ## The marginal distribution To convert the joint distribution into a true probability distribution, we have to divide it by the total area under the joint distribution curve <br/> -- The denominator of eq. 1 `\(\large ([y])\)` is called the marginal distribution of the data - that is, the probability distribution of our data `\(\large y\)` across all possible values of `\(\large \theta\)` <br/> -- Remember from our previous lecture that: `$$\Large [y] = \int [y|\theta][\theta]d\theta$$` <img src="Lecture3_files/figure-html/unnamed-chunk-12-1.png" width="360" style="display: block; margin: auto;" /> --- ## The marginal distribution #### For some simple models, `\(\large [y]\)` can be estimated analytically <br/> -- #### But in many cases, particularly in models with a moderate to large number of parameters, this is very hard to do `\(^{10}\)` <br/> -- #### For most of the models you will need to fit as an ecologist, estimating the marginal distribution of the data is one of a major challenges of Bayesian inference `\(^{11}\)` ??? `\(^{10}\)` In fact, the difficulty of computing the marginal distribution is one of the reasons that it look a long time for Bayesian methods to be applied in practice (Bayes theorem was proven in 1763, long before the frequentist methods you are used to using were invented) `\(^{11}\)` We will return to this topic later --- class: inverse, center, middle # The posterior distribution --- ## The posterior distribution The LHS of equation, `\(\large [\theta|y]\)`, is known as the posterior distribution and it is what we want to know > What is the probability distribution of `\(\large \theta\)` given our data? <br/> -- The posterior distribution tells us everything we know about `\(\large \theta\)` given our data (and possibly prior knowledge) <img src="Lecture3_files/figure-html/unnamed-chunk-13-1.png" width="360" style="display: block; margin: auto;" /> --- ## The posterior distribution The posterior allows to make statements like: - The most probable value of `\(\large \theta\)` is `\(\large x\)` - There is a 95% probability that `\(\large a < \theta < b\)` - There is a 98% probability that `\(\large \theta <0\)` --- ## The posterior distribution #### It's important to realize that before we collect data, `\(\large y\)` is a random variable governed by the marginal distribution -- #### However, *after* we collect data, it is no longer random and `\(\large [y]\)` becomes a *fixed* quantity -- #### That means that: `$$\LARGE [\theta|y] \propto [y|\theta][\theta]$$` --- ## The posterior distribution #### In other words, because the denominator is a constant, the posterior distribution is *proportional* to the joint distribution -- #### This proportionality is important because it's easy to estimate the joint distribution (unlike the marginal distribution) -- #### This means we can *learn* about the shape of the posterior distribution from the joint distribution even if we can't compute `\(\large [y]\)` -- #### This is a central concept for applying modern tools for Bayesian analysis and one we will make use of shortly. --- ## More about the posterior distribution #### One of the cool things about Bayesian methods is that we don't get a point estimate of `\(\large \theta\)`, we get an entire probability distribution! `\(^{12}\)` -- #### These advantages will be come clear as we move towards applications of these methods but as a quick example, let's say we are estimating the abundance of two populations `\(\large (N_1\)` and `\(\large N_2)\)` -- #### We want to determine whether `\(\large N_1 > N_2\)` ??? `\(^{12}\)` Although this may seem like a minor point right now, it has some really practical advantages, namely that we can easily quantify uncertainty in our parameter estimates and we can summarize the distribution is whatever way we want (mean, median, mode, 95% quantiles, 50% quantiles, etc.). --- ## More about the posterior distribution #### In a frequentist framework, it relatively straightforward to get point estimates of `\(\large N_1\)` and `\(\large N_2\)` -- #### Saying that `\(\large N_1 > N_2\)` is the same as saying `\(\large N_1 - N_2 > 0\)` - to answer our question we could derive a new parameter `\(\Delta_N = N_1 - N_2\)` and test whether `\(\Delta_N > 0\)` -- #### Answering the question of whether `\(\large \Delta_N > 0\)` requires knowing not only the magnitude of this difference but also how certain we are in the value - How do we estimate the uncertainty of our new derived variable? -- #### That's not always straightforward in a frequentist world and will require application of the [delta method](https://en.wikipedia.org/wiki/Delta_method). --- ## More about the posterior distribution #### In a Bayesian world, we can actually estimate the entire posterior distribution of `\(\large \Delta_N\)`! -- #### All of the uncertainty in `\(\large N_1\)` and `\(\large N_2\)` will propagate into our uncertainty about `\(\large \Delta_N\)` -- #### It is trivially easy to estimate confidence intervals or specific probabilities for `\(\large \Delta_N\)` (e.g., `\(\large Pr(\Delta_N > 0)\)`) -- #### If nothing else turns you into a Bayesian, it's probably this point.