lecture 6: more about prior distributions

class: center, middle, inverse, title-slide

.title[
# lecture 6: more about prior distributions
]
.subtitle[
## FANR 8370
]
.author[
### <br/><br/><br/>Spring 2025
]

---

# readings

> Hobbs & Hooten 90-105

---
class: inverse, center, middle

> As scientists, we should always prefer to use appropriate, well-constructed, informative priors - Hobbs & Hooten

---
# more about prior distributions

Selecting priors is one of the more confusing and often contentious topics in Bayesian analyses

It is often confusing because there is no direct counterpart in traditional frequentists statistical training

It is controversial because modifying the prior will potentially modify the posterior

- Many scientists are uneasy with the idea that you have to *choose* the prior

- This imparts some subjectivity into the modeling workflow, which conflicts with the idea that we should only base our conclusions on what our data tell us

But this view is both philosophically counter to the scientific method and ignores the many benefits of using priors that contain some information about the parameter(s) `$\large \theta$`

---
# a note on this material

Best practices for selecting priors is an area of active research in the statistical literature and advice in the ecological literature is changing rapidly

As a result, the following sections may be out-of-date in short order

Nonetheless, understanding how and why to construct priors will greatly benefit your analyses so we need to spend some time on this topic.

---
class: inverse, center, middle

# non-informative priors

---
# non-informative priors

In much of the ecological literature the standard method of selecting priors is to choose priors that have minimal impact on the posterior distribution, so called **non-informative priors**  
<br/>

Using non-informative priors is intuitively appealing because:

- they let the data "do the talking"  
<br/>

- they return posteriors that are generally consistent with frequentist-based analyses

???

at least for simple models. This gives folks who are uneasy with the idea of using priors some comfort

---
# non-informative priors

Non-informative priors generally try to be agnostic about the prior probability of `$\large \theta$`

For example, if `$\theta$` is a probability, `$uniform(0,1)$` gives equal prior probability to all possible values:

???

The `$uniform(0,1)$` is a special case of the beta distribution with `$\alpha = \beta = 1$`

---
# non-informative priors

For a parameter that could be any real number, a common choice is a normal prior with very large standard deviation:

---
# non-informative priors

Over realistic values of `$\theta$`, this distribution appears relatively flat:

---
# non-informative priors

Often ecological models have parameters that can take real values `$>0$` (variance or standard deviation, for example)

In these cases, uniform priors from 0 to some large number (e.g., 100) are often used or sometimes very diffuse gamma priors, e.g. `$gamma(0.01, 0.01)$`

???

This distribution may not *look* vague at first glance because most of the mass is near 0. But because the tail is very long, `$gamma(0.01, 0.01)$` actually tells us very little about `$\theta$`. When combined with data, the posterior distribution will be almost completely driven by the data. We will see why shortly when we discuss conjugate priors

---
# non-informative priors

Non-informative priors are appealing because they appear to take the subjectivity out of our analysis

- they "let the data do the talking"

However, non-informative priors are often not the best choice for practical and philosophical reasons

???

technically they are only "objective" if we all come up with the *same* vague priors

---
# practical issues with non-informative priors

The prior always has *some* influence on the posterior and in some cases can actually have quite a bit of influence  
<br/>

For example, let's say we track 10 individuals over a period of time using GPS collars and we want to know the probability that an individual survives from time `$t$` to time `$t+1$`

- assume 8 individuals died during our study (so two 1's and eight 0's)  
<br/>

A natural model for these data is

`$$\LARGE binomial(p, N = 10)$$`

Because `$p$` is a parameter, it needs a prior:

- `$beta(1,1)$` gives equal prior probability to all values of `$p$` between 0 and 1

???

Our data is a series of `$0$`'s (dead) and `$1$`'s (alive)

---
# conjugate priors

It turns out when we have binomial likelihood and a beta prior, the posterior distribution will also be a beta distribution  
<br/>

This scenario, when the prior and the posterior have the same distribution, occurs when the likelihood and prior are **conjugate** distributions.

<table>
<caption>A few conjugate distributions</caption>
 <thead>
  <tr>
   <th style="text-align:center;"> Likelihood </th>
   <th style="text-align:center;"> Prior </th>
   <th style="text-align:center;"> Posterior </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> $y_i \sim binomial(n, p)$ </td>
   <td style="text-align:center;"> $p \sim beta(\alpha, \beta)$ </td>
   <td style="text-align:center;"> $p \sim beta(\sum y_i + \alpha, n -\sum y_i + \beta)$ </td>
  </tr>
  <tr>
   <td style="text-align:center;"> $y_i \sim Bernoulli(p)$ </td>
   <td style="text-align:center;"> $p \sim beta(\alpha, \beta)$ </td>
   <td style="text-align:center;"> $p \sim beta(\sum_{i=1}^n y_i + \alpha, \sum_{i=1}^n (1-y_i) + \beta)$ </td>
  </tr>
  <tr>
   <td style="text-align:center;"> $y_i \sim Poisson(\lambda)$ </td>
   <td style="text-align:center;"> $\lambda \sim gamma(\alpha, \beta)$ </td>
   <td style="text-align:center;"> $\lambda \sim gamma(\alpha \sum_{i=1}^n y_i, \beta + n)$ </td>
  </tr>
</tbody>
</table>

---
# conjugate priors

Conjugate distributions are useful because we can estimate the posterior distribution algebraically

In our case, the posterior distribution for `$p$` is:

`$$\Large beta(2 + 1, 8 + 1) = beta(3, 9)$$`

---
# practical issues with non-informative priors

No prior is truly non-informative

As we collect more data, this point will matter less and less

- if we tracked 100 individuals and 20 lived, our posterior would be `$beta(21, 81)$`

With a sample size that big, we could use a pretty informative prior (e.g., `$\large beta(4,1)$`) and it doesn't matter too much:

---
# practical issues with non-informative priors

Another potential issue with non-informative priors is that they are not always non-informative if we have to transform them  
<br/>

For example, let's say instead of modeling `$p$` on the probability scale (0-1), we model it on the logit scale:

`$$\Large logit(p) \sim Normal(0, \sigma^2)$$`
<br/>

The logit scale transforms the probability to a real number:

???

The logit transformation is useful to model things like probability as a function of covariates (which we will learn more about soon)

---
# practical issues with non-informative priors

We saw that for real numbers, a "flat" normal prior is often chosen as a non-informative prior

But what happens when we transform those values back into probabilities?

---
# philosophical issues with non-informative priors

Aside from the practical issues, there are several philosophical issues regarding the use of non-informative priors

Many of the justifications for using non-informative priors stem from the idea that we should be aim to *reduce* the influence of the prior so our data "do the talking""  
<br/>

But even MLE methods are subjective - so avoiding Bayesian methods or choosing non-informative priors doesn't make our analyses more "objective"  
<br/>

Worse, MLE methods often hide these assumptions from view, making our subjective decisions implicit rather than explicit

???
By choosing the likelihood distribution, we are making assumptions about the data generating process (choose a different distribution, get a different answer)

---
# philosophical issues with non-informative priors

Science advances through the accumulation of facts

We are trained as scientists to be skeptical about results that are at odds with previous information about a system

If we find a unusual pattern in our data, we would be (and should be) very skeptical about these results

What's more plausible? These data really do contradict what we know about the system or they were generated by chance events related to our finite sample size?

Bayesian analysis is powerful because it provides a formal mathematical framework for combining previous knowledge with newly collected data

???
In this way, Bayesian methods are a mathematical representation of the scientific method

---
# philosophical issues with non-informative priors

Why shouldn't we use prior knowledge to improve our inferences?

> "Ignoring prior information you have is like selectively throwing away data before an analysis" (Hobbs & Hooten)

---
# philosophical issues with non-informative priors

Another way to think about the prior is as a *hypothesis* about our system

This hypothesis could be based on previously collected data or our knowledge of the system

--
- If we know a lot, this prior could contain a lot of information

- If we don't know much, this prior could contain less information

???

for example, we're pretty sure the probability of getting a heads in 50%

we might not have great estimates of survival for our study organism but we know based on ecological theory and observed changes in abundance that it's probably not *really* low

---
# philosophical issues with non-informative priors

We allow our data to *update* this hypothesis

- Strong priors will require larger sample sizes and more informative data to change our beliefs about the system

- Less informative priors will require less data

This is how it should be - if we're really confident in our knowledge of the system, we require very strong evidence to convince us otherwise

---
# advantages of informative priors

So there are clearly philosophical advantages to using informative priors

But there are also a number of practical advantages:

1) Improved parameter estimates when sample sizes are small

- Making inferences from small samples sizes is exactly when we expect spurious results and poorly estimated parameters

- Priors are most influential when we have few data

- The additional information about the parameters can reduce the chances that we base our conclusions on spurious results

- Informative priors can also reduce uncertainty in our parameter estimates and improve estimates of poorly identified parameters.

???

Thinking about the beta distribution as previous coin flips, using an informative prior has the same effect as increasing sample size

---
# advantages of informative priors

1) Improved parameter estimates when sample sizes are small  
<br/>

2) Stabilizing computational algorithms

- As models get more complex, the ratio of parameters to data grows

- As a result, we often end up with pathological problems related to parameter *identifiability*

- Informative priors can improve estimation by providing some structure to the parameter space explored by the fitting algorithm

???

Basically, even with large sample sizes, the model cannot estimate all the parameters in the model

---
# advantages of informative priors

1) Improved parameter estimates when sample sizes are small
<br/>

2) Stabilizing computational algorithms

**Example**

If `$y \sim Binomial(N, p)$` and the only data we have is `$y$` (we don't know `$N$`), we cannot estimate `$N$` and `$p$` no matter how much data we have

Including information about either `$N$` or `$p$` via an informative prior can help estimate the other other parameter

???

there are essentially unlimited combinations of `$N$` and `$p$` that are consistent with our data

we'll see this issue come up again when we discuss estimating abundance from count data

---
# advice on priors

1) Use non-informative priors as a starting point

> It's fine to use non-informative priors as you develop your model but you should always prefer to use "appropriate, well-contructed informative priors" (Hobbs & Hooten)

---
# advice on priors

1) Use non-informative priors as a starting point

2) Think hard

> Non-informative priors are easy to use because they are the default option. You can usually do better than non-informative priors but it requires thinking hard about the parameters in your model

---
# advice on priors

1) Use non-informative priors as a starting point

2) Think hard

3) Use your "domain knowledge"

> We can often come up with weakly informative priors just be knowing something about the range of plausible values of our parameters

---
# advice on priors

1) Use non-informative priors as a starting point

2) Think hard

3) Use your "domain knowledge"

4) Dive into the literature

> Find published estimates and use moment matching and other methods to convert published estimates into prior distributions

---
# advice on priors

1) Use non-informative priors as a starting point

2) Think hard

3) Use your "domain knowledge"

4) Dive into the literature

5) Visualize your prior distribution

> Be sure to look at the prior in terms of the parameters you want to make inferences about (use simulation!)

---
# advice on priors

1) Use non-informative priors as a starting point

2) Think hard

3) Use your "domain knowledge"

4) Dive into the literature

5) Visualize your prior distribution

6) Do a sensitivity analysis

> Does changing the prior change your posterior inference? If not, don't sweat it. If it does, you'll need to return to point 2 and justify your prior choice