lecture 2: probability refresher (or introduction)

class: center, middle, inverse, title-slide

.title[
# lecture 2: probability refresher (or introduction)
]
.subtitle[
## FANR 8370
]
.author[
### <br/><br/><br/>Spring 2025
]

---

# readings:

> Hobbs & Hooten 29-70

---
class: inverse, center, middle

**Warning:** The material presented in this lecture is tedious. But the concepts in this lecture are critical to everything that will follow in this course. So push through and try your best to understand these topics. You do not need to be an expert in probability at the end of this lecture - we will reinforce these concepts over and over again throughout the semester - but getting the gist now will help you grasp other topics as we move forward

---
# stochasticity and uncertainty in ecological models

In each level of our models, we differentiate between:

+ a *deterministic* model `$\large g()$`, and

+ a *stochastic* model `$\large [a|b,c]$`  
<br/>

The deterministic portion of the model contains no uncertainty

*Stochastic* processes are different:

+ given an input, the model will not always return the same answer

+ the output of stochastic processes are *uncertain*

+ Even though stochastic processes are inherently uncertain, they are not unpredictable.

???

Given a certain input, the deterministic model will always return the same answer and thus there is no uncertainty in the outcome. However, there often is uncertainty about the model itself but that is a different topic

---
# stochasticity and uncertainty in ecological models

In Bayesian models, all unobserved quantities are treated as **random variables**, that is they can take on different values due to chance (i.e., stochastic)

Each random variable in our model is governed by a **probability distribution**

Our goal is to *use our data to learn about those distributions*

???

In frequentist statistics, in contrast, unknown parameters have a fixed value and the goal of inference is to estimate that value (and the corresponding uncertainty)

---
class: inverse, middle, center

# probability distributions

---
# probability distributions

In Bayesian statistics, unknown parameters follow distributions

- The goal of inference is to estimate the parameters that govern the shape of that distribution

- Bayesian methods require you to think in terms of probability distributions

Using and understanding Bayesian methods therefore requires understanding probability distributions, including:

- developing a "toolbox" of distributions and knowing when to use each

- understanding and calculating probability density functions

- linking the parameters of each distribution to its moments

---
# probability distributions

Random variables are numerical values that are governed by chance events (e.g., a coin flip)

Although the exact value a random variable will take is unknown before the event, we usually know some properties, including the possible values it could take (e.g., heads or tails)

Probability distributions describe the expected frequency of different values that a random variable can take

- There are a lot of [probability distributions](https://www.acsu.buffalo.edu/~adamcunn/probability/probability.html)

---
# probability distributions

Initially, it may seem mysterious how to know when a specific probability distribution should be used to describe a random variable

However, as ecologists, there are a relatively small number of common probability distributions that we encounter and use regularly

The choice of which distribution to use usually comes down to a few characteristics of the random variable we are trying to model:

- Discrete vs. continuous

- *Support*, i.e., possible values with non-zero probability

---
# discrete vs. continuous distributions

**Continuous** random variables can take on an infinite number of values on a specific interval

- normal `$(-\infty$` to `$\infty)$`

- gamma (0 to `$\infty)$`

- beta (0 to 1)

- uniform (? to ?)

**Discrete** random variables are those that take on distinct values, usually integers

- Poisson (integers `$\geq 0)$`

- Bernoulli (0 or 1)

- Binomial

- Multinomial

???

We usually encounter continuous variables in the form of regression coefficients (slope and intercepts), measurements (mass, lengths, etc.), and probabilities

We usually encounter discrete variables in the form of counts (the number of individuals can only be positive integers, you can't have 4.234 individuals) or categories (alive vs. dead, present in location A vs. B vs. C)

---
class: inverse

# parameters

Technically, there is not a single distribution for each probability distribution

Instead, the shape of each probability distribution is governed by the values of its *parameters*

Knowing the parameters of each distribution, and how they relate to its shape, is critical to implementing Bayesian models

- This is a topic we will return to throughout the semester

---
# probability functions

Very often we want to know the probability that a random variable will take a specific value `$\large z$`

Answering this question requires the use of probability functions, which we will denote `$\large [z]$`

Probability functions differ between *continuous* and *discrete* distributions so we will discuss these separately

???

Probability functions tell us `$[z]=Pr(z)$`

---
# probability mass functions

For *discrete* random variables, the probability that the variable will take a specific value `$\large z$` is defined by the *probability mass function* (pmf)

All pmf's share two properties:

`$$\Large 0 \leq [z] \leq 1$$`
`$$\Large \sum_{z \in S}[z]=1$$`

where `$\large S$` is the set of all `$\large z$` for which `$\large [z] > 0$` (i.e., the support).

---
# probability mass functions

As an example, let's assume a random variable that follows a Poisson distribution

- Poisson random variables can take any integer value > 0 `$\large (0, 1, 2,...)$`

- e.g., the number of individuals at a site or the number of seeds produced by a flower

The shape of the Poisson distribution is determined by 1 parameter called `$\large \lambda$`

- `$\large \lambda$` is the expected value (the most likely value) of a random variable generated from the Poisson distribution

- larger `$\large \lambda$` means larger values of the variable

---
# probability mass functions

If `$\large \lambda = 10$`, what is the probability that `$\large z$` will equal 10? Or 8? Or 15?

The pmf of the Poisson distribution is `$[z|\lambda] = \frac{e^{-\lambda} \lambda ^ z}{z!}$`

In R, probability mass is calculated using the `dpois()` function (or the equivalent for other discrete distributions)

- takes two arguments: the value we are interested in estimating the probability of `$(z)$` and the expected value of our distribution `$(\lambda)$`

- `dpois(x = seq(0,25), lambda = 10)`

???

R will let us put in a vector of values so we can also do the following to estimate the probability of all values from 0 to 25: `dpois(x = seq(0, 25), lambda = 10)`

---
# probability density functions

Probability mass functions provide the probability that a discrete random variable takes on a specific value `$\large z$`

For continuous variables, estimating probabilities is a little trickier because `$\large Pr(z) = 0$` for any specific value `$\large z$`

Why? Let's look at the probability distribution for a normal random variable with mean `$\large =0$` and standard deviation `$\large =3$`:

---
# probability density functions

The *probability density* is the area under the curve for an interval between `$\large a$` and `$\large b$`, which we'll call `$\large \Delta_z =(a-b)$`.

For example, the shaded area below shows the probability density `$\large Pr(-2 \leq z \leq -1)$`:

---
# probability density functions

This area can be approximated by multiplying the width times the (average) height of the rectangle:

`$$\Large Pr(a \leq z \leq b) \approx \Delta_z [(a + b)/2]$$`

By making the range `$\Delta_z =a-b$` smaller and smaller, we get closer to `$Pr(z)$`:

---
# probability density functions

At `$\large z$`, `$\large \Delta_z =0$`, thus `$\large [z] = 0$`

However, we can use calculus to estimate the height of the line `$\large ([z])$` as `$\large \Delta_z$` approaches 0

So for continuous random variables, the *probability density* of `$\large z$` is the area under the curve between `$\large a \leq z \leq b$` as `$\large \Delta_z$` approaches zero

Estimating probability density in R is the same as for discrete variables: `dnorm()`

???

Now you know why the function starts with `d`!

---
class: inverse, center, middle

# moments

---
# moments

Every probability distribution we will use in the course can be described by its *moments*

- `$1^{st}$` moment is the expected value (i.e., mean)

- `$2^{nd}$` moment is the variance

**Note**: moments are related to parameters but (in most cases) they are not the same!

- the shape of a distribution, and therefore its moments, is determined by the parameter values

---
# expected value (i.e., the mean)

The first moment of a distribution describes its central tendency (denoted `$\large \mu$`) or *expected value* (denoted `$\large E(z)$`)

This is the most probable value of `$\large z$`

Think of this as a weighted average - the mean of all possible values of `$z$` weighted by the probability mass or density of each value `$\large [z]$`

For discrete variables, the first moment can be calculated as

`$$\Large \mu = E(z) = \sum_{z \in S} z[z]$$`

For continuous variables, we need to use an integral instead of a sum:

`$$\Large \mu = E(z) = \int_{-\infty}^{\infty} z[z]dz$$`

---
# variance

The second moment of a distribution describes the *variance* - that is, the spread of the distribution around its mean

- On average how far is a random value drawn from the distribution from the mean of the distribution

For discrete variables, variance can be estimated as the weighted average of the squared difference (squared to prevent negative values) between each value `$\large z$` and the mean `$\large \mu$` of the distribution:

`$$\Large \sigma^2 = E((z - \mu)^2) = \sum_{z \in S} (z - \mu)^2[z]$$`

and for continuous variables:

`$$\Large \sigma^2 = E((z - \mu)^2) = \int_{-\infty}^{\infty} (z - \mu)^2[z]dz$$`

---
# moments

Every distribution has moment functions that can be used to calculate the mean and variance, given specific parameter values

.pull-left[
For example, the beta distribution is a continuous distribution with support `$0 \leq x \leq 1$`

- the shape of a beta distribution is governed by two parameters, `$\large \alpha$` and `$\large \beta$`

<img src="02_probability_files/figure-html/unnamed-chunk-8-1.png" width="360" style="display: block; margin: auto;" />
]

.pull-right[
The moment functions are:

`$$\large \mu = \frac{\alpha}{\alpha + \beta}$$`
`$$\large \sigma^2 = \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}$$`
]

---
# exercise: estimating moments using *monte carlo integration*

Instead of calculating moments by hand, we can estimate them by simulating a large number of values from the probability distribution itself

This approach is very easy to do in `R` using the `r` class of functions (e.g., `rnorm()`, `rpois()`, etc.)

- These functions generate specified number of random draws (`r` for random) from a given probability distribution

???

Monte Carlo integration is a form of simulation where we draw many random samples from a probability distribution and then use those samples to learn about properties of the distribution

This is a useful approach to understand because it is very similar to how we learn about parameter distributions in Bayesian analyses

---
# exercise: estimating moments using *monte carlo integration*

Let's estimate the first and second moments of a beta distribution  
<br/>

In `R`, we can generate and visualize a large number (e.g., 10000) random draws from the beta distribution using the following code:

```r
n <- 10000 # Sample size
alpha <- 2
beta <- 5
samp <- rbeta(n,  shape1 = alpha, shape2 = beta)
```

???

Both `$\alpha$` and `$\beta$` must be `$>0$`

---
# exercise: estimating moments using *monte carlo integration*

Now let's use these sample to estimate the first moment (the mean) and the second moment (the variance) of the distribution  
<br/>

We estimate the first moment by taking the arithmetic mean of our samples `$(\frac{1}{n}{\sum_{i=1}^{n}z_i})$` and the variance as `$(\frac{1}{n}{\sum_{i=1}^{n}(z_i-\mu)^2})$`:

```r
mu <- sum(samp)/n # mean of the sample
sigma2 <- sum((samp - mu)^2)/n # variance of the sample
```

---
# exercise: estimating moments using *monte carlo integration*

How close are these values to the true moments? Remember: 
`$$\large \mu = \frac{\alpha}{\alpha + \beta}$$`
`$$\large \sigma^2 = \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}$$`

For our samples:

```r
mu # Estimated mean
```

```
## [1] 0.2862
```

```r
alpha/(alpha + beta) # True mean
```

```
## [1] 0.2857
```

???

Your answer won't exactly match the ones here but they should be pretty close

---
# exercise: estimating moments using *monte carlo integration*

How close are these values to the true moments? Remember:
`$$\large \mu = \frac{\alpha}{\alpha + \beta}$$`
`$$\large \sigma^2 = \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}$$`

For our distribution:

```r
sigma2 # Estimated variance
```

```
## [1] 0.02567
```

```r
(alpha* beta)/((alpha + beta) ^ 2 * (alpha + beta + 1))
```

```
## [1] 0.02551
```

---
# exercise: estimating moments using *monte carlo integration*

Try this on your own - simulate data from a Poisson distribution and see if the moments you estimate from the sample are close to the true moments

**Hint** - the Poisson distribution has a single parameter `$\lambda$`, which is both the mean and the variance of the distribution  
<br/>

Change both `$\large \lambda$` and `$\large n$`. *Does varying these values change how well your sample estimates the moments?*

???

**Question** - in the above simulations, we use the arithmetic mean to estimate the first moment of the distribution. But in the definition of the moment, we defined the mean as the weighted average of the `$z$`'s. Why don't we have to take the weighted average of our sample?

---
# moment matching

What if you know the mean and variance of a distribution and need the parameters?

Why would you ever need to do this?

- Because parameters are often reported as a point estimate with uncertainty, but as a Bayesian, everything is a distribution!

- For example, you may have an estimate (and the associated standard error) of annual survival and need to convert that into the corresponding beta distribution

- As we will see, this is particular relevant to incorporating previous research into our analyses via *priors*

Rather than using simulation, each distribution has a set of formulas for converting between parameters and moments (called *moment matching*)

???

If this is not obvious right now, don't worry. You'll see why later in the semester as we work through examples

Of course, this does not mean you need to memorize the moment equations - that's what google is for.

---
# moment matching

For the normal distribution, it is relatively easy to understand moment matching because the parameters of the distribution (mean and standard deviation) *are* the first and second moments

---
# moment matching

The normal distribution has an interesting property - you can change the first moment without changing the second moment

<br/>

This is not true of all probability distributions

<br/>

For example, the beta distribution we saw earlier. If we change `$\large \alpha$` (or `$\large \beta$`), we will also change the mean and variance:

`$$\Large \mu = \frac{\alpha}{\alpha + \beta}$$`

`$$\Large \sigma^2 = \frac{\alpha\beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}$$`

---
# moment matching

Moment matching allows us to flip the previous equations. For the beta distribution:

`$$\large \alpha = \bigg(\frac{1-\mu}{\sigma^2}- \frac{1}{\mu} \bigg)\mu^2$$`

`$$\large \beta = \alpha \bigg(\frac{1}{\mu}-1\bigg)$$`

So if we know the mean of a beta distribution is 0.28 and the variance is 0.025, we can calculate `$\large \alpha$` and `$\large \beta$`:

```r
(alpha <- ((1 - 0.28) / 0.025 - (1 / 0.28)) * 0.28 ^ 2)
```

```
## [1] 1.978
```

```r
(beta <- alpha * ((1 / 0.28) - 1))
```

```
## [1] 5.086
```

???

On your own, use our simulation method to check that our estimates are correct:

```r
samp <- rbeta(n, alpha, beta)

(mu <- sum(samp)/n)

(sigma <- sum((samp-mu)^2)/n)
```

---
# moment matching

Understanding the relationships between parameters and moments is an important skill for implementing Bayesian methods

Don't worry about memorizing the moment matching functions - just look them up when you need them!

But do get comfortable moving back and forth between parameters and mean/variance

- We will practice this throughout the semester

---
class: inverse, middle, center

# rules of probability

---
# rules of probability

In addition to having a good toolbox of probability distributions (and knowing how to use them!), application of Bayesian methods requires a solid understanding of the rules of probability

We won't go into depth on these topics, but a basic understanding of some general principles is important for the material that follows

---
# sample space

For any given random variable, the *sample space* `$S$` includes all of the possible values the variable can take

For example, for an single-species occupancy model, `$S$` would be present or absent. For a model of species abundance, `$S$` would be `${0,1,2,3,...,\infty}$`.

???

For a random process to truly be a probability, the sum of the probabilities of all must equal 1: `$\sum_{i=1}^n Pr(A_i) = 1$` (if the outcomes are continuous we have to take the integral instead of the sum).

---
# sample space

**Example**

Imagine an occupancy model in which we want to know if species `$\Large x$` is present at a given location

We will denote the occupancy status `$\Large z_x$` and the sample space includes just two possible values:

`$$\huge S_{z_x}=\{0, 1\}$$`

---
class: inverse, middle, center

# probability of single events

---
# probability of single events

The probability of `$\large A$` is the area of `$\large A$` divided by the area of `$\large S$`

`$$\Large Pr(A) = \frac{area\; of\; A}{area\; of \;S}$$`

???

Within the sample space, we can define a smaller polygon `$A$` which represents one possible outcome

`$A$` is smaller than `$S$` because it does not contain all possible outcomes, just a subset.

What is the probability that `$A$` does not occur? It's the area outside of `$A$`:

`$$Pr(not \; A) = \frac{area\; of \;S - area\; of\; A}{area\; of \;S} = 1 - \frac{area\; of\; A}{area\; of \;S} = 1 - Pr(A)$$`

---
# probability of single events

**Example**

In our example, let's say that the probability of occupancy for species `$\large x$` is:
`$$\LARGE Pr(z_x = 1) = 0.4$$`

This means that the probability that the site in not occupied is:

`$$\LARGE Pr(z_x = 0) =1-0.4=0.6$$`

---
class: inverse, middle, center

# probability of multiple events

---
# probability of multiple events

Often, we are not interested in the probability of a single event happening but instead of more than one events

The *joint* probability refers to the probability that two or more events occur and is usually denoted `$\large Pr(A,B)$`

???

Estimating joint probabilities is more challenging than estimating the probability of single events but is critical to understanding the logic behind Bayesian methods

---
# probability of multiple events

**Example**

To extend our simple example, let's imagine we are interested in the occupancy status of two species - `$\large x$` and `$\large y$`. Our sample space is now:

`$$\Large S_{z_x,z_y} = \{(0,0), (0,1), (1,0), (1,1)\}$$`

The question we want to know now is: what is the probability that a site is occupied by both species:

`$$\Large Pr(z_x = 1, z_y = 1) = Pr(z_x, z_y)$$`

The answer to that question depends on the relationship between `$\large Pr(z_x)$` and `$\large Pr(z_y)$`

---
class: inverse, center, middle

# conditional probability

---
# conditional probability

In some cases, knowing the status of one random variable tells us something about the status of another random variable

Let's say we know that species `$\large x$` is present, that is `$\large z_x=1$`

Knowing that `$\large z_x=1$` does two things:  
<br/>
--

1) It shrinks the possible range of sample space (if `$\large z_x=1$` occurred, the remainder of our sample space (in this case `$\large z_x=0$`) did not occur)  
<br/>

2) It effectively shrinks the area `$\large z_y$` - we know that the area of `$\large z_y$` outside of `$\large z_x$` didn't occur  
<br/>

You can see this very clearly in this [awesome visualization](http://setosa.io/conditional/)

---
# conditional probability

`$\large Pr(z_y|z_x)$` is the area shared by the two events divided by the area of `$\large z_x$` (not `$S$`!)

`$$\large Pr(z_y|z_x) = \frac{area\; shared\; by\; z_x\; and\; z_y}{area\; of \;z_x} = \frac{Pr(z_x\; \cap\; z_y)}{Pr(z_x)}$$`

likewise,

`$$\large Pr(z_x|z_y) = \frac{Pr(z_x\; \cap\; z_y)}{Pr(z_y)}$$`

???

Read `$Pr(z_y|z_x)$` as "the probability of `$z_y$` conditional on `$z_x$`"

`$\cap$` means "intersection" and it is the area shared by both `$A$` and `$B$`

---
# conditional probability

For conditional events, the joint probability is:

`$$\LARGE Pr(z_y, z_x) = Pr(z_y|z_x)Pr(z_x) = Pr(z_x|z_y)Pr(z_y)$$`

---
# probability of independent events

In some cases, the probability of one event occurring is *independent* of whether or not the other event occurs

In our example, the occupancy of the two species may be totally unrelated

+ if they occur together, it happens by complete chance

In this case, knowing that `$\large z_x=1$` gives us no new information about the probability of `$\large z_y=1$`

Mathematically, this means that:

`$$\large Pr(z_y|z_x) = Pr(z_y)$$`

and

`$$\large Pr(z_x|z_y) = Pr(z_x)$$`

Thus,

`$$\large Pr(z_x,z_y) = Pr(z_x)Pr(z_y)$$`

???

For example, the probability of a coin flip being heads is not dependent on whether or not the previous flip was heads.

This maybe unlikely since even if they don't interact with each other, habitat preferences alone might lead to non-independence but we'll discuss that in a more detail shortly

---
# disjoint events

A special case of conditional probability occurs when events are *disjoint*  
<br/>

In our example, maybe species `$\large x$` and species `$\large y$` **never** occur together  
<br/>

In this case, knowing that `$\large z_x = 1$` means that `$\large z_y = 0$`. In other words,

`$$\Large Pr(z_y|z_x) = Pr(z_x|z_y) = 0$$`

???

Perhaps they are such fierce competitors that they will exclude each other from their territories

---
# probability of one event or the other

In some cases, we might want to know the probability that one event *or* the other occurs

For example, what is the probability that species `$\large x$` or species `$\large y$` is present but not both?

This is the area in `$\large z_x$` and `$\large z_y$` not including the area of overlap:

`$$\Large Pr(z_x \cup z_y) = Pr(z_x) + Pr(z_y) - Pr(z_x,z_y)$$`

---
# probability of one event or the other

When `$\large z_x$` and `$\large z_y$` are independent,

`$$\large Pr(z_x \cup z_y) = Pr(z_x) + Pr(z_y) - Pr(z_x)Pr(z_y)$$`

If they are conditional,

`$$Pr(z_x \cup z_y) = Pr(z_x) + Pr(z_y) - Pr(z_x|z_y)Pr(z_y) = Pr(z_x) + Pr(z_y) - Pr(z_y|z_x)Pr(z_x)$$`

If they are disjoint,

`$$\Large Pr(z_x \cup z_y) = Pr(z_x) + Pr(z_y)$$`

---
# marginal probability

A critical concept in Bayesian models is **marginal probability**, that is the probability of one event happening regardless of the state of other events

Imagine that our occupancy model includes the effect of 3 different habitats on the occupancy probability of species `$\large x$`, so:

`$$\large Pr(z_x|H_i) = \frac{Pr(z_x \cap H_i)}{Pr(H_i)}$$`

What is the overall probability that species `$\large x$` occurs regardless of habitat type? That is, `$P\large r(z_x)$`?

In this case, we *marginalize* over the different habitat types by summing the conditional probabilities weighted by probability of each `$\large H_i$`:

`$$Pr(z_x) = \sum_{i=1}^3 Pr(z_x|H_i)Pr(H_i)$$`

Think of this as a weighted average - the probability that `$\large z_x=1$` in each habitat type weighted by the probability that each habitat type occurs.

---
# marginal probability

---
# marginal probability

---
# marginal probability

<table class="table table-striped table-hover table-condensed table-responsive" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:center;">  </th>
   <th style="text-align:center;"> H1 </th>
   <th style="text-align:center;"> H2 </th>
   <th style="text-align:center;"> H3 </th>
   <th style="text-align:center;"> Total </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> Occupied </td>
   <td style="text-align:center;"> 60 </td>
   <td style="text-align:center;"> 10 </td>
   <td style="text-align:center;"> 10 </td>
   <td style="text-align:center;"> 80 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Unoccupied </td>
   <td style="text-align:center;"> 20 </td>
   <td style="text-align:center;"> 70 </td>
   <td style="text-align:center;"> 250 </td>
   <td style="text-align:center;"> 340 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Total </td>
   <td style="text-align:center;"> 80 </td>
   <td style="text-align:center;"> 80 </td>
   <td style="text-align:center;"> 260 </td>
   <td style="text-align:center;"> 420 </td>
  </tr>
</tbody>
</table>

???

This is the reason random or stratified sampling is so important if you want to know `$Pr(z)$` - if you do not sample habitats in proportion to `$Pr(H_i)$`, you will get biased estimates of `$Pr(z)$`!

---
# looking ahead

### From probability to Bayes theorem

The rules of conditional probability we just reviewed are the basis of *Bayes theorem* - the foundation of Bayesian inference

`$$\Large [\theta|y] = \frac{[y|\theta][\theta]}{[y]}$$`

In the next lecture, we will look at the components of Bayes theorem to understand what they mean and how they relate to statistical inference

We will then learn more about putting these ideas into practice, including:

- developing priors

- calculating the posterior distribution for simple models

- estimating the posterior distribution for more complex models