Hobbs & Hooten 29-70
Warning: The material presented in this lecture is tedious. But the concepts in this lecture are critical to everything that will follow in this course. So push through and try your best to understand these topics. You do not need to be an expert in probability at the end of this lecture - we will reinforce these concepts over and over again throughout the semester - but getting the gist now will help you grasp other topics as we move forward
a deterministic model g(), and
a stochastic model [a|b,c]
a deterministic model g(), and
a stochastic model [a|b,c]
a deterministic model g(), and
a stochastic model [a|b,c]
given an input, the model will not always return the same answer
the output of stochastic processes are uncertain
Even though stochastic processes are inherently uncertain, they are not unpredictable.
1 Given a certain input, the deterministic model will always return the same answer.
2 Because probability distributions for the basis of Bayesian methods, a good understanding of probability is critical to everything that will follow.
For a random process to truly be a probability, the sum of the probabilities of all must equal 1: ∑ni=1Pr(Ai)=1 (if the outcomes are continuous we have to take the integral instead of the sum).
Szx={0,1}
Pr(A)=areaofAareaofS
3 Within the sample space, we can define a smaller polygon A which represents one possible outcome
A is smaller than S because it does not contain all possible outcomes, just a subset.
What is the probability that A does not occur? It's the area outside of A:
Pr(notA)=areaofS−areaofAareaofS=1−areaofAareaofS=1−Pr(A)
Pr(zx=1)=0.4
Pr(zx=1)=0.4
Pr(zx=1)=0.4
Pr(zx=0)=1−0.4=0.6
4 Estimating joint probabilities is more challenging than estimating the probability of single events but is critical to understanding the logic behind Bayesian methods
To extend our simple example, let's imagine we are interested in the occupancy status of two species - x and y. Our sample space is now:
Szx,zy={(0,0),(0,1),(1,0),(1,1)}
To extend our simple example, let's imagine we are interested in the occupancy status of two species - x and y. Our sample space is now:
Szx,zy={(0,0),(0,1),(1,0),(1,1)} The question we want to know now is: what is the probability that a site is occupied by both species:
Pr(zx=1,zy=1)=Pr(zx,zy)
To extend our simple example, let's imagine we are interested in the occupancy status of two species - x and y. Our sample space is now:
Szx,zy={(0,0),(0,1),(1,0),(1,1)} The question we want to know now is: what is the probability that a site is occupied by both species:
Pr(zx=1,zy=1)=Pr(zx,zy)
The answer to that question depends on the relationship between Pr(zx) and Pr(zy)
Let's say we know that species x is present, that is zx=1
Knowing that zx=1 does two things:
Let's say we know that species x is present, that is zx=1
Knowing that zx=1 does two things:
1) It shrinks the possible range of sample space (if zx=1 occurred, the remainder of our sample space (in this case zx=0) did not occur)
Let's say we know that species x is present, that is zx=1
Knowing that zx=1 does two things:
1) It shrinks the possible range of sample space (if zx=1 occurred, the remainder of our sample space (in this case zx=0) did not occur)
2) It effectively shrinks the area zy - we know that the area of zy outside of zx didn't occur
Let's say we know that species x is present, that is zx=1
Knowing that zx=1 does two things:
1) It shrinks the possible range of sample space (if zx=1 occurred, the remainder of our sample space (in this case zx=0) did not occur)
2) It effectively shrinks the area zy - we know that the area of zy outside of zx didn't occur
You can see this very clearly in this awesome visualization
Pr(zy|zx) is the area shared by the two events divided by the area of zx (not S!) 5
Pr(zy|zx)=areasharedbyzxandzyareaofzx=Pr(zx∩zy)Pr(zx)
likewise,
Pr(zx|zy)=Pr(zx∩zy)Pr(zy)
5 Read Pr(zy|zx) as "the probability of zy conditional on zx"
∩ means "intersection" and it is the area shared by both A and B
Pr(zy,zx)=Pr(zy|zx)Pr(zx)=Pr(zx|zy)Pr(zy)
In some cases, the probability of one event occurring is independent of whether or not the other event occurs 6
In some cases, the probability of one event occurring is independent of whether or not the other event occurs 6
In our example, the occupancy of the two species may be totally unrelated
In some cases, the probability of one event occurring is independent of whether or not the other event occurs 6
In our example, the occupancy of the two species may be totally unrelated
In this case, knowing that zx=1 gives us no new information about the probability of zy=1
Mathematically, this means that:
Pr(zy|zx)=Pr(zy)
and
Pr(zx|zy)=Pr(zx)
In some cases, the probability of one event occurring is independent of whether or not the other event occurs 6
In our example, the occupancy of the two species may be totally unrelated
In this case, knowing that zx=1 gives us no new information about the probability of zy=1
Mathematically, this means that:
Pr(zy|zx)=Pr(zy)
and
Pr(zx|zy)=Pr(zx) Thus,
Pr(zx,zy)=Pr(zx)Pr(zy)
6For example, the probability of a coin flip being heads is not dependent on whether or not the previous flip was heads.
7This maybe unlikely since even if they don't interact with each other, habitat preferences alone might lead to non-independence but we'll discuss that in a more detail shortly
A special case of conditional probability occurs when events are disjoint
A special case of conditional probability occurs when events are disjoint
In our example, maybe species x and species y never occur together 8
A special case of conditional probability occurs when events are disjoint
In our example, maybe species x and species y never occur together 8
In this case, knowing that zx=1 means that zy=0. In other words,
Pr(zy|zx)=Pr(zx|zy)=0
8Perhaps they are such fierce competitors that they will exclude each other from their territories
Pr(zx∪zy)=Pr(zx)+Pr(zy)−Pr(zx,zy)
When zx and zy are independent,
Pr(zx∪zy)=Pr(zx)+Pr(zy)−Pr(zx)Pr(zy)
When zx and zy are independent,
Pr(zx∪zy)=Pr(zx)+Pr(zy)−Pr(zx)Pr(zy) If they are conditional,
Pr(zx∪zy)=Pr(zx)+Pr(zy)−Pr(zx|zy)Pr(zy)=Pr(zx)+Pr(zy)−Pr(zy|zx)Pr(zx)
When zx and zy are independent,
Pr(zx∪zy)=Pr(zx)+Pr(zy)−Pr(zx)Pr(zy) If they are conditional,
Pr(zx∪zy)=Pr(zx)+Pr(zy)−Pr(zx|zy)Pr(zy)=Pr(zx)+Pr(zy)−Pr(zy|zx)Pr(zx) If they are disjoint,
Pr(zx∪zy)=Pr(zx)+Pr(zy)
A critical concept in Bayesian models is marginal probability, that is the probability of one event happening regardless of the state of other events
A critical concept in Bayesian models is marginal probability, that is the probability of one event happening regardless of the state of other events
Imagine that our occupancy model includes the effect of 3 different habitats on the occupancy probability of species x, so:
Pr(zx|Hi)=Pr(zx∩Hi)Pr(Hi)
A critical concept in Bayesian models is marginal probability, that is the probability of one event happening regardless of the state of other events
Imagine that our occupancy model includes the effect of 3 different habitats on the occupancy probability of species x, so:
Pr(zx|Hi)=Pr(zx∩Hi)Pr(Hi) What is the overall probability that species x occurs regardless of habitat type? That is, Pr(zx)?
A critical concept in Bayesian models is marginal probability, that is the probability of one event happening regardless of the state of other events
Imagine that our occupancy model includes the effect of 3 different habitats on the occupancy probability of species x, so:
Pr(zx|Hi)=Pr(zx∩Hi)Pr(Hi)
What is the overall probability that species x occurs regardless of habitat type? That is, Pr(zx)?
In this case, we marginalize over the different habitat types by summing the conditional probabilities weighted by probability of each Hi:
Pr(zx)=3∑i=1Pr(zx|Hi)Pr(Hi)
A critical concept in Bayesian models is marginal probability, that is the probability of one event happening regardless of the state of other events
Imagine that our occupancy model includes the effect of 3 different habitats on the occupancy probability of species x, so:
Pr(zx|Hi)=Pr(zx∩Hi)Pr(Hi)
What is the overall probability that species x occurs regardless of habitat type? That is, Pr(zx)?
In this case, we marginalize over the different habitat types by summing the conditional probabilities weighted by probability of each Hi:
Pr(zx)=3∑i=1Pr(zx|Hi)Pr(Hi) Think of this as a weighted average - the probability that zx=1 in each habitat type weighted by the probability that each habitat type occurs.
H1 | H2 | H3 | Total | |
---|---|---|---|---|
Occupied | 60 | 10 | 10 | 80 |
Unoccupied | 20 | 70 | 250 | 340 |
Total | 80 | 80 | 260 | 420 |
This is the reason random or stratified sampling is so important if you want to know Pr(z) - if you do not sample habitats in proportion to Pr(Hi), you will get biased estimates of Pr(z)!
Many of the models you will work with as an ecologist will contain multiple random variables
[z,θp,θo,σ2p,σ2s,σ2o,ui|yi]∝[yi|d(Θo,ui),σ2o][ui|z,σ2s][z|g(θp,x),σ2p][θp][θo][σ2p][σ2s][σ2o]
Many of the models you will work with as an ecologist will contain multiple random variables
[z,θp,θo,σ2p,σ2s,σ2o,ui|yi]∝[yi|d(Θo,ui),σ2o][ui|z,σ2s][z|g(θp,x),σ2p][θp][θo][σ2p][σ2s][σ2o] The rules of probability allow us to express the complex joint probabilities as a series of more simple conditional probabilities
Many of the models you will work with as an ecologist will contain multiple random variables
[z,θp,θo,σ2p,σ2s,σ2o,ui|yi]∝[yi|d(Θo,ui),σ2o][ui|z,σ2s][z|g(θp,x),σ2p][θp][θo][σ2p][σ2s][σ2o] The rules of probability allow us to express the complex joint probabilities as a series of more simple conditional probabilities
Determining the dependencies between parameters in the models is aided by Bayesian network models
Bayesian networks graphically display the dependence among random variables
Random variables are nodes
Arrows point from parents to children
Bayesian networks graphically display the dependence among random variables 9
Children nodes are on the LHS of conditioning symbols
Parent nodes are on the RHS of conditioning symbols
Nodes without a parent are expressed unconditionally
Bayesian networks graphically display the dependence among random variables 9
Children nodes are on the LHS of conditioning symbols
Parent nodes are on the RHS of conditioning symbols
Nodes without a parent are expressed unconditionally
Pr(A,B)=Pr(A|B)Pr(B)
9 These rules extend directly from the rule of probability we already learned
We can generalize the simple model in this slide to more than two events, which we will call z1,z2,...,zn:
Pr(z1,z2,...,zn)=Pr(zn|zn−1,...,z1)...Pr(z3|z2,z1)Pr(z2|z1)Pr(z1)
The order of conditioning (i.e., the dependencies in the graph) are determined by the biology, not the statistics
Pr(A,B,C)=Pr(A|B,C)Pr(B|C)Pr(C)
Pr(A,B,C,D)=Pr(A|C)Pr(B|C)Pr(C|D)Pr(D)
Pr(A,B,C,D,E)=Pr(A|C)Pr(B|C)Pr(C|D,E)Pr(D)Pr(E)
Pr(A,B,C,D)=Pr(A|B,C,D)Pr(B|C,D)Pr(C|D)Pr(D)
Pr(A,B,C,D)=Pr(A|B,C,D)Pr(C|D)Pr(D)Pr(B)
If we know that B is independent of C and D, we can simplify the conditional expressions because (for independent events):
Pr(z1|z2)=Pr(z1)
Because all unobserved quantities are treated as random variables governed by probability distributions, using and understanding Bayesian methods requires understanding probability distributions.
As ecologists, there are a number of very common probability distributions that we encounter and use regularly 10:
normal
Poisson
binomial
gamma
10We are not going to go over the properties of each of these distributions in lecture. Instead, I will talk about specific distributions as they come up in examples.
Even though I will discuss specific distributions as they come up, I highly recommend you read the chapter of Hobbs & Hooten on probability functions to familiarize yourself with the distributions we'll use throughout the semester. If you don't have that book, just google each distribution and read the wikipedia page.
Continuous random variables can take on an infinite number of values on a specific interval 11
Normal (−∞ to ∞)
Gamma (0 to ∞)
Beta (0 to 1)
Uniform (? to ?)
Continuous random variables can take on an infinite number of values on a specific interval 11
Normal (−∞ to ∞)
Gamma (0 to ∞)
Beta (0 to 1)
Uniform (? to ?)
Discrete random variables are those that take on distinct values, usually integers 12
Poisson (integers ≥0)
Bernoulli (0 or 1)
Binomial
Multinomial
11We usually encounter continuous variables in the form of regression coefficients (slope and intercepts), measurements (mass, lengths, etc.), and probabilities
12We usually encounter discrete variables in the form of counts (the number of individuals can only be positive integers, you can't have 4.234 individuals) or categories (alive vs. dead, present in location A vs. B vs. C)
13Probability functions tell us [z]=Pr(z)
For discrete random variables, the probability that the variable will take a specific value z is defined by the probability mass function (pmf)
0≤[z]≤1 ∑z∈S[z]=1
where S is the set of all z for which [z]>0 (the range of possible values of z).
As an example, let's assume a random variable that follows a Poisson distribution
Poisson random variables can take any integer value > 0 (0,1,2,...)
e.g., the number of individuals at a site or the number of seeds produced by a flower
As an example, let's assume a random variable that follows a Poisson distribution
Poisson random variables can take any integer value > 0 (0,1,2,...)
e.g., the number of individuals at a site or the number of seeds produced by a flower
The shape of the Poisson distribution is determined by 1 parameter called λ
λ is the expected value (the most likely value) of a random variable generated from the Poisson distribution
larger λ means larger values of the variable
In R, probability mass is estimating using the dpois()
function (or the equivalent for other discrete distributions)
takes two arguments: the value we are interested in estimating the probability of (z)14 and the expected value of our distribution (λ)
dpois(x = seq(0,25), lambda = 10)
14R will let us put in a vector of values so we can also do the following to estimate the probability of all values from 0 to 25: dpois(x = seq(0, 25), lambda = 10)
Probability mass functions provide the probability that a discrete random variable takes on a specific value z
For continuous variables, estimating probabilities is a little trickier because Pr(z)=0 for any specific value z
Why? Let's look at the probability distribution for a normal random variable with mean =0 and standard deviation =3:
The probability density is the area under the curve for an interval between a and b, which we'll call Δz =(a−b).
For example, the shaded area below shows the probability density Pr(−2≤z≤−1):
This area can be approximated by multiplying the width times the (average) height of the rectangle:
Pr(a≤z≤b)≈Δz[(a+b)/2]
By making the range Δz=a−b smaller and smaller, we get closer to Pr(z):
dnorm()
1515Now you know why the function starts with d
!
Every probability distribution we will use in the course can be described by its moments
1st moment is the expected value (i.e., mean)
2nd moment is the variance
μ=E(z)=∑z∈Sz[z]
μ=E(z)=∑z∈Sz[z]
μ=E(z)=∫∞−∞z[z]dz
σ2=E((z−μ)2)=∑z∈S(z−μ)2[z]
σ2=E((z−μ)2)=∑z∈S(z−μ)2[z]
σ2=E((z−μ)2)=∫∞−∞(z−μ)2[z]dz
One way to estimate moments is by simulating a large number of values from a probability distribution and then using these samples to calculate the first and second moments 17
This approach is very easy to do in R
using the r
class of functions (e.g., rnorm()
, rpois()
, etc.)
r
for random) from a given probability distribution 16 Monte Carlo integration is a form of simulation where we draw many random samples from a probability distribution and then use those samples to learn about properties of the distribution
17This is a useful approach to understand because it is very similar to how we learn about parameter distributions in Bayesian analyses
Let's estimate the first and second moments of a gamma distribution 18
Let's estimate the first and second moments of a gamma distribution 18
The shape of the gamma distribution is governed by two parameters, α (referred to as the shape) and β (referred to as the rate or sometimes the scale) 19
Let's estimate the first and second moments of a gamma distribution 18
The shape of the gamma distribution is governed by two parameters, α (referred to as the shape) and β (referred to as the rate or sometimes the scale) 19
In R
, we can generate and visualize a large number (e.g., 10000) random draws from the gamma distribution using the following code:
n <- 10000 # Sample sizesamp <- rgamma(n, shape = 0.5, rate = 2)
18The gamma distribution is continuous probability distribution that produces non-negative random variables
19Both α and β must be >0
Now let's use these sample to estimate the first moment (the mean) and the second moment (the variance) of the distribution
Now let's use these sample to estimate the first moment (the mean) and the second moment (the variance) of the distribution
We estimate the first moment by taking the arithmetic mean of our samples (1n∑ni=1zi) and the variance as (1n∑ni=1(zi−μ)2):
mu <- sum(samp)/n # mean of the samplesigma2 <- sum((samp - mu)^2)/n # variance of the sample
How close are these values to the true moments? For the gamma distribution: μ=αβ
σ2=αβ2
For our samples: 1
mu # Estimated mean
## [1] 0.2491
0.5/2 # True mean
## [1] 0.25
Your answer won't exactly match the ones here but they should be pretty close
How close are these values to the true moments? For the gamma distribution: μ=αβ
σ2=αβ2
For our distribution:
sigma2 # Estimated variance
## [1] 0.1278
0.5/2^2
## [1] 0.125
Try this on your own - simulate data from a Poisson distribution and see if the moments you estimate from the sample are close to the true moments
Hint - the Poisson distribution has a single parameter λ, which is both the mean and the variance of the distribution
Try this on your own - simulate data from a Poisson distribution and see if the moments you estimate from the sample are close to the true moments
Hint - the Poisson distribution has a single parameter λ, which is both the mean and the variance of the distribution
Change both λ and n. Does varying these values change how well your sample estimates the moments? 20
20 Question - in the above simulations, we use the arithmetic mean to estimate the first moment of the distribution. But in the definition of the moment, we defined the mean as the weighted average of the z's. Why don't we have to take the weighted average of our sample?
What if you know the mean and variance of a distribution and need the parameters?
What if you know the mean and variance of a distribution and need the parameters?
Rather than using simulation, each distribution has a set of formulas for converting between parameters and moments (called moment matching)
What if you know the mean and variance of a distribution and need the parameters?
Rather than using simulation, each distribution has a set of formulas for converting between parameters and moments (called moment matching)
Moment matching is very important because often we have the mean and variance of distributions but need to convert those summaries into the parameters of the underlying distribution 21,22
21If this is not obvious right now, don't worry. You'll see why later in the semester as we work through examples
22 Of course, this does not mean you need to memorize the moment equations - that's what google is for.
For the normal distribution, it is relatively easy to understand moments because the parameters of the distribution (mean and standard deviation) are the first and second moments
The normal distribution has an interesting property - you can change the first moment without changing the second moment
The normal distribution has an interesting property - you can change the first moment without changing the second moment
This is not true of all probability distributions
The normal distribution has an interesting property - you can change the first moment without changing the second moment
This is not true of all probability distributions
For example, the beta distribution is a continuous distribution with values between 0 and 1 23,24. Its first and second moments are:
μ=αα+β
σ2=αβ(α+β)2(α+β+1)
23This makes it useful for modeling random variables that are probabilities (e.g., detection probability in an occupancy model)
24The shape of the beta distribution is governed by two parameters α and β
α=(1−μσ2−1μ)μ2
β=α(1μ−1)
α=(1−μσ2−1μ)μ2
β=α(1μ−1)For our model, that means 25:
(alpha <- ( (1 - 0.3)/0.025 - (1/0.3) )*0.3^2)
## [1] 2.22
(beta <- alpha * ( (1/0.3) - 1))
## [1] 5.18
25 On your own, use our simulation method to check that our estimates are correct:
samp <- rbeta(n, alpha, beta)(mu <- sum(samp)/n)(sigma <- sum((samp-mu)^2)/n)
Hobbs & Hooten 29-70
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |