class: center, middle, inverse, title-slide .title[ # LECTURE 2: principles of estimation ] .subtitle[ ## FANR 6750 (Experimental design) ] .author[ ###
Fall 2022 ] --- class: inverse # outline -- #### 1) Probability and parameters <br/> -- #### 2) Populations vs samples <br/> -- #### 3) Common "statistics" <br/> -- #### 4) Sampling distributions <br/> -- #### 5) Uncertainty: Standard errors and confidence intervals --- class: inverse, center, middle # probability --- # probability #### The dynamics of biological systems are inherently uncertain due to *stochasticity* -- #### Stochastic processes: -- + Given an input, the process will not always return the same output -- + The output of stochastic processes are therefore *uncertain* -- + Even though stochastic processes are inherently uncertain, they are not *unpredictable* --- # probability #### *Random variables* can take on different values due to chance (i.e., stochastic) -- #### *Probability* allows us to summarize how likely each possible value of a random variable is to occur -- - Usually quantified using a *probability distribution* --- # probability distributions #### Mathematical function that tells us how likely each possible value of a random variable is to occur -- - Characterized by a *sample space*, i.e., all possible values (real or integer? negative?) -- - Area under the curve must sum to 1 -- - Many available (normal, Poisson, gamma, beta, Direchlet, binomial, Bernoulli, etc.) -- - Shape of each distribution is governed by *parameters* --- # the normal distributioin <img src="02_estimation_files/figure-html/unnamed-chunk-1-1.png" width="432" style="display: block; margin: auto;" /> -- - Two parameters: `\(\large \mu\)` and `\(\large \sigma\)` + `\(\mu\)` is the mean, i.e., the most probable value + `\(\sigma\)` is the standard deviation, i.e., how far (on average) are values from the mean --- # the normal distributioin <img src="02_estimation_files/figure-html/unnamed-chunk-2-1.png" width="432" style="display: block; margin: auto;" /> - Two parameters: `\(\large \mu\)` and `\(\large \sigma\)` - Very common in nature. Why? + Hint: [Central limit theorem](https://seeing-theory.brown.edu/probability-distributions/index.html) --- # the normal distributioin <img src="02_estimation_files/figure-html/unnamed-chunk-3-1.png" width="432" style="display: block; margin: auto;" /> - Two parameters: `\(\large \mu\)` and `\(\large \sigma\)` - Very common in nature. Why? #### Much of what we'll do this semester comes down to determining whether different normal distributions have the same mean (or standard deviation)! --- # populations vs samples #### Population - A collection of subjects of interest - Often, a biologically meaningful unit - Sometimes a process of interest -- #### Sample - A finite subset of the population of interest, i.e. the data we collect - Samples allow us to draw inferences about the population - Good samples are: + Random + Representative + Sufficiently large --- # normal distribution <img src="02_estimation_files/figure-html/normal-1.png" width="576" style="display: block; margin: auto;" /> -- **Remember: This is the population!** --- # normal distribution <img src="02_estimation_files/figure-html/normal_samp-1.png" width="576" style="display: block; margin: auto;" /> --- # parameters vs statistics ### Parameters - Attributes of the population + Mean ( `\(\mu\)` ) + Variance ( `\(\sigma^2\)` ) + Standard deviation ( `\(\sigma\)` ) -- - Usually unknown -- - Parameters are the quantities of interest -- ### Statistics - Attributes of the sample + Mean ( `\(\bar{y}\)` or `\(\hat{\mu}\)` ) + Variance ( `\(s^2\)` or `\(\hat{\sigma}^2\)` ) + Standard deviation ( `\(s\)` or `\(\hat{\sigma}\)` ) -- - Often treated as estimates of parameters --- # summary statistics ### Measures of central tendency - Sample mean `$$\large \bar{y} = \frac{\sum_{i=1}^n y_i}{n}$$` <br/> -- - Median <br/> -- - Mode --- # summary statistics <img src="02_estimation_files/figure-html/mu_samp-1.png" width="576" style="display: block; margin: auto;" /> --- # summary statistics ### Measures of dispersion - Sample variance `$$\large s^2 = \frac{\sum_{i=1}^n (y_i - \bar{y})^2}{n-1}$$` <br/> -- - Sample standard deviation `$$\large s = \sqrt{s^2}$$` <br/> -- - Range --- # summary statistics <img src="02_estimation_files/figure-html/s_samp-1.png" width="576" style="display: block; margin: auto;" /> --- # sampling error #### Question: What is the probablity that `\(\large \bar{y} = \mu\)`? -- - Answer: 0 (why?) -- #### Fact: The sample mean will never equal the population mean -- - The difference between `\(\large \bar{y}\)` and `\(\large \mu\)` is **sampling error** -- - Sampling error can be reduced but it cannot be eliminated -- #### Problem: If we don't know `\(\large \mu\)`, how do we know how far our estimate is from the true value? -- - Answer: We don't (for any specific sample) -- - BUT...we do know how far, on average, a sample of size `\(n\)` will be from the true value --- # descriptive vs inferential statistics > The sample standard deviation ( `\(s\)` ) is a descriptive statistic `$$\Large s = \sqrt{s^2}$$` - `\(\large s\)` tells us how far, on average, each observation `\(\large y\)` is from the sample mean `\(\large \bar{y}\)` <br/> -- > The standard error (SE) is an inferential statistic `$$\Large SE = \frac{s}{\sqrt{n}}$$` - `\(SE\)` tells us how far, on average, each sample mean `\(\bar{y}\)` is from the population mean `\(\mu\)` ??? What do we mean by "each sample"? After all, we generally only have one sample. Standard error is based on the idea that we could collect (or, more likely simulate) lots and lots and lots of samples (ideally infinite), all from the same population and with the same sample size `\(n\)`? --- class:inverse, center, middle #the sampling distribution --- # a single sample (n = 25) <br/> <img src="02_estimation_files/figure-html/sampling-1.png" width="648" style="display: block; margin: auto;" /> --- # standard deviation <br/> <img src="02_estimation_files/figure-html/sampling_sd-1.png" width="648" style="display: block; margin: auto;" /> -- **Remember** - this error bar is the standard deviation **of our sample**! --- # standard error But remember, what we really want to know is, how far is the sample mean from the true parameter value? -- <img src="02_estimation_files/figure-html/sample_se-1.png" width="648" style="display: block; margin: auto;" /> --- # standard error Imagine we could repeat our experiment 100 times <img src="samples.gif" style="display: block; margin: auto;" /> --- # standard error The 100 sample means is referred to as the **sampling distribution** <img src="02_estimation_files/figure-html/sampling_dist, -1.png" width="648" style="display: block; margin: auto;" /> --- # standard error The 100 sample means is referred to as the **sampling distribution** <img src="02_estimation_files/figure-html/sampling_dist2, -1.png" width="648" style="display: block; margin: auto;" /> --- # standard error > The standard error is the standard deviation **of the sampling distribution**, i.e., how far, on average, is a sample mean from the true population value <img src="02_estimation_files/figure-html/sampling_se, -1.png" width="648" style="display: block; margin: auto;" /> ??? The standard error of the sample mean is an estimate of how far the sample mean is likely to be from the population mean, whereas the standard deviation of the sample is the degree to which individuals within the sample differ from the sample mean --- # standard error #### We rarely repeat experiments <br/> -- #### But we can estimate properties of the sampling distribution from a single sample! <br/> `$$\Large SE = \frac{s}{\sqrt{n}}$$` -- #### This is very useful for estimating uncertainty in our estimates <br/> --- # confidence intervals > If we calculated a `\(x\)`% confidence interval from `\(n\)` samples of the population, about `\(x\)`% of those confidence intervals would contain the true population mean -- <img src="02_estimation_files/figure-html/ci-1.png" width="648" style="display: block; margin: auto;" /> ??? It's worth remembering that 1 - `\(x\)`% of the time, the confidence interval we calculate from our sample **will not** include the true population mean. Of course, with our real data, we have no way of knowing if our sample is one of the black points on this graph 😀 or one of the red dots 😢 --- # for thought #### If our goal is generally to decrease uncertainty in parameter estimates: -- - What factors determine the magnitude of our uncertainty estimates (SE or confidence intervals)? <br/> -- - What can we, as researchers, control when we design experiments to minimize uncertainty? What can we not control? --- # looking ahead <br/> #### **Next time:** Introduction to linear models <br/> #### **Reading:** Quinn chp. 5.2-5.3