LECTURE 3: introduction to statistical modeling

class: center, middle, inverse, title-slide

.title[
# LECTURE 3: introduction to statistical modeling
]
.subtitle[
## (or, everything is a linear model)
]
.author[
### FANR 6750 (Experimental design)
]
.date[
### Fall 2022
]

---

class: inverse

# outline

#### 1) What is a model?

--

#### 2) What is a linear model?

#### 3) Linear model assumptions

---
# what is a model?

![](https://media.giphy.com/media/12npFVlmZoXN4Y/giphy.gif)

---
# what is a model?

> "an informative representation of an object, person or system"

#### Many types (conceptual, graphical, mathematical)

#### In this class, we will deal with *statistical* models

- Mathematical representation of our hypothesis

- By necessity, models will be simplifications of reality ("all models are wrong...")

- Do not have to be complex

---
# but i don't want to be a modeler!

--
- Inference **requires** models

--
- Models link **observations** to **processes**

--
- Models are tools that allow us understand processes that we **cannot directly observe** based on quantities that we **can** observe

---
# a simple model

`$$\Huge y = a + bx$$`

--
It may not be obvious, but this is essentially the only model we will use this semester1

.footnote[[1] With some minor variations, mainly in `$x$`]

---
# a simple model

`$$\Huge y = a + bx$$`

If we want to use this as a statistical model, what's missing?

---
# a simple model

`$$\Huge y = a + bx$$`

If we want to use this as a statistical model, what's missing?

#### **Stochasticity!**

---
# a simple model

`$$\Huge y = a + bx$$`

If we want to use this as a statistical model, what's missing?

#### **Stochasticity!**

---
class:inverse, middle, center

# the linear model

---
# Statistics cookbook

---
# the linear model

`$$\Large response = deterministic\; part+stochastic\; part$$`

--
`$$\underbrace{\LARGE E[y_i] = \beta_0 + \beta_1 \times x_i}_{Deterministic}$$`

--
`$$\underbrace{\LARGE y_i \sim normal(E[y_i], \sigma)}_{Stochastic}$$`

???

Note that the deterministic portion of the model has the same form as the equation for a line: `$y = a + b \times x$`, which is why we call these linear models

---
# the linear model

#### A "simple" example

`$$\underbrace{\LARGE E[y_i] = -2 + 0.5 \times x_i}_{Deterministic}$$`

--
<img src="03_models_files/figure-html/unnamed-chunk-7-1.png" width="288" style="display: block; margin: auto;" />

---
# the linear model

#### A "simple" example

`$$\underbrace{\LARGE E[y_i] = -2 + 0.5 \times x_i}_{Deterministic}$$`

`$$\underbrace{\LARGE y_i \sim normal(E[y_i], \sigma=0.25)}_{Stochastic}$$`

--
<img src="03_models_files/figure-html/unnamed-chunk-8-1.png" width="288" style="display: block; margin: auto;" />

---
# the linear model

#### Same model, different `$\Large x$`

`$$\underbrace{\LARGE E[y_i] = -2 + 0.5 \times x_i}_{Deterministic}$$`

`$$\underbrace{\LARGE y_i \sim normal(E[y_i], \sigma=0.25)}_{Stochastic}$$`

--
<img src="03_models_files/figure-html/unnamed-chunk-9-1.png" width="432" style="display: block; margin: auto;" />

---
# the linear model

#### A more complex model

`$$\large y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + ... + \beta_px_{ip} + \epsilon_i$$`

- Each `$\beta$` coefficient is the effect of a specific predictor variables `$x$`

- Predictor variables may be continuous, binary, factors, or a combination

- We will cover more complex models (and interpretation) later

---
# is this a linear model?

`$$\Large y = 20 + 0.5x - 0.3x^2$$`

---
# residuals

#### One concept we will talk about a lot is *residuals*

- Residuals are the difference between the observed values `$y_i$` and the predicted values `$E[y_i]$`

---
# residuals

#### One concept we will talk about a lot is *residuals*

- Residuals are the difference between the observed values `$y_i$` and the predicted values `$E[y_i]$`

- How much variation in `$y$` is explained by `$x$`?

- Useful for assessing whether data violate model assumptions

---
class:inverse, center, middle

# assumptions

---
# assumptions

#### **EVERY** model has assumptions

- Assumptions are necessary to simplify real world to workable model

- If your data violate the assumptions of your model, inferences *may* be invalid

- **Always** know (and test) the assumptions of your model1

.footnote[[1] You know what happens when you assume...]

---
# linear model assumptions

`$$\Large y_i = \beta_0 + \beta_1 x_i + \epsilon_i$$`

`$$\Large \epsilon_i \sim normal(0, \sigma)$$`

1) **Linearity**: The relationship between `$x$` and `$y$` is linear

2) **Normality**: The residuals are normally distributed2

.footnote[[2] Note that these assumptions apply to the residuals, not the data!]

3) **Homoscedasticity**: The residuals have a constant variance at every level of `$x$`

4) **Independence**: The residuals are independent (i.e., uncorrelated with each other)

???

Because virtually every model we will use this semester is a linear model, these assumptions apply to everything we will discuss from here out

---
# linear models

#### Very flexible

- Predictor(s) can take different forms (binary, continuous, factor)

- Can contain many predictors

- Can model non-linear relationships

#### Link different "tests" (e.g., t-tests, ANOVA, ANCOVA, linear regression)

#### Can be used for different statistical goals

- Estimating unknown parameters

- Testing hypotheses

- Describing stochastic systems

- Making predictions that account for uncertainty

---
# looking ahead

#### **Next time:** t-tests and Null Hypothesis Testing

#### **Reading:** Quinn chp. 3