LECTURE 11: evaluating assumptions

class: center, middle, inverse, title-slide

.title[
# LECTURE 11: evaluating assumptions
]
.subtitle[
## FANR 6750 (Experimental design)
]
.author[
### Fall 2023
]

---

class: inverse

# outline

#### 1) Review of linear model assumptions

--

#### 2) Evaulating assumptions

--

#### 3) Robustness to assumption violations

#### 4) Multicolinearity

---
# assumptions

#### **EVERY** model has assumptions

- Assumptions are necessary to simplify real world to workable model

- If your data violate the assumptions of your model, inferences *may* be invalid

- **Always** know (and test) the assumptions of your model

---
# linear model assumptions

`$$\Large y_i = \beta_0 + \beta_1 x_i + \epsilon_i$$`

`$$\Large \epsilon_i \sim normal(0, \sigma)$$`

1) **Linearity**: The relationship between `$x$` and `$y$` is linear

2) **Normality**: The residuals are normally distributed

3) **Homogeneity**: The residuals have a constant variance at every level of `$x$`

4) **Independence**: The residuals are independent (i.e., uncorrelated with each other)

---
# evaluating assumptions

Evaluating whether your data violate the linear model assumptions relies heavily on **residuals**

- Residuals are the difference between the observed and predicted response value of each observation

- `$r_i = (y_i - \hat{y}_i)$`

Plotting the residuals vs the predicted values provides a visual assessment of the linearity assumptions

Plotting the residuals vs the predicted values or `$X$` provides a visual assessment of the variance assumption

Plotting a Q-Q plot or histogram of the residuals provides visual assessments of the normality assumption

---
# evaluating assumptions

Fortunately, multiple `R` packages include functions for making diagnostic plots from a fitted `lm` object

.vsmall-code[

```r
library(performance)
data("biomassdata")
fm1 <- lm(biomass ~ elevation + rainfall, data = biomassdata)
check_model(fm1, check = c("linearity", "homogeneity", "qq", "normality"))
```

<img src="11_assumptions_files/figure-html/unnamed-chunk-1-1.png" width="576" style="display: block; margin: auto;" />
]

---
# linearity

Deviations from linearity will appear as patterns in the residual plot

.pull-left[
<img src="11_assumptions_files/figure-html/unnamed-chunk-2-1.png" width="432" style="display: block; margin: auto;" />
]

.pull-right[
<img src="11_assumptions_files/figure-html/unnamed-chunk-3-1.png" width="432" style="display: block; margin: auto;" />
]

---
# homogeneity of variance

Heteroscedasticity will appear as patterns (often funnel-shaped) in the residual plot

.pull-left[
<img src="11_assumptions_files/figure-html/unnamed-chunk-4-1.png" width="432" style="display: block; margin: auto;" />
]

.pull-right[
<img src="11_assumptions_files/figure-html/unnamed-chunk-5-1.png" width="432" style="display: block; margin: auto;" />
]

---
# homogeneity of variance

Heteroscedasticity can also be assessed by plotting the square root of residuals. Violations will appear as increasing or decreasing trends

.pull-left[
<img src="11_assumptions_files/figure-html/unnamed-chunk-6-1.png" width="432" style="display: block; margin: auto;" />
]

.pull-right[
<img src="11_assumptions_files/figure-html/unnamed-chunk-7-1.png" width="432" style="display: block; margin: auto;" />
]

---
# normality

Normality is usually assessed using *quantile-quantile* (Q-Q) plots, which plots the observed quantiles of the residuals vs. the expected quantiles of normally distributed data

.pull-left[
<img src="11_assumptions_files/figure-html/unnamed-chunk-8-1.png" width="432" style="display: block; margin: auto;" />
]

.pull-right[
<img src="11_assumptions_files/figure-html/unnamed-chunk-9-1.png" width="432" style="display: block; margin: auto;" />
]

Large deviations from the line indicate deviations from normality

---
# evaluating models with factors

Because all of the models we have seen so far this semester are linear models, they all share the same assumptions

Testing assumptions of models with factors (*t*-test, ANOVA) is therefore no different than testing assumptions of models with continuous predictors

.vsmall-code[

```r
data("tunadata")
fm2 <- lm(growth ~ status, data = tunadata)
```
]

---
# null hypothesis testing

It is also possible to "test" whether assumptions are met:

.vsmall-code[

```r
check_normality(fm1)
```

```
## OK: residuals appear as normally distributed (p = 0.377).
```

```r
check_heteroscedasticity(fm1)
```

```
## OK: Error variance appears to be homoscedastic (p = 0.957).
```
]

where the null hypothesis is that the data don't violate the assumption being test (so `$p>0.05$` is good)

However, remember that the assumptions are about the population, not the sample

- samples can deviate from assumptions, even if they are met for the population

- deviations are more likely for small sample sizes

- there is always some judgement necessary when testing assumptions

We will explore some of these tests in lab

---
# independence

Non-independence of residuals is more difficult to test but is often related to un-modeled structure in the data

- points close in space/time tend to be more alike than distance points

- observations within "groups" (e.g., males/females, plots, growth chamber, lakes) tend to be more alike than individuals from other groups

Often, non-independence can be judged based on knowledge of the sampling design/system

- sometimes be seen as "clustering" in residual plots

---
# independence

Sometimes, non-independence can be remedied by including relevant covariates in the model

- e.g. `lm(y ~ x + Lake)`

- we'll learn more about this and related approaches in a few weeks

---
# what happens when assumptions are violated?

One consequences of sampling is that our samples will likely depart from assumptions to some degree

So rather than asking:

> "Do my data violate the assumptions of my model?"

It can be useful to ask:

> "Do my data violate the assumptions of my model *enough to change my conclusions*?"

Fortunately, linear models are relatively robust to minor (or even moderate) departures from the assumptions

---
# what happens when assumptions are violated?

General rules of thumb:

- Violating assumptions will not generally bias estimates of `$\hat{\beta}_0$`, `$\hat{\beta}_1$`, etc.

- For large sample sizes, Central Limit Theorem often ensures that residuals of many types of data are normally distributed (or at least pretty close)

- Standard errors (and therefore confidence intervals, p-values, etc) *will* be biased when assumptions are violated

- In general, violating assumptions results in standard errors that are too small. What does this do to p-values?

---
class:inverse, center, middle

# multicolinearity

---
# multicolinearity

One topic that often causes confusion when fitting multiple regression models is *correlated predictors* (aka multicolinearity)

Many sources suggest trying to avoid including highly correlated predictors in the same model

- correlation coefficients `$>0.7$` are often considered too correlated

However, linear models make no assumptions about correlations between predictor variables. So what is the problem?

---
# example

We are interested in the factors that govern the evolution of body size of our study species and hypothesize that body length will increase with latitude and decrease temperature. Using museum collections, we have body length measurements from 80 individuals collected at sites between 40 and 50 degrees latitude. We also have long-term temperature data from each collection location.

---
# example

As expected, body length appears to vary as a function of both latitude and temperature:

---
# example

Fitting models to each covariate also appears to confirm our hypotheses:

.vsmall-code[

```r
data("lengthdata")
```
]

.pull-left[
.vsmall-code[

```r
fmL <- lm(Length ~ Latitude, data = lengthdata)
summary(fmL)
```

```
## 
## Call:
## lm(formula = Length ~ Latitude, data = lengthdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.239  -4.102  -0.191   4.356  10.431 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  173.879     18.005    9.66  5.8e-15 ***
## Latitude       1.919      0.396    4.84  6.4e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.13 on 78 degrees of freedom
## Multiple R-squared:  0.231,	Adjusted R-squared:  0.221 
## F-statistic: 23.4 on 1 and 78 DF,  p-value: 6.45e-06
```
]
]

.pull-right[
.vsmall-code[

```r
fmT <- lm(Length ~ Temperature, data = lengthdata)
summary(fmT)
```

```
## 
## Call:
## lm(formula = Length ~ Temperature, data = lengthdata)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -10.376 -2.837 -0.423 3.498 12.258 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) 308.95 8.10 38.17 < 2e-16 ***
## Temperature -2.43 0.41 -5.94 7.5e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.85 on 78 degrees of freedom
## Multiple R-squared: 0.312,	Adjusted R-squared: 0.303 
## F-statistic: 35.3 on 1 and 78 DF, p-value: 7.47e-08
```
]
]

---
# example

But when we fit a model with both covariates, our conclusions are quite different:

.vsmall-code[

```r
fmLT <- lm(Length ~ Latitude + Temperature, data = lengthdata)
summary(fmLT)
```

```
## 
## Call:
## lm(formula = Length ~ Latitude + Temperature, data = lengthdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.373  -2.784  -0.423   3.474  12.277 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 314.1379    49.7133    6.32  1.6e-08 ***
## Latitude     -0.0809     0.7649   -0.11   0.9160    
## Temperature  -2.5107     0.8353   -3.01   0.0036 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.88 on 77 degrees of freedom
## Multiple R-squared:  0.312,	Adjusted R-squared:  0.294 
## F-statistic: 17.4 on 2 and 77 DF,  p-value: 5.69e-07
```
]

---
# example

What happened?

The first thing to recognize is that latitude and temperature are highly negatively correlated:

```r
cor(lengthdata$Latitude, lengthdata$Temperature)
```

```
## [1] -0.8697
```

So high latitudes tend to have larger individuals *and* lower temperatures

---
# multicolinearity

At first, the fact that our estimated slope coefficient for Latitude changes signs and our conclusions about significance change between the two models may seem problematic

- the estimated effects and p-values are sensitive to which model we choose

- it may even be tempting to focus on the uni-variate models since they both give us significant results

But is there really a problem? Not if we remember the purpose of multiple regression and what the coefficients are measuring

---
# multicolinearity

The model with only latitude is measuring whether body length changes with latitude. Clearly it does (same for temperature).

But the multiple regression model is measuring the effect of latitude *after controlling for temperature*

- If we hold temperature constant (statistically), does body length change with latitude? Apparently not\*

.footnote[\*In this **made up** data set]

---
# multicolinearity

In the presence of highly correlated predictors, multiple regression models do exactly what we want them to do!

Changes in the magnitude/sign/significance in different regression models is expected because the coefficients measure different relationships

Larger standard errors are also common but are also expected - multiple regression is a harder problem to solve so, for a given sample size, uncertainty is larger

In general, do not automatically exclude correlated predictors just because they are correlated

- If you have an *a priori* reason for including each covariate, include them!

- If parameter estimates are unreliable, you may need to use a more simple model (or gather more data)

---
# looking ahead

### **Next time**: Model selection

### **Reading**: Tredennick et al. 2021 (pdf will be posted on eLC for next week's paper discussion)