LECTURE 4: linear models part 1: categorical predictor w/ 2 levels

class: center, middle, inverse, title-slide

.title[
# LECTURE 4: linear models part 1: categorical predictor w/ 2 levels
]
.subtitle[
## FANR 6750 (Experimental design)
]
.author[
### Fall 2024
]

---

class: inverse

# outline

#### 1) Categorical predictor with 1 level (linear model vs t-test formulations)

--

#### 2) Categorical predictor with 2 levels (linear model vs t-test formulations)

--

#### 3) Dummy variables and model matrices

---
# context, part 1

#### We are interested in whether a certain brand of scale provides accurate mass measurements

Hypothesis

> Measurement error, measured as the difference between the mass indicated by the scale and the true mass of an object, is 0 (on average)

---
# context, part 1

Field procedure:

- randomly sample 10 scales of the same make and model

- weigh a test object (of known mass) on each scale and record the difference between the measured and true mass

Data:

`-0.062, -0.38, 0.85, -0.58,  0.53,  0.09,  0.31,  0.77,  0.59,  -0.17`

---
# the linear model

`$$\LARGE y_i = \beta_0 + \epsilon_i$$`

`$$\LARGE \epsilon_i \sim normal(0, \sigma)$$`

- `$\large \beta_0$` is the expected (or average) measurement error across all scales

- `$\epsilon_i$` is residual error associated with each measurement

- This is most simple linear model we can construct

- If our hypothesis is correct, `$\large \beta_0 = 0$`

- Question: What is the value of `$\large \beta_0$` is our hypothesis is wrong?

---
# summary statistics

.small-code[

```r
y <- c(-0.062, -0.38, 0.85, -0.58, 0.53, 0.09, 0.31, 0.77, 0.59, -0.17)
```
]

`$\Large \bar{y}$`

```r
mean(y)
```

```
## [1] 0.1948
```

`$\Large s$`

```r
sd(y)
```

```
## [1] 0.4925
```

```r
sd(y)/sqrt(10)
```

```
## [1] 0.1557
```

---
# visualizing the linear model

---
# visualizing the linear model

---
# the linear model

#### Because this is a linear model, we can fit it using the `lm` function:

```r
fit1 <- lm(y ~ 1)

broom::tidy(fit1)
```

```
## # A tibble: 1 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.195 0.156 1.25 0.243
```

- The `(Intercept)` estimate corresponds to `$\beta_0$` and is simply the mean value of `$y$`

- The standard error is calculated simply as `$\frac{s}{\sqrt{n}}$`
--

- We will discuss whether there is evidence that `$\beta_0 \neq 0$` in the next lecture

---
# the linear model as a t-test

#### The model we just fit has a corresponding "test" that you may encounter: the *one-sample t-test*

.pull-left[
- One-sample t-tests are used to test whether the mean of a population differs from some value (in this case 0)

- there is a built-in function to fit t-tests

- the mean (and the SE, t-statistic, p-value, confidence interval) are exactly the same in both approaches
]

.pull-right[

```r
t.test(y, mu = 0)
```

```
## 
## 	One Sample t-test
## 
## data:  y
## t = 1.3, df = 9, p-value = 0.2
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.1575  0.5471
## sample estimates:
## mean of x 
##    0.1948
```
]

---
# the one-sample t-test by hand

#### Because this is such a simple model, we can calculate `$\beta_0$` and the standard error by hand very easily

```r
(beta0 <- mean(y))
```

```
## [1] 0.1948
```

```r
(SE <- sd(y)/sqrt(10))
```

```
## [1] 0.1557
```

Again, we will discuss how to use this information to determine whether `$\beta_0$` is or is not equal to 0 in the next lecture

---
class:inverse

# context, part 2

#### We are interested in determining how urban development influences the abundance of strawberry poison-dart frog (*Oophaga pumilio*)

.left-column[

<img src="fig/frog2.png" width="2016" style="display: block; margin: auto;" />
]

.right-column[
Hypothesis

> Strawberry frog abundance will differ between islands with high urban development and island with low urban development

Field procedure:

- `$\large n=10$` plots are established on one high-development island and one low-development island

- Within each plot, strawberry frogs are counted within 5 randomly-located quadrats

]

---
# context, part 2

.left-column[

<img src="fig/frog.png" width="80%" style="display: block; margin: auto;" />

]

.right-column[
#### Data:

`Low development: 16, 14, 18, 17, 29, 31, 14, 16, 22, 15`

`High development: 2, 11, 6, 8, 0, 3, 19, 1, 6, 5`

]

---
# quick review

#### A "simple" linear model

`$$\underbrace{\LARGE E[y_i] = 7 + 3.5 \times x_i}_{Deterministic}$$`

`$$\underbrace{\LARGE y_i \sim normal(E[y_i], \sigma=0.25)}_{Stochastic}$$`

---
# the linear model

#### Same model, different `$\Large x$`

`$$\underbrace{\LARGE E[y_i] = 7 + 3.5 \times x_i}_{Deterministic}$$`

`$$\underbrace{\LARGE y_i \sim normal(E[y_i], \sigma=0.25)}_{Stochastic}$$`

---
# the linear model

#### Same model, different `$\Large x$`

`$$\underbrace{\LARGE E[y_i] = 7 + 3.5 \times x_i}_{Deterministic}$$`

`$$\underbrace{\LARGE y_i \sim normal(E[y_i], \sigma=0.25)}_{Stochastic}$$`

---
# the linear model

`$$\LARGE E[y_i] = \beta_0 + \beta_1  x_i$$`

#### When `$\large x = 0$`:

- `$\large E[y] = \beta_0$`

- What is the interpretation of `$\large \beta_0$`?

#### When `$\large x = 1$`:

- `$\large E[y] = \beta_0 + \beta_1$`

- What is the interpretation of `$\large \beta_1$`?

---
# fitting the model in `R`

```r
data(frogdata)

fit <- lm(Frogs ~ Development, data = frogdata)

broom::tidy(fit)
```

```
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 19.2 1.87 10.3 0.00000000573
## 2 DevelopmentHigh -13.1 2.64 -4.97 0.000100
```

--
To interpret this output, we need to know how `R` codes categorical predictors...

---
# categorical predictors in `R`

#### When you include a categorical predictor (factor or character object) in your model, `R` recodes it is a **dummy variable**

- For a predictor with two levels, one level gets coded as `$\large 0$` and the other as `$\large 1$`

- For unordered factors or character objects, levels are determined by alphabetical order

- For ordered factors, levels are determined by factor levels

```r
str(frogdata)
```

```
## 'data.frame':	20 obs. of  2 variables:
##  $ Frogs      : num  16 14 18 17 29 31 14 16 22 15 ...
##  $ Development: Factor w/ 2 levels "Low","High": 1 1 1 1 1 1 1 1 1 1 ...
```

---
# model matrix

#### We can always check how `R` coded a linear model using the `model.matrix()` function

- the model matrix (also called the *design matrix*) will always have one row for every observation and one column for every parameter in the model

```r
### View rows 1, 2, 11, and 12 of the model matrix
model.matrix(fit)[c(1,2,11,12),]
```

```
##    (Intercept) DevelopmentHigh
## 1            1               0
## 2            1               0
## 11           1               1
## 12           1               1
```

To interpret this output (and see why it's useful), we need to review matrix algebra

---
# matrix multiplication

`$$\begin{bmatrix}
    a & b & c \\
    d & e & f \\
    g & h & i 
\end{bmatrix}\;\;
\mathbf \times \begin{bmatrix}
    x \\
    y \\
    z
\end{bmatrix}=\large \begin{bmatrix}
    a\times x + b\times y + c\times z\\
    d\times x + e\times y + f\times z\\
    g\times x + h\times y + i\times z
\end{bmatrix}$$`

`$$\begin{bmatrix}
    1 & x_1 \\
    1 & x_2 \\
    1 & x_3 
\end{bmatrix}\;\;
\mathbf \times \begin{bmatrix}
    \beta_0 \\
    \beta_1 
\end{bmatrix}=\large \begin{bmatrix}
    \beta_0 \times 1 + \beta_1 \times x_1\\
    \beta_0 \times 1 + \beta_1 \times x_2\\
    \beta_0 \times 1 + \beta_1 \times x_3
\end{bmatrix}$$`

.pull-left[

```r
### View rows 1, 2, 11, and 12 of the model matrix
model.matrix(fit)[c(1,2,11,12),]
```

```
##    (Intercept) DevelopmentHigh
## 1            1               0
## 2            1               0
## 11           1               1
## 12           1               1
```
]

.pull-right[

```r
### View rows 1, 2, 11, and 12 of the data
frogdata[c(1,2,11,12),]
```

```
##    Frogs Development
## 1     16         Low
## 2     14         Low
## 11     2        High
## 12    11        High
```
]

---
# fitting the model in `R`

```r
data(frogdata)

fit2 <- lm(Frogs ~ Development, data = frogdata)

broom::tidy(fit2)
```

- The `(Intercept)` estimate ( `$\beta_0$` ) is the expected number of frogs in a low-development plot

- The `DevelopmentHigh` estimate ( `$\beta_1$` ) is the *difference* between low-development and high-development plots

- **Question**: Are there fewer frogs in high-development plots?

---
# visualizing the linear model

---
# fitting the model by hand

In this case, we can also fit the model "by hand"

The parameter estimates are:

```r
mu_high <- mean(frogdata$Frogs[frogdata$Development == "High"])
mu_low <- mean(frogdata$Frogs[frogdata$Development == "Low"])

(beta_0.hat <- mu_low)
```

```
## [1] 19.2
```

```r
(beta_1.hat <- mu_high - mu_low)
```

```
## [1] -13.1
```

How do we calculate the standard errors?

---
# fitting the model by hand

How do we calculate the standard errors?

Because we have samples from two populations (low and high development), we have to use the *pooled* variance of both samples to calculate the correct standard errors:

`$$\large s^2_p = \frac{(n_L − 1)s^2_L + (n_H − 1)s^2_H}{n_L + n_H − 2}$$`

```
## [1] 34.81
```

---
# fitting the model by hand

How do we calculate the standard errors?

With the pooled variance calculated, we can calculate the standard errors. For the intercept:

`$$\large SE_{\beta_0} = \sqrt{\frac{s^2_p}{n_L}}$$`

```
## [1] 1.866
```

For the slope:

`$$\large SE_{\beta_1} = \sqrt{\frac{s^2_p}{n_L} + \frac{s^2_p}{n_H}}$$`

```
## [1] 2.638
```

What is the interpretation of these standard errors?

---
# the linear model as a t-test

#### As before, the model we just fit has a corresponding "test" that you may encounter: the *two-sample t-test*

.small-code[

```r
t.test(Frogs ~ Development, data = frogdata)
```

```
## 
## 	Welch Two Sample t-test
## 
## data:  Frogs by Development
## t = 5, df = 18, p-value = 1e-04
## alternative hypothesis: true difference in means between group Low and group High is not equal to 0
## 95 percent confidence interval:
##   7.554 18.646
## sample estimates:
##  mean in group Low mean in group High 
##               19.2                6.1
```
]

- Used to test whether the means of two population are different

- Results are exactly the same as `lm`

---
# other ways to fit the model

#### By default, the linear model approach provides the mean of the reference group and the difference in the group means

--
- This version of the linear model formulation is usually referred to as the *effects parameterization*

- We can easily calculate the mean of group 2 from the linear model as `$\beta_0 + \beta_1$`

- We can also recode the model in `R` to provide the group means (the *means parameterization*)

```r
fit3 <- lm(Frogs ~ Development - 1, data = frogdata)
broom::tidy(fit3)
```

```
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 DevelopmentLow 19.2 1.87 10.3 0.00000000573
## 2 DevelopmentHigh 6.1 1.87 3.27 0.00426
```

---
# other ways to fit the model

#### Challenge question: What do you think the model matrix looks like for the means parameterization of the model?

```r
model.matrix(fit3)[c(1,2,11,12),]
```

```
##    DevelopmentLow DevelopmentHigh
## 1               1               0
## 2               1               0
## 11              0               1
## 12              0               1
```

The key thing to recognize is that these alternative (t-test, effects parameterization, means parameterization) provide the *exact same information*, just presented in different ways

- In other words, they do not change your conclusions!

---
# looking ahead

### **Next time**: Null hypothesis testing

### **Reading**: [Fieberg chp. 1.10](https://statistics4ecologists-v1.netlify.app/linreg.html#hypothesis-tests-and-p-values)