class: center, middle, inverse, title-slide .title[ # LECTURE 4: linear models part 1: categorical predictor w/ 2 levels ] .subtitle[ ## FANR 6750 (Experimental design) ] .author[ ###
Fall 2024 ] --- class: inverse # outline <br/> #### 1) Categorical predictor with 1 level (linear model vs t-test formulations) <br/> -- #### 2) Categorical predictor with 2 levels (linear model vs t-test formulations) <br/> -- #### 3) Dummy variables and model matrices <br/> --- # context, part 1 #### We are interested in whether a certain brand of scale provides accurate mass measurements -- Hypothesis > Measurement error, measured as the difference between the mass indicated by the scale and the true mass of an object, is 0 (on average) -- <img src="04_t-test_files/figure-html/unnamed-chunk-1-1.png" width="432" style="display: block; margin: auto;" /> --- # context, part 1 Field procedure: - randomly sample 10 scales of the same make and model - weigh a test object (of known mass) on each scale and record the difference between the measured and true mass -- Data: `-0.062, -0.38, 0.85, -0.58, 0.53, 0.09, 0.31, 0.77, 0.59, -0.17` <img src="04_t-test_files/figure-html/unnamed-chunk-2-1.png" width="432" style="display: block; margin: auto;" /> --- # the linear model `$$\LARGE y_i = \beta_0 + \epsilon_i$$` `$$\LARGE \epsilon_i \sim normal(0, \sigma)$$` -- - `\(\large \beta_0\)` is the expected (or average) measurement error across all scales - `\(\epsilon_i\)` is residual error associated with each measurement -- - This is most simple linear model we can construct -- - If our hypothesis is correct, `\(\large \beta_0 = 0\)` -- - Question: What is the value of `\(\large \beta_0\)` is our hypothesis is wrong? --- # summary statistics .small-code[ ```r y <- c(-0.062, -0.38, 0.85, -0.58, 0.53, 0.09, 0.31, 0.77, 0.59, -0.17) ``` ] -- `\(\Large \bar{y}\)` ```r mean(y) ``` ``` ## [1] 0.1948 ``` -- `\(\Large s\)` ```r sd(y) ``` ``` ## [1] 0.4925 ``` -- ? ```r sd(y)/sqrt(10) ``` ``` ## [1] 0.1557 ``` --- # visualizing the linear model <br/> <img src="04_t-test_files/figure-html/unnamed-chunk-7-1.png" width="504" style="display: block; margin: auto;" /> --- # visualizing the linear model <br/> <img src="04_t-test_files/figure-html/unnamed-chunk-8-1.png" width="504" style="display: block; margin: auto;" /> --- # the linear model #### Because this is a linear model, we can fit it using the `lm` function: ```r fit1 <- lm(y ~ 1) broom::tidy(fit1) ``` ``` ## # A tibble: 1 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 0.195 0.156 1.25 0.243 ``` -- - The `(Intercept)` estimate corresponds to `\(\beta_0\)` and is simply the mean value of `\(y\)` -- - The standard error is calculated simply as `\(\frac{s}{\sqrt{n}}\)` -- - We will discuss whether there is evidence that `\(\beta_0 \neq 0\)` in the next lecture --- # the linear model as a t-test #### The model we just fit has a corresponding "test" that you may encounter: the *one-sample t-test* .pull-left[ - One-sample t-tests are used to test whether the mean of a population differs from some value (in this case 0) - there is a built-in function to fit t-tests - the mean (and the SE, t-statistic, p-value, confidence interval) are exactly the same in both approaches ] .pull-right[ ```r t.test(y, mu = 0) ``` ``` ## ## One Sample t-test ## ## data: y ## t = 1.3, df = 9, p-value = 0.2 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## -0.1575 0.5471 ## sample estimates: ## mean of x ## 0.1948 ``` ] --- # the one-sample t-test by hand #### Because this is such a simple model, we can calculate `\(\beta_0\)` and the standard error by hand very easily ```r (beta0 <- mean(y)) ``` ``` ## [1] 0.1948 ``` ```r (SE <- sd(y)/sqrt(10)) ``` ``` ## [1] 0.1557 ``` -- Again, we will discuss how to use this information to determine whether `\(\beta_0\)` is or is not equal to 0 in the next lecture --- class:inverse # context, part 2 #### We are interested in determining how urban development influences the abundance of strawberry poison-dart frog (*Oophaga pumilio*) .left-column[ </br> <img src="fig/frog2.png" width="2016" style="display: block; margin: auto;" /> ] .right-column[ Hypothesis > Strawberry frog abundance will differ between islands with high urban development and island with low urban development Field procedure: - `\(\large n=10\)` plots are established on one high-development island and one low-development island - Within each plot, strawberry frogs are counted within 5 randomly-located quadrats ] --- # context, part 2 .left-column[ </br> <img src="fig/frog.png" width="80%" style="display: block; margin: auto;" /> ] .right-column[ #### Data: `Low development: 16, 14, 18, 17, 29, 31, 14, 16, 22, 15` `High development: 2, 11, 6, 8, 0, 3, 19, 1, 6, 5` <img src="04_t-test_files/figure-html/unnamed-chunk-14-1.png" width="468" style="display: block; margin: auto;" /> ] --- # quick review #### A "simple" linear model `$$\underbrace{\LARGE E[y_i] = 7 + 3.5 \times x_i}_{Deterministic}$$` `$$\underbrace{\LARGE y_i \sim normal(E[y_i], \sigma=0.25)}_{Stochastic}$$` <img src="04_t-test_files/figure-html/unnamed-chunk-15-1.png" width="432" style="display: block; margin: auto;" /> --- # the linear model #### Same model, different `\(\Large x\)` `$$\underbrace{\LARGE E[y_i] = 7 + 3.5 \times x_i}_{Deterministic}$$` `$$\underbrace{\LARGE y_i \sim normal(E[y_i], \sigma=0.25)}_{Stochastic}$$` <img src="04_t-test_files/figure-html/unnamed-chunk-16-1.png" width="288" style="display: block; margin: auto;" /> --- # the linear model #### Same model, different `\(\Large x\)` `$$\underbrace{\LARGE E[y_i] = 7 + 3.5 \times x_i}_{Deterministic}$$` `$$\underbrace{\LARGE y_i \sim normal(E[y_i], \sigma=0.25)}_{Stochastic}$$` <img src="04_t-test_files/figure-html/unnamed-chunk-17-1.png" width="288" style="display: block; margin: auto;" /> --- # the linear model </br> `$$\LARGE E[y_i] = \beta_0 + \beta_1 x_i$$` -- #### When `\(\large x = 0\)`: - `\(\large E[y] = \beta_0\)` - What is the interpretation of `\(\large \beta_0\)`? -- #### When `\(\large x = 1\)`: - `\(\large E[y] = \beta_0 + \beta_1\)` - What is the interpretation of `\(\large \beta_1\)`? --- # fitting the model in `R` ```r data(frogdata) fit <- lm(Frogs ~ Development, data = frogdata) broom::tidy(fit) ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 19.2 1.87 10.3 0.00000000573 ## 2 DevelopmentHigh -13.1 2.64 -4.97 0.000100 ``` -- To interpret this output, we need to know how `R` codes categorical predictors... --- # categorical predictors in `R` #### When you include a categorical predictor (factor or character object) in your model, `R` recodes it is a **dummy variable** - For a predictor with two levels, one level gets coded as `\(\large 0\)` and the other as `\(\large 1\)` - For unordered factors or character objects, levels are determined by alphabetical order - For ordered factors, levels are determined by factor levels -- ```r str(frogdata) ``` ``` ## 'data.frame': 20 obs. of 2 variables: ## $ Frogs : num 16 14 18 17 29 31 14 16 22 15 ... ## $ Development: Factor w/ 2 levels "Low","High": 1 1 1 1 1 1 1 1 1 1 ... ``` --- # model matrix #### We can always check how `R` coded a linear model using the `model.matrix()` function - the model matrix (also called the *design matrix*) will always have one row for every observation and one column for every parameter in the model ```r ### View rows 1, 2, 11, and 12 of the model matrix model.matrix(fit)[c(1,2,11,12),] ``` ``` ## (Intercept) DevelopmentHigh ## 1 1 0 ## 2 1 0 ## 11 1 1 ## 12 1 1 ``` -- To interpret this output (and see why it's useful), we need to review matrix algebra --- # matrix multiplication `$$\begin{bmatrix} a & b & c \\ d & e & f \\ g & h & i \end{bmatrix}\;\; \mathbf \times \begin{bmatrix} x \\ y \\ z \end{bmatrix}=\large \begin{bmatrix} a\times x + b\times y + c\times z\\ d\times x + e\times y + f\times z\\ g\times x + h\times y + i\times z \end{bmatrix}$$` -- `$$\begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ 1 & x_3 \end{bmatrix}\;\; \mathbf \times \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix}=\large \begin{bmatrix} \beta_0 \times 1 + \beta_1 \times x_1\\ \beta_0 \times 1 + \beta_1 \times x_2\\ \beta_0 \times 1 + \beta_1 \times x_3 \end{bmatrix}$$` -- .pull-left[ ```r ### View rows 1, 2, 11, and 12 of the model matrix model.matrix(fit)[c(1,2,11,12),] ``` ``` ## (Intercept) DevelopmentHigh ## 1 1 0 ## 2 1 0 ## 11 1 1 ## 12 1 1 ``` ] .pull-right[ ```r ### View rows 1, 2, 11, and 12 of the data frogdata[c(1,2,11,12),] ``` ``` ## Frogs Development ## 1 16 Low ## 2 14 Low ## 11 2 High ## 12 11 High ``` ] --- # fitting the model in `R` ```r data(frogdata) fit2 <- lm(Frogs ~ Development, data = frogdata) broom::tidy(fit2) ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 19.2 1.87 10.3 0.00000000573 ## 2 DevelopmentHigh -13.1 2.64 -4.97 0.000100 ``` -- - The `(Intercept)` estimate ( `\(\beta_0\)` ) is the expected number of frogs in a low-development plot -- - The `DevelopmentHigh` estimate ( `\(\beta_1\)` ) is the *difference* between low-development and high-development plots -- - **Question**: Are there fewer frogs in high-development plots? --- # visualizing the linear model <img src="04_t-test_files/figure-html/unnamed-chunk-24-1.png" width="504" style="display: block; margin: auto;" /> -- <img src="04_t-test_files/figure-html/unnamed-chunk-25-1.png" width="504" style="display: block; margin: auto;" /> --- # fitting the model by hand In this case, we can also fit the model "by hand" -- The parameter estimates are: ```r mu_high <- mean(frogdata$Frogs[frogdata$Development == "High"]) mu_low <- mean(frogdata$Frogs[frogdata$Development == "Low"]) (beta_0.hat <- mu_low) ``` ``` ## [1] 19.2 ``` ```r (beta_1.hat <- mu_high - mu_low) ``` ``` ## [1] -13.1 ``` -- How do we calculate the standard errors? --- # fitting the model by hand How do we calculate the standard errors? -- Because we have samples from two populations (low and high development), we have to use the *pooled* variance of both samples to calculate the correct standard errors: `$$\large s^2_p = \frac{(n_L − 1)s^2_L + (n_H − 1)s^2_H}{n_L + n_H − 2}$$` -- ``` ## [1] 34.81 ``` --- # fitting the model by hand How do we calculate the standard errors? -- With the pooled variance calculated, we can calculate the standard errors. For the intercept: `$$\large SE_{\beta_0} = \sqrt{\frac{s^2_p}{n_L}}$$` ``` ## [1] 1.866 ``` -- For the slope: `$$\large SE_{\beta_1} = \sqrt{\frac{s^2_p}{n_L} + \frac{s^2_p}{n_H}}$$` ``` ## [1] 2.638 ``` -- What is the interpretation of these standard errors? --- # the linear model as a t-test #### As before, the model we just fit has a corresponding "test" that you may encounter: the *two-sample t-test* .small-code[ ```r t.test(Frogs ~ Development, data = frogdata) ``` ``` ## ## Welch Two Sample t-test ## ## data: Frogs by Development ## t = 5, df = 18, p-value = 1e-04 ## alternative hypothesis: true difference in means between group Low and group High is not equal to 0 ## 95 percent confidence interval: ## 7.554 18.646 ## sample estimates: ## mean in group Low mean in group High ## 19.2 6.1 ``` ] - Used to test whether the means of two population are different - Results are exactly the same as `lm` --- # other ways to fit the model #### By default, the linear model approach provides the mean of the reference group and the difference in the group means -- - This version of the linear model formulation is usually referred to as the *effects parameterization* - We can easily calculate the mean of group 2 from the linear model as `\(\beta_0 + \beta_1\)` -- - We can also recode the model in `R` to provide the group means (the *means parameterization*) ```r fit3 <- lm(Frogs ~ Development - 1, data = frogdata) broom::tidy(fit3) ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 DevelopmentLow 19.2 1.87 10.3 0.00000000573 ## 2 DevelopmentHigh 6.1 1.87 3.27 0.00426 ``` --- # other ways to fit the model #### Challenge question: What do you think the model matrix looks like for the means parameterization of the model? -- ```r model.matrix(fit3)[c(1,2,11,12),] ``` ``` ## DevelopmentLow DevelopmentHigh ## 1 1 0 ## 2 1 0 ## 11 0 1 ## 12 0 1 ``` -- The key thing to recognize is that these alternative (t-test, effects parameterization, means parameterization) provide the *exact same information*, just presented in different ways - In other words, they do not change your conclusions! --- # looking ahead <br/> ### **Next time**: Null hypothesis testing <br/> ### **Reading**: [Fieberg chp. 1.10](https://statistics4ecologists-v1.netlify.app/linreg.html#hypothesis-tests-and-p-values)