class: center, middle, inverse, title-slide .title[ # LECTURE 5: linear models part 1: categorical predictor w/ 2 levels ] .subtitle[ ## FANR 6750 (Experimental design) ] .author[ ###
Fall 2025 ] --- class: inverse # outline <br/> #### 1) Categorical predictor with 1 level (linear model vs t-test formulations) <br/> -- #### 2) Categorical predictor with 2 levels (linear model vs t-test formulations) <br/> -- #### 3) Dummy variables <br/> --- # context, part 1 #### We are interested in whether a certain brand of scale provides accurate mass measurements -- Hypothesis > Measurement error, measured as the difference between the mass indicated by the scale and the true mass of an object, is 0 (on average) -- <img src="05_t-test_files/figure-html/unnamed-chunk-1-1.png" width="432" style="display: block; margin: auto;" /> --- # context, part 1 Field procedure: - randomly sample 10 scales of the same make and model - weigh a test object (of known mass) on each scale and record the difference between the measured and true mass -- Data: `-0.062, -0.38, 0.85, -0.58, 0.53, 0.09, 0.31, 0.77, 0.59, -0.17` <img src="05_t-test_files/figure-html/unnamed-chunk-2-1.png" width="432" style="display: block; margin: auto;" /> --- # the linear model `$$\LARGE y_i = \beta_0 + \epsilon_i$$` `$$\LARGE \epsilon_i \sim normal(0, \sigma)$$` -- - `\(\large \beta_0\)` is the expected (or average) measurement error across all scales - `\(\epsilon_i\)` is residual error associated with each measurement -- - This is most simple linear model we can construct -- - If our hypothesis is correct, `\(\large \beta_0 = 0\)` -- - Question: What is the value of `\(\large \beta_0\)` if our hypothesis is wrong? --- # summary statistics .small-code[ ```r y <- c(-0.062, -0.38, 0.85, -0.58, 0.53, 0.09, 0.31, 0.77, 0.59, -0.17) ``` ] -- `\(\Large \bar{y}\)` ```r mean(y) ``` ``` ## [1] 0.1948 ``` -- `\(\Large s\)` ```r sd(y) ``` ``` ## [1] 0.4925 ``` -- SE ```r sd(y)/sqrt(10) ``` ``` ## [1] 0.1557 ``` --- # visualizing the linear model <br/> <img src="05_t-test_files/figure-html/unnamed-chunk-7-1.png" width="504" style="display: block; margin: auto;" /> --- # visualizing the linear model <br/> <img src="05_t-test_files/figure-html/unnamed-chunk-8-1.png" width="504" style="display: block; margin: auto;" /> --- # the linear model #### Because this is a linear model, we can fit it using the `lm` function: ```r fit1 <- lm(y ~ 1) broom::tidy(fit1) ``` ``` ## # A tibble: 1 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 0.195 0.156 1.25 0.243 ``` -- - The `(Intercept)` estimate corresponds to `\(\large \beta_0\)` and is simply the mean value of `\(\large y\)` -- - The standard error is calculated simply as `\(\large \frac{s}{\sqrt{n}}\)` -- - We will discuss whether there is evidence that `\(\beta_0 \neq 0\)` in the next lecture --- # the linear model as a t-test #### The model we just fit has a corresponding "test" that you may encounter: the *one-sample t-test* .pull-left[ - One-sample t-tests are used to test whether the mean of a population differs from some value (in this case 0) - there is a built-in function to fit t-tests - the mean (and the SE, t-statistic, p-value, confidence interval) are exactly the same in both approaches ] .pull-right[ ```r t.test(y, mu = 0) ``` ``` ## ## One Sample t-test ## ## data: y ## t = 1.3, df = 9, p-value = 0.2 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## -0.1575 0.5471 ## sample estimates: ## mean of x ## 0.1948 ``` ] --- # the one-sample t-test by hand #### Because this is such a simple model, we can calculate `\(\large \beta_0\)` and the standard error by hand very easily ```r (beta0 <- mean(y)) ``` ``` ## [1] 0.1948 ``` ```r (SE <- sd(y)/sqrt(10)) ``` ``` ## [1] 0.1557 ``` -- Again, we will discuss how to use this information to determine whether `\(\large \beta_0\)` is or is not equal to 0 in the next lecture --- class:inverse # context, part 2 #### We are interested in determining how urban development influences the abundance of strawberry poison-dart frog (*Oophaga pumilio*) .left-column[ </br> <img src="fig/frog2.png" width="2016" style="display: block; margin: auto;" /> ] .right-column[ Hypothesis > Strawberry frog abundance will differ between islands with high urban development and island with low urban development Field procedure: - `\(\large n=10\)` plots are established on one high-development island and one low-development island - Within each plot, strawberry frogs are counted within 5 randomly-located quadrats ] --- # context, part 2 .left-column[ </br> <img src="fig/frog.png" width="80%" style="display: block; margin: auto;" /> ] .right-column[ #### Data: `Low development: 16, 14, 18, 17, 29, 31, 14, 16, 22, 15` `High development: 2, 11, 6, 8, 0, 3, 19, 1, 6, 5` <img src="05_t-test_files/figure-html/unnamed-chunk-14-1.png" width="468" style="display: block; margin: auto;" /> ] --- # quick review #### A "simple" linear model `$$\underbrace{\LARGE E[y_i] = 7 + 3.5 \times x_i}_{Deterministic}$$` `$$\underbrace{\LARGE y_i \sim normal(E[y_i], \sigma=0.25)}_{Stochastic}$$` <img src="05_t-test_files/figure-html/unnamed-chunk-15-1.png" width="432" style="display: block; margin: auto;" /> --- # the linear model #### Same model, different `\(\Large x\)` `$$\underbrace{\LARGE E[y_i] = 7 + 3.5 \times x_i}_{Deterministic}$$` `$$\underbrace{\LARGE y_i \sim normal(E[y_i], \sigma=0.25)}_{Stochastic}$$` <img src="05_t-test_files/figure-html/unnamed-chunk-16-1.png" width="288" style="display: block; margin: auto;" /> --- # the linear model #### Same model, different `\(\Large x\)` `$$\underbrace{\LARGE E[y_i] = 7 + 3.5 \times x_i}_{Deterministic}$$` `$$\underbrace{\LARGE y_i \sim normal(E[y_i], \sigma=0.25)}_{Stochastic}$$` <img src="05_t-test_files/figure-html/unnamed-chunk-17-1.png" width="288" style="display: block; margin: auto;" /> --- # the linear model </br> `$$\LARGE E[y_i] = \beta_0 + \beta_1 x_i$$` -- #### When `\(\large x = 0\)`: - `\(\large E[y] = \beta_0\)` - What is the interpretation of `\(\large \beta_0\)`? -- #### When `\(\large x = 1\)`: - `\(\large E[y] = \beta_0 + \beta_1\)` - What is the interpretation of `\(\large \beta_1\)`? --- # fitting the model in `R` ```r data(frogdata) fit <- lm(Frogs ~ Development, data = frogdata) broom::tidy(fit) ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 19.2 1.87 10.3 0.00000000573 ## 2 DevelopmentHigh -13.1 2.64 -4.97 0.000100 ``` -- To interpret this output, we need to know how `R` codes categorical predictors... --- # categorical predictors in `R` #### When you include a categorical predictor (factor or character object) in your model, `R` recodes it is a **dummy variable** - For a predictor with two levels, one level gets coded as `\(\large 0\)` and the other as `\(\large 1\)` - For unordered factors or character objects, levels are determined by alphabetical order - For ordered factors, levels are determined by factor levels -- ```r str(frogdata) ``` ``` ## 'data.frame': 20 obs. of 2 variables: ## $ Frogs : num 16 14 18 17 29 31 14 16 22 15 ... ## $ Development: Factor w/ 2 levels "Low","High": 1 1 1 1 1 1 1 1 1 1 ... ``` --- # fitting the model in `R` ```r data(frogdata) fit2 <- lm(Frogs ~ Development, data = frogdata) broom::tidy(fit2) ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 19.2 1.87 10.3 0.00000000573 ## 2 DevelopmentHigh -13.1 2.64 -4.97 0.000100 ``` -- - The `(Intercept)` estimate ( `\(\beta_0\)` ) is the expected number of frogs in a low-development plot -- - The `DevelopmentHigh` estimate ( `\(\beta_1\)` ) is the *difference* between low-development and high-development plots -- - **Question**: Are there fewer frogs in high-development plots? --- # visualizing the linear model <img src="05_t-test_files/figure-html/unnamed-chunk-21-1.png" width="504" style="display: block; margin: auto;" /> -- <img src="05_t-test_files/figure-html/unnamed-chunk-22-1.png" width="504" style="display: block; margin: auto;" /> --- # fitting the model by hand In this case, we can also fit the model "by hand" -- The parameter estimates are: ```r mu_high <- mean(frogdata$Frogs[frogdata$Development == "High"]) mu_low <- mean(frogdata$Frogs[frogdata$Development == "Low"]) (beta_0.hat <- mu_low) ``` ``` ## [1] 19.2 ``` ```r (beta_1.hat <- mu_high - mu_low) ``` ``` ## [1] -13.1 ``` -- How do we calculate the standard errors? --- # fitting the model by hand How do we calculate the standard errors? -- Because we have samples from two populations (low and high development), we have to use the *pooled* variance of both samples to calculate the correct standard errors: `$$\large s^2_p = \frac{(n_L − 1)s^2_L + (n_H − 1)s^2_H}{n_L + n_H − 2}$$` -- ``` ## [1] 34.81 ``` --- # fitting the model by hand How do we calculate the standard errors? -- With the pooled variance calculated, we can calculate the standard errors. For the intercept: `$$\large SE_{\beta_0} = \sqrt{\frac{s^2_p}{n_L}}$$` ``` ## [1] 1.866 ``` -- For the slope: `$$\large SE_{\beta_1} = \sqrt{\frac{s^2_p}{n_L} + \frac{s^2_p}{n_H}}$$` ``` ## [1] 2.638 ``` -- What is the interpretation of these standard errors? --- # the linear model as a t-test #### As before, the model we just fit has a corresponding "test" that you may encounter: the *two-sample t-test* .small-code[ ```r t.test(Frogs ~ Development, data = frogdata) ``` ``` ## ## Welch Two Sample t-test ## ## data: Frogs by Development ## t = 5, df = 18, p-value = 1e-04 ## alternative hypothesis: true difference in means between group Low and group High is not equal to 0 ## 95 percent confidence interval: ## 7.554 18.646 ## sample estimates: ## mean in group Low mean in group High ## 19.2 6.1 ``` ] - Used to test whether the means of two population are different - Results are exactly the same as `lm` --- # other ways to fit the model #### By default, the linear model approach provides the mean of the reference group and the difference in the group means -- - This version of the linear model formulation is usually referred to as the *effects parameterization* - We can easily calculate the mean of group 2 from the linear model as `\(\beta_0 + \beta_1\)` -- - We can also recode the model in `R` to provide the group means (the *means parameterization*) ```r fit3 <- lm(Frogs ~ Development - 1, data = frogdata) broom::tidy(fit3) ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 DevelopmentLow 19.2 1.87 10.3 0.00000000573 ## 2 DevelopmentHigh 6.1 1.87 3.27 0.00426 ``` --- # summary - The simplest linear model we can construct tests whether the mean of a sample is different from some null value (usually 0) + This model is the same as a one-sample t-test -- - The same linear model can be extended to test whether the mean of two groups differ from each other + The `lm()` function uses dummy variables, i.e., coding one treatment level as 0 and the other as 1 + By default, the intercept is the mean of the reference group and the slope parameter is the difference between the group means + This model is the same as a two-sample t-test -- - The alternative ways to fit this model (t-test, effects parameterization, means parameterization) provide the *exact same information* + In other words, they do not change your conclusions! --- # looking ahead <br/> ### **Next time**: Null hypothesis testing <br/> ### **Reading**: [Fieberg chp. 1.10](https://statistics4ecologists-v3.netlify.app/01-linearregression#hypothesis-tests-and-p-values)