yi=β0+β1×Xi+ϵi ϵi∼normal(0,σ) So far, we have learned about linear models that contain only a single predictor variable X:
1) Single continuous variable = "simple" regression (lecture 2)
2) Single categorical predictor w/ one level = one-sample t-test (lecture 4)
3) Single categorical predictor w/ two levels = two-sample t-test (lecture 4)
4) Single categorical predictor w/ > 2 levels = ANOVA (lecture 7)
yi=β0+β1×Xi+ϵi ϵi∼normal(0,σ) So far, we have learned about linear models that contain only a single predictor variable X:
1) Single continuous variable = "simple" regression (lecture 2)
2) Single categorical predictor w/ one level = one-sample t-test (lecture 4)
3) Single categorical predictor w/ two levels = two-sample t-test (lecture 4)
4) Single categorical predictor w/ > 2 levels = ANOVA (lecture 7)
More often than not, your models will need to contain more than one predictor
Models with more than one predictor go by many names (blocking, ANCOVA, factorial, etc) but are all forms of multiple regression of the form:
yi=β0+β1×X1i+β2×X2i+ϵi ϵi∼normal(0,σ)
*Note that this model only contains two predictors but multiple regression models often contain many predictors
Models with more than one predictor go by many names (blocking, ANCOVA, factorial, etc) but are all forms of multiple regression of the form:
yi=β0+β1×X1i+β2×X2i+ϵi ϵi∼normal(0,σ)
*Note that this model only contains two predictors but multiple regression models often contain many predictors
Interpretation of the intercept β0, residual error ϵi, and residual variance σ remains the same as before
Models with more than one predictor go by many names (blocking, ANCOVA, factorial, etc) but are all forms of multiple regression of the form:
yi=β0+β1×X1i+β2×X2i+ϵi ϵi∼normal(0,σ)
*Note that this model only contains two predictors but multiple regression models often contain many predictors
Interpretation of the intercept β0, residual error ϵi, and residual variance σ remains the same as before
Interpretation of the slope coefficients β1, β2, etc. changes slightly
Models with more than one predictor go by many names (blocking, ANCOVA, factorial, etc) but are all forms of multiple regression of the form:
yi=β0+β1×X1i+β2×X2i+ϵi ϵi∼normal(0,σ)
*Note that this model only contains two predictors but multiple regression models often contain many predictors
Interpretation of the intercept β0, residual error ϵi, and residual variance σ remains the same as before
Interpretation of the slope coefficients β1, β2, etc. changes slightly
Models with more than one predictor go by many names (blocking, ANCOVA, factorial, etc) but are all forms of multiple regression of the form:
yi=β0+β1×X1i+β2×X2i+ϵi ϵi∼normal(0,σ)
*Note that this model only contains two predictors but multiple regression models often contain many predictors
Interpretation of the intercept β0, residual error ϵi, and residual variance σ remains the same as before
Interpretation of the slope coefficients β1, β2, etc. changes slightly
Slopes are the expected change in the response variable yi for a unit change in the corresponding predictor variable while holding all other predictors constant
This is a subtle difference but an important one
One of the most common reasons for using multiple regression models is to control for extraneous sources of variation (i.e., sources of variation that influence the response variable but are not of interest in and of themselves)
One of the most common reasons for using multiple regression models is to control for extraneous sources of variation (i.e., sources of variation that influence the response variable but are not of interest in and of themselves)
Why would we want to control for extraneous variation?
One of the most common reasons for using multiple regression models is to control for extraneous sources of variation (i.e., sources of variation that influence the response variable but are not of interest in and of themselves)
Why would we want to control for extraneous variation?
Perhaps we are raising desert tortoises (Gopherus agassizii) for release into the wild and are interested in whether different diets influence their weight gain
A tortoises final weight, however, is not influenced only by diet. For example, it may also be influenced by it's starting body size
A tortoises final weight, however, is not influenced only by diet. For example, it may also be influenced by it's starting body size
As we will see, accurately measuring the effect of diet requires taking into account each individual's initial body size. Multiple regression allows us to do this
Let's start by fitting a model we've already learned about:
Let's start by fitting a model we've already learned about:
fit.lm <- lm(weight ~ diet, data = dietdata)summary(fit.lm)
## ## Call:## lm(formula = weight ~ diet, data = dietdata)## ## Residuals:## Min 1Q Median 3Q Max ## -76.39 -19.29 -0.37 19.78 54.61 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 261.19 7.69 33.94 <2e-16 ***## dietLow 4.56 10.88 0.42 0.677 ## dietMed 11.02 10.88 1.01 0.316 ## dietHigh 25.28 10.88 2.32 0.024 * ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 29.8 on 56 degrees of freedom## Multiple R-squared: 0.0989, Adjusted R-squared: 0.0506 ## F-statistic: 2.05 on 3 and 56 DF, p-value: 0.117
Let's start by fitting a model we've already learned about:
fit.lm <- lm(weight ~ diet, data = dietdata)summary(fit.lm)
## ## Call:## lm(formula = weight ~ diet, data = dietdata)## ## Residuals:## Min 1Q Median 3Q Max ## -76.39 -19.29 -0.37 19.78 54.61 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 261.19 7.69 33.94 <2e-16 ***## dietLow 4.56 10.88 0.42 0.677 ## dietMed 11.02 10.88 1.01 0.316 ## dietHigh 25.28 10.88 2.32 0.024 * ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 29.8 on 56 degrees of freedom## Multiple R-squared: 0.0989, Adjusted R-squared: 0.0506 ## F-statistic: 2.05 on 3 and 56 DF, p-value: 0.117
Conclusions?
Now let's fit a slightly different model:
Now let's fit a slightly different model:
fit.lm2 <- lm(weight ~ diet + length, data = dietdata)summary(fit.lm2)
## ## Call:## lm(formula = weight ~ diet + length, data = dietdata)## ## Residuals:## Min 1Q Median 3Q Max ## -38.25 -12.18 -2.53 12.15 49.23 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 191.364 8.862 21.59 < 2e-16 ***## dietLow 13.727 6.878 2.00 0.05091 . ## dietMed 25.226 6.974 3.62 0.00065 ***## dietHigh 30.784 6.833 4.51 3.5e-05 ***## length 2.789 0.297 9.38 5.2e-13 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 18.6 on 55 degrees of freedom## Multiple R-squared: 0.654, Adjusted R-squared: 0.628 ## F-statistic: 25.9 on 4 and 55 DF, p-value: 4.14e-12
Now let's fit a slightly different model:
fit.lm2 <- lm(weight ~ diet + length, data = dietdata)summary(fit.lm2)
## ## Call:## lm(formula = weight ~ diet + length, data = dietdata)## ## Residuals:## Min 1Q Median 3Q Max ## -38.25 -12.18 -2.53 12.15 49.23 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 191.364 8.862 21.59 < 2e-16 ***## dietLow 13.727 6.878 2.00 0.05091 . ## dietMed 25.226 6.974 3.62 0.00065 ***## dietHigh 30.784 6.833 4.51 3.5e-05 ***## length 2.789 0.297 9.38 5.2e-13 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 18.6 on 55 degrees of freedom## Multiple R-squared: 0.654, Adjusted R-squared: 0.628 ## F-statistic: 25.9 on 4 and 55 DF, p-value: 4.14e-12
What changed?
yi=β0+β1X1i+β2X2i+ϵi
In any statistical test, our goal is to detect a signal ( β ) in the presence of noise (residual variation ϵ)
yi=β0+β1X1i+β2X2i+ϵi
In any statistical test, our goal is to detect a signal ( β ) in the presence of noise (residual variation ϵ)
Our ability to do that depends on the strength of the signal (usually beyond our control) and the amount of noise (partially within our control)
yi=β0+β1X1i+β2X2i+ϵi
In any statistical test, our goal is to detect a signal ( β ) in the presence of noise (residual variation ϵ)
Our ability to do that depends on the strength of the signal (usually beyond our control) and the amount of noise (partially within our control)
In the first model, where was all the variation in weight caused by variation in length?
We can see this clearly by looking at the ANOVA tables for the two models
anova(fit.lm)
## Analysis of Variance Table## ## Response: weight## Df Sum Sq Mean Sq F value Pr(>F)## diet 3 5459 1820 2.05 0.12## Residuals 56 49734 888
anova(fit.lm2)
## Analysis of Variance Table## ## Response: weight## Df Sum Sq Mean Sq F value Pr(>F) ## diet 3 5459 1820 5.23 0.003 ** ## length 1 30615 30615 88.07 5.2e-13 ***## Residuals 55 19119 348 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fit.lm)
## Analysis of Variance Table## ## Response: weight## Df Sum Sq Mean Sq F value Pr(>F)## diet 3 5459 1820 2.05 0.12## Residuals 56 49734 888
anova(fit.lm2)
## Analysis of Variance Table## ## Response: weight## Df Sum Sq Mean Sq F value Pr(>F) ## diet 3 5459 1820 5.23 0.003 ** ## length 1 30615 30615 88.07 5.2e-13 ***## Residuals 55 19119 348 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note the residual sum of squares and mean square error of the two models
When calculating the sums of squares for multiple regression models, the order that we write the model matters
When calculating the sums of squares for multiple regression models, the order that we write the model matters
We wrote the model formula with diet first and length second (weight ~ diet + length
). R
calculates the sums of squares for diet first, then length
When calculating the sums of squares for multiple regression models, the order that we write the model matters
We wrote the model formula with diet first and length second (weight ~ diet + length
). R
calculates the sums of squares for diet first, then length
If we wrote weight ~ length + diet
, R
calculates the length SS first, then the diet SS
When calculating the sums of squares for multiple regression models, the order that we write the model matters
We wrote the model formula with diet first and length second (weight ~ diet + length
). R
calculates the sums of squares for diet first, then length
If we wrote weight ~ length + diet
, R
calculates the length SS first, then the diet SS
This sequential method of calculating is called the Type I sums of squares
R
anova(fit.lm2)
## Analysis of Variance Table## ## Response: weight## Df Sum Sq Mean Sq F value Pr(>F) ## diet 3 5459 1820 5.23 0.003 ** ## length 1 30615 30615 88.07 5.2e-13 ***## Residuals 55 19119 348 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit.lm3 <- lm(weight ~ length + diet, data = dietdata)anova(fit.lm3)
## Analysis of Variance Table## ## Response: weight## Df Sum Sq Mean Sq F value Pr(>F) ## length 1 27882 27882 80.21 2.5e-12 ***## diet 3 8192 2731 7.86 0.00019 ***## Residuals 55 19119 348 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Type I SS is only one of several ways of calculating sums of squares
Type I SS is only one of several ways of calculating sums of squares
Type I SS is only one of several ways of calculating sums of squares
Type II and Type III SS use slightly different methods that do not depend on order
For balanced designs, all three approaches will give the same answers
Type I SS is only one of several ways of calculating sums of squares
Type II and Type III SS use slightly different methods that do not depend on order
For balanced designs, all three approaches will give the same answers
For unbalanced designs, the correct approach depends on the hypotheses being tested
Type I SS is only one of several ways of calculating sums of squares
Type II and Type III SS use slightly different methods that do not depend on order
For balanced designs, all three approaches will give the same answers
For unbalanced designs, the correct approach depends on the hypotheses being tested
Other software programs (SAS, SPSS) default to Type III so may give different results than R
(but all programs have methods to calculate SS using each method)
In 1960, Herbert Stoddard established an experiment at Tall Timbers Research Station to study the effects of fire frequency on long-leaf pine forests in the Red Hills region of north Florida and southwest Georgia. The experiment originally consisted of eighty four, 0.5ha plots, each of which was assigned one of four treatments: 1-year, 2-year, or 3-year fire intervals or unburned. Stoddard recognized the soil type likely influenced response to fire, so treatment plots were allocated across the 3 primary soil types found on Tall Timbers
We will use a made-up data set from this real experiment, with the goal of understanding whether understory plant richness (measured as the number of understory vascular plants recorded on each plot) differs among burn treatments.
Fire frequency |
|||||
---|---|---|---|---|---|
Replicate | Unburned | 1-year | 2-year | 3-year | Soil |
1 | 30 | 43 | 36 | 37 | Loamy sand |
2 | 28 | 37 | 30 | 27 | Sandy loam |
3 | 18 | 27 | 24 | 21 | Upland |
We will use a made-up data set from this real experiment, with the goal of understanding whether understory plant richness (measured as the number of understory vascular plants recorded on each plot) differs among burn treatments.
Fire frequency |
|||||
---|---|---|---|---|---|
Replicate | Unburned | 1-year | 2-year | 3-year | Soil |
1 | 30 | 43 | 36 | 37 | Loamy sand |
2 | 28 | 37 | 30 | 27 | Sandy loam |
3 | 18 | 27 | 24 | 21 | Upland |
What are the experimental units? What are the observational units?
Fit the model with treatment only:
data(firedata)fm1 <- lm(species ~ interval, data = firedata)anova(fm1)
## Analysis of Variance Table## ## Response: species## Df Sum Sq Mean Sq F value Pr(>F)## interval 3 173 57.6 1.1 0.4## Residuals 8 419 52.4
Conclusions?
Because the soil types differ in a variety of ways that may influence plant communities, we probably want to control for this variability in our analysis
We do that by adding it to the model:
data(firedata)fm2 <- lm(species ~ interval + soil, data = firedata)anova(fm2)
## Analysis of Variance Table## ## Response: species## Df Sum Sq Mean Sq F value Pr(>F) ## interval 3 173 57.6 13.4 0.00456 ** ## soil 2 394 196.8 45.7 0.00023 ***## Residuals 6 26 4.3 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusions?
Note that in this case, each treatment is applied exactly once within each soil type
Note that in this case, each treatment is applied exactly once within each soil type
Soil type itself is not a experimental treatment and is not replicated
Note that in this case, each treatment is applied exactly once within each soil type
Soil type itself is not a experimental treatment and is not replicated
This design is often referred to as a "randomized complete blocking" design (RCBD)
The soil types are called "blocks" and should be identified as a source of variability before the experiment is conducted
"Complete" refers to each treatment being represented in each block
"Randomized" means that treatments are assigned randomly to each experimental unit
Note that in this case, each treatment is applied exactly once within each soil type
Soil type itself is not a experimental treatment and is not replicated
This design is often referred to as a "randomized complete blocking" design (RCBD)
The soil types are called "blocks" and should be identified as a source of variability before the experiment is conducted
"Complete" refers to each treatment being represented in each block
"Randomized" means that treatments are assigned randomly to each experimental unit
Again, blocks must be identified during the design phase. You should not "search" for blocks after the fact (why?)
As these examples show, one reason to use a multiple regression model is to increase power to detect treatment effects
Remember that F=MStreatment/MSerror
By explaining some of the residual variation, including predictors in the model reduces MSerrer and thereby increases F
As these examples show, one reason to use a multiple regression model is to increase power to detect treatment effects
Remember that F=MStreatment/MSerror
By explaining some of the residual variation, including predictors in the model reduces MSerrer and thereby increases F
When one predictor is a factor and the other(s) is continuous, this model is often referred to as an ANCOVA (Analysis of Covariance)
As these examples show, one reason to use a multiple regression model is to increase power to detect treatment effects
Remember that F=MStreatment/MSerror
By explaining some of the residual variation, including predictors in the model reduces MSerrer and thereby increases F
When one predictor is a factor and the other(s) is continuous, this model is often referred to as an ANCOVA (Analysis of Covariance)
When both predictors are factors and each treatment is included within each block, this model is often referred to a randomized complete block design
As these examples show, one reason to use a multiple regression model is to increase power to detect treatment effects
Remember that F=MStreatment/MSerror
By explaining some of the residual variation, including predictors in the model reduces MSerrer and thereby increases F
When one predictor is a factor and the other(s) is continuous, this model is often referred to as an ANCOVA (Analysis of Covariance)
When both predictors are factors and each treatment is included within each block, this model is often referred to a randomized complete block design
In experimental setting, ANCOVA and blocking are used when there is extraneous variation that we cannot control during the design phase
If we are only interested in the treatment effect and not the blocks/continuous predictor, the ANOVA table (possibly in combination with multiple comparisons) is often sufficient for interpreting the model
If we are only interested in the treatment effect and not the blocks/continuous predictor, the ANOVA table (possibly in combination with multiple comparisons) is often sufficient for interpreting the model
However, in observational studies, we are often interested in the effects of both predictors
If we are only interested in the treatment effect and not the blocks/continuous predictor, the ANOVA table (possibly in combination with multiple comparisons) is often sufficient for interpreting the model
However, in observational studies, we are often interested in the effects of both predictors
In this case, it helps to understand the model structure and parameter interpretation in more detail
Understanding how multiple regression models are structured will also help when you need to include different combinations of factors and continuous predictors
If we are only interested in the treatment effect and not the blocks/continuous predictor, the ANOVA table (possibly in combination with multiple comparisons) is often sufficient for interpreting the model
However, in observational studies, we are often interested in the effects of both predictors
In this case, it helps to understand the model structure and parameter interpretation in more detail
Understanding how multiple regression models are structured will also help when you need to include different combinations of factors and continuous predictors
Luckily, you already have the tools exploring these models in detail...
broom::tidy(fit.lm2)
## # A tibble: 5 × 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 191. 8.86 21.6 1.56e-28## 2 dietLow 13.7 6.88 2.00 5.09e- 2## 3 dietMed 25.2 6.97 3.62 6.48e- 4## 4 dietHigh 30.8 6.83 4.51 3.50e- 5## 5 length 2.79 0.297 9.38 5.16e-13
broom::tidy(fit.lm2)
## # A tibble: 5 × 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 191. 8.86 21.6 1.56e-28## 2 dietLow 13.7 6.88 2.00 5.09e- 2## 3 dietMed 25.2 6.97 3.62 6.48e- 4## 4 dietHigh 30.8 6.83 4.51 3.50e- 5## 5 length 2.79 0.297 9.38 5.16e-13
Intercept
= predicted weight of control group when length = 0broom::tidy(fit.lm2)
## # A tibble: 5 × 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 191. 8.86 21.6 1.56e-28## 2 dietLow 13.7 6.88 2.00 5.09e- 2## 3 dietMed 25.2 6.97 3.62 6.48e- 4## 4 dietHigh 30.8 6.83 4.51 3.50e- 5## 5 length 2.79 0.297 9.38 5.16e-13
Intercept
= predicted weight of control group when length = 0
dietLow
= difference between control and low diet
broom::tidy(fit.lm2)
## # A tibble: 5 × 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 191. 8.86 21.6 1.56e-28## 2 dietLow 13.7 6.88 2.00 5.09e- 2## 3 dietMed 25.2 6.97 3.62 6.48e- 4## 4 dietHigh 30.8 6.83 4.51 3.50e- 5## 5 length 2.79 0.297 9.38 5.16e-13
Intercept
= predicted weight of control group when length = 0
dietLow
= difference between control and low diet
dietMed
= difference between control and medium diet
broom::tidy(fit.lm2)
## # A tibble: 5 × 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 191. 8.86 21.6 1.56e-28## 2 dietLow 13.7 6.88 2.00 5.09e- 2## 3 dietMed 25.2 6.97 3.62 6.48e- 4## 4 dietHigh 30.8 6.83 4.51 3.50e- 5## 5 length 2.79 0.297 9.38 5.16e-13
Intercept
= predicted weight of control group when length = 0
dietLow
= difference between control and low diet
dietMed
= difference between control and medium diet
dietHigh
= difference between control and high diet
broom::tidy(fit.lm2)
## # A tibble: 5 × 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 191. 8.86 21.6 1.56e-28## 2 dietLow 13.7 6.88 2.00 5.09e- 2## 3 dietMed 25.2 6.97 3.62 6.48e- 4## 4 dietHigh 30.8 6.83 4.51 3.50e- 5## 5 length 2.79 0.297 9.38 5.16e-13
Intercept
= predicted weight of control group when length = 0
dietLow
= difference between control and low diet
dietMed
= difference between control and medium diet
dietHigh
= difference between control and high diet
length
= predicted increase in weight for one unit increase in length
summary(fm2)
## ## Call:## lm(formula = species ~ interval + soil, data = firedata)## ## Residuals:## Min 1Q Median 3Q Max ## -2.5000 -0.9167 0.0833 0.9375 2.2500 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 32.08 1.47 21.87 6.0e-07 ***## interval1 10.33 1.69 6.10 0.00088 ***## interval2 4.67 1.69 2.75 0.03310 * ## interval3 2.67 1.69 1.57 0.16656 ## soilsandy loam -6.25 1.47 -4.26 0.00532 ** ## soilupland -14.00 1.47 -9.54 7.6e-05 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 2.07 on 6 degrees of freedom## Multiple R-squared: 0.956, Adjusted R-squared: 0.92 ## F-statistic: 26.3 on 5 and 6 DF, p-value: 0.000518
summary(fm2)
## ## Call:## lm(formula = species ~ interval + soil, data = firedata)## ## Residuals:## Min 1Q Median 3Q Max ## -2.5000 -0.9167 0.0833 0.9375 2.2500 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 32.08 1.47 21.87 6.0e-07 ***## interval1 10.33 1.69 6.10 0.00088 ***## interval2 4.67 1.69 2.75 0.03310 * ## interval3 2.67 1.69 1.57 0.16656 ## soilsandy loam -6.25 1.47 -4.26 0.00532 ** ## soilupland -14.00 1.47 -9.54 7.6e-05 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 2.07 on 6 degrees of freedom## Multiple R-squared: 0.956, Adjusted R-squared: 0.92 ## F-statistic: 26.3 on 5 and 6 DF, p-value: 0.000518
This scenario (two factors) is a little more tricky. But, as before, we'll use the design matrix to help understand what each parameter means
model.matrix(fm2)
## (Intercept) interval1 interval2 interval3 soilsandy loam soilupland## 1 1 0 0 0 1 0## 2 1 0 0 0 0 0## 3 1 0 0 0 0 1## 4 1 1 0 0 1 0## 5 1 1 0 0 0 0## 6 1 1 0 0 0 1## 7 1 0 1 0 1 0## 8 1 0 1 0 0 0## 9 1 0 1 0 0 1## 10 1 0 0 1 1 0## 11 1 0 0 1 0 0## 12 1 0 0 1 0 1## attr(,"assign")## [1] 0 1 1 1 2 2## attr(,"contrasts")## attr(,"contrasts")$interval## [1] "contr.treatment"## ## attr(,"contrasts")$soil## [1] "contr.treatment"
Intercept
= predicted number of species in the unburned treatment in loamy sandIntercept
= predicted number of species in the unburned treatment in loamy sand
interval1
= difference between unburned and 1-year fire interval
Intercept
= predicted number of species in the unburned treatment in loamy sand
interval1
= difference between unburned and 1-year fire interval
interval2
= difference between unburned and 2-year fire interval
Intercept
= predicted number of species in the unburned treatment in loamy sand
interval1
= difference between unburned and 1-year fire interval
interval2
= difference between unburned and 2-year fire interval
interval3
= difference between unburned and 3-year fire interval
Intercept
= predicted number of species in the unburned treatment in loamy sand
interval1
= difference between unburned and 1-year fire interval
interval2
= difference between unburned and 2-year fire interval
interval3
= difference between unburned and 3-year fire interval
soilsandy loam
= difference between loamy sand and sandy loam soils
Intercept
= predicted number of species in the unburned treatment in loamy sand
interval1
= difference between unburned and 1-year fire interval
interval2
= difference between unburned and 2-year fire interval
interval3
= difference between unburned and 3-year fire interval
soilsandy loam
= difference between loamy sand and sandy loam soils
soilupland
= difference between loamy sand and upland soils
Intercept
= predicted number of species in the unburned treatment in loamy sand
interval1
= difference between unburned and 1-year fire interval
interval2
= difference between unburned and 2-year fire interval
interval3
= difference between unburned and 3-year fire interval
soilsandy loam
= difference between loamy sand and sandy loam soils
soilupland
= difference between loamy sand and upland soils
Again, because there is no interaction between interval
and soil
, the model assumes that the effects of fire frequency are the same in every soil type (and that the soil differences are the same for every fire frequency)
We have looked at multiple regression examples with two factors and with one factor and one continuous covariate, but what about multiple continuous covariates?
We have looked at multiple regression examples with two factors and with one factor and one continuous covariate, but what about multiple continuous covariates?
The biomassdata
object contains (made up) biomass measurements (kg) as a function of rainfall (mm) and elevation (km)
data("biomassdata")head(biomassdata)
## biomass rainfall elevation## 1 96.12 166.0 4.4397## 2 122.89 150.3 4.9462## 3 69.14 195.3 5.8502## 4 47.02 192.0 0.5679## 5 20.56 211.6 2.5554## 6 10.77 218.0 1.0106
fit.lm <- lm(biomass ~ rainfall + elevation, data = biomassdata)broom::tidy(fit.lm)
## # A tibble: 3 × 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 320. 4.11 78.0 2.31e-51## 2 rainfall -1.44 0.0217 -66.4 3.97e-48## 3 elevation 3.97 0.241 16.5 3.77e-21
Intercept
= predicted biomass when rainfall = 0 and elevation = 0fit.lm <- lm(biomass ~ rainfall + elevation, data = biomassdata)broom::tidy(fit.lm)
## # A tibble: 3 × 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 320. 4.11 78.0 2.31e-51## 2 rainfall -1.44 0.0217 -66.4 3.97e-48## 3 elevation 3.97 0.241 16.5 3.77e-21
Intercept
= predicted biomass when rainfall = 0 and elevation = 0
rainfall
= predicted change in biomass for 1mm increase in rainfall while holding elevation constant
fit.lm <- lm(biomass ~ rainfall + elevation, data = biomassdata)broom::tidy(fit.lm)
## # A tibble: 3 × 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 320. 4.11 78.0 2.31e-51## 2 rainfall -1.44 0.0217 -66.4 3.97e-48## 3 elevation 3.97 0.241 16.5 3.77e-21
Intercept
= predicted biomass when rainfall = 0 and elevation = 0
rainfall
= predicted change in biomass for 1mm increase in rainfall while holding elevation constant
elevation
= predicted change in biomass for 1km increase in elevation while holding rainfall constant
When fitting models to continuous covariates, it is common to center or standardize covariates.
When fitting models to continuous covariates, it is common to center or standardize covariates.
Centering is done by subtracting the mean
biomassdata$elevation.c <- biomassdata$elevation - mean(biomassdata$elevation)biomassdata$rainfall.c <- biomassdata$rainfall - mean(biomassdata$rainfall)
When fitting models to continuous covariates, it is common to center or standardize covariates.
Centering is done by subtracting the mean
biomassdata$elevation.c <- biomassdata$elevation - mean(biomassdata$elevation)biomassdata$rainfall.c <- biomassdata$rainfall - mean(biomassdata$rainfall)
When interpreting centered data:
positive values indicate observations larger than the mean
negative values indicate observation smaller than the mean
units don't change
fit.lm <- lm(biomass ~ rainfall.c + elevation.c, data = biomassdata)broom::tidy(fit.lm)
## # A tibble: 3 × 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 62.0 0.414 150. 1.24e-64## 2 rainfall.c -1.44 0.0217 -66.4 3.97e-48## 3 elevation.c 3.97 0.241 16.5 3.77e-21
Intercept
= predicted biomass when rainfall and elevation are at their meanfit.lm <- lm(biomass ~ rainfall.c + elevation.c, data = biomassdata)broom::tidy(fit.lm)
## # A tibble: 3 × 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 62.0 0.414 150. 1.24e-64## 2 rainfall.c -1.44 0.0217 -66.4 3.97e-48## 3 elevation.c 3.97 0.241 16.5 3.77e-21
Intercept
= predicted biomass when rainfall and elevation are at their mean
rainfall
= predicted change in biomass for 1mm increase in rainfall while holding elevation constant
fit.lm <- lm(biomass ~ rainfall.c + elevation.c, data = biomassdata)broom::tidy(fit.lm)
## # A tibble: 3 × 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 62.0 0.414 150. 1.24e-64## 2 rainfall.c -1.44 0.0217 -66.4 3.97e-48## 3 elevation.c 3.97 0.241 16.5 3.77e-21
Intercept
= predicted biomass when rainfall and elevation are at their mean
rainfall
= predicted change in biomass for 1mm increase in rainfall while holding elevation constant
elevation
= predicted change in biomass for 1km increase in elevation while holding rainfall constant
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |