-
Matthew K Defenderfer authored4df26284
title: "Lab 6 Linear Regression 1"
author: "Matt Defenderfer"
date: "8/15/2019"
output:
html_document:
toc: TRUE
knitr::opts_chunk$set(echo = TRUE)
Overall Progression for Lab 6:
- Load Packages and Data
- Simple Linear Regression a. Reporting the Regression b. Plotting the Regression
- Multiple Linear Regression a. Interactions
- Reporting a Linear Regression
Package and Data Setup
For this lab we will be using the following packages:
-
tidyverse
: containsdplyr
andggplot2
-
broom
: contains thetidy
function to convert a summary table to a data frame -
knitr
: contains thekable
function to export tidied summary tables to text
library(knitr)
library(broom)
library(tidyverse)
theme_set(theme_bw())
We will also be using a new dataset. This dataset is the Framingham Heart Study Dataset and can be loaded from the file "framingham_small_n50.RData". Download this file from Canvas if needed. It includes the following variables:
-
gender
: factor with 2 levels (male, female) -
age
: numeric -
currentSmoker
: factor, 2 levels (No, Yes) -
cigsPerDay
: numeric -
BPMeds
: whether they are taking mediaction for blood pressure. factor, 2 levels (No, Yes) -
prevalentStroke
: has had a prior stroke. factor, 2 levels (No, Yes) -
prevalentHyp
: has had hypertension. factor, 2 levels (No, Yes) -
diabetes
: has diabetes. factor, 2 levels (No, Yes) -
totChol
: total cholesterol. numeric -
sysBP
: systolic blood pressure. numeric -
diaBP
: diastolic blood pressure. numeric -
BMI
: body mass index. numeric -
heartRate
: numeric -
glucose
: numeric -
TenYearCHD
: Patient developed coronary heart disease within 10 years after initial visit. factor, 2 levels (No, Yes)
load("/Users/Matthew/Google Drive/Grad School/GR 770 Statistics/R Labs/Data/framingham_n50.RData")
Simple Linear Regression
For any sort of basic linear model, the command you will use is lm
. lm
has the following form:
lm(formula, data, weights, ...)
-
formula
: a formula expressing the dependent variable as a function of the predictor variables. For example, if I wanted to create a model of y as a function of x, my formula would bey ~ x
. -
data
: the data frame the variables can be found in -
weights
: in some instances, you may want the model to be influenced by certain data points over others. You can assign each data point a weight here that will determine how much it influences the final model. By default,lm
weights all data points equally, and you should not change it without a good reason.
lm
creates a different type of variable called a model object. Assign the output of lm
to a variable and then use the summary
function to create a table with the test statistics and some other information from that model object.
In our dataset, we have a few variables we could use as our dependent variable. For this first example, we can model systolic blood pressure as a function of age.
sysBP.m <- lm(sysBP ~ age, data = fhs)
summary(sysBP.m)
sysBP.m.tidy <- tidy(sysBP.m)
What is output is a table listing the function call, a summary of the residual errors, a summary of the coefficients for the terms, and error and variance explained measures. While all of these are important, the ones of particular interest are the coefficients and error calculations.
The Estimate column of the table of coefficients lists the value of the intercept (the value of sysBP when age = 0), and the slope of age. You can see that baseline sysBP according to the model is r round(sysBP.m.tidy$estimate[1],3)
when age is 0. For each year a person ages, sysBP will change by r round(sysBP.m.tidy$estimate[2],3)
. The table also lists the SEMs for the Intercept and age as well as the t value test statistic and corresponding p values. According to this simple model, age is a significant predictor of systolic blood pressure.
In the error section, you can find the residual standard error (RSE) calculation and the corresponding degrees of freedom. There are also two different versions of R^2 beneath that. For this model, a solid r round(summary(sysBP.m)$r.squared * 100,3)
% of the variance was explained by age in this model. For simple linear models, you can use the RSE and R^2 calculations to help you determine whether your model is good or not. There aren't hard and fast rules for what constitutes a good model depending on your error tolerance, the noise associated with your data, etc.
Reporting a Linear Regression Table
The broom
package and tidy
in particular are nice tools for easily exporting the coefficients summary as a dataframe and then to a csv if desired.
Using tidy
is simple, all you really need to enter is the model object as input:
sysBP.tidy <- tidy(sysBP.m)
sysBP.tidy
You can see the coefficient results in a neat table. Now, if you want to output this dataframe as a formatted text table for copy-pasting into a document, you can use the kable
function. At its base, you only really need to input the tidy dataframe. This would output as such:
kable(sysBP.tidy)
NOTE: If viewing this in the html file, you will notice that the table is formatted very nicely. This is done automatically by R Markdown when creating this file. It outputs as text when run in the command window
However, you can format the table a little more such as changing the max number of digits and column labels
kable(sysBP.tidy, digits = 4, col.names = c("Predictor","Estimate","SEM","t-value","p-value"))
This table would be a proper way to report results on the test if there is a question on linear regression.
Plotting the Regression
Plotting the regression for two variables is essentially the same as for plotting a correlation from last lab. We will plot using geom_point
and add a regression line using geom_smooth
with method being lm
We can plot the regression between sysBP and age here
ggplot(fhs, aes(x = age, y = sysBP)) +
geom_point() +
geom_smooth(method = "lm")
The one difference here vs. in the previous correlation plot is that we have included the SEM ribbon around the regression line this time. It is included by default.
Multiple Linear Regression
Although making a model using only one predictor variable is valid, it doesn't always tell the whole story. Adding extra predictor variables to the model will necessarily reduce RMSE and increase variance explained. To add a second term to the model, we can use the generic formula
y ~ x + z
So here, we can model sysBP as a function of both age and glucose
sysBP.m.2vars <- lm(sysBP ~ glucose + age, data = fhs)
summary(sysBP.m.2vars)
You can see in the model summary that age and glucose are significant predictors of sysBP. Also notice that the Estimates for both the Intercept and age have changed from previously. They adjust when new terms are added to the model.
RSE has dropped by r summary(sysBP.m)$sigma - summary(sysBP.m.2vars)$sigma
and variance explained has increased by r round((summary(sysBP.m.2vars)$r.squared - summary(sysBP.m)$r.squared)*100,3)
. Generally, we can see if one model is better than another by looking at RSE and R^2 but need to be careful of overfitting.
Reporting a Linear Regression
Reporting our results from the main effects model, not including the interaction term
Multiple regression analysis was used to investigate whether glucose and age were significantly associated with systolic blood pressure. Together, age and glucose accounted for 40.23% of the variance (R2 = 0.4023, p < 0.00001, N=50). The partial regression coefficients for age and glucose were both significant [age: B, 0.5939 ± 0.2440, \beta=0.2865, t = 2.433, p = 0.019; glucose: B, 0.4615 ± 0.1110, \beta=0.4895, t = 4.158, p = 0.0001].
Explanation of the inserted variables:
- R2: Multiple R-squared term. Non-adjusted
- p: gotten from the very bottom next to the F-Statistic
- N: Sample size
For each independent variable:
- B: Estimate for that variable ± Std. Error for that variable
- \beta: Normalized Slope. See note below
- t: t value from the table for that variable
- p: p value from the table for that variable
Note on Standardized Slope
Use the function lm.beta
from the QuantPsyc
package. All you need to do is pass in the lm
object and it will spit out the normalized slopes for each variable
library(QuantPsyc)
lm.beta(sysBP.m.2vars)