Skip to content
Snippets Groups Projects
title: "Lab 6 Linear Regression 1"
author: "Matt Defenderfer"
date: "8/15/2019"
output: 
  html_document:
    toc: TRUE
knitr::opts_chunk$set(echo = TRUE)

Overall Progression for Lab 6:

  1. Load Packages and Data
  2. Simple Linear Regression a. Reporting the Regression b. Plotting the Regression
  3. Multiple Linear Regression a. Interactions
  4. Reporting a Linear Regression

Package and Data Setup

For this lab we will be using the following packages:

  1. tidyverse: contains dplyr and ggplot2
  2. broom: contains the tidy function to convert a summary table to a data frame
  3. knitr: contains the kable function to export tidied summary tables to text
library(knitr)
library(broom)
library(tidyverse)

theme_set(theme_bw())

We will also be using a new dataset. This dataset is the Framingham Heart Study Dataset and can be loaded from the file "framingham_small_n50.RData". Download this file from Canvas if needed. It includes the following variables:

  1. gender: factor with 2 levels (male, female)
  2. age: numeric
  3. currentSmoker: factor, 2 levels (No, Yes)
  4. cigsPerDay: numeric
  5. BPMeds: whether they are taking mediaction for blood pressure. factor, 2 levels (No, Yes)
  6. prevalentStroke: has had a prior stroke. factor, 2 levels (No, Yes)
  7. prevalentHyp: has had hypertension. factor, 2 levels (No, Yes)
  8. diabetes: has diabetes. factor, 2 levels (No, Yes)
  9. totChol: total cholesterol. numeric
  10. sysBP: systolic blood pressure. numeric
  11. diaBP: diastolic blood pressure. numeric
  12. BMI: body mass index. numeric
  13. heartRate: numeric
  14. glucose: numeric
  15. TenYearCHD: Patient developed coronary heart disease within 10 years after initial visit. factor, 2 levels (No, Yes)
load("/Users/Matthew/Google Drive/Grad School/GR 770 Statistics/R Labs/Data/framingham_n50.RData")

Simple Linear Regression

For any sort of basic linear model, the command you will use is lm. lm has the following form:

lm(formula, data, weights, ...)

  • formula: a formula expressing the dependent variable as a function of the predictor variables. For example, if I wanted to create a model of y as a function of x, my formula would be y ~ x.
  • data: the data frame the variables can be found in
  • weights: in some instances, you may want the model to be influenced by certain data points over others. You can assign each data point a weight here that will determine how much it influences the final model. By default, lm weights all data points equally, and you should not change it without a good reason.

lm creates a different type of variable called a model object. Assign the output of lm to a variable and then use the summary function to create a table with the test statistics and some other information from that model object.

In our dataset, we have a few variables we could use as our dependent variable. For this first example, we can model systolic blood pressure as a function of age.

sysBP.m <- lm(sysBP ~ age, data = fhs)
summary(sysBP.m)
sysBP.m.tidy <- tidy(sysBP.m)

What is output is a table listing the function call, a summary of the residual errors, a summary of the coefficients for the terms, and error and variance explained measures. While all of these are important, the ones of particular interest are the coefficients and error calculations.

The Estimate column of the table of coefficients lists the value of the intercept (the value of sysBP when age = 0), and the slope of age. You can see that baseline sysBP according to the model is r round(sysBP.m.tidy$estimate[1],3) when age is 0. For each year a person ages, sysBP will change by r round(sysBP.m.tidy$estimate[2],3). The table also lists the SEMs for the Intercept and age as well as the t value test statistic and corresponding p values. According to this simple model, age is a significant predictor of systolic blood pressure.

In the error section, you can find the residual standard error (RSE) calculation and the corresponding degrees of freedom. There are also two different versions of R^2 beneath that. For this model, a solid r round(summary(sysBP.m)$r.squared * 100,3)% of the variance was explained by age in this model. For simple linear models, you can use the RSE and R^2 calculations to help you determine whether your model is good or not. There aren't hard and fast rules for what constitutes a good model depending on your error tolerance, the noise associated with your data, etc.

Reporting a Linear Regression Table

The broom package and tidy in particular are nice tools for easily exporting the coefficients summary as a dataframe and then to a csv if desired.

Using tidy is simple, all you really need to enter is the model object as input:

sysBP.tidy <- tidy(sysBP.m)
sysBP.tidy

You can see the coefficient results in a neat table. Now, if you want to output this dataframe as a formatted text table for copy-pasting into a document, you can use the kable function. At its base, you only really need to input the tidy dataframe. This would output as such:

kable(sysBP.tidy)

NOTE: If viewing this in the html file, you will notice that the table is formatted very nicely. This is done automatically by R Markdown when creating this file. It outputs as text when run in the command window

However, you can format the table a little more such as changing the max number of digits and column labels

kable(sysBP.tidy, digits = 4, col.names = c("Predictor","Estimate","SEM","t-value","p-value"))

This table would be a proper way to report results on the test if there is a question on linear regression.

Plotting the Regression

Plotting the regression for two variables is essentially the same as for plotting a correlation from last lab. We will plot using geom_point and add a regression line using geom_smooth with method being lm

We can plot the regression between sysBP and age here

ggplot(fhs, aes(x = age, y = sysBP)) +
  geom_point() +
  geom_smooth(method = "lm")

The one difference here vs. in the previous correlation plot is that we have included the SEM ribbon around the regression line this time. It is included by default.

Multiple Linear Regression

Although making a model using only one predictor variable is valid, it doesn't always tell the whole story. Adding extra predictor variables to the model will necessarily reduce RMSE and increase variance explained. To add a second term to the model, we can use the generic formula

y ~ x + z

So here, we can model sysBP as a function of both age and glucose

sysBP.m.2vars <- lm(sysBP ~ glucose + age, data = fhs)
summary(sysBP.m.2vars)

You can see in the model summary that age and glucose are significant predictors of sysBP. Also notice that the Estimates for both the Intercept and age have changed from previously. They adjust when new terms are added to the model.

RSE has dropped by r summary(sysBP.m)$sigma - summary(sysBP.m.2vars)$sigma and variance explained has increased by r round((summary(sysBP.m.2vars)$r.squared - summary(sysBP.m)$r.squared)*100,3). Generally, we can see if one model is better than another by looking at RSE and R^2 but need to be careful of overfitting.

Reporting a Linear Regression

Reporting our results from the main effects model, not including the interaction term

Multiple regression analysis was used to investigate whether glucose and age were significantly associated with systolic blood pressure. Together, age and glucose accounted for 40.23% of the variance (R2 = 0.4023, p < 0.00001, N=50). The partial regression coefficients for age and glucose were both significant [age: B, 0.5939 ± 0.2440, \beta=0.2865, t = 2.433, p = 0.019; glucose: B, 0.4615 ± 0.1110, \beta=0.4895, t = 4.158, p = 0.0001].

Explanation of the inserted variables:

  • R2: Multiple R-squared term. Non-adjusted
  • p: gotten from the very bottom next to the F-Statistic
  • N: Sample size

For each independent variable:

  • B: Estimate for that variable ± Std. Error for that variable
  • \beta: Normalized Slope. See note below
  • t: t value from the table for that variable
  • p: p value from the table for that variable

Note on Standardized Slope

Use the function lm.beta from the QuantPsyc package. All you need to do is pass in the lm object and it will spit out the normalized slopes for each variable

library(QuantPsyc)
lm.beta(sysBP.m.2vars)