Skip to content
Snippets Groups Projects
title: "Lab_7_Linear_Regression_2_With_Code"
author: "Matt Defenderfer"
date: "8/19/2019"
output: 
  html_document:
    toc: TRUE
knitr::opts_chunk$set(echo = TRUE)

Overall Progression for Lab 7:

  1. Load Packages and Data
  2. Review our Simple Model
  3. Checking Assumptions a. Linearity b. Homoscedasticity of residuals c. Normality of residuals d. Independence e. Multicollinearity (for multiple linear model)

Loading Packages and Data

For this lab we will be loading the tidyverse, pastecs, broom, and car packages for various assumption tests. We will also be using the same dataset as last lab.

library(pastecs)
library(broom)
library(car)
library(tidyverse)

theme_set(theme_bw())
load("~/Google Drive/Grad School/GR 770 Statistics/R Labs/Data/framingham_n50.RData")

Reviewing Our Simple Model

Let's review our simple model of systolic blood pressure (sysBP) as a function of age alone. We will recalculate this using lm

sysBP.m <- lm(sysBP ~ age, data = fhs)
summary(sysBP.m)

We know that we have a significant main effect of age on systolic blood pressure, and that's great for us, but we need to make sure that our model passes all of the assumptions.

Checking Assumptions

Linearity

There are 4 assumptions that need to be checked for a linear regression: linearity, homoscedasticity, independence, and normality. Linearity is the easiest since it is based on the dependent and independent variables themselves. You can assess linearity using a scatterplot, something we did last lab. We can recreate that plot here, again adding a regression line using geom_smooth

ggplot(fhs, aes(age, sysBP)) +
  geom_point() +
  geom_smooth(method= "lm")

You can see the general increasing trend between age and sysBP for this model which bodes well for our use of a linear model. If there was a lack of linearity, you can try to transform the data with sqrt, log, squared, etc. transforms.

Homoscedasticity of Residuals

Next, we will move to assessing the residuals. Residuals are the point-by-point error between the actual dependent variable and what the linear model predicts depending on the value of the predictor variable. The residuals are automatically calculated by the lm command, but are annoying to access in the base form, however, the augment command exists which extracts the residual and other information at each point used to create the model.

augment has the following form:

model.a <- augment(model, ...)

It will spit out a bunch of variables including the input dependent and independent variables, residuals, and standardised residuals. Let's augment sysBP.m and store in sysBP.a to have easier access to the residuals

sysBP.a <- augment(sysBP.m)
head(sysBP.a,10)

You will see that the first two columns are our dependent and independent variables for reference and for help plotting later on. The .resid column has the raw residuals while .stdresid is the z-transformed residual.

First off, we need to make sure the average of our residuals is basically 0.

mean(sysBP.a$.resid)

A number e-16 is close enough to 0 for government work, so that's good. We already passed one test.

Residual Plots

For assessing homoscedasticity, we should plot our residuals. For this we will be using the .fitted and the .resid variables from our augmented data frame. Make a scatterplot using geom_point with .fitted on the x axis and .resid on the y axis. Also add a horizontal line at 0 using geom_hline.

ggplot(sysBP.a, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  theme_bw()

We are looking for a few things using this graph:

  • Randomly spread points above and below the 0 line.
  • Similar variance across all points
  • No pattern in the residuals

Based on these 3 criteria, our residuals have passed the first test. All of them are nicely distributed, there isnt a noticeable change in the variance at any given location, and there isn't a discernable pattern.

Normality

Normality will be tested in the same way we have done previously, using histograms, Q-Q plots, and statistical tests. Really, you can test normality on either the observed data or the residuals, but seeing as we have already done normality on observed data before, let's test it on the residuals for kicks and giggles.

Histogram

We will plot the .resid variable from sysBP.a like we have previously adding the normal distribution on top. We will use 15 bins this time since we have a fairly low sample size

ggplot(sysBP.a,aes(x = .resid)) +
  geom_histogram(aes(y = ..density..), bins = 15, fill = "gray80", color = "black") +
  stat_function(fun = dnorm, args = list(mean(sysBP.a$.resid), sd(sysBP.a$.resid)), color = "red")

You can see that the residuals are fairly normal, slightly skewed to the right but nothing major.

Q-Q plot

ggplot(sysBP.a,aes(sample = .resid)) +
  geom_qq() +
  geom_qq_line()

So again, as you can see in the Q-Q plot, almost all of the points are normally distributed with only 3-4 at the tail contributing to any skew.

Boxplot

ggplot(sysBP.a,aes(y = .resid)) +
  geom_boxplot()

According to the boxplot, we have 2 outliers; however the tails of the boxplot have approximately even length although the median line is not in the middle of the box.

Statistical Tests

Here, we will calculate skew, kurtosis, and perform SW test for normality

stat.desc(sysBP.a$.resid, basic = FALSE, norm = TRUE)

You can see that the skewness metric is 0.999 indicating a highly skewed distribution, different from what the histogram might suggest. The skew.2SE doesn't meet the criterion for our rule of thumb though where, for small sample sizes, if it is greater than 1.96, the distribution is significantly skewed. The SW test did come out highly significant.

Our last test is the KS test which is less stringent than the SW test.

ks.test(sysBP.a$.resid, y = pnorm, mean = mean(sysBP.a$.resid), sd = sd(sysBP.a$.resid))

The KS test comes out non-significant. So we have a variety of conflicting information about the skewness of this distribution. The plots looks only slightly skewed, the skewness metric as well as the SW indicate a highly skewed distribution, but the KS test and skew.2SE metrics indicate the opposite. Overall, this seems borderline but we can say it is normal since most of the skewness seems to be driven by 2-3 points out of 50. To be clear, I think we could have said either skewed or normal here and been correct.

Independence

We don't talk about independence in this class

Multicolinearity

Multicolinearity occurs when predictor variables in a multiple linear regression are highly correlated. This causes problems with both parameter estimation and prediction, so we need to make sure our predictors are fairly unique. We will test multicolinearity by calculating the variance inflation factor (VIF) using the vif function from the car package.

In order to use the function, you only need to pass the model into the vif function

vif(model)

So we can recreate our multiple linear regression from the last lab. Remember, we made a model using the main effects of age and glucose on systolic blood pressure. We can make that again, storing in sysBP.m.2

sysBP.m.2 <- lm(sysBP ~ age + glucose, data = fhs)
vif(sysBP.m.2)

Variables with VIF values above 10 should be removed from the model. We can see that neither of our predictors approach that so multicolinearity is not an issue in our simplistic multiple linear model.