For this lab we will be loading the tidyverse
, pastecs
, broom
, and car
packages for various assumption tests. We will also be using the same dataset as last lab.
library(pastecs)
library(broom)
library(car)
library(tidyverse)
theme_set(theme_bw())
load("~/Google Drive/Grad School/GR 770 Statistics/R Labs/Data/framingham_n50.RData")
Let’s review our simple model of systolic blood pressure (sysBP) as a function of age alone. We will recalculate this using lm
sysBP.m <- lm(sysBP ~ age, data = fhs)
summary(sysBP.m)
##
## Call:
## lm(formula = sysBP ~ age, data = fhs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.293 -9.533 1.273 5.580 57.010
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 89.8743 12.9313 6.950 8.77e-09 ***
## age 0.8853 0.2705 3.272 0.00198 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.57 on 48 degrees of freedom
## Multiple R-squared: 0.1824, Adjusted R-squared: 0.1653
## F-statistic: 10.71 on 1 and 48 DF, p-value: 0.001981
We know that we have a significant main effect of age on systolic blood pressure, and that’s great for us, but we need to make sure that our model passes all of the assumptions.
There are 4 assumptions that need to be checked for a linear regression: linearity, homoscedasticity, independence, and normality. Linearity is the easiest since it is based on the dependent and independent variables themselves. You can assess linearity using a scatterplot, something we did last lab. We can recreate that plot here, again adding a regression line using geom_smooth
ggplot(fhs, aes(age, sysBP)) +
geom_point() +
geom_smooth(method= "lm")
You can see the general increasing trend between age and sysBP for this model which bodes well for our use of a linear model. If there was a lack of linearity, you can try to transform the data with sqrt, log, squared, etc. transforms.
Next, we will move to assessing the residuals. Residuals are the point-by-point error between the actual dependent variable and what the linear model predicts depending on the value of the predictor variable. The residuals are automatically calculated by the lm
command, but are annoying to access in the base form, however, the augment
command exists which extracts the residual and other information at each point used to create the model.
augment
has the following form:
model.a <- augment(model, ...)
It will spit out a bunch of variables including the input dependent and independent variables, residuals, and standardised residuals. Let’s augment sysBP.m
and store in sysBP.a
to have easier access to the residuals
sysBP.a <- augment(sysBP.m)
head(sysBP.a,10)
## # A tibble: 10 x 9
## sysBP age .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 112. 36 122. 3.72 -9.24 0.0572 15.7 0.0114 -0.612
## 2 137 52 136. 2.57 1.09 0.0273 15.7 0.0000709 0.0711
## 3 112. 53 137. 2.72 -24.3 0.0305 15.3 0.0395 -1.59
## 4 140. 51 135. 2.44 5.48 0.0246 15.7 0.00160 0.356
## 5 137 47 131. 2.20 5.52 0.0200 15.7 0.00131 0.358
## 6 108. 53 137. 2.72 -28.3 0.0305 15.2 0.0536 -1.85
## 7 131 45 130. 2.27 1.29 0.0213 15.7 0.0000764 0.0837
## 8 113 37 123. 3.51 -9.63 0.0508 15.7 0.0108 -0.635
## 9 132 48 132. 2.21 -0.367 0.0202 15.7 0.00000585 -0.0238
## 10 123 36 122. 3.72 1.26 0.0572 15.7 0.000210 0.0831
You will see that the first two columns are our dependent and independent variables for reference and for help plotting later on. The .resid column has the raw residuals while .stdresid is the z-transformed residual.
First off, we need to make sure the average of our residuals is basically 0.
mean(sysBP.a$.resid)
## [1] -1.151856e-16
A number e-16 is close enough to 0 for government work, so that’s good. We already passed one test.
For assessing homoscedasticity, we should plot our residuals. For this we will be using the .fitted and the .resid variables from our augmented data frame. Make a scatterplot using geom_point
with .fitted on the x axis and .resid on the y axis. Also add a horizontal line at 0 using geom_hline
.
ggplot(sysBP.a, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0) +
theme_bw()
We are looking for a few things using this graph:
Based on these 3 criteria, our residuals have passed the first test. All of them are nicely distributed, there isnt a noticeable change in the variance at any given location, and there isn’t a discernable pattern.
Normality will be tested in the same way we have done previously, using histograms, Q-Q plots, and statistical tests. Really, you can test normality on either the observed data or the residuals, but seeing as we have already done normality on observed data before, let’s test it on the residuals for kicks and giggles.
We will plot the .resid variable from sysBP.a like we have previously adding the normal distribution on top. We will use 15 bins this time since we have a fairly low sample size
ggplot(sysBP.a,aes(x = .resid)) +
geom_histogram(aes(y = ..density..), bins = 15, fill = "gray80", color = "black") +
stat_function(fun = dnorm, args = list(mean(sysBP.a$.resid), sd(sysBP.a$.resid)), color = "red")
You can see that the residuals are fairly normal, slightly skewed to the right but nothing major.
ggplot(sysBP.a,aes(sample = .resid)) +
geom_qq() +
geom_qq_line()
So again, as you can see in the Q-Q plot, almost all of the points are normally distributed with only 3-4 at the tail contributing to any skew.
ggplot(sysBP.a,aes(y = .resid)) +
geom_boxplot()
According to the boxplot, we have 2 outliers; however the tails of the boxplot have approximately even length although the median line is not in the middle of the box.
Here, we will calculate skew, kurtosis, and perform SW test for normality
stat.desc(sysBP.a$.resid, basic = FALSE, norm = TRUE)
## median mean SE.mean CI.mean.0.95 var
## 1.272710e+00 -1.151856e-16 2.178805e+00 4.378473e+00 2.373596e+02
## std.dev coef.var skewness skew.2SE kurtosis
## 1.540648e+01 -1.337535e+17 9.992540e-01 1.484331e+00 2.558828e+00
## kurt.2SE normtest.W normtest.p
## 1.932917e+00 9.264829e-01 4.077272e-03
You can see that the skewness metric is 0.999 indicating a highly skewed distribution, different from what the histogram might suggest. The skew.2SE doesn’t meet the criterion for our rule of thumb though where, for small sample sizes, if it is greater than 1.96, the distribution is significantly skewed. The SW test did come out highly significant.
Our last test is the KS test which is less stringent than the SW test.
ks.test(sysBP.a$.resid, y = pnorm, mean = mean(sysBP.a$.resid), sd = sd(sysBP.a$.resid))
##
## One-sample Kolmogorov-Smirnov test
##
## data: sysBP.a$.resid
## D = 0.11811, p-value = 0.4537
## alternative hypothesis: two-sided
The KS test comes out non-significant. So we have a variety of conflicting information about the skewness of this distribution. The plots looks only slightly skewed, the skewness metric as well as the SW indicate a highly skewed distribution, but the KS test and skew.2SE metrics indicate the opposite. Overall, this seems borderline but we can say it is normal since most of the skewness seems to be driven by 2-3 points out of 50. To be clear, I think we could have said either skewed or normal here and been correct.
We don’t talk about independence in this class
Multicolinearity occurs when predictor variables in a multiple linear regression are highly correlated. This causes problems with both parameter estimation and prediction, so we need to make sure our predictors are fairly unique. We will test multicolinearity by calculating the variance inflation factor (VIF) using the vif
function from the car
package.
In order to use the function, you only need to pass the model into the vif
function
vif(model)
So we can recreate our multiple linear regression from the last lab. Remember, we made a model using the main effects of age and glucose on systolic blood pressure. We can make that again, storing in sysBP.m.2
sysBP.m.2 <- lm(sysBP ~ age + glucose, data = fhs)
vif(sysBP.m.2)
## age glucose
## 1.089871 1.089871
Variables with VIF values above 10 should be removed from the model. We can see that neither of our predictors approach that so multicolinearity is not an issue in our simplistic multiple linear model.