Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
---
title: "Lab_7_Linear_Regression_2_With_Code"
author: "Matt Defenderfer"
date: "8/19/2019"
output:
html_document:
toc: TRUE
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Overall Progression for Lab 7:
1. Load Packages and Data
2. Review our Simple Model
3. Checking Assumptions
a. Linearity
b. Homoscedasticity of residuals
c. Normality of residuals
d. Independence
e. Multicollinearity (for multiple linear model)
## Loading Packages and Data
For this lab we will be loading the `tidyverse`, `pastecs`, `broom`, and `car` packages for various assumption tests. We will also be using the same dataset as last lab.
```{r Library Setup, message = FALSE}
library(pastecs)
library(broom)
library(car)
library(tidyverse)
theme_set(theme_bw())
```
```{r Load Data}
load("~/Google Drive/Grad School/GR 770 Statistics/R Labs/Data/framingham_n50.RData")
```
## Reviewing Our Simple Model
Let's review our simple model of systolic blood pressure (sysBP) as a function of age alone. We will recalculate this using `lm`
```{r sysBP age}
sysBP.m <- lm(sysBP ~ age, data = fhs)
summary(sysBP.m)
```
We know that we have a significant main effect of age on systolic blood pressure, and that's great for us, but we need to make sure that our model passes all of the assumptions.
## Checking Assumptions
### Linearity
There are 4 assumptions that need to be checked for a linear regression: linearity, homoscedasticity, independence, and normality. Linearity is the easiest since it is based on the dependent and independent variables themselves. You can assess linearity using a scatterplot, something we did last lab. We can recreate that plot here, again adding a regression line using `geom_smooth`
```{r Check linearity}
ggplot(fhs, aes(age, sysBP)) +
geom_point() +
geom_smooth(method= "lm")
```
You can see the general increasing trend between age and sysBP for this model which bodes well for our use of a linear model. If there was a lack of linearity, you can try to transform the data with sqrt, log, squared, etc. transforms.
### Homoscedasticity of Residuals
Next, we will move to assessing the residuals. Residuals are the point-by-point error between the actual dependent variable and what the linear model predicts depending on the value of the predictor variable. The residuals are automatically calculated by the `lm` command, but are annoying to access in the base form, however, the `augment` command exists which extracts the residual and other information at each point used to create the model.
`augment` has the following form:
`model.a <- augment(model, ...)`
It will spit out a bunch of variables including the input dependent and independent variables, residuals, and standardised residuals. Let's augment `sysBP.m` and store in `sysBP.a` to have easier access to the residuals
```{r augment sysBP.m}
sysBP.a <- augment(sysBP.m)
head(sysBP.a,10)
```
You will see that the first two columns are our dependent and independent variables for reference and for help plotting later on. The .resid column has the raw residuals while .stdresid is the z-transformed residual.
First off, we need to make sure the average of our residuals is basically 0.
```{r avg of residuals}
mean(sysBP.a$.resid)
```
A number e-16 is close enough to 0 for government work, so that's good. We already passed one test.
#### Residual Plots
For assessing homoscedasticity, we should plot our residuals. For this we will be using the .fitted and the .resid variables from our augmented data frame. Make a scatterplot using `geom_point` with .fitted on the x axis and .resid on the y axis. Also add a horizontal line at 0 using `geom_hline`.
```{r residual plot}
ggplot(sysBP.a, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0) +
theme_bw()
```
We are looking for a few things using this graph:
- Randomly spread points above and below the 0 line.
- Similar variance across all points
- No pattern in the residuals
Based on these 3 criteria, our residuals have passed the first test. All of them are nicely distributed, there isnt a noticeable change in the variance at any given location, and there isn't a discernable pattern.
### Normality
Normality will be tested in the same way we have done previously, using histograms, Q-Q plots, and statistical tests. Really, you can test normality on either the observed data or the residuals, but seeing as we have already done normality on observed data before, let's test it on the residuals for kicks and giggles.
#### Histogram
We will plot the .resid variable from sysBP.a like we have previously adding the normal distribution on top. We will use 15 bins this time since we have a fairly low sample size
```{r resid hist}
ggplot(sysBP.a,aes(x = .resid)) +
geom_histogram(aes(y = ..density..), bins = 15, fill = "gray80", color = "black") +
stat_function(fun = dnorm, args = list(mean(sysBP.a$.resid), sd(sysBP.a$.resid)), color = "red")
```
You can see that the residuals are fairly normal, slightly skewed to the right but nothing major.
#### Q-Q plot
```{r resid qq}
ggplot(sysBP.a,aes(sample = .resid)) +
geom_qq() +
geom_qq_line()
```
So again, as you can see in the Q-Q plot, almost all of the points are normally distributed with only 3-4 at the tail contributing to any skew.
#### Boxplot
```{r resid boxplot}
ggplot(sysBP.a,aes(y = .resid)) +
geom_boxplot()
```
According to the boxplot, we have 2 outliers; however the tails of the boxplot have approximately even length although the median line is not in the middle of the box.
#### Statistical Tests
Here, we will calculate skew, kurtosis, and perform SW test for normality
```{r skew + kurtosis + SW}
stat.desc(sysBP.a$.resid, basic = FALSE, norm = TRUE)
```
You can see that the skewness metric is 0.999 indicating a highly skewed distribution, different from what the histogram might suggest. The skew.2SE doesn't meet the criterion for our rule of thumb though where, for small sample sizes, if it is greater than 1.96, the distribution is significantly skewed. The SW test did come out highly significant.
Our last test is the KS test which is less stringent than the SW test.
```{r}
ks.test(sysBP.a$.resid, y = pnorm, mean = mean(sysBP.a$.resid), sd = sd(sysBP.a$.resid))
```
The KS test comes out non-significant. So we have a variety of conflicting information about the skewness of this distribution. The plots looks only slightly skewed, the skewness metric as well as the SW indicate a highly skewed distribution, but the KS test and skew.2SE metrics indicate the opposite. Overall, this seems borderline but we can say it is normal since most of the skewness seems to be driven by 2-3 points out of 50. To be clear, I think we could have said either skewed or normal here and been correct.
### Independence
We don't talk about independence in this class
### Multicolinearity
Multicolinearity occurs when predictor variables in a multiple linear regression are highly correlated. This causes problems with both parameter estimation and prediction, so we need to make sure our predictors are fairly unique. We will test multicolinearity by calculating the variance inflation factor (VIF) using the `vif` function from the `car` package.
In order to use the function, you only need to pass the model into the `vif` function
`vif(model)`
So we can recreate our multiple linear regression from the last lab. Remember, we made a model using the main effects of age and glucose on systolic blood pressure. We can make that again, storing in sysBP.m.2
```{r}
sysBP.m.2 <- lm(sysBP ~ age + glucose, data = fhs)
vif(sysBP.m.2)
```
Variables with VIF values above 10 should be removed from the model. We can see that neither of our predictors approach that so multicolinearity is not an issue in our simplistic multiple linear model.