Hey there! This is the second guide in my Modeling Guide with R series. In the previous guide, we performed some preliminary explorations on our dataset and took a look at a few of the variables. We cleaned the dataset up a bit and exported it so we could read it again in later guides.
This page will focus on both simple and multiple linear regression. Although, I cannot cover all of the details and fine use cases of linear regression, we will explore some of the key ideas to focus on when creating and analyzing a linear regression model. We’ll start by importing our data and identifying variables to use in the model then move to creating, interpreting, and testing the model. In both simple and linear regression, we’ll use train/test validation and we’ll also look at several candidate models.
We will be using the same dataset for both the simple and multiple
regression cases so let’s get that set up first. As mentioned in the
previous, guide, I prefer to use the tidyverse
family of
packages. We’ll also be using the tidymodels
collection of
packages to set up the models and perform our train/test validation and
GGally
to make a scatterplot matrix. The
performance
package helps us check the model after we’ve
created it and the others are helpful for rendering good tables.
The code below imports our packages and the data we cleaned in the previous guide. It also splits the data into a train and test set. We will train our models using the training set and will test its performance using the test set. We do this to simulate how the model will perform in the “real world”.
I will be setting the seed for the random generator so we can get the
same results each time I start a new R session and split the data. You
may or may not wish to do this in your work. I want to make sure we get
a roughly equal amount of each store in the split sets so I will take a
stratified sample to make the split (strata = store
).
library(tidyverse)
library(tidymodels)
library(GGally)
library(performance)
library(knitr)
library(kableExtra)
retail <- read_csv(here::here("data/retail_clean.csv"))
set.seed(52319)
retail_split <- initial_split(retail, prop = .8, strata = store)
train <- training(retail_split)
test <- testing(retail_split)
Alright! Now that we have that set, we’ll use the train
data for all the model creation and will only use the test
data when we are analyzing our models.
Although I’ll avoid any complex math throughout this guide, it is important to think about how regression works. The formula for simple linear regression for a population is
\[ y = \beta_0 + \beta_1 x_1 + \varepsilon \]
which states that we can represent the relationship between our predictor variable and our response variable as a straight line plus some natural error (\(\varepsilon\)). The problem is, we don’t know the information about our population (i.e., we don’t have data from all transactions from all stores). Instead, we have a subset of that data (some transactions froms ome stores) and are looking to predict the population values using our sample. As a result, we are creating a model of the form
\[ \hat{y} = \hat{\beta_0} + \hat{\beta_1} x_1 \] There is a difference here. First, we’ve lost the error term since we are now in prediction mode (there is no error since we know the data we collected). You also now see hats above a few of the values. This indicates that they are estimates of the true values.
When might we want to use SLR? Well, first we need to make sure we have the correct variable types:
Second, we have to meet certain criteria. See, simple linear regression can be somewhat limiting as it requires us to make certain assumptions about our data and the population we are hoping to model due to the mathematical formulas and matrix algebra used. This is what we should pay attention to:
These will be explored later on when we investigate our variables.
Given all of that information, let’s choose the variables we want to
use in the model. Suppose we want to have a way to predict the rating of
a transaction. The rating
variable is continuous so it is
acceptable to use as a response variable. To try and predict
rating
, we will consider a few candidate models (each with
one predictor variable). Let’s start with total
, and
unit_price
as possible predictors.
At this point, it’s a good idea to write down your predictions about
what might happen. This is important as you want to make sure you don’t
create a prediction based on your results so you don’t create any bias.
At this point, I am thinking that unit_price
will have the
most effect on rating
since I think that when items cost
more, people are more likely to be grumpy and give a lower rating.
Using the formula above, we can create two possible models that we
want to use to predict rating
:
\[ \widehat{\text{rating}} = \hat{\beta_0} + \hat{\beta_1} * \text{total} \]
\[ \widehat{\text{rating}} = \hat{\beta_0} + \hat{\beta_1} * \text{unit_price} \] In each case, we are trying to find the values for \(\hat{\beta_0}\) and \(\hat{\beta_1}\) using our predictor variable.
Let’s build the model! We will use the tidymodels
family
of packages for this. We first set up the regression engine, then fit
the models with our data. Based on the work done in the first guide, we
saw that total
has some right skew to it due to some
outliers to the right. As a result, we are going to take the log of
total
.
train_subset <- train %>%
select(rating, unit_price, total) %>%
mutate(log_total = log(total))
lm_spec <- linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")
slr_mod_price <- lm_spec %>%
fit(rating ~ unit_price, data = train_subset)
slr_mod_total <- lm_spec %>%
fit(rating ~ log_total, data = train_subset)
Above, I fit two different models, one with unit_price
and one with total
. Let’s take a look at the output for
both of them.
tidy(slr_mod_price) %>%
kable() %>%
kable_styling(c("striped", "responsive"))
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 6.987286 | 0.1422717 | 49.1122698 | 0.0000000 |
unit_price | 0.000548 | 0.0023023 | 0.2380066 | 0.8119372 |
tidy(slr_mod_total) %>%
kable() %>%
kable_styling(c("striped", "responsive"))
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 6.9220710 | 0.366991 | 18.8616934 | 0.0000000 |
log_total | 0.0176607 | 0.066701 | 0.2647747 | 0.7912515 |
We can use this output to create a regression line equation using the above formulas:
\[ \widehat{\text{rating}} = 6.99 + 0.000548 * \text{unit price} \]
\[
\widehat{\text{rating}} = 6.92 + 0.0177 * \log{\text{(total)}}
\] In other words, because the slopes for both of the lines are
very small, we have very little evidence of a linear relationship
between rating
and both unit_price
and the log
of total
.
Let’s see how the models perform! My guess is not very well based on the information so far.
model_performance(slr_mod_price) %>%
kable()
AIC | AICc | BIC | R2 | R2_adjusted | RMSE | Sigma |
---|---|---|---|---|---|---|
3137.269 | 3137.299 | 3151.319 | 7.11e-05 | -0.0011835 | 1.716958 | 1.71911 |
model_performance(slr_mod_total) %>%
kable()
AIC | AICc | BIC | R2 | R2_adjusted | RMSE | Sigma |
---|---|---|---|---|---|---|
3137.255 | 3137.286 | 3151.305 | 8.8e-05 | -0.0011666 | 1.716943 | 1.719096 |
We can get a good idea of this by looking at \(R^2\), or the proportion of the variance in
rating
we have explained. When we use both
unit_price
and total
, we can explain less than
1% of the variance in rating
. Not so good! There is very
little difference between the two models. The AIC, or Akaike’s
information criteria assess our error. We want to see it decrease as we
look at different models. The same goes for the BIC, or Bayesian
information criterion. We also want to see \(R^2\) increase.
Due to the similar models, I will be finishing the SLR section with
only one model: the one with log(total)
which does seem to
be the slightly “better” model here.
The performance
packages offers us a nice glance at some
plots to help answer if we have met the assumptions.
diagnostics <- plot(check_model(slr_mod_total, panel = FALSE))
First, can we assume linearity between the response and the predictor?
diagnostics[[2]] +
my_theme
Well, this is a great example of how most real life data is not going to be perfectly linear. This data certainly shows no linear pattern; in fact, this looks more like a random scatter with no pattern than anything else. It’s close, but I wouldn’t say that we have satisfied this condition. We can see as well that the reference line in the plot is not flat for lower fitted values.
cor(train$rating, log(train$total)) %>%
round(5) %>%
kable() %>%
kable_styling(full_width = FALSE)
x |
---|
0.00938 |
We can also see that the correlation between the two variables is
very close to zero indicating that there is low evidence of a strong
relationship between rating
and the log of
total
.
diagnostics[[3]] +
my_theme
Using the third plot, do we see any curvature of the reference line? Are the errors more spread out (vertically) at some points than others? I do not see evidence to assume this other than perhaps towards the lower end of the fitted values. At nearly every fitted value, the residuals span from about -2.5 to 2.5 and the line is very close to flat.
We can make a histogram of the errors to check for the normality assumption.
train_aug <- augment(slr_mod_total, new_data = train_subset)
ggplot(data = train_aug, aes(x = .resid)) +
geom_histogram(binwidth = 0.25, fill = "#099392", color = "black") +
xlab("Residuals") +
ylab("Count") +
labs(title = "Residual Histogram") +
my_theme
At first glance, this looks ok! It’s not perfect by any means but there are some peaks in the middle and lower tails. However, a histogram is not the best tool to measure this as we are mostly eye-balling it.
diagnostics[[5]] +
my_theme
Let’s use this plot, which is very similar to a traditional Q-Q plot, which plots our errors versus what we would expect for a perfectly normal distribution of errors. If our errors are close to normally distributed, we would want them to lie flat along the line and here we see some pretty wonky tails. I would not consider this assumption to be met.
It’s also a good idea to look at any extreme observations. It is possible that one or a few observations are driving the analysis and are skewing our results.
diagnostics[[4]] +
my_theme
We can look at the plot above to find observations with high
leverage, or significantly different \(x\) values (log_total
values).
I do not see any points outside of the dashed lines so we are ok here.
The plot uses Cook’s \(d\), or Cook’s
distance, to quantify how influential a point is and we want to keep all
values inside of the cutoff values for Cook’s \(d\) (the dashed lines).
We can run the model on the testing dataset we created earlier to see how accurate the model is.
test_subset <- test %>%
mutate(log_total = log(total)) %>%
select(rating, log_total)
test_aug <- augment(slr_mod_total, new_data = test_subset)
head(test_aug) %>%
kable() %>%
kable_styling(c("striped", "responsive"))
.pred | .resid | rating | log_total |
---|---|---|---|
7.035840 | -2.9358402 | 4.1 | 6.441929 |
7.013923 | 2.8860768 | 9.9 | 5.200925 |
7.021534 | -1.0215340 | 6.0 | 5.631873 |
6.984960 | -0.2849597 | 6.7 | 3.560932 |
7.030404 | 0.5695961 | 7.6 | 6.134109 |
7.019795 | 0.6802047 | 7.7 | 5.533421 |
This has created two new columns to our dataset: .pred
and .resid
. Now we have our predictions and errors!
mean(test_aug$.resid) %>%
kable() %>%
kable_styling(full_width = FALSE)
x |
---|
-0.224745 |
It looks like, on average, we over-predict rating
by
about 0.225 points. We can also plot our actual data versus the
predicted data.
test_aug %>%
ggplot(aes(x = .pred, y = rating)) +
geom_point() +
labs(title = "Predicted Rating vs. Actual Rating",
x = "Predicted Rating",
y = "Actual Rating") +
my_theme
We didn’t do very good. Notice how the predicted rating only ranges from about 6.97 to 7.06 whereas the actual rating ranges from 4 to 10. This is produces some of our huge error terms and leads to a poor model and failed model assumptions.
All in all, simple linear regression was not a good model to choose
for predicting rating
. With only one variable, we don’t get
a lot of information to help us predict rating
accurately.
Moreover, we can’t be certain that we meet the linearity assumption so
our results might be in question. My original prediction was not correct
as both unit_price
and total
alone do not tell
us much about rating
.