Hey there! This is the third guide in my Modeling Guide with R series. In the previous guide, we performed some preliminary explorations on our dataset and took a look at a few of the variables. We cleaned the dataset up a bit and exported it so we could read it again in later guides.
This page will focus on both simple and multiple linear regression. Although, I cannot cover all of the details and fine use cases of linear regression, we will explore some of the key ideas to focus on when creating and analyzing a linear regression model. We’ll start by importing our data and identifying variables to use in the model then move to creating, interpreting, and testing the model. In both simple and linear regression, we’ll use train/test validation and we’ll also look at several candidate models.
We will be using the same dataset for both the simple and multiple
regression cases so let’s get that set up first. As mentioned in the
previous, guide, I prefer to use the tidyverse
family of
packages. We’ll also be using the tidymodels
collection of
packages to set up the models and perform our train/test validation and
GGally
to make a scatterplot matrix.
The code below imports our packages and the data we cleaned in the previous guide. It also splits the data into a train and test set. We will train our models using the training set and will test its performance using the test set. We do this to simulate how the model will perform in the “real world”.
I will be setting the seed for the random generator so we can get the
same results each time I start a new R session and split the data. You
may or may not wish to do this in your work. I want to make sure we get
a roughly equal amount of each store in the split sets so I will take a
stratified sample to make the split (strata = store
).
library(tidyverse)
library(tidymodels)
library(GGally)
library(performance)
library(knitr)
library(kableExtra)
retail <- read_csv(here::here("data/retail_clean.csv"))
set.seed(52319)
retail_split <- initial_split(retail, prop = .8, strata = store)
train <- training(retail_split)
test <- testing(retail_split)
Alright! Now that we have that set, we’ll use the train
data for all the model creation and will only use the test
data when we are analyzing our models.
Earlier, we weren’t able to predict rating
with one
variable alone. What if we could use multiple variables in regression
model? Maybe then we can better predict rating
.
When can we use multiple linear regression (MLR)?
We also still have to meet the regression assumptions:
The tricky part about working with MLR is that you want to find the
right balance of variables. We do want to choose variables we
think might have an effect on rating
but we don’t
want to include any unnecessary variables or variables that don’t
provide much benefit to the model.
As a result, MLR is an iterative process. I like to start out by creating a full model that has every single predictor variable that I think could be useful in predicting my response variable. I then use the output from the model to try and pull out the variables that are not pulling their weight.
So let’s start with that full model. These are the variables that I think might be useful (remember that in MLR we can use categorical variables):
It is also important to think about how these predictor variables interact with each other. We should consider adding in any interaction terms if we think two variables influence each other. I think it is possible that customer member status might influence their pay method and perhaps gender and purchase total interact. However, we shouldn’t rely solely on our instinct. Before jumping right into making a model, we should plot the variables and visualize any possible interactions.
train_subset <- train %>%
select(rating, store, customer_type, gender, total, unit_price, pay_method)
test_subset <- test %>%
select(rating, store, customer_type, gender, total, unit_price, pay_method)
ggpairs(train_subset)
This is a large plot that shows the relationships between each pair of variables. There isn’t a lot that stands out here but I do want to notice a few things:
rating
, our response variable, is approximately
normaltotal
is skewed rightunit_price
and total
customer_type
and gender
may be
relatedstore
and gender
may be relatedstore
and pay_method
may be relatedBased on those observations I would like to make a few changes. I
will be taking the log of total
to help correct the skew
and I will be adding interaction variables between the related pairs
mentioned above.
Because we are still working in linear regression, we can use the
same lm_spec
engine from SLR to drive our model building.
This time, we’ll use a different formula to fit the model which will
include the log of total and some interaction terms.
train_subset <- train_subset %>%
mutate(log_total = log(total))
test_subset <- test_subset %>%
mutate(log_total = log(total))
lm_spec <- linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")
mlr_mod_full <- lm_spec %>%
fit(rating ~ store + customer_type + gender + log_total + unit_price + pay_method + unit_price*log_total + customer_type*gender + store*gender + store*pay_method, data = train_subset)
Once we have done that, we can look at the output:
tidy(mlr_mod_full) %>%
mutate(p.value = format.pval(p.value, digits = 4)) %>%
kable() %>%
kable_styling(c("striped", "responsive"))
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 5.7370356 | 0.8000191 | 7.1711235 | 1.726e-12 |
storeB | 0.4076251 | 0.3028263 | 1.3460690 | 0.17867 |
storeC | 0.2997884 | 0.2909336 | 1.0304357 | 0.30312 |
customer_typeNormal | 0.1256181 | 0.1741896 | 0.7211577 | 0.47103 |
genderMale | 0.4039030 | 0.2456883 | 1.6439652 | 0.10059 |
log_total | 0.1875390 | 0.1572396 | 1.1926951 | 0.23335 |
unit_price | 0.0187970 | 0.0141310 | 1.3301941 | 0.18384 |
pay_methodCredit card | 0.0192168 | 0.2636199 | 0.0728958 | 0.94191 |
pay_methodEwallet | 0.0897699 | 0.2521303 | 0.3560458 | 0.72190 |
log_total:unit_price | -0.0034288 | 0.0025746 | -1.3317756 | 0.18332 |
customer_typeNormal:genderMale | -0.0946899 | 0.2452180 | -0.3861457 | 0.69949 |
storeB:genderMale | -0.5277196 | 0.2992870 | -1.7632560 | 0.07825 |
storeC:genderMale | -0.3982474 | 0.3012989 | -1.3217686 | 0.18663 |
storeB:pay_methodCredit card | -0.1173402 | 0.3729626 | -0.3146166 | 0.75314 |
storeC:pay_methodCredit card | 0.1632780 | 0.3710021 | 0.4400999 | 0.65999 |
storeB:pay_methodEwallet | -0.4414774 | 0.3663241 | -1.2051553 | 0.22851 |
storeC:pay_methodEwallet | -0.1200978 | 0.3607408 | -0.3329201 | 0.73928 |
We see that R has created dummy variables for us for each of our
categorical variables. However, because of this, we have a very large
number of predictor variables. Our goal is to reduce the amount of
predictor variables down to the essentials. We want to predict
rating
with the most accuracy but with the fewest variables
we can.
One way to judge this is to look at the p-values for each predictor which essentially looks at if the addition of one variable contributes something to the model (by comparing a model with the variable to one without). However, I would caution against relying solely on p-values. They are great to help us get an idea of the tendencies of the model and can give us a direction about where to go next but I personally say that they do not and in most cases should not drive decision making. I will not be ignoring p-values but I also will not be using them as a hard and fast rule for including or not including a variable.
For instance, look at our model. Not a single variable is
significant, at the typical cutoff of 0.05. To me, this does not mean
that we should scratch every variable and try again. It does not mean
that these variables are useless at predicting rating
. This
indicates to me that maybe we do not have enough information in the data
to predict rating
accurately or maybe we have too many
variables.
I think it’s also important to look at the estimates. Some of the
estimates have larger values, like Store B. Here, if a certain
transaction was done at Store B and all other variables were held
constant, we would predict the rating of a transaction to increase by
0.411. Other estimates have very small values, such as the interaction
between unit_price
and total
. With this term,
for every dollar increase of the product of unit_price
and
total
, we would expect the predicted rating to decrease by
0.00343 points. This does not necessarily mean that this interaction
variable has a marginally small effect on rating. Remember that the
values for the log of total and unit price can be pretty high and can
influence the prediction. The highest unit_price
value in
train_subset
is 99.96 with a corresponding total of 735.
Multiplying the log of the total by the unit price gives 659.7231. If we
insert that into our model, we can say that for this observation we
would expect the predicted rating to decrease by \(659.7231*0.00343 = 2.263\) points!
So it’s important to consider the practical piece of your model. Even if a term doesn’t meet the cutoff for its p-value, it may be close enough to still merit inclusion in the final model, especially if you think that it makes sense to include in the context of the dataset.
In the end, it is up to you which variables to use. There are variable selection procedures out there (forward selection, backward selection, and stepwise selection) that can help you with this if you choose to use them. You may also be interested in dimension reduction techniques such as Principal Components Analysis (PCA) and other factor analysis procedures. These are all out of the scope of this guide but can be useful in certain cases. I would recommend treating each method with some skepticism as I believe that no method will be able to tell you with 100% certainty which variables to include (and some of the methods produce models that are tricky to interpret). With the model here, I will be using my judgement by looking at the p-values, values for the estimates, and thinking about the context of the data to determine if I think a variable is useful.
I’ll detail my thoughts below:
customer_type
and
payment_method
and most of the interactions including it.
They don’t seem significant and the estimate values are pretty
small.With all this being said, let’s make a reduced model!
mlr_mod_reduced <- lm_spec %>%
fit(rating ~ store + gender + log_total + unit_price + store*gender + log_total*unit_price, data = train_subset)
tidy(mlr_mod_reduced) %>%
mutate(p.value = format.pval(p.value, digits = 4)) %>%
kable() %>%
kable_styling(c("striped", "responsive"))
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 5.7255485 | 0.7818191 | 7.3233673 | 5.969e-13 |
storeB | 0.2086592 | 0.2138560 | 0.9756993 | 0.32951 |
storeC | 0.3082288 | 0.2123498 | 1.4515141 | 0.14703 |
genderMale | 0.3533736 | 0.2093694 | 1.6877995 | 0.09184 |
log_total | 0.2111048 | 0.1555520 | 1.3571331 | 0.17513 |
unit_price | 0.0206734 | 0.0139899 | 1.4777356 | 0.13988 |
storeB:genderMale | -0.5281460 | 0.2975677 | -1.7748764 | 0.07630 |
storeC:genderMale | -0.4069470 | 0.2988545 | -1.3616893 | 0.17368 |
log_total:unit_price | -0.0037972 | 0.0025461 | -1.4913614 | 0.13627 |
I’m liking this model better. We have similar results (low p-values and decently-sized estimates) and we are using less variables. Let’s stick with this model for the rest of the guide.
Now let’s assess the model and see if regression was a good choice. We’ll start with overall performance and then check the assumptions.
model_performance(mlr_mod_full) %>%
kable()
AIC | AICc | BIC | R2 | R2_adjusted | RMSE | Sigma |
---|---|---|---|---|---|---|
3157.101 | 3157.978 | 3241.402 | 0.0127147 | -0.0074855 | 1.706068 | 1.724512 |
model_performance(mlr_mod_reduced) %>%
kable()
AIC | AICc | BIC | R2 | R2_adjusted | RMSE | Sigma |
---|---|---|---|---|---|---|
3144.37 | 3144.65 | 3191.204 | 0.0086671 | -0.0013717 | 1.709562 | 1.719272 |
We will look at the AIC and BIC as we did with SLR and I’d also like introduce the Root Mean Square Error (RMSE). The RMSE is the standard deviation of our prediction errors. We’d like this to be small as that means our errors are less spread out (and therefore smaller).
We can’t interpret any of these statistics alone; we use them to compare between models which is why I ran the model performance function on both our full model and our reduced model. We do see both the AIC and the BIC decrease slightly which is good. Our RMSE actually increases here but by a very little amount. My biggest takeaway is that the reduced model and the full model are very similar in terms of their performance. In cases like this, we would prefer to choose the model with the fewest variables. Why use more when we can do similarly with few?
Another thing to note is that our RMSE is pretty large compared to
the range of the response variable, rating
. Most of our
errors are around 1.7 with some extending beyond that (bigger errors)
and some less. Given that rating
only spans from 4 to 10,
an error of 1.7 could mean the difference between a neutral customer and
a satisfied customer. Because the error is so big compared to the
response, I don’t have much confidence in the accuracy of this
model.
Now on to the assumptions. Most of our assumptions are the same as SLR; however, we do have one more which involves collinearity (correlation between predictors).
diagnostics <- plot(check_model(mlr_mod_reduced, panel = FALSE))
## Model has interaction terms. VIFs might be inflated.
## You may check multicollinearity among predictors of a model without
## interaction terms.
First, can we assume linearity between the response and the predictor?
diagnostics[[2]] +
my_theme
We are looking for a random scatter of points (and not a pattern). We do see a random scatter here with no general pattern. This plot looks much better than the one we saw for our SLR model. Even though there is a little lift of the line towards the lower end of the predicted values, I’m not concerned.
diagnostics[[3]] +
my_theme
As with SLR, we want to see a flat line which would indicate that our errors have a similar variability (spread along the y-axis) for all predicted values (the x-axis). We don’t want a rainbow or a cone shape and we want to see the reference line be flat. It is true that we have a larger clump of points above the solid line which is bowing the line up slightly but I don’t see much reason for concern here.
We can make a histogram of the errors to check for the normality assumption.
train_aug <- augment(mlr_mod_reduced, new_data = train_subset)
ggplot(data = train_aug, aes(x = .resid)) +
geom_histogram(binwidth = 0.25, fill = "#099392", color = "black") +
xlab("Residuals") +
my_theme
Once again, histograms can be slightly deceiving. This plot does show a decent bell-curve shape with some peaks in the center and less in the two ends. However, it would be nice to see less peaks around -2 and 2. Let’s also look at the diagnostic plot.
diagnostics[[6]] +
my_theme
Not so great! The tail ends of the dots stray far from the line and the prediction intervals (the shaded grey part) have a very large range. Based on this plot, I would not say we have met this assumption and should question the results of our model.
We also should make sure that our predictor variables are not correlated with each other. Recall earlier how we did see some slight correlation between some of our variables and we created an interaction term to account for this. Using interaction terms is one way of helping to meet this assumption. Another way is to simply remove one of the correlated variables from the model.
We can use variance inflation factors to look at collinearity. However, you may have noticed above when we generated our diagnostic plots that a warning message was printed. It said
Model has interaction terms. VIFs might be inflated. You may check multicollinearity among predictors of a model without interaction terms.
R has recognized that we have an interaction term. Naturally, we can expect the interaction term and the variables that make up that interaction term to be correlated. In fact, if we look at the plot here,
diagnostics[[5]] +
my_theme
check_collinearity(mlr_mod_reduced)
## Model has interaction terms. VIFs might be inflated.
## You may check multicollinearity among predictors of a model without
## interaction terms.
## # Check for Multicollinearity
##
## Moderate Correlation
##
## Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
## log_total 5.44 [ 4.81, 6.17] 2.33 0.18 [0.16, 0.21]
##
## High Correlation
##
## Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
## unit_price 36.92 [32.27, 42.25] 6.08 0.03 [0.02, 0.03]
we see that the interaction between the log of the total and the unit price as well as the unit price variable itself have some potential for high correlation with the other variables. This is a problem, but let’s try checking this assumption on the model without any interactions.
mlr_mod_noint <- lm_spec %>%
fit(rating ~ store + gender + log(total) + unit_price, data = train_subset)
plot(check_model(mlr_mod_noint, panel = FALSE))[[5]] +
my_theme
check_collinearity(mlr_mod_noint)
## # Check for Multicollinearity
##
## Low Correlation
##
## Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
## log(total) 1.58 [1.45, 1.75] 1.26 0.63 [0.57, 0.69]
## unit_price 1.59 [1.46, 1.76] 1.26 0.63 [0.57, 0.69]
Removing our interactions has helped with our VIFs so I would
consider this assumption to be met although I do want to question
store
due to the large margin of error on its VIF.
It’s also a good idea to look at any extreme observations. It is possible that one or a few observations are driving the analysis and are skewing our results.
diagnostics[[4]] +
my_theme
We can look at the plot above to find observations with high
leverage, or significantly different \(x\) values (log_total
values).
I do not see any points outside of the dashed lines so we are ok here.
The plot uses Cook’s \(d\), or Cook’s
distance, to quantify how influential a point is and we want to keep all
values inside of the cutoff values for Cook’s \(d\) (the dashed lines).
Let’s also take a quick peek at our testing data. How well does the model perform?
test_aug <- augment(mlr_mod_reduced, new_data = test_subset)
ggplot(data = test_aug, aes(x = .pred, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Errors vs. Predicted",
subtitle = "MLR Reduced Model Testing Dataset") +
xlab("Fitted values") +
ylab("Errors") +
my_theme
It looks like we still see a big range in our errors (between -3 and
3). We also see again that the fitted values range from 6.8 to about 7.3
which doesn’t not represent the rating
variable
accurately.
All in all, regression was not the best model to use to predict
rating
. The variables in the dataset simply do not hold
enough information in order to make good predictions. We saw very low
\(R^2\) values, a high error rate, we
violated an assumption, and our predictions do not reflect the full
range of possible values for our response. Although we won’t move
forward with this example, it would be a good idea to explore the
variables more. Perhaps we could create a variable of our own to use in
the model. Perhaps we need to collect more data (more observations and
more variables) to improve our accuracy. Perhaps linear regression just
isn’t the best model choice for predicting rating. There definitely is
more work that can be done here to try and predict a customer’s rating
of a transaction (and hence their satisfaction).