Welcome

Hey there! This is the second guide in my Modeling Guide with R series. In the previous guide, we performed some preliminary explorations on our dataset and took a look at a few of the variables. We cleaned the dataset up a bit and exported it so we could read it again in later guides.

This page will focus on both simple and multiple linear regression. Although, I cannot cover all of the details and fine use cases of linear regression, we will explore some of the key ideas to focus on when creating and analyzing a linear regression model. We’ll start by importing our data and identifying variables to use in the model then move to creating, interpreting, and testing the model. In both simple and linear regression, we’ll use train/test validation and we’ll also look at several candidate models.

We will be using the same dataset for both the simple and multiple regression cases so let’s get that set up first. As mentioned in the previous, guide, I prefer to use the tidyverse family of packages. We’ll also be using the tidymodels collection of packages to set up the models and perform our train/test validation and GGally to make a scatterplot matrix. The performance package helps us check the model after we’ve created it and the others are helpful for rendering good tables.

The code below imports our packages and the data we cleaned in the previous guide. It also splits the data into a train and test set. We will train our models using the training set and will test its performance using the test set. We do this to simulate how the model will perform in the “real world”.

I will be setting the seed for the random generator so we can get the same results each time I start a new R session and split the data. You may or may not wish to do this in your work. I want to make sure we get a roughly equal amount of each store in the split sets so I will take a stratified sample to make the split (strata = store).

library(tidyverse)
library(tidymodels)
library(GGally)
library(performance)
library(knitr)
library(kableExtra)

retail <- read_csv(here::here("data/retail_clean.csv"))

set.seed(52319)
retail_split <- initial_split(retail, prop = .8, strata = store)

train <- training(retail_split)
test <- testing(retail_split)

Alright! Now that we have that set, we’ll use the train data for all the model creation and will only use the test data when we are analyzing our models.

Simple Linear Regression (SLR)

Although I’ll avoid any complex math throughout this guide, it is important to think about how regression works. The formula for simple linear regression for a population is

\[ y = \beta_0 + \beta_1 x_1 + \varepsilon \]

which states that we can represent the relationship between our predictor variable and our response variable as a straight line plus some natural error (\(\varepsilon\)). The problem is, we don’t know the information about our population (i.e., we don’t have data from all transactions from all stores). Instead, we have a subset of that data (some transactions froms ome stores) and are looking to predict the population values using our sample. As a result, we are creating a model of the form

\[ \hat{y} = \hat{\beta_0} + \hat{\beta_1} x_1 \] There is a difference here. First, we’ve lost the error term since we are now in prediction mode (there is no error since we know the data we collected). You also now see hats above a few of the values. This indicates that they are estimates of the true values.

Requirements

When might we want to use SLR? Well, first we need to make sure we have the correct variable types:

The response variable (the variable we are predicting) must be continuous.
The predictor variable (the variable we are using to predict) can be either continuous or categorical with two levels.
We only get one predictor variable and one response variable.

Second, we have to meet certain criteria. See, simple linear regression can be somewhat limiting as it requires us to make certain assumptions about our data and the population we are hoping to model due to the mathematical formulas and matrix algebra used. This is what we should pay attention to:

The relationship between the predictor variable and the response variable(s) should be linear
The errors of our predictions should be approximately normally distributed
There should be a constant variability in our errors
Our errors should be independent (we will not focus on this here)

These will be explored later on when we investigate our variables.

Selecting Variables

Given all of that information, let’s choose the variables we want to use in the model. Suppose we want to have a way to predict the rating of a transaction. The rating variable is continuous so it is acceptable to use as a response variable. To try and predict rating, we will consider a few candidate models (each with one predictor variable). Let’s start with total, and unit_price as possible predictors.

At this point, it’s a good idea to write down your predictions about what might happen. This is important as you want to make sure you don’t create a prediction based on your results so you don’t create any bias. At this point, I am thinking that unit_price will have the most effect on rating since I think that when items cost more, people are more likely to be grumpy and give a lower rating.

Using the formula above, we can create two possible models that we want to use to predict rating:

\[ \widehat{\text{rating}} = \hat{\beta_0} + \hat{\beta_1} * \text{total} \]

\[ \widehat{\text{rating}} = \hat{\beta_0} + \hat{\beta_1} * \text{unit_price} \] In each case, we are trying to find the values for \(\hat{\beta_0}\) and \(\hat{\beta_1}\) using our predictor variable.

Building the Model

Let’s build the model! We will use the tidymodels family of packages for this. We first set up the regression engine, then fit the models with our data. Based on the work done in the first guide, we saw that total has some right skew to it due to some outliers to the right. As a result, we are going to take the log of total.

train_subset <- train %>% 
  select(rating, unit_price, total) %>% 
  mutate(log_total = log(total))

lm_spec <- linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm")

slr_mod_price <- lm_spec %>% 
  fit(rating ~ unit_price, data = train_subset)

slr_mod_total <- lm_spec %>% 
  fit(rating ~ log_total, data = train_subset)

Above, I fit two different models, one with unit_price and one with total. Let’s take a look at the output for both of them.

tidy(slr_mod_price) %>% 
  kable() %>% 
  kable_styling(c("striped", "responsive"))

term	estimate	std.error	statistic	p.value
(Intercept)	6.987286	0.1422717	49.1122698	0.0000000
unit_price	0.000548	0.0023023	0.2380066	0.8119372

tidy(slr_mod_total) %>% 
  kable() %>% 
  kable_styling(c("striped", "responsive"))

term	estimate	std.error	statistic	p.value
(Intercept)	6.9220710	0.366991	18.8616934	0.0000000
log_total	0.0176607	0.066701	0.2647747	0.7912515

We can use this output to create a regression line equation using the above formulas:

\[ \widehat{\text{rating}} = 6.99 + 0.000548 * \text{unit price} \]

\[ \widehat{\text{rating}} = 6.92 + 0.0177 * \log{\text{(total)}} \] In other words, because the slopes for both of the lines are very small, we have very little evidence of a linear relationship between rating and both unit_price and the log of total.

Assessing the Model

Let’s see how the models perform! My guess is not very well based on the information so far.

model_performance(slr_mod_price) %>% 
  kable()

AIC	AICc	BIC	R2	R2_adjusted	RMSE	Sigma
3137.269	3137.299	3151.319	7.11e-05	-0.0011835	1.716958	1.71911

model_performance(slr_mod_total) %>% 
  kable()

AIC	AICc	BIC	R2	R2_adjusted	RMSE	Sigma
3137.255	3137.286	3151.305	8.8e-05	-0.0011666	1.716943	1.719096

We can get a good idea of this by looking at \(R^2\), or the proportion of the variance in rating we have explained. When we use both unit_price and total, we can explain less than 1% of the variance in rating. Not so good! There is very little difference between the two models. The AIC, or Akaike’s information criteria assess our error. We want to see it decrease as we look at different models. The same goes for the BIC, or Bayesian information criterion. We also want to see \(R^2\) increase.

Due to the similar models, I will be finishing the SLR section with only one model: the one with log(total) which does seem to be the slightly “better” model here.

Assumptions

The performance packages offers us a nice glance at some plots to help answer if we have met the assumptions.

diagnostics <- plot(check_model(slr_mod_total, panel = FALSE))

Linearity

First, can we assume linearity between the response and the predictor?

diagnostics[[2]] +
  my_theme

Dot plot of predicted values vs. errors. Most points lie towards the higher end of the plot and are sparser towards the lower end. Predicted values range from 6.96 to 7.05 and errors from -3 to 3.

Well, this is a great example of how most real life data is not going to be perfectly linear. This data certainly shows no linear pattern; in fact, this looks more like a random scatter with no pattern than anything else. It’s close, but I wouldn’t say that we have satisfied this condition. We can see as well that the reference line in the plot is not flat for lower fitted values.

cor(train$rating, log(train$total)) %>% 
  round(5) %>% 
  kable() %>% 
  kable_styling(full_width = FALSE)

x
0.00938

We can also see that the correlation between the two variables is very close to zero indicating that there is low evidence of a strong relationship between rating and the log of total.

Constant Variability of Errors

diagnostics[[3]] +
  my_theme

Plot of predicted values vs. square root of standardized residuals. Most values of predicted values have a residual range of 0 to 1.2 but lower predicted values have a slightly lesser variance.

Using the third plot, do we see any curvature of the reference line? Are the errors more spread out (vertically) at some points than others? I do not see evidence to assume this other than perhaps towards the lower end of the fitted values. At nearly every fitted value, the residuals span from about -2.5 to 2.5 and the line is very close to flat.

Normal Errors

We can make a histogram of the errors to check for the normality assumption.

train_aug <- augment(slr_mod_total, new_data = train_subset)

ggplot(data = train_aug, aes(x = .resid)) +
  geom_histogram(binwidth = 0.25, fill = "#099392", color = "black") +
  xlab("Residuals") +
  ylab("Count") +
  labs(title = "Residual Histogram") +
  my_theme

At first glance, this looks ok! It’s not perfect by any means but there are some peaks in the middle and lower tails. However, a histogram is not the best tool to measure this as we are mostly eye-balling it.

diagnostics[[5]] +
  my_theme

Sample points should lie flat along the horizontal line at 0. Points are above the line on the left and below the line on the right, making an S shape.

Let’s use this plot, which is very similar to a traditional Q-Q plot, which plots our errors versus what we would expect for a perfectly normal distribution of errors. If our errors are close to normally distributed, we would want them to lie flat along the line and here we see some pretty wonky tails. I would not consider this assumption to be met.

Extreme Observations

It’s also a good idea to look at any extreme observations. It is possible that one or a few observations are driving the analysis and are skewing our results.

diagnostics[[4]] + 
  my_theme

Leverage plot. All points are between the contour lines which are around 7 and -7 for the y-axis: studentized residuals. No leverage exceeds 0.02.

We can look at the plot above to find observations with high leverage, or significantly different \(x\) values (log_total values). I do not see any points outside of the dashed lines so we are ok here. The plot uses Cook’s \(d\), or Cook’s distance, to quantify how influential a point is and we want to keep all values inside of the cutoff values for Cook’s \(d\) (the dashed lines).

Testing

We can run the model on the testing dataset we created earlier to see how accurate the model is.

test_subset <- test %>% 
  mutate(log_total = log(total)) %>% 
  select(rating, log_total)

test_aug <- augment(slr_mod_total, new_data = test_subset)
head(test_aug) %>% 
  kable() %>% 
  kable_styling(c("striped", "responsive"))

.pred	.resid	rating	log_total
7.035840	-2.9358402	4.1	6.441929
7.013923	2.8860768	9.9	5.200925
7.021534	-1.0215340	6.0	5.631873
6.984960	-0.2849597	6.7	3.560932
7.030404	0.5695961	7.6	6.134109
7.019795	0.6802047	7.7	5.533421

This has created two new columns to our dataset: .pred and .resid. Now we have our predictions and errors!

mean(test_aug$.resid) %>% 
  kable() %>% 
  kable_styling(full_width = FALSE)

x
-0.224745

It looks like, on average, we over-predict rating by about 0.225 points. We can also plot our actual data versus the predicted data.

test_aug %>% 
  ggplot(aes(x = .pred, y = rating)) +
  geom_point() +
  labs(title = "Predicted Rating vs. Actual Rating",
       x = "Predicted Rating",
       y = "Actual Rating") +
  my_theme

Plot of predicted vs. actual rating. Points are scattered about. Predicted ranges from 6.9 to 7.05 and actual ranges from 4 to 10.

We didn’t do very good. Notice how the predicted rating only ranges from about 6.97 to 7.06 whereas the actual rating ranges from 4 to 10. This is produces some of our huge error terms and leads to a poor model and failed model assumptions.

Wrap Up

All in all, simple linear regression was not a good model to choose for predicting rating. With only one variable, we don’t get a lot of information to help us predict rating accurately. Moreover, we can’t be certain that we meet the linearity assumption so our results might be in question. My original prediction was not correct as both unit_price and total alone do not tell us much about rating.

References

Dykes, Bradford. n.d. “STA 631.” GitHub. Accessed April 18, 2024. https://github.com/gvsu-sta631.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning: With Applications in R. 1st ed. 2013, Corr. 7th printing 2017 edition. New York: Springer.

Paye, Aung. n.d. “Supermarket Sales.” Accessed April 18, 2024. https://www.kaggle.com/datasets/aungpyaeap/supermarket-sales.

Simple Linear Regression

Ian Curtis

April 2024