Hey there! This is the fourth guide in my Modeling Guide with R series. In the previous guide, we looked at multiple linear regression. This page will focus on logistic regression. Although, I cannot cover all of the details and fine use cases of regression, we will explore some of the key ideas to focus on when creating and analyzing a regression model. We’ll start by importing our data and identifying variables to use in the model then move to creating, interpreting, and testing the model. As with SLR and MLR, we’ll use train/test validation and we’ll also look at a couple candidate models.
We will be using the same dataset we have been using which covers
transaction data. As mentioned in the previous, guide, I prefer to use
the tidyverse
family of packages. We’ll also be using the
tidymodels
collection of packages to set up the models and
perform our train/test validation.
The code below imports our packages and the data we cleaned in the first guide. It also splits the data into a train and test set. We will train our models using the training set and will test its performance using the test set. We do this to simulate how the model will perform in the “real world”.
I will be setting the seed for the random generator so we can get the
same results each time I start a new R session and split the data. You
may or may not wish to do this in your work. I want to make sure we get
a roughly equal amount of each store in the split sets so I will take a
stratified sample to make the split (strata = store
).
library(tidyverse)
library(tidymodels)
library(GGally)
library(gvsu215)
library(car)
library(performance)
library(knitr)
library(kableExtra)
retail <- read_csv(here::here("data/retail_clean.csv"))
set.seed(52319)
retail_split <- initial_split(retail, prop = .8, strata = store)
train <- training(retail_split)
test <- testing(retail_split)
Alright! Now that we have that set, we’ll use the train
data for all the model creation and will only use the test
data when we are analyzing our models.
Although most of the processes for logistic regression will be the same as or very similar to SLR and MLR, there are a few key differences. The key piece is what our response variable looks like. Remember, for SLR and MLR, our response variable had to be continuous. But what if you want to predict a variable that isn’t continuous? Well, logistic regression may come to the rescue.
In order to get valid results from logistic regression, we must meet certain assumptions.
First, let’s decide what we want to predict. Again, we need a
variable that has two levels. We have two options:
customer_type
and gender
. I would like to
focus on customer type and see if we can predict the probability that a
certain customer is a member or not given information about their
transaction. Do members shop differently than non members? I am inclined
to say so.
We should also define our “base” level for customer type or which type we want to compare everything to. I am going to create a true binary variable for gender which takes on a value of 1 if a customer is a member and 0 if they are not. We also need to coerce this variable into an R factor in order for the code later on to work.
train <- train %>%
mutate(member = as.factor(ifelse(customer_type == "Member", 1, 0)))
test <- test %>%
mutate(member = as.factor(ifelse(customer_type == "Member", 1, 0)))
Technically, we can split up logistic regression into simple logistic regression (one predictor variable) and multiple logistic regression (more than one predictor variable). However, I am going to jump right into multiple logistic regression to create a more detailed model. As I did with MLR, I am going to start by choosing every variable that I think might play a role in predicting the gender of a customer and use the regression output to narrow things down.
I would like to start with:
store
gender
product_class
pay_method
rating
total
Before beginning, we should look at our variables to see if we need to apply any transformations or add any interaction variables.
train_subset <- train %>%
select(member, store, gender, product_class, pay_method, rating, total)
test_subset <- test %>%
select(member, store, gender, product_class, pay_method, rating, total)
ggpairs(train_subset) +
my_theme
My main observation here is that total
is skewed right
which is causing some outliers with the other variables. The values for
total
can also get quite large. If we’re not careful, these
large values could influence our results.
Here is that distribution up close:
train_subset %>%
ggplot(aes(total)) +
geom_histogram(binwidth = 20, color = "black", fill = "#099392") +
labs(x = "Total ($)",
y = "Count",
title = "Distribution of Transaction Totals") +
my_theme
Previously we used the log to help correct this. Let’s try something
different here and take the square root of total
to help
bring down those extreme values.
train_subset %>%
ggplot(aes(sqrt(total))) +
geom_histogram(color = "black", fill = "#099392") +
labs(x = "Total",
y = "Count",
title = "Distribution of the Square Root of Transaction Totals") +
my_theme
That looks better! This is not ideal as it affects our interpretation of results but it does help to fix the skewness. I don’t see much evidence for creating an interaction variable for this model. However, I originally considered making one for store vs. gender and/or pay method vs. gender. I don’t think they’ll be necessary but I will include them in our first model nonetheless.
Before we actually build the model, let’s take a look at how logistic regression differs from SLR and MLR.
Instead of predicting the value of a continuous response, we are measuring (predicting) the probability of being in a certain class, here whether a customer is a member or not. Since we choose to have “Member” be represented by a 1, we are essentially finding the odds of a certain customer being a member, given the values for the other variables. We can define the odds of a “success” (a customer being member) as:
\[ \frac{Pr(Y_i = 1)}{Pr(Y_i = 0)} = \frac{p_i}{1-p_i} \] which is the probability of a given person being a member divided by one minus that same probability.
The math behind predicting this can get pretty involved. However, I would like to show that if we take the log of the above odds, we get a formula like this (which is derived from the more complicated logistic equation):
\[ \begin{equation*} \log\left(\frac{p_i}{1-p_i}\right) = \beta_0 + \beta_1 X_1 + ... \end{equation*} \]
This should look slightly familiar to linear regression! We have a list of predictors and we would like to predict the parameters that allow us to predict how likely it is that a customer is a member. This response is formally called the “log odds” or “logit”.
To build the model, we use similar code as before. This time, we are
using the logistic_reg()
function rather than the
linear_reg()
function.
log_spec <- logistic_reg() %>%
set_engine("glm")
log_mod_full <- log_spec %>%
fit(member ~ store + gender + product_class + pay_method + rating + store*gender + pay_method*gender, data = train_subset, family = "binomial")
tidy(log_mod_full) %>%
kable() %>%
kable_styling(c("striped", "responsive"))
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.1695912 | 0.3994688 | -0.4245419 | 0.6711706 |
storeB | 0.2453408 | 0.2515595 | 0.9752794 | 0.3294217 |
storeC | 0.2404212 | 0.2501605 | 0.9610677 | 0.3365181 |
genderMale | 0.0203514 | 0.3171467 | 0.0641702 | 0.9488347 |
product_classFashion accessories | 0.0552697 | 0.2425452 | 0.2278738 | 0.8197443 |
product_classFood and beverages | 0.3305222 | 0.2451912 | 1.3480184 | 0.1776525 |
product_classHealth and beauty | 0.1288623 | 0.2520721 | 0.5112120 | 0.6092026 |
product_classHome and lifestyle | 0.2538179 | 0.2497428 | 1.0163173 | 0.3094783 |
product_classSports and travel | 0.3327949 | 0.2463026 | 1.3511623 | 0.1766434 |
pay_methodCredit card | 0.2423094 | 0.2484766 | 0.9751796 | 0.3294712 |
pay_methodEwallet | -0.0155909 | 0.2475790 | -0.0629733 | 0.9497878 |
rating | -0.0226503 | 0.0419418 | -0.5400406 | 0.5891690 |
storeB:genderMale | -0.3077068 | 0.3501435 | -0.8788020 | 0.3795086 |
storeC:genderMale | -0.2153436 | 0.3516872 | -0.6123157 | 0.5403289 |
genderMale:pay_methodCredit card | 0.0563912 | 0.3549803 | 0.1588573 | 0.8737813 |
genderMale:pay_methodEwallet | -0.0468723 | 0.3479201 | -0.1347214 | 0.8928321 |
The tricky part about logistic regression is the interpretation.
Since are predicting the log odds of obtaining a customer that
is a member, each value in the estimate
column above can be
interpreted as the change in the log odds of finding a member
for a unit increase in the respective \(X\) variable.
So if a customer shops at Store C, their predicted log odds of being a member increases by 0.240 and therefore their predicted odds of being a member increases by \(e^{0.240} = 1.271\).
One thing to notice here is that none of the p-values for the estimates are statistically significant at the “typically” 0.05 level. I mentioned this in the previous guide, but it is important to note the difference between statistical significance and practical significance. It may be that we don’t meet the 0.05 requirement, but sometimes, the p-value may be small enough to still mean something to us. At this moment, we are looking at the different variables that may help us predict if someone is a member or not. We don’t necessarily need to have statistical significance as long as we are still getting good insight into our question.
We can also explore our variables a little bit further. Using the following plot, we can look at the distribution of each of our variables separated by the member status of a customer. I will be dropping the interaction variables since they didn’t seem to have much of an effect, as expected.
ggbivariate(train, outcome = "customer_type", explanatory = c("store", "gender", "product_class", "pay_method", "rating")) +
scale_fill_brewer(palette = "Dark2") +
my_theme
My biggest takeaway here is that we have a pretty even split between members and non-members. In other words, both members and non-members engage in similar transactions. It may not be possible to accurately predict member status using the information provided in this dataset.
With that being said, I would like to at least try a reduced model to
see what the results are. I will use the p-values from above as well as
the parameter estimates to guide my decision making and to create a
reduced model. Some of these variables don’t seem to be very useful at
predicting member status so let’s remove them from the model and try
again, this time with just product_class
and
store
.
log_mod_reduced <- log_spec %>%
fit(member ~ product_class + store, data = train_subset, family = "binomial")
tidy(log_mod_reduced) %>%
kable() %>%
kable_styling(c("striped", "responsive"))
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.2541328 | 0.1980084 | -1.2834442 | 0.1993365 |
product_classFashion accessories | 0.0570579 | 0.2401610 | 0.2375820 | 0.8122053 |
product_classFood and beverages | 0.3517674 | 0.2436186 | 1.4439266 | 0.1487596 |
product_classHealth and beauty | 0.1242045 | 0.2495210 | 0.4977716 | 0.6186450 |
product_classHome and lifestyle | 0.2477315 | 0.2481198 | 0.9984347 | 0.3180686 |
product_classSports and travel | 0.3583982 | 0.2447623 | 1.4642709 | 0.1431199 |
storeB | 0.1041356 | 0.1736868 | 0.5995593 | 0.5488000 |
storeC | 0.1333407 | 0.1742284 | 0.7653210 | 0.4440804 |
This model isn’t much better but we do see some evidence that certain product classes and stores are more associated with a member than otherwise. For instance, if a customer purchases a food/beverage item, the log odds of them being a member of the company increased by about 0.352. We can also see this with sports and travel goods. Perhaps members get a slight discount on items in these areas so they purchase more. I would be interested to explore this idea more, especially if I had more data and if I could explore purchase tendencies over time.
In general, we don’t seem to do much better than the full model but there are some small improvements such as smaller p-values and bigger estimates. Let’s stick with this reduced model moving forward.
I’m not a huge fan of this model for use in inference and predicting future values. We just don’t have enough information about the customers and transactions in order to make a confident prediction. However, for educational purposes, I would like to see how well the model performs when making predictions. We also should check our assumptions.
log_aug <- augment(log_mod_reduced, type.predict = "response",
type.residuals = "deviance", new_data = train_subset) %>%
mutate(id = row_number())
ggplot(data = log_aug, aes(x = .pred_class, y = member)) +
geom_point() +
geom_jitter() +
labs(x = "Predicted Member Status",
y = "Actual Member Status",
title = "Member Status: Predicted vs. Actual",
subtitle = "Training Data") +
my_theme
This plot shows us how well the model did. It compares the actual member status with what the model would predict a customer to be. I did add a little bit of jitter to the plot so we could see each observation better (otherwise they all would lie on top of each other).
The ideal situation is 100% prediction accuracy where all of the points lie at (0, 0) meaning we predicted a customer to not be a member and they actually weren’t or at (1, 1) meaning we predicted a customer to be a member and they actually were. In the real world, we will have some error and will misclassify some observations, which appear in (0, 1) and (1, 0). Looking at this plot, we see that we did in fact predict the member status correctly for many customers!
My main concern is that there is a roughly equal amount of observations in each quadrant. The model seems to not do much better than us just guessing the member status in which case we would expect to get about 50% accuracy. Let’s see the actual numbers of our predictions.
log_aug %>%
rename("Predicted" = .pred_class,
"Actual" = member) %>%
tbl_2var(Predicted~Actual,
caption = "Predicted Member Status vs. Actual Member Status")
Actual | |||
---|---|---|---|
Predicted | 0 | 1 | Total |
0 | 214 | 199 | 413 |
1 | 183 | 203 | 386 |
Total | 397 | 402 | 799 |
We see that we got \(214 + 203 = 417\) observations correctly predicted which means that we incorrectly predicted 382. We were \(\frac{417}{799} * 100 = 52.19\%\) accurate. Not much better than just randomly guessing.
It’s also a good idea to check our prediction accuracy with the testing dataset, although I don’t have much hope.
log_aug_test <- augment(log_mod_reduced, type.predict = "response",
type.residuals = "deviance", new_data = test)
ggplot(data = log_aug_test, aes(x = .pred_class, y = member)) +
geom_point() +
geom_jitter() +
labs(x = "Predicted Member Status",
y = "Actual Member Status",
title = "Member Status: Predicted vs. Actual",
subtitle = "Testing Data") +
my_theme
log_aug_test %>%
rename("Predicted" = .pred_class,
"Actual" = member) %>%
tbl_2var(Predicted~Actual,
caption = "Predicted Member Status vs. Actual Member Status")
Actual | |||
---|---|---|---|
Predicted | 0 | 1 | Total |
0 | 53 | 47 | 100 |
1 | 49 | 52 | 101 |
Total | 102 | 99 | 201 |
We see similar results from the training data. We did correctly predict some observations, but there are still quite a handful that we missed. To quantify it, we correctly predicted 105 out of 201 customer member statuses which gives an accuracy rate of 52.24%.
model_performance(log_mod_full) %>%
kable()
AIC | AICc | BIC | R2_Tjur | RMSE | Sigma | Log_loss | Score_log | Score_spherical | PCP |
---|---|---|---|---|---|---|---|---|---|
1129.119 | 1129.814 | 1204.052 | 0.0130321 | 0.4967318 | 1 | 0.6865573 | -280.8741 | 0.0019635 | 0.5065354 |
model_performance(log_mod_reduced) %>%
kable()
AIC | AICc | BIC | R2_Tjur | RMSE | Sigma | Log_loss | Score_log | Score_spherical | PCP |
---|---|---|---|---|---|---|---|---|---|
1119.157 | 1119.339 | 1156.624 | 0.0055728 | 0.4985954 | 1 | 0.6903358 | -279.2792 | 0.0062764 | 0.5028059 |
Our full and reduced model are very similar. The AIC and BIC values
differ very little between the two, although they have decreased with
the reduced model. We also see a slight increase in the RMSE which is
not ideal but an increase that small isn’t concerning. I do want to look
at the Tjur’s \(R^2\) value which gives
us a “typical” \(R^2\) for regression.
This number is technically called the Coefficient of Discrimination but
can be interpreted like a standard \(R^2\). So we explain about 1.3% of the
variance in customer_type
in our full model and only 0.6%
with the reduced model, giving further evidence that our model is not
great.
diagnostics <- plot(check_model(log_mod_reduced, panel = FALSE, type = "discrete_dots"))
We met this when we chose variables earlier. Our response variable is binary.
We are assuming this to be true, trusting the person/service that collected this data. It is possible that one person is represented more than one or that one transaction influenced another, however.
We used output from ggbivariate
to explore this a bit
above. I don’t think we have evidence to say that the predictors are
correlated with each other.
However, we can use variance inflation factors (VIFs) to get a better idea of how correlated the predictor variables are.
log_mod_reduced %>%
extract_fit_engine() %>%
check_collinearity()
## # Check for Multicollinearity
##
## Low Correlation
##
## Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
## product_class 1.01 [1.00, 5.21] 1.01 0.99 [0.19, 1.00]
## store 1.01 [1.00, 5.21] 1.01 0.99 [0.19, 1.00]
The output states that we have low correlation between our predictors
which is great! We can also verify this by looking at the
VIF
column. All values are much less than 10, the common
cutoff for evidence of multicollinearity.
We can check this by plotting each variable against the log odds. I am going to skip this assumption since I am not confident with the model and since I will not be using the model for any inference.
We might also want to check on our errors.
diagnostics[[5]] +
my_theme
We definitely don’t meet this condition as our points do not lie along the reference line. We should automatically question the validity of our results.
This is met. We have actually already verified this by looking at tables and counts. There are no combinations of variables such that we have a zero count. For instance, the number of members who went to Store A is non-zero, as is the number of non-members who bought a food/beverage.
As we saw throughout this guide, using logistic regression to predict a customer’s member status did not turn out as well as we thought. Our error rate was just about as good as guessing randomly (50/50 chance). That being said, I do think we gained some valuable insight into customer tendencies, even if we are not able to accurately predict member status. We were able to look at some of the product classes that members are more likely to purchase and found that members were more likely to make purchases using a credit card. It is also worth noting the the range of transaction rating is roughly the same regardless of whether someone is a member or not. This could be useful to the company as it allows them to focus on improving the quality of all transactions without the need to worry about if someone is a member or not. All in all, it is important to remember that logistic regression is best used when the response variable is binary (or has two levels) with predictor variables that can be either continuous or categorical.