Hey there! This is the fifth guide in my Introduction to Modeling with R series. In the previous guide, we looked at logistic regression which involved a binary response variable. This page will focus on multinomial regression, an extension of logistic regression. Although, I cannot cover all of the details and fine use cases of regression, we will explore some of the key ideas to focus on when creating and analyzing a regression model. We’ll start by importing our data and identifying variables to use in the model then move to creating, interpreting, and testing the model. As before, we’ll use train/test validation and we’ll also look at a few candidate models.
I again will start by importing our dataset as well as the packages
we will be using. As mentioned in the previous guide, I prefer to use
the tidyverse
family of packages. We’ll also be using the
tidymodels
collection of packages to set up the models and
perform our train/test validation.
The code below imports our packages and the data we cleaned in the previous guide. It also splits the data into a train and test set. We will train our models using the training set and will test its performance using the test set. We do this to simulate how the model will perform in the “real world”.
I will be setting the seed for the random generator so we can get the
same results each time I start a new R session and split the data. You
may or may not wish to do this in your work. I want to make sure we get
a roughly equal amount of each store in the split sets so I will take a
stratified sample to make the split (strata = store
).
library(tidyverse)
library(tidymodels)
library(lubridate)
library(performance)
library(GGally)
library(gvsu215)
library(car)
library(knitr)
library(kableExtra)
retail <- read_csv(here::here("data/retail_clean.csv"))
set.seed(52319)
retail_split <- initial_split(retail, prop = .8, strata = store)
train <- training(retail_split)
test <- testing(retail_split)
Alright! Now that we have that all set, we’ll use the
train
data for all the model creation and will only use the
test
data when we are analyzing our models.
Although most of the processes for multinomial regression will be the same as or very similar to SLR, MLR, ad logistic regression, there are a few key differences. The key piece is what our response variable looks like. Remember, for SLR and MLR, our response variable had to be continuous. In logistic regression, the response variable had to be binary (or had to have two levels). What if the response variable has more than 2 levels?
In order to get valid results from multinomial regression, we must meet certain assumptions.
First, let’s decide what we want to predict. Again, we need a
variable that has more than two levels. We have a few choices:
store
, product_class
, or
pay_method
. I am most interested in looking to see if we
can predict which store a customer will shop at. This can be useful for
a company to help in product ordering. If, for instance, we find that we
can predict a customer’s destination based on the product class, we
could instruct stores to order more products of certain classes to avoid
out of stocks.
Technically, we can split up multinomial regression into simple multinomial regression (one predictor variable) and multiple multinomial regression (more than one predictor variable). However, I am going to jump right into multiple multinomial regression and dive into a model as predicting with one variable is not very exciting. As I did previously, I am going to start by choosing every variable that I think might play a role in predicting the gender of a customer and use the regression output to narrow things down.
I would like to start with:
customer_type
gender
product_class
qty
total
pay_method
rating
This time, I also want to create a new variable hour
that describes the hour of the day a customer visited a store. The
following code creates a subset of the data (and also codes our response
variable as a factor).
train_subset <- train %>%
mutate(store = factor(store),
qty = factor(qty),
hour = factor(hour(time))) %>%
select(store, customer_type, gender, product_class, qty, total, pay_method, rating, hour)
test_subset <- test %>%
mutate(store = factor(store),
qty = factor(qty),
hour = factor(hour(time))) %>%
select(store, customer_type, gender, product_class, qty, total, pay_method, rating, hour)
Now that we are predicting a response variable with multiple levels, we have to rethink our model equation. Essentially, we are trying to predict the likelihood of a certain store being shopped at given the values of our predictor variables.
Like binomial regression, multinomial regression also requires us to compare against a “base” level. Since our response variable has three levels, we are essentially creating two logistic regression equations which allow us to compare to the “base” level. Recall that we have three stores: A, B, and C. If we choose store A to be our comparison level, we are actually running logistic regression with the following odds ratio
\[ \frac{Pr(Y_i = \text{Store B})}{Pr(Y_i = \text{Store A})} = \frac{p_{\text{Store B}}}{p_{\text{Store A}}} \] and
\[ \frac{Pr(Y_i = \text{Store C})}{Pr(Y_i = \text{Store A})} = \frac{p_{\text{Store C}}}{p_{\text{Store A}}} \] This can make interpretation tricky as all of our calculations are being done in comparison to Store A.
The math behind predicting this can get pretty involved. However, I would like to show that if we take the log of the above odds, we get a formula like this (which is derived from the more complicated logistic equation):
\[ \begin{equation*} \log\left(\frac{Pr(Y_i = \text{Store B})}{Pr(Y_i = \text{Store A})}\right) = \beta_0 + \beta_1 X_1 + ... \end{equation*} \] where the predictions for the beta coefficients are derived from the data from Store B.
This should look pretty familiar. We had a similar situation with logistic regression when we took the log of the odds. Here too we can take the log of the odds and we have a linear relationship between the predictors and the log odds, or logit.
Before officially building the model, let’s make a few plots to visualize our variables.
train_subset %>%
ggplot(aes(x = store, fill = customer_type)) +
geom_bar() +
scale_fill_brewer(palette = "Dark2") +
labs(title = "Visits to Stores by Member Status",
x = "Store",
y = "Count",
fill = "Member Status") +
my_theme
It looks like we have very similar results per store here. Store A seems to attract more non-members but this may be because it has more customers overall. We may not see an effect here later on in the model.
train_subset %>%
ggplot(aes(x = store, fill = gender)) +
geom_bar() +
scale_fill_brewer(palette = "Dark2") +
labs(title = "Visits to Stores by Gender",
x = "Store",
y = "Count",
fill = "Gender") +
my_theme
Similar to the above plot, we see more males visiting store A but, again, this could be due to the fact that more customers visit store A (although not by much).
train_subset %>%
ggplot(aes(x = store, fill = product_class)) +
geom_bar() +
scale_fill_brewer(palette = "Dark2") +
labs(title = "Store vs. Product Class",
x = "Store",
y = "Count",
fill = "Product Class") +
my_theme
We may end up seeing some sort of effect in the model with this variable. I’m just eyeballing the plot here but it looks like Store A sells more Home/Lifestyle and Electronic goods and less Fashion Accessories while Store B sells more Sports/Travel items and Store C sells more Food/Beverages.
train_subset %>%
ggplot(aes(x = store, fill = qty)) +
geom_bar() +
scale_fill_brewer(palette = "Set3") +
labs(title = "Store Location vs. Quantity Sold",
subtitle = "One Unique Item Per Transaction",
x = "Store",
y = "Count",
fill = "Quantity") +
my_theme
It is hard to tell if there is a pattern here. I do see that Store A has more products sold in the middle areas (4, 5, 6, 7) while Store B is more even across the board and Store C has more on the higher end. However this is just a generality and I’m not sure if we will keep this variable.
train_subset %>%
ggplot(aes(x = total, fill = store)) +
geom_boxplot() +
scale_fill_brewer(palette = "Dark2") +
labs(title = "Visits to Stores by Transaction Total",
x = "Total",
y = "Store",
fill = "Store") +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank()) +
my_theme
Totals seem to be very similar across stores and skewed right with the higher totals being less common than lower totals. Store C does seem to have higher purchase totals but this may be because of a few higher values rather than an general tendency.
train_subset %>%
ggplot(aes(x = store, fill = pay_method)) +
geom_bar() +
scale_fill_brewer(palette = "Dark2") +
labs(title = "Store vs. Pay Method",
x = "Store",
y = "Count",
fill = "Pay Method") +
my_theme
Once again, results are similar across stores; however, I do see that Store A has more customers paying with Ewallet, Store B more with credit cards, and Store C more with cash. We will see if this difference is large enough to pop up in our model, but this may be useful information regardless.
train_subset %>%
ggplot(aes(x = rating, fill = store)) +
geom_boxplot() +
scale_fill_brewer(palette = "Dark2") +
labs(title = "Visits to Stores by Transaction Rating",
x = "Rating",
y = "Store",
fill = "Store") +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank()) +
my_theme
It looks like all stores have high and low ratings but, on average Store B is rated lower than the other two. It is unlikely that this will have a big effect in the model but I think this helps the company realize they should focus on increasing customer satisfaction at all stores, especially Store B.
train_subset %>%
ggplot(aes(x = store, fill = hour)) +
geom_bar() +
scale_fill_brewer(palette = "Set3") +
labs(title = "Store vs. Hour Visited",
x = "Store",
y = "Count",
fill = "Hour") +
my_theme
It looks like all stores are open between the hours of 10am and 8pm. Store A attracts customers right at open and near the early afternoon but becomes less busy as the day goes on. Store B starts off pretty slow, gets a small rush early afternoon and finishes the day strong with the most customers arriving around 6 and 7 in the evening. Store C has roughly the same amount of traffic all day with small peaks at noon and 5pm. This information can be vital to the company to help them better prepare for rushes and make sure the shelves are stocked in preparation.
To wrap up, it’s a good idea to look at all of the variables together (as we have done before) to look for any potential interaction terms or transformations (e.g., square root, log, square, cube, etc.).
ggpairs(train_subset) +
my_theme
We are using the total
variable in this model so we will
want to apply a transformation to it here, such as the square root. I’d
also like to explore a potential interaction between product class and
payment method.
Now that we have an understanding of what the multinomial model looks like, let’s use R to build one!
train_subset <- train_subset %>%
mutate(sqrt_total = sqrt(total))
test_subset <- test_subset %>%
mutate(sqrt_total = sqrt(total))
multi_spec <- multinom_reg() %>%
set_engine("nnet")
multi_mod_full <- multi_spec %>%
fit(store ~ customer_type + gender + product_class + qty + sqrt_total + pay_method + rating + hour + product_class*pay_method, data = train_subset) %>%
repair_call(data = train_subset) # this is necessary because we are not using cross-validation
tidy(multi_mod_full) %>%
kable() %>%
kable_styling(c("striped", "responsive"), fixed_thead = TRUE)
y.level | term | estimate | std.error | statistic | p.value |
---|---|---|---|---|---|
B | (Intercept) | 0.1230925 | 0.6639102 | 0.1854054 | 0.8529111 |
B | customer_typeNormal | -0.1022425 | 0.1806509 | -0.5659671 | 0.5714161 |
B | genderMale | -0.2173637 | 0.1826443 | -1.1900926 | 0.2340100 |
B | product_classFashion accessories | 0.5175410 | 0.5569480 | 0.9292448 | 0.3527622 |
B | product_classFood and beverages | -0.9130753 | 0.5954924 | -1.5333115 | 0.1251991 |
B | product_classHealth and beauty | -0.7073235 | 0.5587781 | -1.2658398 | 0.2055704 |
B | product_classHome and lifestyle | -0.9997502 | 0.5450608 | -1.8341992 | 0.0666244 |
B | product_classSports and travel | 0.0762935 | 0.4935285 | 0.1545878 | 0.8771463 |
B | qty2 | 0.4222929 | 0.4121049 | 1.0247221 | 0.3054943 |
B | qty3 | 0.2278580 | 0.4128217 | 0.5519527 | 0.5809808 |
B | qty4 | 0.2963050 | 0.4139034 | 0.7158796 | 0.4740657 |
B | qty5 | -0.3161466 | 0.4301869 | -0.7349051 | 0.4623973 |
B | qty6 | 0.3926758 | 0.4539686 | 0.8649846 | 0.3870472 |
B | qty7 | -0.1393764 | 0.4667160 | -0.2986321 | 0.7652208 |
B | qty8 | 0.5507095 | 0.5071525 | 1.0858855 | 0.2775297 |
B | qty9 | 0.6045089 | 0.5060219 | 1.1946298 | 0.2322317 |
B | qty10 | 0.0715738 | 0.4979165 | 0.1437465 | 0.8857006 |
B | sqrt_total | -0.0042119 | 0.0197610 | -0.2131419 | 0.8312163 |
B | pay_methodCredit card | -0.4048292 | 0.5226017 | -0.7746419 | 0.4385513 |
B | pay_methodEwallet | -0.4762357 | 0.5294266 | -0.8995311 | 0.3683699 |
B | rating | -0.0167801 | 0.0526306 | -0.3188279 | 0.7498570 |
B | hour11 | 0.2861487 | 0.3999898 | 0.7153900 | 0.4743681 |
B | hour12 | -0.2638694 | 0.4159754 | -0.6343389 | 0.5258597 |
B | hour13 | 0.2386122 | 0.3928605 | 0.6073714 | 0.5436045 |
B | hour14 | 0.4756167 | 0.4222767 | 1.1263154 | 0.2600320 |
B | hour15 | 0.2044245 | 0.4001754 | 0.5108372 | 0.6094651 |
B | hour16 | -0.1895534 | 0.4403074 | -0.4305024 | 0.6668302 |
B | hour17 | -0.0257888 | 0.4319457 | -0.0597039 | 0.9523915 |
B | hour18 | 0.2614289 | 0.3980759 | 0.6567313 | 0.5113537 |
B | hour19 | 0.7217362 | 0.3941421 | 1.8311575 | 0.0670770 |
B | hour20 | 0.3407303 | 0.4357730 | 0.7818986 | 0.4342742 |
B | product_classFashion accessories:pay_methodCredit card | -0.2332280 | 0.7788595 | -0.2994481 | 0.7645982 |
B | product_classFood and beverages:pay_methodCredit card | 1.2963956 | 0.7825805 | 1.6565652 | 0.0976074 |
B | product_classHealth and beauty:pay_methodCredit card | 1.5547850 | 0.7905627 | 1.9666814 | 0.0492199 |
B | product_classHome and lifestyle:pay_methodCredit card | 1.0702993 | 0.7905656 | 1.3538399 | 0.1757875 |
B | product_classSports and travel:pay_methodCredit card | 0.4670240 | 0.7256503 | 0.6435937 | 0.5198390 |
B | product_classFashion accessories:pay_methodEwallet | -0.0881054 | 0.7555997 | -0.1166033 | 0.9071745 |
B | product_classFood and beverages:pay_methodEwallet | 1.0499598 | 0.8043774 | 1.3053074 | 0.1917882 |
B | product_classHealth and beauty:pay_methodEwallet | 1.1756288 | 0.7756213 | 1.5157253 | 0.1295888 |
B | product_classHome and lifestyle:pay_methodEwallet | 1.0617632 | 0.7560930 | 1.4042759 | 0.1602367 |
B | product_classSports and travel:pay_methodEwallet | 0.0704697 | 0.7370418 | 0.0956116 | 0.9238291 |
C | (Intercept) | 0.3740907 | 0.6443545 | 0.5805666 | 0.5615326 |
C | customer_typeNormal | -0.1035452 | 0.1813948 | -0.5708279 | 0.5681163 |
C | genderMale | -0.3161325 | 0.1837407 | -1.7205360 | 0.0853351 |
C | product_classFashion accessories | -0.0406714 | 0.5478844 | -0.0742335 | 0.9408246 |
C | product_classFood and beverages | -0.0241859 | 0.4875570 | -0.0496064 | 0.9604361 |
C | product_classHealth and beauty | -0.5513391 | 0.5070656 | -1.0873133 | 0.2768984 |
C | product_classHome and lifestyle | -1.0130234 | 0.5046434 | -2.0074045 | 0.0447066 |
C | product_classSports and travel | -0.9204218 | 0.5189377 | -1.7736654 | 0.0761185 |
C | qty2 | 0.0864847 | 0.4047179 | 0.2136913 | 0.8307878 |
C | qty3 | -0.4552864 | 0.4236196 | -1.0747527 | 0.2824854 |
C | qty4 | -0.3009137 | 0.4107135 | -0.7326609 | 0.4637653 |
C | qty5 | -0.7120384 | 0.4271518 | -1.6669447 | 0.0955254 |
C | qty6 | -0.3797564 | 0.4609512 | -0.8238537 | 0.4100227 |
C | qty7 | -0.2263241 | 0.4472796 | -0.5060014 | 0.6128557 |
C | qty8 | 0.0489264 | 0.4954365 | 0.0987542 | 0.9213335 |
C | qty9 | -0.2185088 | 0.5059906 | -0.4318437 | 0.6658550 |
C | qty10 | -0.1735220 | 0.4873218 | -0.3560728 | 0.7217860 |
C | sqrt_total | 0.0186506 | 0.0197400 | 0.9448127 | 0.3447545 |
C | pay_methodCredit card | -1.2569920 | 0.5514493 | -2.2794334 | 0.0226413 |
C | pay_methodEwallet | -0.6660243 | 0.5080869 | -1.3108470 | 0.1899094 |
C | rating | 0.0258671 | 0.0527362 | 0.4905001 | 0.6237800 |
C | hour11 | -0.2368580 | 0.4189176 | -0.5654046 | 0.5717986 |
C | hour12 | 0.1739025 | 0.3855400 | 0.4510623 | 0.6519446 |
C | hour13 | 0.1750183 | 0.3875151 | 0.4516425 | 0.6515265 |
C | hour14 | 0.3370850 | 0.4201652 | 0.8022676 | 0.4223982 |
C | hour15 | 0.1315039 | 0.3923755 | 0.3351482 | 0.7375133 |
C | hour16 | 0.0809160 | 0.4134694 | 0.1957000 | 0.8448450 |
C | hour17 | 0.1459642 | 0.4111131 | 0.3550465 | 0.7225548 |
C | hour18 | -0.3088176 | 0.4141024 | -0.7457517 | 0.4558174 |
C | hour19 | 0.2246624 | 0.4036445 | 0.5565847 | 0.5778112 |
C | hour20 | 0.2964526 | 0.4234660 | 0.7000622 | 0.4838884 |
C | product_classFashion accessories:pay_methodCredit card | 1.0148245 | 0.7965850 | 1.2739688 | 0.2026745 |
C | product_classFood and beverages:pay_methodCredit card | 0.2753167 | 0.7682468 | 0.3583702 | 0.7200663 |
C | product_classHealth and beauty:pay_methodCredit card | 1.8845297 | 0.7843329 | 2.4027167 | 0.0162738 |
C | product_classHome and lifestyle:pay_methodCredit card | 1.7588866 | 0.7890985 | 2.2289824 | 0.0258151 |
C | product_classSports and travel:pay_methodCredit card | 1.6970046 | 0.7735810 | 2.1936998 | 0.0282570 |
C | product_classFashion accessories:pay_methodEwallet | 0.2511211 | 0.7436477 | 0.3376882 | 0.7355981 |
C | product_classFood and beverages:pay_methodEwallet | 0.1895658 | 0.7175306 | 0.2641920 | 0.7916320 |
C | product_classHealth and beauty:pay_methodEwallet | 0.4636049 | 0.7537622 | 0.6150546 | 0.5385187 |
C | product_classHome and lifestyle:pay_methodEwallet | 0.7819534 | 0.7280234 | 1.0740773 | 0.2827881 |
C | product_classSports and travel:pay_methodEwallet | 0.5345029 | 0.7740678 | 0.6905117 | 0.4898725 |
model_performance(multi_mod_full) %>%
kable()
## Can't calculate log-loss.
AIC | AICc | BIC | R2 | R2_adjusted | RMSE | Sigma |
---|---|---|---|---|---|---|
1843.07 | 1862.081 | 2227.105 | 0.043475 | 0.0423356 | 0.4604025 | 1.530294 |
This gives us a huge output. Remember, we are really running two regression models at the same time, comparing Store B to Store A and Store C to Store A. So, for example, if you purchase a home/lifestyle good, we would predict your log odds of shopping at Store B to be 1.00 compared to Store A, holding all other variables constant.
We will certainly not be using this model as our final model. There are way too many variables and too information to consider. Moreover, it doesn’t look like all of these variables will have equal use in predicting the odds of shopping at a certain store. As with logistic regression, I will not be relying solely on p-values to determine which variables to keep. I am going to be looking at them to get an idea of how influential a variable might be, but I will also be thinking about which variables might have practical effect on the results.
Looking through, I see the following:
hour
doesn’t say much compared to
Store A. However, we do have a highly significant p-value here which
makes me think that hour
plays some role.You may have noticed that a lot of our observations from looking at plots above have appeared in the results here. This is why we run regression. Sure, we can visualize trends but it’s nice to have a statistical test to tell us how big of an effect the things we have observed are having on the outcome.
I now want to build a reduced model by removing some of the
variables. I think we can remove customer_type
as there
didn’t seem to be much of a difference between stores. I also want to
remove qty
. After thinking about it, it probably doesn’t
make sense to predict which store someone will shop at based on a
variable that is determined while they are shopping. I also am taking
out the rating and log_total
variables. We saw earlier that
ratings and totals were across the board for all stores and now we see
that it makes a marginal difference in the log odds. I will be keeping
the interaction in the model and therefore should also keep the two
variables that make up that interaction: product class and pay
method.
multi_mod_reduced <- multi_spec %>%
fit(store ~ gender + product_class + pay_method + hour + product_class*pay_method, data = train_subset) %>%
repair_call(data = train_subset) # this is necessary because we are not using cross-validation
tidy(multi_mod_reduced) %>%
kable() %>%
kable_styling(c("striped", "responsive"), fixed_thead = TRUE)
y.level | term | estimate | std.error | statistic | p.value |
---|---|---|---|---|---|
B | (Intercept) | 0.0324376 | 0.4418813 | 0.0734079 | 0.9414815 |
B | genderMale | -0.2181609 | 0.1792318 | -1.2171993 | 0.2235284 |
B | product_classFashion accessories | 0.4593263 | 0.5475887 | 0.8388162 | 0.4015724 |
B | product_classFood and beverages | -0.8620287 | 0.5859534 | -1.4711556 | 0.1412490 |
B | product_classHealth and beauty | -0.7111274 | 0.5530655 | -1.2857923 | 0.1985156 |
B | product_classHome and lifestyle | -0.8941556 | 0.5392373 | -1.6581859 | 0.0972799 |
B | product_classSports and travel | 0.0567547 | 0.4868340 | 0.1165792 | 0.9071936 |
B | pay_methodCredit card | -0.3808298 | 0.5140045 | -0.7409076 | 0.4587495 |
B | pay_methodEwallet | -0.4487375 | 0.5224286 | -0.8589451 | 0.3903708 |
B | hour11 | 0.3106743 | 0.3936037 | 0.7893075 | 0.4299323 |
B | hour12 | -0.1583406 | 0.4081229 | -0.3879728 | 0.6980361 |
B | hour13 | 0.2742805 | 0.3868920 | 0.7089330 | 0.4783660 |
B | hour14 | 0.5601527 | 0.4138768 | 1.3534284 | 0.1759188 |
B | hour15 | 0.2418397 | 0.3940304 | 0.6137589 | 0.5393746 |
B | hour16 | -0.1627922 | 0.4351202 | -0.3741315 | 0.7083065 |
B | hour17 | 0.0052286 | 0.4265025 | 0.0122591 | 0.9902189 |
B | hour18 | 0.2705395 | 0.3932515 | 0.6879553 | 0.4914809 |
B | hour19 | 0.7615430 | 0.3852854 | 1.9765687 | 0.0480904 |
B | hour20 | 0.3738848 | 0.4307943 | 0.8678964 | 0.3854511 |
B | product_classFashion accessories:pay_methodCredit card | -0.1761060 | 0.7691999 | -0.2289470 | 0.8189101 |
B | product_classFood and beverages:pay_methodCredit card | 1.1505224 | 0.7718011 | 1.4906981 | 0.1360408 |
B | product_classHealth and beauty:pay_methodCredit card | 1.5352941 | 0.7806878 | 1.9665915 | 0.0492303 |
B | product_classHome and lifestyle:pay_methodCredit card | 1.0415556 | 0.7813768 | 1.3329749 | 0.1825401 |
B | product_classSports and travel:pay_methodCredit card | 0.4056389 | 0.7134346 | 0.5685720 | 0.5696466 |
B | product_classFashion accessories:pay_methodEwallet | -0.0546555 | 0.7459788 | -0.0732669 | 0.9415938 |
B | product_classFood and beverages:pay_methodEwallet | 1.0396773 | 0.7937828 | 1.3097756 | 0.1902718 |
B | product_classHealth and beauty:pay_methodEwallet | 1.1450831 | 0.7681274 | 1.4907463 | 0.1360281 |
B | product_classHome and lifestyle:pay_methodEwallet | 0.9440606 | 0.7445859 | 1.2679001 | 0.2048336 |
B | product_classSports and travel:pay_methodEwallet | 0.0665056 | 0.7240331 | 0.0918544 | 0.9268137 |
C | (Intercept) | 0.5089231 | 0.4190068 | 1.2145939 | 0.2245210 |
C | genderMale | -0.2890749 | 0.1802066 | -1.6041308 | 0.1086852 |
C | product_classFashion accessories | -0.0206193 | 0.5404320 | -0.0381534 | 0.9695654 |
C | product_classFood and beverages | 0.0164166 | 0.4761684 | 0.0344765 | 0.9724972 |
C | product_classHealth and beauty | -0.5296649 | 0.5018916 | -1.0553372 | 0.2912711 |
C | product_classHome and lifestyle | -0.9393242 | 0.4982635 | -1.8851956 | 0.0594034 |
C | product_classSports and travel | -0.9334757 | 0.5128043 | -1.8203350 | 0.0687080 |
C | pay_methodCredit card | -1.2504315 | 0.5448502 | -2.2950005 | 0.0217331 |
C | pay_methodEwallet | -0.6574219 | 0.5018185 | -1.3100789 | 0.1901691 |
C | hour11 | -0.1735148 | 0.4136739 | -0.4194483 | 0.6748886 |
C | hour12 | 0.2236018 | 0.3804245 | 0.5877691 | 0.5566873 |
C | hour13 | 0.1796662 | 0.3828059 | 0.4693403 | 0.6388264 |
C | hour14 | 0.4229653 | 0.4137510 | 1.0222703 | 0.3066530 |
C | hour15 | 0.1710339 | 0.3867843 | 0.4421945 | 0.6583485 |
C | hour16 | 0.1099455 | 0.4087028 | 0.2690108 | 0.7879214 |
C | hour17 | 0.1695118 | 0.4057528 | 0.4177711 | 0.6761145 |
C | hour18 | -0.2904513 | 0.4100855 | -0.7082701 | 0.4787776 |
C | hour19 | 0.2716885 | 0.3956510 | 0.6866874 | 0.4922798 |
C | hour20 | 0.3381913 | 0.4182171 | 0.8086500 | 0.4187165 |
C | product_classFashion accessories:pay_methodCredit card | 1.0770044 | 0.7893959 | 1.3643400 | 0.1724606 |
C | product_classFood and beverages:pay_methodCredit card | 0.2050424 | 0.7584138 | 0.2703569 | 0.7868857 |
C | product_classHealth and beauty:pay_methodCredit card | 1.8646395 | 0.7757981 | 2.4035115 | 0.0162385 |
C | product_classHome and lifestyle:pay_methodCredit card | 1.6587394 | 0.7787565 | 2.1299846 | 0.0331729 |
C | product_classSports and travel:pay_methodCredit card | 1.8085889 | 0.7639346 | 2.3674656 | 0.0179104 |
C | product_classFashion accessories:pay_methodEwallet | 0.2596970 | 0.7356182 | 0.3530323 | 0.7240642 |
C | product_classFood and beverages:pay_methodEwallet | 0.1482658 | 0.7080370 | 0.2094040 | 0.8341328 |
C | product_classHealth and beauty:pay_methodEwallet | 0.4181683 | 0.7464316 | 0.5602232 | 0.5753272 |
C | product_classHome and lifestyle:pay_methodEwallet | 0.7349214 | 0.7168877 | 1.0251555 | 0.3052898 |
C | product_classSports and travel:pay_methodEwallet | 0.5216228 | 0.7628310 | 0.6837986 | 0.4941024 |
model_performance(multi_mod_reduced) %>%
kable()
## Can't calculate log-loss.
## Can't calculate proper scoring rules for ordinal, multinomial or
## cumulative link models.
AIC | AICc | BIC | R2 | R2_adjusted | RMSE | Sigma |
---|---|---|---|---|---|---|
1813.663 | 1822.912 | 2085.298 | 0.0328826 | 0.0317433 | 0.4629543 | 1.513619 |
This still gives us a large model with many variables. Part of this
is due to how many levels the hour
variable and the
interaction variable contains. I hesitate to remove hour
since some of the hours seem to play a role in predicting the log odds
of going to a certain store. Before moving forward with this model
(which does seem to be a slight improvement from the full model as we
have lowered the AIC and BIC), I would like to fit one more model
without hour
to see what happens.
multi_mod_nohour <- multi_spec %>%
fit(store ~ gender + product_class + pay_method + product_class*pay_method, data = train_subset) %>%
repair_call(data = train_subset) # this is necessary because we are not using cross-validation
tidy(multi_mod_nohour) %>%
kable() %>%
kable_styling(c("striped", "responsive"), fixed_thead = TRUE)
y.level | term | estimate | std.error | statistic | p.value |
---|---|---|---|---|---|
B | (Intercept) | 0.2785966 | 0.3604863 | 0.7728355 | 0.4396198 |
B | genderMale | -0.1954871 | 0.1758289 | -1.1118031 | 0.2662228 |
B | product_classFashion accessories | 0.4572940 | 0.5408788 | 0.8454648 | 0.3978514 |
B | product_classFood and beverages | -0.8899751 | 0.5806029 | -1.5328464 | 0.1253137 |
B | product_classHealth and beauty | -0.6739486 | 0.5483441 | -1.2290615 | 0.2190487 |
B | product_classHome and lifestyle | -0.9709108 | 0.5327123 | -1.8225800 | 0.0683670 |
B | product_classSports and travel | -0.0245208 | 0.4810784 | -0.0509705 | 0.9593491 |
B | pay_methodCredit card | -0.4584183 | 0.5083734 | -0.9017353 | 0.3671975 |
B | pay_methodEwallet | -0.4450654 | 0.5185653 | -0.8582630 | 0.3907472 |
B | product_classFashion accessories:pay_methodCredit card | -0.0397798 | 0.7565471 | -0.0525808 | 0.9580659 |
B | product_classFood and beverages:pay_methodCredit card | 1.3344234 | 0.7633087 | 1.7482093 | 0.0804278 |
B | product_classHealth and beauty:pay_methodCredit card | 1.5116425 | 0.7718664 | 1.9584251 | 0.0501801 |
B | product_classHome and lifestyle:pay_methodCredit card | 1.1559684 | 0.7708827 | 1.4995386 | 0.1337340 |
B | product_classSports and travel:pay_methodCredit card | 0.4938071 | 0.7056887 | 0.6997520 | 0.4840822 |
B | product_classFashion accessories:pay_methodEwallet | -0.0915018 | 0.7382811 | -0.1239389 | 0.9013637 |
B | product_classFood and beverages:pay_methodEwallet | 1.0967486 | 0.7882034 | 1.3914537 | 0.1640879 |
B | product_classHealth and beauty:pay_methodEwallet | 1.0916698 | 0.7622508 | 1.4321661 | 0.1520963 |
B | product_classHome and lifestyle:pay_methodEwallet | 1.0178712 | 0.7368889 | 1.3813089 | 0.1671840 |
B | product_classSports and travel:pay_methodEwallet | 0.1440879 | 0.7157792 | 0.2013021 | 0.8404623 |
C | (Intercept) | 0.6010893 | 0.3408282 | 1.7636135 | 0.0777971 |
C | genderMale | -0.2712891 | 0.1772239 | -1.5307703 | 0.1258262 |
C | product_classFashion accessories | 0.0447680 | 0.5356529 | 0.0835765 | 0.9333932 |
C | product_classFood and beverages | 0.0491046 | 0.4715041 | 0.1041446 | 0.9170546 |
C | product_classHealth and beauty | -0.5121843 | 0.4974753 | -1.0295673 | 0.3032132 |
C | product_classHome and lifestyle | -0.9671051 | 0.4924182 | -1.9639912 | 0.0495311 |
C | product_classSports and travel | -0.9112194 | 0.5086908 | -1.7913033 | 0.0732446 |
C | pay_methodCredit card | -1.2345571 | 0.5413370 | -2.2805701 | 0.0225739 |
C | pay_methodEwallet | -0.6430008 | 0.4989594 | -1.2886837 | 0.1975081 |
C | product_classFashion accessories:pay_methodCredit card | 0.9859724 | 0.7793577 | 1.2651090 | 0.2058322 |
C | product_classFood and beverages:pay_methodCredit card | 0.1927078 | 0.7525470 | 0.2560741 | 0.7978936 |
C | product_classHealth and beauty:pay_methodCredit card | 1.8507962 | 0.7688588 | 2.4071992 | 0.0160754 |
C | product_classHome and lifestyle:pay_methodCredit card | 1.7259267 | 0.7706736 | 2.2395041 | 0.0251231 |
C | product_classSports and travel:pay_methodCredit card | 1.8075592 | 0.7599656 | 2.3784749 | 0.0173844 |
C | product_classFashion accessories:pay_methodEwallet | 0.2305947 | 0.7303134 | 0.3157475 | 0.7521941 |
C | product_classFood and beverages:pay_methodEwallet | 0.1419276 | 0.7040976 | 0.2015737 | 0.8402500 |
C | product_classHealth and beauty:pay_methodEwallet | 0.4735933 | 0.7417994 | 0.6384385 | 0.5231883 |
C | product_classHome and lifestyle:pay_methodEwallet | 0.7935966 | 0.7103644 | 1.1171684 | 0.2639223 |
C | product_classSports and travel:pay_methodEwallet | 0.4951504 | 0.7572145 | 0.6539103 | 0.5131696 |
model_performance(multi_mod_nohour) %>%
kable()
## Can't calculate log-loss.
## Can't calculate proper scoring rules for ordinal, multinomial or
## cumulative link models.
AIC | AICc | BIC | R2 | R2_adjusted | RMSE | Sigma |
---|---|---|---|---|---|---|
1790.054 | 1793.954 | 1968.022 | 0.0235453 | 0.022406 | 0.4654612 | 1.50079 |
Well…I’m not upset about this one! We’ve lowered the AIC and BIC once
again although our RMSE has increased only slightly. The \(R^2\) has gone down but it wasn’t very
large at all to begin with. At this point, I’m not really sure which
model is “better”. It is very nice to remove hour
and lose
a large number of levels seen in the model. But there were a few levels
of hour
that seemed to play a role in the model and I
hesitate just dropping it.
After thinking about it, I am going to move forward with both of the models. I will look at our model accuracy on both of the models first and then pick the model with the highest prediction accuracy.
Let’s see how we did!
multi_aug <- augment(multi_mod_reduced, new_data = train_subset)
ggplot(data = multi_aug, aes(x = .pred_class, y = store)) +
geom_point() +
geom_jitter() +
labs(x = "Predicted Store",
y = "Actual Store ",
title = "Store Visited: Predicted vs. Actual",
subtitle = "Reduced Model | Training Data") +
my_theme
Well, the reduced model didn’t do terribly but also not the best. We want to see the most dots as possible along the diagonal from the bottom left to the top right. This would mean we predicted a customer would go to a certain store and they actually did. We do see that we correctly predicted a nice handful of observations. That being said, there are quite a few that we did not predict correctly. We have about 30-40 observations in all possible combinations of stores meaning that we made some wrong predictions in every way possible.
Let’s see it in table form.
multi_aug %>%
rename("Predicted" = .pred_class,
"Actual" = store) %>%
tbl_2var(Predicted~Actual,
caption = "Predicted Store Visited vs. Actual Store Visited \n Reduced Model")
Actual | ||||
---|---|---|---|---|
Predicted | A | B | C | Total |
A | 117 | 85 | 78 | 280 |
B | 80 | 122 | 72 | 274 |
C | 75 | 58 | 112 | 245 |
Total | 272 | 265 | 262 | 799 |
Officially, we correctly predicted 351 out of the 799 observations for a score of 43.93%. Not great but better than 33% which is where I would expect to be if we were just guessing randomly!
We should also check with the testing dataset.
multi_aug_test <- augment(multi_mod_reduced, new_data = test_subset)
ggplot(data = multi_aug_test, aes(x = .pred_class, y = store)) +
geom_point() +
geom_jitter(width = 0.35, height = 0.35) +
labs(x = "Predicted Store",
y = "Actual Store ",
title = "Store Visited: Predicted vs. Actual",
subtitle = "Reduced Model | Testing Data") +
my_theme
multi_aug_test %>%
rename("Predicted" = .pred_class,
"Actual" = store) %>%
tbl_2var(Predicted~Actual,
caption = "Predicted Store Visited vs. Actual Store Visited (Testing) \n No Hour Model")
Actual | ||||
---|---|---|---|---|
Predicted | A | B | C | Total |
A | 25 | 26 | 17 | 68 |
B | 22 | 28 | 23 | 73 |
C | 21 | 13 | 26 | 60 |
Total | 68 | 67 | 66 | 201 |
We see similar results. We have correctly predicted 79 out of the 201 observations for a score of 39.3%. This is worse than the training data (which is expected) but is also approaching the 33% realm which is what would expect to get if we just guessed randomly.
multi_aug_nohour <- augment(multi_mod_nohour, new_data = train_subset)
ggplot(data = multi_aug_nohour, aes(x = .pred_class, y = store)) +
geom_point() +
geom_jitter() +
labs(x = "Predicted Store",
y = "Actual Store ",
title = "Store Visited: Predicted vs. Actual",
subtitle = "No Hour Model | Training Data") +
my_theme
Just looking at this plot, I don’t see any striking evidence that we
did any better or worse by taking hour
out of the model. We
still have plenty of observations where we incorrectly predict their
store
value.
multi_aug_nohour %>%
rename("Predicted" = .pred_class,
"Actual" = store) %>%
tbl_2var(Predicted~Actual,
caption = "Predicted Store Visited vs. Actual Store Visited \n No Hour Model")
Actual | ||||
---|---|---|---|---|
Predicted | A | B | C | Total |
A | 113 | 82 | 76 | 271 |
B | 85 | 111 | 83 | 279 |
C | 74 | 72 | 103 | 249 |
Total | 272 | 265 | 262 | 799 |
We have less correct predictions here with a score of 327 out of 799 which gives a 40.93% accuracy.
multi_aug_test_nohour <- augment(multi_mod_nohour, new_data = test_subset)
ggplot(data = multi_aug_test_nohour, aes(x = .pred_class, y = store)) +
geom_point() +
geom_jitter(width = 0.35, height = 0.35) +
labs(x = "Predicted Store",
y = "Actual Store ",
title = "Store Visited: Predicted vs. Actual",
subtitle = "No Hour Model | Testing Data") +
my_theme
multi_aug_test_nohour %>%
rename("Predicted" = .pred_class,
"Actual" = store) %>%
tbl_2var(Predicted~Actual,
caption = "Predicted Store Visited vs. Actual Store Visited (Testing) \n No Hour Model")
Actual | ||||
---|---|---|---|---|
Predicted | A | B | C | Total |
A | 29 | 27 | 17 | 73 |
B | 19 | 21 | 25 | 65 |
C | 20 | 19 | 24 | 63 |
Total | 68 | 67 | 66 | 201 |
With the testing data, we predict 74/201 correct for a score of
36.82%. Based on these results, I think I am willing to sacrifice the
extra variables and keep the model that includes hour
. We
will continue to move on and check model assumptions with this
model.
Even though our model is not performing as well as I’d hoped, it still is good to check our assumptions. It is possible that part of the reason why our model is not satisfactory is due to not meeting some assumptions.
diagnostics <- plot(check_model(multi_mod_reduced, panel = FALSE, residual_type = "normal"))
We meet this since we chose a response variable (store
)
with more than two levels: A, B, and C.
Rose Werth has a good explanation of what this assumption really means. Essentially “this assumption states that the relative likelihood of being in one category compared to the base category would not change if you added any other categories.” We should be confident that there are no confounding variables out that that might change our outcome if they were considered.
I honestly don’t think we meet this assumption here. For instance, what if I told you the weather on a certain day or hour. Or what if I gave you traffic information about a certain day our hour. Would this influence your choice of store? Maybe, maybe not. But it is possible that knowing this information might change the outcome. For example, what if you were planning on going to Store B but I told you there was a car accident on the way there and you decided to go to Store C instead to avoid the traffic. That’s a change in the outcome.
I think this violation plays a role in our models lack of major success. Because there are so many other variables out there that we could collect and consider, we aren’t able to make an accurate prediction.
As before, let’s look at variance inflation factors to look at multicollinearity.
multi_mod_reduced %>%
extract_fit_engine() %>%
check_collinearity()
## Model has interaction terms. VIFs might be inflated.
## You may check multicollinearity among predictors of a model without
## interaction terms.
## # Check for Multicollinearity
##
## Low Correlation
##
## Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
## gender 1.37 [ 1.27, 1.50] 1.17 0.73 [0.67, 0.78]
##
## High Correlation
##
## Term VIF VIF 95% CI Increased SE Tolerance
## product_class 1907.92 [ 1678.42, 2168.82] 43.68 5.24e-04
## pay_method 61.15 [ 53.86, 69.45] 7.82 0.02
## hour 20.28 [ 17.91, 23.00] 4.50 0.05
## product_class:pay_method 1.53e+05 [1.34e+05, 1.74e+05] 390.70 6.55e-06
## Tolerance 95% CI
## [0.00, 0.00]
## [0.01, 0.02]
## [0.04, 0.06]
## [0.00, 0.00]
This is not good at all, but remember that we have an interaction term in our model so we do expect some correlation there. We should check VIFs without any interactions.
multi_spec %>%
fit(store ~ gender + product_class + pay_method + hour, data = train_subset) %>%
repair_call(data = train_subset) %>%
extract_fit_engine() %>%
check_collinearity()
## # Check for Multicollinearity
##
## Low Correlation
##
## Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
## gender 1.37 [ 1.27, 1.50] 1.17 0.73 [0.67, 0.79]
## product_class 4.46 [ 3.97, 5.02] 2.11 0.22 [0.20, 0.25]
## pay_method 1.81 [ 1.65, 2.00] 1.34 0.55 [0.50, 0.61]
##
## High Correlation
##
## Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
## hour 18.33 [16.14, 20.84] 4.28 0.05 [0.05, 0.06]
We have evidence for low correlation between predictors.
gender
, product_class
, and
pay_method
all have VIF values of less than 5.
product_class
is getting close to showing evidence of
correlation with the other predictors but not bad.
What strikes me more is the massive VIF for hour
. At
18.33, it is clearly correlated with other predictors. At this point, it
should be dropped or combined into another variable. I’m not sure we
could get a better model with the data we have, but curiosity strikes me
to test something. I am going to create a new variable that merges some
of the hour
s together and then I’ll run a new model on
that. I won’t include the interaction for demonstration purposes.
train_subset <- train_subset %>%
mutate(time_of_day = case_when(hour %in% c(10, 11, 12, 13, 14) ~ "Midday",
hour %in% c(15, 16, 17, 18) ~ "Afternoon",
TRUE ~ "Evening"))
multi_mod_reduced2 <- multi_spec %>%
fit(store ~ gender + product_class + pay_method + time_of_day, data = train_subset) %>%
repair_call(data = train_subset) # this is necessary because we are not using cross-validation
tidy(multi_mod_reduced2) %>%
kable() %>%
kable_styling(c("striped", "responsive"), fixed_thead = TRUE)
y.level | term | estimate | std.error | statistic | p.value |
---|---|---|---|---|---|
B | (Intercept) | -0.2183208 | 0.2842962 | -0.7679344 | 0.4425261 |
B | genderMale | -0.1910915 | 0.1758398 | -1.0867362 | 0.2771534 |
B | product_classFashion accessories | 0.3385894 | 0.2997552 | 1.1295533 | 0.2586645 |
B | product_classFood and beverages | -0.0727605 | 0.3032538 | -0.2399326 | 0.8103825 |
B | product_classHealth and beauty | 0.1715636 | 0.3087674 | 0.5556403 | 0.5784568 |
B | product_classHome and lifestyle | -0.2571213 | 0.3039655 | -0.8458898 | 0.3976142 |
B | product_classSports and travel | 0.1813320 | 0.2933302 | 0.6181838 | 0.5364542 |
B | pay_methodCredit card | 0.2052460 | 0.2175893 | 0.9432723 | 0.3455416 |
B | pay_methodEwallet | 0.0230733 | 0.2140427 | 0.1077976 | 0.9141563 |
B | time_of_dayEvening | 0.5343003 | 0.2550518 | 2.0948697 | 0.0361826 |
B | time_of_dayMidday | 0.1138453 | 0.1966973 | 0.5787843 | 0.5627347 |
C | (Intercept) | 0.1442794 | 0.2765776 | 0.5216598 | 0.6019072 |
C | genderMale | -0.2736713 | 0.1761162 | -1.5539250 | 0.1202023 |
C | product_classFashion accessories | 0.3474356 | 0.2981897 | 1.1651496 | 0.2439585 |
C | product_classFood and beverages | 0.1151221 | 0.2945517 | 0.3908384 | 0.6959167 |
C | product_classHealth and beauty | 0.1815095 | 0.3070232 | 0.5911916 | 0.5543921 |
C | product_classHome and lifestyle | -0.2175378 | 0.2996763 | -0.7259091 | 0.4678945 |
C | product_classSports and travel | -0.2023096 | 0.3051752 | -0.6629293 | 0.5073758 |
C | pay_methodCredit card | -0.1645312 | 0.2156844 | -0.7628331 | 0.4455629 |
C | pay_methodEwallet | -0.3188419 | 0.2110679 | -1.5106129 | 0.1308871 |
C | time_of_dayEvening | 0.2880832 | 0.2605779 | 1.1055548 | 0.2689192 |
C | time_of_dayMidday | 0.0551912 | 0.1950934 | 0.2828963 | 0.7772563 |
multi_aug2 <- augment(multi_mod_reduced2, new_data = train_subset)
ggplot(data = multi_aug2, aes(x = .pred_class, y = store)) +
geom_point() +
geom_jitter() +
labs(x = "Predicted Store",
y = "Actual Store ",
title = "Store Visited: Predicted vs. Actual",
subtitle = "Training Data: Reduced Model 2") +
my_theme
multi_mod_reduced2 %>%
extract_fit_engine() %>%
check_collinearity()
## # Check for Multicollinearity
##
## Low Correlation
##
## Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
## gender 1.35 [1.25, 1.49] 1.16 0.74 [0.67, 0.80]
## product_class 4.26 [3.79, 4.80] 2.06 0.23 [0.21, 0.26]
## pay_method 1.79 [1.63, 1.98] 1.34 0.56 [0.50, 0.61]
## time_of_day 1.91 [1.74, 2.12] 1.38 0.52 [0.47, 0.58]
The results don’t look much better. However, we have corrected the massively large VIF value! I would trust this second reduced model more than the original reduced model.
We can check this by plotting each predictor against the log odds. I am going to skip this assumption since I am not confident with the model and since I will not be using the model for any inference.
We may have clusters in the data depending on how the data was collected and on whom it was collected. It is very possible that we have multiple individuals represented here in this dataset. We may also have clusters of different ethnicities or other variables. I am not confident we meet this assumption either due to the variables we have versus those we do not have.
diagnostics[[3]] +
my_theme
Ideally, we’d like to see our errors normally distributed as well. According to this Q-Q Plot, we do not meet this assumption which questions our model results further.
I think we are seeing a pattern over these guides of not quite being able to predict our outcomes as well as we’d like. Here, we had a few points of interest but it wasn’t enough to give us great prediction accuracy Even though our models are not “successful” per se, I would claim that we still are extracting valuable information about our data. Just knowing that this collection of variables does not provide enough information to accurately predict the store a customer visits is useful. We now know that we should 1) try a different outcome variable, 2) use different predictors, and/or 3) collect new data to help supplement what we have here. We also found some interesting patterns when looking at visualizations and the regression output.
The moral of the story is don’t fear when your results aren’t significant or aren’t what you’d hoped. Think about what you were able to gain from the process. A model that doesn’t work well is still progress and can tell you a lot about your data. Why didn’t the model work? What could you do to change the model or update the data? Every pattern, trend, failure, and success allows us to learn something and make changes and suggestions for the future.