Hey there! This is the sixth guide in my Introduction to Modeling with R series. In the previous guide, we looked at multinomial regression which involved a multi-level response variable. This page will focus on Poisson regression, a general linear model. Although, I cannot cover all of the details and fine use cases of regression, we will explore some of the key ideas to focus on when creating and analyzing a regression model. We’ll start by importing our data and identifying variables to use in the model then move to creating, interpreting, and testing the model. As before, we’ll use train/test validation and we’ll also look at a few candidate models.
I again will start by importing our dataset as well as the packages
we will be using. As mentioned in the previous, guide, I prefer to use
the tidyverse
family of packages. We’ll also be using the
tidymodels
collection of packages to set up the models and
perform our train/test validation.
The code below imports our packages and the data we cleaned in the previous guide. It also splits the data into a train and test set. We will train our models using the training set and will test its performance using the test set. We do this to simulate how the model will perform in the “real world”.
I will be setting the seed for the random generator so we can get the
same results each time I start a new R session and split the data. You
may or may not wish to do this in your work. I want to make sure we get
a roughly equal amount of each store in the split sets so I will take a
stratified sample to make the split (strata = store
).
library(tidyverse)
library(tidymodels)
library(poissonreg)
library(GGally)
library(performance)
library(knitr)
library(kableExtra)
retail <- read_csv(here::here("data/retail_clean.csv"))
set.seed(52319)
retail_split <- initial_split(retail, prop = .8, strata = store)
train <- training(retail_split)
test <- testing(retail_split)
Alright! Now that we have that set, we’ll use the train
data for all the model creation and will only use the test
data when we are analyzing our models.
As you move forward in modeling, you may come across the term “generalized linear models”. This concept refers to the wide variety of models we can create with, according to Penn State University (“6.1 - Introduction to GLMs STAT 504” n.d.),
All of the models we have covered so far are a specific case of a generalized linear model. SLR and MLR, where the response variable is continuous, are the simplest cases. The “link function” here is the expected value (mean) of the response and we assume the response variable comes from a normal distribution. We also have seen logistic regression and multinomial regression for when our response is categorical with two or more levels and where we assume the response comes from a binomial or multinomial distribution, respectively. These two use a link function of the log odds. This is why we have been predicting the log odds of an outcome rather than the outcome itself.
Generalized linear models also use something called the maximum likelihood to calculate the estimates (as opposed to the least squares method traditionally used in SLR and MLR). Here, we are choosing values for the model that produce results that are the most likely to match the data we already have. There are many other kinds of generalized linear models. As long as we meet the three criteria above, we can create a model. What if we want to assume our response variable comes from a Poisson distribution? A negative binomial? A chi-squared? All we need to complete the model is a link function. To demonstrate another example of a generalized linear model, we will explore Poisson regression in this guide.
As indicated above, we are interested in Poisson regression if we want to assume our response variable follows a Poisson distribution. (“Poisson” should be pronounced “pwah-soh” because it’s a French name but most people pronounce it “poy-sahn”.) A Poisson distribution is commonly used for variables that are counts as it predicts the probability that a certain event happens in period of time. In other words, we are counting something occurring and want to know how many will occur.
The formula for Poisson regression is simpler than logistic and multinomial regression:
\[ \log{\lambda} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ ... \] Instead of predicting the log odds of an occurrence, we are predicting the log average times (\(\lambda\)) some event occurs. Yes, we are predicting the average number of an occurrence. This can make model interpretation a little tricky but we’ll work through it below, once we get model output.
The assumptions for Poisson regression are similar to logistic and multinomial regression.
We only have one variable that is a count variable and that is
qty
, or the number of a specific item purchased. That makes
choosing the response variable easy!
As for the predictors, I once again am going to collect all the variables I think might play a role in predicting how many of a certain item a customer will buy.
store
customer_type
gender
product_class
pay_method
unit_price
I also am going to create a new variable that we used in the previous
guide: time_of_day
which will take on the values of either
“midday”, “afternoon”, or “evening”.
train_subset <- train %>%
mutate(hour = hour(time),
time_of_day = case_when(hour %in% c(10, 11, 12, 13, 14) ~ "Midday",
hour %in% c(15, 16, 17, 18) ~ "Afternoon",
TRUE ~ "Evening")) %>%
select(qty, store, customer_type, gender, product_class, pay_method, unit_price, time_of_day)
test_subset <- test %>%
mutate(hour = hour(time),
time_of_day = case_when(hour %in% c(10, 11, 12, 13, 14) ~ "Midday",
hour %in% c(15, 16, 17, 18) ~ "Afternoon",
TRUE ~ "Evening")) %>%
select(qty, store, customer_type, gender, product_class, pay_method, unit_price, time_of_day)
It’s usually a good idea to take a look at our variables and how they interact before we make a model. So let’s do that now!
train_subset %>%
ggplot(aes(x = qty, fill = store)) +
geom_bar() +
scale_fill_brewer(palette = "Dark2") +
labs(x = "Quantity Sold in One Transaction",
y = "Count",
title = "Distribution of Quantities Sold vs. Store Visited",
fill = "Store") +
my_theme
This variable might not have much of an effect on quantity sold. For the most part the quantity sold in each store is roughly proportional to the total amount of goods sold of that particular quantity. Some exceptions are
train_subset %>%
ggplot(aes(x = qty, fill = customer_type)) +
geom_bar() +
scale_fill_brewer(palette = "Dark2") +
labs(x = "Quantity Sold in One Transaction",
y = "Count",
title = "Distribution of Quantities Sold vs. Member Status",
fill = "Member Status") +
my_theme
There doesn’t appear to be a huge different between member status and quantities sold. It does seem like more members buy 10 of an item and more non members buy 8 or 9 of an item but these differences are not extraordinary.
train_subset %>%
ggplot(aes(x = qty, fill = gender)) +
geom_bar() +
scale_fill_brewer(palette = "Dark2") +
labs(x = "Quantity Sold in One Transaction",
y = "Count",
title = "Distribution of Quantities Sold by Gender",
fill = "Gender") +
my_theme
We could end up seeing an effect here. Just by looking, I see that more males buy 1, 4, and 7 quantities of goods and females buy 5, 6, 8, and 9 items at a time.
train_subset %>%
ggplot(aes(x = qty, fill = product_class)) +
geom_bar() +
scale_fill_brewer(palette = "Dark2") +
labs(x = "Quantity Sold in One Transaction",
y = "Count",
title = "Distribution of Quantities Sold vs. Product Class",
fill = "Product Class") +
my_theme
There is a lot going on here so it is hard to tell what role product class might end up playing in the model. However, I do notice that if someone is going to buy a fashion accessory, they seem to have a tendency to buy 1 or 2 of them. Conversely, if someone is going to buy a food or beverage, they have a higher tendency to buy 3 or 5 of them. However, none of these observations stand out as a solid pattern.
train_subset %>%
ggplot(aes(x = qty, fill = pay_method)) +
geom_bar() +
scale_fill_brewer(palette = "Dark2") +
labs(x = "Quantity Sold in One Transaction",
y = "Count",
title = "Distribution of Quantities Sold vs. Payment Method") +
my_theme
Once again, I see a few things that may have an effect, but nothing that stands out.
train_subset %>%
ggplot(aes(x = unit_price, fill = factor(qty))) +
geom_boxplot() +
scale_fill_brewer(palette = "Paired") +
labs(x = "Unit Price",
y = "Quantity Sold",
title = "Distribution of Quantities Sold vs. Unit Price",
fill = "Quantity") +
my_theme
It looks like there won’t be as much of an effect here as I was originally thinking. I thought that as unit price increased, it would be less likely that someone would buy more of that item. From this plot (and the plots above) it looks like this company sells a nice handful of 10 items at a time but there doesn’t seem to be a pattern between the price of one item. The range of unit prices vary from about $10 to $100 regardless of the quantity of that item sold.
train_subset %>%
ggplot(aes(x = qty, fill = time_of_day)) +
geom_bar() +
scale_fill_brewer(palette = "Dark2") +
labs(x = "Quantity Sold in One Transaction",
y = "Count",
title = "Distribution of Quantities Sold by Time Period",
fill = "Time of Day") +
my_theme
Overall, most goods are sold midday and the fewest in the evening. But we do see that quantities of 1, 3, and 10 are more likely to be sold in the afternoon.
I’ll also take a look at all of the variables together just for good measure.
train_subset %>%
ggpairs() +
my_theme
I don’t see any need for interaction variables or variable transformations.
Let’s try to build a model with all of the above variables. I do anticipate having to remove some variables due to a lack of use in predicting the number of goods sold. I have a good idea of what these variables will be but I would like to see what the p-values look like to help guide my decision making.
poi_spec <- poisson_reg()
poi_mod_full <- poi_spec %>%
fit(qty ~ store + customer_type + gender + product_class + pay_method + unit_price + time_of_day, data = train_subset) %>%
repair_call(data = train_subset) # this is necessary because we are not using cross-validation
tidy(poi_mod_full, exp = TRUE) %>%
mutate(p.value = format.pval(p.value, digits = 2)) %>%
kable() %>%
kable_styling(c("striped", "responsive"))
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 5.9031001 | 0.0624539 | 28.4286060 | < 2e-16 |
storeB | 1.0142098 | 0.0373039 | 0.3782379 | 0.70525 |
storeC | 1.0391716 | 0.0372186 | 1.0323833 | 0.30189 |
customer_typeNormal | 1.0031840 | 0.0304003 | 0.1045685 | 0.91672 |
genderMale | 0.9371310 | 0.0306640 | -2.1175365 | 0.03421 |
product_classFashion accessories | 0.8367606 | 0.0514901 | -3.4611963 | 0.00054 |
product_classFood and beverages | 0.8812484 | 0.0516387 | -2.4480801 | 0.01436 |
product_classHealth and beauty | 0.9409788 | 0.0521437 | -1.1666740 | 0.24334 |
product_classHome and lifestyle | 0.9307007 | 0.0518888 | -1.3840682 | 0.16634 |
product_classSports and travel | 0.9365300 | 0.0510866 | -1.2835805 | 0.19929 |
pay_methodCredit card | 1.0100251 | 0.0374698 | 0.2662202 | 0.79007 |
pay_methodEwallet | 1.0066169 | 0.0370091 | 0.1782019 | 0.85856 |
unit_price | 0.9996562 | 0.0005745 | -0.5984694 | 0.54953 |
time_of_dayEvening | 1.0655001 | 0.0441046 | 1.4384955 | 0.15029 |
time_of_dayMidday | 1.0542168 | 0.0344765 | 1.5314229 | 0.12566 |
Alright, here is our model response! Not all of our p-values are
significant (as expected) but let’s dig into why. The interpretation of
this model can be a little strange. First, you may have noticed that I
used exp = TRUE
in the code. This will exponentiate our
answers in the estimate
column to help with interpretation.
Remember we were predicting the log mean number of goods sold.
Exponentiating the estimates allows us to interpret in terms of mean
number of goods sold. However, the results here don’t actually give
us our expected change in pure numbers, it gives it in relative risk, or
percent change where the baseline is 1. Should we predict a value of 1
for a parameter (such as what we have for unit_price
), we
would predict the mean number of goods sold to not change at all based
on the price of one item, holding all other variables constant.
Here are some interpretations from our data. Holding all other variables constant,
We can use this information, as well as the p-values, to derive a
reduced model. I want to avoid the variables whose estimates are at 1 or
very close as this means they have little effect on predicting our
response. Looking at the output, I’d say we keep gender
,
product_class
, and time_of_day
. I will that
it’s interesting that unit_price
doesn’t seem to have a
role here as I would expect the price of an item to have some say in how
many of that item are purchased.
poi_mod_reduced <- poi_spec %>%
fit(qty ~ gender + product_class + time_of_day, data = train_subset) %>%
repair_call(data = train_subset) # this is necessary because we are not using cross-validation
tidy(poi_mod_reduced, exp = TRUE) %>%
mutate(p.value = format.pval(p.value, digits = 2)) %>%
kable() %>%
kable_styling(c("striped", "responsive"))
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 5.9418974 | 0.0440945 | 40.413864 | < 2e-16 |
genderMale | 0.9345872 | 0.0305029 | -2.217834 | 0.02657 |
product_classFashion accessories | 0.8381205 | 0.0512081 | -3.448544 | 0.00056 |
product_classFood and beverages | 0.8812673 | 0.0514381 | -2.457211 | 0.01400 |
product_classHealth and beauty | 0.9425170 | 0.0519899 | -1.138708 | 0.25482 |
product_classHome and lifestyle | 0.9290891 | 0.0516756 | -1.423314 | 0.15465 |
product_classSports and travel | 0.9336724 | 0.0507896 | -1.351253 | 0.17661 |
time_of_dayEvening | 1.0669125 | 0.0439675 | 1.473112 | 0.14072 |
time_of_dayMidday | 1.0543886 | 0.0344618 | 1.536805 | 0.12434 |
I like this model better, but let’s move forward with looking model performance before we make any final decisions.
model_performance(poi_mod_full) %>%
kable()
AIC | AICc | BIC | R2_Nagelkerke | RMSE | Sigma | Score_log | Score_spherical |
---|---|---|---|---|---|---|---|
4079.757 | 4080.371 | 4150.008 | 0.0344889 | 2.883914 | 1 | -2.534266 | 0.0310775 |
model_performance(poi_mod_reduced) %>%
kable()
AIC | AICc | BIC | R2_Nagelkerke | RMSE | Sigma | Score_log | Score_spherical |
---|---|---|---|---|---|---|---|
4069.234 | 4069.463 | 4111.385 | 0.0322932 | 2.885788 | 1 | -2.53519 | 0.0310738 |
There isn’t much that differs between the two models. Following the pattern we’ve seen before, we have slightly reduced the AIC and the BIC and our \(R^2\) and RMSE values have hardly increased. This is, to me, evidence to prefer the model with less variables as we can get almost the same performance as the full model.
diagnostics <- plot(check_model(poi_mod_reduced, panel = FALSE, type = "discrete_dots"))
diagnostics[[4]] +
my_theme
The good news is that we don’t seem to have any influential observations or high leverage points.
Our response is in fact a count variable. All values of the variable are positive integers. We also use valid predictors as they are all either continuous or categorical.
If our response variable qty
truly has a population
Poisson distribution, we should expect to see the mean and the variance
be the same. This may not be exactly true for our sample since we have a
sample and not the entire population but we would expect them to be
close.
mean(train_subset$qty) %>%
kable() %>%
kable_styling(full_width = FALSE)
x |
---|
5.485607 |
var(train_subset$qty) %>%
kable() %>%
kable_styling(full_width = FALSE)
x |
---|
8.480682 |
Without running any statistical tests (such as a dispersion test or goodness of fit test), it does look like the variance and mean are different. I can’t say for certain, but we may be missing this assumption.
Since we assume a linear relationship between the predictors and the log of the response variable, we should make sure we see linearity there.
train_subset %>%
mutate(log_qty = log(qty)) %>%
select(-qty) %>%
ggpairs() +
my_theme
First, we see that the response variable has a wonky distribution which is skewing the rest of the variables left. But I don’t see evidence for any non-linear relationships here, except for perhaps with unit_price.
We would like to see our observations be independent. We might also violate this assumption since we cannot confirm that a customer isn’t in the dataset multiple times. It may also be that the items one customer purchases have some dependency on what the previous customer purchased (for instance, if I see someone buy strawberry yogurt and I decide I want that as well).
We ideally don’t want to see our predictors be correlated. We can use a diagnostic plot to look at this.
diagnostics[[5]] +
my_theme
Our VIFs are very low so I see no reason for concern here.
Poisson regression can be a very useful tool if you are looking to predict counts of something or if you want to predict how often an event occurs in a specific time frame. This kind of problem doesn’t arise frequently but it is useful to know that it exists. Of course, you could use another form of generalized linear model if you have reason to suspect that your response variable follows a different distribution. As we saw, the main drawback to using Poisson regression is the interpretation as we have to remember we are predicting a relative risk rather than an actual count.
This guide wraps up all of the modeling that the series will cover. We will go over bootstrapping next which is a form of resampling and which uses a model but is not a model in and of itself. I feel like it might be helpful to create a little summary of the models we have covered here and when to use them.
Model Name | Type of Response | Type of Predictors | Number of Predictors |
---|---|---|---|
Simple Linear Regression | Continuous | Continuous | 1 |
Multiple Linear Regression | Continuous | Continuous, Categorical | 2+ |
Simple Logistic Regression | Binary | Continuous | 1 |
Multiple Logistic Regression | Binary | Continuous, Categorical | 2+ |
Multinomial Regression | Categorical w/ 3+ Levels | Continuous, Categorical | 2+ |
Poisson Regression | Discrete Count | Continuous, Categorical | 2+ |
Generalized Linear Model | Follows a Distribution | Continuous, Categorical | 2+ |