Hey there! Welcome to my Modeling Guide with R. Over the course of the next few guides, you’ll get to see a few modeling concepts in action. The dataset used here was downloaded from Kaggle (Paye n.d.). It is unclear if this dataset was generated randomly or is based on real sales data; however, it has plenty of variables for us to explore.
The guide you are reading dives into a generalized exploratory data analysis (EDA) of the dataset. Before we can jump to modeling, we have to make sure we have a good understanding of what’s going on in the dataset. It’s not a good idea to jump right into inference; if we can’t explain what the variables are trying to say, we won’t be able to provide any insight. Moreover, we may need to do some data cleaning and we also may have questions to bring back to the data provider.
Let’s begin importing some packages and reading in the dataset. I
rely heavily on the tidyverse
family of packages and draw
on other packages as needed. Here, we will be using the
lubridate
package to work with dates and times.
library(tidyverse)
library(lubridate)
library(knitr)
library(kableExtra)
retail <- read_csv(here::here("data/retail_raw.csv"))
dim(retail)
## [1] 1000 17
head(retail) %>%
kable() %>%
kable_styling(c("striped", "responsive")) %>%
scroll_box(width = "100%", box_css = "border: 2px solid black; padding: 5px; ")
Invoice ID | Branch | City | Customer type | Gender | Product line | Unit price | Quantity | Tax 5% | Total | Date | Time | Payment | cogs | gross margin percentage | gross income | Rating |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
750-67-8428 | A | Yangon | Member | Female | Health and beauty | 74.69 | 7 | 26.1415 | 548.9715 | 1/5/2019 | 13:08:00 | Ewallet | 522.83 | 4.761905 | 26.1415 | 9.1 |
226-31-3081 | C | Naypyitaw | Normal | Female | Electronic accessories | 15.28 | 5 | 3.8200 | 80.2200 | 3/8/2019 | 10:29:00 | Cash | 76.40 | 4.761905 | 3.8200 | 9.6 |
631-41-3108 | A | Yangon | Normal | Male | Home and lifestyle | 46.33 | 7 | 16.2155 | 340.5255 | 3/3/2019 | 13:23:00 | Credit card | 324.31 | 4.761905 | 16.2155 | 7.4 |
123-19-1176 | A | Yangon | Member | Male | Health and beauty | 58.22 | 8 | 23.2880 | 489.0480 | 1/27/2019 | 20:33:00 | Ewallet | 465.76 | 4.761905 | 23.2880 | 8.4 |
373-73-7910 | A | Yangon | Normal | Male | Sports and travel | 86.31 | 7 | 30.2085 | 634.3785 | 2/8/2019 | 10:37:00 | Ewallet | 604.17 | 4.761905 | 30.2085 | 5.3 |
699-14-3026 | C | Naypyitaw | Normal | Male | Electronic accessories | 85.39 | 7 | 29.8865 | 627.6165 | 3/25/2019 | 18:30:00 | Ewallet | 597.73 | 4.761905 | 29.8865 | 4.1 |
It looks like we have a dataset that has 1000 observations with 17 variables. This is not a ton of observations but is enough to present the modeling concepts.
Now that this is done, let’s look at the variable names and the kinds
of variables we have in this dataset. For this, we can use the
summary()
function which will give us a brief understanding
of the variables.
summary(retail)
## Invoice ID Branch City Customer type
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Gender Product line Unit price Quantity
## Length:1000 Length:1000 Min. :10.08 Min. : 1.00
## Class :character Class :character 1st Qu.:32.88 1st Qu.: 3.00
## Mode :character Mode :character Median :55.23 Median : 5.00
## Mean :55.67 Mean : 5.51
## 3rd Qu.:77.94 3rd Qu.: 8.00
## Max. :99.96 Max. :10.00
## Tax 5% Total Date Time
## Min. : 0.5085 Min. : 10.68 Length:1000 Length:1000
## 1st Qu.: 5.9249 1st Qu.: 124.42 Class :character Class1:hms
## Median :12.0880 Median : 253.85 Mode :character Class2:difftime
## Mean :15.3794 Mean : 322.97 Mode :numeric
## 3rd Qu.:22.4453 3rd Qu.: 471.35
## Max. :49.6500 Max. :1042.65
## Payment cogs gross margin percentage gross income
## Length:1000 Min. : 10.17 Min. :4.762 Min. : 0.5085
## Class :character 1st Qu.:118.50 1st Qu.:4.762 1st Qu.: 5.9249
## Mode :character Median :241.76 Median :4.762 Median :12.0880
## Mean :307.59 Mean :4.762 Mean :15.3794
## 3rd Qu.:448.90 3rd Qu.:4.762 3rd Qu.:22.4453
## Max. :993.00 Max. :4.762 Max. :49.6500
## Rating
## Min. : 4.000
## 1st Qu.: 5.500
## Median : 7.000
## Mean : 6.973
## 3rd Qu.: 8.500
## Max. :10.000
Here, I’m just looking at general features. I’m making sure that each
of the character variables have length 1000 (which is how many rows we
have) and that the numerical summaries have values that make sense (for
example the Rating
variable has a max of 10 which is
logical). I’m also looking at any adjustments that I need to make to the
data to make it easier to work with. Of course, to remain ethical
analysts, we should never “edit” the data. Here, I’m talking about
editing variable names, ensuring that each variable has the correct type
(e.g., the Date
variable should be encoded as a date), and
checking for input errors.
After looking at the summary, I think I want to take the time to rename the variables. This isn’t always a necessary step, but since I will be using this data over the course of multiple guides I want to make sure I am comfortable with it. The code below:
City
variable is also removed as the
same information can be conveyed with the Store ID and helps promote
anonymity.date
variable into a date objectretail <- retail %>%
rename(id = `Invoice ID`,
store = Branch,
customer_type = `Customer type`,
gender = Gender,
product_class = `Product line`,
qty = Quantity,
date = Date,
time = Time,
pay_method = Payment,
rating = Rating,
total = Total) %>%
select(-c(`Tax 5%`, cogs, `gross margin percentage`, `gross income`, `Unit price`, City)) %>%
mutate(date = mdy(date),
subtotal = total / 1.05,
unit_price = subtotal / qty
)
head(retail) %>%
kable() %>%
kable_styling(c("striped", "responsive")) %>%
scroll_box(width = "100%", box_css = "border: 2px solid black; padding: 5px; ")
id | store | customer_type | gender | product_class | qty | total | date | time | pay_method | rating | subtotal | unit_price |
---|---|---|---|---|---|---|---|---|---|---|---|---|
750-67-8428 | A | Member | Female | Health and beauty | 7 | 548.9715 | 2019-01-05 | 13:08:00 | Ewallet | 9.1 | 522.83 | 74.69 |
226-31-3081 | C | Normal | Female | Electronic accessories | 5 | 80.2200 | 2019-03-08 | 10:29:00 | Cash | 9.6 | 76.40 | 15.28 |
631-41-3108 | A | Normal | Male | Home and lifestyle | 7 | 340.5255 | 2019-03-03 | 13:23:00 | Credit card | 7.4 | 324.31 | 46.33 |
123-19-1176 | A | Member | Male | Health and beauty | 8 | 489.0480 | 2019-01-27 | 20:33:00 | Ewallet | 8.4 | 465.76 | 58.22 |
373-73-7910 | A | Normal | Male | Sports and travel | 7 | 634.3785 | 2019-02-08 | 10:37:00 | Ewallet | 5.3 | 604.17 | 86.31 |
699-14-3026 | C | Normal | Male | Electronic accessories | 7 | 627.6165 | 2019-03-25 | 18:30:00 | Ewallet | 4.1 | 597.73 | 85.39 |
Specific data explorations will occur within the respective guides. However, I do think it is important to look at some of the variables to anticipate any potential problems we might have. It also helps to get a general visual understanding of what we are working with here.
Let’s start by creating some basic one-variable plots! I’ll follow each plot with a few things that I notice and that I may want to pay attention to later.
retail %>% ggplot(aes(store, label = store)) +
geom_bar(fill = "#099392") +
geom_text(aes(label = after_stat(count)), stat = "count", vjust = 2) +
labs(x = "Store ID",
y = "Count",
title = "Counts of Transactions by Store") +
my_theme
retail %>% ggplot(aes(customer_type, label = customer_type)) +
geom_bar(fill = "#099392") +
geom_text(aes(label = after_stat(count)), stat = "count", vjust = 2) +
labs(x = "Customer Type",
y = "Count",
title = "Counts of Each Customer Type") +
my_theme
retail %>% ggplot(aes(gender, label = gender)) +
geom_bar(fill = "#099392") +
geom_text(aes(label = after_stat(count)), stat = "count", vjust = 2) +
labs(x = "Gender",
y = "Count",
title = "Counts of Each Gender",
subtitle = "Represented by This Dataset") +
my_theme
retail %>% ggplot(aes(product_class, label = product_class)) +
geom_bar(fill = "#099392") +
geom_text(aes(label = after_stat(count)), stat = "count", vjust = 2) +
labs(x = "Product Class",
y = "Count",
title = "Counts of Each Product Class") +
scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
my_theme
retail %>% ggplot(aes(qty)) +
geom_bar(fill = "#099392") +
geom_text(aes(label = after_stat(count)), stat = "count", vjust = 2) +
labs(x = "Number of Items",
title = "Distribution of The Quantity of an Item Sold",
subtitle = "In a Single Transaction") +
my_theme
qty
variable
retail %>% ggplot(aes(total)) +
geom_histogram(fill = "#099392", color = "black", binwidth = 50) +
labs(x = "Transaction Total",
y = "Count",
title = "Distribution of Transaction Totals") +
my_theme
retail %>% ggplot(aes(pay_method, label = pay_method)) +
geom_bar(fill = "#099392") +
geom_text(aes(label = after_stat(count)), stat = "count", vjust = 2) +
labs(x = "Payment Method",
y = "Count",
title = "Counts of Each Payment Method") +
my_theme
retail %>% ggplot(aes(rating)) +
geom_histogram(fill = "#099392", color = "black", binwidth = 0.5) +
labs(x = "Rating",
y = "Count",
title = "Distribution of Transaction Rating") +
scale_x_continuous(breaks = seq(0, 10, 1)) +
my_theme
retail %>% ggplot(aes(unit_price)) +
geom_histogram(fill = "#099392", color = "black", binwidth = 5) +
labs(x = "Price",
y = "Count",
title = "Distribution of Unit Price",
subtitle = "The Price of a Single Item") +
my_theme
The last step here will be to export our edited dataset so we can import it into other guides later on.
write_csv(retail, here::here("data/retail_clean.csv"))
That will concludes this preliminary guide on preparing our data! Over the course of the next guides, we will explore the data a little bit more and create some predictive models.