I first load the necessary packages.
library(baseballr)
library(tidyverse)
baseballr
is a package used to collect data on various
baseball (MLB) statistics from multiple sources on the internet. It can
also provide some interesting data on pre-selected trends and certain
calculations.
As of October 2022, baseballr
is capable of fetching
data from the following sources:
This package is incredibly useful for searching for MLB data, especially when dataset joining is needed (such as combining statistics from Baseball Reference and FanGraph, for instance).
Below are a few ways in which the baseballr
package
might be used to grab data.
The Savant database is a large, searchable repository of MLB data extending back to 2008. The database can be searched on the web and contains a large number of custom filters to apply. The database will automatically create aggregate summaries according to selections, but the raw data is pitch-by-pitch and gives the researcher much freedom when using the data.
The package here will grab raw data based on the query which can either specify a specific batter or pitcher or request all of the raw data between a certain time frame.
# Search for all data for Max Scherzer in June 2021
scherzer <- statcast_search_pitchers(
start_date = '2021-06-01',
end_date = '2021-06-30',
pitcherid = 453286)
head(scherzer)
## # A tibble: 6 × 92
## pitch_type game_date release_…¹ relea…² relea…³ playe…⁴ batter pitcher events
## <chr> <date> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 FF 2021-06-27 95 -3.26 5.59 Scherz… 542932 453286 "stri…
## 2 CH 2021-06-27 84.3 -3.38 5.22 Scherz… 542932 453286 ""
## 3 FF 2021-06-27 94.6 -3.16 5.52 Scherz… 542932 453286 ""
## 4 SL 2021-06-27 85.8 -3.39 5.19 Scherz… 542932 453286 ""
## 5 FF 2021-06-27 95.5 -3.18 5.5 Scherz… 542932 453286 ""
## 6 CH 2021-06-27 83.2 -3.29 5.41 Scherz… 542932 453286 ""
## # … with 83 more variables: description <chr>, spin_dir <lgl>,
## # spin_rate_deprecated <lgl>, break_angle_deprecated <lgl>,
## # break_length_deprecated <lgl>, zone <dbl>, des <chr>, game_type <chr>,
## # stand <chr>, p_throws <chr>, home_team <chr>, away_team <chr>, type <chr>,
## # hit_location <int>, bb_type <chr>, balls <int>, strikes <int>,
## # game_year <int>, pfx_x <dbl>, pfx_z <dbl>, plate_x <dbl>, plate_z <dbl>,
## # on_3b <dbl>, on_2b <dbl>, on_1b <dbl>, outs_when_up <int>, inning <dbl>, …
The above chunk searches for all of the pitch-by-pitch data for Max Scherzer in June 2021. The result is a large data frame containing attributes that can be pulled out for use.
From here, we might plot some data!
scherzer_plot <- scherzer %>%
ggplot(aes(x = release_speed, y = release_spin_rate, color = pitch_type)) +
geom_point() +
labs(title = 'Max Scherzer: Release Speed vs. Ball Spin Rate',
subtitle = 'Broken down by pitch type',
x = 'Release Speed (MPH)',
y = 'Release Spin Rate (RPM)') +
guides(color = guide_legend(title = "Pitch Type")) +
scale_color_brewer(palette = "Dark2")
scherzer_plot
Baseball Reference is another source of baseball data. The package
baseballr
allows for aggregate player performance data to
be scraped as well as historical standings at any date. There is also a
function to calculate “team consistency”. Baseball Reference might be
used more for getting “typical” statistics such as batting average, ERA,
and number of home runs.
bref_batter <- bref_daily_batter("2021-06-01", "2021-06-30")
head(bref_batter)
## # A tibble: 6 × 30
## bbref_id season Name Age Level Team G PA AB R H X1B
## <chr> <int> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 547989 2021 Jose Al… 31 Maj-… Hous… 27 129 101 24 26 13
## 2 660670 2021 Trea Tu… 28 Maj-… Wash… 28 123 113 24 39 27
## 3 642715 2021 DJ LeMa… 32 Maj-… New … 26 123 113 12 33 24
## 4 571431 2021 Freddie… 31 Maj-… Atla… 28 122 108 20 33 23
## 5 656180 2021 Marcus … 30 Maj-… Toro… 26 122 110 24 29 15
## 6 501303 2021 Jonatha… 24 Maj-… Cinc… 27 121 99 24 30 21
## # … with 18 more variables: X2B <dbl>, X3B <dbl>, HR <dbl>, RBI <dbl>,
## # BB <dbl>, IBB <dbl>, uBB <dbl>, SO <dbl>, HBP <dbl>, SH <dbl>, SF <dbl>,
## # GDP <dbl>, SB <dbl>, CS <dbl>, BA <dbl>, OBP <dbl>, SLG <dbl>, OPS <dbl>
bref_batter %>%
filter(PA >= 15) %>%
ggplot(aes(x = BA, y = OBP)) +
geom_point() +
labs(title = 'Batting Average vs. On-Base Percentage for Batters in June 2021',
subtitle = 'Minimum 15 Plate Appearances')