Plotting Baseball Statistics using R: Queing an SQL Database

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Introduction

This tutorial will help you link R to a locally hosted SQL database, create new variables in R from that database, and plot in R using SQL data. Social scientists please note that you can implement this code with any type of data that you would like to work with in R but exceeds the memory storage capacity of the R workspace. This is a great technique for analyzing large sample size datasets with R.

Steps needed prior to running code below:

  1. Install MySQL. This will allow to to locally host a database.

  2. Install some type of SQL GUI if desired (personally, I prefer SQLyog).

  3. Download Sean Lahman’s baseball database and load it into SQL.

##this calls the packages required. use install.packages('RMySQL') if not installed
library(RMySQL)
library(lattice)
con <- dbConnect(MySQL(),
user = 'username', #replace this with your SQL username
password = 'password', #replace this with your SQL password
host = 'localhost', #replace with your hostname or IP
dbname = 'baseball2') #replace with database name


Congratulations! You’ve successfully connected a database to your R workspace using RMySQL. The next code will show you each of the tables in the database you are connected to. Look at all that data!

dbListTables(con) #lists all tables in data
##  [1] "allstarfull"         "appearances"         "awardsmanagers"
##  [4] "awardsplayers"       "awardssharemanagers" "awardsshareplayers"
##  [7] "batting"             "battingpost"         "collegeplaying"
## [10] "fielding"            "fieldingof"          "fieldingpost"
## [13] "halloffame"          "managers"            "managershalf"
## [16] "master"              "pitching"            "pitchingpost"
## [19] "salaries"            "schools"             "seriespost"
## [22] "teams"               "teamsfranchises"     "teamshalf"

Now that this has been done, you can generate a probability density plot of average for each team from 1985 to 2015. I utilize the lattice package in this case but you can do the same using ggplot2, or any other preferred visualization package. Note that this next chunk of code features SQL syntax directly. If you would like to learn how to work directly in SQL, CodeAcademy currently features a useful beginner’s tutorial.

res <- dbSendQuery(con, "select * from batting where yearID=1984 and AB > 250");
batting1985<-fetch(res, n=-1);
attach(batting1985);
AVG<-H/AB;##compute Batting Avg using hits and at bats
densityplot(~AVG|teamID, plot.points=TRUE)##plot avg by team using lattice

Bonus Round: Sneak preview of the pitchRx package

Much thanks to the open source help provided by Carson Sievert. Please see his website for further documentation.

library(pitchRx)
library(plyr)
library(dplyr)
library(RSQLite)
library(ggplot2)
library(viridis)

The following code will scrape pitchFX from any dates specified. Here, I ask for all pitches thrown on June 20th, 2015, and store the result locally.

june19 <- scrape(start = "2015-06-19", end = "2015-06-19")#scrapes pitchFX
pitches <- plyr::join(june19$pitch, june19$atbat,
                     by = c("num", "url"), type = "inner") ##uses plyr to join data into more readable format
##specify the pitcher you are interested in
elias = subset(pitches, pitcher_name == "Roenis Elias")

# convert characters to numbers
elias$px = as.numeric(elias$px)
elias$pz = as.numeric(elias$pz)
elias$start_speed = as.numeric(elias$start_speed)

strikes <- subset(elias, des == "Called Strike")
strikeFX(strikes, geom = "tile") +
  facet_grid(pitcher_name ~ stand) +
  coord_equal() +
  theme_bw() +
  viridis::scale_fill_viridis() +
  labs(title = "Roenis Elias vs HOU 6/20/15
 4 H, 2 HR allowed, 10 K, 0 BB")

Neat. We can see from this density plot that, when Roenis Elias dominated the Houston Astros, he pitched to much different locations against left-handed batters versus right-handed batters. Perhaps we would want to load more games to make visualizations comparing Elias across games. I will not go that far in this turotial. But, if you want to load more than a few games, you will likely have to store the result in a local SQLite database. This call will locally store every pitch thrown in the MLB from 2012 onward. Note: This will take a while to load onto your machine if you specify more than a few weeks of data.

db <- src_sqlite("pitchfx.sqlite3", create = T)
scrape(start = "2012-01-01", end = Sys.Date(), connect = db$con)

Thank you for reading! Please email me at Ian.Ruginski@utah.edu if you have any questions or suggestions. I plan to add more tutorials in the future using RMarkdown covering statistical and visualization techniques using R.