Throughout the last month I’ve been working on learning the programming language R by analyzing hockey statistics, which has been a tad more difficult than I thought it would be. Let’s take a look at what is out there and what I’ve been able to accomplish so far this summer:
Hockey Stats as a Business Problem
After spending some time researching the field of hockey stats and analytics, I’ve learned there is a lot more out there than expected. Not only are there sites that allow users to access NHL data for free, there are also people visualizing and modeling both team and individuals, creating predictive models on what teams will/not win each night, and even comparing amateur leagues against the NHL to measure how high up a player should be drafted. Some helpful sites I’ve found:
HockeyAbstract.com — has complete resources on where to find team stats, individual stats, payment data, predictions, and visualizations. (Link here: http://www.hockeyabstract.com/thoughts/completelistofhockeyanalyticsdataresources)
HockeyGraphs.com https://hockey-graphs.com/interactive-nhl-graphs/ &
Both of these have some really awesome visuals that I never would have thought were available or a thing to look at (e.g. heat maps of where shots are taken).
Money Puck http://moneypuck.com/predictions.htm
Predicts which teams will win each night, and what teams are predicted to win the Stanley Cup.
In my last article I talked about NHL.com having data on its site, but there would likely need to be a learn how to scrape web data. After some research it turns out there is a tool for R, called NHLscrapr, which accesses play-by-play data, by game, from NHL’s API. However, the program is buggy and no longer works. After realizing this, I worked my way through several modules on DataCamp.com to see if I could learn how to do this on my own, but after realizing I’d need to get much more acquainted with Java and XML, I decided it would be worthwhile to find another solution for the time being since my goals for the summer were limited to learning data manipulation, visualization, and modeling. Too bad there aren’t any free SQL servers I could plug off of!
Realizing web scraping was out of the question for now I did some more research on what else is out there. Turns out there is a lot and after tinkering around I was able to find a couple sites that allow you to export team and individual data as .CSVs, which is reasonable enough for my project goals.
Sites with Exportable Data:
Xtra Hockey Stats http://xtrahockeystats.com/teams.php, &
Firstline Stats, http://firstlinestats.com/teams/
After getting data from these sites, I needed to figure out how to start exploring data. Turns out, the best application of this is with GGPlot2, a visual package in R. After playing around with player level data I found on Corsica.Hockey I was able to stand up a couple graphs and ask several questions such as:
- Do teams that make the Stanley Cup Play-Offs have offensive players who score more than other teams? This can be easily visualized in a box and whisker plot as seen below along with the code that was used to generate it.
2. Is a specific player over/under rated? On the social media site, reddit.com/r/wildhockey I was seeing a lot of hate towards a specific individual (Minnesota Wild’s Nino Niederreiter) so I decided to plot them against the rest of the team and the league to see how they compared.
Overall, once I got my hands on some data to analyze, I found it rather straightforward to stand up the graphs using ggplot and then easily reproduce them with whatever variables I chose, though the one challenge I’m running into is the formatting piece which over time should get addressed as I make more and more of these.
Next steps include making more exploratory diagrams, adding in some statistics to identify components that predict teams making the Stanley Cup Play-Offs, and eventually using these to create predictive models for the 2018–2019 season.
Notes: Throughout this entire process I will be using the programming language R, accompanied by R Studio .
This article is part of a multiple part series on my first data science project and how I applied that to the sport of hockey. Follow me on Twitter: Josh Heurung or connect with me on LinkedIn: JRHeurung