My First Data Science Project — Hockey Analytics

5 min readJun 4, 2018

Over the last few years, I’ve become more and more interested in the world of data science/advanced analytics/data mining etc, and this last semester in my graduate program things have finally come together on what can be done analytically with data, and since I’m rather passionate about it, I’m hoping to put together my first data science project from scratch. My goal in the next several posts will be to outline my process I will follow for my first data science project, and then share my results for those interested in following in my footsteps to see what it takes to go from idea to outcome.

Knowing that this is likely going to be a very tedious project, with a steep learning curve I decided to pick a problem in hockey to predict, specifically, what teams will make the Stanley Cup playoffs at the end of the year. I’ve always been a life long fan of the sport, playing it as a kid and now as an adult, and even have spent time as an official, so I have some significant interest and knowledge of it already. Plus, this was something done in baseball as seen in the hit movie Moneyball where Billy Beane, the Oakland Athletics’ general manager used analytics to turn the team around and set a record breaking 20 consecutive wins in one season. Obviously this could be applied to hockey, right? We’re about to find out.

Coming from a background in business analysis, I already have some knowledge on what this analytic process should look like, but I’m going to employ CRISP-DM as much as possible over the next few months and follow conventional guidelines in this type of work by breaking the project into the following steps:

Gain Understanding Of The Business Problem

In the sport of hockey, each team plays a predefined number of in-season games (typically from October through April). At the conclusion of this season, the top 16 teams move on to play in the Stanley Cup playoffs where winner takes all. In order to be a cup contender a team needs to qualify for the playoffs, which is what I’m initially interested in predicting.

2. Data Understanding

Hockey like many sports, involves moving an object (in hockey it is a puck) from the defensive end into the offensive end, and then putting that puck into the opponent’s net. At the end of the game, the team with the most points wins. Since I already have background in the sport, the majority of the statistics in public record are relatively straightforward, though after spending some time in researching hockey analytics, there are clearly a lot more complicated statistics that people are tracking, where I might be familiar with shots taken (attempts to put a puck in the net) others are tracking statistics such as the Corsi measurement which tracks shots + missed shots + blocked shots, which is rather eye opening and makes me feel like Alice about to tumble down the rabbit hole.

3. Data Availability and Understanding

This is likely going to be the most difficult part of the project. Typically in my work environment, I have access to a number of self-service tools, or can query a relational database to get what I need. Doing some research on NHL.com, it looks like there are already a lot of statistics published on the web (such as Figure 1 below) I could attempt to copy and past all of this data into an Excel file then do all my analysis there, however I can’t easily copy and paste all of it at once, according to Figure 2, there are over 450+ pages of team data, meaning I would need to copy and past 450 times just to access the data, and that is just for the teams! No way am I going to do that over 1000 times since that is just too time consuming and opens up too many possibilities for human error. I will likely need to figure out how to use a scripting tool to facilitate in this, and fortunately it looks like R has something available, so that will be my tool of choice moving forward.

Figure 2. Number of Pages Available for All Team Data

Once I finally have all of this pulled and aggregated in some normalized manner, I’ll then need to go through some cleaning and figure out how to deal with missing variables, which variables are significant in predicting playoff contenders, visualize trends to test for independence, etc. prior to moving on to the fun of model building.

4. Model Building

This phase will include using my data to actually predict our playoff contenders. To do this, I’m planning to use a number supervised regression models, decision trees, and neural networks to get my results. Further, I’ll likely build this out with two phases: phase one will be looking at team data only (e.g. how many goals a team scores versus number they let in); and phase two will be incorporating player data (e.g. number of all-stars a team has).

5. Share Results and Load Into Production

My hope is once I have something built, is to share it will all readers, and hopefully load it into production so that fans can follow their favorite team along throughout the duration of the 2018–2019 season.

Hopefully this and the next several articles can help provide you an opportunity to see what an analytical process looks like from start to finish as I plan on sharing the problems I encounter and solutions to them along the way.

Notes: Throughout this entire process I will be using the programming language R, accompanied by R Studio .

This article is part of a multiple part series on my first data science project and how I applied that to the sport of hockey. Follow me on Twitter: Josh Heurung or connect with me on LinkedIn: JRHeurung

My First Data Science Project — Hockey Analytics

Written by Josh Heurung

No responses yet