This is the first, non-technical, part of this series. See the second part for more detail.
I was recently looking for a good machine learning task to try out, and I thought that doing something NFL-related would be interesting, because the NFL season is about to start (finally!).
Why was I looking for a good machine learning task to try out? I have mostly done my data analysis work in R, but recently, I have been moving over to Python. As part of that process, trying as many real-world problems out as possible helps. I have also been developing a lightweight, modular, machine learning framework with my company, Equirio.
We are going to start with a high-level, nontechnical overview of what we will be doing, and then follow that up with some technical details in a second post.
High Level Overview
Machine learning description
In machine learning, the goal is to learn from data and a known outcome to predict an unknown outcome for future data. For example, let’s say that we have data for how hot it has been for the past 10 days, and we want to predict how hot it will be tomorrow. The data (how hot it has been for the past 10 days), and the outcome (how hot it will be tomorrow), will be somewhat correlated. So, we can take data and outcomes from the past (ie, how hot it was for the 10 days before today, and how hot it was today), and use it to predict how hot it will be tomorrow. This will not be a perfect prediction, and we will have some error, as we do not have all of the needed information. Maybe there is a cold front coming in from the north, but in our simple model, we don’t have that information.
How does this apply to NFL data?
In my case, I wanted to predict something NFL-related. One of the main problem in doing this kind of analysis, believe it or not, is easy access to data. It is harder to get detailed per-game data such as who did what on what play, or even season-level statistics per player.
What is fairly easy to get is box score data, such as the below from Pro football reference.
You can see that it is very generic information: for each game, we have the winner, the loser, who was at home, how many points each team had, how many turnovers each team had, and how many yards each team gained.
Given this basic information, one of the simplest things we can predict is a teams win/loss record in a given season.
Before we get started
Before we get started, we have to define an error metric and a baseline. For example, if the algorithm predicts that the Washington Redskins will win 6 games next year, and they actually win 7 games, was the algorithm good?
To answer this, we need some baseline to measure against. First, we will define the error metric. The error metric will just be the mean of the absolute value of all the predictions minus all of the actual results.
Let’s take the 2010 season (unfortunately, these tables may look bad in an RSS reader):
|1440||14||new orleans saints||2010||11|
|1442||15||new england patriots||2010||14|
|1443||4||tampa bay buccaneers||2010||10|
|1445||2||st. louis rams||2010||7|
|1451||13||new york giants||2010||10|
|1452||15||green bay packers||2010||14|
|1457||0||st. louis cardinals||2010||0|
|1460||14||san francisco 49ers||2010||6|
|1468||0||los angeles rams||2010||0|
|1469||8||san diego chargers||2010||9|
|1471||8||new york jets||2010||13|
|1474||7||kansas city chiefs||2010||10|
|1477||0||los angeles raiders||2010||0|
We can see each team, along with how many games it won in 2010 (total_wins), and how many it won in 2011 (next_year_wins). Let’s say that we predict that each team will win the same amount of games in 2011 as it won in 2010. Thankfully, we already know how many games each team won, so we can use our error metric to calculate the error.
Once we remove the 2012 season (we don’t know what the wins will be next year), and any teams with 0 victories (teams that do not exist anymore), we can calculate the total error for all of the seasons. The error comes out to be 3.1. So, if we just assume that teams will win as many games next year, the actual number will, on average be +/- 3.1 games away.
Let’s try another baseline. It’s well known that teams tend to regress towards the mean. So, let’s just go with 8 as the number of victories for every team (in a 16 game season, 8 would be average). If we do this, we actually get a better result. The error is now only 2.8. Let’s use this as our baseline. If our system can reduce the error, than we can say that our system is potentially useful.
Training and Prediction
So, we take as much past data as we can (I used data from 1980 to now), convert per-game data into per-season data by calculating a lot of features for each team, such as how many points the team scored in their last 5 games of the season, or opponent record for the season. A feature is basically a decision criteria. If I want to know if it will be sunny tomorrow, one data point that I might want is if it is sunny or not today.
We can then train our machine learning model, and evaluate its error. Our error here is 2.6, which is better than the baseline.
We can then use our model to predict how teams will perform in future seasons (in this case, 2013).
After our training, we get predictions, which come out to:
|1532||green bay packers||2012||12||10.87|
|1554||kansas city chiefs||2012||2||6.35|
|1557||los angeles raiders||2012||0||0.00|
|1548||los angeles rams||2012||0||0.00|
|1522||new england patriots||2012||13||11.44|
|1520||new orleans saints||2012||7||8.10|
|1531||new york giants||2012||9||8.20|
|1551||new york jets||2012||6||6.65|
|1549||san diego chargers||2012||7||7.81|
|1540||san francisco 49ers||2012||14||9.72|
|1537||st. louis cardinals||2012||0||0.00|
|1525||st. louis rams||2012||7||6.36|
|1523||tampa bay buccaneers||2012||7||7.25|