Use Python and sklearn to model NFL game outcomes and build a pre-game win probability model. Super Bowl prediction at the end of the post!
If you have any questions about the code here, feel free to reach out to me on Twitter or on Reddit.
If you have any questions about the code here, feel free to reach out to me on Twitter or on Reddit.
This post was originally written for Open Source Football here, and has been adapted to the Fantasy Football Data Pros blog to share with my readers.
In this post we are going to cover predicting NFL game outcomes and pre-game win probability using a logistic regression model in Python. And since it's almost Super Bowl Sunday, at the end of the post we will be using the model to come up with a super bowl prediction!
Previous posts on Open Source Football have covered engineering EPA to maximize it's predictive value, and this post will partly build upon those written by Jack Lichtenstien and John Goldberg.
The goal of this post will be to provide you with an introduction to modeling in Python and a baseline model to work from. Python's de-facto machine learning library, sklearn, is built upon a streamlined API which allows ML code to be iterated upon easily. To switch to a more complex model wouldn't take much tweaking of the code I'll provide here, as every supervised algorithm is implemented via sklearn in more or less the same fashion.
As with any machine learning task, the bulk of the work will be in the data munging, cleaning, and feature extraction/engineering process.
For our features, we will be using exponentially weighted rolling EPA. For those who don't know what EPA is - it stands for estimated points added, and is a statistic that quantifies how many points a particular play added to a team's score. It's based on a lot of things - field position, yards gained, down and distance, etc. and is generally a good way to quantify and measure offensive and defensive efficiency/performance. A 30 yard play on 3rd and 13, for example, might have 3.2 estimated points added.
We will be taking average EPA per play for each team (split in to rushing, passing, and then further split in to defense, offense) for each week and then calculating a moving average (we will also lag the data back one period, so we only use data up to and not including a particular game). Moreover, instead of calculating a simple moving average, we will be calculating a exponential moving average, which will weigh more recent games more heavily.
The window size for the rolling average will be 10 for all teams before week 10 of the season. This means that prior to week 10, some prior season data will be used. If we're past week 10, the entire, and only the entire, season will be included in the calculation of rolling EPA. This dynamic window size idea was Jack Lichtenstien's, and his post on Open Source Football on the topic is linked above. His post showed that using a dynamic window size was slightly more predictive than using a static 10 game window.
EPA will be split in to defense and offense for both teams, and then further split in to passing and rushing. This means in total, we'll have 8 features:
1. Home team passing offense EPA/play
2. Home team passing defense EPA/play
3. Home team rushing offense EPA/play
4. Home team rushing defense EPA/play
5. Away team passing offense EPA/play
6. Away team passing defense EPA/play
7. Away team rushing offense EPA/play
8. Away team rushing defense EPA/play
The target will be a home team win.
Each of these features will be lagged one period, and then an exponential moving average will be calculated.
We're going to be using Logistic Regression as our model. Logistic Regression is used to model the probability of a binary outcome. The probability we are attempting to model here is the probability a home team wins given the features we've laid out above. We'll see that our LogisticRegression object has a predict_proba method which shows us the predicted probability of a 1 (home team win) or 0 (away team win). This means the model can be used as a pre-game win probability model as well.
We'll be training the model with data from 1999 - 2019, and leaving 2020 out so we can analyze it further at the end of the post.
To start, let's install the nflfastpy module, a Python package I manage which allows us easy access to nflfastR data.
Next, we import some stuff we'll need for this notebook and also set the base styling for our matplotlib visualizations.
The code block below will pull nflfastR data from the [nflfastR-data](https://github.com/guga31bb/nflfastR-data) repository and concatenate the seperate, yearly DataFrames in to a single DataFrame we'll call data. This code block will take anywhere from 2-5 minutes.
The code below is going to calculate a rolling EPA with a static window and a dynamic window. I've included both, although we'll only be using the rolling EPA with the dynamic window in our model.
team | season | week | epa_rushing_offense | epa_shifted_rushing_offense | ewma_rushing_offense | ewma_dynamic_window_rushing_offense | epa_passing_offense | epa_shifted_passing_offense | ewma_passing_offense | ewma_dynamic_window_passing_offense | epa_rushing_defense | epa_shifted_rushing_defense | ewma_rushing_defense | ewma_dynamic_window_rushing_defense | epa_passing_defense | epa_shifted_passing_defense | ewma_passing_defense | ewma_dynamic_window_passing_defense | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ARI | 2000 | 1 | -0.345669 | -0.068545 | -0.109256 | -0.109256 | 0.018334 | 0.056256 | -0.172588 | -0.172588 | 0.202914 | 0.363190 | 0.080784 | 0.080784 | -0.069413 | 0.269840 | 0.119870 | 0.119870 |
1 | ARI | 2000 | 2 | -0.298172 | -0.345669 | -0.153707 | -0.153707 | 0.587000 | 0.018334 | -0.136691 | -0.136691 | -0.110405 | 0.202914 | 0.103747 | 0.103747 | 0.311383 | -0.069413 | 0.084280 | 0.084280 |
2 | ARI | 2000 | 4 | -0.334533 | -0.298172 | -0.180702 | -0.180702 | -0.271663 | 0.587000 | -0.001460 | -0.001460 | -0.018524 | -0.110405 | 0.063730 | 0.063730 | 0.500345 | 0.311383 | 0.126717 | 0.126717 |
3 | ARI | 2000 | 5 | -0.041279 | -0.334533 | -0.209303 | -0.209303 | 0.069246 | -0.271663 | -0.051697 | -0.051697 | 0.012054 | -0.018524 | 0.048437 | 0.048437 | 0.058499 | 0.500345 | 0.196184 | 0.196184 |
4 | ARI | 2000 | 6 | -0.038473 | -0.041279 | -0.178191 | -0.178191 | 0.101830 | 0.069246 | -0.029303 | -0.029303 | 0.086308 | 0.012054 | 0.041700 | 0.041700 | -0.063633 | 0.058499 | 0.170690 | 0.170690 |
We can plot EPA for the Green Bay Packers alongside our moving averages. We can see that the static window EMA and dynamic window EMA are quite similar, with slight divergences towards season ends.
Now that we have our features compiled, we can begin to merge in game result data to come up with our target variable.
season | week | home_team | away_team | home_score | away_score | home_team_win | epa_rushing_offense_home | epa_shifted_rushing_offense_home | ewma_rushing_offense_home | ... | ewma_passing_offense_away | ewma_dynamic_window_passing_offense_away | epa_rushing_defense_away | epa_shifted_rushing_defense_away | ewma_rushing_defense_away | ewma_dynamic_window_rushing_defense_away | epa_passing_defense_away | epa_shifted_passing_defense_away | ewma_passing_defense_away | ewma_dynamic_window_passing_defense_away | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2000 | 1 | NYG | ARI | 21 | 16 | 1 | 0.202914 | 0.162852 | -0.108683 | ... | -0.172588 | -0.172588 | 0.202914 | 0.363190 | 0.080784 | 0.080784 | -0.069413 | 0.269840 | 0.119870 | 0.119870 |
1 | 2000 | 1 | PIT | BAL | 0 | 16 | 0 | -0.499352 | -0.643877 | -0.128608 | ... | -0.115204 | -0.115204 | -0.499352 | -0.241927 | -0.167554 | -0.167554 | -0.216112 | -0.208568 | -0.197401 | -0.197401 |
2 | 2000 | 1 | WAS | CAR | 20 | 17 | 1 | -0.006129 | -0.498218 | -0.076112 | ... | 0.227056 | 0.227056 | -0.006129 | 0.060270 | 0.064942 | 0.064942 | 0.147979 | -0.368611 | 0.001883 | 0.001883 |
3 | 2000 | 1 | MIN | CHI | 30 | 27 | 1 | 0.283000 | -0.115150 | -0.058772 | ... | -0.079775 | -0.079775 | 0.283000 | -0.130394 | -0.045563 | -0.045563 | 0.124496 | 0.365886 | 0.106074 | 0.106074 |
4 | 2000 | 1 | LA | DEN | 41 | 36 | 1 | -0.106372 | -0.692095 | -0.263223 | ... | -0.011590 | -0.011590 | -0.106372 | -0.110182 | -0.063446 | -0.063446 | 0.400226 | -0.078273 | -0.089393 | -0.089393 |
5 rows × 39 columns
We'll set some variables to isolate our target and features. Here, we are only including the dynamic window columns. I've found that the dynamic window does produce a slightly better model score. You're welcome to use the EWMA features with the static window size.
Here, we finally train and test our model. In sklearn, each supervised algorithm is implemented in the same fashion, First, you bring in your class (which we did earlier in the code). You then instantiate the class, providing model hyperparameters at instantiation. Here, since we're just doing Logistic Regression, we have none (although sklearn allows us to provide C as a hyperpameter - which controls regularization). Then, you call the fit method which trains your model. For evaluating model accuracy, you have a couple options. Notice here we did not split our model in to train and test sets, as we'll be using 10-fold cross validation to train and test instead using the cross_val_scores function we brought in earlier.
We see our week model has about a 63.5% accuracy score. Not terrible, considering if you head over to nflpickwatch.com, you'll see that the best experts tend to cluster around 68%.
We also found the negative log loss. This is important because the model could have value as a win probability model as opposed to a straight pick-em model.
The model can definitely be improved. Some possible improvements and things I'd like to explore in future iterations of this model:
1. Using John Goldberg's idea of opponent-adjusted EPA
2. Weighing EPA by win probability
3. Engineering other features beyond EPA, including special teams performance
4. Switch to a more complex model + hyperparameter tuning
Something fun we can do now is see how the model would have predicted this past season. Notice we took out 2020 data from our training data, and so now we are free to see how the model would have done in 2020 (since none of the 2020 data was used for training).
home_team | away_team | week | predicted_winner | actual_winner | win_probability | correct_prediction | |
---|---|---|---|---|---|---|---|
0 | KC | NYJ | 8 | KC | KC | 0.874193 | 1 |
1 | KC | DEN | 13 | KC | KC | 0.845162 | 1 |
2 | LA | NYJ | 15 | LA | NYJ | 0.833220 | 0 |
3 | LA | NYG | 4 | LA | LA | 0.832061 | 1 |
4 | SF | PHI | 4 | SF | PHI | 0.819993 | 0 |
5 | MIA | NYJ | 6 | MIA | MIA | 0.802636 | 1 |
6 | BAL | CLE | 1 | BAL | BAL | 0.794298 | 1 |
7 | KC | CAR | 9 | KC | KC | 0.793608 | 1 |
8 | BAL | CIN | 5 | BAL | BAL | 0.779855 | 1 |
9 | IND | JAX | 17 | IND | IND | 0.773426 | 1 |
These are the 10 games this season the model was most confident about. No surprises here that the KC-NYJ game was the most lopsided game this season.
We can also view how our model would have done this season by week. There doesn't seem to be a clear trend here. You would expect that the model would get better as the season went on, but the data doesn't make that clear here.
The model would have done the best in week 2 of the 2020 season, with a 87.5% accuracy score!
home_team | away_team | week | predicted_winner | actual_winner | win_probability | correct_prediction | |
---|---|---|---|---|---|---|---|
5341 | TB | CAR | 2 | TB | TB | 0.763676 | 1 |
5339 | HOU | BAL | 2 | BAL | BAL | 0.743131 | 1 |
5345 | TEN | JAX | 2 | TEN | TEN | 0.740741 | 1 |
5344 | GB | DET | 2 | GB | GB | 0.733028 | 1 |
5353 | ARI | WAS | 2 | ARI | ARI | 0.708931 | 1 |
5338 | DAL | ATL | 2 | DAL | DAL | 0.694735 | 1 |
5351 | CHI | NYG | 2 | CHI | CHI | 0.645196 | 1 |
5343 | PIT | DEN | 2 | PIT | PIT | 0.640730 | 1 |
5349 | SEA | NE | 2 | SEA | SEA | 0.636207 | 1 |
5346 | LAC | KC | 2 | KC | KC | 0.596137 | 1 |
5342 | CLE | CIN | 2 | CLE | CLE | 0.556421 | 1 |
5340 | MIA | BUF | 2 | BUF | BUF | 0.556119 | 1 |
5352 | NYJ | SF | 2 | SF | SF | 0.553112 | 1 |
5350 | LV | NO | 2 | NO | LV | 0.532511 | 0 |
5348 | IND | MIN | 2 | IND | IND | 0.520469 | 1 |
5347 | PHI | LA | 2 | PHI | LA | 0.515421 | 0 |
We can also view how our model would have done in the playoffs by filtering out weeks prior to week 17.
home_team | away_team | week | predicted_winner | actual_winner | win_probability | correct_prediction | |
---|---|---|---|---|---|---|---|
5578 | TEN | BAL | 18 | BAL | BAL | 0.551507 | 1 |
5579 | NO | CHI | 18 | NO | NO | 0.755613 | 1 |
5580 | PIT | CLE | 18 | PIT | CLE | 0.513341 | 0 |
5581 | BUF | IND | 18 | BUF | BUF | 0.632864 | 1 |
5582 | SEA | LA | 18 | SEA | LA | 0.589866 | 0 |
5583 | WAS | TB | 18 | TB | TB | 0.614521 | 1 |
5584 | BUF | BAL | 19 | BUF | BUF | 0.527067 | 1 |
5585 | KC | CLE | 19 | KC | KC | 0.587855 | 1 |
5586 | GB | LA | 19 | GB | GB | 0.689657 | 1 |
5587 | NO | TB | 19 | NO | TB | 0.597603 | 0 |
5588 | KC | BUF | 20 | KC | KC | 0.530803 | 1 |
5589 | GB | TB | 20 | GB | TB | 0.593343 | 0 |
Of course, an NFL game prediction post a week out from the super bowl isn't complete without a super bowl prediction.
The goal of this post was to provide you with a baseline model to work from in Python that can be easily iterated upon.
The model was came out to be around 63.5% accurate, and can definitely be improved on. The model may also have more value as a pre-game win probability model.
In future posts, I would like to incorporate EPA adjusted for opponent and also look at how WP affects the predictability of EPA, as it's an interesting idea and I think there could be some value there. Also, trying out other features besides EPA could certainly help. Turnover rates, time of possession, average number of plays ran, average starting field position, special teams performance, QB specific play, etc, could all prove useful. We saw in the visualization of feature importance that passing EPA per play had the most predictive power for a home team win and away team win. Exploring QB specific features could help improve the model. Of course, that would require roster data for each week.
As always - thank you for reading!