Skip to content

Analyzing the Predictive Power of NFL Statistics

By Isabel Pantle '23

Introduction

An essential beauty of sports, especially football, is the unpredictability of the game; no one expected the winless Jets to beat a hot 9-5 Rams in Week 15 of the 2020 season, yet it happens nonetheless. But in an era of statistics and analysis, how unpredictable is the game?

In this paper, I will analyze the predictive power of different statistics from NFL games on game outcomes. Particularly, I will explore if there are any particular statistics (or combinations of statistics) that are more predictive of the winner of any given NFL game. For example, something like point differential (the difference in points scored) would always be able to predict who wins a game because that is how the winner is determined, but how well do other statistics predict the outcomes? How well can multiple statistics as predictor variables predict outcomes? 

The predictor variables I will analyze are yards gained by a team’s offense, yards lost by a team’s defense, and turnover differential. Since the predictor variable (winning or losing the game) is a binary variable, I will be using a logistic regression to analyze the association between my different variables and the outcome of the game. First, I will model a logistic regression model for each statistic as an independent predictor variable. Then, I will create logistic regression models with pairs of predictor variables, and finally, I will create a logistic regression with many predictor variables to see if the accuracy of the models improves with more variables.

Methods

I am using data from Pro Football Reference which has many different statistics from games from current and past seasons (Pro Football Reference, 2020 Weekly League Schedule). I will use the data from 2020 to build the models, then, since there were minimal rule changes from 2020 to 2021, I will use games from the 2021 season to test the accuracy of the model (NFL Football Operations, 2021). 

I will test the accuracy of these models in two ways. First, I decided that the pseudo R2 values were more applicable in this case. The R2 value will give a measure of how good the model is at predicting the fraction of games with certain statistics will end in a win or loss, while the pseudo R2 value will measure the model’s ability to predict the outcome of specific games. While both are important, in this case the pseudo R2 values will be more useful, since our ultimate goal is to be able to predict the outcome based on the statistics from a specific game. 

The second way I plan on testing these models is by using data that was not used to create the model (that is, statistics from games not in the original sample) and seeing how well the model predicts the outcomes of those games. To do this, I will take statistics from games from the 2021 season and plug them into my most successful model as evaluated by the pseudo R2 values as described above. After doing this for many games, I will calculate the percent of game outcomes that it accurately predicts by determining any output over 50% to be a correct prediction for wins, and any output under 50% to be a correct prediction for losses. 

Results

The first model used to fit the data has one predictor variable, which is yards gained. Figure 1 in the Appendix shows the fitted regression and the plotted points, where the yards gained by the winning team are points where y = 1, and the yards gained by the losing team are points where y = 0. Looking at the coefficients, β0 = -2.7659 and β1  = 0.0077. β1 is the coefficient associated with yards gained, and using the rule of fourths, this indicates that for every yard gained, the upper bound of the probability of winning is 0.0077 / 4 = 0.0001925. Additionally, we can determine when there is a 50% chance of winning, since the predictor variable 1 / 1 + eβ0 + β1= 1 / (1+1) when β0 + β1x = 0. So we determine that there is a 50% chance of winning a game when the offense scores  2.7659 / 0.0077 ≈ 359 yards. This is also the point at which the interpretation of the coefficients is most accurate, as this is the yardage at which the probability of winning is 50%.

Below is the graph of this regression, where the x-axis is yards gained in a single game, and points where y = 1 corresponds to the yards gained of the winner, while y = 0 corresponds to the loser. From looking at the graph in Figure 1, it does not seem as though the model does a very good job capturing the variation, which is confirmed by the low pseudo R2 value of 0.06594. However, before we dismiss yards gained as a valuable predictor, notice that there does seem to be a difference in the clustering of the yards gained by winning and losing teams.

Indeed, the average number of yards gained by the winning team is 384 yards compared to an average of 334 yards by the losing team. Thus, it does seem as though there is some relationship between the yards gained and probability of winning, but that this is not the only predictor variable. Similarly, when yards allowed was the only predictor variable, there was a similarly low pseudo R2 value (Figure 2 in the Appendix); thus, we move forward to a model with multiple predictors. The next iteration of the model will look at yards gained in combination with yards allowed by the defense to try to increase our model’s accuracy (Figure 3 in the Appendix). When comparing this model to the first iteration, one thing of note is that the β= -β2. Additionally, β0 = 5.581 * 10-15 ≈ 0. Thus, our fitted model is:

So β1 is now the coefficient of yardage differential, that is, yards gained - yards lost. This indicates that if both team’s yardage increases by 1, which would leave the differential unchanged, neither of them increases their probability of winning. This seems reasonable and accounts for the fact that some games are more offensive than others, and two teams that are exchanging 80 yard drives may not be gaining an advantage if they are both gaining and allowing the same number of yards.

Thus, this model introduces an aspect of comparative success. In a defensive game, 200 yards gained may give a team a large advantage if their opponent only gains 100 yards, but that same team that gains 200 yards will likely be blown out if their opponent gains 600 yards. However, this model indicates that two teams that both gain 10000 yards are equally likely to win the game, even though the most yards gained in a single game is 722. Thus, this yardage differential certainly has advantages, but this model has limitations in that it will only be accurate for a reasonable domain of yards from zero to 750. Additionally, the model still has a low pseudo R2 value of 0.1399, so we look to add additional predictor variables.

The next model uses turnovers as its single predictor variable. Looking at the coefficients obtained from the model as shown in Figure 4, the x1 coefficient indicates that for each additional turnover, the upper bound of the change in probability of winning the game is a decrease of 0.8718 / 4 ≈ 0.218. In our model, β0 = 1.0777, which means that if a team has no turnovers, their probability of winning is 1 / 1 + e-1.0777 = 0.75. This makes sense as the average number of turnovers for a winning team is 0.8327, while the average number of turnovers for the losing team is 1.747. Thus, having no turnovers gives a team a large advantage. However, even if a team has one turnover, their probability of winning is still greater than 50%:

While these results are interesting and align with common understanding of football, this model still has a low pseudo R2 value of 0.1293, though this is a higher pseudo R2 value than the models with yards gained or allowed as the only predictor variable.

In the final iteration of the model, there are four predictor variables, yards gained, yards allowed, turnovers lost, and turnovers recovered. β1 is the coefficient associated with yards gained, β2 is associated with yards allowed, β3 is associated with turnovers lost, and β4 is associated with turnovers recovered. Again, we notice that β1 = -β2 and β3 = -β4; thus, this model incorporates the idea of comparative yardage and comparative turnovers, referred to as turnover differential. Based on these coefficients, the model indicates that the upper bound for the increase in win probability is 0.0126/4 = 0.00316 per additional yard gained when the probability of winning is 50%.  Additionally, the coefficient for yards allowed is the same, which indicates that the upper bound for the decrease in win probability is 0.00316 per additional yard allowed. For turnovers, the upper bound of change in in probability is an increase of 1.0805/4 = 0.270125 per turnover, and a decrease in win probability of 0.270125 per turnover lost.

This is the model we are most confident in, as this has the highest pseudo R2 value of 0.4150 and incorporates the idea of turnover differential which is widely regarded as an important statistic in football analytics. We will use this model to predict games from the 2021 season to determine its accuracy. To predict games from the 2021 season, data from the 2021 season was downloaded as a csv and uploaded into Colab (Pro Football Reference, 2021 Weekly League Schedule). For each line in the data frame, the statistics of the winner was entered into the model, and if the model gave the winning team a win probability of 50% or higher, this was counted as an accurate prediction (Figure 6 in the Appendix). Out of the 150 games in the 2021 season so far, the model correctly predicted 132 of them.

Discussion

Some limitations of the model is that it can only make predictions about the outcome of the game after it has concluded (and thus the outcome is already known) since it requires the statistics measured from the full game. Another iteration of the example could work with statistics from halftime, so it can make predictions before the game is over. Additionally, the model could potentially be more accurate with more predictor variables, such as Redzone conversion rate, or how often a team scores when they are within 20 yards of their opponent’s end zone. However, it is interesting that the model was able to predict outcomes with such accuracy without having any statistics directly related to scoring (such as Redzone conversion rate or number of scoring drives). This would likely make the model more accurate, and could be important predictor variables for an iteration that uses halftime statistics.

Though there are definitely limitations of this model, it was able to predict the outcome of games when given statistics from 2021 games with impressive accuracy. This answers the question of unpredictability of games; with the yards gained, yards allowed, turnovers recovered, and turnovers lost, we could accurately predict 88% of games from the 2021 season.

Appendix  

Works Cited

“2020 NFL Weekly League Schedule.” Pro Football Reference,

https://www.pro-football-reference.com/years/2020/games.htm .

“2021 NFL Weekly League Schedule.” Pro Football Reference,

https://www.pro-football-reference.com/years/2021/games.htm .

“2021 Rules Changes and Points of Emphasis.” NFL Football Operations,

https://operations.nfl.com/the-rules/rules-changes/2021-rules-changes-and-points-of-emp

hasis/.

Gough, Christina. “NFL Revenue by Year.” Statista, 8 Sept. 2021,

https://www.statista.com/statistics/193457/total-league-revenue-of-the-nfl-since-2005/ .

Gough, Christina. “Most Popular pro Sports Leagues in the U.S. 2019.” Statista, 14 Sept. 2020,

https://www.statista.com/statistics/1074271/sports-leagues-fans/ .

“Which NFL Team Has the Most Offensive Yards in a Game 2021.” StatMuse,

https://www.statmuse.com/nfl/ask/which-nfl-team-has-the-most-offensive-yards-in-a-ga

me-2021.