Creating a Rushing Yards Over Expected Model

Introduction

As I continue my search to find new ways to compare and evaluate players, it is evident that “traditional” counting stats are just simply not enough anymore. To get the full scope of a player, you need more. And if not all rushing attempts are created equal, why should we assume outcomes are equal? 

Next Gen Stats has figured this out and uses tracking data to evaluate players relative to their surroundings but, unfortunately, they keep their info private. This is where my new model, found exclusively at brotofantasy.com, comes in. Using publicly available stats, I attempt to predict the number of yards the “average” running back would get in any given rushing scenario.

Instead of using speed, direction and the position of defenders, I used situational data to get my expected rushing yards (xRY) number and compared it to the actual result of the play, to get rushing yards over expected (RYOE).

In simple terms, xRY of a given play can be viewed as the average yards gained by all the players in the past decade that had a rush attempt in the same game situation at the moment of handoff… but extended to all possible scenarios. 

Coming up I will describe what went into making the model and later present the results. 

The model

The data used for the model included all rushing plays since 2010 (provided by nflfastR), excluding QB scrambles and plays that resulted in penalties, to ensure we had a representative sample. Then, all plays were split on whether the rush was up the middle or to the outside, in order to calculate the average yards allowed by defensive units in those situations.

Once we had all the info necessary, 80% of the data was put through a 14-feature, extreme gradient boosting algorithm to generate the model. The general idea of an extreme gradient boosting algorithm is to use an ensemble of machine-learning techniques to build models on top of models that then “learn” from the mistakes the previous model made. This renders very accurate results.

Among the 14 features used were field position, Vegas win probability, time remaining, down, distance, whether it was out of shotgun or under center, etc.

Each one adds a level of specificity so that the model can better differentiate between plays. 

The Results

So how did the model perform? To be perfectly honest: very very well.

When tested against data the model had never seen before, I found very straight forward relationships. While the average yards per play (24800 instances) of the test data sat at 4.208, the average expected yards from our model came out at… 4.200, that gives us an average rushing yards over expected (RYOE) of 0.008, virtually 0, right where we want it. That is including all the breakaway plays where RBs gain 20+ yards while the model had the max predicted value at ~10 yards. 

For that same reason, the relationship between xRY and actual yards on a play-by-play basis will be underwhelming at best. There simply will never be an instance where the model predicts 70+ yards (something Miles Sanders did 3 times this past season).

However, if we group by player and season to analyze those results, we get the following:

Screen Shot 2021-02-24 at 8.15.38 PM.png

That’s 94.8% correlation (0.898 R^2) on unseen data for those keeping track at home. ninety-four!!

Taking it one step further, if we take rushers with at least 20 carries in that same test data, the average error (RMSE) between yards per carry and xRY/Att is 1.13. Not half bad.

Furthermore, when compared against nflfastR’s expected points added (EPA), RYOE had a very interesting nonlinear relationship with 72.7% correlation (0.528 R^2), giving foundation to the idea that rushing for more yards than expected does lead to a higher point expectancy. 

Screen Shot 2021-02-24 at 8.15.48 PM.png

I am sure one question still remains - How did the model compare to the NGS version of RYOE? The answer again is, very well. For the 51 RBs that met the criteria, there was an 80.4% correlation (0.646 R^2) to the 2020 NGS data, as you can appreciate below.

Screen Shot 2021-02-24 at 8.15.58 PM.png

Again, on a different portion of the data, notice how our average RYOE was virtually 0, as any over/under expectation metric should be. 

Finally, to tie the whole results section together, let's look at the 2020 leaders in RYOE: 

Screen Shot 2021-02-24 at 8.16.14 PM.png

Outside of some shockers (I’m looking at you Darrell Henderson), I think the league leaders are about as expected.

Limitations

While this model is very accurate and a good comparable to the one created with tracking data, it is not without its limitations. In many cases, the situation (and therefore the model) may call for a 5 yard gain but the reality is a defender is already in the backfield at the moment of handoff making 5 yards impossible to gain, something impossible for us to know.

We have big projects in line that take advantage of the model’s strengths but specific play-by-play analysis isn’t one. One must know what the model is capable of and what it’s not to fully take advantage of it.

A huge thanks to Ben Baldwin and Sebastian Carl for their work on nflfastR and Tej Seth for the idea and setting the foundation for this project to be built upon.

For this and many more Fantasy Football and Football Analytics content you can find us at @BRotoFFCasanova and @BRotoFantasy on twitter, and of course, at brotofantasy.com.

By Santiago Casanova (@BRotoFFCasanova)