Hi all, in this part of the intermediate series, I'm going to take you through part of a project we go through in the python course offered on this site - only this post is going to give a look into the current season. The course offers a more in-depth analysis of this project along with an hour long video explanation.

Rushing TD Regression Candidates

We are going to be doing a TD regression analysis for RBs. Essentially, for each RB this season, we'll be looking at their number of carries + how far each carry was from the endzone. Based off how far a player received a carry from the endzone, we assign a score or expected touchdown value to their carry based off historical numbers on how probable a TD is from that given yardline. Once you add each carry's expected TD value for each RB, you'll get an expected TD number for the entire season. For each RB, we'll be comparing that expected TD value to the actual TD values they posted. If a player actually scored more TDs than their expected TD value, we can say they are due for negative regression and "over performed" so far this season. And vice versa if their expected TD value was higher than their actual TD posting. This will help us mid season to assess if players should have gotten more TDs to this point and give us insight when making roster moves in fantasy football.

There is actually quite a bit of statistics that go into this. Since we have the probability of a score from a given yardline, does it make sense to simply add up all the rush attempts from a player in a single drive? One thing you may notice is that the TD probability from a rush attempted on the 1 yardline is greater than 50%. This means that if a player attempts 2 rushes from the 1 yardline their expected touchdowns is greater than 1. The problem with this is although technically that is true, a player cannot score more than 1 touchdown on a drive. To get around this we implement a cap at 1 touchdown per drive.

No analysis is perfect - we are treating every RB as if they have the same talent level. Obviously some RBs have that special ability to break off a 60 yard TD at any moment, while other RBs simply do not. Those RBs that have that special ability to break off a huge play at any moment obviously have a higher probability of scoring a TD per handoff than an average RB. Thus, those RBs will be underweighted using this sort of analysis. Conversely, RBs that don't have that special ability are overweighted by this sort of analysis. This analysis is still very valuable since it is using an on average approach to allow us to assess the opportunity based value of a player compared to their actual performance.

Feel free to scroll to the bottom to get a list of RB expected and actual touchdowns scored.

The code

With that out of the way, let's start work on the project. You'll need to run this code in a notebook environment, for those unfamiliar with my content, and the easiest solution would be to set up a Google Colab runtime.

To start, we're going to be installing the nflfastpy library which is maintained by Fantasy Football Data Pros. It pulls from the nflfastR data package and exposes 20 years of play by play data.

Importing our libraries as always, and setting the style for our visualizations.

In this first block of code, we are going to be pulling data from the past 5 years of the NFL, excluding the most recent season and concatenating the data together into one big DataFrame. We will be using this data to find the probability of scoring a touchdown X yards from the endzone, for each value of X 0 - 100, and putting this all in to a DataFrame we'll call rushing_df_probs.

Now we're going to take our big DataFrame and transform it the way I described. Like I said earlier, this post will not be as in depth as my course, so if the code here confuses a bit, it's covered a bit more in depth in the course. I've added some comments to each line to explain what we're doing.

	yardline_100	probability_of_touchdown
0	1.0	0.560390
3	2.0	0.428058
5	3.0	0.336910
7	4.0	0.304251
9	5.0	0.206349

Now we can plot our DataFrame you see above so we can visualize the probability of scoring a rushing touchdown X yards from the endzone. As anyone could have expected, as you get farther from the endzone, the less probable it is a TD will occur. We're more concerned with the values from the DataFrame and using it it calculate expected TDs than any sort of revelation from visualizing this data.

Now we'll load in 2021 data. We're going to use the rushing_df_probs DataFrame to calculate expected TD values for each RB in the 2021 season, then compare that calculated value to actual TDs scored.

Below, we're going to filter, clean, and merge our data a bit. I've added comments above each operation.

	rusher_id	drive_id	rusher_player_name	rush_attempt	yardline_100	probability_of_touchdown
0	00-0032764	2021_01_ARI_TEN_1	D.Henry	1.0	75.0	0.003111
1	00-0035228	2021_01_ARI_TEN_2	K.Murray	1.0	23.0	0.018553
2	00-0034681	2021_01_ARI_TEN_2	C.Edmonds	1.0	9.0	0.081081
3	00-0032764	2021_01_ARI_TEN_3	D.Henry	1.0	80.0	0.000969
4	00-0032764	2021_01_ARI_TEN_5	D.Henry	1.0	75.0	0.003111

Now, we have a DataFrame that contains each rush attempt for the season by an RB, with each play assigned an expected touchdown value. All that's left to do now is to group by rusher id and add up the actual touchcowns and the expected values, right? Recall in the beginning of this post we discussed the problem I originally ran into when running this sort of analysis. Some RBs would receive a greater than 1 value for expected TDs on a given drive, which doesn't make sense.

I was considering different ways to approach this problem. I considered using conditional probability, but since we already know the outcome of the events it isn't the most appropriate use. In my opinion the best thing we can do is cap the per drive TD maximum at 1. This made the most sense to me since if a RB gets 3 rushes from the 1 yardline I would say he should have scored based on his opportunity.

So, we need to limit expected TDs to 1 per drive, or at the very least, very close to 1. That was the reason for assigning a drive id column above. First, we will add expected values for each RB, drive without this cap.

We can see here we have 18 instances where an RB was assigned a expected TD value of greater than 1 on a single drive. Let's cap all drives to 0.999.

Now that we've gotten that out of the way, let's groupby again and this time only group by rusher id. We'll assign a column called positive_regression_candidate which will flash True when expected touchdowns > actual touchdowns, indicating a player may be due for positive regression. The delta column we assign here will be the difference expected and actual touchdowns.

	rusher_id	rusher_player_name	actual_touchdowns	expected_touchdowns	positive_regression_candidate	delta
73	00-0032764	D.Henry	10.0	7.339330	False	2.660670
209	00-0036223	J.Taylor	6.0	7.021710	True	1.021710
186	00-0035657	D.Harris	6.0	5.897961	False	0.102039
189	00-0035664	D.Henderson	5.0	5.003799	True	0.003799
108	00-0033856	L.Fournette	4.0	4.867322	True	0.867322

All that's left now is to visualize the results, but I encourage you to mess around with this data because it's interesting and useful on it's own. I've added some comments to the code below to guide you through it once again.

The Henry injury is devastating, but lets use him as an example to interpret this plot. Unsurprisingly Derrick Henry finds himself at the top of the list in terms of expected touchdowns and actual touchdowns. The plot suggests that Henry is over performing, but like I mentioned in the beginning of the post we are not taking into account each individual rusher's ability, and this is Derrick Henry we are talking about. If the average rusher got the opportunities Henry has gotten this season I might think that the TD production will not be sustainable, but there is 0 reason to suggest Henry was over performing and would regress to the mean if he stayed healthy.

Someone like Sam Darnold on the other hand is a huge regression candidate since he ran in 5 TDs in the first 4 weeks of the season. He has yet to score a rushing touchdown since then (this analysis is taking place after week 7) which means we are already seeing a sort of regression to the mean. Its safe to say we do not expect Darnold to be the leading TD rusher (or anywhere near it like he was after week 4).

I anticipate Miles Sanders to get it going, at least more than he has so far this season since he is still without a TD. Don't expect him to be a top 10 RB (or even top 15) since Hurts will continue to sap his value, but with Sanders under performing his expected TD value by 3, I'd wager he will find the endzone sooner rather than later.

That's all for this post. Thank you for reading, you guys are awesome!

Appendix

Here are all the RBs from this season that have at least 1 expected touchdown.

	rusher_id	rusher_player_name	actual_touchdowns	expected_touchdowns	positive_regression_candidate	delta
73	00-0032764	D.Henry	10.0	7.339330	False	2.660670
209	00-0036223	J.Taylor	6.0	7.021710	True	1.021710
186	00-0035657	D.Harris	6.0	5.897961	False	0.102039
189	00-0035664	D.Henderson	5.0	5.003799	True	0.003799
108	00-0033856	L.Fournette	4.0	4.867322	True	0.867322
102	00-0033553	J.Conner	8.0	4.865928	False	3.134072
83	00-0033045	E.Elliott	5.0	4.715967	False	0.284033
92	00-0033293	A.Jones	3.0	4.640432	True	1.640432
117	00-0033906	A.Kamara	2.0	4.557704	True	2.557704
235	00-0036389	J.Hurts	4.0	4.183223	True	0.183223
259	00-0036893	N.Harris	3.0	4.103959	True	1.103959
105	00-0033699	A.Ekeler	5.0	3.922145	False	1.077855
116	00-0033897	J.Mixon	5.0	3.900342	False	1.099658
148	00-0034791	N.Chubb	4.0	3.681646	False	0.318354
198	00-0035831	J.Robinson	5.0	3.407411	False	1.592589
120	00-0033923	K.Hunt	5.0	3.320475	False	1.679525
265	00-0036924	Mi.Carter	3.0	3.311899	True	0.311899
149	00-0034796	L.Jackson	2.0	3.289903	True	1.289903
11	00-0027966	M.Ingram	1.0	3.248947	True	2.248947
211	00-0036251	Z.Moss	3.0	3.147690	True	0.147690
218	00-0036275	D.Swift	3.0	3.134956	True	0.134956
130	00-0034301	Da.Williams	4.0	3.050942	False	0.949058
223	00-0036328	A.Gibson	3.0	3.009944	True	0.009944
173	00-0035243	M.Sanders	0.0	2.891143	True	2.891143
79	00-0032972	D.Booker	2.0	2.780498	True	0.780498
192	00-0035700	J.Jacobs	5.0	2.615657	False	2.384343
114	00-0033893	D.Cook	2.0	2.585961	True	0.585961
66	00-0032426	A.Collins	2.0	2.579420	True	0.579420
273	00-0036997	J.Williams	1.0	2.455934	True	1.455934
59	00-0032144	M.Gordon	3.0	2.444273	False	0.555727
158	00-0034857	J.Allen	2.0	2.398869	True	0.398869
52	00-0031806	M.Brown	1.0	2.370514	True	1.370514
121	00-0033948	J.Williams	2.0	2.288534	True	0.288534
172	00-0035228	K.Murray	2.0	2.277217	True	0.277217
183	00-0035537	T.Johnson	1.0	2.227564	True	1.227564
89	00-0033280	C.McCaffrey	1.0	2.185380	True	1.185380
193	00-0035710	D.Jones	2.0	2.160797	True	0.160797
241	00-0036555	C.Hubbard	2.0	2.136616	True	0.136616
159	00-0034869	S.Darnold	5.0	2.040697	False	2.959303
136	00-0034414	B.Scott	3.0	1.984017	False	1.015983
84	00-0033077	D.Prescott	0.0	1.946874	True	1.946874
262	00-0036906	K.Herbert	1.0	1.881730	True	0.881730
155	00-0034844	S.Barkley	2.0	1.849759	False	0.150241
56	00-0032063	M.Davis	1.0	1.842768	True	0.842768
145	00-0034681	C.Edmonds	1.0	1.839042	True	0.839042
77	00-0032950	C.Wentz	1.0	1.718546	True	0.718546
152	00-0034816	R.Jones	1.0	1.711764	True	0.711764
205	00-0036096	J.Taylor	2.0	1.666519	False	0.333481
31	00-0030578	C.Patterson	2.0	1.617028	False	0.382972
32	00-0030874	D.Williams	2.0	1.572870	False	0.427130
70	00-0032602	J.McKissic	1.0	1.560870	True	0.560870
191	00-0035685	D.Montgomery	3.0	1.543478	False	1.456522
242	00-0036567	E.Mitchell	3.0	1.532005	False	1.467995
104	00-0033594	C.Carson	3.0	1.495154	False	1.504846
276	00-0037012	T.Lance	1.0	1.355950	True	0.355950
268	00-0036971	T.Lawrence	2.0	1.324875	False	0.675125
156	00-0034845	S.Michel	1.0	1.307333	True	0.307333
72	00-0032741	P.Barber	1.0	1.269330	True	0.269330
96	00-0033357	T.Hill	3.0	1.264787	False	1.735213
264	00-0036919	K.Gainwell	2.0	1.233825	False	0.766175
28	00-0030513	L.Murray	4.0	1.227994	False	2.772006
42	00-0031345	J.Garoppolo	3.0	1.227769	False	1.772231
74	00-0032780	J.Howard	2.0	1.200711	False	0.799289
215	00-0036265	A.Dillon	0.0	1.188196	True	1.188196
41	00-0031285	D.Freeman	2.0	1.188181	False	0.811819
146	00-0034750	R.Penny	0.0	1.117933	True	1.117933
177	00-0035311	M.Gaskin	0.0	1.114066	True	1.114066
0	00-0019596	T.Brady	1.0	1.109926	True	0.109926
175	00-0035261	T.Pollard	1.0	1.103899	True	0.103899
161	00-0034972	A.Mattison	0.0	1.082761	True	1.082761
34	00-0031045	C.Hyde	0.0	1.070865	True	1.070865
174	00-0035250	D.Singletary	1.0	1.040830	True	0.040830
185	00-0035628	D.Johnson	2.0	1.016086	False	0.983914

Learn Python with Fantasy Football: Rushing TD Regression Candidates

Rushing TD Regression Candidates

The code

Appendix