In this part of the beginner series, we split data in to testing and training and begin work on a ML model to predict Fantasy Football points.
If you have any questions about the code here, feel free to reach out to me on Twitter or on Reddit.
If you like Fantasy Football and have an interest in learning how to code, check out our Ultimate Guide on Learning Python with Fantasy Football Online Course. Here is a link to purchase for 15% off. The course includes 15 chapters of material, 14 hours of video, hundreds of data sets, lifetime updates, and a Slack channel invite to join the Fantasy Football with Python community.
In this part of the beginner series, we are going to begin using Linear Regresssion to predict Fantasy Football output based off one feature - Usage. There isn't going to be much code in this part, as I want to lay the groundwork for how Linear Regression works before we actually implement it.
Linear Regression is a machine learning algorithm. Machine Learning isn't actually that hard to implement (Python and scikit-learn provide us with a bunch of tools to easily implement models) if you understand the underlying concepts.
Machine Learning algorithms are classified in to two categories - those that "train" or learn with data that is labeled, and those that don't. The former is known as supervised machine learning, and the latter is known as unsupervised machine learning. If you're on desktop, check out the new FFDP portal here to see an unsupervised machine learning algo in action.
For the purpose of this series, we are going to focus on supervised machine learning for now. Supervised machine learning algorithms can be categorized further - in to regression algorithms and classification algorithms. Regression algorithms, what we'll be focusing on right now, are used to predict a continuous output - outputs that can span an interval. Weight, stock prices, and Fantasy Football points are all continuous outputs and thus can be predicted using Regression algorithms.
The Regression algorithm we are going to focusing on at first is simple linear regression. The way SLR works is it takes in one explanatory variable, also known as an independent variable, and outputs a continous value.
In our case, Usage is our explanatory variable. It is going to be used to predict Fantasy Football performance.
The first thing we need to do before using SLR to predict Fantasy Points is train the algorithm. Basically, we split our data in to training data and testing data. We'll use a function to do this in the code. We then take the training data and feed it to the algorithm. Once the function is trained, we'll then give it testing data. It will take the testing data and use it to predict Fantasy Football points. From there, we then evaluate how the predicted values faired with actual values and calculate something called root mean squared error - which tells us how our model did. This will be reserved for future parts of this series.
A couple notes before we begin:
(1) Usage and Fantasy Football points happen at the same time, so using this model can be a bit tricky to implement. The key in future parts is we'll be using prior usage to predict future usage and then to predict Fantasy Football performance. In my course, we actually implement a model like this as one of the capstone projects, and it actually works out pretty well.
(2) The details for how Linear Regression learns will be covered in future parts. The algorithm is simply minimizing what's known as the sum of the squared residuals.
Here's the code to bring us back to where we were last time.
Player | Tgt | RushingAtt | FantasyPoints | Usage | UsageRank | FantasyPointsRank | |
---|---|---|---|---|---|---|---|
0 | Christian McCaffrey | 142.0 | 287.0 | 469.2 | 429.0 | 1.0 | 1.0 |
4 | Ezekiel Elliott | 71.0 | 301.0 | 311.7 | 372.0 | 2.0 | 3.0 |
28 | Leonard Fournette | 100.0 | 265.0 | 259.4 | 365.0 | 3.0 | 7.0 |
8 | Nick Chubb | 49.0 | 298.0 | 255.2 | 347.0 | 4.0 | 8.0 |
2 | Derrick Henry | 24.0 | 303.0 | 294.6 | 327.0 | 5.0 | 5.0 |
What we want to do is use Usage to predict FantasyPoints. What we need to do first, though, is to separate the training and testing data we are going to be using to train the algorithm and test it. This is very important. It's not good to test the model on the same data you used to train it. This is because you want to test the model on data it has not seen before, to do otherwise would be known as a "data-leakage" issue.
This is all we are going to be doing today. Next time, we'll be talking more about how Linear Regression fits the model. Again, you'll see that making a LinearRegression model only takes a couple lines of code because of the tools that scikit-learn already provides for us. The important part is understanding what the algorithm does and why you should use it. After that, implementing models is actually really easy.
Okay, so let's run through what we just did here.
We imported a function from the sklearn library known as train_test_split to split our data in to training and testing sets. sklearn is provided to us by scikit-learn, the de-facto machine-learning library for Python.
Before that, though, we set our X and our Y variables. Usage is our X as it's our independent or explanatory variable. It is being used here to explain Fantasy Football performance. Y is set to be our FantasyPoints column as it's what's being explained or predicted by our model.
By default, train_test_split needs a numpy array as arguments to it's function. The way we do this in Pandas is by tacking on .values at the end there. Pandas then takes our DataFrame column (or Series) and converts it to a numpy array.
From there, we input it in to our little function, and we also set test_size=0.2. We are telling sklearn here that we'd like to save 20% of our data for testing. The remaining 80% will be used to train the algorithm.
train_test_split gives us back 4 values by default. In Python, when a function gives us back multiple values we can "unpack" the return values of the function using this notation or syntax.
For example this works in Python, (this is not part of the source, don't type this in to your notebook)
The way you find these things out is simply by looking through the documentation. Here is the docs page for the train_test_split function. Link to scikit train_test_split docs page
At the end there, I just outputed our x_train and y_test values to make sure everything looked good. I used list indexing notation to output the first 20 values of each array - ([:20] means grab everything up to the 20th item in the array/list).
And that's it for this part of the series!
In the next part, we'll be feeding our training data in to our algorithm and then use it to predict Fantasy Points. In the part after that, we'll go over how to evaluate our model's performance.
Thanks for reading!