Any system of prospect evaluation, from traditional courtside scouting to complex machine-learning, follows the same basic framework. The evaluator assesses a player’s skillset, then projects how well those tools will translate to the next level. These projections are typically based on the outcomes of past prospects with a similar physical profile and talents. The difference comes in how evaluators go about this process, and it is clear that there is no perfect approach. Statistical models are objective and use precise weights, but necessarily work at a low resolution. They are limited to the data plugged into them, which is never more than a tiny subset of all possible information. Traditional scouting has much better access to the context and detail needed to compile a full description of players, but struggles to systematically weight different attributes and potentially falls prey to basic human biases.
I see quite a bit of mud-flinging between quant-based and eye-test evaluators, but it seems silly. Objective and subjective specialists are wonderfully suited to mutualistic cooperation. Statistical models can tell a simple story about prospects while clearly stating underlying assumptions and gaps in the information accounted for. Traditional scouting fills those gaps and identifies why model assumptions and omitted information might be particularly problematic in specific situations. Ideally, scouts can even generate testable hypotheses that ultimately improve both ends of evaluation. This process is only possible if modelers make their approach explicit and understandable. Authoritatively assigning numbers to players through some murky process does nothing but kill discussion. This is the main reason I try to make my methods as clear and accessible as possible.
Previously, I described the data I am working with for creating a model to project international prospects. Today I am going to outline the process of actually creating the model and highlight some interesting findings along the way.
My draft model is built using linear regression, which takes a set of variables and finds a way to weight each of them such that they do the best possible job of explaining some other variable. In this case, that other variable is NBA performance. As noted in the introductory article, I am trying to predict a player’s “NBA wins peak,” a simple metric that indicates how many wins a player is responsible for during the high-point of his career. This is called the “dependent variable,” while the other variables used to explain it are the “independent variables.” The independent variables are drawn from information about players that is already available at the time we want to make predictions. In this case, I am using international statistics and basic biometrics (i.e. height, weight, age).
One important thing to keep in mind when choosing independent variables is the difference between description and prediction. The goal is to generate predictions that are as accurate as possible. This is often consistent with doing the best job of describing previous outcomes, but not always. For an extreme example, I could include a variable in my model that captures whether or not a player was born in Würzburg, Germany. Thanks to Dirk Nowitzki, the regression algorithm would likely consider this a highly significant predictor of success. However, I think we can all agree that being born in Würzburg is not actually a useful variable to consider when evaluating prospects. This is why ideally we should build models from some sort of theoretical basis. We may not be able to intuit the precise weights of different attributes, but we should have some good ideas about which variables are important, whether they should have positive or negative effects, and how they might relate to one another.
That said, we also want to do a bit of exploration. There may be factors that are surprisingly important or unimportant and we do not want to completely ignore new information. For these reasons, I start simple, then experiment and build from there if I am convinced additions make sense and improve model performance.
After organizing all of my data, the first model I fit included age, field-goal attempts (FGA), shooting efficiency (eFG), two-point vs. three-point bias (2PA/FGA), free throws (FT), rebounds (TRB), assists (AST), turnovers (TOV), steals (STL), blocks (BLK), personal fouls (PF), and height. In most cases these variables are exactly as you see them in the box score, but I should note that age is not simply “age.” Because the dataset includes players on either side of expected peak age (~27 years old), I use a more complicated age curve based on DSMok’s work in the APBRmetrics forum. This more complicated age variable better captures the relationship between age, production, and NBA potential.
Here are the weights, or coefficients, I found for each of the above variables.
These are the values, which on average, when assigned to players’ production (per-100 possessions) give the best predictions of NBA performance. For an example of how this works, I will use the numbers from Ricky Rubio’s 18-year-old season:
-11.68 – 3.711 + 14.77*(.09) + .46*(5.44) + 0.64*(-.26) + 6.65*(.24) + 12.66*(.43) + 7.05*(-.43) + 6.63*(.07) + 4.96*(.60) + 0.27*(.41) + 5.67*(-.12) + 76*(0.19) = 9.6
The first term is the intercept, which is a constant across all players that is estimated along with the weights for the different variables. The second term is the age adjustment, which I am whistling past for sake of simplicity. All of the bold values are the coefficients from the list above, while the others are the actual numbers posted by Rubio in the corresponding box-score statistic (or his height in inches for the final term). Adding all of these values together, we get a projected score of 9.6. That tells us that Ricky Rubio looked like a borderline star at 18, at least according to this simple model.
Unstandardized coefficients like the ones above are nice for demonstrating the model’s mechanics, but they are not as useful for displaying the relative importance of different attributes. Below are the same coefficients, but standardized to reveal their relative importance. To help make these even easier to appreciate, I fit the same variables to my NCAA data and am placing the two outputs side-by-side. This gives a comparison of how the collection of different statistics in either the NCAA or international competition carries different information about a player’s NBA potential:
Age, height, and defensive production are given nearly identical weight in the two models. The many differences between NCAA and international competition do not seem to matter when evaluating these attributes for a player’s NBA potential. This is not the case with the other variables included in the model. Scoring, both in terms of volume and efficiency appear to be more important for collegiate players, as does rebounding. Meanwhile, passing efficiency is weighted more heavily for international prospects. Assists are given equal value in both the NCAA and international models, but international prospects committing turnovers are punished much more harshly.
Two other factors that show a strong difference are 2-point bias and free-throw makes. The models are wildly different in the importance they place on these two factors, however; this is a good example of where we need to be careful about over-interpreting regression outputs. Players who play closer to the basket offensively also draw more fouls, and thus collect more free-throws This means that the two independent variables themselves are correlated with one another. Correlated variables probably carry shared information about NBA potential, and because only one of them will receive credit for adding that information, they can complicate interpretation. If we remove either FT or 2PA/FGA from both models, the other suddenly appears more important than it did before, as it now adds more unique information.
This is a nice model, but there is room for improvement. The next step is to take what we learned above and find beneficial tweaks. For example, my next step was to take away the useless two-point-bias variable, and add an interaction effect between assists and turnovers. Interaction effects is a more complex terms that account for the fact that the relationship between one independent variable and the dependent variable may be different depending on some third variable. In this case I am trying to more accurately capture the relationship between distribution efficiency and NBA success. These two changes improve the overall performance of the model. The first version explained about 30% of the variation in NBA performance, while the new version explains about 32%. This is an example of one step in the process of building a better projection model.
I cannot reasonably go through every step, but many of the decisions worked along these lines. I also use some automatic model selection tools like stepwise AIC, which help identify patterns I would miss plucking through on my own.
Fast-forwarding to the final product, I made a number of noteworthy changes. I included minutes-per-game and weight, and discriminated between offensive and defensive rebounding. I also included several interaction effects, one between steals and fouls, another between free-throw makes and attempts, and another between the two size variables (height and weight). Finally, I added a selection of team variables. After adjusting for the extra possessions it gives players, pace is actually a positive predictor of NBA success. The NBA game operates at a faster pace than most international competition, and this may be the reason that succeeding at a faster pace is a good sign. Players with teammates who take more shots inside than outside also project better. I assume this is due to the challenge of scoring efficiently without ideal floor-spacing. Players whose teammates collect more rebounds also look better. This is probably just a sloppy way of accounting for boxing out and “rebound-stealing” among teammates.
(To view the coefficients for the current international prospect model, follow this link. I continuously tinker, especially as I add more data, and any changes will be accounted for here.)
In addition to specific variables added to the model, there are some other tricks running under the hood. The regression gives more weight to observations with more minutes played, as well as observations on younger players (since projecting young players is the primary goal). In addition, to deal with the problem of some players having played many more seasons than others, I actually fit 100 models with 1 random example of each player used in each regression. The final coefficients take the average across these 100 repeats for the actual projection model. In its current form, the model explains about 45% of the variation in NBA performance of international players. This is much better than the first example, but there clearly remains a lot of important information the model is not capturing. That said, perfection is an unreasonable goal. I just want to make a tool that is useful. In the next article I start looking at where the model would have done well or poorly with past data, and try applying it to players who are yet to spend time on the NBA court.