NBA Shot Success Prediction

Shot success predictions in the NBA using ML and Deep Learning

Introduction

The National Basketball Association (NBA) is one of the most popular sports leagues in the world, with millions of fans across the globe. Numerous factors have contributed to the NBA’s rise in popularity over the years.

Among others, first, there’s a very high level of play with a phenomenal athletic level and skills sometimes qualified as superhuman (we will come back to this point later). Also, the presence of iconic players like Michael Jordan who perfectly represented American basketball on and off the court. This sport also benefits from massive media coverage; for example, for the 2021 final tournament (playoffs), the matches were broadcast in over 215 countries and territories. This includes major markets such as the United States, Canada, China, and Europe, as well as smaller markets in Africa, Asia, and South America. Lastly, one of the major factors of the high popularity of the NBA worldwide is the introduction of new game rules, which have ensured to always have more spectacle.

Until the late 1970s, teams relied heavily on their tall players (like Wilt Chamberlain) to score points near the basket, leading to slower and more defensive play, but the introduction of the 3-point line in 1979 radically changed the game by expanding the range of scoring opportunities and creating a new strategic element in the game. Today, the three-point shot is a fundamental element of basketball and has helped make the game more exciting and entertaining for fans. The figure below represents the current phenomenon well, it is indeed observed that the total number of attempted 3-point shots has been multiplied by 8 since 1997 (the total in 2011 and 2020 is lower because there is little data).


# The variable 'Shot Type' contains the type of shot '2PT Field Goal' and '3PT Field Goal'. Here we only select 3PT Field Goal and group them by year.
threes_by_year = df_all[df_all['Shot Type'] == '3PT Field Goal'].groupby('year').size().reset_index(name='Number of Shots')

fig, ax = plt.subplots(figsize=(15, 10))
sns.barplot(x='year', y='Number of Shots', data=threes_by_year)
plt.title('Number of 3-point Shots Taken by Year',fontsize = 25)
plt.xlabel('Year',fontsize = 20)
plt.ylabel('Number of Shots',fontsize = 20)
ax.tick_params(axis='x', labelsize=13)
ax.tick_params(axis='y', labelsize=13)
plt.gca().set_facecolor("white")
#ax.set_facecolor("white")
ax.spines['bottom'].set_color('black')
ax.spines['top'].set_color('black')
ax.spines['right'].set_color('black')
ax.spines['left'].set_color('black')
plt.grid(color='grey',linewidth=0.5)
plt.show()

The statistics are even more striking when comparing the percentage of 3-point shots to 2-point shots per game (figure below on the left), where the average number of 3-point attempts per game has increased from 15% in 1997 to about 40% in 2020. The figure on the right shows the proportion of shooting zones per game, and we can see that shots behind the 3-point line have multiplied over the years, indicating an overall move away from the basket for shooting.


import pandas as pd
import matplotlib.pyplot as plt

# calculate the total number of shots and the number of 3PTS shots for each year
fig, ax = plt.subplots(figsize=(15, 10))
total_shots = df_all.groupby('year').size()
three_pt_shots = df_all[df_all['Shot Type'] == '3PT Field Goal'].groupby('year').size()
two_pt_shots = df_all[df_all['Shot Type'] == '2PT Field Goal'].groupby('year').size()
# calculate the percentage of 3PTS shots for each year
three_pt_pct = (three_pt_shots / total_shots) * 100
two_pt_pct = (two_pt_shots / total_shots) * 100
# plot the percentage of 3PTS shots per game as a function of year
plt.plot(three_pt_pct.index, three_pt_pct,linewidth = 5,label="3PTS %")
plt.plot(two_pt_pct.index, two_pt_pct,linewidth = 5,label="2PTS %",color='red')

plt.xlabel('Year',fontsize=30)
plt.ylabel('% of Shots per game',fontsize=30)
plt.title('Evolution of % of Shots per game in the NBA',fontsize=30)
plt.legend(fontsize=25)
ax.tick_params(axis='x', labelsize=13)
ax.tick_params(axis='y', labelsize=15)
plt.gca().set_facecolor("white")
#ax.set_facecolor("white")
ax.spines['bottom'].set_color('black')
ax.spines['top'].set_color('black')
ax.spines['right'].set_color('black')
ax.spines['left'].set_color('black')
plt.grid(color='grey',linewidth=0.5)
plt.show()


import pandas as pd
import matplotlib.pyplot as plt

# Calculate the percentage of shots per game taken at each Shot Zone Basic
shots_by_zone = df_all.groupby(['year', 'Shot Zone Basic'])['Shot Made Flag'].count()
total_shots_by_year = df_all.groupby(['year'])['Shot Made Flag'].count()
percent_shots_by_zone = shots_by_zone / total_shots_by_year * 100

# Plot the evolution as a function of year
fig, ax = plt.subplots(figsize=(15, 10))
for zone in percent_shots_by_zone.index.levels[1]:
    ax.plot(percent_shots_by_zone[:, zone], label=zone,linewidth = 5)
ax.legend(fontsize = 15,loc='center left',bbox_to_anchor=(0, 0.575))
ax.set_xlabel('Year',fontsize = 20)
ax.set_ylabel('% of Shots Per Game',fontsize = 20)
ax.tick_params(axis='x', labelsize=13)
ax.tick_params(axis='y', labelsize=13)
plt.gca().set_facecolor("white")
#ax.set_facecolor("white")
ax.spines['bottom'].set_color('black')
ax.spines['top'].set_color('black')
ax.spines['right'].set_color('black')
ax.spines['left'].set_color('black')
plt.grid(color='grey',linewidth=0.5)
plt.show()

In the 90s, defense in the NBA was characterized by physical play, with players often relying on brute force and aggressive tactics to disrupt their opponents. Teams often played traditional man-to-man defense under the basket. Figure 2 (right) shows the percentage of shots per game in different areas of the court. We can see that the percentage of shots behind the 3-point line has drastically increased, going from 11% in 1997 to 30% nowadays. This new trend of long-range shooting is directly linked to the emergence of a previously unknown profession: the Data Scientist.

Data scientists have become an essential element in many NBA teams. In fact, for the past decade, every game has been closely scrutinized using dozens of cameras installed around the court, allowing every player’s movement to be meticulously analyzed. As a result, with the large amount of data available today, coaches need assistance in analyzing and making sense of it. This is where data scientists come in: they can help coaches identify patterns, trends, and ideas that may not be immediately apparent. For example, the Houston Rockets discovered that 3-point shots were ultimately the most « rewarding » type of shot in a game, much more so than mid-range shots on Figure 2 (right). Therefore, their best player, James Harden, was tasked with shooting as many 3-pointers as possible. This new style of play allowed them to elevate their performance, transforming from a mid-table team to a championship contender. Since then, this style of play has spread throughout the NBA, and coaches must constantly adapt to new offensive and defensive strategies from their opponents. Thus, data scientists now utilize machine learning algorithms to assist coaches in making decisions regarding player rotations, player development, and game strategy based on the opponent encountered.

In this Data Science project, we propose to work as NBA Data Scientists to analyze the probability of a player making a successful shot based on several parameters. One of the primary objectives of this project is to compare the shooting performance of twenty of the best NBA players of the 21st century listed below:

Tim Duncan
Kobe Bryant
Allen Iverson
Steve Nash
Ray Allen
Paul Pierce
Pau Gasol
Tony Parker
Manu Ginobili
Dwyane Wade
LeBron James
Chris Paul
Kevin Durant
Russell Westbrook
Stephen Curry
James Harden
Kawhi Leonard
Damian Lillard
Anthony Davis
Giannis Antetokounmpo

We want to determine if certain players have shooting preferences based on specific positions on the court and game situations, and provide a player-by-player analysis. Additionally, it would be interesting to conduct a position-by-position and zone-by-zone analysis to identify any general trends.

In the next phase, the objective is to build a model that predicts the success or failure of a shot and the probability of a basket for one of the ten currently active players. This model would consider various parameters such as shot distance, shooting angle, shot type, opponent team ranking, time remaining, score at the moment of the shot, and many others.

3. Data Visualization

Once the data frame is constructed, the next step is to make initial observations of the dataset’s values using graphs. Some observations allowed us to discard certain variables as they had little impact on shot success.

Firstly, we plotted a heat map representing the correlations between the different numerical variables in the dataset. We can observe that no variable in the data frame is correlated with the ‘Shot Made Flag’ target variable above 0.2, and the variables ‘Y Location’ and ‘Shot Distance’ are strongly correlated with each other.

Indeed, the variables ‘X Location’ and ‘Y Location’ on one hand, and ‘Shot Distance’ and ‘angle_tir’ on the other hand, are correlated with each other because they all provide information about the player’s position at the time of the shot. Based on this graph, we can rightly say that the use of a complex ML or DL model based on several of these variables is relevant and can help us predict whether a shot will go in the basket or not.

In the rest of this section, we plotted numerous graphs to observe the impact of each of these variables on the target variable ‘Shot Made Flag’. We will review the 13 columns that we kept in our final data frame presented earlier.

One of the initial objectives was to divide the court into zones. One of the datasets already had three variables that divided the court into zones.

Figure 12 presents the three types of mappings: based on distance (shot_zone_range), shot zone area (shot_zone_area), and key zones (shot_zone_basic).

The shooting success rate is relatively consistent in all the different zone locations, except for the « backcourt » position, which corresponds to the other half of the court, where the success rate drops to less than 4%. This is expected as it is obviously very difficult to score from such a great distance from the basket.

3.2 Shot Type: « df.shot_type »

The shot type has only two parameters: 2 points and 3 points. The success rate for 2-point and 3-point shots is shown in the figure below:

Figure 14 – Success percentage of 2-point shots (left) and 3-point shots (right). We can observe that the success rate for 2-point shots is nearly 50%, while for 3-point shots, it is around 35%. However, as mentioned earlier, the importance of the 3-point shot in the current gameplay promises an improvement in this statistic as players increasingly work on their long-distance shooting.

Figure 15 – Players’ shooting success rates based on shot type. If we look at the success rate of each player for 2-point or 3-point shots as shown in the above figure, we can observe that each player has their own preferences. Players like Giannis Antetokounmpo or LeBron James excel in 2-point shooting, while Stephen Curry and Steve Nash are unparalleled in 3-point shooting. A higher success percentage for 2-point shots. The prospect of improvement in 3-point shooting results.

3.3 Action Type: « df.action_type »

The action type is a broad variable containing 70 different types of shots, ranging from dunks to finger rolls (a delicate rolling of the ball off the fingertips into the basket). The lines not provided correspond to actions without data or with a 0% success rate. We can observe that the slam dunk, which involves forcefully dunking the ball into the hoop, has the highest success rate. In contrast, a complex action like the « turn around fadeaway, » which involves turning around and shooting while taking a step back, is one of the most challenging actions to execute and thus has the lowest success rate. These numerous actions and their high success rates once again demonstrate the spectacular nature of the NBA. In the context of establishing a predictive model, one may question the relevance of the action type column since, in the end, we want to predict the success of a shot without specifying the specific type of shot the player will take.

The above figure highlights that certain players have their own « signature shots » that have a higher success rate in scoring. This is particularly the case for Tim Duncan and the layup, Allen Iverson with the Pullup Jump Shot, Fadeaway Jump Shot, and his Step Back Jump Shot.

3.4 Distance and Shooting Angle: « df.distance, df.shooting_angle »

The shooting distance is one of the most important variables in our model. It also has a stronger correlation with the target variable ‘shot_made_flag,’ as observed in the correlation table in Figure 11. Figure 18 on the left represents the success rate of a shot based on distance. Naturally, we observe that as the distance (in feet) increases, the success rate diminishes.

It is also interesting to examine the success rate from the perspective of the shooting angle on the court. The angle is calculated from the axis that runs from the basket and divides the court in half: this axis is at 0°. The two corners are at 90°. Figure 18 on the right shows the success rate of shots based on the angle. At 0° and 90°, the success rate is the highest. At 0°, the player is aligned with the basket, so it is expected to have a higher success rate. However, the high success rate at 90° can be surprising. This can be explained by the fact that players in the corners are often left unguarded by the defense, giving them more time to set up their shots. For the rest of the angles, the success rate is approximately between 40% and 45%.

Figure 18 – (left) Success rate based on distance (right) Success rate based on shooting angle relative to the basket

The shooting distance strongly impacts the success rate: the shorter the distance, the higher the chances of a successful shot. The shooting angle also influences the shot’s success.

3.5 Player Position and Age: « df.position, df.age »

The player’s position is a factor that has a significant impact on shooting success for several reasons. Firstly, different positions have different roles and responsibilities on the court. Even though the game has been evolving with more « hybrid » roles, point guards are generally smaller than their teammates but quick and skilled in dribbling. They aim to bring the ball up the court, dictate tactics, and distribute the game. They often penetrate the defense to finish with a layup or a pass. Shooting guards are fast and agile players capable of scoring from both 2-point and 3-point range, while also driving into the paint for layups or dunks. Small forwards are versatile players with the mission to shoot from the wings and assist the center in defensive actions and rebounds. Centers and power forwards are typically the tallest players on the court and are responsible for rebounding, defense near the basket, and scoring close to the basket in offense. However, power forwards operate further from the basket than centers, allowing for more versatility in shot selection.

Figure 19 – (left) Success rate based on player position (right) Success rate based on player age

Figure 19 (left) presents a boxplot of the success rate for our players based on their positions. We can observe a significant disparity in the point guard position. This can be explained by differences in shooting accuracy for that position in our list of players. It is also interesting to note that shooting guards, small forwards, and centers display a high shooting accuracy.

Age is often said to be just a number, and this holds true when observing the shooting success rate based on age in Figure 19 (right). We observe a relatively consistent success rate around 50%, with a slight decrease at the extremes: 18-19 years old and 39-40

years old. This can be explained by the fact that young players entering the league need some time to adapt from their high school/college background to the professional league. Additionally, as players approach their forties, their shooting percentage decreases along with their declining athletic ability.

Returning to the player’s position, in Figure 13, we plotted the shooting success based on the player’s position in each zone of the court, which is represented in the ‘shot_zone_area’ column described in Figure 12. This figure allows us to see if the player has shooting preferences based on their position. Naturally, for the « Back court » zone, which is the other half of the court, the success rate is lower, even reaching zero for power forwards and centers. As predicted earlier, the center of the court is often chosen by power forwards and centers, resulting in a higher success rate compared to other positions. The point guard position proves to be very versatile and is represented in all zones of the court with a very good success rate, both in the center and on the sides.

Figure 20 – Shooting success based on player position

In Figure 21, we represented the best five players from our list for each zone on the court. It is interesting to see that the player with the most TOP 1 rankings is Steve Nash, an iconic player for the Dallas Mavericks and the Phoenix Suns, who played as a point guard. This confirms the versatility of a player in that position, often taking shots from the sides due to their smaller stature. Conversely, the central zone is dominated by small forwards and power forwards, who are often taller and more powerful, featuring versatile players like LeBron James, Anthony Davis, and Giannis Antetokounmpo. Furthermore, while Stephen Curry is known for making many shots from the other half of the court, he ranks only third in the « Back Court » category. However, this can be explained by his higher number of attempts in that zone, which affects his success rate.

Figure 21 – Shooting success based on player position

3.6 Court Location: « df.x_location and df.y_location »

The « x_location » and « y_location » parameters in a basketball dataset are crucial for creating a predictive model of shot success, as they represent the shooter’s location on the court at the time of the shot. The shooter’s location significantly impacts the probability of a successful shot. This is because different areas on the court have different levels of difficulty associated with them, based on the distance from the basket and the shooting angle, as we have seen before. We have observed that shots taken closer to the basket are generally easier to make than those taken from further away. Shots taken from the sides of the court, at angles between 5 and 40 degrees, are often more challenging than those taken from the center (or even from the corner, as we have also seen). Additionally, although we don’t have this data in our dataset, the distances between the shooter and the defenders are also directly linked to the shooter’s location on the court and therefore closely related to their shooting success.

In this section, we created an interactive figure where shooting positions are represented as hexagons, with the color corresponding to the success rate and the size representing the frequency of shots. To achieve this, each horizontal line on the court is divided into 70 hexagons. The choice of hexagons is strategic as they efficiently fill the space on the court, leaving few empty spaces. Furthermore, the figure includes two dropdown menus where we can choose the player’s name and the year. Below, the success and frequency of shots for all players and years in the data frame are represented. On the left, we have 2-point shots, and on the right, 3-point shots.

These figures summarize what we have observed in our previous figures:

The success rate for 2-point shots ranges between 30% and 40%.
The success rate for 3-point shots is around 30%.
The success rate for shots under the basket is high, above 80%.
The greater the distance, the lower the success rate.

Below, we have represented heatmaps for two players with completely different profiles and positions. Stephen Curry, the point guard for the Golden State Warriors, is known for his exceptional shooting accuracy from beyond the arc. He often takes long-distance shots and shoots from over a meter behind the 3-point line, which allows him to evade defense. Figure 11 represents his favorite zones and his shooting percentage. For 2-point shots, it can be observed that he does not frequently shoot from the two corners of the key. These data provide valuable information to the opposing team to minimize Stephen Curry’s influence on the game. In his case, coaches will certainly try to defend him away from the basket.

The success rate remains higher for 2-point shots. Each player has a different playing style depending on their position, which can be visualized through the heatmaps.

Figure 23 – Shooting success and frequency based on court position for point guard Stephen Curry in 2015

Figure 24 – Shooting success and frequency based on court position for center Tim Duncan in 2015

Tim Duncan, on the other hand, is a center who spent his entire career with the San Antonio Spurs, forming one of the most prolific trios in NBA history with Tony Parker and Manu Ginobili, earning four NBA championship rings. Figure 24 represents his shooting zones, and it is immediately noticeable that 3-point shooting is not his strong suit. Tim Duncan is more of a back-to-the-basket player, and when he receives the ball, it often results in a dunk or a mid-range shot, as suggested by the left side of Figure 24. The preferred shooting zones for all players in our dataset can be found in the Appendix, at the end of the manuscript.

By incorporating these spatial parameters into a predictive model, it becomes possible to assess the impact of the shooter’s location on the shooting probability. This analysis can help identify the areas on the court where a particular player is most effective and areas where they may need to improve their shooting ability. The predictive model can also assist coaches and analysts in making strategic decisions about which players to use in specific situations based on their shooting success rates at different locations.

4. Machine Learning

4.1 Model Architecture

Before building the machine learning (ML) model, we imported the necessary libraries, including pandas, train_test_split, preprocessing, GridSearchCV from scikit-learn, and other libraries specific to each model (DecisionTreeClassifier, Linear model, Decision tree, XGboost). Once the data frame (DF) was imported, we separated the data into two DFs: a ‘target’ DF with the ‘Shot Made Flag’ variable, and a ‘features’ DF with the remaining variables. We then split the data into a training set and a test set using the train_test_split function, with 20% of the total data as the test set. Finally, we standardized the data using the StandardScaler function from the preprocessing submodule on the x_train and x_test variables.

4.2 Logistic Regression

The first model we created was one of the simplest to implement and works well when the output values are binary, as is the case in our study: logistic regression. We applied a hyperparameter search using the GridSearchCV function on three parameters: the ‘C’ value, the solver, and the penalty. The hyperparameters found were C: 0.1, penalty: l2, and solver: liblinear. The best score of the model is 0.623, and the confusion matrix is presented below.

Figure 25 – Confusion matrix for Logistic Regression

Building a shooting prediction model Data separation: training set and test set. A better score of 0.623 for logistic regression. . MSPy Report 27 A better score of 0.634 for decision tree.

4.3 Decision Tree

In the second step, we performed a decision tree. We optimized the ‘criterion’ parameter and the maximum tree depth. The hyperparameters we obtained were criterion: entropy and max_depth: 10. The best score of the model is 0.634, and the confusion matrix is presented below. We can observe that compared to logistic regression, this model predicts the class 1 (successful shot) slightly better. It is not feasible to plot the tree to explain the model as the tree is too deep and involves too many variables.

Figure 26 – (left) Confusion matrix for Decision Tree (right) Importance of different model parameters

We can see, however, that some variables have more importance than others in predicting this model. This is the case for two action types: Jump Shot and Dunk Shot, as well as the remaining time and the leading variable that indicates whether the player is behind, leading, or tied with the opponent.

4.4 Random Forest

4.4.1 First attempt In the third step, we performed a Random Forest using the same hyperparameter search approach. We found the hyperparameters of the number of trees in the forest (n_estimators) to be 800 and the criterion to be ‘gini’. The graph below allows us to observe this choice. The impact of a small parameter modification on the model’s performance is quite minimal, only 0.0004 in this case. The best score of the model is 0.623, and the confusion matrix is presented below. We can see that the model predicts the class 1 (successful shot) slightly better but predicts the class 0 (missed shot) slightly worse compared to the first two models.

Figure 27 – (left) Confusion matrix for Random Forest (right) Hyperparameter tuning

4.4.2 Second attempt

Since the action_type variable is particularly important, a second attempt was made by keeping the data for this variable intact, i.e., without grouping them into categories as done in the data visualization step. This resulted in 70 unique values instead of 8.

This significantly increased the size of the data frame after encoding the categorical variables, but it yielded better results. Without any parameter tuning, we obtained a score of 0.652, and after applying a GridSearch method, the score improved to 0.665. Furthermore, to maintain a minimized data frame size without impacting the model’s performance, we found that using the Variance Threshold method (t=0.01) allowed us to achieve a nearly similar score while reducing the dataframe to 23 variables.

Figure 28 – Confusion matrix for the second attempt

A better score of 0.665 for the random forest. MSPy Report 29

4.5 XGBoost

Finally, we wanted to test one last model: XGBoost, known as one of the best ML models. After optimizing the model’s parameters (max_depth = 9, learning_rate = 0.01, gamma = 1, n_estimators = 100, eval_metric = ‘aucpr’, and n_jobs = 3), we trained our model on the training data. Two scores appeared and posed a problem: one was 0.72 obtained using XGBoost’s best_score_ method, and the other was 0.67 obtained using the score on the test data. After manually calculating the score using the model’s predictions and the ‘Shot Made Flag’ values, the obtained score was indeed 0.67. The score_ returns the average score on the test data, while best_score_ returns the best score obtained at the time of early stopping. Early stopping is a default parameter of the model that retains the best score when the model stops learning and stabilizes its performance. The confusion matrix below shows that the model is slightly less accurate in predicting class 0 than the first two models but performs better than all of them in predicting class 1.

Figure 29 – (left) Confusion matrix for XGBoost, (right) XGBoost report

After constructing the different models, we decided to keep the most performant one, XGBoost, and try to interpret it as best as possible. To do so, we will use the SHAP library to understand the predictions made by the model on our dataset, both globally and locally. We will be able to understand which variables have a significant impact on our model and which have less influence.

Before attempting to interpret the model’s predictions globally and locally, we first examined the relative importance of the variables on the model’s predictions. The two graphs below demonstrate this.

A better score of 0.67 for XGBoost. XGBoost appears to be the most performant model. MSPy Report 30

Figure 30 – (left) Ranking of average global variable importance (right) Impact of variables on the model

The first graph represents the global importance of variables calculated using SHAP values. The importance is calculated by averaging the absolute values of the SHAP values. From this graph, we can draw some interpretations: the variables that seem to have the most impact on the model’s predictions are the action types Jump Shot and Layup Shot, the distance of the shot taken, and the leading variable discussed earlier. Regarding the action variables, it is logical that they have significant impact as they are the two most frequent variables in our dataset. In contrast, many Shoot action variables appear to have little to no importance in the model, as well as the home variable and the 3-point shot.

However, these observations from the left graph are not necessarily reflected in the right graph. The right graph represents the positive or negative SHAP values of each variable in their order of importance (according to the left graph) for each data point in our dataset. Each data point, representing a shot from our dataset, is

represented as a point on the right graph, and the color of the variables indicates their impact, with high variable values shown in red and low values shown in blue. We can make several observations, some of which are logical based on this graph:

The action types « Jump Shot » and « Layup Shot » have the most influence on the predictions.
As expected, the distance has a negative impact when it is significant, while leading positively influences the predictions.
The distance of the shot negatively impacts the model’s predictions when it becomes large.
The frequently occurring Jump Shot (over 120,000 entries) has a negative impact on the model. This observation, combined with the previous graph showing that the jump_shot variable has the most impact on the model’s decisions, should be taken into account when analyzing the model’s performance.
A Dunk shot is generally predicted as successful in most cases due to its high importance in the predictions.
Leading in the score has a positive impact, while being behind has a negative impact.
A low opponent’s win percentage (w_pct) indicates a higher probability of scoring the basket.
A shot with a low remaining time, which can be considered rushed, has a negative impact on the model’s prediction and is generally predicted as a miss.
A Layup Shot or a Tip Shot is more likely to be predicted as a miss by the model.
The interpretation of other variables is less clear as many data points are centered around zero or do not show distinct observations.

After conducting a global analysis of the model, we can now perform a local interpretation of the model. We will compare four randomly selected data points from the dataset:

One representing a successful shot and correctly predicted as successful by the model.
Another representing a successful shot but incorrectly predicted as a miss by the model.
A missed shot correctly predicted as a miss.
And a missed shot where the model predicted the opposite.

In the following graphs, variables colored in red have a positive impact, meaning they contribute to the model’s prediction being higher than the base value (which is -0.1058 in this case). Variables in blue contribute to a lower prediction compared to the base value and have a negative impact.

Figure 31 – Successful shot with incorrect prediction

Figure 32 – Successful shot with correct prediction

We can see that when the model correctly predicts a successful shot, certain variables such as period, leading, and score_margin had a positive impact on the model’s decision. Conversely, in Figure 31, the model relied on the leading variable and the shot’s distance to predict it as a miss, even though it was actually successful.

Another way to graphically interpret the model’s decisions locally is shown below. For a missed shot, the model predicted it as successful by mainly considering the Jump Shot and leading variables, despite the shot’s distance likely not being optimal and suggesting a miss. The three graphs provide similar information but have varying degrees of simplicity in their interpretation.

5. Deep Learning

5.1 Model Architecture

Once the ML model was optimized, we explored Deep Learning (DL) as an approach. As discussed in the module, adding neural network layers might improve the model’s performance compared to the ML model. To begin, we imported the necessary libraries, including pandas, TensorFlow, train_test_split, preprocessing from scikit-learn, and other libraries specific to Keras (to_categorical, Model, Input, Dense, Dropout, callbacks). After importing the data frame (DF), we separated the data into two DFs: a ‘target’ DF with the ‘Shot Made Flag’ variable and a ‘features’ DF with the remaining variables. Subsequently, we split the data into a training set and a test set using the train_test_split function, with 20% of the total data set aside for testing. Finally, we standardized the data using the StandardScaler function from the preprocessing submodule for the variables x_train and x_test. For the DL model to work, we categorized the y_train and y_test variables into two columns, labeled ‘0’ and ‘1,’ indicating whether the shot was made or missed, respectively, with binary values inside (0 for false and 1 for true). Next, we focused on modeling. For simplicity, we decided to concentrate our DL work on Dense Neural Networks with stacked dense layers. The architecture of our model was built in a functional and non-sequential manner, although our model was not complex and did not involve multiple inputs. The initial models constructed were simply a sequence of dense layers, starting with two dense layers and then progressing to three dense layers to improve the model’s performance (see Figure 34). The final layer was automatically a two-neuron layer with a ‘softmax’ activation function to return the probability of belonging to each of the two classes. A hyperparameter search similar to ‘gridsearchCV’ was conducted to optimize our model and will be explained in an upcoming section.

Figure 34 – Neural Network diagram Rapport MSPy 34 Need to limit learning biases. Training the model for 20 epochs with a batch size of 200. Once the basic model was constructed, we implemented callbacks to limit certain learning biases. First, we applied EarlyStopping to prevent the model from continuing to train without further improvement. In this case, the loss function was monitored, and training would stop if the loss value did not decrease for five consecutive epochs. In parallel, we introduced ReduceLROnPlateau to reduce the learning rate of the model’s optimization function (we chose the Adam optimizer). This optimization function serves the purpose of backpropagation. We chose to decrease the learning rate by a factor of 0.1 every three epochs when the loss value no longer decreased (i.e., lr = lr * 0.1). Next, we compiled the model with the ‘categorical_crossentropy’ loss function, suitable for a classification problem, and the metric [‘accuracy’]. Then, we trained our model for 20 epochs with a batch size of 200. Subsequently, we obtained two graphs presented below:

One representing the accuracy metric values during different epochs for the training and validation data.
The second graph displaying the loss values during different epochs for the training and validation data.

Figure 35 – (left) Variation of accuracy metric with the number of epochs (right) Variation of loss with the number of epochs

We then obtained the average score of the model and its confusion matrix. These results will be presented and discussed in an upcoming section. After this initial phase, we sought to optimize certain model parameters. To achieve this, we used the keras_tuner module. First, we created a model_builder function that incorporated all the previously explained steps (the different model layers , callbacks, model compilation, and training) while varying different parameters: the number of neurons and activation functions in the first two dense layers, and the learning rate of the Adam optimizer. The remaining parameters were fixed.

Once this function was created, we applied the Hyperband function from the keras_tuner module and used the search function to perform various cross-validation experiments, varying the aforementioned parameters. The best score was then recorded, and the get_best_hyperparameters function allowed us to determine which parameters contributed to achieving this score.

Furthermore, we explored whether modifying the batch_size and epochs parameters could significantly influence the model’s performance. For a batch_size larger than 1000, both accuracy values and the loss function deteriorated. An optimal batch_size value appeared to be around 200 or 300.

Finally, during different experiments, we noticed the phenomenon of overfitting, where the model continued to learn from the training data and improved over epochs, even though it had reached a plateau in terms of test data performance. To address this issue, we modified the model’s structure. We inserted a Dropout layer with a 20% dropout rate between each dense layer to mitigate overfitting. The final Deep Learning model had the structure

described in the accompanying figure:

Figure 37 – Summary of the final Deep Learning model

The metric we used to evaluate the model’s score was accuracy_score, which was found to be 0.660. The following confusion matrix provides a better understanding of the model’s performance and how it predicts shots compared to reality.

Conclusions and Perspectives

The multitude of information present in the various databases initially required sorting the information to be able to exploit it in a relevant manner. The data visualization work allowed us to clean the data. We then worked on machine learning models and also attempted the implementation of deep learning.

Here is a summary of the results for the machine learning models:

The best score of 0.621 for logistic regression.
The best score of 0.655 for the decision tree.
The best score of 0.665 for the random forest.
The best score of 0.67 for XGBoost.

Among all the models presented in the last two sections, the one chosen is XGBoost for its superior performance and the possibility of interpretability, albeit complex. However, despite all our efforts to improve the model, we were never able to achieve a score higher than 0.67, which remains rather low.

Upon analyzing the different results, we found that the variable « action_type_jump_shot » was by far the variable that had the most impact on the decisions made by the model.

Limitations:

Limited results.
Machine learning results are relatively similar at first glance.
XGBoost, an interpretable model.
A significant presence of « jump shot » that can bias the results.
Missing information that could have improved the model.
The need to understand not only the vocabulary but also the rules specific to the discipline.

48.6% of the shots in our data frame are jump shots. This high proportion explains why this variable has a significant influence on the model’s predictions. Among all these jump shots, 63.4% are missed. The model, trained on our training data, tends to generalize a jump shot as a missed shot, as observed in Figure 30 of this report. This observation could partly explain the model’s predictions, namely that the model predicts missed shots (class 0) better than successful shots (class 1).

One possible solution could be to penalize the model for jump shots.

Additionally, important information for accurate predictions was missing. This includes defensive data related to the shot, such as the distance from the nearest defender, the angle relative to the shooter, the defensive pressure applied, the specific defender, and their height. Another relevant data point that we could not calculate was the remaining time on the possession. This data could have helped us understand that some shots are missed due to time constraints and are taken under suboptimal conditions.

Furthermore, a historical record of the shooter’s shooting success based on their position could provide additional information. We have seen that each player has their preferences in terms of shooting, whether it’s the type of action or their location on the court.

All these proposals could potentially improve the model’s performance, making it usable for coaches to plan offensive and defensive strategies before a game to maximize their chances of victory.

Publié par Adrian Chmielewski le 05/05/202305/05/2023

0 commentaire

Laisser un commentaire Annuler la réponse