Building an NBA Dataset for Machine Learning and Deep Learning Projects

4 min readApr 28, 2023

Introduction:

Basketball enthusiasts and data scientists rejoice! In this article, we’ll walk you through the process of creating a comprehensive NBA dataset from the 2018 season. Our dataset will include player information, their season averages, and a target column indicating if a player remained in the league after four years. With this dataset, you can explore various machine learning and deep learning projects, such as predicting player performance, analyzing team dynamics, and more.

We’ll be using Python, the Pandas library for data manipulation, and the balldontlie API to fetch the data. We'll then save the dataset in a CSV format, which you can easily import into your own projects or share with others on platforms like Kaggle.com.

Step 1: Fetch Season Averages for 2018

First, let’s define a function get_season_averages(season) that retrieves season averages for all players in the given season. This function will utilize the balldontlie API to fetch player information and their respective season averages. The function will also include rate-limiting management to avoid hitting the API rate limit. The output of this function will be a Pandas DataFrame.

Once we have our dataset, we’ll add a new column called 4_years to the DataFrame. This column will have a value of 1 for players who remained in the league after four years (i.e., have stats for the 2022 season) and 0 for those who didn't. This will be our target.

# 16 minute runtime
import pandas as pd
import requests
import time

def get_season_averages(season):
    # Get the total number of pages for the players' endpoint
    response = requests.get("https://www.balldontlie.io/api/v1/players", params={'per_page': 100})
    data = response.json()
    total_pages = data['meta']['total_pages']
    
    season_averages = []

    # Fetch season averages for all players
    for page in range(1, total_pages + 1):
        print(f"Fetching players from page {page} of {total_pages}")
        response = requests.get("https://www.balldontlie.io/api/v1/players", params={'page': page, 'per_page': 100})
        data = response.json()
        
        player_ids = [player['id'] for player in data['data']]
        
        print(f"Fetching season averages for {season}")
        stats_response = requests.get("https://www.balldontlie.io/api/v1/season_averages", params={'season': season, 'player_ids[]': player_ids})
        stats_data = stats_response.json()

        season_averages.extend(stats_data['data'])
        
        # Sleep for a short duration to avoid hitting the rate limit
        time.sleep(1)

    # Convert the list of season averages to a pandas DataFrame
    df = pd.DataFrame(season_averages)

    # Check if players have stats for the 2022 season
    print("Checking for 2022 season stats")
    df['4_years'] = 0
    for index, row in df.iterrows():
        player_id = row['player_id']
        print(f"Checking 2022 season stats for player {index + 1} of {total_players} (ID: {player_id})")
        response = requests.get("https://www.balldontlie.io/api/v1/season_averages", params={'season': 2022, 'player_ids[]': player_id})
        data = response.json()
        if data['data']:
            df.at[index, '4_years'] = 1

        # Sleep for a short duration to avoid hitting the rate limit
        time.sleep(1)

    # Save the DataFrame to a CSV file
    df.to_csv(f'season_{season}_averages.csv', index=False)

    return df

# Example usage
season_2018_df = get_season_averages(2018)
print(season_2018_df.head())

Step 2: Enrich Player Data

Next, we’ll create a function called enrich_player_data(df) that enriches the DataFrame obtained from the previous step with additional player information, such as first name, last name, position, height, weight, and team details. We'll again use the balldontlie API to fetch this information.

import pandas as pd
import requests
import time

def enrich_player_data(df):
    player_details = []

    for player_id in df['player_id']:
        print(f"Fetching player details for ID {player_id}")
        response = requests.get(f"https://www.balldontlie.io/api/v1/players/{player_id}")
        data = response.json()
        
        player = {
            'player_id': data['id'],
            'first_name': data['first_name'],
            'last_name': data['last_name'],
            'position': data['position'],
            'height_feet': data['height_feet'],
            'height_inches': data['height_inches'],
            'weight_pounds': data['weight_pounds'],
            'team_id': data['team']['id'],
            'team_abbreviation': data['team']['abbreviation'],
            'team_city': data['team']['city'],
            'team_conference': data['team']['conference'],
            'team_division': data['team']['division'],
            'team_full_name': data['team']['full_name'],
            'team_name': data['team']['name']
        }

        player_details.append(player)

        # Sleep for a short duration to avoid hitting the rate limit
        time.sleep(1)

    player_details_df = pd.DataFrame(player_details)
    enriched_df = df.merge(player_details_df, on='player_id')

    return enriched_df

# Example usage
enriched_season_2018_df = enrich_player_data(season_2018_df)
enriched_season_2018_df.head()

Step 3: Visualizing the Data

Once we have our dataset ready, we can create a few visualizations to gain insights into the data. We’ll use the Seaborn library to create plots such as a bar plot of the 4_years column, which shows the distribution of players who remained in the league after four years.

import seaborn as sns
import matplotlib.pyplot as plt

def visualize_data(df):
    # Bar plot of 4_years
    plt.figure(figsize=(8, 6))
    sns.countplot(data=df, x='4_years')
    plt.title("Count of Players with Stats in 2022")
    plt.show()

    # Other visualizations can be added here
    # For example, a box plot of points scored by players in different positions
    plt.figure(figsize=(8, 6))
    sns.boxplot(data=df, x='position', y='pts')
    plt.title("Points Scored by Players in Different Positions")
    plt.show()

# Example usage
visualize_data(enriched_season_2018_df)

Conclusion:

Congratulations! You’ve successfully built an NBA dataset from the 2018 season, enriched it with player information, and created a target column for players who remained in the league after four years. This dataset can now be used for various machine learning and deep learning projects, such as predicting player performance, analyzing team dynamics, and more. Additionally, you can export the dataset to a CSV file and share it with other data scientists and sports analysts on platforms like Kaggle.com.

Happy data exploring and good luck with your projects!

Here is the final dataset: https://github.com/fenago/datasets/blob/main/2018NBA.csv

Here it is on kaggle.com:

https://www.kaggle.com/datasets/ernestolee/2018-nba-data-with-4-year-target?datasetId=3194996