Fairness by Counterfactuals#

This notebook will demonstrate how to implement the concept of “fairness” in a case of multiclass classification. Usually, fairness and explainability focus on the descriptive side of the model, providing the variables that impact the most the final prediction/classification. We will be using a library named DiCE (interpretml/DiCE) which is able to provide active explanations through counterfactuals from the real data of a Machine Learning (ML) model. This way it would be possible to understand also how much a variable should change to reach the desired outcome

Case Study: NBA Players Salary Expectations#

We will be implementing DiCE with an official NBA database reporting players’ stats and salary for 2022/2023 season. The aim is to suggest which (and, most importantly, how much) stats need to improve in order to expect a better salary. Let’s jump right in.

Install and necessary libraries#

DiCE can be easily installed (check package documentation). Other than that only pandas and scikit-learn must be loaded, so Standard libraries for ML classification are required, nothing really fancy. DiCE claims to be able to work with whatever ML model, therefore you could also just build your own.

import dice_ml
from dice_ml import Dice

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier

import pandas as pd

Load and preprocess the NBA dataset#

For multiclass classification, we will try to predict through stats the range of salary of NBA players. We start from the assumption that better stats mean better salary, disregarding intangible skills such as leadership, locker room chemistry building and charisma: it’s a shame, but this is a toy example to play around with. In all honesty, it would have been better to average the stats of the past five seasons, which is when the current contract is usually earned, but it would have been complicated to account for younger players, so for the moment this should do it.

url = 'https://raw.githubusercontent.com/antoniocollesei/nba-fairness-salary/main/stats_salary_NBA_2223.csv'

df = pd.read_csv(url)
df = df.dropna()
# remove columns that have a '.' or an 'X' in the column name (so, percentages)
df = df.loc[:,~df.columns.str.contains('\.')]
df = df.loc[:,~df.columns.str.contains('X')]
# reset the index (it will be useful later, trust me)
df = df.reset_index(drop=True)
df.head(5)
Player Pos Age G GS MP FG FGA FT FTA ORB DRB TRB AST STL BLK TOV PF PTS salary
0 Aaron Gordon PF 26 75 75 31.7 5.8 11.1 2.3 3.1 1.7 4.2 5.9 2.5 0.6 0.6 1.8 2.0 15.0 19690909.0
1 Aaron Holiday PG 25 63 15 16.2 2.4 5.4 0.9 1.1 0.4 1.6 1.9 2.4 0.7 0.1 1.1 1.5 6.3 1836090.0
2 Aaron Nesmith SF 22 52 3 11.0 1.4 3.5 0.4 0.5 0.3 1.4 1.7 0.4 0.4 0.1 0.6 1.3 3.8 3804360.0
3 Aaron Wiggins SG 23 50 35 24.2 3.1 6.7 1.2 1.7 1.0 2.5 3.6 1.4 0.6 0.2 1.1 1.9 8.3 1563518.0
4 Admiral Schofield SF 24 38 1 12.3 1.4 3.4 0.3 0.4 0.4 1.9 2.3 0.7 0.1 0.1 0.6 1.5 3.8 506508.0

Prepare the target#

Although you could make this work with regression, we want to work with classification to make it easily understandable. Therefore, we subdivide the continuous salary into four classes. We tried to make the classes as balanced as possible without undermining the purpose, but it is clear that most of the players are not superstars, earning way less than their colleagues.

outcome_name = "salary"
continuous_features = df.drop(outcome_name, axis=1).select_dtypes(include=['float64', 'int64']).columns
target = df[outcome_name]
# factorize target into 4 classes
target_cat = pd.cut(target, bins=[0, 5e6, 1.5e7, 2.5e7, target.max()], labels=["5m-", "5-15m", "15-25m", "25m+"])
# substitute target with factorized version
df[outcome_name] = target_cat

target_cat.value_counts()
5m-       226
5-15m     132
25m+       50
15-25m     41
Name: salary, dtype: int64

Multi-Class Modeling#

Here we build the ML model (a Random Forest Classifier). Note that we introduce also the blocks to scale continous variables and one-hot encode the categorical ones.

# save vector of players and drop from df
players = df['Player']
df = df.drop(['Player'], axis=1)

# Split data into train and test
datasetX = df.drop(outcome_name, axis=1)
x_train, x_test, y_train, y_test = train_test_split(datasetX,
                                                    target_cat,
                                                    test_size=0.3,
                                                    random_state=42,
                                                    stratify=target_cat)

# Create the same dataset but with the player names (it will be used later to select the player)
datasetX['Player'] = players
x_player_train, x_player_test, y_player_train, y_player_test = train_test_split(datasetX,
                                                                                target_cat,
                                                                                test_size=0.3,
                                                                                random_state=42,
                                                                                stratify=target_cat)

# Implementing transformers for variables
categorical_features = x_train.columns.difference(continuous_features)

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

transformations = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, continuous_features),
        ('cat', categorical_transformer, categorical_features)])

# Create a pipeline with the transformations and the classifier
clf = Pipeline(steps=[('preprocessor', transformations),
                           ('classifier', RandomForestClassifier())])
model = clf.fit(x_train, y_train)

Build DiCE model#

d = dice_ml.Data(dataframe=df,
                 continuous_features=list(continuous_features),
                 outcome_name=outcome_name)

# We provide the type of model as a parameter (model_type)
m = dice_ml.Model(model=model, backend="sklearn", model_type='classifier')

exp_genetic = Dice(d, m, method="genetic")

Here we list the possible players to analyze through DiCE: for simplicity we will get a pool of the first 20 out of the test dataset.

players = x_player_test['Player'].values
print(players[1:20])
['Saben Lee' 'Isaac Okoro' 'Dewayne Dedmon' 'Sam Merrill'
 'Kevin Porter Jr.' 'Devin Cannady' 'Doug McDermott' 'PJ Dozier'
 'Richaun Holmes' 'Naz Reid' 'Kai Jones' 'Max Strus' 'Marvin Bagley III'
 'Mike Muscala' 'Aaron Nesmith' 'Jrue Holiday' 'Marcus Smart'
 'Derrick Favors' 'Cam Thomas']

Now we are ready to select an NBA player that needs desperately his contract to be upgraded. What should he do in terms of stats to improve his salary? Let’s take Naz Reid, who has just been named Sixth Man of the Year in season 2023/24 with the Minnesota Timberwolves. We want to see, for example, in terms of points scored (‘PTS’) what his contribution should be to upgrade his salary.

player_name = 'Naz Reid'

x_player_test = x_player_test.reset_index(drop=True)
x_test = x_test.reset_index(drop=True)
player_row = x_player_test[x_player_test['Player'].str.contains(player_name)].iloc[0]
player_row = player_row.drop(['Player'])
player_index = x_test[x_test.eq(player_row).all(1)].index[0]

# Generate counterfactuals for the player
query_instances = x_test[player_index:player_index+1]
genetic = exp_genetic.generate_counterfactuals(query_instances, 
                                               total_CFs=3,
                                               #features_to_vary=['Age', 'PTS', 'ORB', 'DRB', 'STL', 'BLK'],
                                               desired_class=1
                                               )
genetic.visualize_as_dataframe(show_only_changes=True)
  0%|                                                                                                                                                                                               | 0/1 [00:00<?, ?it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.38it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.37it/s]
Query instance (original outcome : 5m-)

Pos Age G GS MP FG FGA FT FTA ORB DRB TRB AST STL BLK TOV PF PTS salary
0 C 22 77 6 15.8 3.0 6.2 1.5 1.9 1.3 2.6 3.9 0.9 0.5 0.9 1.1 2.2 8.3 5m-
Diverse Counterfactual set (new outcome: 25m+)
Pos Age G GS MP FG FGA FT FTA ORB DRB TRB AST STL BLK TOV PF PTS salary
0 PG 31 37 37 25.6 4.0 10.0 1.6 - 0.4 - 3.0 3.5 0.7 0.2 1.3 1.0 11.6 25m+
0 - 35 69 69 29.1 3.9 8.2 1.2 1.4 1.6 6.1 7.7 3.4 0.7 1.3 0.9 1.9 10.2 25m+
0 - 23 58 58 29.5 7.6 12.0 1.8 2.4 2.6 7.7 10.2 1.4 0.7 0.7 1.6 2.4 17.2 25m+

And as a reference these are the league averages per single stat.

df.mean(axis=0).to_frame().T
/tmp/ipykernel_26665/78113647.py:1: FutureWarning: The default value of numeric_only in DataFrame.mean is deprecated. In a future version, it will default to False. In addition, specifying 'numeric_only=None' is deprecated. Select only valid columns or specify the value of numeric_only to silence this warning.
  df.mean(axis=0).to_frame().T
Age G GS MP FG FGA FT FTA ORB DRB TRB AST STL BLK TOV PF PTS
0 25.837416 52.5902 26.585746 21.711136 3.581514 7.8049 1.475724 1.906682 0.926058 3.032517 3.960356 2.211359 0.69265 0.429621 1.168597 1.785523 9.747439

Considerations#

A few key points can be drawn from this simple analysis, both for the player, and the league itself. Let’s see: it appears that Naz Reid, in order to secure himself a 25M+ contract, should clearly improve his scoring efficiency. His mere 8.3 points and 3.0 field goals per night are not enough to be considered an elite player, so he should focus on buckets! And it makes completely sense. But other interesting take-aways are actually coming from noticeable trends at league level. Let’s see some of them:

  • the NBA seems to care more on defensive rebounds (DRB) than offensive (ORB), since our Naz Reid should grab 3x more DRB, while on the ORB side he would just need to double his stats;

  • steals (STL) and blocks (BLK) are not highly rewarded since his defensive attitude could work just like it is (and his stats are pretty close to the league average, maybe the blocks are a bit closer to a medium-high level player);

  • the league rewards mature players, and it actually makes sense: young players coming into the league have a capped salary until they reach 5 years of militance in NBA; only after that they can be granted a max-extension, if their performance is elite.

Conclusions#

DiCE seems a pretty powerful tool to improve visibility in ML classification tasks. Not only it suggests personalized actions to fall into a different category, but also it can extract domain take-away points to analyze high-level trends. In conclusion, if some NBA player wants a better contract, he can send me his stats, and I can definitely help him improve his game!