A Biased Employment Status Algorithm

Here we will examine some employment status data from Missouri and demonstrate that it exhibits bias by ethnicity.
Author

Jay-U Chung

Published

April 4, 2023

Bias Audit

In this blog post, we will be auditing the allocative bias on our algorithm. We will examine data on the employment status in Missouri, focusing on the various self-reported ethnicities. We will then train an algorithm to predict employment status and examine any biases our algorithm might exhibit with respect to these groups.

Examining the Data

Our data can be obtained with the folktables package.

from folktables import ACSDataSource, ACSEmployment, BasicProblem, adult_filter
import numpy as np

STATE = "MO"

data_source = ACSDataSource(survey_year='2018', 
                            horizon='1-Year', 
                            survey='person')

acs_data = data_source.get_data(states=[STATE], download=True)

acs_data.head()
RT SERIALNO DIVISION SPORDER PUMA REGION ST ADJINC PWGTP AGEP ... PWGTP71 PWGTP72 PWGTP73 PWGTP74 PWGTP75 PWGTP76 PWGTP77 PWGTP78 PWGTP79 PWGTP80
0 P 2018GQ0000034 4 1 400 2 29 1013097 77 27 ... 123 128 134 70 77 74 70 70 16 139
1 P 2018GQ0000067 4 1 1001 2 29 1013097 73 42 ... 70 130 72 125 73 77 70 69 127 128
2 P 2018GQ0000097 4 1 600 2 29 1013097 85 20 ... 86 84 85 12 88 13 90 83 13 14
3 P 2018GQ0000113 4 1 1002 2 29 1013097 78 26 ... 80 79 6 75 78 6 77 76 78 78
4 P 2018GQ0000159 4 1 200 2 29 1013097 55 37 ... 11 57 101 56 57 108 54 13 13 13

5 rows × 286 columns

Here we select the possible features, most notably the race (RAC1P) and sex (SEX).

possible_features=['AGEP', 'SCHL', 'MAR', 'RELP', 'DIS', 'ESP', 'CIT', 'MIG', 'MIL', 'ANC', 'NATIVITY', 'DEAR', 'DEYE', 'DREM', 'SEX', 'RAC1P', 'ESR']
acs_data[possible_features].head()
AGEP SCHL MAR RELP DIS ESP CIT MIG MIL ANC NATIVITY DEAR DEYE DREM SEX RAC1P ESR
0 27 17.0 5 16 2 NaN 1 3.0 4.0 4 1 2 2 2.0 2 1 6.0
1 42 19.0 5 16 2 NaN 1 3.0 4.0 1 1 2 2 2.0 1 2 6.0
2 20 19.0 5 17 2 NaN 1 1.0 4.0 2 1 2 2 2.0 1 1 1.0
3 26 17.0 5 16 1 NaN 1 1.0 4.0 1 1 2 2 1.0 1 2 6.0
4 37 16.0 5 16 1 NaN 1 3.0 4.0 1 1 2 1 1.0 1 2 6.0
target_features = ["ESR", "RAC1P"]
features_to_use = [f for f in possible_features if f not in target_features]
EmploymentProblem = BasicProblem(
    features=features_to_use,
    target='ESR',
    target_transform=lambda x: x == 1,
    group='RAC1P',
    preprocess=lambda x: x,
    postprocess=lambda x: np.nan_to_num(x, -1),
)

features, label, group = EmploymentProblem.df_to_numpy(acs_data)

We have now transformed our data so that it can be processed in a classification algorithm. Our task is to predic the employment status of civilians at work (excluding military). Of course, we split the training and testing data, which in this case is an 80-20 split.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test, group_train, group_test = train_test_split(
    features, label, group, test_size=0.2, random_state=0)

Let’s now examine some demographic characteristic of our training data.

import pandas as pd
df = pd.DataFrame(X_train, columns = features_to_use)
df["group"] = group_train
df["label"] = y_train
df_nums = df.groupby(["group"]).size().reset_index(name = "Number")
df_nums["group"] = ['White alone','Black or African American alone','American Indian alone','Alaska Native alone','American Indian and Alaska Native tribes specified, or American Indian or Alaska Native, not specified and no other races','Asian alone','Native Hawaiian and Other Pacific Islander alone',' Some Other Race alone','Two or More Races']
df_nums = df_nums.rename(index = {0:1,1:2,2:3,3:4,4:5,5:6,6:7,7:8,8:9})
df_nums['Percentage'] = df_nums['Number']/df.shape[0]*100.0
df_nums
group Number Percentage
1 White alone 43713 87.545061
2 Black or African American alone 3838 7.686454
3 American Indian alone 150 0.300409
4 Alaska Native alone 5 0.010014
5 American Indian and Alaska Native tribes speci... 27 0.054074
6 Asian alone 762 1.526075
7 Native Hawaiian and Other Pacific Islander alone 51 0.102139
8 Some Other Race alone 316 0.632861
9 Two or More Races 1070 2.142914

So we can see that an overwhelming majority of our data is White, then Black, Two or More Races, and Asian. For my analysis, I will only focus on these groups as the others have too small a sample size.

df_employed = df.groupby(["group","label"]).size().groupby("group", group_keys=False).apply(lambda x: 100 * x / x.sum()).reset_index(name = "Percentage").replace({1:'White alone',2:'Black or African American alone',3:' American Indian alone',4:'Alaska Native alone',5:'American Indian and Alaska Native tribes specified, or American Indian or Alaska Native, not specified and no other races',6:'Asian alone',7:'Native Hawaiian and Other Pacific Islander alone',8:' Some Other Race alone',9:'Two or More Races'})
df_employed = df_employed.rename(columns = {'label':'Employed'})

print("Overall employment rate: ",(df["label"] == 1.0).mean())
df_employed['Diff. with Overall Avg.'] = df_employed['Percentage'] - (df["label"] == 1.0).mean()*np.ones(df_employed.shape[0])*100
df_employed[df_employed["Employed"] == True]
Overall employment rate:  0.4467275494672755
group Employed Percentage Diff. with Overall Avg.
1 White alone True 45.439572 0.766817
3 Black or African American alone True 39.577905 -5.094850
5 American Indian alone True 46.000000 1.327245
7 Alaska Native alone True 60.000000 15.327245
9 American Indian and Alaska Native tribes speci... True 40.740741 -3.932014
11 Asian alone True 49.343832 4.671077
13 Native Hawaiian and Other Pacific Islander alone True 52.941176 8.268422
15 Some Other Race alone True 37.341772 -7.330983
17 Two or More Races True 29.906542 -14.766213

So in comparison to the overall average, White individuals are about the same, Black individuals are less employed, individuals of two or more races are much less employed, Asian individuals are slightly more employed.

import seaborn as sb
df_plot = df.groupby(["group","SEX","label"]).size().groupby(["group","SEX"], group_keys=False).apply(lambda x: 100 * x / x.sum()).reset_index(name = "Percentage")
df_plot['SEX'] = df_plot['SEX'].replace([1.0,2.0],['Male','Female'])

sb.barplot(x="group",
           y="Percentage",
           hue="SEX",
           data=df_plot[df_plot["label"]==True])
<AxesSubplot: xlabel='group', ylabel='Percentage'>

Plotting the employment rates by sex, we can see that there are clear disparities between overall employment and male and female employment.

For White and Asian individuals, males are more likely to be employed than females. For Black individuals and those from Two or More races, females are more likely to be employed.

Training an Algorithm to Predict Employment Status

The main classification algorithms we have on hand from Scikit Learn are Logistic Regression, Support Vector Machine, Decision Tree Classifier, and the Random Forest Classifier algorithm.

Each model has a parameter that must be tuned. For Logistic Regression, this is the degree for polynomial features, for Support Vector Machine the regularization parameter, for Decision Trees and the Random Forest, the max depth. I define a function below that will take a model, tune its optimal parameter using cross validation, and output this optimal parameter.

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score

def train_opt_model(input_model, kwargs, params, X_train, y_train, X_test, y_test):
    keys = list(kwargs.keys())
    best_score = 0.0
    best_param = params[0]
    
    for param in params:
        kwargs[keys[0]] = param #should be only one argument in this case
        model = make_pipeline(StandardScaler(), input_model(**kwargs))
        train_score = cross_val_score(model, X_train, y_train, cv=5).mean()
        if (train_score > best_score):
            best_score = train_score
            best_param = param
                              
    print("Best model parameter: ",best_param)    
    kwargs[keys[0]] = best_param
    best_model = make_pipeline(StandardScaler(), input_model(**kwargs))
    best_model.fit(X_train,y_train)
    print("Best model training score: ",best_model.score(X_train,y_train))
    print("Best model test score: ",best_model.score(X_test,y_test))
    return kwargs
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

def poly_LR(degree, **kwargs):
    plr = Pipeline([("poly", PolynomialFeatures(degree = degree)),
                    ("LR", LogisticRegression(**kwargs))])
    return plr

params = [i for i in range(3,4)]
kwargs = {"degree":params[0],"max_iter":5000}
kwargs = train_opt_model(poly_LR, kwargs, params, X_train, y_train, X_test, y_test)
model = make_pipeline(StandardScaler(), poly_LR(**kwargs))
model.fit(X_train, y_train)
y_hat = model.predict(X_test)
print("Overall employment accuracy: ",(y_hat == y_test).mean())
print("Employment accuracy for white: ",(y_hat == y_test)[group_test == 1].mean())
print("Employment accuracy for black:  ",(y_hat == y_test)[group_test == 2].mean())
from sklearn.svm import SVC
params = [0.1]
kwargs = {"C":params[0]}
kwargs = train_opt_model(SVC, kwargs, params, X_train, y_train, X_test, y_test)
model = make_pipeline(StandardScaler(), SVC(**kwargs))
model.fit(X_train, y_train)
y_hat = model.predict(X_test)
print("Overall employment accuracy: ",(y_hat == y_test).mean())
print("Employment accuracy for white: ",(y_hat == y_test)[group_test == 1].mean())
print("Employment accuracy for black:  ",(y_hat == y_test)[group_test == 2].mean())

confusion_matrix_SVC = confusion_matrix(y_test,y_hat,normalize='true')
print("False positive rate: ",confusion_matrix_SVC[0][1])
print("False negative rate: ",confusion_matrix_SVC[1][0])
print(confusion_matrix(y_test,y_hat)[1][1]/(confusion_matrix(y_test,y_hat)[1][1]+confusion_matrix(y_test,y_hat)[0][1]))

confusion_matrix_SVC = confusion_matrix(y_test[group_test == 1],y_hat[group_test == 1],normalize='true')
print("White False positive rate: ",confusion_matrix_SVC[0][1])
print("White False negative rate: ",confusion_matrix_SVC[1][0])
print(confusion_matrix(y_test[group_test == 1],y_hat[group_test == 1])[1][1]/(confusion_matrix(y_test[group_test == 1],y_hat[group_test == 1])[1][1]+confusion_matrix(y_test[group_test == 1],y_hat[group_test == 1])[0][1]))

confusion_matrix_SVC = confusion_matrix(y_test[group_test == 2],y_hat[group_test == 2],normalize='true')
print("Black False positive rate: ",confusion_matrix_SVC[0][1])
print("Black False negative rate: ",confusion_matrix_SVC[1][0])
print(confusion_matrix(y_test[group_test == 2],y_hat[group_test == 2])[1][1]/(confusion_matrix(y_test[group_test == 2],y_hat[group_test == 2])[1][1]+confusion_matrix(y_test[group_test == 2],y_hat[group_test == 2])[0][1]))

Spoiler for these two algorithms - I tried running them on my machine, and they both took an extraordinarily long time! Testing with some parameters also yields a similar training score to the algorithm we will use anyway.

from sklearn.tree import DecisionTreeClassifier
params = [i for i in range(5,15)]
kwargs = {"max_depth":params[0]}
kwargs = train_opt_model(DecisionTreeClassifier, kwargs, params, X_train, y_train, X_test, y_test)
model = make_pipeline(StandardScaler(), DecisionTreeClassifier(**kwargs))
model.fit(X_train, y_train)
y_hat = model.predict(X_test)
print("Overall employment accuracy: ",(y_hat == y_test).mean())
print("Employment accuracy for white: ",(y_hat == y_test)[group_test == 1].mean())
print("Employment accuracy for black:  ",(y_hat == y_test)[group_test == 2].mean())
Best model parameter:  9
Best model training score:  0.8429263798766322
Best model test score:  0.8404357577699455
Overall employment accuracy:  0.8403556552387056
Employment accuracy for white:  0.8408361602640506
Employment accuracy for black:   0.841031149301826
from sklearn.ensemble import RandomForestClassifier
params = [i for i in range(5,15)]
kwargs = {"max_depth":params[0]}
kwargs = train_opt_model(RandomForestClassifier, kwargs, params, X_train, y_train, X_test, y_test)
model = make_pipeline(StandardScaler(), RandomForestClassifier(**kwargs))
model.fit(X_train, y_train)
y_hat = model.predict(X_test)
print("Overall employment accuracy: ",(y_hat == y_test).mean())
print("Employment accuracy for white: ",(y_hat == y_test)[group_test == 1].mean())
print("Employment accuracy for black:  ",(y_hat == y_test)[group_test == 2].mean())
Best model parameter:  14
Best model training score:  0.8644356324601458
Best model test score:  0.8415571932073054
Overall employment accuracy:  0.8416372957385453
Employment accuracy for white:  0.8411112129824884
Employment accuracy for black:   0.8431793770139635

We can see that the training score for the Decision Tree and Random Forest algorithms are pretty similar. The Random Forest algorithm does have a slightly better training score and test score, but I will opt to use the Decision Tree algorithm because it is faster. Moreover, I think it would not be ethical to have randomness in an algorithm that predicts employment status.

Analyzing our Employment Classification Algorithm

We can now define a function that will make some statistic easier to calculate, and run our Decision Tree with a max depth of 9.

def get_statistics(y_hat,y_test,group_test=None, group_test_val = None):
    if (group_test_val == None):
        y_t = y_test
        y_h = y_hat
    else:
        y_t = y_test[group_test == group_test_val]
        y_h = y_hat[group_test == group_test_val]
        
    CM =  confusion_matrix(y_t,y_h)
    
    p = (1*y_t).mean() #mean of how many in group are employed
    PPV = CM[1][1]/(CM[1][1] + CM[0][1])
    
    FPR = CM[0][1]/(CM[0][1] + CM[0][0])
    FNR = CM[1][0]/(CM[1][0] + CM[1][1])
    
    calibration_0 = ((y_t == 1)*(y_h == 0)).sum()/((y_h == 0).sum())
    calibration_1 = ((y_t == 1)*(y_h == 1)).sum()/((y_h == 1).sum())
    
    parity = (y_h > 0).mean() #threshold taken to be 0
    
    return p, PPV, FPR, FNR, calibration_0, calibration_1, parity
model = make_pipeline(StandardScaler(), DecisionTreeClassifier(max_depth = 9))
model.fit(X_train, y_train)
y_hat = model.predict(X_test)

p, PPV, FPR, FNR, calibration_0, calibration_1, parity = get_statistics(y_hat,y_test)
print("Overall accuracy: ",round((y_hat == y_test).mean(),3))
print("Overall prevalence: ",round(p,3))
print("Overall False positive rate: ",round(FPR,3))
print("Overall False negative rate: ",round(FNR,3))
print("Overall Calibration (Score 0): ",round(calibration_0,3))
print("Overall Calibration (Score 1): ",round(calibration_1,3))
print("Overall Parity: ",round(parity,3),"\n")

white_p, white_PPV, white_FPR, white_FNR, white_calibration_0, white_calibration_1, white_parity = get_statistics(y_hat,y_test,group_test,1)
print("Accuracy for White: ",round((y_hat == y_test)[group_test == 1].mean(),3))
print("White prevalence: ",round(white_p,3))
print("White False positive rate: ",round(white_FPR,3))
print("White False negative rate: ",round(white_FNR,3))
print("Calibration (Score 0): ",round(white_calibration_0,3))
print("Calibration (Score 1): ",round(white_calibration_1,3))
print("Parity: ",round(white_parity,3),"\n")

black_p, black_PPV, black_FPR, black_FNR, black_calibration_0, black_calibration_1, black_parity = get_statistics(y_hat,y_test,group_test,2)
print("Accuracy for Black:  ",round((y_hat == y_test)[group_test == 2].mean(),3))
print("Black prevalence: ",round(black_p,3))
print("Black False positive rate: ",round(black_FPR,3))
print("Black False negative rate: ",round(black_FNR,3))
print("Calibration (Score 0): ",round(black_calibration_0,3))
print("Calibration (Score 1): ",round(black_calibration_1,3))
print("Parity: ", round(black_parity,3),"\n")

two_p, two_PPV, two_FPR, two_FNR, two_calibration_0, two_calibration_1, two_parity = get_statistics(y_hat,y_test,group_test,9)
print("Accuracy for Two or More Races:  ",round((y_hat == y_test)[group_test == 9].mean(),3))
print("Two or More Races prevalence: ",round(two_p,3))
print("Two or More Races False positive rate: ",round(two_FPR,3))
print("Two or More Races False negative rate: ",round(two_FNR,3))
print("Calibration (Score 0): ",round(two_calibration_0,3))
print("Calibration (Score 1): ",round(two_calibration_1,3))
print("Parity: ", round(two_parity,3),"\n")

asian_p, asian_PPV, asian_FPR, asian_FNR, asian_calibration_0, asian_calibration_1, asian_parity = get_statistics(y_hat,y_test,group_test,6)
print("Accuracy for Asian:  ",round((y_hat == y_test)[group_test == 6].mean(),3))
print("Asian prevalence: ",round(asian_p,3))
print("Asian False positive rate: ",round(asian_FPR,3))
print("Asian False negative rate: ",round(asian_FNR,3))
print("Calibration (Score 0): ",round(asian_calibration_0,3))
print("Calibration (Score 1): ",round(asian_calibration_1,3))
print("Parity: ", round(asian_parity,3))
Overall accuracy:  0.841
Overall prevalence:  0.453
Overall False positive rate:  0.177
Overall False negative rate:  0.138
Overall Calibration (Score 0):  0.122
Overall Calibration (Score 1):  0.801
Overall Parity:  0.488 

Accuracy for White:  0.841
White prevalence:  0.462
White False positive rate:  0.178
White False negative rate:  0.136
Calibration (Score 0):  0.125
Calibration (Score 1):  0.806
Parity:  0.495 

Accuracy for Black:   0.841
Black prevalence:  0.376
Black False positive rate:  0.17
Black False negative rate:  0.14
Calibration (Score 0):  0.092
Calibration (Score 1):  0.752
Parity:  0.43 

Accuracy for Two or More Races:   0.825
Two or More Races prevalence:  0.326
Two or More Races False positive rate:  0.151
Two or More Races False negative rate:  0.226
Calibration (Score 0):  0.114
Calibration (Score 1):  0.713
Parity:  0.354 

Accuracy for Asian:   0.844
Asian prevalence:  0.482
Asian False positive rate:  0.212
Asian False negative rate:  0.095
Calibration (Score 0):  0.101
Calibration (Score 1):  0.798
Parity:  0.546

To explain, the accuracy measures how many predictions were correct, the prevalence is the amount employed, the calibration measures how many are actual employed given the score (within the given group), statistical parity measures the scores above 0 that are given.

Notice that the overall employment accuracy is similar amongst all groups.

Comparing White and Black populations, we can see that the false postive and negative rates closely agree. However, these scores are not calibrated. The calibration for score 0 is about 30 percent larger for White individuals and the calibration for the score of 1 or parity is about 7 percent larger for White individuals. The statistical parity, however, is quite similar. So, while there is error rate balance and statistical parity for these groups, the algorithm is not calibrated.

Notice that for individuals or Two or More Races or Asian, the prevalences are significantly different from both the White and Black populations. For individuals of Two or More races, the false negative rate is significantly higher and the calibration is smaller than the White and Black groups. The parity is also significantly lower. For Asian individuals, the false positive rate is much higher, the false negative rate is much lower, and statistical parity is significantly higher than in the White and black groups. Since these groups are not as large as the others, these trends may not be as reliable.

Perhaps overall it could be argued that the calibration is similar enough amongst groups. However, clearly the algorithm is not error rate balanced.

Conclusions

Ultimately, the benefit of using these labels for employment status would depend on its usage. Perhaps someone would use this algorithm to predict if individuals of certain backgrounds would need unemployment aid. Perhaps a company would use this to decide if an individual is worth hiring.

In any case, this algorithm does not work to combat structural discrimination. Notice that for individuals that are Black or or Two or More Races, the prevalences are markedly lower than for White and Asian individuals. This in itself is likely to self-perpetuate as communities with higher unemployment are more likely to remain so. Even if the error rates were exactly the same amongst groups, this would reinforce the existing distribution in employment. In fact, they would likely to falsely predict that an Asian individual is employed and that an individual of Two or More races is not employed. I find that Asian individuals have a much higher false positive rate to be interesting also - perhaps it is a manifestation of the Model Minority myth (which may stem from higher education amongst Asian groups). Using the examples above, an Asian individual could then mistakenly receive less aid for unemployment. A company using this as a hiring proxy could then underhire those of Two or More Races.

I think that the idea of solely using demographic information to determine whether or not an individual is employed is also flawed. It misses any indication of the economy and the amount of opportunities a person may have to find a job. An extra feature accounting for the wealth in the surrounding community could help account for this. There also are not any features that would address the time. This is specific to the year this survey was taken. However, unemployment is not generally chronic thing and can fluctuate depending on the time. Some record of past employment or time unemployed would probably also be an important feature to add.

Overall, this algorithm to determine employment status based on demographic characteristics seems to insufficiently capture the factors that would influence employment and, more concerningly, would likely reinforce existing discrimination.