Here we will examine some employment status data from Missouri and demonstrate that it exhibits bias by ethnicity.
Author
Jay-U Chung
Published
April 4, 2023
Bias Audit
In this blog post, we will be auditing the allocative bias on our algorithm. We will examine data on the employment status in Missouri, focusing on the various self-reported ethnicities. We will then train an algorithm to predict employment status and examine any biases our algorithm might exhibit with respect to these groups.
Examining the Data
Our data can be obtained with the folktables package.
from folktables import ACSDataSource, ACSEmployment, BasicProblem, adult_filterimport numpy as npSTATE ="MO"data_source = ACSDataSource(survey_year='2018', horizon='1-Year', survey='person')acs_data = data_source.get_data(states=[STATE], download=True)acs_data.head()
RT
SERIALNO
DIVISION
SPORDER
PUMA
REGION
ST
ADJINC
PWGTP
AGEP
...
PWGTP71
PWGTP72
PWGTP73
PWGTP74
PWGTP75
PWGTP76
PWGTP77
PWGTP78
PWGTP79
PWGTP80
0
P
2018GQ0000034
4
1
400
2
29
1013097
77
27
...
123
128
134
70
77
74
70
70
16
139
1
P
2018GQ0000067
4
1
1001
2
29
1013097
73
42
...
70
130
72
125
73
77
70
69
127
128
2
P
2018GQ0000097
4
1
600
2
29
1013097
85
20
...
86
84
85
12
88
13
90
83
13
14
3
P
2018GQ0000113
4
1
1002
2
29
1013097
78
26
...
80
79
6
75
78
6
77
76
78
78
4
P
2018GQ0000159
4
1
200
2
29
1013097
55
37
...
11
57
101
56
57
108
54
13
13
13
5 rows × 286 columns
Here we select the possible features, most notably the race (RAC1P) and sex (SEX).
We have now transformed our data so that it can be processed in a classification algorithm. Our task is to predic the employment status of civilians at work (excluding military). Of course, we split the training and testing data, which in this case is an 80-20 split.
df_nums = df.groupby(["group"]).size().reset_index(name ="Number")df_nums["group"] = ['White alone','Black or African American alone','American Indian alone','Alaska Native alone','American Indian and Alaska Native tribes specified, or American Indian or Alaska Native, not specified and no other races','Asian alone','Native Hawaiian and Other Pacific Islander alone',' Some Other Race alone','Two or More Races']df_nums = df_nums.rename(index = {0:1,1:2,2:3,3:4,4:5,5:6,6:7,7:8,8:9})df_nums['Percentage'] = df_nums['Number']/df.shape[0]*100.0df_nums
group
Number
Percentage
1
White alone
43713
87.545061
2
Black or African American alone
3838
7.686454
3
American Indian alone
150
0.300409
4
Alaska Native alone
5
0.010014
5
American Indian and Alaska Native tribes speci...
27
0.054074
6
Asian alone
762
1.526075
7
Native Hawaiian and Other Pacific Islander alone
51
0.102139
8
Some Other Race alone
316
0.632861
9
Two or More Races
1070
2.142914
So we can see that an overwhelming majority of our data is White, then Black, Two or More Races, and Asian. For my analysis, I will only focus on these groups as the others have too small a sample size.
df_employed = df.groupby(["group","label"]).size().groupby("group", group_keys=False).apply(lambda x: 100* x / x.sum()).reset_index(name ="Percentage").replace({1:'White alone',2:'Black or African American alone',3:' American Indian alone',4:'Alaska Native alone',5:'American Indian and Alaska Native tribes specified, or American Indian or Alaska Native, not specified and no other races',6:'Asian alone',7:'Native Hawaiian and Other Pacific Islander alone',8:' Some Other Race alone',9:'Two or More Races'})df_employed = df_employed.rename(columns = {'label':'Employed'})print("Overall employment rate: ",(df["label"] ==1.0).mean())df_employed['Diff. with Overall Avg.'] = df_employed['Percentage'] - (df["label"] ==1.0).mean()*np.ones(df_employed.shape[0])*100df_employed[df_employed["Employed"] ==True]
Overall employment rate: 0.4467275494672755
group
Employed
Percentage
Diff. with Overall Avg.
1
White alone
True
45.439572
0.766817
3
Black or African American alone
True
39.577905
-5.094850
5
American Indian alone
True
46.000000
1.327245
7
Alaska Native alone
True
60.000000
15.327245
9
American Indian and Alaska Native tribes speci...
True
40.740741
-3.932014
11
Asian alone
True
49.343832
4.671077
13
Native Hawaiian and Other Pacific Islander alone
True
52.941176
8.268422
15
Some Other Race alone
True
37.341772
-7.330983
17
Two or More Races
True
29.906542
-14.766213
So in comparison to the overall average, White individuals are about the same, Black individuals are less employed, individuals of two or more races are much less employed, Asian individuals are slightly more employed.
import seaborn as sbdf_plot = df.groupby(["group","SEX","label"]).size().groupby(["group","SEX"], group_keys=False).apply(lambda x: 100* x / x.sum()).reset_index(name ="Percentage")df_plot['SEX'] = df_plot['SEX'].replace([1.0,2.0],['Male','Female'])sb.barplot(x="group", y="Percentage", hue="SEX", data=df_plot[df_plot["label"]==True])
Plotting the employment rates by sex, we can see that there are clear disparities between overall employment and male and female employment.
For White and Asian individuals, males are more likely to be employed than females. For Black individuals and those from Two or More races, females are more likely to be employed.
Training an Algorithm to Predict Employment Status
The main classification algorithms we have on hand from Scikit Learn are Logistic Regression, Support Vector Machine, Decision Tree Classifier, and the Random Forest Classifier algorithm.
Each model has a parameter that must be tuned. For Logistic Regression, this is the degree for polynomial features, for Support Vector Machine the regularization parameter, for Decision Trees and the Random Forest, the max depth. I define a function below that will take a model, tune its optimal parameter using cross validation, and output this optimal parameter.
from sklearn.linear_model import LogisticRegressionfrom sklearn.pipeline import make_pipelinefrom sklearn.preprocessing import StandardScalerfrom sklearn.metrics import confusion_matrixfrom sklearn.model_selection import cross_val_scoredef train_opt_model(input_model, kwargs, params, X_train, y_train, X_test, y_test): keys =list(kwargs.keys()) best_score =0.0 best_param = params[0]for param in params: kwargs[keys[0]] = param #should be only one argument in this case model = make_pipeline(StandardScaler(), input_model(**kwargs)) train_score = cross_val_score(model, X_train, y_train, cv=5).mean()if (train_score > best_score): best_score = train_score best_param = paramprint("Best model parameter: ",best_param) kwargs[keys[0]] = best_param best_model = make_pipeline(StandardScaler(), input_model(**kwargs)) best_model.fit(X_train,y_train)print("Best model training score: ",best_model.score(X_train,y_train))print("Best model test score: ",best_model.score(X_test,y_test))return kwargs
Spoiler for these two algorithms - I tried running them on my machine, and they both took an extraordinarily long time! Testing with some parameters also yields a similar training score to the algorithm we will use anyway.
from sklearn.tree import DecisionTreeClassifierparams = [i for i inrange(5,15)]kwargs = {"max_depth":params[0]}kwargs = train_opt_model(DecisionTreeClassifier, kwargs, params, X_train, y_train, X_test, y_test)model = make_pipeline(StandardScaler(), DecisionTreeClassifier(**kwargs))model.fit(X_train, y_train)y_hat = model.predict(X_test)print("Overall employment accuracy: ",(y_hat == y_test).mean())print("Employment accuracy for white: ",(y_hat == y_test)[group_test ==1].mean())print("Employment accuracy for black: ",(y_hat == y_test)[group_test ==2].mean())
Best model parameter: 9
Best model training score: 0.8429263798766322
Best model test score: 0.8404357577699455
Overall employment accuracy: 0.8403556552387056
Employment accuracy for white: 0.8408361602640506
Employment accuracy for black: 0.841031149301826
from sklearn.ensemble import RandomForestClassifierparams = [i for i inrange(5,15)]kwargs = {"max_depth":params[0]}kwargs = train_opt_model(RandomForestClassifier, kwargs, params, X_train, y_train, X_test, y_test)model = make_pipeline(StandardScaler(), RandomForestClassifier(**kwargs))model.fit(X_train, y_train)y_hat = model.predict(X_test)print("Overall employment accuracy: ",(y_hat == y_test).mean())print("Employment accuracy for white: ",(y_hat == y_test)[group_test ==1].mean())print("Employment accuracy for black: ",(y_hat == y_test)[group_test ==2].mean())
Best model parameter: 14
Best model training score: 0.8644356324601458
Best model test score: 0.8415571932073054
Overall employment accuracy: 0.8416372957385453
Employment accuracy for white: 0.8411112129824884
Employment accuracy for black: 0.8431793770139635
We can see that the training score for the Decision Tree and Random Forest algorithms are pretty similar. The Random Forest algorithm does have a slightly better training score and test score, but I will opt to use the Decision Tree algorithm because it is faster. Moreover, I think it would not be ethical to have randomness in an algorithm that predicts employment status.
Analyzing our Employment Classification Algorithm
We can now define a function that will make some statistic easier to calculate, and run our Decision Tree with a max depth of 9.
def get_statistics(y_hat,y_test,group_test=None, group_test_val =None):if (group_test_val ==None): y_t = y_test y_h = y_hatelse: y_t = y_test[group_test == group_test_val] y_h = y_hat[group_test == group_test_val] CM = confusion_matrix(y_t,y_h) p = (1*y_t).mean() #mean of how many in group are employed PPV = CM[1][1]/(CM[1][1] + CM[0][1]) FPR = CM[0][1]/(CM[0][1] + CM[0][0]) FNR = CM[1][0]/(CM[1][0] + CM[1][1]) calibration_0 = ((y_t ==1)*(y_h ==0)).sum()/((y_h ==0).sum()) calibration_1 = ((y_t ==1)*(y_h ==1)).sum()/((y_h ==1).sum()) parity = (y_h >0).mean() #threshold taken to be 0return p, PPV, FPR, FNR, calibration_0, calibration_1, parity
Overall accuracy: 0.841
Overall prevalence: 0.453
Overall False positive rate: 0.177
Overall False negative rate: 0.138
Overall Calibration (Score 0): 0.122
Overall Calibration (Score 1): 0.801
Overall Parity: 0.488
Accuracy for White: 0.841
White prevalence: 0.462
White False positive rate: 0.178
White False negative rate: 0.136
Calibration (Score 0): 0.125
Calibration (Score 1): 0.806
Parity: 0.495
Accuracy for Black: 0.841
Black prevalence: 0.376
Black False positive rate: 0.17
Black False negative rate: 0.14
Calibration (Score 0): 0.092
Calibration (Score 1): 0.752
Parity: 0.43
Accuracy for Two or More Races: 0.825
Two or More Races prevalence: 0.326
Two or More Races False positive rate: 0.151
Two or More Races False negative rate: 0.226
Calibration (Score 0): 0.114
Calibration (Score 1): 0.713
Parity: 0.354
Accuracy for Asian: 0.844
Asian prevalence: 0.482
Asian False positive rate: 0.212
Asian False negative rate: 0.095
Calibration (Score 0): 0.101
Calibration (Score 1): 0.798
Parity: 0.546
To explain, the accuracy measures how many predictions were correct, the prevalence is the amount employed, the calibration measures how many are actual employed given the score (within the given group), statistical parity measures the scores above 0 that are given.
Notice that the overall employment accuracy is similar amongst all groups.
Comparing White and Black populations, we can see that the false postive and negative rates closely agree. However, these scores are not calibrated. The calibration for score 0 is about 30 percent larger for White individuals and the calibration for the score of 1 or parity is about 7 percent larger for White individuals. The statistical parity, however, is quite similar. So, while there is error rate balance and statistical parity for these groups, the algorithm is not calibrated.
Notice that for individuals or Two or More Races or Asian, the prevalences are significantly different from both the White and Black populations. For individuals of Two or More races, the false negative rate is significantly higher and the calibration is smaller than the White and Black groups. The parity is also significantly lower. For Asian individuals, the false positive rate is much higher, the false negative rate is much lower, and statistical parity is significantly higher than in the White and black groups. Since these groups are not as large as the others, these trends may not be as reliable.
Perhaps overall it could be argued that the calibration is similar enough amongst groups. However, clearly the algorithm is not error rate balanced.
Conclusions
Ultimately, the benefit of using these labels for employment status would depend on its usage. Perhaps someone would use this algorithm to predict if individuals of certain backgrounds would need unemployment aid. Perhaps a company would use this to decide if an individual is worth hiring.
In any case, this algorithm does not work to combat structural discrimination. Notice that for individuals that are Black or or Two or More Races, the prevalences are markedly lower than for White and Asian individuals. This in itself is likely to self-perpetuate as communities with higher unemployment are more likely to remain so. Even if the error rates were exactly the same amongst groups, this would reinforce the existing distribution in employment. In fact, they would likely to falsely predict that an Asian individual is employed and that an individual of Two or More races is not employed. I find that Asian individuals have a much higher false positive rate to be interesting also - perhaps it is a manifestation of the Model Minority myth (which may stem from higher education amongst Asian groups). Using the examples above, an Asian individual could then mistakenly receive less aid for unemployment. A company using this as a hiring proxy could then underhire those of Two or More Races.
I think that the idea of solely using demographic information to determine whether or not an individual is employed is also flawed. It misses any indication of the economy and the amount of opportunities a person may have to find a job. An extra feature accounting for the wealth in the surrounding community could help account for this. There also are not any features that would address the time. This is specific to the year this survey was taken. However, unemployment is not generally chronic thing and can fluctuate depending on the time. Some record of past employment or time unemployed would probably also be an important feature to add.
Overall, this algorithm to determine employment status based on demographic characteristics seems to insufficiently capture the factors that would influence employment and, more concerningly, would likely reinforce existing discrimination.