My Awesome CSCI 0451 Blog - A Biased Employment Status Algorithm

Bias Audit

In this blog post, we will be auditing the allocative bias on our algorithm. We will examine data on the employment status in Missouri, focusing on the various self-reported ethnicities. We will then train an algorithm to predict employment status and examine any biases our algorithm might exhibit with respect to these groups.

Examining the Data

Our data can be obtained with the folktables package.

from folktables import ACSDataSource, ACSEmployment, BasicProblem, adult_filter
import numpy as np

STATE = "MO"

data_source = ACSDataSource(survey_year='2018', 
                            horizon='1-Year', 
                            survey='person')

acs_data = data_source.get_data(states=[STATE], download=True)

acs_data.head()

	RT	SERIALNO	DIVISION	SPORDER	PUMA	REGION	ST	ADJINC	PWGTP	AGEP	...	PWGTP71	PWGTP72	PWGTP73	PWGTP74	PWGTP75	PWGTP76	PWGTP77	PWGTP78	PWGTP79	PWGTP80
0	P	2018GQ0000034	4	1	400	2	29	1013097	77	27	...	123	128	134	70	77	74	70	70	16	139
1	P	2018GQ0000067	4	1	1001	2	29	1013097	73	42	...	70	130	72	125	73	77	70	69	127	128
2	P	2018GQ0000097	4	1	600	2	29	1013097	85	20	...	86	84	85	12	88	13	90	83	13	14
3	P	2018GQ0000113	4	1	1002	2	29	1013097	78	26	...	80	79	6	75	78	6	77	76	78	78
4	P	2018GQ0000159	4	1	200	2	29	1013097	55	37	...	11	57	101	56	57	108	54	13	13	13

5 rows × 286 columns

Here we select the possible features, most notably the race (RAC1P) and sex (SEX).

possible_features=['AGEP', 'SCHL', 'MAR', 'RELP', 'DIS', 'ESP', 'CIT', 'MIG', 'MIL', 'ANC', 'NATIVITY', 'DEAR', 'DEYE', 'DREM', 'SEX', 'RAC1P', 'ESR']
acs_data[possible_features].head()

	AGEP	SCHL	MAR	RELP	DIS	ESP	CIT	MIG	MIL	ANC	NATIVITY	DEAR	DEYE	DREM	SEX	RAC1P	ESR
0	27	17.0	5	16	2	NaN	1	3.0	4.0	4	1	2	2	2.0	2	1	6.0
1	42	19.0	5	16	2	NaN	1	3.0	4.0	1	1	2	2	2.0	1	2	6.0
2	20	19.0	5	17	2	NaN	1	1.0	4.0	2	1	2	2	2.0	1	1	1.0
3	26	17.0	5	16	1	NaN	1	1.0	4.0	1	1	2	2	1.0	1	2	6.0
4	37	16.0	5	16	1	NaN	1	3.0	4.0	1	1	2	1	1.0	1	2	6.0

target_features = ["ESR", "RAC1P"]
features_to_use = [f for f in possible_features if f not in target_features]

EmploymentProblem = BasicProblem(
    features=features_to_use,
    target='ESR',
    target_transform=lambda x: x == 1,
    group='RAC1P',
    preprocess=lambda x: x,
    postprocess=lambda x: np.nan_to_num(x, -1),
)

features, label, group = EmploymentProblem.df_to_numpy(acs_data)

We have now transformed our data so that it can be processed in a classification algorithm. Our task is to predic the employment status of civilians at work (excluding military). Of course, we split the training and testing data, which in this case is an 80-20 split.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test, group_train, group_test = train_test_split(
    features, label, group, test_size=0.2, random_state=0)

Let’s now examine some demographic characteristic of our training data.

import pandas as pd
df = pd.DataFrame(X_train, columns = features_to_use)
df["group"] = group_train
df["label"] = y_train

df_nums = df.groupby(["group"]).size().reset_index(name = "Number")
df_nums["group"] = ['White alone','Black or African American alone','American Indian alone','Alaska Native alone','American Indian and Alaska Native tribes specified, or American Indian or Alaska Native, not specified and no other races','Asian alone','Native Hawaiian and Other Pacific Islander alone',' Some Other Race alone','Two or More Races']
df_nums = df_nums.rename(index = {0:1,1:2,2:3,3:4,4:5,5:6,6:7,7:8,8:9})
df_nums['Percentage'] = df_nums['Number']/df.shape[0]*100.0
df_nums

	group	Number	Percentage
1	White alone	43713	87.545061
2	Black or African American alone	3838	7.686454
3	American Indian alone	150	0.300409
4	Alaska Native alone	5	0.010014
5	American Indian and Alaska Native tribes speci...	27	0.054074
6	Asian alone	762	1.526075
7	Native Hawaiian and Other Pacific Islander alone	51	0.102139
8	Some Other Race alone	316	0.632861
9	Two or More Races	1070	2.142914

So we can see that an overwhelming majority of our data is White, then Black, Two or More Races, and Asian. For my analysis, I will only focus on these groups as the others have too small a sample size.

df_employed = df.groupby(["group","label"]).size().groupby("group", group_keys=False).apply(lambda x: 100 * x / x.sum()).reset_index(name = "Percentage").replace({1:'White alone',2:'Black or African American alone',3:' American Indian alone',4:'Alaska Native alone',5:'American Indian and Alaska Native tribes specified, or American Indian or Alaska Native, not specified and no other races',6:'Asian alone',7:'Native Hawaiian and Other Pacific Islander alone',8:' Some Other Race alone',9:'Two or More Races'})
df_employed = df_employed.rename(columns = {'label':'Employed'})

print("Overall employment rate: ",(df["label"] == 1.0).mean())
df_employed['Diff. with Overall Avg.'] = df_employed['Percentage'] - (df["label"] == 1.0).mean()*np.ones(df_employed.shape[0])*100
df_employed[df_employed["Employed"] == True]

Overall employment rate:  0.4467275494672755

	group	Employed	Percentage	Diff. with Overall Avg.
1	White alone	True	45.439572	0.766817
3	Black or African American alone	True	39.577905	-5.094850
5	American Indian alone	True	46.000000	1.327245
7	Alaska Native alone	True	60.000000	15.327245
9	American Indian and Alaska Native tribes speci...	True	40.740741	-3.932014
11	Asian alone	True	49.343832	4.671077
13	Native Hawaiian and Other Pacific Islander alone	True	52.941176	8.268422
15	Some Other Race alone	True	37.341772	-7.330983
17	Two or More Races	True	29.906542	-14.766213

So in comparison to the overall average, White individuals are about the same, Black individuals are less employed, individuals of two or more races are much less employed, Asian individuals are slightly more employed.

import seaborn as sb
df_plot = df.groupby(["group","SEX","label"]).size().groupby(["group","SEX"], group_keys=False).apply(lambda x: 100 * x / x.sum()).reset_index(name = "Percentage")
df_plot['SEX'] = df_plot['SEX'].replace([1.0,2.0],['Male','Female'])

sb.barplot(x="group",
           y="Percentage",
           hue="SEX",
           data=df_plot[df_plot["label"]==True])

<AxesSubplot: xlabel='group', ylabel='Percentage'>

Plotting the employment rates by sex, we can see that there are clear disparities between overall employment and male and female employment.

For White and Asian individuals, males are more likely to be employed than females. For Black individuals and those from Two or More races, females are more likely to be employed.

Training an Algorithm to Predict Employment Status

The main classification algorithms we have on hand from Scikit Learn are Logistic Regression, Support Vector Machine, Decision Tree Classifier, and the Random Forest Classifier algorithm.

Each model has a parameter that must be tuned. For Logistic Regression, this is the degree for polynomial features, for Support Vector Machine the regularization parameter, for Decision Trees and the Random Forest, the max depth. I define a function below that will take a model, tune its optimal parameter using cross validation, and output this optimal parameter.

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score

def train_opt_model(input_model, kwargs, params, X_train, y_train, X_test, y_test):
    keys = list(kwargs.keys())
    best_score = 0.0
    best_param = params[0]
    
    for param in params:
        kwargs[keys[0]] = param #should be only one argument in this case
        model = make_pipeline(StandardScaler(), input_model(**kwargs))
        train_score = cross_val_score(model, X_train, y_train, cv=5).mean()
        if (train_score > best_score):
            best_score = train_score
            best_param = param
                              
    print("Best model parameter: ",best_param)    
    kwargs[keys[0]] = best_param
    best_model = make_pipeline(StandardScaler(), input_model(**kwargs))
    best_model.fit(X_train,y_train)
    print("Best model training score: ",best_model.score(X_train,y_train))
    print("Best model test score: ",best_model.score(X_test,y_test))
    return kwargs

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

def poly_LR(degree, **kwargs):
    plr = Pipeline([("poly", PolynomialFeatures(degree = degree)),
                    ("LR", LogisticRegression(**kwargs))])
    return plr

params = [i for i in range(3,4)]
kwargs = {"degree":params[0],"max_iter":5000}
kwargs = train_opt_model(poly_LR, kwargs, params, X_train, y_train, X_test, y_test)
model = make_pipeline(StandardScaler(), poly_LR(**kwargs))
model.fit(X_train, y_train)
y_hat = model.predict(X_test)
print("Overall employment accuracy: ",(y_hat == y_test).mean())
print("Employment accuracy for white: ",(y_hat == y_test)[group_test == 1].mean())
print("Employment accuracy for black:  ",(y_hat == y_test)[group_test == 2].mean())

from sklearn.svm import SVC
params = [0.1]
kwargs = {"C":params[0]}
kwargs = train_opt_model(SVC, kwargs, params, X_train, y_train, X_test, y_test)
model = make_pipeline(StandardScaler(), SVC(**kwargs))
model.fit(X_train, y_train)
y_hat = model.predict(X_test)
print("Overall employment accuracy: ",(y_hat == y_test).mean())
print("Employment accuracy for white: ",(y_hat == y_test)[group_test == 1].mean())
print("Employment accuracy for black:  ",(y_hat == y_test)[group_test == 2].mean())

confusion_matrix_SVC = confusion_matrix(y_test,y_hat,normalize='true')
print("False positive rate: ",confusion_matrix_SVC[0][1])
print("False negative rate: ",confusion_matrix_SVC[1][0])
print(confusion_matrix(y_test,y_hat)[1][1]/(confusion_matrix(y_test,y_hat)[1][1]+confusion_matrix(y_test,y_hat)[0][1]))

confusion_matrix_SVC = confusion_matrix(y_test[group_test == 1],y_hat[group_test == 1],normalize='true')
print("White False positive rate: ",confusion_matrix_SVC[0][1])
print("White False negative rate: ",confusion_matrix_SVC[1][0])
print(confusion_matrix(y_test[group_test == 1],y_hat[group_test == 1])[1][1]/(confusion_matrix(y_test[group_test == 1],y_hat[group_test == 1])[1][1]+confusion_matrix(y_test[group_test == 1],y_hat[group_test == 1])[0][1]))

confusion_matrix_SVC = confusion_matrix(y_test[group_test == 2],y_hat[group_test == 2],normalize='true')
print("Black False positive rate: ",confusion_matrix_SVC[0][1])
print("Black False negative rate: ",confusion_matrix_SVC[1][0])
print(confusion_matrix(y_test[group_test == 2],y_hat[group_test == 2])[1][1]/(confusion_matrix(y_test[group_test == 2],y_hat[group_test == 2])[1][1]+confusion_matrix(y_test[group_test == 2],y_hat[group_test == 2])[0][1]))

Spoiler for these two algorithms - I tried running them on my machine, and they both took an extraordinarily long time! Testing with some parameters also yields a similar training score to the algorithm we will use anyway.

from sklearn.tree import DecisionTreeClassifier
params = [i for i in range(5,15)]
kwargs = {"max_depth":params[0]}
kwargs = train_opt_model(DecisionTreeClassifier, kwargs, params, X_train, y_train, X_test, y_test)
model = make_pipeline(StandardScaler(), DecisionTreeClassifier(**kwargs))
model.fit(X_train, y_train)
y_hat = model.predict(X_test)
print("Overall employment accuracy: ",(y_hat == y_test).mean())
print("Employment accuracy for white: ",(y_hat == y_test)[group_test == 1].mean())
print("Employment accuracy for black:  ",(y_hat == y_test)[group_test == 2].mean())

Best model parameter:  9
Best model training score:  0.8429263798766322
Best model test score:  0.8404357577699455
Overall employment accuracy:  0.8403556552387056
Employment accuracy for white:  0.8408361602640506
Employment accuracy for black:   0.841031149301826

from sklearn.ensemble import RandomForestClassifier
params = [i for i in range(5,15)]
kwargs = {"max_depth":params[0]}
kwargs = train_opt_model(RandomForestClassifier, kwargs, params, X_train, y_train, X_test, y_test)
model = make_pipeline(StandardScaler(), RandomForestClassifier(**kwargs))
model.fit(X_train, y_train)
y_hat = model.predict(X_test)
print("Overall employment accuracy: ",(y_hat == y_test).mean())
print("Employment accuracy for white: ",(y_hat == y_test)[group_test == 1].mean())
print("Employment accuracy for black:  ",(y_hat == y_test)[group_test == 2].mean())

Best model parameter:  14
Best model training score:  0.8644356324601458
Best model test score:  0.8415571932073054
Overall employment accuracy:  0.8416372957385453
Employment accuracy for white:  0.8411112129824884
Employment accuracy for black:   0.8431793770139635

We can see that the training score for the Decision Tree and Random Forest algorithms are pretty similar. The Random Forest algorithm does have a slightly better training score and test score, but I will opt to use the Decision Tree algorithm because it is faster. Moreover, I think it would not be ethical to have randomness in an algorithm that predicts employment status.

Analyzing our Employment Classification Algorithm

We can now define a function that will make some statistic easier to calculate, and run our Decision Tree with a max depth of 9.

def get_statistics(y_hat,y_test,group_test=None, group_test_val = None):
    if (group_test_val == None):
        y_t = y_test
        y_h = y_hat
    else:
        y_t = y_test[group_test == group_test_val]
        y_h = y_hat[group_test == group_test_val]
        
    CM =  confusion_matrix(y_t,y_h)
    
    p = (1*y_t).mean() #mean of how many in group are employed
    PPV = CM[1][1]/(CM[1][1] + CM[0][1])
    
    FPR = CM[0][1]/(CM[0][1] + CM[0][0])
    FNR = CM[1][0]/(CM[1][0] + CM[1][1])
    
    calibration_0 = ((y_t == 1)*(y_h == 0)).sum()/((y_h == 0).sum())
    calibration_1 = ((y_t == 1)*(y_h == 1)).sum()/((y_h == 1).sum())
    
    parity = (y_h > 0).mean() #threshold taken to be 0
    
    return p, PPV, FPR, FNR, calibration_0, calibration_1, parity

model = make_pipeline(StandardScaler(), DecisionTreeClassifier(max_depth = 9))
model.fit(X_train, y_train)
y_hat = model.predict(X_test)

p, PPV, FPR, FNR, calibration_0, calibration_1, parity = get_statistics(y_hat,y_test)
print("Overall accuracy: ",round((y_hat == y_test).mean(),3))
print("Overall prevalence: ",round(p,3))
print("Overall False positive rate: ",round(FPR,3))
print("Overall False negative rate: ",round(FNR,3))
print("Overall Calibration (Score 0): ",round(calibration_0,3))
print("Overall Calibration (Score 1): ",round(calibration_1,3))
print("Overall Parity: ",round(parity,3),"\n")

white_p, white_PPV, white_FPR, white_FNR, white_calibration_0, white_calibration_1, white_parity = get_statistics(y_hat,y_test,group_test,1)
print("Accuracy for White: ",round((y_hat == y_test)[group_test == 1].mean(),3))
print("White prevalence: ",round(white_p,3))
print("White False positive rate: ",round(white_FPR,3))
print("White False negative rate: ",round(white_FNR,3))
print("Calibration (Score 0): ",round(white_calibration_0,3))
print("Calibration (Score 1): ",round(white_calibration_1,3))
print("Parity: ",round(white_parity,3),"\n")

black_p, black_PPV, black_FPR, black_FNR, black_calibration_0, black_calibration_1, black_parity = get_statistics(y_hat,y_test,group_test,2)
print("Accuracy for Black:  ",round((y_hat == y_test)[group_test == 2].mean(),3))
print("Black prevalence: ",round(black_p,3))
print("Black False positive rate: ",round(black_FPR,3))
print("Black False negative rate: ",round(black_FNR,3))
print("Calibration (Score 0): ",round(black_calibration_0,3))
print("Calibration (Score 1): ",round(black_calibration_1,3))
print("Parity: ", round(black_parity,3),"\n")

two_p, two_PPV, two_FPR, two_FNR, two_calibration_0, two_calibration_1, two_parity = get_statistics(y_hat,y_test,group_test,9)
print("Accuracy for Two or More Races:  ",round((y_hat == y_test)[group_test == 9].mean(),3))
print("Two or More Races prevalence: ",round(two_p,3))
print("Two or More Races False positive rate: ",round(two_FPR,3))
print("Two or More Races False negative rate: ",round(two_FNR,3))
print("Calibration (Score 0): ",round(two_calibration_0,3))
print("Calibration (Score 1): ",round(two_calibration_1,3))
print("Parity: ", round(two_parity,3),"\n")

asian_p, asian_PPV, asian_FPR, asian_FNR, asian_calibration_0, asian_calibration_1, asian_parity = get_statistics(y_hat,y_test,group_test,6)
print("Accuracy for Asian:  ",round((y_hat == y_test)[group_test == 6].mean(),3))
print("Asian prevalence: ",round(asian_p,3))
print("Asian False positive rate: ",round(asian_FPR,3))
print("Asian False negative rate: ",round(asian_FNR,3))
print("Calibration (Score 0): ",round(asian_calibration_0,3))
print("Calibration (Score 1): ",round(asian_calibration_1,3))
print("Parity: ", round(asian_parity,3))

Overall accuracy:  0.841
Overall prevalence:  0.453
Overall False positive rate:  0.177
Overall False negative rate:  0.138
Overall Calibration (Score 0):  0.122
Overall Calibration (Score 1):  0.801
Overall Parity:  0.488 

Accuracy for White:  0.841
White prevalence:  0.462
White False positive rate:  0.178
White False negative rate:  0.136
Calibration (Score 0):  0.125
Calibration (Score 1):  0.806
Parity:  0.495 

Accuracy for Black:   0.841
Black prevalence:  0.376
Black False positive rate:  0.17
Black False negative rate:  0.14
Calibration (Score 0):  0.092
Calibration (Score 1):  0.752
Parity:  0.43 

Accuracy for Two or More Races:   0.825
Two or More Races prevalence:  0.326
Two or More Races False positive rate:  0.151
Two or More Races False negative rate:  0.226
Calibration (Score 0):  0.114
Calibration (Score 1):  0.713
Parity:  0.354 

Accuracy for Asian:   0.844
Asian prevalence:  0.482
Asian False positive rate:  0.212
Asian False negative rate:  0.095
Calibration (Score 0):  0.101
Calibration (Score 1):  0.798
Parity:  0.546

To explain, the accuracy measures how many predictions were correct, the prevalence is the amount employed, the calibration measures how many are actual employed given the score (within the given group), statistical parity measures the scores above 0 that are given.

Notice that the overall employment accuracy is similar amongst all groups.

Comparing White and Black populations, we can see that the false postive and negative rates closely agree. However, these scores are not calibrated. The calibration for score 0 is about 30 percent larger for White individuals and the calibration for the score of 1 or parity is about 7 percent larger for White individuals. The statistical parity, however, is quite similar. So, while there is error rate balance and statistical parity for these groups, the algorithm is not calibrated.

Notice that for individuals or Two or More Races or Asian, the prevalences are significantly different from both the White and Black populations. For individuals of Two or More races, the false negative rate is significantly higher and the calibration is smaller than the White and Black groups. The parity is also significantly lower. For Asian individuals, the false positive rate is much higher, the false negative rate is much lower, and statistical parity is significantly higher than in the White and black groups. Since these groups are not as large as the others, these trends may not be as reliable.

Perhaps overall it could be argued that the calibration is similar enough amongst groups. However, clearly the algorithm is not error rate balanced.

Conclusions

Ultimately, the benefit of using these labels for employment status would depend on its usage. Perhaps someone would use this algorithm to predict if individuals of certain backgrounds would need unemployment aid. Perhaps a company would use this to decide if an individual is worth hiring.

In any case, this algorithm does not work to combat structural discrimination. Notice that for individuals that are Black or or Two or More Races, the prevalences are markedly lower than for White and Asian individuals. This in itself is likely to self-perpetuate as communities with higher unemployment are more likely to remain so. Even if the error rates were exactly the same amongst groups, this would reinforce the existing distribution in employment. In fact, they would likely to falsely predict that an Asian individual is employed and that an individual of Two or More races is not employed. I find that Asian individuals have a much higher false positive rate to be interesting also - perhaps it is a manifestation of the Model Minority myth (which may stem from higher education amongst Asian groups). Using the examples above, an Asian individual could then mistakenly receive less aid for unemployment. A company using this as a hiring proxy could then underhire those of Two or More Races.

I think that the idea of solely using demographic information to determine whether or not an individual is employed is also flawed. It misses any indication of the economy and the amount of opportunities a person may have to find a job. An extra feature accounting for the wealth in the surrounding community could help account for this. There also are not any features that would address the time. This is specific to the year this survey was taken. However, unemployment is not generally chronic thing and can fluctuate depending on the time. Some record of past employment or time unemployed would probably also be an important feature to add.

Overall, this algorithm to determine employment status based on demographic characteristics seems to insufficiently capture the factors that would influence employment and, more concerningly, would likely reinforce existing discrimination.