Determining Race from Chest X-Rays

We used deep learning to guess a patient’s race based on their chest X-ray.
Author

Trong Le, Jay-U Chung, Kent Canonigo

Published

May 10, 2023

The GitHub repository for the project code can be found here.

1. Abstract

A series of prior results, including those by Gichoya et al. (2022), have shown that it is possible to use deep convolutional neural networks to predict a patient’s self-reported race from chest radiographs with high accuracies. Since this result displays significant ethical concerns for medical imaging algorithms, we aim to reproduce these results and investigate the implications this algorithm could have for race-based medicine and the racial inequalities reinforced by algorithms. We use a subset of the chest radiographs obtained from the ChexPert dataset, aiming to classify images into Black, White, Asian. We primarily train and test on a subset of data with equal proportions amongst all races. In particular, we compare the results of pretrained and untrained ResNet18 models and the EfficientNetB0 model. Our results achieve around 70 % accuracy, displaying some racial bias and having minimal gender bias. We therefore conclude that, on a smaller scale, we have confirmed that it is indeed possible to train neural networks to accurately classify race from chest radiographs.

2. Introduction

Deep neural networks are increasingly getting popular in medicine as diagnostic tools. While at times suprassing the accuracy of experts, results such as those by Seyyed-Kalantari et al. (2021) show concerning results of underdiagnosis for patients that are Black, Hispanic, younger, or belonging to a lower socieconomic status groups. Problematically this reinforces a history of minority groups or economically vulnerable groups receiving inadequate medical care, especially when many publicly available datasets are disproportionately represent White patients.

As Seyyed-Kalantari et al. (2021) suggest, this may be a matter of confounding variables as bias amplification or differing prevalence. However, a paper by Gichoya et al. (2022) investigated the direct question - can race be inferred from chest X-rays? Clincially speaking this is something that is not expected, it is an implicit assumption that chest radiographs contain no information about one’s demographic characteristics, beyond those most relevant to physiology, such as age or biological sex. Many models specifically exclude characteristics so that classification is based solely on the image. However, deep neural networks are often a black box, being capable on picking up on pixel level patterns that are surprising.

Indeed, the authors Gichoya et al. (2022) found that by using self-reported race labels, those being Black, White, and Asian, it is possible to classify chest radiographs into these three categories with high accuracies (0.91–0.99 using AUC metrics). To the extent that they investigated, this was not based on potentially race related characteristics, including bone or breast density or disease prevalence. Even highly degraded versions of the image maintained a high performance. Moreover, this pattern could not be replicated with algorithms that did not use the image data - “logistic regression model (AUC 0·65), a random forest classifier (0·64), and an XGBoost model (0·64) to classify race on the basis of age, sex, gender, disease, and body habitus performed much worse than the race classifiers trained on imaging data”. So as they conclude, “medical AI systems can easily learn to recognise self-reported racial identity from medical images, and that this capability is extremely difficult to isolate” - the problem may be prevalent in a large range of algorithms and would be difficult to correct for. Moreover, the fact that they obtained the results by training on a variety of popular and publicly available datasets for medical images, including the MIMIC-CXR, CheXpert, National lung cancer screening trial, RSNA Pulmonary Embolism CT, and the Digital Hand Atlas, further suggests that this could largely applicable to other AI projects.

This paper is also not a standalone result. A prior paper by Yi et al. (2021) demonstrated that age and sex can be determined for Chinese and American populations. A paper by Adleberg et al. (2022), training on the MIMIC-CXR dataset, created a deep learning model that can extract self-reported information such as age, gender, race, ethnicity with high accuracies, even insurance status at moderate rates.

While the question of whether their results are reproducible has been more adequately answered elsewhere, we are interested if it possible to reproduce their results on a smaller scale. Moreover, we aim to answer the ethical implications of their work beyond the problems of bias it poses to deep neural networks. Gichoya et al. (2022) “emphasise that the ability of AI to predict racial identity is itself not the issue of importance”, but is this enough? It does not seem to be adequate to stop at this conclusion when racial classification itself is a goal that is long rooted in the painful histories of eugenics, slavery, and colonization. To this extent we will exposit some more about the definition of race and its use in medicine.

Race in Medicine

The most concerning question we face are the implications of this model. The direct usages for this algorithm are limited. However, its main value is in its demonstration; anyone using an AI algorithm may be unknowingly be using similar procedures to this one to classify self-reported and use this as a proxy for other classifcation tasks.

We cannot ignore that there still may be potential users of this algorithm. The very goal of racial classification contains an implicit assumption that race exists. However, we must address two central questions: what race represents in medicine, and how race has been used for clinical practice.

Does Race Exist?

Whether race exists as a biological phenomenon, and not as a social construct, is a hotly debated issue. As Cerdeña, Plaisime, and Tsai (2020) note, “race was developed as a tool to divide and control populations worldwide. Race is thus a social and power construct with meanings that have shifted over time to suit political goals, including to assert biological inferiority of dark-skinned populations.”

One justification for the biological reality of races is based on the assumption that different races have distinct genetics from one another, and can be fit into genetic groups. However, Maglo, Mersha, and Martin (2016) note that humans are not distinct by evolutionary criteria and genetic similarities between “human races, understood as continental clusters, have no taxonomic meaning”, with there being “tremendous diversity within groups” [2]. Whether race defines a genetic profile is therefore unclear at best, with correlations between race and disease being confounded by variables such as the association between race and socioeconomic variables.

What is Race-based Medicine?

It is possible that some may be interested in using this algorithm to deduce the race of an individual and use this as part of medical decisions. There are some correlations between disease prevalence and race. Maglo, Mersha, and Martin (2016) note that “Recent studies showed that ancestry mapping has been successfully applied for disease in which prevalence is significantly different between the ancestral populations to identify genomic regions harboring diseases susceptibility loci for cardiovascular disease (Tang et al. (2005)), multiple sclerosis (Reich et al. (2005)), prostate cancer (Freedman et al. (2006)), obesity (Cheng et al. (2009)), and asthma (Vergara et al. (2009))” [2].

These practices would be characteristic of race-based medicine. As Cerdeña, Plaisime, and Tsai (2020) argue, this is “the system by which research characterizing race as an essential, biological variable, [which] translates into clinical practice, leading to inequitable care” [1]. Notably, then, race-based medicine has come under heavy criticism.

The Harms of Race-based Medicine

As stated above, race is not an accurate proxy for genetics. Cerdeña, Plaisime, and Tsai (2020) note that in medical practices, race is used as an inaccurate guideline for medical care: “Black patients are presumed to have greater muscle mass …On the basis of the understanding that Asian patients have higher visceral body fat than do people of other races, they are considered to be at risk for diabetes at lower body-mass indices” [1]. As they note, race-based medicine can be founded more on racial stereotypes and generalizations rather than.

Moreover, race-based medicine can lead to ineffective treatements. Apeles (2022) summarizes the results of a study on race-based prescriptions for Black patients for high blood pressure. While this study demonstrates that alternative prescriptions for Black patients with high blood pressure have been shown to be ineffective, “Practice guidelines have long recommended that Black patients with high blood pressure and no comorbidities be treated initially with a thiazide diuretic or a calcium channel blocker (CCB) instead of an angiotensin converting enzyme inhibitor (ACEI) and/or angiotensin receptor blocker (ARB). By contrast, non-Black patients can be prescribed any of those medicines regardless of comorbidities.” In addition, the authors of the study found that “other factors may be more important than considerations of race, such as dose, the addition of second or third drugs, medication adherence, and dietary and lifestyle interventions. Follow-up care was important, and the Black patients who had more frequent clinical encounters tended to have better control of their blood pressure.”

In addition, Vyas, Eisenstein, and Jones (2020) argue that race is ill-suited as a correction factor for medical algorithms. As they found, algorithms as the American Heart Association (AHA) Get with the Guidelines–Heart Failure Risk Score, which predicts the likelihood of death from heart failure, the Vaginal Birth after Cesarean (VBAC), which predicts the risk of labor for someone with a previous cesarean section, and STONE score, which predicts the likelihood of kidney stones in patients with flank pain, all used race to change their predictions of the likelihood or morbidities. However, they find that these algorithms were not sufficiently evidence based as “Some algorithm developers offer no explanation of why racial or ethnic differences might exist. Others offer rationales, but when these are traced to their origins, they lead to outdated, suspect racial science or to biased data”. Using race can then discourage racial minorities from receiving the proper treatment based on their scores, exacerbating already existing problems of unequal health outcomes.

Conclusion

So it is clear that anyone who intends to use race for diagnosis could harm racial minority groups. Race inherently is a complex social and economic phenomenon and cannot be said to be a clear biological variable. Hence anyone intending to use or create algorithms will run the risk of creating dangerous biases in treatment; ones that could worsen the existing disparities in care for vulnerable populations.

3. Values Statement

The potential users of this project are scholars and researchers who remain adamant in exploring the classification of race through the intersection of other socially constructed identities (gender, ethnicity, sexuality, etc.). There have been numerous literatures potentially identifying race as a proxy for categorizing and describing certain social, cultural, and biological characteristics of individuals or groups; it has also become pervasive in its history and role in the medical field. Those who are harmed and still are affected by this project would be the hidden bodies — the group of individuals historically marginalized in society — and whose very identities are in a constant battle of validity. In pursuit of this project, we acknowledge that the technology and results could further harm and perpetuate the racist ideologies that currently exists in validating the physiological differences across racial groups.

What is your personal reason for working on this problem?

Jay-U: My personal motivation for working on this project was my interest in Gichoya et al. (2022)’s paper. I thought it was very suprising that they were able to identify race from chest X-rays when I think that there should be nothing identifying about the images. I think that ethically it is problematic, so I wanted to explore their methods and verify if it was reproducible, as well as reading more about race and its use in medicine.

Kent: My personal reason for working on this project was to engage in an ethical conversation about the implications of classifying race through AI. Jay-U provided the initial literature to this project, and it motivated me to exercise bias auditing if any of the algorithms proved to be flawed.

4. Materials and Methods

Our Data

We used the ChexPert dataset collected by Irvin et al. (2019). The dataset contains images collected between between October 2002 and July 2017, where the dataset was eventually finalized after analyzing all images from Stanford Hospital. This dataset contains 224,316 frontal and lateral chest radiographs of 65,240 patients. Each radiograph is labeled with information such as age, gender, race, ethnicity and medical conditions, but we are primarily concerned with race and gender. A full structured datasheet of the ChexPert dataset has been explained by Garbin et al. (2021).

One limitation in the dataset includes a limited variety of x-ray devices to capture the images as the dataset is only coming from one institution — Stanford Hospital. Thus, models trained on this dataset can only be said to be valid for patients living around the Stanford area and for scans coming from this hospital. It is always a possibility that our model is specializing to some features specific to these images from Stanford, so our models may perform worse when evaluatinf on scans for different institutions.

Notice that our actual data is from Kaggle as the original 11 GB dataset with smaller images was unavailable.

The two relevant datasets include this df_patients dataframe, which contains a path to the images of the patient, their Sex (male or female), Age, and Frontal/Lateral (indicating whether their scan is from the from or side). The remaining columns are disease related and were not relevant to our analysis.

import pandas as pd
df_patients = pd.read_csv('../data/train.csv')
df_patients
Path Sex Age Frontal/Lateral AP/PA No Finding Enlarged Cardiomediastinum Cardiomegaly Lung Opacity Lung Lesion Edema Consolidation Pneumonia Atelectasis Pneumothorax Pleural Effusion Pleural Other Fracture Support Devices
0 CheXpert-v1.0-small/train/patient00001/study1/... Female 68 Frontal AP 1.0 NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN 1.0
1 CheXpert-v1.0-small/train/patient00002/study2/... Female 87 Frontal AP NaN NaN -1.0 1.0 NaN -1.0 -1.0 NaN -1.0 NaN -1.0 NaN 1.0 NaN
2 CheXpert-v1.0-small/train/patient00002/study1/... Female 83 Frontal AP NaN NaN NaN 1.0 NaN NaN -1.0 NaN NaN NaN NaN NaN 1.0 NaN
3 CheXpert-v1.0-small/train/patient00002/study1/... Female 83 Lateral NaN NaN NaN NaN 1.0 NaN NaN -1.0 NaN NaN NaN NaN NaN 1.0 NaN
4 CheXpert-v1.0-small/train/patient00003/study1/... Male 41 Frontal AP NaN NaN NaN NaN NaN 1.0 NaN NaN NaN 0.0 NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
223409 CheXpert-v1.0-small/train/patient64537/study2/... Male 59 Frontal AP NaN NaN NaN -1.0 NaN NaN NaN NaN -1.0 0.0 1.0 NaN NaN NaN
223410 CheXpert-v1.0-small/train/patient64537/study1/... Male 59 Frontal AP NaN NaN NaN -1.0 NaN NaN NaN 0.0 -1.0 NaN -1.0 NaN NaN NaN
223411 CheXpert-v1.0-small/train/patient64538/study1/... Female 0 Frontal AP NaN NaN NaN NaN NaN -1.0 NaN NaN NaN NaN NaN NaN NaN NaN
223412 CheXpert-v1.0-small/train/patient64539/study1/... Female 0 Frontal AP NaN NaN 1.0 1.0 NaN NaN NaN -1.0 1.0 0.0 NaN NaN NaN 0.0
223413 CheXpert-v1.0-small/train/patient64540/study1/... Female 0 Frontal AP 1.0 NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN

223414 rows × 19 columns

This df_race dataframe links the patients ID to their gender, age, primary race, and ethnicity.

df_race = pd.read_excel('../data/chexpert_race.xlsx')
df_race
PATIENT GENDER AGE_AT_CXR PRIMARY_RACE ETHNICITY
0 patient24428 Male 61 White Non-Hispanic/Non-Latino
1 patient48289 Female 39 Other Hispanic/Latino
2 patient33856 Female 81 White Non-Hispanic/Non-Latino
3 patient41673 Female 42 Unknown Unknown
4 patient48493 Male 71 White Non-Hispanic/Non-Latino
... ... ... ... ... ...
65396 patient65702 Male 1 Other Hispanic/Latino
65397 patient04979 Female 27 Other Hispanic/Latino
65398 patient11445 Female 29 Unknown Unknown
65399 patient23235 Female 41 Other, Hispanic Hispanic/Latino
65400 patient05143 Male 24 White Non-Hispanic/Non-Latino

65401 rows × 5 columns

For our purposes, we combined the two dataframes to match the image links with the patient race. This was achieved by using the patient ID within each of the image paths to combine the two.

df_race['PRIMARY_RACE'].unique()
array(['White', 'Other', 'Unknown', 'White, non-Hispanic', 'Asian', nan,
       'Black or African American', 'Black, non-Hispanic',
       'Other, Hispanic', 'Race and Ethnicity Unknown',
       'Asian, non-Hispanic', 'Pacific Islander, non-Hispanic',
       'Native Hawaiian or Other Pacific Islander', 'Other, non-Hispanic',
       'Patient Refused', 'White, Hispanic', 'Black, Hispanic',
       'Asian, Hispanic', 'American Indian or Alaska Native',
       'Native American, Hispanic', 'Native American, non-Hispanic',
       'Pacific Islander, Hispanic', 'Asian - Historical Conv',
       'White or Caucasian'], dtype=object)

There are quite a few self-reported race labels, but we focused only on White, Black, and Asian. As in Gichoya’s code, we decided to label any race containing White as White (so this include White, non-Hispanic and White, Hispanic), and vice versa for Black and Asian.

Whether this approach is valid is somewhat questionable, but perhaps the actual ethnographic definitions of race are similarly ill-defined.

After combining our dataframes and removing any patients not identifying as White, Black, or Asian (including the non-reported or nan columns), we now create the following figures. As seen below, imaghes from White patients occupy the vast majority of this data and we are concerned that this may lead to a racial bias in the model’s classification algorithm.

imbalance

Interestingly more images belong to male than female patients also, which could lead to gender bias in our algorithm.

balance1 balance2

Our Method

Data Subsetting

We trained our model using 10,000 frontal chest X-rays, such as the one in the following figure. The only input features are from the image itself, and the only target is race.

We created the train and test datasets by taking on the Black, White, and Asian populations. The dataset in total is 90000 images, with 72,000 in the train set and 18,000 in the test set. We created the equal proportion training set by taking the first 6000 images from each race group (this includes almost all the Black patients, the smallest group) in the training set and randomizing the order. A similar process held for the test set, but in total it is only 3000 images.

Aside from potential bias, subsetting the data makes it easier to train and evaluate an algorithm. Especially since we are not using a very large amount of data, using a large model can easily overfit to the demographics of the patient. Attempts to use the unbalanced training set essentially resulted in a algorithm that would always guess White. By using a balanced set, it is easier for the model to learn about the features without this frequency bias.

We created more training sets that also included only the frontal scans. We noticed that this is something that Gichoya et. al did. Whether it actually makes a meaningful difference is not clear, but since the lateral scans are not as common, it is possible that this hinders the models from learning.

For actual training, we only use 10,000 images due to the lack of computing power. Our training was done on the standard T4 GPUs available in Google Colab. Again, this is significant difference from Gichoya et. al, who use more than 100,000 images in their algorithm.

chest X-ray

Image Transforms

We also used the image transforms that Gichoya et. al implmeneted. This includes an image resize to (224 by 244), image normalization (to the ImageNet mean and standard deviation), random horizontal flips, random rotations at a maximum of 15 degrees. We did not include a random zoom as in the paper however. In general, these are a good way of preventing overfitting. Some transformation like the random rotations may also prevent the model from generalizing some specific features, like a patient’s lean or posture.

Models

As for our models themselves, we used ResNet and EfficientNet. These are popular deep learning architectures for image classification, and have both achieved high accuracies of around 70-80% when evaluated on the ImageNet data base. Specifically, we used pretrained EfficientNetB0 and ResNet18 models. We chose these specific variations because they have a low number of parameters (which means faster training time and a smaller chance of overfitting). Moreover, these models are trained on the 224 by 224 image size that we used.

We also implemented some ResNet18 models on our own but achieved a lower accuracy. A problem with deep neural networks is that performance degrades with more layers. This occurs even when an identity operation is conudcted. Hence ResNet combats this issue by using skip connections, which saves the output from previous layers and adds it to the current output, avoiding this degradation.

To gauge what the ResNet18 model is “looking” at, we extracted the filters from the first convolutional layer to visualize the feature maps across three convolutional layers: layer 0, layer 8, and the last layer 16.

To optimize a model, we would train it on 10,000 images in a loop, using different learning rates for the Adam optimizer and \(\gamma\) values for the exponential scheduler. We trained all the parameters of our pretrained model. We assumed that since there is no reason that the ImageNet database would contain X-ray like images, it would be best to tune all parameters.

We note that the training process for the self-implemented Resnet18 model was different. We trained first with an exponential learning rate scheduler for 30 epochs, then 15 epochs on the learning rate plateu scheduler as the loss did not decrease otherwise.

In the same loop, we would then validate the model on 2,500 other images from the training set to find the optimal hyperparameters. Cross entropy loss was used for all models.

We define accuracy as the proportion of correct guesses, and we analyze the confusion matrices of our model.

As mentioned before, there may be a gender bias in our model because there are more male than female patients in our training dataset. We inspected this by splitting our test set into male and female counterparts and testing the model on each subset. Gender bias is then examined by looking at the score and confusion matrix for each gendered subset.

5. Results

Loss and Accuracy History

Our best models achieved an accuracy of about 70-75% on the equal testing set. Pretrained ResNet and EfficientNet models obtained similar accuracies and losses, so we will display only EfficientNet results.

After optimizing the pretrained EfficientNetB0 algorithm with an initial learning rate at 0.001 and an exponential scheduler with \(\gamma = 0.735\), we achieved an accuracy of 74% on the balanced test set.

As we can see in the following figures, the training score and loss gradually improved, while the validation score and loss plateau after a few epochs. This is likely a sign of overfitting. We tried to optimize the model by altering the scheduler type, varying the Adam learning rate from 0.001 to 0.01, but the overfitting persisted.

EfficientNetB0 loss EfficientNetB0 score

We also did our own implementation of ResNet18, and obtained comparable results. The learning rate was also set to 0.001, and the exponential scheduler at \(\gamma = 0.735\). The issue of overfitting remained, and our model achieved a score of 68% when tested on unseen data.

Confusion Matrix

Shown below are the confusion matrices for the EfficientNetB0 model with \(\gamma = 0.735\) over a subset of 2500 male and female patients for both frontal and lateral images. The horizontal axis represents the predicted classes (White, Black, Asian), and the vertical axis represents the true labels.

Confusion matrix for male patients Confusion matrix for female patients

From the confusion matrices above, we notice that for male patients, the model classified White the best class, with 78% of the predictions to be true. Whereas, for the female patients, the model classified Asian the best class, with 79% of the predictions to be true. However, across all classes (White, Black, Asian), the percentage of true-positives are relatively the same and can be interpreted to be equivalent.

For male patients, 19% of the Black class and 18% of the Asian class were predicted as White. For female patients, 17% of the Black class and 15% of the Asian class were predicted as White. These rates are similar for the misclassification of White patients predicted as Asian — 12% of the male White class and 16% of the female White class were predicted as Asian. In contrast, the model has a significantly lower percentage of classifying the Asian and White class as part of the Black class, especially in falsely predicting Black as Asian and vice versa.

Thus, although we have trained on an equal subset of each class (White, Black, Asian), the confusion matrix suggests that there may be representational bias with how a certain class such as the White class will be more likely to be predicted than the Black and Asian class. Thus, though the model does classify each race at a relatively moderate accuracy (up to 80% accuracy), it may also be at the cost of producing inaccurate classifications to a patient’s true race — which is highly favored for those belonging in the White class. This can result in heavy implications if an algorithm like this were to be commercialized and probes the validity of similar algorithms that yield a “high” accuracy rates for the true races. The causes are not clear with our current analysis, but perhaps the number of distinct patients represented amongst those images is not equal - White patients on average do have more images, which could mean a smaller number of distinct patients during training.

To compare, we also show the self-implemented Resnet18 confusion matrices. Notice that the correct predictions remain the same even for the unequal testing set. Of course, the accuracy of 71% is not quite good enough to match the base accuracy of 78% (if the algorithm were to guess all White).

Self-implemented Resnet18, Equal Set Self-implemented Resnet18, Unequal Set

Interestingly here, the correct predictions for Black and Asian are significantly worse than for EfficientNetB0. The general trend of the false predictions, while higher than in EfficientNetB0, do adhere to the similar trends discussed.

Visualized Feature Map

Using our self-implemented ResNet18 model, we demonstrate below the first convolutional layer’s filters, and the feature maps of three convolutional layers — layer 0, layer 8, and layer 16 (the last layer).

The patient used for this demonstration is a White, non-hispanic Female. Shown below is the patient’s frontal-view image used for this experiment.

Frontal-view image of patient

And here is the first convolutional layer’s filter:

First convolutional layer filter of the self-implemented ResNet-18 neural network model

Then, we passed the filter through each of the convolutional layers of the model. For simplicity, we only showed three of the total layers. It is interesting to see that there is varied noise as to what the model “looks” at, highlighted by the whiter patches of the image — which also corresponds to which part of the image is “activated”!

Feature maps from the first convolutional layer (layer 0) of self-implemented ResNet18 model
Feature maps from the ninth convolutional layer (layer 8) of self-implemented ResNet18 model
Feature maps from the last convolutional layer (layer 16) of self-implemented ResNet18 model

After displaying the three layers, we observed that the last layer is highly disoriented, and the regular human eye can no longer distinguish the original image (the patient’s frontal-view). This last layer serves high importance as it is used to deduce what the model is actually classifying based off “features” it has learned. Additionally, we also observe that the model focused on different aspects of the image as the filters used to create the feature map vary.

What exactly it is observing is unclear, perhaps we can say that it is able to distinguish between the bones and lungs of an X-ray.

6. Conclusion

The models that our project produced can classify race at around 70-75% accuracy based only on chest X-rays. Given that this is conducted on the balanced set, this seems to affirm that chest X-rays can be used to predict race.

We also investigated the ethical issues that these models could pose. We speculate that if the results of this project were used in bad faith, existing racial inequalities would be reinforced and worsened. While this is not a conclusion supported by the algorithm or any other literature, some may interpret this algorithm as evidence for racial essentialism. This would be a problematic conclusion given the ethical and practical issues of racial essentialism, and its byproduct, race-based medicine.

Right now, our model is still in its infancy, and we do not know for sure what the model is looking at to make its decision. If we had more time, the first thing we would do is test our algorithm on the MIMIC-CXR dataset. Training and validating on multiple chest X-rays would show more convincingly that medical imaging AI can detect race. With access to better resources, we could have tried to train on a larger subset of the data. Looking into regularization methods could have also helped with overfitting, as well as reducing the number of layers and thus the parameters of the model, or trying to train only on the linear layers of complex models.

7. Group Contributions Statement

Trong: I downloaded the ChexPert dataset to a shared drive to make the pathways in our Google Colab consistent; visualized racial and gender imbalances in the training data; trained ResNet18 models with different schedulers; visualized training and validation losses for each model; investigated gender bias for the pretrained EfficientNetB0 model. I tried to implement the code from here but it didn’t work. I also wrote the project presentation script, the blogpost, and finalized the ethics research.

Jay-U: I found the paper by Gichoya et. al on classifying race using Chest X-rays. I found the CheXpert dataset and created the subsets. I created the dataset function for the dataloader (including the transforms). I identified the algorithms to use and experimented with them, using the ResNet and EfficientNet algorithms. I also did a self-implementation on the ResNet18/ResNet34 algorithms and trained them. I implemented the training and testing functions, the optimization loop, and added the learning rate scheduler. I also created a way to save and load models, having trained the pretrained ResNet18 and EfficientNet algorithms with the exponential and plateu schedulers.

I also worked on the research, finding and doing the writeups for the introduction (the results similar to the work we’re doing). I also found the papers around race-based medicine and finished my write up. I made edits to the project presentation, and all sections of the blogpost.

Kent: I aided in exploratory data visualizations for the CheXpert dataset. I also implemented code to visualize the activation maps for the self-implemented ResNet18 model.

I helped format the project presentation and aided in editing a portion of the sections. I supported in writing the Values Statement, Methods (added datasheet) and Results (interpreting the confusion matrix + auditing for bias) sections, alongside creating the bibliography.

8. Personal Reflection

I learned a lot about image classification. This was also through the class, but I learned about the Resnet algorithm and about how to implement it. I learned how to use pytorch, how to create a dataset loader, how to use pretrained models, how to optimize models, and how to save and load models. I also learned, along the way, about some subtleties in overfitting (though we were not really able to implement corrections). I didn’t have the time to show these, but I did experiment a lot. I tried training just the linear layer, training the EfficientNetB4 algorithm (before realizing that EfficientNetB0 would be better). I even tried a two-step training process in which I trained first on the equal set, and then trained on the unequal set (which some papers have shown do help address the issue of unequal proportions).

I also learned about race-based medicine and its drawbacks. I had previously known about race-based medicine or even read scientific papers about whether race exists as a biological factor. It was a struggle at first to specifically answer how medicine and race are related, so this was a part in which I learned a lot.

I think that I did grow in terms of trying to manage a group. While it wasn’t succesful, I did set weekly meetings and make a Todo list, things that I thought would help keep people on the same page.

I think overall I fell short in what I achieved. I am proud that I was able to get my implementations to work decently well on the balanced set, even if my goal was to make them work on the unbalanced set. Overall, it would have been nice to reach the goals I mentioned in the conclusion, or try different metrics like the AUC metric that Gichoya et. al did. I think the analysis and writing on this blog post also is not as complete as I would have hoped either. I do feel like I missed out on the enriching parts about group work. In terms of coding or writing, it would have been more fulfilling if others engaged with what I posted and made corrections or showed me their own work.

With this project, I am more confident in my abilities to learn about new topics in machine learning. It was initially quite daunting, and I was able to make a project that at least worked. I have also thought a lot more about ethics in computing and in medicine, and I hope that this goal of socially conscientious computing will stick with me.

References

Adleberg, Jason, Amr Wardeh, Florence X Doo, Brett Marinelli, Tessa S Cook, David S Mendelson, and Alexander Kagen. 2022. “Predicting Patient Demographics from Chest Radiographs with Deep Learning.” Journal of the American College of Radiology 19 (10): 1151–61.
Apeles, Linda. 2022. “Race-Based Prescribing for Black People with High Blood Pressure Shows No Benefit.” Patient Care.
Cerdeña, Jessica P, Marie V Plaisime, and Jennifer Tsai. 2020. “From Race-Based to Race-Conscious Medicine: How Anti-Racist Uprisings Call Us to Act.” The Lancet 396 (10257): 1125–28.
Cheng, Ching-Yu, WH Linda Kao, Nick Patterson, Arti Tandon, Christopher A Haiman, Tamara B Harris, Chao Xing, et al. 2009. “Admixture Mapping of 15,280 African Americans Identifies Obesity Susceptibility Loci on Chromosomes 5 and x.” PLoS Genetics 5 (5): e1000490.
Freedman, Matthew L, Christopher A Haiman, Nick Patterson, Gavin J McDonald, Arti Tandon, Alicja Waliszewska, Kathryn Penney, et al. 2006. “Admixture Mapping Identifies 8q24 as a Prostate Cancer Risk Locus in African-American Men.” Proceedings of the National Academy of Sciences 103 (38): 14068–73.
Garbin, Christian, Pranav Rajpurkar, Jeremy Irvin, Matthew P Lungren, and Oge Marques. 2021. “Structured Dataset Documentation: A Datasheet for CheXpert.” arXiv Preprint arXiv:2105.03020.
Gichoya, Judy Wawira, Imon Banerjee, Ananth Reddy Bhimireddy, John L Burns, Leo Anthony Celi, Li-Ching Chen, Ramon Correa, et al. 2022. “AI Recognition of Patient Race in Medical Imaging: A Modelling Study.” The Lancet Digital Health 4 (6): e406–14.
Irvin, Jeremy, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, et al. 2019. “Chexpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison.” In Proceedings of the AAAI Conference on Artificial Intelligence, 33:590–97. 01.
Maglo, Koffi N, Tesfaye B Mersha, and Lisa J Martin. 2016. “Population Genomics and the Statistical Values of Race: An Interdisciplinary Perspective on the Biological Classification of Human Populations and Implications for Clinical Genetic Epidemiological Research.” Frontiers in Genetics 7: 22.
Reich, David, Nick Patterson, Philip L De Jager, Gavin J McDonald, Alicja Waliszewska, Arti Tandon, Robin R Lincoln, et al. 2005. “A Whole-Genome Admixture Scan Finds a Candidate Locus for Multiple Sclerosis Susceptibility.” Nature Genetics 37 (10): 1113–18.
Seyyed-Kalantari, Laleh, Haoran Zhang, Matthew BA McDermott, Irene Y Chen, and Marzyeh Ghassemi. 2021. “Underdiagnosis Bias of Artificial Intelligence Algorithms Applied to Chest Radiographs in Under-Served Patient Populations.” Nature Medicine 27 (12): 2176–82.
Tang, Hua, Tom Quertermous, Beatriz Rodriguez, Sharon LR Kardia, Xiaofeng Zhu, Andrew Brown, James S Pankow, et al. 2005. “Genetic Structure, Self-Identified Race/Ethnicity, and Confounding in Case-Control Association Studies.” The American Journal of Human Genetics 76 (2): 268–75.
Vergara, Candelaria, Luis Caraballo, Dilia Mercado, Silvia Jimenez, Winston Rojas, Nicholas Rafaels, Tracey Hand, et al. 2009. “African Ancestry Is Associated with Risk of Asthma and High Total Serum IgE in a Population from the Caribbean Coast of Colombia.” Human Genetics 125: 565–79.
Vyas, Darshali A, Leo G Eisenstein, and David S Jones. 2020. “Hidden in Plain Sight—Reconsidering the Use of Race Correction in Clinical Algorithms.” New England Journal of Medicine. Mass Medical Soc.
Yi, Paul H, Jinchi Wei, Tae Kyung Kim, Jiwon Shin, Haris I Sair, Ferdinand K Hui, Gregory D Hager, and Cheng Ting Lin. 2021. “Radiology ‘Forensics’: Determination of Age and Sex from Chest Radiographs Using Deep Learning.” Emergency Radiology 28: 949–54.