We are releasing our approach towards bias-free Artificial Intelligence (AI) in a series of public analyses. With this analysis, our goal is to begin a study of fairness and accuracy and take steps to improve it. To give more context to our numbers, we have included an introduction to data bias in machine learning models and ways to mitigate them. If you are only interested in seeing the demographics of the Ouva dataset, scroll down to section titled Data Analysis.
The Need to Reduce AI Bias
For the AI solutions to perform equitably for everybody, the data they learn from must be diverse in many aspects. The bias is easy to detect and mitigate on the surface. If AI learns only from people with lighter skin, it will not be accurate when it sees a person with different skin color. As shown in Gender Shades in 2018, commercial facial recognition solutions performed better on people with lighter skin and males and worse on females with darker skin. As John Smith puts it: "Images must reflect the distribution of features in faces we see in the world."
The result of using a biased AI system can vary from making a wrong categorization in a patient's age to a deadly fall that goes undetected. For example, Larrazabal et al. showed that an X-ray algorithm trained with male data could not read female X-rays well. Data diversity is so crucial that Moderna delayed vaccine trials to ensure minority representation.
Some Biases Are Harder To Detect
There is an intricate web of biases that authors of these systems introduce. This level of detail makes bias harder to quantify, as the source of bias is more complicated to identify. For example, when we ask a person to annotate a video with the skin color of the actor in it, they will use the label they perceive, which stems from their own biases. Therefore, two people from different cultures may not use the same annotation to label a person's skin color. This subjectivity can affect the diversity of the data or the analysis of its outcomes.
How Does a Machine Learn
To completely understand how a machine learning system is biased, we must know how it learns from and reacts to different inputs. As much as possible, we need to strive towards building "Explainable AI." This idea applies even more to healthcare as the outcomes affect patients' health and how caregivers do their work. As Spatharou et al. urge, "practitioners want to understand how AI works, where the underlying data come from, and what biases exist in the algorithms."
Learning From Public Models
Today, AI relies on machine learning models which ingest data such as images, videos, and other signals. During the learning phase, we split the dataset into two: Training & Validation. The model uses the former to learn a model that can recognize the events in the latter with high accuracy.
Fine-tuning with Custom Datasets
Like many products and teams today, we rely on open-source machine learning models available as the starting point. Researchers have trained these models using large, public datasets. However, such models do not recognize specialized environments, such as hospital rooms, because of their generic nature. We fine-tune these models using our datasets so that they can understand the processes of inpatient care.
How To Detect Bias
As we have built one of the most comprehensive patient room datasets, we must have the right metrics that help us analyze its diversity correctly. Below, we begin our evaluation by defining diversity metrics in the dataset. We began our analysis using age, gender and skin color metrics. However, we believe this decisions merits additional discussion.
Problems with Assigning Gender
It is important to note that most recent research places gender in a non-binary spectrum. As gender equality and inclusivity moves beyond male/female separation, retaining a binary classification may cause unforeseen issues in practice. Jessica Cameron, in "Gender (mis)measurement," recommends the use of inclusive gender labels and to avoid categories such as "Other" that label non-binary individuals as not usual.
Since Ouva AI learns from imagery, we seek diversity in perceivable differences. While we provide binary gender analysis as the starting point for the discussion, we know it includes our own biases. The subsequent analysis will explore attributes that affect model prediction accuracy: hair color and style, weight, height, disabilities, and others that may belong to gender identity.
Biases in Analysing Skin Color
Skin color is one of the attributes that brings to surface many aspects of bias during the preparation and analysis of the dataset. For one, it is hard to classify because of changing light conditions. But more importantly, because skin color definition is subjective and not self-reported as part of ethnicity, it includes the biases of the analyzers. We have annotated skin color as darker and lighter.
In the next analysis, we will improve the skin color measure by using a continuous metric such as Fitzpatrick Skin Type Scale, which ranges from 1 (lightest) to 6 (darkest). We intentionally did not categorize people based on a perceived ethnicity (e.g., Caucasian) because we believe this attribute should be self-reported.
There has not been a universally agreed-upon definition of fairness when it comes to AI. However, to evaluate bias wholly, we should also analyze the system's output and how it behaves on diverse people. Researchers have proposed different statistical methods that attempt to quantify bias better. For example, IBM AI Fairness 360 toolkit recommends the use of the following to compare the fairness of an AI model:
- Difference of prediction mistake rates - lack of Statistical Parity,
- The ratio of favorable outcomes,
- Statistical metrics such as Theil Index.
We will be doing a thorough fairness analysis of the Ouva models in the next version of this review.
Like most AI platforms, we use existing machine learning models as a starting point. Since the bias in their datasets may influence our final model, we review two of them as well.
COCO is large-scale object detection, segmentation, and captioning dataset which influences person detection and silhouette segmentation models in Ouva. In Understanding and Evaluating Racial Biases in Image Captioning, Zhao et al. analyzed the 2014 version of the COCO dataset as below.
PEdesTrian Attribute (PETA)
The PEdesTrian Attribute dataset (PETA) is a dataset for recognizing attributes of people, in particular clothing style. This dataset is the largest and most diverse pedestrian attribute dataset to date. At Ouva, we use models that researchers trained using the PETA dataset. These baseline models allow us to create new custom models that detect staff uniforms, patient gowns, and accessories such as gloves and masks.
There is no available exhaustive diversity survey of this dataset. However, it is one of the largest, covering more than 60 attributes on 19000 images. The binary attributes cover an exhaustive set of demographics (e.g., gender and age range), appearance (e.g., hairstyle), upper and lower body clothing style (e.g., casual or formal), and accessories.
Ouva Patient Room Dataset
Our patient room dataset consists of scenes recorded in mock-up patient rooms and at our partner hospital sites with volunteer patients. This dataset allows us to fine-tune preexisting models researchers created with public datasets. Hence, the resulting model works with high accuracy in a patient room as opposed to other general environments.
How to Mitigate Bias in AI Models
Increase Diversity with Synthetic Healthcare Dataset
To further increase diversity, we have created a synthetic (simulated) healthcare data platform. It allows our developers to recreate behaviors from various points of view and swap actors in and out of various environments. It uses the original data as the basis for a simulation environment that keeps the essential statistical relationships, behaviors, and activities while using simulated actors to retain privacy and increase diversity. With this platform, we can create thousands of images and rebalance our dataset within minutes.
Reduce Bias Using Algorithms
Beyond diversifying data, we can use a variety of algorithms to reduce bias in the AI models. The choice of the method depends on whether we need to fix the data, the model, or the predictions. For example, we can:
- Reweigh the examples in each label differently to ensure fairness before classification.
- Train a model with maximum accuracy while ensuring that no attribute of the subject's information can be predicted from the outcome, leading to a fair classifier.
- Increase model outcome accuracy for unprivileged subjects within limits.
IBM AIF360 recommends the earliest intervention possible because of its flexibility. In summary:
- When we are allowed to modify the training dataset, we use the pre-processing step.
- If not, we then change the learning algorithm, intervening in-processing.
- If none of the options are possible, then we intervene in the post-processing step.
Diverse Teams Lead to Better AI
Approaching bias from a dataset and algorithmic point of view is not enough - we need to consider the diversity of teams that build AI. Michael Li of Harvard Business Review identifies a lack of diversity among developers and funders of AI tools and framing problems from the perspective of majority groups as the issues that propagate bias.
Form Diverse Partnerships
Team diversity is paramount. However, with the limited resources of small teams, it is equally important to organize AI development with diverse partnerships. At Ouva, we partnered with groups from Europe to the US to test ideas and reduce bias early. Through such collaborations, we could quickly blend ideas and cultural expectations of healthcare researchers, nurses, user experience designers, and other stakeholders of different nationalities. In the right collaborative environment, you hear diverse voices, honest and direct feedback, and AI solutions like Ouva that change how care is delivered can succeed.
Towards a Bias-Free AI
As builders of technologies that are already changing patient care, at Ouva, we feel the weight of our responsibility. We are only beginning to do our part by pulling back the curtain on our decisions that may affect the lives of patients and caregivers. This analysis is the first of many in an attempt to stay accountable to the broader audience. We have gained so much from public discourse and hope to contribute back in return.