Human visual explanations mitigate bias in AI-based assessment of surgeon skills

ARTICLE

OPEN

Human visual explanations mitigate bias in AI-based

assessment of surgeon skills

Dani Kiyasseh

✉

, Jasper Laca

, Taseen F. Haque

, Maxwell Otiato

, Brian J. Miles

, Christian Wagner

, Daniel A. Donoho

Quoc-Dien Trinh

, Animashree Anandkumar

and Andrew J. Hung

✉

Arti

fi

cial intelligence (AI) systems can now reliably assess surgeon skills through videos of intraoperative surgical activity. With such

systems informing future high-stakes decisions such as whether to credential surgeons and grant them the privilege to operate on

patients, it is critical that they treat all surgeons fairly. However, it remains an open question whether surgical AI systems exhibit

bias against surgeon sub-cohorts, and, if so, whether such bias can be mitigated. Here, we examine and mitigate the bias exhibited

by a family of surgical AI systems

—

SAIS

—

deployed on videos of robotic surgeries from three geographically-diverse hospitals (USA

and EU). We show that SAIS exhibits an underskilling bias, erroneously downgrading surgical performance, and an overskilling bias,

erroneously upgrading surgical performance, at different rates across surgeon sub-cohorts. To mitigate such bias, we leverage a

strategy

—

TWIX

—

which teaches an AI system to provide a visual explanation for its skill assessment that otherwise would have

been provided by human experts. We show that whereas baseline strategies inconsistently mitigate algorithmic bias, TWIX can

effectively mitigate the underskilling and overskilling bias while simultaneously improving the performance of these AI systems

across hospitals. We discovered that these

fi

ndings carry over to the training environment where we assess medical students

’

skills

today. Our study is a critical prerequisite to the eventual implementation of AI-augmented global surgeon credentialing programs,

ensuring that all surgeons are treated fairly.

npj Digital Medicine

(2023) 6:54 ; https://doi.org/10.1038/s41746-023-00766-2

INTRODUCTION

The quality of a surgeon

’

s intraoperative activity (skill-level) can

now be reliably assessed through videos of surgical procedures

and arti

fi

cial intelligence (AI) systems

–

. With these AI-based skill

assessments on the cusp of informing high-stakes decisions on a

global scale such as the credentialing of surgeons

, it is critical

that they are unbiased

—

reliably re

fl

ecting the true skill-level of all

surgeons equally

However, it remains an open question whether

such surgical AI systems exhibit a bias against certain surgeon sub-

cohorts

. Without an examination and mitigation of these systems

’

algorithmic bias, they may unjusti

fi

ably rate surgeons differently,

erroneously delaying (or hastening) the credentialing of surgeons,

and thus placing patients

’

lives at risk

A surgeon typically masters multiple skills (e.g., needle handling

and driving) necessary for surgery

–

. To reliably automate the

assessment of such skills, multiple AI systems (one for each skill)

are often developed (Fig.

a). To test the robustness of these

systems, they are typically deployed on data from multiple

hospitals

. We argue that the bias of any one of these systems,

which manifests as a discrepancy in its performance across

surgeon sub-cohorts (e.g., novices vs. experts), is akin to one of

many light bulbs in an electric circuit connected in series (Fig.

b).

With a single defective light bulb in

fl

uencing the entire circuit, just

one biased AI system is enough to disadvantage a surgeon sub-

cohort. Therefore, the deployment of multiple AI systems across

multiple hospitals, a common feat in healthcare, necessitates that

we examine and mitigate the bias of all such systems collectively.

Doing so will ethically guide the impending implementation of AI-

augmented global surgeon credentialing programs

Previous studies have focused on algorithmic bias exclusively

against

patients

, demonstrating that AI systems systematically

underestimate the pain level of Black patients

and falsely predict

that female Hispanic patients are healthy

. The study of bias in

video-based AI systems has also gained traction, in the context of

automated video interviews

, algorithmic hiring

, and emotion

recognition

. Previous work has not, however, investigated the

bias of AI systems applied to surgical videos

, thereby over-

looking its effect on surgeons. Further, previous attempts to

mitigate such bias are either ineffective

–

or are limited to a

single AI system deployed in a single hospital

–

, casting doubt

on their wider applicability. As such, previous studies do not

attempt, nor demonstrate the effectiveness of a strategy, to

mitigate the bias exhibited by multiple AI systems across multiple

hospitals.

In this study, we examine the bias exhibited by a family of

surgical AI systems

—

SAIS

—

developed to assess the binary skill-

level (low vs. high skill) of multiple surgical activities from videos.

Through experiments on data from three geographically-diverse

hospitals, we show that SAIS exhibits an

underskilling

bias,

erroneously downgrading surgical performance, and an

overskilling

bias, erroneously upgrading surgical performance, at different

rates across surgeon sub-cohorts. To mitigate such bias, we

leverage a strategy

—

TWIX

—

that teaches an AI system to

complement its skill assessments with a prediction of the

importance of video frames, as provided by human experts

Department of Computing and Mathematical Sciences, California Institute of Technology, California, CA, USA.

Center for Robotic Simulation and Education, Catherine & Joseph

Aresty Department of Urology, University of Southern California, California, CA, USA.

Department of Urology, Houston Methodist Hospital, Texas, TX, USA.

Department of

Urology, Pediatric Urology and Uro-Oncology, Prostate Center Northwest, St. Antonius-Hospital, Gronau, Germany.

Division of Neurosurgery, Center for Neuroscience, Children

’

National Hospital, Washington DC, WA, USA.

Center for Surgery & Public Health, Department of Surgery, Brigham and Women

’

s Hospital, Harvard Medical School, Boston, MA,

USA.

✉

email: danikiy@hotmail.com; ajhung@gmail.com

www.nature.com/npjdigitalmed

Published in partnership with Seoul National University Bundang Hospital

1234567890():,;

(Fig.

c). We show that TWIX can mitigate the underskilling and

overskilling bias across hospitals and simultaneously improve the

performance of AI systems for all surgeons. Our

fi

ndings inform

the ethical implementation of impending AI-augmented global

surgeon credentialing programs.

RESULTS

SAIS exhibits underskilling bias across hospitals

With skill assessment, we refer to the erroneous downgrading of

surgical performance as underskilling. An underskilling bias is

exhibited when such underskilling occurs at different rates across

surgeon sub-cohorts. For binary skill assessment (low vs. high

skill), which is the focus of our study, this bias is re

fl

ected by a

discrepancy in the negative predictive value (NPV) of SAIS (see

Methods, Fig.

). We, therefore, present SAIS

’

NPV for surgeons

who have performed a different number of robotic surgeries

during their lifetime (expert caseload >100), those operating on

prostate glands of different volumes and of different cancer

severity (Gleason score) (Fig.

). Note that members of these

groups are

fl

uid as surgeons often have little say over, for example,

the characteristics of the prostate gland they operate on. Please

refer to the Methods section for our motivation behind selecting

these groups and sub-cohorts.

We found that SAIS exhibits an underskilling bias across

hospitals (see Methods for description of data, Table

for the

number of video samples). This is evident by, for example, the

discrepancy in the negative predictive value across the two

Surgical AI

System 2

Surgical AI

System 1

surgical video

frames

needle handling

skill assessment

needle driving

skill assessment

AUC

novice

expert

AI systems

AI systems exhibit bias against expert surgeons

bias

no bias

series electric circuit

TWIX

Surgical AI

System 1

needle handling

skill assessment

surgical video

frames

frame

important frames

identified by

AI system

important frames identified

human experts

TWIX is a simple add-on to any AI system

novice

expert

train and deploy

Hospital A

deploy

Hospital B

deploy

Hospital C

bias

Hospital A

bias

Hospital B

bias

Hospital C

Fig. 1

Mitigating bias of multiple surgical AI systems across multiple hospitals. a

Multiple AI systems assess the skill-level of multiple

surgical activities (e.g., needle handling and needle driving) from videos of intraoperative surgical activity. These AI systems are often

deployed across multiple hospitals.

To examine bias, we stratify these systems' performance (e.g., AUC) across different sub-cohorts of

surgeons (e.g., novices vs. experts). The bias of one of many AI systems is akin to a light bulb in an electric circuit connected in series: similar to

how one defective light bulb leads to a defective circuit, one biased AI system is suf

fi

cient to disadvantage a surgeon sub-cohort.

To mitigate

bias, we teach an AI system, through a strategy referred to as TWIX, to complement its skill assessments with predictions of the importance of

video frames based on ground-truth annotations provided by human experts.

D. Kiyasseh et al.

2

npj Digital Medicine (2023) 54

Published in partnership with Seoul National University Bundang Hospital

1234567890():,;

surgeon sub-cohorts operating on prostate glands of different

volumes (

≤

49 ml and >49 ml). For example, when assessing the

skill-level of needle handling at USC (Fig.

a), SAIS achieved

NPV

≈

0.71 and 0.75 for the two sub-cohorts, respectively. Such an

underskilling bias consistently appears across hospitals where

NPV

≈

0.80 and 0.93 at St. Antonius Hospital (SAH), and NPV

≈

0.73

and 0.88 at Houston Methodist Hospital (HMH). These

fi

ndings

extend to when SAIS assessed the skill-level of the second surgical

activity of needle driving (see Fig.

b).

Overskilling bias

. While our emphasis has been on the unders-

killing bias, we demonstrate that SAIS also exhibits an overskilling

bias, where it erroneously upgrades surgical performance (see

Supplementary Note 2).

Multi-class skill assessment

. Although the emphasis of this study

is on binary skill assessment, a decision driven primarily by the

need to inspect the fairness of a previously-developed and soon-

to-be-deployed AI system (SAIS), there has been a growing

number of studies focused on

multi-class

skill assessment

.As

such, we conducted a con

fi

ned experiment to examine whether

such a setup, in which needle handling is identi

fi

ed as either low,

intermediate, or high skill also results in algorithmic bias (see

Supplementary Note 3). We found that both the underskilling and

overskilling bias continue to extend to this setting.

Underskilling bias persists even after controlling for potential

confounding factors

Confounding factors may be responsible for the apparent

underskilling bias

. It is possible that the underskilling bias

against surgeons with different caseloads (Fig.

b) is driven by

SAIS

’

dependence on caseload, as a proxy, for skill assessment. For

example, SAIS may have latched onto the effortlessness of expert

surgeons

’

intraoperative activity, as opposed to the strict skill

assessment criteria (see Methods), as predictive of high-skill

activity. However, after controlling for caseload, we found that

SAIS

’

outputs remain highly predictive of skill-level (odds ratio

=

2.27), suggesting that surgeon caseload, or experience, plays a

relatively smaller role in assessing skill

(see Methods). To further

check if SAIS was latching onto caseload-speci

fi

c features in

surgical videos, we retrained it on data with an equal number of

samples from each class (low vs. high skill) and surgeon caseload

group (novice vs. expert) and found that the underskilling bias still

persists. This suggests that SAIS is unlikely to be dependent on

unreliable caseload-speci

fi

c features.

Examining bias across multiple AI systems and hospitals

prevents misleading bias

fi

ndings

With multiple AI systems deployed on the same group of surgeons

across hospitals, we claim that examining the bias of only one of

these AI systems can lead to misleading bias

fi

ndings. Here, we

provide evidence in support of this claim by focusing on the

surgeon caseload group (also applies to other groups).

Multiple AI systems

. We found that, had we examined bias for

only needle handling, we would have erroneously assumed that

SAIS disadvantaged novice surgeons exclusively. While SAIS did

exhibit an underskilling bias against

novice

surgeons at USC when

assessing the skill-level of needle handling, it exhibited this bias

against

expert

surgeons when assessing the skill-level of the

second surgical activity of needle driving. For example, SAIS

achieved NPV

≈

0.71 and 0.75 for novice and expert surgeons,

respectively, for needle handling (Fig.

a), whereas it achieved

NPV

≈

0.85 and 0.75 for these two sub-cohorts, for needle driving

(Fig.

b).

Multiple hospitals

. We also found that, had we examined bias on

data only from USC, we would have erroneously assumed that

SAIS disadvantaged expert surgeons exclusively. While SAIS did

exhibit an underskilling bias against

expert

surgeons at USC when

assessing the skill-level of needle driving, it exhibited this bias

against

novice

surgeons, to an even greater extent, at HMH. For

Fig. 2

SAIS exhibits an underskilling bias across hospitals.

SAIS is tasked with assessing the skill-level of

needle handling and

needle

driving. A discrepancy in the negative predictive value across surgeon sub-cohorts re

fl

ects an underskilling bias. Note that SAIS is always

trained on data from USC and deployed on data from St. Antonius Hospital and Houston Methodist Hospital. To examine bias, we stratify SAIS'

performance based on the total number of robotic surgeries performed by a surgeon during their lifetime (caseload), the volume of the

prostate gland, and the severity of the prostate cancer (Gleason score). The results are an average, and error bars re

fl

ect the standard error,

across ten Monte Carlo cross-validation folds.

D. Kiyasseh et al.

3

Published in partnership with Seoul National University Bundang Hospital

npj Digital Medicine (2023) 54

example, SAIS achieved NPV

≈

0.85 and 0.75 for novice and expert

surgeons, respectively, at USC, whereas it achieved NPV

≈

0.57 and

0.80 for these two sub-cohorts at HMH (Fig.

b).

TWIX mitigates underskilling bias across hospitals

Although we demonstrated, in a previous study, that SAIS was

able to generalize to data from different hospitals, we are acutely

aware that AI systems are not perfect. They can, for example,

depend on unreliable features as a shortcut to performing a task,

otherwise known as spurious correlations

. We similarly hypothe-

sized that SAIS, as a video-based AI system, may be latching onto

unreliable temporal features (i.e., video frames) to perform skill

assessment. At the very least, SAIS could be focusing on frames

which are irrelevant to the task at hand and which could hinder its

performance.

To test this hypothesis, we opted for an approach that directs

an AI system

’

s focus onto frames deemed relevant (by human

experts) while performing skill assessment. The intuition is that by

learning to focus on features deemed most relevant by human

experts, an AI system is less likely to latch onto unreliable features

in a video when assessing surgeon skill. To that end, we leverage a

strategy entitled training with explanations

—

TWIX

—

(see Meth-

ods). We present the performance of SAIS for the disadvantaged

surgeon sub-cohorts before and after adopting TWIX when

assessing the skill-level of needle handling (Fig.

a) and needle

driving (Fig.

b).

We found that TWIX mitigates the underskilling bias exhibited

by SAIS. This is evident by the improvement in SAIS

’

worst-case

negative predictive value for the disadvantaged surgeon sub-

cohorts after having adopted TWIX. For example, when SAIS was

tasked with assessing the skill-level of needle handling at USC

(Fig.

a), worst-case NPV increased by 2% for the disadvantaged

surgeon sub-cohort (novice) in the surgeon caseload group

(see Fig.

to identify disadvantaged sub-cohorts). This

fi

nding

was even more pronounced when SAIS was tasked with assessing

the skill-level of needle driving at USC (Fig.

b), with improve-

ments in the worst-case NPV by up to 32%.

We also observed that TWIX, despite being adopted while SAIS

was trained on data exclusively from USC, also mitigates bias

when SAIS is deployed on data from other hospitals. This is

evident by the improvements in SAIS

’

performance for the

disadvantaged surgeon sub-cohorts at SAH and, occasionally, at

HMH. In cases where we observed a decrease in the worst-case

performance, we found that this was associated with an overall

decrease in the performance of SAIS (Fig.

). We hypothesize that

this reduction in performance is driven by the variability in the

execution of surgical activity by surgeons across hospitals.

Overskilling bias

. Empirically, we discovered that while various

strategies mitigated the underskilling bias, they exacerbated the

overskilling bias (more details in forthcoming section). In contrast,

we found that TWIX avoids this negative unintended effect.

Speci

fi

cally, we found that TWIX also mitigates the overskilling

bias (see Supplementary Note 4).

Deploying TWIX with multiple AI systems and hospitals

prevents misleading

fi

ndings about its effectiveness

As with examining algorithmic bias, it is equally critical to measure

the effectiveness of a bias mitigation strategy across multiple AI

systems and hospitals in order to avoid misleading

fi

ndings. We

now provide evidence in support of this claim.

Multiple AI systems

. We found that, had we not adopted TWIX for

needle driving skill assessment, we would have underestimated its

effectiveness. Speci

fi

cally, while TWIX mitigated the underskilling

bias at USC when SAIS assessed the skill-level of needle handling

Fig. 3

TWIX mitigates the underskilling bias across hospitals.

We present the average performance of SAIS on the most disadvantaged sub-

cohort (worst-case NPV) before and after adopting TWIX, indicating the percent change. An improvement (

↑

) in the worst-case NPV is

considered bias mitigation. SAIS is tasked with assessing the skill-level of

needle handling and

needle driving. Note that SAIS is trained on

data from USC and deployed on data from St. Antonius Hospital and Houston Methodist Hospital. Results are an average across ten Monte

Carlo cross-validation folds.

D. Kiyasseh et al.

4

npj Digital Medicine (2023) 54

Published in partnership with Seoul National University Bundang Hospital

(system 1), the magnitude of this mitigation increased when SAIS

assessed the skill-level of the distinct activity of needle driving

(system 2). For example, for the disadvantaged surgeon sub-

cohort in the caseload group, the worst-case NPV improved by 2%

for needle handling (Fig.

a) and 20% for needle driving (Fig.

b),

fl

ecting a 10-fold increase in the effectiveness of TWIX as a bias

mitigation strategy.

Multiple hospitals

. We found that, had we not adopted TWIX and

deployed SAIS in other hospitals, we would have overestimated its

effectiveness. Speci

fi

cally, while TWIX mitigated the underskilling

bias at USC when SAIS assessed the skill-level of needle driving,

the magnitude of this mitigation decreased when SAIS was

deployed on data from SAH. For example, for the disadvantaged

surgeon sub-cohort in the prostate volume group, the worst-case

NPV improved by 19% at USC but only by 1% at SAH (Fig.

b).

Baseline bias mitigation strategies induce collateral damage

A strategy for mitigating a particular type of bias can

exacerbate

another, leading to collateral damage and eroding its effective-

ness. To investigate this, we adapted two additional strategies that

have, in the past, proven effective in mitigating bias

. These

include training an AI system with additional data (TWAD) and

pre-training an AI system

fi

rst with surgical videos (VPT) (see

Methods for in-depth description). We compare their ability to

mitigate bias to that of TWIX (Table

and Supplementary Note 5).

We found that while baseline strategies were effective in

mitigating the underskilling bias, and even more so than TWIX,

they dramatically worsened the overskilling bias exhibited by SAIS.

For example, VPT almost negated its improvement in the

underskilling bias (7.7%) by exacerbating the overskilling bias

(7.0%). In contrast, TWIX consistently mitigated both the unders-

killing and overskilling bias, albeit more moderately, resulting in

an average improvement in the worst-case performance by 3.0%

and 4.0%, respectively. The observed consistency in TWIX

’

effect on bias is an appealing property whose implications we

discuss later.

TWIX can improve AI system performance while mitigating

bias across hospitals

Trustworthy AI systems must exhibit both robust and fair

behavior

. Although it has been widely documented that

mitigating algorithmic bias can come at the expense of AI system

performance

, recent work has cast doubt on this trade-off

–

We explored this trade-off in the context of TWIX, and present

SAIS

’

performance for all surgeons across hospitals (Fig.

). This is

fl

ected by the area under receiver operating characteristic curve

(AUC), before and after having adopted TWIX.

We found that TWIX can improve the performance of AI systems

while mitigating bias. This is evident by the improvement in the

performance of SAIS both for the disadvantaged surgeon sub-

cohorts (see earlier Fig.

) and on average for all surgeons. For

example, when tasked with assessing the skill-level of needle

driving at USC (Fig.

b), TWIX improved the worst-case NPV by

Fig. 4

TWIX can improve AI system performance while mitigating bias across hospitals.

The performance (AUC) of SAIS before and after

having adopted TWIX when assessing the skill-level of

needle handling and

needle driving. Note that SAIS is trained on data from USC and

deployed on data from St. Antonius Hospital and Houston Methodist Hospital. The results are an average across ten Monte Carlo cross-

validation folds and the shaded area represents one standard error.

Table 1.

Baseline strategies mitigate bias inconsistently.

Bias mitigation strategy

Bias

TWAD

VPT

TWIX (ours)

Underskilling

↓

3.7%

↓

7.7%

↓

3.0%

Overskilling

↑

6.7%

↑

7.0%

↓

4.0%

We report the change in the AI system

’

s bias (negative percent change in

worst-case performance) averaged across the surgeon groups as a result of

adopting distinct mitigation strategies. An improvement in the worst-case

performance corresponds to a reduction in bias. Results are shown for the

needle handling skill assessment system deployed on data from USC.

TWAD involves training an AI system with additional data, and VPT

involves pre-training the AI system with surgical videos (see Methods).

D. Kiyasseh et al.

5

Published in partnership with Seoul National University Bundang Hospital

npj Digital Medicine (2023) 54

32%, 19%, and 20% for the surgeon groups of caseload, prostate

volume, and Gleason score, respectively and thus mitigating the

underskilling bias, and also improved SAIS

’

performance from

AUC

=

0.821

→

0.843 (Fig.

b).

Deployment of SAIS in a training environment

Our study informs the future implementation of AI-augmented

surgeon credentialing programs. We can, however, begin to assess

today the skills of surgical trainees in a training environment. To

foster a fair learning environment for surgical trainees, it is critical

that these AI-based skill assessments re

fl

ect the true skill-level of

all trainees equally. To measure this, and as a proof of concept, we

deployed SAIS on video samples of the needle handling activity

performed by medical students without prior robotic experience

on a robot otherwise used in surgical procedures (see Methods)

(Fig.

We discovered that our

fi

ndings from when SAIS was deployed

on video samples of live surgical procedures transferred to the

training environment. Speci

fi

cally, we

fi

rst found that SAIS exhibits

an underskilling bias against male medical students (Fig.

a).

Consistent with earlier

fi

ndings, we also found that TWIX mitigates

this underskilling bias (Fig.

b) and simultaneously improves SAIS

’

ability to assess the skill-level of needle handling (Fig.

c).

DISCUSSION

Recently-developed surgical AI systems can reliably assess multi-

ple surgeon skills across hospitals. The impending deployment of

such systems for the purpose of credentialing surgeons and

training medical students necessitates that they do not disadvan-

tage any particular sub-cohort. However, until now, it has

remained an open question whether such surgical AI systems

exhibit algorithmic bias.

In this study, we examined and mitigated the bias exhibited by

a family of surgical AI systems

–

SAIS

–

that assess the skill-level of

multiple surgical activities through video. To prevent misleading

bias

fi

ndings, we demonstrated the importance of examining the

collective bias exhibited by all AI systems deployed on the same

group of surgeons and across multiple hospitals. We then

leveraged a strategy

—

TWIX

—

which not only mitigates such bias

for the majority of surgeon groups and hospitals, but can also

improve the performance of AI systems for all surgeons.

As it pertains to the study and mitigation of algorithmic bias,

previous work is limited in three main ways. First, it has not

examined the algorithmic bias of AI systems applied to the data

modality of surgical videos

nor against surgeons

, thereby

overlooking an important stakeholder within medicine. Second,

previous work has not studied bias in the real clinical setting

characterized by

multiple

AI systems deployed on the same group

of surgeons and across multiple hospitals, with a single excep-

tion

. Third, previous work has not demonstrated the effective-

ness of a bias mitigation strategy across multiple stakeholders and

hospitals

When it comes to bias mitigation, we found that TWIX mitigated

algorithmic bias more consistently than baseline strategies that

have, in the past, proven effective in other scienti

fi

c domains and

with other AI systems. This consistency is re

fl

ected by a

simultaneous decrease in algorithmic bias of different forms

(underskilling and overskilling), of multiple AI systems (needle

handling and needle driving skill assessment), and across

hospitals. We do appreciate, however, that it is unlikely for a

single bias mitigation strategy to be effective all the time. This

reactive approach to bias mitigation might call for more of a

preventative approach where AI systems are purposefully

designed to exhibit minimal bias. While appealing in theory, we

believe this is impractical at the present moment for several

reasons. First, it is dif

fi

cult to determine, during the design stage of

an AI system, whether it will exhibit any algorithmic bias upon

deployment on data, and if so, against whom. Mitigating bias

becomes challenging when you cannot

fi

rst quantify it. Second,

the future environment in which an AI system will be deployed is

often unknown. This ambiguity makes it dif

fi

cult to design an AI

system specialized to the data in that environment ahead of time.

In some cases, it may even be undesirable to do so as a specialized

system might be unlikely to generalize to novel data.

From a practical standpoint, we believe TWIX confers several

bene

fi

ts. Primarily, TWIX is a simple add-on to almost any AI system

that processes temporal information and does not require any

amendments to the latter

’

s underlying architecture. This is

appealing particularly in light of the broad availability and common

practice of adapting open-source AI systems. In terms of resources,

TWIX only requires the availability of ground-truth importance

labels (e.g., the importance of frames in a video), which we have

demonstrated can be acquired with relative ease in this study.

Furthermore, TWIX

’

sbene

fi

ts can extend beyond just mitigating

algorithmic bias. Most notably, when performing inference on an

unseen video sample, an AI system equipped with TWIX can be

viewed as explainable, as it highlights the relative importance of

video frames thereby instilling trust in domain experts, and

leveraged as a personalized educational tool for medical students,

directing them towards surgical activity in the video that can be

improved upon. These additional capabilities would be missing

from other bias mitigation strategies.

We demonstrated that, to prevent misleading bias

fi

ndings, it is

crucial to examine and mitigate the bias of multiple AI systems

across multiple hospitals. Without such an analysis, stakeholders

within medicine would be left with an incomplete and potentially

Fig. 5

SAIS can be used today to assess the skill-level of surgical trainees. a

SAIS exhibits an underskilling bias against male medical

students when assessing the skill-level of needle handling.

TWIX improves the worst-case NPV, and thus mitigates the underskilling bias.

TWIX simultaneously improves SAIS' ability to perform skill assessment.

D. Kiyasseh et al.

6

npj Digital Medicine (2023) 54

Published in partnership with Seoul National University Bundang Hospital

incorrect understanding of algorithmic bias. For example, at the

national level, medical boards augmenting their decision-making

with AI systems akin to those introduced here may introduce

unintended disparities in how surgeons are credentialed. At the

local hospital level, medical students subjected to AI-augmented

surgical training, a likely

fi

rst application of such AI systems, may

receive unreliable learning signals. This would hinder their

professional development and perpetuate existing biases in the

education of medical students

–

. Furthermore, the alleviation of

bias across multiple hospitals implies that surgeons looking to

deploy an AI system in their own operating room are now less

reticent to do so. As such, we recommend that algorithmic bias,

akin to AI system performance, is also examined across multiple

hospitals and multiple AI systems deployed on the same group of

stakeholders. Doing so increases the transparency of AI systems,

leading to more informed decision-making at various levels of

operation within healthcare and contributing to the ethical

deployment of surgical AI systems.

There are important challenges that our work does not yet

address. A topic that is seldom discussed, and which we do not

claim to have an answer for, is that of identifying an acceptable

level of algorithmic bias. Akin to the ambiguity of selecting a

performance threshold that AI systems should surpass before being

deployed, it is equally unclear whether a discrepancy in perfor-

mance across groups (i.e., bias) of 10 percentage points is

signi

fi

cantly worse than that of 5 percentage points. As with model

performance, this is likely to be context-speci

fi

c and dependent on

how costly a particular type of bias is. In our work, we have

suggested that any performance discrepancy is indicative of

algorithmic bias, an assumption that the majority of previous work

also makes. In a similar vein, we have only considered algorithmic

bias at a single snapshot in time, when the model is trained and

deployed on a static and retrospectively-collected dataset. How-

ever, as AI systems are likely to be deployed over extended periods

of time, where the distribution of data is likely to change, it is critical

to continuously monitor and mitigate the bias exhibited by such

systems over time. Analogous to continual learning approaches that

allow models to perform well on new unseen data while

maintaining strong performance on data observed in the past

we believe

continual bias mitigation

is an avenue worth exploring.

Our study has been limited to examining the bias of AI systems

which only assess the quality of two surgical skills (needle

handling and needle driving). Although these skills form the

backbone of suturing, an essential activity that almost all surgeons

must master, they are but only a subset of all skills required of a

surgeon. It is imperative for us to proactively assess the

algorithmic bias of surgical AI systems once they become capable

of reliably assessing a more exhaustive set of surgical skills.

Another limitation is that we examine and mitigate algorithmic

bias exclusively through a technical lens. However, we acknowl-

edge that the presence and perpetuation of bias is dependent on

a multitude of additional factors ranging from the social context in

which an AI system is deployed, the decisions that it will inform,

and the incentives surrounding its use. In this study, and for

illustration purposes, we assumed that an AI system would be

used to either provide feedback to surgeons about their

performance or to inform decisions such as surgeon credentialing.

To truly determine whether algorithmic biases, as we have de

fi

ned

them, translate into tangible biases that negatively affect surgeons

and their clinical work

fl

ow, a prospective deployment of an AI

system would be required.

Although we leveraged a bias mitigation strategy (TWIX), our

work does not claim to address the key open question of

how

much

bias mitigation is suf

fi

cient. Indeed, the presence of a

performance discrepancy across groups is not always indicative of

algorithmic bias. Some have claimed that this is the case only if

the discrepancy is unjusti

fi

ed and harmful to stakeholders

Therefore, to address this open question, which is beyond the

scope of our work, researchers must appreciate the entire

ecosystem in which an AI system is deployed. Moving forward,

and once data become available, we look to examine (a) bias

against surgeon groups which we had excluded in this study due

to sample size constraints (e.g., those belonging to a particular

race, sex, and ethnicity) and (b) intersectional bias

: that which is

exhibited against surgeons who belong to multiple groups at the

same time (e.g., expert surgeons who are female). Doing so could

help outline whether a variant of Simpson

’

s paradox

is at play;

bias, although absent at the individual group level, may be

present when simultaneously considering multiple groups. We

leave this to future work as the analysis would require a suf

fi

cient

number of samples from each intersectional group. We must also

emphasize that a single bias mitigation strategy is unlikely to be a

panacea. As a result, we encourage the community to develop

bias mitigation strategies that achieve the desired effect across

multiple hospitals, AI systems, and surgeon groups. Exploring the

interplay of these elements, although rarely attempted in the

context of algorithmic bias in medicine, is critical to ensure that AI

systems deployed in clinical settings have the intended positive

effect on stakeholders.

The credentialing of a surgeon is often considered a rite of

passage. With time, such a decision is likely to be supported by AI-

based skill assessments. In preparation for this future, our study

introduces safeguards to enable fair decision-making.

METHODS

Ethics approval

All datasets (data from USC, SAH, and HMH) were collected under

Institutional Review Board (IRB) approval from the University of

Southern California in which written informed consent was

obtained from all participants (HS-17-00113). Moreover, the

datasets were de-identifed prior to model development.

Description of surgical procedure and activities

In this study, we focused on robot-assisted radical prostatectomy

(RARP), a surgical procedure in which the prostate gland is

removed from a patient

’

s body in order to treat cancer. With a

surgical procedure often composed of sequential steps that must

be executed by a surgeon, we observed the intraoperative activity

of surgeons during one particular step of the RARP procedure: the

vesicoureteral anastomosis (VUA). In short, the VUA is a

reconstructive suturing step in which the bladder and urethra,

separated by the removal of the prostate, must now be connected

to one another through a series of stitches. This connection

creates a tight link that should allow for the normal

fl

ow of urine

postoperatively. To perform a single stitch in the VUA step, a

surgeon must

fi

rst grab the needle with one of the robotic arms

(needle handling), push that needle through the tissue (needle

driving), and then withdraw that needle on the other side of the

tissue in preparation for the next stitch (needle withdrawal).

Surgical video samples and annotations

In assessing the skill-level of suturing activity, SAIS was trained and

evaluated on video samples associated with ground-truth skill

assessment annotations. We now outline how these video

samples and annotations were generated, and defer a description

of SAIS to the next section.

Video samples

. We collected videos of entire robotic surgical

procedures from three geographically-diverse hospitals in addi-

tion to videos of medical students performing suturing activities in

a laboratory environment.

Live robotic surgical procedures

: An entire video of the VUA

step (on the order of 20 min) from one surgical case was split into

D. Kiyasseh et al.

7

Published in partnership with Seoul National University Bundang Hospital

npj Digital Medicine (2023) 54

video samples depicting either one of the two suturing activities:

needle handling and needle driving. With each VUA step

consisting of around 24 stitches, this resulted in approximately

24 video samples depicting needle handling and another

24 samples depicting needle driving. To obtain these video

samples, a trained medical fellow identi

fi

ed the start and end time

of the respective suturing sub-phases. Each video sample can span

−

30 s in duration. Please refer to Table

for a summary of the

number of video samples.

Training environment

: To mimic the VUA step in a laboratory

environment, we presented medical students with a realistic gel-

like model of the bladder and urethra, and asked them to perform

a total of 16 stitches while using a robot otherwise used in live

surgical procedures. To obtain video samples, we followed the

same strategy described above. As such, each participant

’

s video

resulted in 16 video samples for each of the activities of needle

handling, needle driving, etc. For this dataset, we only focused on

needle handling (see Table

for a number of video samples). Note

that since these video samples depict suturing activities, we

adopted the same annotation strategy (described next) for these

video samples and those of live surgical procedures.

Skill assessment annotations

. A team of trained human raters

(TFH, MO, and others) were tasked with viewing each video

sample and annotating it with either a binary low-skill or high-skill

assessment. It is worthwhile to note that, to minimize potential

bias in the annotations, these raters were not privy to the clinical

meta-information (e.g., surgeon caseload) associated with their

surgical videos. The raters followed the strict guidelines outlined

in our team

’

s previously-developed skill assessment tool

, which

we outline in brief below. To ensure the quality of the annotations,

the raters

fi

rst went through a training process in which they

annotated the same set of video samples. Once their agreement

level exceeded 80%, they were allowed to begin annotating the

video samples for this study. In the event of disagreements in the

annotations, we followed the same strategy adopted in the

original study

where the lowest of all scores is considered as the

fi

nal annotation.

Needle handling skill assessment

: The skill-level of needle

handling is assessed by observing the number of times a surgeon

had to reposition their grasp of the needle. Fewer repositions

imply a high skill-level, as they are indicative of improved surgeon

dexterity and intent.

Needle driving skill assessment

: The skill-level of needle driving

is assessed by observing the smoothness with which a surgeon

pushes the needle through tissue. Smoother driving implies a high

skill-level, as it is less likely to cause physical trauma to the tissue.

SAIS is an AI system for skill assessment

SAIS was recently developed to decode the intraoperative activity

of surgeons based exclusively on surgical videos

. Speci

fi

cally, it

demonstrated state-of-the-art performance in assessing the skill-

level of surgical activity, such as needle handling and driving,

across multiple hospitals. In light of these capabilities, we used

SAIS as the core AI system whose potential bias we attempted to

examine and mitigate across hospitals.

Components of SAIS

. We outline the basic components of SAIS

here and refer readers to the original study for more details

.In

short, SAIS takes two data modalities as input: RGB frames and

optical

fl

ow, which measures motion in the

fi

eld of view over

time, and which is derived from neighboring RGB frames.

Spatial information is extracted from each of these frames

through a vision transformer pre-trained in a self-supervised

manner on ImageNet. To capture the temporal information

across frames, SAIS learns the relationship between subsequent

frames through an attention mechanism. Greater attention, or

importance, is placed on frames deemed more important for

the ultimate skill assessment. Rep

eating this process for all data

modalities, SAIS arrives at modality-speci

fi

c video representa-

tions. SAIS aggregates these representations to arrive at a single

video representation that summarizes the content of the video

sample. This video representation is then used to output a

probability distribution over th

e two skill categories (low vs.

high skill).

Training and evaluating SAIS

. As in the original study

, SAIS is

trained on data exclusively from USC using tenfold Monte Carlo

cross validation (see Supplementary Note 1). Each fold consisted

of a training, validation, and test set, ensuring that surgical videos

were not shared across the sets. When evaluated on data from

other hospitals, SAIS is deployed on all such video samples. This is

repeated for all ten of the SAIS models. As such, we report

evaluation metrics as an average and standard deviation across 10

Monte Carlo cross validation folds.

Skill assessment evaluation metrics

When SAIS decodes surgical skills, we report the positive

predictive value (PPV) de

fi

ned as the proportion of AI-based

high-skill assessments which are correct, and the negative

predictive value (NPV) de

fi

ned as the proportion of the AI-based

low-skill assessments which are correct (see Fig.

). The motivation

for doing so stems from the expected use of these AI systems

where their low or high-skill assessment predictions would inform

decision-making (e.g., surgeon credentialing). As such, we were

interested in seeing what proportion of the AI-based assessments,

Table 2.

Total number of videos and video samples associated with each of the hospitals and tasks.

Task

Activity

Details

Hospital

Videos

Video samples

Surgeons

Generalizing to

Skill assessment

Suturing

Needle handling

USC

912

Videos

SAH

240

Hospitals

HMH

184

Hospitals

LAB

328

Modality

Needle driving

USC

530

Videos

SAH

280

Hospitals

HMH

220

Hospitals

Note that we train our model, SAIS, on data exclusively shown in

bold

following a ten fold Monte Carlo cross-validation setup. For an exact breakdown of the

number of video samples in each fold and training, validation, and test split, please refer to Supplementary Tables 1

–

6. The data from the remaining hospitals

are exclusively used for inference. SAIS is always trained and evaluated on a class-balanced set of data whereby each category (e.g., low skill and hig

h skill)

contains the same number of samples. This prevents SAIS from being negatively affected by a sampling bias during training, and allows for a more intuit

ive

appreciation of the evaluation results.

D. Kiyasseh et al.

8

npj Digital Medicine (2023) 54

Published in partnership with Seoul National University Bundang Hospital

^

, matched the ground-truth assessment,

, for a given set of

surgeon sub-cohorts,

PPV

high

;

^

high

(1)

NPV

low

;

^

low

(2)

Choosing threshold for evaluation metrics

. SAIS outputs the

probability,

∈

[0, 1], that a video sample depicts a high-skill

activity. As with any probabilistic output, to make a de

fi

nitive

prediction (skill assessment), we had to choose a threshold,

τ

,on

this probability. Whereas

≤

τ

indicates a low-skill assessment,

τ

indicates a high-skill assessment. While this threshold is often

informed by previously-established clinical evidence

or a desired

error rate, we did not have such prior information in this setting.

We also balanced the number of video samples from each skill

category during the training of SAIS. As such, we chose a

threshold

τ

=

0.5 for our experiments. Changing this threshold did

not affect the relative model performance values across surgeon

sub-cohorts, and therefore left the bias

fi

ndings unchanged.

Quantifying the different types of bias

To examine and mitigate the bias exhibited by surgical AI systems,

fi

rst require a de

fi

nition of bias. Although many exist in the

literature, we adopt the de

fi

nition, most commonly used in recent

studies

, as a discrepancy in the performance of an AI system

for different members, or sub-cohorts, of a group (e.g., surgeons

with different experience levels). The choice of performance

metric ultimately depends on the type of bias we are interested in

examining. In this study, we focus on two types of bias:

underskilling and overskilling.

Underskilling

. In the context of skill assessment, underskilling

occurs when an AI system erroneously downgrades surgical

performance, predicting a skill to be of lower quality than it

actually is. Using this logic with binary skill assessment (low vs. high

skill), underskilling can be quanti

fi

ed by the proportion of AI-based

low-skill predictions (

^

low) which should have been classi

fi

ed as

high skill (

=

high). This is equivalently re

fl

ected by the negative

predictive value of the AI-based predictions (see Fig.

). While it is

also possible to examine the proportion of high-skill assessments

which an AI system predicts to be low-skill, amounting to the true

positive rate, we opt to focus on how AI-based low-skill predictions

directly inform the decision-making of an end-user.

Overskilling

. In the context of skill assessment, overskilling occurs

when an AI system erroneously upgrades surgical performance,

predicting a skill to be of higher quality than it actually is. Using

this logic with binary skill assessment, overskilling can be

quanti

fi

ed by the proportion of AI-based high-skill predictions

(

^

high) which should have been classi

fi

ed as a low skill

(

=

low). This is equivalently re

fl

ected by the positive predictive

value of the AI-based predictions (see Fig.

Underskilling and overskilling bias

. Adopting the established

fi

nitions of bias

, and leveraging our descriptions of

underskilling and overskilling, we de

fi

ne an underskilling bias as

a discrepancy in the negative predictive value of AI-based

predictions across sub-cohorts of surgeons,

and

, for example,

when dealing with two sub-cohorts (see Fig.

). This concept

naturally extends to the multi-class skill assessment setting

(Supplementary Note 3). A larger discrepancy implies a larger

bias. We similarly de

fi

ne an overskilling bias as a discrepancy in

the positive predictive value of AI-based predictions across sub-

cohorts of surgeons. Given our study

’

s focus on RARP surgical

procedures, we examine bias exhibited against groups of (a)

surgeons with different robotic caseloads (total number of robotic

surgeries performed in their lifetime), and those operating on

prostate glands of (b) different volumes, and (c) different cancer

severity. We motivate this choice of groups in this next section.

Motivation behind surgeon groups and sub-cohorts

We examined algorithmic bias against several surgeon groups.

These included the volume of the prostate gland, the severity of

the prostate cancer (Gleason Score), and the surgeon caseload. We

chose these groups after consultation with a urologist (AH) about

their relevance, and the completeness of the clinical meta-

information associated with the surgical cases. It may seem

counter-intuitive at

fi

rst to identify surgeon groups based on, for

example, the volume of the prostate gland on which they operate.

After all, few surgeons make the decision to operate on patients

based on such a factor. Although a single surgeon may not have a

say over the volume of the prostate gland on which they operate,

institution- or geography-speci

fi

c patient demographics may

naturally result in these groups. For example, we found that, in

addition to differences in prostate volumes of patients within a

hospital, there exists a difference in the distribution of such

volumes across hospitals. Therefore, de

fi

ning surgeon groups

based on these factors still provides meaningful insight into

algorithmic bias.

fi

ning surgeon sub-cohorts

. In order to quantify bias as a

discrepancy in model performance across sub-cohorts, we

discretized continuous surgeon groups, where applicable, into

two sub-cohorts. To de

fi

ne novice and expert surgeons, we built

on previous literature which uses surgeon caseload, the total

number of robotic surgeries performed by a surgeon during their

lifetime, as a proxy

–

. As such, we de

fi

ne experts as having

completed >100 robotic surgeries. As for prostate volume, we

used the population median in the USC data to de

fi

ne prostate

volume

≤

49 ml and >49 ml. We used the population median (a) in

Fig. 6

Visual de

fi

nition of underskilling and overskilling bias in the context of binary skill assessment.

Whereas an underskilling bias is

fl

ected by a discrepancy in the negative predictive value of AI-based predictions across sub-cohorts of surgeons (e.g.,

=

novice and

=

expert), and overskilling bias is re

fl

ected by a discrepancy in the positive predictive value.

D. Kiyasseh et al.

9

Published in partnership with Seoul National University Bundang Hospital

npj Digital Medicine (2023) 54