Using Real-time Feedback To Improve Surgical Performance on a Robotic Tissue Dissection Task

Education

Using Real-time Feedback To Improve Surgical Performance on a

Robotic Tissue Dissection Task

Jasper A. Laca

, Rafal Kocielnik

, Jessica H. Nguyen

, Jonathan You

, Ryan Tsang

Elyssa Y. Wong

, Andrew Shtulman

, Anima Anandkumar

, Andrew J. Hung

*

Center for Robotic Simulation and Education, Catherine and Joseph Aresty Department of Urology, USC Institute of Urology, University of Southern Ca

lifornia,

Los Angeles, CA, USA;

Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA;

Thinking Lab, Department

of Psychology, Occidental College, Los Angeles, CA, USA

Article info

Article history:

Accepted September 26, 2022

Associate Editor:

M. Carmen Mir

Keywords:

Surgical education

Feedback

Robotic surgery

Learning

Mentoring

Abstract

Background:

There is no standard for the feedback that an attending surgeon pro-

vides to a training surgeon, which may lead to variable outcomes in teaching cases.

Objective:

To create and administer standardized feedback to medical students in

an attempt to improve performance and learning.

Design, setting, and participants:

A cohort of 45 medical students was recruited from

a single medical school. Participants were randomly assigned to two groups. Both

completed two rounds of a robotic surgical dissection task on a da Vinci Xi surgical

system. The first round was the baseline assessment. In the second round, one

group received feedback and the other served as the control (no feedback).

Outcome measurements and statistical analysis:

Video from each round was retrospec-

tively reviewed by four blinded raters and given a total error tally (primary out-

come) and a technical skills score (Global Evaluative Assessment of Robotic

Surgery [GEARS]). Generalized linear models were used for statistical modeling.

According to their initial performance, each participant was categorized as either

an innate performer or an underperformer, depending on whether their error tally

was above or below the median.

Results and limitations:

In round 2, the intervention group had a larger decrease in

error rate than the control group, with a risk ratio (RR) of 1.51 (95% confidence

interval [CI] 1.07–2.14;

= 0.02). The intervention group also had a greater increase

in GEARS score in comparison to the control group, with a mean group difference of

2.15 (95% CI 0.81–3.49;

< 0.01). The interaction effect between innate performers

versus underperformers and the intervention was statistically significant for the

error rates, at F(1,38) = 5.16 (

= 0.03). Specifically, the intervention had a statisti-

cally significant effect on the error rate for underperformers (RR 2.23, 95% CI 1.37–

3.62;

< 0.01) but not for innate performers (RR 1.03, 95% CI 0.63–1.68;

= 0.91).

https://doi.org/10.1016/j.euros.2022.09.015

2666-1683/

Ó

2022 The Author(s). Published by Elsevier B.V. on behalf of European Association of Urology. This is an open access article

under the CC BY-NC-ND license (

http://creativecommons.org/licenses/by-nc-nd/4.0/

*

Corresponding author. University of Southern California Institute of Urology, 1441 Eastlake

Avenue, Los Angeles, CA 90089, USA. Tel. +1 323 865 3700; Fax: +1 323 865 0120.

E-mail address:

andrew.hung@med.usc.edu

(A.J. Hung).

EUROPEAN UROLOGY OPEN SCIENCE 46 (2022) 15–21

available at www.sciencedirect.com

journal homepage: www.eu-openscience.europeanurology.com

Conclusions:

Real-time feedback improved performance globally compared to the

control. The benefit of real-time feedback was stronger for underperformers than

for trainees with innate skill.

Patient summary:

We found that real-time feedback during a training task using a

surgical robot improved the performance of trainees when the task was repeated.

This feedback approach could help in training doctors in robotic surgery.

Ó

2022 The Author(s). Published by Elsevier B.V. on behalf of European Association of

Urology. This is an open access article under the CC BY-NC-ND license (

http://creative-

commons.org/licenses/by-nc-nd/4.0/

1. Introduction

A surgeon’s formal training period involves learning from

many mentors who provide feedback during surgery. The

effectiveness of this feedback in improving the performance

of a trainee surgeon ultimately dictates surgical outcomes.

Oneofthemainchallengeswiththecurrentstatus quois that

there is no established standard for feedback delivered by

mentors. One mentor’s methodology may differ significantly

from that of another. Trainee surgeons subjected to variable

feedback may produce variable surgical outcomes, both dur-

ingandaftertheirformaltraining.Thisiscompoundedbythe

fact that some trainee surgeons are naturally gifted, while

others need more help in their training

[1]

Despite the challenges of the status quo, the benefits of

surgical mentorship are hard to dispute. However, there is

no means to continue mentorship beyond formal training,

which represents another challenge. Young surgeons who

have finished their formal training may lose out on the

potential benefits of mentorship before they have achieved

surgical mastery, a milestone that is likely to be achieved

well after training.

Research has shown that surgical mentoring in robotic

surgery can improve a trainee surgeon’s task acquisition

[2]

. Without standardization, however, there is no guaran-

tee that mentoring will consistently have a positive effect.

Recent research has shown that the process of providing

feedback can be further automated with an auto-mentor

[3]

. Automated feedback may solve the problem of mentor

inconsistency and maintain the benefits of efficacious men-

toring for as long as an individual might need it.

In a previous study we used feedback that was tailored to

the individual trainee surgeon on the basis of their perfor-

mance

[4]

. The summative feedback was provided weekly

following each training session from the previous week.

Individuals who received feedback had accelerated task

acquisition in comparison to a control group with no feed-

back. While the feedback was able to improve the long-

term performance, it was unable to improve performance

of the immediate task.

In this present study we explored the effect of standard-

ized real-time feedback on a simulated dissection task. The

goal of the feedback is to aid the trainee surgeon’s learning

andtopreventorreduceerrorsduringthetrainingprocedure.

We hypothesize that: (H1) feedback leads to improve-

ments in surgical performance, measured as the error rate

(H1a) and technical skills score (H1b), in comparison to a

control; and (H2) participants who initially perform worse

show a greater improvement after the intervention in com-

parison to those who initially perform well, measured as the

error rate (H2a) and technical skills score (H2b).

2. Materials and methods

A group of novice medical students without any surgical experience

completed a simulated dissection procedure on a da Vinci robotic system

(Sunnyvale, CA, USA). The task involved: removal of the premarked top

of a clementine skin and then exposing and removing a single segment

of the interior fruit.

The study was designed as a two-task repetition spread over two

separate sessions. The first session consisted of a brief training period,

during which the participant was given a standardized introductory

course on how to use the da Vinci surgical robot. This involved standard-

ized instruction from a proctor, followed by two practice tasks. The prac-

tice tasks served as an opportunity for the participant to test all of the

introductory skills needed to complete the experimental task.

The participant then completed a baseline/control clementine task

(round 1). After all the participants had completed the first session, they

were randomized into two groups (group 1 and group 2) with equal

average performance scores. In the second session, participants com-

pleted the task once more (round 2), during which group 1 received

feedback while group 2 did not. Depending on whether they scored

above or below the median during round 1, participants in each group

were individually categorized as either an innate performer (IP) or an

underperformer (UP) for further analyses.

Endoscope video and audio (during the feedback round only) were

recorded for the task in each round. Video was recorded by feeding the

output of the endoscope into an external screen-grabber (OBS;

https://

obsproject.com/

). Audio of the training sessions was recorded using Tobii

eye tracking glasses (

https://www.tobiipro.com/

) and retrospectively

synced with video of the experimental task.

Feedback consisted of seven prerecorded voice messages that were

manually triggered by a proctor using an online soundboard (

https://

blerp.com

). The seven pieces of feedback corresponded to seven risky

behaviors commonly observed for novices, as identified from previous

research on this dissection task

[5]

. Feedback was constructed according

to the following standardized template: [warning/call to attention] + [risk

factor] + [proposed mitigation] + [longer-term impact]

[6]

. Each piece of

feedback was triggered by a specific risky behavior that could cause an

error if not corrected in time. For example, if the participant did not

appropriately grip the skin of the fruit, they were given feedback rele-

vant to that behavior in an attempt to avoid the potential for a skin tear

(error).

All feedback was recorded by the same voice actor (J.A.L.). Partici-

pants were instructed to stop what they were doing and listen to the

entirety of the recorded feedback message before returning to the task.

Each piece of feedback was preceded by an alert tone, which participants

were told was their cue to stop and listen (

Fig. 1

EUROPEAN UROL

OGY OPEN SCIENCE 46 (2022) 15–21

Audio-free video of the tasks was retrospectively reviewed by four

blinded raters. Each rater provided a total error tally and a technical skill

score using the previously validated Global Evaluative Assessment of

Robotic Surgery (GEARS)

[7]

for each video. Raters’ intraclass correlation

coefficients (ICCs) were measured to assess inter-rater reliability of

scores for their first ten videos. Raters achieved ICC of 0.84 (95% confi-

dence interval [CI] 0.783–0.892) for error identification and classifica-

tion, and 0.81 (95% CI 0.742–0.862) for GEARS scores.

After completion of the feedback round, group 1 was asked to com-

plete a System Usability Scale (SUS) questionnaire

[8]

to analyze the par-

ticipants’ perception of the feedback they received in terms of how

useful it was.

2.1. Statistical analysis

Four hypotheses were tested in the study, two on the overall interven-

tion effect (H1a and H1b) and two on interaction effects (H2a and

H2b). To prevent inflation of the experiment-wise error rate (

a

) by mul-

tiple hypothesis testing, we used a fixed sequential method to assign the

a

value. In this procedure, we started with testing the overall group

effect under the primary outcome. If the null hypothesis was rejected,

the full experiment-wise error rate (

a

) was carried to the next test for

the interaction between group and participant type. If the null hypothe-

sis was rejected again, we moved to the secondary outcome (GEARS) and

followed the same sequence to pass the experiment-wise error (

a

Using this chain for hypothesis testing, if any test failed to reject the null

hypothesis, we stopped the subsequent hypothesis testing and only

reported the descriptive result. In an exploratory analysis, we used scat-

ter plots to illustrate the pattern of correlation between GEARS, errors

and SUS scores by the IP and UP participant types. Pearson or Spearman

correlation was used, depending on data normality, for the descriptive

analysis.

Generalized linear models were used for statistical modeling. We

used a Gaussian distribution with an identity link function to model con-

tinuous outcomes (GEARS scores). The intervention effect is presented as

the absolute difference between groups. For count outcomes (errors) we

first standardized the measure by the total task length (number of

errors/10 min) to derive error rates and modeled using a log link func-

tion; thus, the intervention effect is reported as a risk ratio (RR; multi-

plicative difference represented as the ratio of the error count/10 min

between groups). The difference in intervention effect by participant

types was tested using the interaction term in the model. The stratum-

specific intervention effect was estimated using a post hoc contrast test.

Model integrity was examined using residual plots and normality tests

for residuals. Data analysis was performed using SPSS version 28.0.1.0

(SPSS Inc., Chicago, IL, USA).

3. Results

3.1. Participant demographics

A total of 45 medical students were recruited, of whom 13

were first-year and 32 were second-year students. The

median age of the participants was 24 yr (range 21–34 yr)

and 23/45 (51%) identified as female. There were no signif-

icant differences in demographic variables (gender, educa-

tion, hand dominance, age, medical school year) between

group 1 and group 2, or between the IP and UP groups.

In round 1, participants took between 4 min 47 s and 60

min 39 s to complete the task; the median time was 26 min

36 s (interquartile range [IQR] 31 min–32 min 23 s). In

round 2, the median time was 15 min 3 s (IQR 11 min 44

s–22 min 59 s). A total of 60 instances of feedback were

delivered to the intervention group (group 1). The median

number of feedback instances was 3 (range 1–5). Among

the five feedback categories (

Table 1

), the most frequently

delivered was related to scissor usage, accounting for

28/60 instances (47%). The least common type of feedback

was related to ideal tissue exposure and force (

Table 1

Fisher’s exact test revealed that there was no statistically

significant difference in feedback type between the IP and

UP groups (two-tailed

= 0.369). There was a statistically

significant negative correlation between technical skills

(GEARS score) and the error rate in both study rounds, with

Spearman correlation coefficients of

(41) =

0.69

(

< 0.001) for round 1 and

(41) =

0.83 (

< 0.001) for

round 2.

3.2. Intervention effect: change from baseline to round 2

between the intervention and control groups

We first estimated the impact of feedback delivery in com-

parison to the control (

Fig. 2

A). Comparison of the interven-

tion (feedback) and control groups revealed a decrease in

error rates in the feedback round, with a count RR of 1.51

(95% CI 1.07–2.14;

= 0.021). This meant that in the inter-

vention group, the decrease in error rate was 1.51 times

higher than in the control group (ie, repetition of the same

task without feedback). Hypothesis H1a is supported.

Similarly, we found an increase in technical skills

score, with a mean difference between the groups of

Fig. 1 – Example of the feedback delivery process.

EUROPEAN UROLOGY OPEN

SCIENCE 46 (2022) 15–21

2.15 (95% CI 0.81–3.49;

= 0.002;

Table 2

). This represents

an increase in total GEARS score (scale 0–25) of more than 2

points. Hypothesis H1b is supported.

3.3. Intervention effect for the IP and UP groups

The interaction effect between performance group (IP vs

UP) and condition (feedback vs no feedback) was statisti-

cally significant for the error rate, with F(1,38) = 5.162

(

= 0.029; hypothesis H2a is supported), but did not reach

statistical significance for technical skills, with F

(1,38) = 1.588 (

= 0.215; hypothesis H2b is rejected). For

consistency, we performed a contrast test for both

outcomes.

For the UP group there was a statistically significant dif-

ference between the intervention and the control in terms

of a decrease in error rate, with RR of 2.23 (95% CI 1.37–

3.62;

= 0.002). This indicates that UP participants reduced

their error rate in the intervention by 2.23-fold in compar-

ison to the control (task repetition without feedback). Sim-

ilarly, for the UP group there was a statistically significant

increase in technical skills score, with a mean difference

of 2.99 (95% CI 0.90–5.06;

= 0.006;

Table 2

For the IP group the differences were not statistically sig-

nificant: for the decrease in error rate the RR was 1.03 (95%

CI 0.63–1.68;

= 0.914) and for the increase in technical

skills score the mean difference was 1.32 (95% CI

0.36 to

2.99;

= 0.119;

Table 2

). To confirm that the IP and UP

groups received a similar amount of feedback, we ran a

Mann-Whitney

test to compare differences in the feed-

back instance counts. The test revealed no statistically sig-

nificant (at

a

= 0.05) difference between the IP group

(median 2.0, range 1–4;

= 10) and the UP group (median

3.0, range 2–5;

= 11), with U = 32.00 and z =

1.72

(two-tailed test,

= 0.087).

3.4. Usability of feedback

We further analyzed whether SUS scores were correlated

with surgeon performance. For this analysis, we could only

include participants who received feedback (group 1). A

one-tailed Spearman correlation test between the decrease

in error count and SUS score, controlling for baseline error

counts (round 1), was statistically significant:

q

(18) = 0.530,

= 0.008. A one-tailed Pearson correlation test

between the increase in technical skills and SUS score, con-

trolling for baseline technical skills (round 1), was also sta-

tistically significant: r(18) = 0.503,

= 0.012.

We further assessed these correlations within the IP and

UP subgroups separately (

Fig. 3

). For the UP group, usability

was significantly correlated with a decrease in error rate:

q

(8) = 0.674,

= 0.016. Usability was also weakly signifi-

cantly correlated with an increase in technical skills: r

(8) = 0.547,

= 0.051. For the IP subgroup, usability was

not significantly correlated with a decrease in error count:

q

(7) = 0.162, p = 0.339. In addition, usability was not signif-

icantly correlated with an increase in technical skills score:

r(7) = 0.123,

= 0.376.

We further assessed whether the UP and IP groups dif-

fered in their average SUS score. The difference was not sta-

tistically significant, with a mean difference of 3.07 (95% CI

7.57 to 13.71;

= 0.572).

4. Discussion

Our study shows that real-time feedback led to a perfor-

mance improvement (measured as technical skills and

errors) for the simulated dissection task for which it was

provided, and was particularly helpful for participants

who initially struggled with the task (UP group). Further-

more, the performance of UP participants who received

feedback was improved to a level similar to that of the IP

participants. Broadly speaking, this serves as proof of con-

cept that standardized, real-time surgical feedback can be

a useful aid in surgical training.

SUS results were positively correlated with performance.

The more receptive an individual was to the feedback (ie,

how useful they found it), the more efficacious the feedback

appeared to be. While both the UP and IP groups had statis-

tically similar SUS scores, only the UP group had a statisti-

cally significant correlation between SUS score and

performance. Regardless of how usable the feedback was

perceived by the IP participants, they performed well. By

contrast, the more usable the feedback was for UP partici-

pants, the better was their performance. One interpretation

of this could be that the usability scores are a measure of

how timely, understandable, and nondistracting the feed-

back was, but do not necessarily measure how ‘‘valuable’’

or ‘‘indispensable’’ the feedback was for accomplishing the

Table 1 – Feedback delivery analysis: number of feedback instances

by category during round 2 and the feedback phrases recorded for

specific targeted behaviors

Feedback