On 2021-08-06 07:17:04, user disqus_UQJEvw3dWd wrote:
Dear Dr Austen El-Osta,
We read with interest this preprint article “What is the suitability of clinical vignettes in benchmarking the performance of online symptom checkers? An audit study”. Studies addressing the suitability of different evaluation methods are useful, and vignettes methods in particular have known advantages as well as known shortcomings (Fraser et al., 2018; Jungmann et al., 2019). Further detailed analysis into the overall utility of vignettes methodologies is certainly important. Whilst the approach taken for exploring vignette methodologies here is interesting and warrants reading and careful consideration, two aspects of the study conduct and reporting are deeply worrying.
We ask for the authors to correct aspects of the paper where there is unequal and unbalanced methodology applied to the funder symptom checker (Healthily), as compared to those applied to the symptom checkers of the funder’s competitors (Ada and Babylon).
We also ask that the authors report results in a balanced manner in the abstract. All outcome measures should be reported fairly, irrespective of whether the funder’s symptom checker performed well in any particular measure. Please see below for a detailed description of these aspects.
We do not state that the selective application of methodology and the selective reporting of results has been deliberately conducted to bias the study to the benefit of the funder. However, the degree of different treatment of the funder’s symptom checker is so large, that an independent reader could draw that conclusion. We suggest rectifying the highlighted issues in the preprint, and, before submitting the manuscript for peer review.
Should these issues not be addressed in any future peer review process, we will in due course, also write to the editor of the publishing peer review journal.
Major concern 1 of bias towards study funder: The paper not only assesses the utility of a methodology, it also applies that methodology to report on relative performance of different symptom checkers (i.e. benchmarking).
This approach would be fair if the same methodology were applied to all the symptom checkers, however, the study presents a grossly unmatched analysis. One approach has been used for the funder’s symptom checker (Healthily) and a second for the symptom checkers of two main competitors of the funder. This gives the appearance of fundamental bias in testing and reporting based on study funding. Although some degree of bias may be introduced in studies for a multitude of reasons, deliberate application of fundamentally different testing methodologies to the products of the funder compared to those applied for their competitors is unacceptable. The Healthily symptom checker was tested with 6 inputters (4 professional non-doctor & 2 lay), whilst, for no rational justification, the Ada and Babylon symptom checker were tested with a testing group of fundamentally different make-up (not just the number of testers, but a systematic and deliberate choice to use a different type of tester population, i,e. 4 professional non-doctor inputters).
The number of tests also differed greatly (n=816 for Healthily, vs n=272 for Ada and Babylon). Additionally, only one professional non-doctor inputter recorded the consultation outcome and triage recommendation using Ada and Babylon symptom checkers, for all 139 vignettes, which is in contrast to the approach the authors adopted for Healthily.
Major concern 2 of bias towards study funder: There is also an important bias in selecting the results in the abstract. <br />
With respect to condition suggestion: In the results section, it is reported that “Ada consistently performed better than Healthily and Babylon in providing the correct consultation outcome in D1, D2 and D3” (i.e. in the provision of correct condition suggestions). The difference in performance was large: “The correct consultation outcome for Ada against the RCGP Standard at any disposition was 54.0% compared to 37.4% for Healthily and 28.1% for Babylon”. It is acknowledged in the abstract that condition suggestion (referred to as disposition/diagnosis) is a main outcome measure, however this measure is not reported in the abstract. This looks like selective reporting in the abstract to avoid negative messages about the funder’s symptom checker.
With respect to ‘triage recommendation’:<br />
It is reported in the results that “In benchmarking against the original RCGP standard, Healthily provided an appropriate triage recommendation 43.3% (95% CI 39.2%, 47.6%) of the time, whereas Ada and Babylon were correct 61.2% (95% CI 52.5%, 69.3%) and 57.6%, (95% CI 48.9%, 65.9%) of the time respectively (p<0.001). Again, this is omitted from the abstract, where only the aspects of the relatively positive performance of the funder’s symptom checker are reported.
We would welcome a change in this study to remove bias towards the funder in methodology and results reporting.
Yours faithfully
On behalf of Ada Health GmbH<br />
Dr. Stephen Gilbert<br />
Clinical Evaluation Director<br />
Ada Health GmbH<br />
Karl-Liebknecht-Str. 1<br />
10178 Berlin, DE <br />
+49 (0) 152 0713 0836
REFERENCES
Fraser, H., Coiera, E., Wong, D., 2018. Safety of patient-facing digital symptom checkers. The Lancet 392, 2263–2264. https://doi.org/10.1016/S01...
Jungmann, S.M., Klan, T., Kuhn, S., Jungmann, F., 2019. Accuracy of a Chatbot (Ada) in the Diagnosis of Mental Disorders: Comparative Case Study With Lay and Expert Users. JMIR Formative Research 3, e13863. https://doi.org/10.2196/13863