Bowen Wang♣♢***Equal contribution., Jiuyang Chang♡111Equal contribution., Yiming Qian♠†††Corresponding author., Guoxin Chen♠, Junhao Chen♢,
Zhouqiang Jiang♢, Jiahao Zhang♢, Yuta Nakashima♢♣, Hajime Nagahara♢♣
♣Premium Research Institute for Human Metaverse Medicine (WPI-PRIMe), Osaka University,
♡Department of Cardiology, The First Affiliated Hospital of Dalian Medical University,
♢Institute of Datability Science (IDS), Osaka University,
♠Agency for Science, Technology and Research (A*STAR),
{wang, n-yuta, nagahara}@ids.osaka-u.ac.jp
changjiuyang@firsthosp-dmu.com
qiany@ihpc.a-star.edu.sg, gx.chen.chn@gmail.com
{junhao, zhouqiang, jiahao}@is.ids.osaka-u.ac.jp
Abstract
Large language models (LLMs) have recently showcased remarkable capabilities, spanning a wide range of tasks and applications, including those in the medical domain. Models like GPT-4 excel in medical question answering but may face challenges in the lack of interpretability when handling complex tasks in real clinical settings. We thus introduce the diagnostic reasoning dataset for clinical notes (DiReCT), aiming at evaluating the reasoning ability and interpretability of LLMs compared to human doctors. It contains 511 clinical notes, each meticulously annotated by physicians, detailing the diagnostic reasoning process from observations in a clinical note to the final diagnosis. Additionally, a diagnostic knowledge graph is provided to offer essential knowledge for reasoning, which may not be covered in the training data of existing LLMs. Evaluations of leading LLMs on DiReCT bring out a significant gap between their reasoning ability and that of human doctors, highlighting the critical need for models that can reason effectively in real-world clinical scenarios ‡‡‡Code are available https://github.com/wbw520/DiReCT. Data will be released through PhysioNet..
1 Introduction
Recent advancements of large language models (LLMs) [Zhao etal., 2023] have ushered in new possibilities and challenges for a wide range of natural language processing (NLP) tasks [Min etal., 2023]. In the medical domain, these models have demonstrated remarkable prowess [Anil etal., 2023, Han etal., 2023], particularly in medical question answering (QA) [Jin etal., 2021]. Leading-edge models, such as GPT-4 [OpenAI, 2023a], exhibit profound proficiency in understanding and generating text [Bubeck etal., 2023], even achieved high scores on the United States Medical Licensing Examination (USMLE) questions [Nori etal., 2023].
Despite the advancements, interpretability is critical, particularly in medical NLP tasks [Liévin etal., 2024]. Some studies assess this capability over medical QA [Pal etal., 2022, Li etal., 2023, Chen etal., 2024] or natural language inference (NLI) [Jullien etal., 2023]. Putting more attention on interpretability, they use relatively simple tasks as testbeds, taking short text as input. However, tasks in real clinical settings can be more complex [Gao etal., 2023a]. As shown in Figure 1, a typical diagnosis requires comprehending and combining various information, such as health records, physical examinations, and laboratory tests, for further reasoning of possible diseases in a step-by-step manner following the established guidelines. This observation suggests that both perception, or reading, (e.g., finding necessary information in medical record) and reasoning (determining the disease based on the observations) should be counted when evaluating interpretability in LLM-based medical NLP tasks.
For a more comprehensive evaluation of LLMs for supporting diagnosis in a more realistic setting, we propose a Diagnostic Reasoning dataset for Clinical noTes (DiReCT). The task basically is predicting the diagnosis from a clinical note of a patient, which is a collection of various medical records, written in natural language.Our dataset contains 511 clinical notes spanning 25 disease categories, sampled from a publicly available database, MIMIC-IV [Johnson etal., 2023].Each clinical note undergoes fine-grained annotation by professional physicians. The annotators (i.e., the physicians) are responsible for identifying the text, or the observation, in the note that leads to a certain diagnosis, as well as the explanation. The dataset also provides a diagnostic knowledge graph based on existing diagnostic guidelines to facilitate more consistent annotations and to supply a model with essential knowledge for reasoning that might not be encompassed in its training data.
To underscore the challenge offered by our dataset, we evaluate a simple AI-agent based baseline [Xi etal., 2023, Tang etal., 2023] that utilizes the knowledge graph to decompose the diagnosis into a sequence of diagnoses from a smaller number of observations. Our experimental findings indicate that current state-of-the-art LLMs still fall short of aligning well with human doctors.
Contribution. DiReCT offers a new challenge in diagnosis from a complex clinical note with explicit knowledge of established guidelines. This challenge aligns with a realistic medical scenario that doctors are experiencing. In the application aspect, the dataset facilitates the development of a model to support doctors in diagnosis, which is error-prone [Middleton etal., 2013, Liu etal., 2022]. From the technical aspect, the dataset can benchmark models’ ability to read long text and find necessary observations for multi-evidence entailment tree reasoning. As shown in Figure 3, this is not trivial because of the variations in writing; superficial matching does not help, and medical knowledge is vital. Meanwhile, reasoning itself is facilitated by the knowledge graph. The model does not necessarily have the knowledge of diagnostic guidelines. With this choice, the knowledge graph explains the reasoning process, which is also beneficial when deploying such a diagnosis assistant system in practical uses.
2 Related Works
Natural language explanation. Recent advancements in NLP have led to significant achievements [Min etal., 2023]. However, existing models often lack explainability, posing potential risks [Danilevsky etal., 2020, Gurrapu etal., 2023]. Numerous efforts have been made to address this challenge. One effective approach is to provide a human-understandable plain text explanation alongside the model’s output [Camburu etal., 2018, Rajani etal., 2019]. Another strategy involves identifying evidence within the input that serves as a rationale for the model’s decisions, aligning with human reasoning [DeYoung etal., 2020]. Expanding on this concept, [Jhamtani and Clark, 2020] introduces chain-structured explanations, given that a diagnosis can demand multi-hop reasoning. This idea is further refined by ProofWriter [Tafjord etal., 2021] through a proof stage for explanations, and by [Zhao etal., 2021] through retrieval from a corpus. [Dalvi etal., 2021] proposes the entailment tree, offering more detailed explanations and facilitating inspection of the model’s reasoning. More recently, [Zhang etal., 2024] employed cumulative reasoning to tap into the potential of LLMs to provide explanation via a directed acyclic graph. Although substantial progress has been made, interpreting NLP tasks in medical domains remains an ongoing challenge [Liévin etal., 2024].
Benchmarks of interpretability in the medical domain Several datasets are designed to assess a model’s reasoning together with its interpretability in medical NLP (Table 1). MedMCQA [Pal etal., 2022] and other medical QA datasets [Li etal., 2023, Chen etal., 2024] provide plain text as explanations for QA tasks. NLI4CT [Jullien etal., 2023] uses clinical trial reports, focusing on NLI supported by multi-hop reasoning. N2N2 [Gao etal., 2022] proposes a summarization (Sum) task for a diagnosis based on multiple pieces of evidence in the input clinical note. NEJM CPC [Zack etal., 2023] interprets clinicians’ diagnostic reasoning as plain text for reasoning clinical diagnosis (CD). DR.BENCH [Gao etal., 2023b] aggregates publicly available datasets to assess the diagnostic reasoning of LLMs. Utilizing an multi-evidence entailment tree explanation, DiReCT introduces a more rigorous task to assess whether LLMs can align with doctors’ reasoning in real clinical settings.
Dataset Task Data Source Length Explanation # Cases MedMCQA [Pal etal., 2022] QA Examination 9.93 t Plain Text 194,000 ExplainCPE [Li etal., 2023] QA Examination 37.79 w Plain Text 7,000 JAMA Challenge [Chen etal., 2024] QA Clinical Cases 371 w Plain Text 1,524 Medbullets [Chen etal., 2024] QA Online Questions 163 w Plain Text 308 N2N2 [Gao etal., 2022] Sum Clinical Notes 785.46 t Evidences 768 NLI4CT [Jullien etal., 2023] NLI Clinical Trail Reports 10-35 t Multi-hop 2,400 NEJM CPC [Zack etal., 2023] CD Clinical Cases - Plain Text 2,525 DiReCT (Ours) CD Clinical Notes 1074.6 t Entailment Tree 511
3 A benchmark for Clinical Notes Diagnosis
This section first detail clinical notes (Section 3.1). We also describes the knowledge graph that encodes existing guidelines (Section 3.2). Our task definition, which tasks a clinical note and the knowledge graph as input is given in Section 3.4. We then present our annotation process for clinical notes (Section 3.3) and the evaluation metrics (Section 3.5).
3.1 Clinical Notes
Clinical notes used in DiReCT are stored in the SOAP format [Weed, 1970]. A clinical note comprises four components: In the subjective section, the physician records the patient’s chief complaint, the history of present illness, and other subjective experiences reported by the patient. The objective section contains structural data obtained through examinations (inspection, auscultation, etc.) and other measurable means. The assessment section involves the physician’s analysis and evaluation of the patient’s condition. This may include a summary of current status, etc. Finally, the plan section outlines the physician’s proposed treatment and management plan. This may include prescribed medications, recommended therapies, and further investigations. A clinical note also includes a primary discharge diagnosis (PDD) in the assessment section.
DiReCT’s clinical notes are sourced from the MIMIC-IV dataset [Johnson etal., 2023] (PhysioNet Credentialed Health Data License 1.5.0), which encompasses over 40,000 patients admitted to the intensive care units. Each note contains clinical data for a patient. To construct DiReCT, we curated a subset of 511 notes whose PDDs fell within one of 25 disease categories in 5 medical domains.
In our task, a note is an excerpt of 6 clinical data in the subjective and objective sections (i.e., ): chief complaint, history of present illness, past medical history, family history, physical exam, and pertinent results.111We excluded data, such as review system and social history, because they are often missing in the original clinical notes and are less relevant to the diagnosis. We also identified the PDD associated with .222All clinical notes in DiReCT are related to only one PDD, and there is no secondary discharge diagnosis. The set of ’s for all ’s collectively forms . We manually removed any descriptions that disclose the PDD in .
3.2 Diagnostic Knowledge Graph
Existing knowledge graphs for the medical domain, e.g., UMLS KG [Bodenreider, 2004], lack the ability to provide specific clinical decision support (e.g., diagnostic threshold, context-specific data, dosage information, etc.), which are critical for accurate diagnosis.
Our knowledge graphs is a collection of graph for disease category . is based on the diagnosis criteria in existing guidelines (refer to supplementary material for details). ’s nodes are either premise (medical statement, e.g., Headache is a symptom of) and diagnoses (e.g., Suspected Stroke). consists of two different types of edges. One is premise-to-diagnosis edges , where and ; an edge is from to . This edge represents the necessary premise to make a diagnosis . We refer to them as supporting edges. The other is diagnosis-to-diagnosis edges , where and the edge is from to , which represents the diagnostic flow. These edges are referred to as procedural edges.
A disease category is defined according to an existing guideline, which starts from a certain diagnosis; therefore, a procedural graph has only one root node and arbitrarily branches toward multiple leaf nodes that represent PDDs (i.e., the clinical notes in DiReCT are chosen to cover all leaf nodes of ). Thus, is a tree. We denote the set of the leaf nodes (or PDDs) as . The knowledge graph is denoted by .
Figure 2 shows a part of , where is Acute Coronary Syndromes (ACS). Premises in and diagnoses in are given in the blue and gray boxes, while PDDs in are ones without outgoing edges (i.e., STEMI-ACS and NSTEMI-ACS, and UA). The black and red arrows are edges in and , respectively, where the black arrows indicate the supporting edges.
serves two essential functions: (1) They serve as the gold standard for annotation, guiding doctors in the precise and uniform interpretation of clinical notes. (2) Our task also allows a model to use them to ensure the output from an LLM can be closely aligned with the reasoning processes of medical professionals.
3.3 Data Annotation
Let denote the PDD of disease category associated with . We can find a subgraph of that contains all ancestors of , including premises in . We also denote the set of supporting edges in as . Our annotation process is, for each supporting edge , to extract observation in (highlighted text in the clinical note in Figure 3) and provide rationalization of this deduction why is a support for or corresponds to .333All annotations strictly follow the procedural flow in , and each observation is only related to one diagnostic node. If does not provide sufficient observations for the PDD (which may happen when a certain test is omitted), the annotators were asked to add plausible observations to . This choice compromises the fidelity of our dataset to the original clinical notes, but we chose it for the completeness of the dataset. They form the explanation for . This annotation process was carried out by 9 clinical physicians and subsequently verified for accuracy and completeness by three senior medical experts.
Medical domain # cat. # samples Length Cardiology 7 184 27 16 8.7 1156.6 t Gastroenterology 4 103 11 7 4.3 1026.0 t Neurology 5 77 17 11 11.9 1186.3 t Pulmonology 5 92 26 17 10.7 940.7 t Endocrinology 4 55 20 14 6.9 1063.5 t Overall 25 511 101 65 8.5 1074.6 t
Table 2 summarizes statistics of our dataset. The second and third columns (“# cats.” and “# samples”) show the numbers of disease categories and samples in the respective medical domains. and are the total numbers of diagnoses (diseases) and PDDs, summed over all diagnostic categories in the medical domain, respectively. is the average number of annotated observations. “Length” is the average number of tokens in .
3.4 Task Definition
We propose two tasks with different levels of supplied external knowledge. The first task is, given and , to predict the associated PDD and generate an explanation that explains the model’s diagnostic procedure from to , i.e., letting denote a model:
(1) |
where and are predictions for the PDD and explanation, respectively. With this task, the knowledge of specific diagnostic procedures in existing guidelines can be used for prediction, facilitating interpretability. The second task takes as input instead of , i.e.,:
(2) |
This task allows for the use of broader knowledge of premises for prediction. One may also try a task without any external knowledge.
3.5 Evaluation Metrics
We designed three metrics to quantify the predictive performance over our benchmark.
(1) Accuracy of diagnosis evaluates if a model can correctly identify the diagnosis. if , and otherwise. The average is reported.
(2) Completeness of observations evaluates whether a model extracts all and only necessary observations for the prediction. Let and denote the sets of observations in and , respectively. The metric is defined as, where the numerator is the number of observations that are common in both and .444We find the common observations with an LLM (refer to the supplementary material for more detail). This metric simultaneously evaluates the correctness of each observation and the coverage. To supplement it, we also report the precision and recall , given by and .
(3) Faithfulness of explanations Faith evaluates if the diagnostic flow toward the PDD is fully supported by observations with faithful rationalizations.This involves establishing a one-to-one correspondence between deductions in the prediction and the ground truth. We use the correspondences established for computing . Let and denote corresponding observations. This correspondence is considered successful if and as well as and associated with and matches. Let denote the number of successful matches. We use the ratio of to and as evaluation metrics and , respectively, to see failures come from observations or explanations and diagnosis.
4 Baseline
Figure 4 shows an overview of our baseline with three LLM-based modules narrowing-down, perception, and reasoning (refer to the supplementary material for more details). The narrowing-down module takes as input to make a prediction of the disease category, i.e., .
Let be the diagnosis that has been reached with iterations over , where corresponds to the depth of node and so is less than or equal to the depth of . is the root node of . For , we apply the perception module to extract all observations in and explanation to support as
(3) |
is supplied to facilitate the model to extract all observations for the following reasoning process.555We used only pairs of an observation and a premise. We abuse to mean this for notation simplicity.
Diagnosis identifies the set of its children and so the set of premises that support . Therefore, our reasoning module iteratively and greedily identifies the next step’s diagnosis (i.e., ) from , making a rationalization for each deduction. That is, verifies whether there exist ’s in that supports one . If is fully supported, is identified as for the -th iteration, i.e.,
(4) |
Otherwise, the reasoning module fails. is repeated until in is found or it fails. In our annotation, each observation contributes to deducing only one . Therefore, if an observation in is included in the preceding sets of explanations to , the corresponding explanation in the preceding sets is removed.
5 Experiments
5.1 Experimental Setup
We assess the reasoning capabilities of 7 recent LLMs from diverse families and model sizes, including 5 instruction-tuned models that are openly accessible: LLama3 8B and 70B [AI@Meta, 2024], Zephyr 7B [Tunstall etal., 2023], Mistral 7B [Jiang etal., 2023], and Mixtral 87B [Jiang etal., 2023]. We have also obtained access to private versions of the GPT-3.5 turbo [OpenAI, 2023b] and GPT-4 turbo [OpenAI, 2023a] 666These two models are housed on a HIPPA-compliant instance within Microsoft Azure AI Studio. No data is transferred to either Microsoft or OpenAI. This secure environment enables us to safely conduct experiments with the MIMIC-IV dataset, in compliance with the Data Use Agreement., which are high-performance closed-source models. Each LLM is utilized to implement our baseline’s narrowing-down, perception, and reasoning modules. The temperature is set to 0. For computing evaluation metrics, we use LLama3 8B with few-shot prompts to make correspondences between and as well as to verify a match between predicted and ground-truth explanations (refer to the supplementary material for more details).
5.2 Results
Comparison among LLMs. Table 3 shows the performance of our baseline built on top of various LLMs. We first evaluate a variant of our task that takes graph consisting of only procedural flow as external knowledge instead of . Comparison between and demonstrates the importance of supplying premises with the model and LLMs’ capability to make use of extensive external knowledge that may be superficially different from statements in . Subsequently, some models are evaluated with our task using . In addition to the metrics in Section 3.5, we also adopt the accuracy of disease category , which gives 1 when , as our baseline’s performance depends on it.
Diagnosis Observation Explanation Task Models Acc Acc With Zephyr 7B 0.274 0.151 0.123 0.115 0.092 0.071 0.014 Mistral 7B 0.507 0.306 0.211 0.317 0.173 0.230 0.062 Mixtral 87B 0.413 0.237 0.147 0.266 0.124 0.144 0.029 LLama3 8B 0.576 0.321 0.253 0.437 0.219 0.232 0.071 LLama3 70B 0.752 0.540 0.277 0.537 0.256 0.395 0.112 GPT-3.5 turbo 0.679 0.455 0.389 0.351 0.275 0.331 0.103 GPT-4 turbo 0.772 0.572 0.446 0.491 0.371 0.475 0.199 With LLama3 8B 0.576 0.344 0.235 0.394 0.199 0.327 0.087 LLama3 70B 0.735 0.581 0.262 0.501 0.236 0.463 0.125 GPT-3.5 turbo 0.652 0.413 0.347 0.279 0.232 0.374 0.121 GPT-4 turbo 0.781 0.614 0.431 0.458 0.353 0.633 0.247
Observation Explanation Task Models Acc With LLama3 8B 0.070 0.154 0.330 0.135 0.020 0.004 LLama3 70B 0.502 0.257 0.509 0.237 0.138 0.034 GPT-3.5 turbo 0.223 0.164 0.149 0.116 0.091 0.025 GPT-4 turbo 0.636 0.461 0.482 0.378 0.186 0.074 No Knowledge LLama3 8B 0.023 0.137 0.258 0.119 0.018 0.006 LLama3 70B 0.037 0.246 0.504 0.227 0.022 0.007 GPT-3.5 turbo 0.059 0.161 0.148 0.113 0.036 0.011 GPT-4 turbo 0.074 0.410 0.443 0.324 0.047 0.019
With , we can see that GPT-4 achieves the best performance in most metrics, especially related to observations and explanations, surpassing LLama3 70B by a large margin. In terms of accuracy (in both category and diagnosis levels), LLama3 70B is comparable to GPT-4. LLama3 70B also has a higher but low and , which means that this model tends to extract many observations. Models with high diagnostic accuracy are not necessarily excel in finding essential information in long text (i.e., observations) and generating reasons (i.e., explanations).
When is given, all models show better diagnostic accuracy (except GPT-3.5) and explanations, while observations are slightly degraded. GPT-4 with enhances Acc, , and scores. This suggests that premises and supporting edges are beneficial for diagnosis and explanation. Lower observational performance may indicate that the models lack the ability to associate premises and text in , which are often superficially different though semantically consistent.
LLMs may undergo inherent challenges for evaluation when no external knowledge is supplied. They may have the knowledge to diagnose but cannot make consistent observations and explanations that our task expects through . To explore this, we evaluate two settings: (1) giving and (2) no knowledge is supplied to a model (shown in Table 4). The prompts used for this setup are detailed in the supplementary material. We do not evaluate the accuracy of disease category prediction as it is basically the same as Table 3. We can clearly see that with , GPT-4’s diagnostic and observational scores are comparable to those of the task with , though explanatory performance is much worse. Without any external knowledge, the diagnostic accuracy is also inferior.777We understand this comparison is unfair, as the prompts differ. We intend to give a rough idea about the challenge without external knowledge. The deteriorated performance can be attributed to inconsistent wording of diagnosis names, which makes evaluation tough. High observational scores imply that observations in can be identified without relying on external knowledge. There can be some cues to spot them.
Performance in individual domains. Figure 5 summarizes the performance of LLama3 70B, GPT-3.5, and GPT-4 across different medical domains, evaluated using Acc, , and . Neurology gives the best diagnostic accuracy, where GPT-4 achieved an accuracy of 0.806. LLama3 also performed well (0.786). In terms of and , GPT-4’s results were 0.458 and 0.340, respectively, with the smallest difference between the two scores among all domains. This smaller gap indicates that in Neurology, the common observations in prediction and ground truth lead to the correct diagnoses with faithful rationalizations. However, GPT-4 yields a higher diagnostic accuracy score while a lower explanatory score, suggesting that the observations captured by the model or their rationalizations differ from human doctors.
For Cardiology and Endocrinology, the diagnostic accuracy of the models is relatively low (GPT-4 achieved 0.458 and 0.468, respectively). Nevertheless, and are relatively high. Endocrinology results in lower diagnostic accuracy and higher explanatory performance. A smaller gap may imply that in these two domains, successful predictions are associated with observations similar to those of human doctors, and the reasoning process may be analogous. Conversely, in Gastroenterology, higher Acc) is accompanied by lower and (especially for LLama3), potentially indicating a significant divergence in the reasoning process from human doctors. Overall, DiReCT demonstrates that the degree of alignment between the model’s diagnostic reasoning ability and that of human doctors varies across different medical domains.
Model Observation Rationalization LLama3 8B 0.887 0.801 GPT-4 turbo 0.902 0.836
Reliability of automatic evaluation. We randomly pick out 100 samples from DiReCT and their prediction by GPT-4 over the task with to assess the consistency of our automated metrics to evaluate the observational and explanatory performance in Section 3.3 to human judgments. Three physicians joined this experiment. For each prediction , they are asked to find a similar observation in ground truth . For explanatory metrics, they verify if each prediction for align with ground-truth corresponding to . A prediction and a ground truth are deemed aligned for both assessments if at least two specialists agree. We compare LLama3’s and GPT-4’s judgments to explore if there is a gap between these LLMs. As summarized in Table 5, GPT-4 achieves the best results, with LLama3 8B also displaying a similar performance. From these results, we argue that our automated evaluation metrics are consistent with human judgments, and LLama3 is sufficient for this evaluation, allowing the cost-efficient option.
A prediction example. Figure 6 shows a sample generated by GPT-4. The ground-truth PDD of the input clinical note is Hemorrhagic Stroke. In this figure, purple, orange, and red indicate explanations only in the ground truth, only in prediction, and common in both, respectively; therefore, red is a successful prediction of an explanation, while purple and orange are a false negative and false positive. GPT-4 treats the observation of aurosis fugax as the criteria for diagnosing Ischemic Stroke. However, this observation only supports Suspected Stroke. Conversely, observation thalamic hematoma, which is the key indicator of Hemorrhagic Stroke, is regarded as a less important clue. Such observation-diagnosis correspondence errors lead to the model’s misdiagnosis. More samples are available in the supplementary material.
6 Conclusion and Limitations
We proposed DiReCT as the first benchmark for evaluating the diagnostic reasoning ability of LLMs with interpretability by supplying external knowledge as a graph. Our evaluations reveal a notable disparity between current leading-edge LLMs and human experts, underscoring the urgent need for AI models that can perform reliable and interpretable reasoning in clinical environments. DiReCT can be easily extended to more challenging settings by removing the knowledge graph from the input, facilitating evaluations of future LLMs.
Limitations. DiReCT encompasses only a subset of disease categories and considers only one PDD, omitting the inter-diagnostic relationships due to their complexity—a significant challenge even for human doctors. Additionally, our baseline may not use optimal prompts, chain-of-thought reasoning, or address issues related to hallucinations in task responses. Our dataset is solely intended for model evaluation but not for use in clinical environments. The use of the diagnostic knowledge graph is also limited to serving merely as a part of input. Future work will focus on constructing a more comprehensive disease dataset and developing an extensive diagnostic knowledge graph.
Acknowledgments and Disclosure of Funding
This work was supported by World Premier International Research Center Initiative (WPI), MEXT, Japan. This work is also supported by JSPS KAKENHI 24K20795 and Dalian Haichuang Project for Advanced Talents.
References
- Zhao etal. [2023]WayneXin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen.A survey of large language models.arXiv preprint arXiv:2303.18223, 2023.
- Min etal. [2023]Bonan Min, Hayley Ross, Elior Sulem, Amir PouranBen Veyseh, ThienHuu Nguyen, Oscar Sainz, Eneko Agirre, Ilana Heintz, and Dan Roth.Recent advances in natural language processing via large pre-trained language models: A survey.ACM Computing Surveys, 56(2):1–40, 2023.
- Anil etal. [2023]Rohan Anil, AndrewM Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, etal.Palm 2 technical report.arXiv preprint arXiv:2305.10403, 2023.
- Han etal. [2023]Tianyu Han, LisaC Adams, Jens-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser, Alexander Löser, Daniel Truhn, and KenoK Bressem.Medalpaca–an open-source collection of medical conversational ai models and training data.arXiv preprint arXiv:2304.08247, 2023.
- Jin etal. [2021]DiJin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits.What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021.
- OpenAI [2023a]OpenAI.GPT-4 Technical Report.CoRR, abs/2303.08774, 2023a.doi: 10.48550/arXiv.2303.08774.URL https://doi.org/10.48550/arXiv.2303.08774.
- Bubeck etal. [2023]Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, YinTat Lee, Yuanzhi Li, Scott Lundberg, etal.Sparks of artificial general intelligence: Early experiments with gpt-4.arXiv preprint arXiv:2303.12712, 2023.
- Nori etal. [2023]Harsha Nori, Nicholas King, ScottMayer McKinney, Dean Carignan, and Eric Horvitz.Capabilities of gpt-4 on medical challenge problems.arXiv preprint arXiv:2303.13375, 2023.
- Liévin etal. [2024]Valentin Liévin, ChristofferEgeberg Hother, AndreasGeert Motzfeldt, and Ole Winther.Can large language models reason about medical questions?Patterns, 5(3), 2024.
- Pal etal. [2022]Ankit Pal, LogeshKumar Umapathi, and Malaikannan Sankarasubbu.MedMCQA: A large-scale multi-subject multi-choice dataset for medical domain question answering.In Conference on health, inference, and learning, pages 248–260. PMLR, 2022.
- Li etal. [2023]Dongfang Li, Jindi Yu, Baotian Hu, Zhenran Xu, and Min Zhang.ExplainCPE: A free-text explanation benchmark of chinese pharmacist examination.arXiv preprint arXiv:2305.12945, 2023.
- Chen etal. [2024]Hanjie Chen, Zhouxiang Fang, Yash Singla, and Mark Dredze.Benchmarking large language models on answering and explaining challenging medical questions.arXiv preprint arXiv:2402.18060, 2024.
- Jullien etal. [2023]Mael Jullien, Marco Valentino, Hannah Frost, Paul O’Regan, Donal Landers, and André Freitas.Semeval-2023 task 7: Multi-evidence natural language inference for clinical trial data.arXiv preprint arXiv:2305.02993, 2023.
- Gao etal. [2023a]Yanjun Gao, Ruizhe Li, John Caskey, Dmitriy Dligach, Timothy Miller, MatthewM Churpek, and Majid Afshar.Leveraging a medical knowledge graph into large language models for diagnosis prediction.arXiv preprint arXiv:2308.14321, 2023a.
- Johnson etal. [2023]Alistair E.W. Johnson, Lucas Bulgarelli, LuShen, Alvin Gayles, Ayad Shammout, Steven Horng, TomJ. Pollard, Sicheng Hao, Benjamin Moody, Brian Gow, Li-weiH. Lehman, LeoA. Celi, and RogerG. Mark.MIMIC-IV, a freely accessible electronic health record dataset.Scientific data, 10(1):1, 2023.
- Xi etal. [2023]Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen Cheng, QiZhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huang, and Tao Gui.The rise and potential of large language model based agents: A survey, 2023.
- Tang etal. [2023]Xiangru Tang, Anni Zou, Zhuosheng Zhang, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein.Medagents: Large language models as collaborators for zero-shot medical reasoning.arXiv preprint arXiv:2311.10537, 2023.
- Middleton etal. [2013]Blackford Middleton, Meryl Bloomrosen, MarkA Dente, Bill Hashmat, Ross Koppel, JMarc Overhage, ThomasH Payne, STrent Rosenbloom, Charlotte Weaver, and Jiajie Zhang.Enhancing patient safety and quality of care by improving the usability of electronic health record systems: recommendations from amia.Journal of the American Medical Informatics Association, 20(e1):e2–e8, 2013.
- Liu etal. [2022]Jinghui Liu, Daniel Capurro, Anthony Nguyen, and Karin Verspoor.“note bloat” impacts deep learning-based nlp models for clinical prediction tasks.Journal of biomedical informatics, 133:104149, 2022.
- Danilevsky etal. [2020]Marina Danilevsky, Kun Qian, Ranit Aharonov, Yannis Katsis, Ban Kawas, and Prithviraj Sen.A survey of the state of explainable AI for natural language processing.In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 447–459, 2020.
- Gurrapu etal. [2023]Sai Gurrapu, Ajay Kulkarni, Lifu Huang, Ismini Lourentzou, and FerasA Batarseh.Rationalization for explainable nlp: A survey.Frontiers in Artificial Intelligence, 6, 2023.
- Camburu etal. [2018]Oana-Maria Camburu, Tim Rocktäschel, Thomas Lukasiewicz, and Phil Blunsom.e-snli: Natural language inference with natural language explanations.Advances in Neural Information Processing Systems, 31, 2018.
- Rajani etal. [2019]NazneenFatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher.Explain yourself! leveraging language models for commonsense reasoning.In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4932–4942, Florence, Italy, 2019.
- DeYoung etal. [2020]Jay DeYoung, Sarthak Jain, NazneenFatema Rajani, Eric Lehman, Caiming Xiong, Richard Socher, and ByronC. Wallace.ERASER: A benchmark to evaluate rationalized NLP models.In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4443–4458, 2020.
- Jhamtani and Clark [2020]Harsh Jhamtani and Peter Clark.Learning to explain: Datasets and models for identifying valid reasoning chains in multihop question-answering.In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, page 137–150, 2020.
- Tafjord etal. [2021]Oyvind Tafjord, BhavanaDalvi Mishra, and Peter Clark.Proofwriter: Generating implications, proofs, and abductive statements over natural language.In Findings of the Association for Computational Linguistics: ACL-IJCNLP, page 3621–3634, 2021.
- Zhao etal. [2021]Chen Zhao, Chenyan Xiong, Jordan Boyd-Graber, and Hal DauméIII.Multi-step reasoning over unstructured text with beam dense retrieval.In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4635–4641, 2021.
- Dalvi etal. [2021]Bhavana Dalvi, Peter Jansen, Oyvind Tafjord, Zhengnan Xie, Hannah Smith, Leighanna Pipatanangkura, and Peter Clark.Explaining answers with entailment trees.In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7358–7370, 2021.
- Zhang etal. [2024]Yifan Zhang, Jingqin Yang, Yang Yuan, and Andrew Chi-Chih Yao.Cumulative reasoning with large language models.In ICLR 2024 Workshop on Bridging the Gap Between Practice and Theory in Deep Learning, 2024.URL https://openreview.net/forum?id=XAAYyRxTlQ.
- Gao etal. [2022]Yanjun Gao, Dmitriy Dligach, Timothy Miller, Samuel Tesch, Ryan Laffin, MatthewM. Churpek, and Majid Afshar.Hierarchical annotation for building a suite of clinical natural language processing tasks: Progress note understanding.In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5484–5493, Marseille, France, 2022. European Language Resources Association.
- Zack etal. [2023]Travis Zack, Gurpreet Dhaliwal, Rabih Geha, Mary Margaretten, Sara Murray, and JulianC Hong.A clinical reasoning-encoded case library developed through natural language processing.Journal of General Internal Medicine, 38(1):5–11, 2023.
- Gao etal. [2023b]Yanjun Gao, Dmitriy Dligach, Timothy Miller, John Caskey, Brihat Sharma, MatthewM Churpek, and Majid Afshar.Dr. bench: Diagnostic reasoning benchmark for clinical natural language processing.Journal of Biomedical Informatics, 138:104286, 2023b.
- Weed [1970]L.L. Weed.Medical Records, Medical Education, and Patient Care: The Problem-oriented Record as a Basic Tool.Press of Case Western Reserve University, 1970.ISBN 9780815191889.
- Bodenreider [2004]Olivier Bodenreider.The unified medical language system (umls): integrating biomedical terminology.Nucleic acids research, 32(suppl_1):D267–D270, 2004.
- AI@Meta [2024]AI@Meta.Llama 3 model card.2024.URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
- Tunstall etal. [2023]Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, etal.Zephyr: Direct distillation of lm alignment.arXiv preprint arXiv:2310.16944, 2023.
- Jiang etal. [2023]AlbertQ Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, DevendraSingh Chaplot, Diego delas Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, etal.Mistral 7b.arXiv preprint arXiv:2310.06825, 2023.
- OpenAI [2023b]OpenAI.Introducing ChatGPT and Whisper APIs.2023b.URL https://openai.com/blog/introducing-chatgpt-and-whisper-apis.
- Byrne etal. [2024]RobertA Byrne, Xavier Rossello, JJCoughlan, Emanuele Barbato, Colin Berry, Alaide Chieffo, MarcJ Claeys, Gheorghe-Andrei Dan, MarcR Dweck, Mary Galbraith, etal.2023 esc guidelines for the management of acute coronary syndromes: developed by the task force on the management of acute coronary syndromes of the european society of cardiology (esc).European Heart Journal: Acute Cardiovascular Care, 13(1):55–161, 2024.
- Members etal. [2022]WritingCommittee Members, EricM Isselbacher, Ourania Preventza, James Hamilton BlackIII, JohnG Augoustides, AdamW Beck, MichaelA Bolen, AlanC Braverman, BruceE Bray, MayaM Brown-Zimmerman, etal.2022 acc/aha guideline for the diagnosis and management of aortic disease: a report of the american heart association/american college of cardiology joint committee on clinical practice guidelines.Journal of the American College of Cardiology, 80(24):e223–e393, 2022.
- Joglar etal. [2024]JoséA Joglar, MinaK Chung, AnastasiaL Armbruster, EmeliaJ Benjamin, JaniceY Chyou, EdmondM Cronin, Anita Deswal, LeeL Eckhardt, ZacharyD Goldberger, Rakesh Gopinathannair, etal.2023 acc/aha/accp/hrs guideline for the diagnosis and management of atrial fibrillation: a report of the american college of cardiology/american heart association joint committee on clinical practice guidelines.Circulation, 149(1):e1–e156, 2024.
- Ommen etal. [2020]SteveR Ommen, Seema Mital, MichaelA Burke, SharleneM Day, Anita Deswal, Perry Elliott, LaurenL Evanovich, Judy Hung, JoséA Joglar, Paul Kantor, etal.2020 aha/acc guideline for the diagnosis and treatment of patients with hypertrophic cardiomyopathy: executive summary: a report of the american college of cardiology/american heart association joint committee on clinical practice guidelines.Journal of the American College of Cardiology, 76(25):3022–3055, 2020.
- Heidenreich etal. [2022]PaulA Heidenreich, Biykem Bozkurt, David Aguilar, LarryA Allen, JoniJ Byun, MonicaM Colvin, Anita Deswal, MarkH Drazner, ShannonM Dunlay, LindaR Evers, etal.2022 aha/acc/hfsa guideline for the management of heart failure: a report of the american college of cardiology/american heart association joint committee on clinical practice guidelines.Journal of the American College of Cardiology, 79(17):e263–e421, 2022.
- Su etal. [2021]Lilly Su, Rea Mittal, Devyani Ramgobin, Rahul Jain, and Rohit Jain.Current management guidelines on hyperlipidemia: the silent killer.Journal of lipids, 2021(1):9883352, 2021.
- Unger etal. [2020]Thomas Unger, Claudio Borghi, Fadi Charchar, NadiaA Khan, NeilR Poulter, Dorairaj Prabhakaran, Agustin Ramirez, Markus Schlaich, GeorgeS Stergiou, Maciej Tomaszewski, etal.2020 international society of hypertension global hypertension practice guidelines.Hypertension, 75(6):1334–1357, 2020.
- Shah etal. [2021]ShailjaC Shah, MBlanca Piazuelo, ErnstJ Kuipers, and Dan Li.Aga clinical practice update on the diagnosis and management of atrophic gastritis: expert review.Gastroenterology, 161(4):1325–1332, 2021.
- Gyawali etal. [2024]CPrakash Gyawali, Rena Yadlapati, Ronnie Fass, David Katzka, John Pandolfino, Edoardo Savarino, Daniel Sifrim, Stuart Spechler, Frank Zerbib, MarkR Fox, etal.Updates to the modern diagnosis of gerd: Lyon consensus 2.0.Gut, 73(2):361–371, 2024.
- Kavitt etal. [2019]RobertT Kavitt, AnnaM Lipowska, Adjoa Anyane-Yeboa, and IanM Gralnek.Diagnosis and treatment of peptic ulcer disease.The American journal of medicine, 132(4):447–456, 2019.
- Barkun etal. [2019]AlanN Barkun, Majid Almadi, ErnstJ Kuipers, Loren Laine, Joseph Sung, Frances Tse, GrigoriosI Leontiadis, NeenaS Abraham, Xavier Calvet, FrancisKL Chan, etal.Management of nonvariceal upper gastrointestinal bleeding: guideline recommendations from the international consensus group.Annals of internal medicine, 171(11):805–822, 2019.
- McKhann etal. [1984]Guy McKhann, David Drachman, Marshall Folstein, Robert Katzman, Donald Price, and EmanuelM Stadlan.Clinical diagnosis of alzheimer’s disease: Report of the nincds-adrda work group* under the auspices of department of health and human services task force on alzheimer’s disease.Neurology, 34(7):939–939, 1984.
- Igaku-Shoin-Ltd. [2018]Igaku-Shoin-Ltd.Clinical practice guidelines for epilepsy 2018.2018.
- Lipton etal. [2001]RichardB Lipton, Seymour Diamond, Michael Reed, MerleL Diamond, and WalterF Stewart.Migraine diagnosis and treatment: results from the american migraine study ii.Headache: The Journal of Head and Face Pain, 41(7):638–645, 2001.
- Lublin [2005]FredD Lublin.Clinical features and diagnosis of multiple sclerosis.Neurologic clinics, 23(1):1–15, 2005.
- Kleindorfer etal. [2021]DawnO Kleindorfer, Amytis Towfighi, Seemant Chaturvedi, KevinM co*ckroft, Jose Gutierrez, Debbie Lombardi-Hill, Hooman Kamel, WalterN Kernan, StevenJ Kittner, EnriqueC Leira, etal.2021 guideline for the prevention of stroke in patients with stroke and transient ischemic attack: a guideline from the american heart association/american stroke association.Stroke, 52(7):e364–e467, 2021.
- Qaseem etal. [2011]Amir Qaseem, TimothyJ Wilt, StevenE Weinberger, NicolaA Hanania, Gerard Criner, Thys vander Molen, DarcyD Marciniuk, Tom Denberg, Holger Schünemann, Wisia Wedzicha, etal.Diagnosis and management of stable chronic obstructive pulmonary disease: a clinical practice guideline update from the american college of physicians, american college of chest physicians, american thoracic society, and european respiratory society.Annals of internal medicine, 155(3):179–191, 2011.
- Gupta etal. [2013]Dheeraj Gupta, Ritesh Agarwal, AshutoshNath Aggarwal, VNMaturu, Sahajal Dhooria, KTPrasad, InderpaulS Sehgal, LakshmikantB Yenge, Aditya Jindal, Navneet Singh, etal.Guidelines for diagnosis and management of chronic obstructive pulmonary disease: Joint ics/nccp (i) recommendations.Lung India, 30(3):228–267, 2013.
- Olson and Davis [2020]Gregory Olson and AndrewM Davis.Diagnosis and treatment of adults with community-acquired pneumonia.Jama, 323(9):885–886, 2020.
- Konstantinides etal. [2020]StavrosV Konstantinides, Guy Meyer, Cecilia Becattini, Héctor Bueno, Geert-Jan Geersing, Veli-Pekka Harjola, MennoV Huisman, Marc Humbert, CatrionaSian Jennings, David Jiménez, etal.2019 esc guidelines for the diagnosis and management of acute pulmonary embolism developed in collaboration with the european respiratory society (ers) the task force for the diagnosis and management of acute pulmonary embolism of the european society of cardiology (esc).European heart journal, 41(4):543–603, 2020.
- Lewinsohn etal. [2017]DavidM Lewinsohn, MichaelK Leonard, PhilipA LoBue, DavidL Cohn, CharlesL Daley, EdDesmond, Joseph Keane, DeborahA Lewinsohn, AnnM Loeffler, GeraldH Mazurek, etal.Official american thoracic society/infectious diseases society of america/centers for disease control and prevention clinical practice guidelines: diagnosis of tuberculosis in adults and children.Clinical Infectious Diseases, 64(2):e1–e33, 2017.
- Charmandari etal. [2014]Evangelia Charmandari, NicolasC Nicolaides, and GeorgeP Chrousos.Adrenal insufficiency.The Lancet, 383(9935):2152–2167, 2014.
- ElSayed etal. [2023]NuhaA ElSayed, Grazia Aleppo, VanitaR Aroda, RaveendharaR Bannuru, FlorenceM Brown, Dennis Bruemmer, BillyS Collins, Kenneth Cusi, MarisaE Hilliard, Diana Isaacs, etal.4. comprehensive medical evaluation and assessment of comorbidities: Standards of care in diabetes—2023.Diabetes Care, 46(Suppl 1):s49, 2023.
- Tritos and Miller [2023]NicholasA Tritos and KarenK Miller.Diagnosis and management of pituitary adenomas: a review.Jama, 329(16):1386–1398, 2023.
- AlexanderErik etal. [2017]KAlexanderErik, NPearceElizabeth, ABrentGregory, SBrownRosalind, AGrobmanWilliam, HLazarusJohn, JMandelSusan, PPeetersRobin, etal.2017 guidelines of the american thyroid association for the diagnosis and management of thyroid disease during pregnancy and the postpartum.Thyroid, 2017.
Appendix A Details of DiReCT
A.1 Data Statistics
Domains Categories # samples References Cardiology Acute Coronary Syndromes 65 6 3 [Byrne etal., 2024] Aortic Dissection 14 3 2 [Members etal., 2022] Atrial Fibrillation 10 3 2 [Joglar etal., 2024] Cardiomyopathy 9 5 4 [Ommen etal., 2020] Heart Failure 52 6 3 [Heidenreich etal., 2022] Hyperlipidemia 2 2 1 [Su etal., 2021] Hypertension 32 2 1 [Unger etal., 2020] Gastroenterology Gastritis 27 5 3 [Shah etal., 2021] Gastroesophageal Reflux Disease 41 2 1 [Gyawali etal., 2024] Peptic Ulcer Disease 28 3 2 [Kavitt etal., 2019] Upper Gastrointestinal Bleeding 7 2 1 [Barkun etal., 2019] Neurology Alzheimer 10 2 1 [McKhann etal., 1984] Epilepsy 8 3 2 [Igaku-Shoin-Ltd., 2018] Migraine 4 3 2 [Lipton etal., 2001] Multiple Sclerosis 27 6 4 [Lublin, 2005] Stroke 28 3 2 [Kleindorfer etal., 2021] Pulmonology Asthma 13 7 5 [Qaseem etal., 2011] COPD 19 6 4 [Gupta etal., 2013] Pneumonia 20 4 2 [Olson and Davis, 2020] Pulmonary Embolism 35 5 3 [Konstantinides etal., 2020] Tuberculosis 5 3 2 [Lewinsohn etal., 2017] Endocrinology Adrenal Insufficiency 20 4 3 [Charmandari etal., 2014] Diabetes 13 4 2 [ElSayed etal., 2023] Pituitary 12 4 3 [Tritos and Miller, 2023] Thyroid Disease 10 6 4 [AlexanderErik etal., 2017]
Table 6 provides a detailed breakdown of the disease categories included in DiReCT. The column labeled # samples indicates the number of data points. The symbols and denote the total number of diagnoses (diseases) and PDDs, respectively. Existing guidelines for diagnosing diseases were used as References, forming the foundation for constructing the diagnostic knowledge graphs. As some premise may not included in the referred guidelines. During annotation, physicians will incorporate their own knowledge to complete the knowledge graph.
A.2 Structure of Knowledge Graph
The entire knowledge graph, denoted as , is stored in separate JSON files, each corresponding to a specific disease category as . Each comprises a procedural graph and the corresponding premise for each disease. As illustrated in Figure 7, the procedural graph is stored under the key "Diagnostic" in a dictionary structure. A key with an empty list as its value indicates a leaf diagnostic node as . The premise for each disease is saved under the key of "Knowledge" with the corresponding disease name as an index. For all the root nodes (e.g., Suspected Heart Failure), we further divide the premise into "Risk Factors", "Symptoms", and "Signs". Note that each premise is separated by ";".
A.3 Annotation and Tools
We have developed proprietary software for annotation purposes. As depicted in Figure 8, annotators are presented with the original text as observations and are required to provide rationales () to explain why a particular observation supports a disease . The left section of the figure, labeled Input1 to Input6, corresponds to different parts of the clinical note, specifically the chief complaint, history of present illness, past medical history, family history, physical exam, and pertinent results, respectively. Annotators will add the raw text into the first layer by left-clicking and dragging to select the original text, then right-clicking to add it. After each observation, a white box will be used to record the rationales. Finally, a connection will be made from each rationale to a disease, represented in a grey box. The annotation process strictly follow the knowledge graph. Both the final annotation and the raw clinical note will be saved in a JSON file. We provide the code to compile these annotations and detailed instructions for using our tool on GitHub.
A.4 Access to DiReCT
Implementation code and annotation tool are available through https://github.com/wbw520/DiReCT. Data will be released through PhysioNet due to safety issues according to the license of MIMIC-IV (PhysioNet Credentialed Health Data License 1.5.0). We will use the same license for DiReCT. The download link will be accessible via GitHub. We confirm that this GitHub link and data link are always accessible. We confirm that we will bear all responsibility in case of violation of rights.
Appendix B Implementation of Baseline Method
B.1 Prompt Settings
Input Prompt Suppose you are one of the greatest AI scientists and medical expert. Let us think step by step. You will review a clinical ’Note’ and your ’Response’ is to diagnose the disease that the patient have for this admission. All possible disease options are in a list structure: {disease_option}. Note that you can only choose one disease from the disease options and directly output the origin name of that disease. Now, start to complete your task. Don’t output any information other than your ’Response’. ’Note’: {note} Your ’Response’:
Input Prompt Suppose you are one of the greatest AI scientists and medical expert. Let us think step by step. You will review a part of clinical "Note" from a patient. The disease for which the patient was admitted to hospital this time is {disease}. Your task is to extract the original text as confidence "Observations" that lead to {disease}. Here are some premise for the diagnosis of this disease category. You can refer them for your task. Premise are: {premise} Note that you also need to briefly provide the "Reason" for your extraction. Note that both "Observations" and "Reason" should be string. Note that your "Response" should be a list structure as following : [["Observation", "Reason"], ……, ["Observation", "Reason"]] Note that if you can’t find any "Observation" your "Response" should be: []. Now, start to complete your task. Note that you should not output any information other than your "Response". "Note": {note} Note that you should not output any information other than your "Response". Your "Response":
Input Prompt Suppose you are one of the greatest AI scientists and medical expert. Let us think step by step. You will receive a list of "Observations" from a clinical "Note". These "Observations" are possible support to diagnose {disease}. Based on these "Observations", you need to diagnose the "Disease" from the following options: {disease_option}. Here are some golden standards to discriminate diseases. You can refer them for your task. Golden standards are: {premise} Note that you can only choose one "Disease" from the disease options and directly output the name in disease options. Note that you also required to select the "Observations" that satisfy the golden standard to diagnose the "Disease" you choose. Note that you also required to provide the "Reason" for your choice. Note that your "Response" should be a list structure as following :[["Observation", "Reason", "Disease"], ……, ["Observation", "Reason", "Disease"]] Note that if you can’t find any "Observation" to support a disease option, your "Response" should be: None Now, start to complete your task. Note that you should not output any information other than your "Response". "Observations": {observation} Note that you should not output any information other than your "Response". Your "Response":
In this section, we demonstrate the prompt we used for each module (From Table 7-9 for narrowing-down, perception, and reasoning module, respectively).
In Table 7, {disease_option} is the name for all disease categories, and {note} is the content for the whole clinical note. The response for the model is the name of a possible disease .
In Table 8, {disease} is the disease category name predicted in narrowing-down. The content marked blue is the premise, which is only provided during the setting. In this module, {premise} is offered with all information in the knowledge graph. Different to narrowing-down, {note} is implemented for each clinical data and the outputs are combined together for and .
In Table 9, {disease} is the disease category name and {disease_option} is consisted by the children nodes . Similarly, the premise on the blue is only available for the setting. It provides the premise that are criteria for the diagnosis of each children node. {observation} is the extracted in previous step. We provide all the prompts and the complete implementation code on GitHub.
B.2 Details of Automatic Evaluation
The automatic evaluation is realized by LLama3 8B. We demonstrate the prompt for this implement in Table 10 (for observation) and Table 11 (for rationalization). Note that we do not use few-shot samples for the evaluation of observation. In Table 10, {gt_observation} and {pred_observation} are from model prediction and ground-truth. As this is a simple similarity comparison task to discriminate whether the model finds similar observations to humans, LLama3 itself have such ability. We do not strict to exactly match due to the difference in length of extracted raw text (as long as the observation expresses the same description). In Table 11, {gt_reasoning} and {pred_reasoning} are from model prediction and ground-truth. We require the rationale to be complete (content of the expression can be understood from the rationale alone) and meaningful; therefore, we provide five samples for this evaluation. We also provide all the prompts and the complete implementation code on GitHub.
Input Prompt Suppose you are one of the greatest AI scientists and medical expert. Let us think step by step. You will receive two "Observations" extracted from a patient’s clinical note. Your task is to discriminate whether they textually description is similar? Note that "Response" should be one selection from "Yes" or "No". Now, start to complete your task. Don’t output any information other than your "Response". "Observation 1": {gt_observation} "Observation 2": {pred_observation} Your "Response":
Input Prompt Suppose you are one of the greatest AI scientists and medical expert. Let us think step by step. You will receive two "Reasoning" for the explanation of why an observation cause a disease. Your task is to discriminate whether they explain a similar medical diagnosis premise? Note that "Response" should be one selection from "Yes" or "No". Here are some samples: Sample 1: "Reasoning 1": Facial sagging is a classic symptom of stroke "Reasoning 2": Indicates possible facial nerve palsy, a common symptom of stroke "Response": Yes Sample 2: "Reasoning 1": Family history of Diabetes is an important factor "Reasoning 2": Patient’s mother had a history of Diabetes, indicating a possible genetic predisposition to stroke "Response": Yes Sample 3: "Reasoning 1": headache is one of the common symptoms of HTN "Reasoning 2": Possible symptom of HTN "Response": No Sample 4: "Reasoning 1": Acute bleeding is one of the typical symptoms of hemorrhagic stroke "Reasoning 2": The presence of high-density areas on Non-contrast CT Scan is a golden standard for Hemorrhagic Stroke "Response": No Sample 5: "Reasoning 1": Loss of strength on one side of the body, especially when compared to the other side, is a common sign of stroke "Reasoning 2": Supports ischemic stroke diagnosis "Response": No Now, start to complete your task. Don’t output any information other than your "Response". "Reasoning 1": {gt_reasoning} "Reasoning 2": {pred_reasoning} Your "Response":
B.3 Prediction Samples
Figure 9 and 10 shows two sample generated by GPT-4. The ground-truth PDD of the input clinical note is Gastroesophageal Reflux Disease (GERD) and Heart Failure (HF). In these figure, purple, orange, and red indicate explanations only in the ground truth, only in prediction, and common in both, respectively; therefore, red is a successful prediction of an explanation, while purple and orange are a false negative and false positive.
In Figure 9, we can observe that GPT-4 can find the key observation for the diagnosis of GERD, which is consistent with human in both observation and rationale. However, it still lacks the ability to identify all observations and establish accurate relationships for diseases. In Figure 10, the model’s predictions do not align well with those of a human doctor. Key observations, such as the relationships between BNP and LVEF, are incorrectly identified, leading to a final misdiagnosis.
B.4 Experiments for No Extra Knowledge
Input Prompt Suppose you are one of the greatest AI scientists and medical expert. Let us think step by step. You will review a clinical ’Note’ and your ’Response’ is to diagnose the disease that the patient have for this admission. All possible disease options are in a list structure: {disease_options}. Note that you can only choose one disease from the disease options and directly output the origin name of that disease. Now, start to complete your task. Don’t output any information other than your ’Response’. ’Note’: {note} Your ’Response’:
Input Prompt Suppose you are one of the greatest AI scientists and medical expert. Let us think step by step. You will review a clinical ’Note’ and your ’Response’ is to diagnose the disease that the patient have for this admission. Note that you can only give one disease name and directly output the name of that "Disease". Now, start to complete your task. Don’t output any information other than your ’Response’. ’Note’: {note} Your ’Response’:
We demonstrate the prompt used for and no knowledge settings in Table 12 and Table 13, respectively. {note} is the text of whole clinical note and {note} in Table 12 is the name of all leaf node .
B.5 Experimental Settings
All experiments are implemented with a temperature value of 0. All close sourced models are implemented in a local server with 4 NVIDIA A100 GPU.
Appendix C Failed Attempts on DiReCT
In this section, we discuss some unsuccessful attempts during the experiments.
Extract observation from the whole clinical note. We try to diagnose the disease and extract observation, and the corresponding rationale using the prompt shown in Table 14. The {note} is offered by the whole content in the clinical note. We find that even though the model can make the correct diagnosis, only a few observations can be extracted (no more than 4), which decreases the completeness and faithfulness.
Input Prompt Suppose you are one of the greatest AI scientists and medical expert. Let us think step by step. You will review a clinical ’Note’, and your ’Response’ is to diagnose the disease that the patient has for this admission. All possible disease options are in a list structure: {disease_options}. Note that you can only choose one disease from the disease options and directly output the origin name of that disease. Note that you also need to extract original text as confidence "Observations" that lead to the "Disease" you selected. Note that you should extract all necessary "Observation". Note that you also need to briefly provide the "Reason" for your extraction. Note that both "Observations" and "Reason" should be string. Note that your "Response" should be a list structure as following :[["Observation", "Reason", "Disease"], ……, ["Observation", "Reason", "Disease"]] Now, start to complete your task. Don’t output any information other than your ’Response’. ’Note’ :{note} Your ’Response’:
End-to-End prediction. We also try to output the whole reasoning process in one step (without iteration) when given observations. We show our prompt in Table 15. We find that using such a prompt model can not correctly recognize the relation between observation, rationale, and diagnosis.
Input Prompt Suppose you are one of the greatest AI scientists and medical expert. Let us think step by step. You will receive a list of "Observations" from a clinical "Note" for the diagnosis of stroke. Here is the diagnostic route of stroke in a tree structure: -Suspected Stroke -Hemorrhagic Stroke -Ischemic Stroke Here are some premise for the diagnosis of this disease. You can refer them for your task. Premise are: {premise} Based on these "Observations", starting from the root disease, your target is to diagnose one of the leaf disease. Note that you also required to provide the "Reason" for your reasoning. Note that your "Response" should be a list structure as following :[["Observation", "Reason", "Disease"], ……, ["Observation", "Reason", "Disease"]] Note that if you can’t find any "Observation" to support a disease option, your "Response" should be: None Now, start to complete your task. Note that you should not output any information other than your "Response". "Observations": {observation} Note that you should not output any information other than your "Response". Your "Response":
Appendix D Ethical Considerations
Utilizing real-world EHRs, even in a de-identified form, poses inherent risks to patient privacy. Therefore, it is essential to implement rigorous data protection and privacy measures to safeguard sensitive information, in accordance with regulations such as HIPAA. We strictly adhere to the Data Use Agreement of the MIMIC dataset, ensuring that the data is not shared with any third parties. All experiments are implement on a private server. The access to GPT is also a private version.
AI models are susceptible to replicating and even intensifying the biases inherent in their training data. These biases, if not addressed, can have profound implications, particularly in sensitive domains such as healthcare. Unconscious biases in healthcare systems can result in significant disparities in the quality of care and health outcomes among different demographic groups. Therefore, it is imperative to rigorously examine AI models for potential biases and implement robust mechanisms for ongoing monitoring and evaluation. This involves analyzing the model’s performance across various demographic groups, identifying any disparities, and making necessary adjustments to ensure equitable treatment for all. Continual vigilance and proactive measures are essential to mitigate the risk of biased decision-making and to uphold the principles of fairness and justice in AI-driven healthcare solutions.