Clinical information extraction from notes of Veterans with lymphoid malignancies: Natural language processing study

Abstract: BACKGROUND: Clinical natural language processing (cNLP) techniques are commonly developed and used to extract information from clinical notes to facilitate clinical decision-making and research. However, they are less established for rare diseases such as lymphoid malignancies due to the lack of annotated data as well as the heterogeneity and complexity of how clinical information is documented. In addition, there is increasing evidence that cNLP techniques may be prone to biases embedded in clinical documentation or model development. These biases can result in disparities in performance when extracting clinical information or predicting patient outcomes. OBJECTIVE: This study aims to report the development and validation of a cNLP pipeline that extracts clinical information such as performance status, staging, and diagnosis, as well as less common information such as substance use and military environmental exposures, from the clinical notes of veterans with lymphoid malignancies. METHODS: We developed a rule-based cNLP pipeline that integrates domain expertise. We tested and compared the performance of the cNLP pipeline on notes from 2 veteran patient cohorts: one from non-Hispanic White veterans and the other from non-Hispanic Black veterans. RESULTS: Overall, our pipeline achieved promising performance on our study data, especially for extracting entities that have standard clinical documentation, such as performance status. We also found that while the pipeline has robust performance across the two patient groups, the false-positive and false-negative rates were significantly associated with race for detecting the primary diagnosis (P=.001 for both); the false-negative rate was significantly associated with race for identifying substance use (P=.02). CONCLUSIONS: The system exhibits satisfying and comparable performance for most clinical entities of interest except for (1) the primary diagnosis and (2) substance use. Future work will address the challenges encountered in developing and deploying the cNLP pipeline on the Department of Veterans Affairs data for rare cancers and enhance the performance of cNLP systems to avoid biases.

Read the full article
Report a problem with this article

Related articles