Artikel
Using Named Entity Recognition and Large Language Models to enhance de-identification of semi-structured EMR data – preliminary results and lessons learned
Suche in Medline nach
Autoren
Veröffentlicht: | 15. September 2023 |
---|
Gliederung
Text
Introduction: ?????Stringent regulatory and ethical requirements must be met to enable scientific analysis of electronic medical record (EMR) data, which typically require de-identification. In addition to patients' personal health data, these datasets also contain personally identifiable information (PII) of third parties (e.g. healthcare workers). For semi-structured datasets, complete de-identification is an unresolved challenge [1].
We describe the experimental integration of neural networks in a de-identification process and the initial evaluation results on semi-structured test datasets from a laboratory information system.
State?????? of the art: ????To date, the training of neural networks on German medical texts has been limited due to the lack of annotated corpora or their public availability [2], although they tend to achieve good results for de-identification of medical documents [3].
Recently, pre-trained Large Language Models (LLMs) have become available under open source licenses, which can be specialized by few-shot learning without fine-tuning. Low Rank Adaptation (LoRA) and quantization make LLMs executable even in CPU-constrained environments.
Concept: As part of the de-identification processes of an experimental ETL pipeline, two inference steps are inserted. Test records are structurally and semantically analogous with OBX segments of HL7-v2 ORU messages and contain fictitious pieces of PII, specifically created for this evaluation. We opted to err on the side of caution and withheld any dataset classified as PII from further processing, regardless of inference scores.
Implementation: We created an annotated corpus of 100 OBX records, 27 of which contained 46 pieces of fictitious PII. First, we used a combination of regular expressions and other rule sets to identify PII such as phone numbers or personal names (e.g., doctors' names are often prefixed with “Dr.”) Afterwards, two open source models were used for inference:
- 1.
- “Flair-ner-german” [4], a BiLSTM-CRF with string embeddings, pretrained on German texts and able to classify a set of labeled tokens: Persons (PER), Locations (LOC), Organisations (ORG) and Miscellaneous (MISC).
- 2.
- “MedAlpaca 7b LoRA 8bit” [5], a LLM that was specialized on medical datasets. Few-shot learning was used to instruct the model to classify names, locations, and phone numbers, by providing a prompt with nine example-strings and an instruction to extract all PII from the tenth sting. Five of the example strings contained PII and four did not. The tenth string would be dynamically injected by the ETL process.
Lessons learned: The combination of deterministic methods (regular expressions, fuzzy string matching) and classification from the two models withheld 25 out of 27 PII-positive test records (sensitivity: 0.92).
Abbreviations are challenging. Within the fictional substring: "Hb 4,2 g/dl telefonisch an DÄ Heidelberger um 22:34 (DS)" - which translates to: “Haemoglobin concentration of 4.2 g/dl was reported by telephone to the on-call physician, Heidelberger, at 10:34 p.m. by DS” - "Heidelberger" was classified as a PER token, but the abbreviation "DS" was not. ?????In datasets where only abbreviations were present, this resulted in false-negative classifications. Pre-processing of abbreviations against a known list of abbreviations or additional training of LLMs could potentially mitigate this problem.
In addition, 12 records that did not contain PII were withheld (specificity: 0.83), mainly due to proper names of procedures. For example, "Cockroft-Gault" (method used to calculate glomerular filtration rate) was classified as a PER token (by the NER-Model) and therefore a false-positive, as we chose to exclude any instance of suspected PII.
Interestingly, the LLM did not classify "Cockroft-Gault" as PII, probably due to its specialization in medical texts. Future research will need to focus on automated interpretation of inference scores to deal with disagreements between different methods of automated de-identification.
In summary, we are encouraged by these preliminary results and will continue our research on automatic de-identification. In addition to developing new prompts to instruct the LLM, we will expand the validation corpus to include additional data domains.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
References
- 1.
- Stubbs A, Kotfila C, Uzuner O. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1. J Biomed Inform. 2015 Dec;58(Suppl):S11–9.
- 2.
- Kittner M, Lamping M, Rieke DT, Götze J, Bajwa B, Jelas I, et al. Annotation and initial evaluation of a large annotated German oncological corpus. JAMIA Open. 2021 Apr;4(2):ooab025.
- 3.
- Richter-Pechanski P, Amr A, Katus HA, Dieterich C. Deep Learning Approaches Outperform Conventional Strategies in De-Identification of German Medical Reports. Stud Health Technol Inform. 2019 Sep 3;267:101–9.
- 4.
- Akbik A, Blythe D, Vollgraf R. Contextual String Embeddings for Sequence Labeling. In: Proceedings of the 27th International Conference on Computational Linguistics [Internet]. Santa Fe, New Mexico, USA: Association for Computational Linguistics; 2018 [cited 2023 Apr 29]. p. 1638–49. Available from: https://aclanthology.org/C18-1139
- 5.
- Han T, Adams LC, Papaioannou JM, Grundmann P, Oberhauser T, Löser A, et al. MedAlpaca – An Open-Source Collection of Medical Conversational AI Models and Training Data [Preprint]. arXiv. 2023 Apr 14. DOI: 10.48550/arXiv.2304.08247