Abstract
Electronic health record (EHR) de-identification is crucial for publishing or sharing medical data without violating the patient’s privacy. Protected health information (PHI) is abundant in EHRs, and privacy regulations worldwide mandate de-identification before downstream tasks are performed. The ever-growing data generation in healthcare and the advent of generative artificial intelligence have increased the demand for de-identified EHRs and highlighted privacy issues with large language models (LLMs), especially data transmission to cloud-based LLMs. In this study, we benchmark ten LLMs for de-identifying EHRs in English and German. We then compare de-identification performance for in-context learning and full model fine-tuning and analyze the limitations of LLMs for this task. Our experimental evaluation shows that LLMs effectively de-identify EHRs in both languages. Moreover, in-context learning with a one-shot setting boosts de-identification performance without the costly full fine-tuning of the LLMs.
Original language | English |
---|---|
Article number | 112 |
Journal | Information |
Volume | 16 |
Issue number | 2 |
DOIs | |
Publication status | Published - 6 Feb 2025 |
Keywords
- de-identification
- generative AI
- German NLP
- GPT
- LLMs
- PHI
- privacy
ASJC Scopus subject areas
- Information Systems