Large Language Models for Electronic Health Record De-Identification in English and German

Samuel Sousa, Michael Jantscher, Mark Kröll, Roman Kern

Research output: Contribution to journalArticlepeer-review

Abstract

Electronic health record (EHR) de-identification is crucial for publishing or sharing medical data without violating the patient’s privacy. Protected health information (PHI) is abundant in EHRs, and privacy regulations worldwide mandate de-identification before downstream tasks are performed. The ever-growing data generation in healthcare and the advent of generative artificial intelligence have increased the demand for de-identified EHRs and highlighted privacy issues with large language models (LLMs), especially data transmission to cloud-based LLMs. In this study, we benchmark ten LLMs for de-identifying EHRs in English and German. We then compare de-identification performance for in-context learning and full model fine-tuning and analyze the limitations of LLMs for this task. Our experimental evaluation shows that LLMs effectively de-identify EHRs in both languages. Moreover, in-context learning with a one-shot setting boosts de-identification performance without the costly full fine-tuning of the LLMs.

Original languageEnglish
Article number112
JournalInformation
Volume16
Issue number2
DOIs
Publication statusPublished - 6 Feb 2025

Keywords

  • de-identification
  • generative AI
  • German NLP
  • GPT
  • LLMs
  • PHI
  • privacy

ASJC Scopus subject areas

  • Information Systems

Fingerprint

Dive into the research topics of 'Large Language Models for Electronic Health Record De-Identification in English and German'. Together they form a unique fingerprint.

Cite this