Exploring the Capabilities of GPT4-Vision as OCR Engine

Alex Ghiriti, Wolfgang Göderle, Roman Kern

Research output: Chapter in Book/Report/Conference proceedingConference paperpeer-review

Abstract

Many museums and libraries conducted efforts to digitize their assets, and many historic documents are now available as digital images. However, these documents are not directly accessible to retrieval systems that rely on written text and not images. In this study, the novel GPT4-Vision is being studied for its ability of optical character recognition (OCR), in cases where established methods, such as Tesseract may have difficulties. We find that GPT4-Vision provides excellent results even in cases where even humans struggle. We also identified a number of key limitations, including the long runtime implying high energy requirements, the lack of handling of rotated images, the necessity for layout hints, and limitations regarding image size. Even with these limitations, it is expected that large language models and vision transformers will play an important role to make historical documents more accessible for further processing, or directly to users.
Original languageEnglish
Title of host publicationLinking Theory and Practice of Digital Libraries - 28th International Conference on Theory and Practice of Digital Libraries, TPDL 2024, Proceedings
EditorsApostolos Antonacopoulos, Annika Hinze, Nicholas Vanderschantz, Benjamin Piwowarski, Mickaël Coustaty, Giorgio Maria Di Nunzio, Francesco Gelati
Pages3-12
Number of pages10
ISBN (Electronic)978-3-031-72440-4
DOIs
Publication statusPublished - 25 Sept 2024
Event28th International Conference on Theory and Practice of Digital Libraries, TPDL 2024 - Ljubljana, Slovenia
Duration: 24 Sept 202427 Sept 2024

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume15178 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference28th International Conference on Theory and Practice of Digital Libraries, TPDL 2024
Country/TerritorySlovenia
CityLjubljana
Period24/09/2427/09/24

Keywords

  • GPT4-Vision
  • Historic Documents
  • OCR
  • OCR Benchmark
  • Tesseract Comparison
  • Vision Transformer.

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'Exploring the Capabilities of GPT4-Vision as OCR Engine'. Together they form a unique fingerprint.

Cite this