TY - GEN
T1 - Exploring the Capabilities of GPT4-Vision as OCR Engine
AU - Ghiriti, Alex
AU - Göderle, Wolfgang
AU - Kern, Roman
PY - 2024/9/25
Y1 - 2024/9/25
N2 - Many museums and libraries conducted efforts to digitize their assets, and many historic documents are now available as digital images. However, these documents are not directly accessible to retrieval systems that rely on written text and not images. In this study, the novel GPT4-Vision is being studied for its ability of optical character recognition (OCR), in cases where established methods, such as Tesseract may have difficulties. We find that GPT4-Vision provides excellent results even in cases where even humans struggle. We also identified a number of key limitations, including the long runtime implying high energy requirements, the lack of handling of rotated images, the necessity for layout hints, and limitations regarding image size. Even with these limitations, it is expected that large language models and vision transformers will play an important role to make historical documents more accessible for further processing, or directly to users.
AB - Many museums and libraries conducted efforts to digitize their assets, and many historic documents are now available as digital images. However, these documents are not directly accessible to retrieval systems that rely on written text and not images. In this study, the novel GPT4-Vision is being studied for its ability of optical character recognition (OCR), in cases where established methods, such as Tesseract may have difficulties. We find that GPT4-Vision provides excellent results even in cases where even humans struggle. We also identified a number of key limitations, including the long runtime implying high energy requirements, the lack of handling of rotated images, the necessity for layout hints, and limitations regarding image size. Even with these limitations, it is expected that large language models and vision transformers will play an important role to make historical documents more accessible for further processing, or directly to users.
KW - GPT4-Vision
KW - Historic Documents
KW - OCR
KW - OCR Benchmark
KW - Tesseract Comparison
KW - Vision Transformer.
UR - http://www.scopus.com/inward/record.url?scp=85206217227&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-72440-4_1
DO - 10.1007/978-3-031-72440-4_1
M3 - Conference paper
SN - 978-3-031-72439-8
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 3
EP - 12
BT - Linking Theory and Practice of Digital Libraries - 28th International Conference on Theory and Practice of Digital Libraries, TPDL 2024, Proceedings
A2 - Antonacopoulos, Apostolos
A2 - Hinze, Annika
A2 - Vanderschantz, Nicholas
A2 - Piwowarski, Benjamin
A2 - Coustaty, Mickaël
A2 - Di Nunzio, Giorgio Maria
A2 - Gelati, Francesco
T2 - 28th International Conference on Theory and Practice of Digital Libraries, TPDL 2024
Y2 - 24 September 2024 through 27 September 2024
ER -