Creating a Dataset for Keyphrase Extraction in Physics Publications and Patents

André Rattinger, Christian Gütl

Research output: Chapter in Book/Report/Conference proceedingConference paperpeer-review


Extracting keyphrases and entities can be an important first step in many Natural Language Processing (NLP) and Information Retrieval (IR) Tasks. There are many datasets to train models for standard entities, but it is hard to find data that can be used for more domain specific applications.
The types of keyphrases someone wants to extract vary enormously between different fields, which makes otherwise successful algorithms perform poorly on them. One of the fields where this is the case is Physics, specifically to process physics publications and patents. In comparison to news articles or social media, the typical entities like Organization, Location or Person are not helpful when extracting impor-
tant information from publications or patents. There are few dataset annotations for specific domains, and even when they exist they are not easily transferable. This work contributes an annotated dataset for the facilitation of information retrieval and extraction in Physics. The dataset spans Physics Patents as well as Publications. It covers both of these document types to enable future work between them. This can
facilitate future work such as tracking inventions from the first emergence in a publication to the adaption in a patent
Original languageEnglish
Title of host publicationProceedings of the 3rd International Open Search Symposium #ossym2021
Subtitle of host publicationOSSYM 2021
ISBN (Electronic)978-92-9083-633-9
Publication statusPublished - 2022
Event3rd International Open Search Symposium: OSSYM 2021 - Virtuell, Austria
Duration: 11 Oct 202113 Oct 2021


Conference3rd International Open Search Symposium
Abbreviated titleOSSYM 2021

Fields of Expertise

  • Information, Communication & Computing


Dive into the research topics of 'Creating a Dataset for Keyphrase Extraction in Physics Publications and Patents'. Together they form a unique fingerprint.

Cite this