Evaluating OpenAI Large Language Models for Generating Logical Abstractions of Technical Requirements Documents

Alexander Perko*, Franz Wotawa

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference paperpeer-review

Abstract

Since the advent of Large Language Models (LLM[s]) a few years ago, they have not only reached the mainstream but have become a commodity. Their application areas steadily expand because of sophisticated model architectures and enormous training corpora. However, accessible chatbot user interfaces and human-like responses may cause a tendency to overestimate their abilities. This study contributes to demonstrating the strengths and weaknesses of LLMs. In this work, we bridge methods from sub-symbolic and symbolic AI. In particular, we evaluate the capabilities of LLMs to convert textual requirements documents into their logical representation, enabling analysis and reasoning. This task demonstrates a use case close to industry, as requirements analysis is key in requirements and system engineering. Our experiments evaluate the popular model family used in OpenAI's ChatGPT, GPT-3.5, and GPT-4. The underlying goal of testing for the correct abstraction of meaning is not trivial, as the relationship between input and output semantics is not directly measurable. Thus, it is necessary to approximate translation correctness through quantifiable criteria. Most notably, we defined consistency-based metrics for the plausibility and stability of translations. Our experiments give insights into syntactical validity, semantic plausibility, stability of translations, and parameter configurations for LLM translations. We use real-world requirements and test the LLMs' performance out of the box and after pre-training. Experimentally, we demonstrated the strong relation between ChatGPT parameters and the stability of translations. Finally, we showed that even the best model configurations produced syntactically faulty (5%) or semantically implausible (7%) output and are not stable in their results.

Original languageEnglish
Title of host publicationProceedings - 2024 IEEE 24th International Conference on Software Quality, Reliability and Security, QRS 2024
PublisherIEEE Institute of Electrical and Electronics Engineers
Pages238-249
Number of pages12
ISBN (Electronic)9798350365634
DOIs
Publication statusPublished - 26 Sept 2024
Event24th IEEE International Conference on Software Quality, Reliability and Security, QRS 2024 - Cambridge, United Kingdom
Duration: 1 Jul 20245 Jul 2024

Publication series

NameIEEE International Conference on Software Quality, Reliability and Security, QRS
ISSN (Print)2693-9177

Conference

Conference24th IEEE International Conference on Software Quality, Reliability and Security, QRS 2024
Country/TerritoryUnited Kingdom
CityCambridge
Period1/07/245/07/24

Keywords

  • ChatGPT
  • large language models
  • logical abstraction
  • NLP
  • requirements engineering
  • symbolic AI

ASJC Scopus subject areas

  • Software
  • Safety, Risk, Reliability and Quality

Fingerprint

Dive into the research topics of 'Evaluating OpenAI Large Language Models for Generating Logical Abstractions of Technical Requirements Documents'. Together they form a unique fingerprint.

Cite this