TY - GEN
T1 - Evaluating OpenAI Large Language Models for Generating Logical Abstractions of Technical Requirements Documents
AU - Perko, Alexander
AU - Wotawa, Franz
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024/9/26
Y1 - 2024/9/26
N2 - Since the advent of Large Language Models (LLM[s]) a few years ago, they have not only reached the mainstream but have become a commodity. Their application areas steadily expand because of sophisticated model architectures and enormous training corpora. However, accessible chatbot user interfaces and human-like responses may cause a tendency to overestimate their abilities. This study contributes to demonstrating the strengths and weaknesses of LLMs. In this work, we bridge methods from sub-symbolic and symbolic AI. In particular, we evaluate the capabilities of LLMs to convert textual requirements documents into their logical representation, enabling analysis and reasoning. This task demonstrates a use case close to industry, as requirements analysis is key in requirements and system engineering. Our experiments evaluate the popular model family used in OpenAI's ChatGPT, GPT-3.5, and GPT-4. The underlying goal of testing for the correct abstraction of meaning is not trivial, as the relationship between input and output semantics is not directly measurable. Thus, it is necessary to approximate translation correctness through quantifiable criteria. Most notably, we defined consistency-based metrics for the plausibility and stability of translations. Our experiments give insights into syntactical validity, semantic plausibility, stability of translations, and parameter configurations for LLM translations. We use real-world requirements and test the LLMs' performance out of the box and after pre-training. Experimentally, we demonstrated the strong relation between ChatGPT parameters and the stability of translations. Finally, we showed that even the best model configurations produced syntactically faulty (5%) or semantically implausible (7%) output and are not stable in their results.
AB - Since the advent of Large Language Models (LLM[s]) a few years ago, they have not only reached the mainstream but have become a commodity. Their application areas steadily expand because of sophisticated model architectures and enormous training corpora. However, accessible chatbot user interfaces and human-like responses may cause a tendency to overestimate their abilities. This study contributes to demonstrating the strengths and weaknesses of LLMs. In this work, we bridge methods from sub-symbolic and symbolic AI. In particular, we evaluate the capabilities of LLMs to convert textual requirements documents into their logical representation, enabling analysis and reasoning. This task demonstrates a use case close to industry, as requirements analysis is key in requirements and system engineering. Our experiments evaluate the popular model family used in OpenAI's ChatGPT, GPT-3.5, and GPT-4. The underlying goal of testing for the correct abstraction of meaning is not trivial, as the relationship between input and output semantics is not directly measurable. Thus, it is necessary to approximate translation correctness through quantifiable criteria. Most notably, we defined consistency-based metrics for the plausibility and stability of translations. Our experiments give insights into syntactical validity, semantic plausibility, stability of translations, and parameter configurations for LLM translations. We use real-world requirements and test the LLMs' performance out of the box and after pre-training. Experimentally, we demonstrated the strong relation between ChatGPT parameters and the stability of translations. Finally, we showed that even the best model configurations produced syntactically faulty (5%) or semantically implausible (7%) output and are not stable in their results.
KW - ChatGPT
KW - large language models
KW - logical abstraction
KW - NLP
KW - requirements engineering
KW - symbolic AI
UR - http://www.scopus.com/inward/record.url?scp=85206369494&partnerID=8YFLogxK
U2 - 10.1109/QRS62785.2024.00032
DO - 10.1109/QRS62785.2024.00032
M3 - Conference paper
AN - SCOPUS:85206369494
T3 - IEEE International Conference on Software Quality, Reliability and Security, QRS
SP - 238
EP - 249
BT - Proceedings - 2024 IEEE 24th International Conference on Software Quality, Reliability and Security, QRS 2024
PB - IEEE Institute of Electrical and Electronics Engineers
T2 - 24th IEEE International Conference on Software Quality, Reliability and Security, QRS 2024
Y2 - 1 July 2024 through 5 July 2024
ER -