Detecting non-natural language artifacts for de-noising bug reports

Thomas Hirsch; Birgit Gertraud Hofer

doi:10.1007/s10515-022-00350-0

Detecting non-natural language artifacts for de-noising bug reports

Thomas Hirsch, Birgit Gertraud Hofer

Institute of Software Technology (7160)

Research output: Contribution to journal › Article › peer-review

Abstract

Textual documents produced in the software engineering process are a popular target for natural language processing (NLP) and information retrieval (IR) approaches. However, issue tickets often contain artifacts such as code snippets, log outputs and stack traces. These artifacts not only inflate the issue ticket sizes, but also can this noise constitute a real problem for some NLP approaches, and therefore has to be removed in the pre-processing of some approaches. In this paper, we present a machine learning based approach to classify textual content into natural language and non-natural language artifacts at line level. We show how data from GitHub issue trackers can be used for automated training set generation, and present a custom preprocessing approach for the task of artifact removal. The training sets are automatically created from Markdown annotated issue tickets and project documentation files. We use these generated training sets to train a Markdown agnostic model that is able to classify un-annotated content. We evaluate our approach on issue tickets from projects written in C++, Java, JavaScript, PHP, and Python. Our approach achieves ROC-AUC scores between 0.92 and 0.96 for language-specific models. A multi-language model trained on the issue tickets of all languages achieves ROC-AUC scores between 0.92 and 0.95. The provided models are intended to be used as noise reduction pre-processing steps for NLP and IR approaches working on issue tickets.

Original language	English
Article number	52
Number of pages	29
Journal	Automated Software Engineering
Volume	29
Issue number	2
DOIs	https://doi.org/10.1007/s10515-022-00350-0
Publication status	Published - 24 Aug 2022

Keywords

NLP
Bug reports
Issue tickets
Data cleaning
Artifact removal
De-noising

ASJC Scopus subject areas

Software

Fields of Expertise

Information, Communication & Computing

Treatment code (Nähere Zuordnung)

Basic - Fundamental (Grundlagenforschung)

Access to Document

10.1007/s10515-022-00350-0Licence: CC BY 4.0

Cite this

@article{d4d21e4a7f9641ac9717ee4400c8099b,

title = "Detecting non-natural language artifacts for de-noising bug reports",

abstract = "Textual documents produced in the software engineering process are a popular target for natural language processing (NLP) and information retrieval (IR) approaches. However, issue tickets often contain artifacts such as code snippets, log outputs and stack traces. These artifacts not only inflate the issue ticket sizes, but also can this noise constitute a real problem for some NLP approaches, and therefore has to be removed in the pre-processing of some approaches. In this paper, we present a machine learning based approach to classify textual content into natural language and non-natural language artifacts at line level. We show how data from GitHub issue trackers can be used for automated training set generation, and present a custom preprocessing approach for the task of artifact removal. The training sets are automatically created from Markdown annotated issue tickets and project documentation files. We use these generated training sets to train a Markdown agnostic model that is able to classify un-annotated content. We evaluate our approach on issue tickets from projects written in C++, Java, JavaScript, PHP, and Python. Our approach achieves ROC-AUC scores between 0.92 and 0.96 for language-specific models. A multi-language model trained on the issue tickets of all languages achieves ROC-AUC scores between 0.92 and 0.95. The provided models are intended to be used as noise reduction pre-processing steps for NLP and IR approaches working on issue tickets.",

keywords = "NLP, Bug reports, Issue tickets, Data cleaning, Artifact removal, De-noising",

author = "Thomas Hirsch and Hofer, {Birgit Gertraud}",

year = "2022",

month = aug,

day = "24",

doi = "10.1007/s10515-022-00350-0",

language = "English",

volume = "29",

journal = "Automated Software Engineering",

issn = "0928-8910",

publisher = "Springer Science+Business Media B.V ",

number = "2",

}

TY - JOUR

T1 - Detecting non-natural language artifacts for de-noising bug reports

AU - Hirsch, Thomas

AU - Hofer, Birgit Gertraud

PY - 2022/8/24

Y1 - 2022/8/24

N2 - Textual documents produced in the software engineering process are a popular target for natural language processing (NLP) and information retrieval (IR) approaches. However, issue tickets often contain artifacts such as code snippets, log outputs and stack traces. These artifacts not only inflate the issue ticket sizes, but also can this noise constitute a real problem for some NLP approaches, and therefore has to be removed in the pre-processing of some approaches. In this paper, we present a machine learning based approach to classify textual content into natural language and non-natural language artifacts at line level. We show how data from GitHub issue trackers can be used for automated training set generation, and present a custom preprocessing approach for the task of artifact removal. The training sets are automatically created from Markdown annotated issue tickets and project documentation files. We use these generated training sets to train a Markdown agnostic model that is able to classify un-annotated content. We evaluate our approach on issue tickets from projects written in C++, Java, JavaScript, PHP, and Python. Our approach achieves ROC-AUC scores between 0.92 and 0.96 for language-specific models. A multi-language model trained on the issue tickets of all languages achieves ROC-AUC scores between 0.92 and 0.95. The provided models are intended to be used as noise reduction pre-processing steps for NLP and IR approaches working on issue tickets.

AB - Textual documents produced in the software engineering process are a popular target for natural language processing (NLP) and information retrieval (IR) approaches. However, issue tickets often contain artifacts such as code snippets, log outputs and stack traces. These artifacts not only inflate the issue ticket sizes, but also can this noise constitute a real problem for some NLP approaches, and therefore has to be removed in the pre-processing of some approaches. In this paper, we present a machine learning based approach to classify textual content into natural language and non-natural language artifacts at line level. We show how data from GitHub issue trackers can be used for automated training set generation, and present a custom preprocessing approach for the task of artifact removal. The training sets are automatically created from Markdown annotated issue tickets and project documentation files. We use these generated training sets to train a Markdown agnostic model that is able to classify un-annotated content. We evaluate our approach on issue tickets from projects written in C++, Java, JavaScript, PHP, and Python. Our approach achieves ROC-AUC scores between 0.92 and 0.96 for language-specific models. A multi-language model trained on the issue tickets of all languages achieves ROC-AUC scores between 0.92 and 0.95. The provided models are intended to be used as noise reduction pre-processing steps for NLP and IR approaches working on issue tickets.

KW - NLP

KW - Bug reports

KW - Issue tickets

KW - Data cleaning

KW - Artifact removal

KW - De-noising

UR - http://www.scopus.com/inward/record.url?scp=85137060613&partnerID=8YFLogxK

U2 - 10.1007/s10515-022-00350-0

DO - 10.1007/s10515-022-00350-0

M3 - Article

SN - 0928-8910

VL - 29

JO - Automated Software Engineering

JF - Automated Software Engineering

IS - 2

M1 - 52

ER -

Detecting non-natural language artifacts for de-noising bug reports

Abstract

Keywords

ASJC Scopus subject areas

Fields of Expertise

Treatment code (Nähere Zuordnung)

Access to Document

Other files and links

Fingerprint

FWF - AMADEUS - Automated Debugging in Use

Cite this

Detecting non-natural language artifacts for de-noising bug reports

Abstract

Keywords

ASJC Scopus subject areas

Fields of Expertise

Treatment code (Nähere Zuordnung)

Access to Document

Other files and links

Fingerprint

Projects

FWF - AMADEUS - Automated Debugging in Use

Cite this