PyChemFlow: an automated pre-processing pipeline in Python for reproducible machine learning on chemical data

Mario Lovrić; Tomislav Duricic; Hussain Hussain; Bono Lučić; Roman Kern

doi:10.26434/chemrxiv-2023-3zpw0

PyChemFlow: an automated pre-processing pipeline in Python for reproducible machine learning on chemical data

Mario Lovrić^*, Tomislav Duricic, Hussain Hussain, Bono Lučić, Roman Kern

^*Korrespondierende/r Autor/-in für diese Arbeit

Know-Center GmbH Research Center for Data-Driven Business & Big Data Analytics (98770)

Publikation: Arbeitspapier › Preprint

Abstract

PyChemFlow is a Python library for automated and reproducible data pre-processing. Based on open-source code, PyChemFlow has simple requirements that rely on pandas, scikit-learn and joblib. The library's backbone is built up of transformer objects, which are fully constructed during the PyChemFlow fitting process using training data and can be conveniently stored using joblib. The user can run the library with a one-line command after splitting data into train and validation sets or while working with additional data. This is especially useful when reproducibility is critical. PyChemFlow also offers the ability to persistently store metadata, in addition to providing customizable and configurable data manipulation steps.

Originalsprache	englisch
Seitenumfang	3
DOIs	https://doi.org/10.26434/chemrxiv-2023-3zpw0
Publikationsstatus	Veröffentlicht - 20 Juli 2023

Zugriff auf Dokument

10.26434/chemrxiv-2023-3zpw0Lizenz: CC BY-NC-ND 4.0

Dieses zitieren

@techreport{24ee2dbd3d604baa86b47504dd2901c9,

title = "PyChemFlow: an automated pre-processing pipeline in Python for reproducible machine learning on chemical data",

abstract = "PyChemFlow is a Python library for automated and reproducible data pre-processing. Based on open-source code, PyChemFlow has simple requirements that rely on pandas, scikit-learn and joblib. The library's backbone is built up of transformer objects, which are fully constructed during the PyChemFlow fitting process using training data and can be conveniently stored using joblib. The user can run the library with a one-line command after splitting data into train and validation sets or while working with additional data. This is especially useful when reproducibility is critical. PyChemFlow also offers the ability to persistently store metadata, in addition to providing customizable and configurable data manipulation steps.",

author = "Mario Lovri{\'c} and Tomislav Duricic and Hussain Hussain and Bono Lu{\v c}i{\'c} and Roman Kern",

year = "2023",

month = jul,

day = "20",

doi = "10.26434/chemrxiv-2023-3zpw0",

language = "English",

type = "WorkingPaper",

}

TY - UNPB

T1 - PyChemFlow: an automated pre-processing pipeline in Python for reproducible machine learning on chemical data

AU - Lovrić, Mario

AU - Duricic, Tomislav

AU - Hussain, Hussain

AU - Lučić, Bono

AU - Kern, Roman

PY - 2023/7/20

Y1 - 2023/7/20

N2 - PyChemFlow is a Python library for automated and reproducible data pre-processing. Based on open-source code, PyChemFlow has simple requirements that rely on pandas, scikit-learn and joblib. The library's backbone is built up of transformer objects, which are fully constructed during the PyChemFlow fitting process using training data and can be conveniently stored using joblib. The user can run the library with a one-line command after splitting data into train and validation sets or while working with additional data. This is especially useful when reproducibility is critical. PyChemFlow also offers the ability to persistently store metadata, in addition to providing customizable and configurable data manipulation steps.

AB - PyChemFlow is a Python library for automated and reproducible data pre-processing. Based on open-source code, PyChemFlow has simple requirements that rely on pandas, scikit-learn and joblib. The library's backbone is built up of transformer objects, which are fully constructed during the PyChemFlow fitting process using training data and can be conveniently stored using joblib. The user can run the library with a one-line command after splitting data into train and validation sets or while working with additional data. This is especially useful when reproducibility is critical. PyChemFlow also offers the ability to persistently store metadata, in addition to providing customizable and configurable data manipulation steps.

U2 - 10.26434/chemrxiv-2023-3zpw0

DO - 10.26434/chemrxiv-2023-3zpw0

M3 - Preprint

BT - PyChemFlow: an automated pre-processing pipeline in Python for reproducible machine learning on chemical data

ER -