PyChemFlow: an automated pre-processing pipeline in Python for reproducible machine learning on chemical data

Mario Lovrić*, Tomislav Duricic, Hussain Hussain, Bono Lučić, Roman Kern

*Corresponding author for this work

Research output: Working paperPreprint

Abstract

PyChemFlow is a Python library for automated and reproducible data pre-processing. Based on open-source code, PyChemFlow has simple requirements that rely on pandas, scikit-learn and joblib. The library's backbone is built up of transformer objects, which are fully constructed during the PyChemFlow fitting process using training data and can be conveniently stored using joblib. The user can run the library with a one-line command after splitting data into train and validation sets or while working with additional data. This is especially useful when reproducibility is critical. PyChemFlow also offers the ability to persistently store metadata, in addition to providing customizable and configurable data manipulation steps.
Original languageEnglish
Number of pages3
DOIs
Publication statusPublished - 20 Jul 2023

Fingerprint

Dive into the research topics of 'PyChemFlow: an automated pre-processing pipeline in Python for reproducible machine learning on chemical data'. Together they form a unique fingerprint.

Cite this