LIMA: Fine-grained Lineage Tracing and Reuse in Machine Learning Systems

Arnab Phani, Benjamin Rath, Matthias Boehm

Publikation: Beitrag in Buch/Bericht/KonferenzbandBeitrag in einem KonferenzbandBegutachtung

Abstract

Machine learning (ML) and data science workflows are inherently
exploratory. Data scientists pose hypotheses, integrate the necessary data, and run ML pipelines of data cleaning, feature engineering, model selection and hyper-parameter tuning. The repetitive
nature of these workflows, and their hierarchical composition from
building blocks exhibits high computational redundancy. Existing
work addresses this redundancy with coarse-grained lineage tracing and reuse for ML pipelines. This approach allows using existing
ML systems, but views entire algorithms as black boxes, and thus,
fails to eliminate fine-grained redundancy and to handle internal
non-determinism. In this paper, we introduce LIMA, a practical
framework for efficient, fine-grained lineage tracing and reuse inside ML systems. Lineage tracing of individual operations creates new challenges and opportunities. We address the large size of lineage traces with multi-level lineage tracing and reuse, as well as lineage deduplication for loops and functions; exploit full and partial reuse opportunities across the program hierarchy; and integrate this framework with task parallelism and operator fusion. The resulting framework performs fine-grained lineage tracing with low overhead, provides versioning and reproducibility, and is able to eliminate fine-grained redundancy. Our experiments on a variety of ML pipelines show performance improvements up to 12.4x.
Originalspracheenglisch
TitelLIMA: Fine-grained Lineage Tracing and Reuse in Machine Learning Systems
Seiten1426-1439
Seitenumfang14
DOIs
PublikationsstatusVeröffentlicht - 2021

Publikationsreihe

NameProceedings of the ACM SIGMOD International Conference on Management of Data
ISSN (Print)0730-8078

ASJC Scopus subject areas

  • Software
  • Information systems

Fingerprint

Untersuchen Sie die Forschungsthemen von „LIMA: Fine-grained Lineage Tracing and Reuse in Machine Learning Systems“. Zusammen bilden sie einen einzigartigen Fingerprint.

Dieses zitieren