LIMA: Fine-grained Lineage Tracing and Reuse in Machine Learning Systems

Arnab Phani, Benjamin Rath, Matthias Boehm

Research output: Chapter in Book/Report/Conference proceedingConference paperpeer-review

Abstract

Machine learning (ML) and data science workflows are inherently
exploratory. Data scientists pose hypotheses, integrate the necessary data, and run ML pipelines of data cleaning, feature engineering, model selection and hyper-parameter tuning. The repetitive
nature of these workflows, and their hierarchical composition from
building blocks exhibits high computational redundancy. Existing
work addresses this redundancy with coarse-grained lineage tracing and reuse for ML pipelines. This approach allows using existing
ML systems, but views entire algorithms as black boxes, and thus,
fails to eliminate fine-grained redundancy and to handle internal
non-determinism. In this paper, we introduce LIMA, a practical
framework for efficient, fine-grained lineage tracing and reuse inside ML systems. Lineage tracing of individual operations creates new challenges and opportunities. We address the large size of lineage traces with multi-level lineage tracing and reuse, as well as lineage deduplication for loops and functions; exploit full and partial reuse opportunities across the program hierarchy; and integrate this framework with task parallelism and operator fusion. The resulting framework performs fine-grained lineage tracing with low overhead, provides versioning and reproducibility, and is able to eliminate fine-grained redundancy. Our experiments on a variety of ML pipelines show performance improvements up to 12.4x.
Original languageEnglish
Title of host publicationLIMA: Fine-grained Lineage Tracing and Reuse in Machine Learning Systems
Pages1426-1439
Number of pages14
DOIs
Publication statusPublished - 2021

Publication series

NameProceedings of the ACM SIGMOD International Conference on Management of Data
ISSN (Print)0730-8078

Keywords

  • lineage tracing
  • lineage-based reuse
  • ml systems
  • reuse of intermediates

ASJC Scopus subject areas

  • Software
  • Information Systems

Cite this