Machine Learning Pipelines in 3 simple pictures

Roman Kazinnik
4 min readDec 3, 2020

I’ll do a side-by-side comparison of architectural patterns for the Data Pipeline and Machine Learning Pipeline and illustrate principal differences. My main goal is to show the value of deploying dedicated tools and platforms for Machine Learning, such as Kubeflow and Metaflow. Utilizing non-Machine Learning tools, be it Airflow for the orchestration, appears to be suboptimal.

Machine learning pipeline products have been quite popular for a lot of time now. But why do we need pipelines in the first place? Here are some of the main reasons, in my opinion:

Why we Need Pipelines

  • Using pipelines helps you maintain discipline in work. You are able to plan ahead and be clear on the targets you have to achieve. This helps in the systematic and timely progress of work.
  • The use of pipelines helps provides automation. They eliminate possibilities of human error and replace manual processes.
  • Almost anyone can run the pipelines from anywhere and at any time—this enhances reproducibility.
  • Using pipelines increases the visibility of the process. With these, one can easily review and debug where needed. It helps teams spots disconnected codes easily.

For a more detailed take on these points, please see this excellent post by Alan Marazzi for Pipelines ‘Whys’.

Figure 1: Three patterns of Pipeline Architectures

Machine Learning Platform

If you’re particularly searching for Machine Learning platforms, I must recommend Metaflow and Kubeflow to you. These two are the best for development, experimentation, and deployment of production.

A personal analysis of the two platforms tells me that Kubflow helps organize Kubernetes-relating to machine learning workflows. On the other hand, a python library, Metaflow, is a helpful tool for scientists to conduct real-life data science projects.

Metaflow and Kubeflow are two important tools that aid data scientists in their projects and provide them with unique and innovative support for machine learning.

In my opinion, SCALE is the most important reason why a person might need MLOps. There are several advantages in using SCALE in every aspect of your DS work, including improving outcomes in the Machine Learning Platform.

Pipelines: Data Transformation vs. Machine Learning

Figure 1 highlights the differences between the Data Transformation process and the Machine Learning process. These differences are fundamental. Utilizing Data Pipelines for Machine Learning tasks leads to technical debt. Specifically, the goal of Data Pipelines (Figure 1, left) is to facilitate the flow of Tb of big and homogeneous data between computationally simple transformations. The goal of the Machine Learning Pipeline is to connect computationally complex components with input and output of small (heterogeneous) metadata. For example, consider a popular ML component “Features Selection” that utilizes XGBoost and outputs a list of feature importances. Compare to Map-Reduce “Filter” component that inputs and outputs Tb data files, file chunks, or real-time streams.
As with any debt, it gets accumulated, in this case by applying Data pipelines to multiple ML products and prolonged time.

Batch Architecture

Figure 1 shows that the Batch Architectures are designed for offline (non) real-time deployments. Seconds added in run-times are of no importance, and robust and minimal possible failures are the ultimate goal.

Specifically, Figure 1 illustrates Persistent Batch Architecture (left), where inputs and outputs of components are uploaded and downloaded on a Cloud. Notice how for a price of a small overhead of cloud disk space costs and the IO-involved run-time slowdown, the Persistent Batch Architecture provides two new critical functionalities: experiments tracking and debugging. These new functions, being of marginal importance for Data processing, are highly important for the experiments-driven iterative process of Machine Learning model development.

Three Architectures comparison

IDE Debugging Components with Live Data Use Cases

Twenty years ago, a live product was built with libraries compiled from code sources. Debugging was straightforward: one would run the product with a debug version of the deployed library in IDE.

Figure 2 Persistent component allows for IDE debugging

Lack of IDE debugging of Live products in the newly introduced Isolation and Docker containers was regarded as a small cost. No debugging was still a pain, and Logs became the cure.

Persistent Pipeline design brings back the IDE debugger. In Persistent Architecture, one finds the inputs that caused the component to fail and debugs it locally with the preferred IDE. For example, debugging the Python component with Pycharm is done by selecting a python created from the Docker image requirements file.

Final Thoughts

The best additions for software product architectures have been components and pipelines thus far. These are the most powerful and intuitive tools we have seen thus far. We can expect more hybrid pipelines soon to combine Machine Learning components and Data, such as TensorFlow Extended. These will also include the utilization of time-tested batch architectures.

Enjoyed or Hated it? Let me know with a comment, or get in touch on Twitter and follow me on Medium

--

--