From Data Pipeline to Machine Learning Architecture in 3 simple pictures
In this post, I’ll do a side-by-side comparison of architectural patterns for the Data Pipeline and Machine Learning Pipeline and illustrate the principal differences. My main goal is to show the value behind dedicated tools and platforms for Machine Learning products and also shed more light on the new and fast-growing range of Machine Learning Pipeline products.
Below, listed are several (great) reasons for designing and deploying pipelines:
- Discipline — use of pipelines facilitated the process of planning ahead and having clear goals in place before you start the work
- Automation — replace manual processes, and eliminate any chances for human error
- Reproducibility — can be run by anyone, anywhere, anytime
- Visibility — use of pipelines provides the opportunity to review and debug the whole process vs. having your team review disconnected code snippets (yikes!)
For more detailed take on these points, please see this excellent post by Alan Marazzi for Pipelines ‘Whys’.
Different goals: Data Transformation vs. ML
Figure 1 highlights the differences between the Data Transformation process and the Machine Learning process. These differences are fundamental. Utilizing Data Pipelines for Machine Learning tasks leads to technical debt. Specifically, the goal of Data Pipelines (Figure 1, left) is to facilitate the flow of Tb of big and homogeneous data between computationally simple transformations. The goal of the Machine Learning Pipeline is to connect computationally complex components with input and output of small (heterogeneous) metadata. For example, consider a popular ML component “Features Selection” that utilizes XGBoost and outputs a list of feature importances. Compare to Map-Reduce “Filter” component that inputs and outputs Tb data files, file chunks, or real-time streams.
As with any debt, it gets accumulated, in this case by applying Data pipelines to multiple ML products and prolonged time.
Figure 1 shows that the Batch Architectures are designed for offline (non) real-time deployments. Seconds added in run-times are of no importance, and robust and minimal possible failures are the ultimate goal.
Specifically, Figure 1 illustrates Persistent Batch Architecture (left), where inputs and outputs of components are uploaded and downloaded on a Cloud. Notice how for a price of a small overhead of cloud disk space costs and the IO-involved run-time slowdown, the Persistent Batch Architecture provides two new critical functionalities: experiments tracking and debugging. These new functions, being of marginal importance for Data processing, are highly important for the experiments-driven iterative process of Machine Learning model development.
IDE Debugging Components with Live Data Use Cases
Twenty years ago, a live product was built with libraries compiled from code sources. Debugging was straightforward: one would run the product with a debug version of the deployed library in IDE.
Lack of IDE debugging of Live products in the newly introduced Isolation and Docker containers was regarded as a small cost. No debugging was still a pain, and Logs became the cure.
Persistent Pipeline design brings back the IDE debugger. In Persistent Architecture, one finds the inputs that caused the component to fail and debugs it locally with the preferred IDE. For example, debugging the Python component with Pycharm is done by selecting a python created from the Docker image requirements file.
Final thoughts and Future works
Pipelines and components are intuitive and powerful tools for software product architectures. I foresee new hybrid pipelines combining Data and Machine Learning components to be proposed soon, utilizing time-tested Batch Architectures.