Machine Learning as a Flow: Kubeflow vs. Metaflow
Both Kubeflow (2018, Google) and Metaflow (2019, Netflix) are great Machine Learning platforms for experimentation, development, and production deployment. Having used both of these, here is my comparative analysis.
For starters, Kubeflow is a project that helps you deploy machine learning workflows on Kubernetes. On the other hand, Metaflow is a Python library that helps data scientists build and manage real-life data science projects. Both Kubeflow and Metaflow are developed to boost the productivity of data scientists by facilitating them with state-of-the-art machine learning.
I would recommend watching these videos to illustrate the purpose of Machine Learning Platform:
3-min: https://youtu.be/sdbBcPuvw40 “Spell: Next-Generation Machine Learning Platform”,
33-min: https://youtu.be/lu5zHvpQeSI “Managing ML in Production with Kubeflow and DevOps — David Aronchick, Microsoft”
Also 10-min Metaflow read: https://towardsdatascience.com/learn-metaflow-in-10-mins-netflixs-python-r-framework-for-data-scientists-2ef124c716e4
If you ask me why one would need MLOps, my one-word answer will be: “SCALE”
Adding “SCALE” To Your ML
By adding Scale to every aspect of DS work, the Machine Learning Platform improves outcomes.
Similarities Between Kubeflow and Metaflow:
Both facilitate the Machine Learning Platform.
At a high level, both Kubeflow and Metaflow help with the following:
- Collaboration: keep track of and access to the experiments.
- Resume a run: A run failed (or was stopped intentionally). You can restart the workflow from where it failed/stopped.
- Hybrid runs: run one step of your workflow on high memory CPU-s (such as the data load and aggregation) and another compute-intensive step (the model training) on low-memory GPU.
- Inspecting experiments and using metadata: Data scientists can tune the hyperparameters on the same model and data.
Kubeflow versus Metaflow Comparative analysis
Goal
… highly subjectively
Metaflow provides Python-level Machine Learning workflows for running ML experiments at Scale and reproducible. Kubeflow provides Docker-image-level Machine Learning workflows for running ML experiments at Scale and reproducible.
Distributed computation
Both offer support for pipelines and components running in parallel.
I like Kubeflow architecture that provides Kubernetes (k8s) ‘under the hood’. This solves problems such as cloud deployment and cloud migration: k8s is open-source and will install on any cloud. Also, DevOps is often familiar or willing to adapt k8s, as well as there is a significant amount of third-party tools available for k8s cluster monitoring
Data exchange between ML pipeline components
Kubeflow doesn’t support Python communication between components but operates Docker image files instead. In my personal opinion, the strongest Metaflow feature is data between components is just Python objects, which can be accessed, monitored, debugged.
Cloud
Metaflow locks in AWS.
Kubeflow unlimited cloud deployment such as GCP, Azure, anything that runs k8s.
Model Serving
via external tooling:
Kubeflow — AI Hub, or TFX at Google Cloud Platform
Metaflow — Sagemaker/AWS for (M)
Horizontal scaling
Kubeflow builds upon the Kubernetes cluster with powerful support for scaling and monitoring
Metaflow runs AWS ECS
Costs
This is highly subjective, but I think that similar computational loads on Google Cloud Platform may cost less compared to AWS.
Summary
By deploying and utilizing Machine Learning Platform with Kubeflow or Metaflow, one gets the support for the most common Machine Learning scenarios, such as managing code, data, and dependencies for the experiments. Popular Machine Learning use cases described here: https://towardsdatascience.com/learn-metaflow-in-10-mins-netflixs-python-r-framework-for-data-scientists-2ef124c716e4
Major Kubeflow Advantages
Kubeflow can run multiple versions of the same code simultaneously on one cluster: one is free to choose version, language, packages for any step of one’s workflow. In Kubeflow, it is done using components implemented as independent Docker images. This is something that is not possible in Metaflow
Additionally, Kubeflow does not lock in a particular cloud provider. It abstracts out open-source Kubernetes and Airflow/other Orchestration. However, using TensorFlow Extended (TFX) will lock into Apache Beam, provided as Dataflow at Google Cloud Platform. Kubeflow also drives Data Science teams towards Docker containerization and Kubernetes cluster familiarity. On the other hand, Metaflow currently locks in AWS Sagemaker/S3/Batch
Major Metaflow Advantages
Metaflow is a Python package that will deliver Machine Learning Platform functionalities such as tracking and reproducibility at Day One.
ML pipelines, components, and inter-component messaging in Metaflow are simply Python functions and objects that can be monitored, debugged, code sources. In Kubeflow, creating ML pipeline is more similar to creating a batch file of consecutively running commands. For example, in Metaflow adding loops, ‘if’-s, and other statements to ML pipelines is a python code. In Kubeflow, ‘if’-s and ‘loop’-s are supported but complex machinery and no other statements are supported.