Machine Learning as a Flow: Kubeflow vs. Metaflow

Roman Kazinnik
4 min readNov 9, 2020

Both Kubeflow (2018, Google) and Metaflow (2019, Netflix) are great Machine Learning platforms for experimentation, development, and production deployment. Having used both of these, here is my comparative analysis.

For starters, Kubeflow is a project that helps you deploy machine learning workflows on Kubernetes. On the other hand, Metaflow is a Python library that helps data scientists build and manage real-life data science projects. Both Kubeflow and Metaflow are developed to boost the productivity of data scientists by facilitating them with state-of-the-art machine learning.

I would recommend watching these videos to illustrate the purpose of Machine Learning Platform:

3-min: https://youtu.be/sdbBcPuvw40 “Spell: Next-Generation Machine Learning Platform”,

33-min: https://youtu.be/lu5zHvpQeSI “Managing ML in Production with Kubeflow and DevOps — David Aronchick, Microsoft”

Also 10-min Metaflow read: https://towardsdatascience.com/learn-metaflow-in-10-mins-netflixs-python-r-framework-for-data-scientists-2ef124c716e4

If you ask me why one would need MLOps, my one-word answer will be: “SCALE”

Machine Learning Platform facilitates better Machine Learning providing SCALE (both compute and data) and FLOW (pipelines)

Adding “SCALE” To Your ML

By adding Scale to every aspect of DS work, the Machine Learning Platform improves outcomes.

Similarities Between Kubeflow and Metaflow:

Both facilitate the Machine Learning Platform.

At a high level, both Kubeflow and Metaflow help with the following:

  • Collaboration: keep track of and access to the experiments.
  • Resume a run: A run failed (or was stopped intentionally). You can restart the workflow from where it failed/stopped.
  • Hybrid runs: run one step of your workflow on high memory CPU-s (such as the data load and aggregation) and another compute-intensive step (the model training) on low-memory GPU.
  • Inspecting experiments and using metadata: Data scientists can tune the hyperparameters on the same model and data.
Metaflow vs. Kubeflow

Kubeflow versus Metaflow Comparative analysis

Goal

… highly subjectively

Metaflow provides Python-level Machine Learning workflows for running ML experiments at Scale and reproducible. Kubeflow provides Docker-image-level Machine Learning workflows for running ML experiments at Scale and reproducible.

Distributed computation

Both offer support for pipelines and components running in parallel.

I like Kubeflow architecture that provides Kubernetes (k8s) ‘under the hood’. This solves problems such as cloud deployment and cloud migration: k8s is open-source and will install on any cloud. Also, DevOps is often familiar or willing to adapt k8s, as well as there is a significant amount of third-party tools available for k8s cluster monitoring

Data exchange between ML pipeline components

Kubeflow doesn’t support Python communication between components but operates Docker image files instead. In my personal opinion, the strongest Metaflow feature is data between components is just Python objects, which can be accessed, monitored, debugged.

Cloud

Metaflow locks in AWS.

Kubeflow unlimited cloud deployment such as GCP, Azure, anything that runs k8s.

Model Serving

via external tooling:

Kubeflow — AI Hub, or TFX at Google Cloud Platform

Metaflow — Sagemaker/AWS for (M)

Horizontal scaling

Kubeflow builds upon the Kubernetes cluster with powerful support for scaling and monitoring

Metaflow runs AWS ECS

Costs

This is highly subjective, but I think that similar computational loads on Google Cloud Platform may cost less compared to AWS.

Summary

By deploying and utilizing Machine Learning Platform with Kubeflow or Metaflow, one gets the support for the most common Machine Learning scenarios, such as managing code, data, and dependencies for the experiments. Popular Machine Learning use cases described here: https://towardsdatascience.com/learn-metaflow-in-10-mins-netflixs-python-r-framework-for-data-scientists-2ef124c716e4

Major Kubeflow Advantages

Kubeflow can run multiple versions of the same code simultaneously on one cluster: one is free to choose version, language, packages for any step of one’s workflow. In Kubeflow, it is done using components implemented as independent Docker images. This is something that is not possible in Metaflow

Additionally, Kubeflow does not lock in a particular cloud provider. It abstracts out open-source Kubernetes and Airflow/other Orchestration. However, using TensorFlow Extended (TFX) will lock into Apache Beam, provided as Dataflow at Google Cloud Platform. Kubeflow also drives Data Science teams towards Docker containerization and Kubernetes cluster familiarity. On the other hand, Metaflow currently locks in AWS Sagemaker/S3/Batch

Major Metaflow Advantages

Metaflow is a Python package that will deliver Machine Learning Platform functionalities such as tracking and reproducibility at Day One.

ML pipelines, components, and inter-component messaging in Metaflow are simply Python functions and objects that can be monitored, debugged, code sources. In Kubeflow, creating ML pipeline is more similar to creating a batch file of consecutively running commands. For example, in Metaflow adding loops, ‘if’-s, and other statements to ML pipelines is a python code. In Kubeflow, ‘if’-s and ‘loop’-s are supported but complex machinery and no other statements are supported.

--

--