Distributed Hyperparameter Search in Kubeflow/Kubernetes: Keras Tuner vs. Katib

3 min readNov 19, 2020

Although Katib is a Kubeflow built-in Hyperparameter Search (HS), here is why I choose Keras-Tuner for distributed HS:

Keeps codebase independent of Kubeflow
Keeps Kubeflow experiments independent of the codebase decisions
Keras(TensorFlow)-trained models are also HS-optimized with Keras
Keeps HS to be a part of the codebase with source code versioning and CI/CD
Prevents running hundreds of Kubeflow experiments for a single HS

HS’s goal is to produce a list of model architectures and model parameters sorted by its model accuracy score. Asynchronous HS will produce such a list progressively by having it updated with newly computed trials. Keras Tuner allows making HS distributed with a set of CPU or GPU, and without any code change. The progressive mode can also be achieved with Keras Tuner by incrementally increasing the number of evaluated models.

Here I illustrate how to run HS in Kubeflow/Kubernetes:

Deployment
Keras-Tuner HS deploys ‘chief’ and ‘workers’ using All-reduce (or simply ‘reduce’): ‘cheif’ sends out requests to compute individual trials, ‘workers’ compute trials independently and exit upon instructions from ‘chief’.
Creation
Kubernetes StatefulSet creates a distributed pool of pods/applications. These pods share k8s context but do not facilitate Chief-Workers architecture (see my figure).
Cheif-Workers
Kubeflow TFJob implements Chief-Workers deployment, that facilitates Keras-Tuner: https://www.kubeflow.org/docs/components/training/tftraining/
I didn’t try it, but TFJob should also support another distributed Hyperparameter Search - Ray.

Here is How to Scale HS horizontally:

Workers run independent trials on GPU or CPU, both single-node and multiple-nodes.
For me, the best feature of Keras-Tuner is that a docker component that runs Keras-Tuner on a single node requires no code change/no docker image change. A new TFJob yaml file needs to be created to run the Keras-Tuner component distributed.
This Keras-tuner distributed sample shows how to add Keras-tuner to Keras model with less than ten lines of python code, and run Keras-tuner in a distributed mode without code change:
https://keras-team.github.io/keras-tuner/tutorials/distributed-tuning/

Explaning of TFJob YAML and how to run it on Kubeflow cluster (more information: https://www.kubeflow.org/docs/components/training/tftraining/)

The first step — dockerize the Keras-tuner sample, which can be run as “python run_tuning.py”. Explaining and editing YAML file:

image: Docker image with Keras-tuner
PS and Worker: Chief and Workers
For Keras-Tuner, add to PS and Worker after ‘containers’:
ports:
— containerPort: 8000
name: chief-service
resources:
limits:
cpu: ‘1’
workingDir: /
port=8000 is selected by Keras-Tuner by default and can be changed. Selected ‘cpu’ on Chief, choose ‘cpu’ or ‘gpu’ on workers.
Run in k8s cluster:
kubectl apply -f file.yaml
Delete running pods from k8s:
kubectl delete -f file.yaml

Your Turn

So, what are your views on this? Do you feel stuck anywhere? Let me know in the comments below, and I will try my best to answer them ASAP.

Distributed Hyperparameter Search in Kubeflow/Kubernetes: Keras Tuner vs. Katib

Your Turn

Written by Roman Kazinnik