Distributed Hyperparameter Search in Kubeflow/Kubernetes: Keras Tuner vs. Katib

Roman Kazinnik
3 min readNov 19, 2020

--

Although Katib is a Kubeflow built-in Hyperparameter Search (HS), here is why I choose Keras-Tuner for distributed HS:

  • Keeps codebase independent of Kubeflow
  • Keeps Kubeflow experiments independent of the codebase decisions
  • Keras(TensorFlow)-trained models are also HS-optimized with Keras
  • Keeps HS to be a part of the codebase with source code versioning and CI/CD
  • Prevents running hundreds of Kubeflow experiments for a single HS

HS’s goal is to produce a list of model architectures and model parameters sorted by its model accuracy score. Asynchronous HS will produce such a list progressively by having it updated with newly computed trials. Keras Tuner allows making HS distributed with a set of CPU or GPU, and without any code change. The progressive mode can also be achieved with Keras Tuner by incrementally increasing the number of evaluated models.

Here I illustrate how to run HS in Kubeflow/Kubernetes:

  1. Deployment
    Keras-Tuner HS deploys ‘chief’ and ‘workers’ using All-reduce (or simply ‘reduce’): ‘cheif’ sends out requests to compute individual trials, ‘workers’ compute trials independently and exit upon instructions from ‘chief’.
  2. Creation
    Kubernetes StatefulSet creates a distributed pool of pods/applications. These pods share k8s context but do not facilitate Chief-Workers architecture (see my figure).
  3. Cheif-Workers
    Kubeflow TFJob implements Chief-Workers deployment, that facilitates Keras-Tuner: https://www.kubeflow.org/docs/components/training/tftraining/
  4. I didn’t try it, but TFJob should also support another distributed Hyperparameter Search - Ray.

Here is How to Scale HS horizontally:

  1. Workers run independent trials on GPU or CPU, both single-node and multiple-nodes.
  2. For me, the best feature of Keras-Tuner is that a docker component that runs Keras-Tuner on a single node requires no code change/no docker image change. A new TFJob yaml file needs to be created to run the Keras-Tuner component distributed.
  3. This Keras-tuner distributed sample shows how to add Keras-tuner to Keras model with less than ten lines of python code, and run Keras-tuner in a distributed mode without code change:
    https://keras-team.github.io/keras-tuner/tutorials/distributed-tuning/

Explaning of TFJob YAML and how to run it on Kubeflow cluster (more information: https://www.kubeflow.org/docs/components/training/tftraining/)

The first step — dockerize the Keras-tuner sample, which can be run as “python run_tuning.py”. Explaining and editing YAML file:

  1. image: Docker image with Keras-tuner
  2. PS and Worker: Chief and Workers
  3. For Keras-Tuner, add to PS and Worker after ‘containers’:
    ports:
    — containerPort: 8000
    name: chief-service
    resources:
    limits:
    cpu: ‘1’
    workingDir: /
  4. port=8000 is selected by Keras-Tuner by default and can be changed. Selected ‘cpu’ on Chief, choose ‘cpu’ or ‘gpu’ on workers.
  5. Run in k8s cluster:
    kubectl apply -f file.yaml
  6. Delete running pods from k8s:
    kubectl delete -f file.yaml

Your Turn

So, what are your views on this? Do you feel stuck anywhere? Let me know in the comments below, and I will try my best to answer them ASAP.

--

--