Distributed Hyperparameter Search in Kubeflow/Kubernetes: Keras Tuner vs. Katib
Although Katib is a Kubeflow built-in Hyperparameter Search (HS), here is why I choose Keras-Tuner for distributed HS:
- Keeps codebase independent of Kubeflow
- Keeps Kubeflow experiments independent of the codebase decisions
- Keras(TensorFlow)-trained models are also HS-optimized with Keras
- Keeps HS to be a part of the codebase with source code versioning and CI/CD
- Prevents running hundreds of Kubeflow experiments for a single HS
HS’s goal is to produce a list of model architectures and model parameters sorted by its model accuracy score. Asynchronous HS will produce such a list progressively by having it updated with newly computed trials. Keras Tuner allows making HS distributed with a set of CPU or GPU, and without any code change. The progressive mode can also be achieved with Keras Tuner by incrementally increasing the number of evaluated models.
Here I illustrate how to run HS in Kubeflow/Kubernetes:
- Deployment
Keras-Tuner HS deploys ‘chief’ and ‘workers’ using All-reduce (or simply ‘reduce’): ‘cheif’ sends out requests to compute individual trials, ‘workers’ compute trials independently and exit upon instructions from ‘chief’. - Creation
Kubernetes StatefulSet creates a distributed pool of pods/applications. These pods share k8s context but do not facilitate Chief-Workers architecture (see my figure). - Cheif-Workers
Kubeflow TFJob implements Chief-Workers deployment, that facilitates Keras-Tuner: https://www.kubeflow.org/docs/components/training/tftraining/ - I didn’t try it, but TFJob should also support another distributed Hyperparameter Search - Ray.
Here is How to Scale HS horizontally:
- Workers run independent trials on GPU or CPU, both single-node and multiple-nodes.
- For me, the best feature of Keras-Tuner is that a docker component that runs Keras-Tuner on a single node requires no code change/no docker image change. A new TFJob yaml file needs to be created to run the Keras-Tuner component distributed.
- This Keras-tuner distributed sample shows how to add Keras-tuner to Keras model with less than ten lines of python code, and run Keras-tuner in a distributed mode without code change:
https://keras-team.github.io/keras-tuner/tutorials/distributed-tuning/
Explaning of TFJob YAML and how to run it on Kubeflow cluster (more information: https://www.kubeflow.org/docs/components/training/tftraining/)
The first step — dockerize the Keras-tuner sample, which can be run as “python run_tuning.py”. Explaining and editing YAML file:
- image: Docker image with Keras-tuner
- PS and Worker: Chief and Workers
- For Keras-Tuner, add to PS and Worker after ‘containers’:
ports:
— containerPort: 8000
name: chief-service
resources:
limits:
cpu: ‘1’
workingDir: / - port=8000 is selected by Keras-Tuner by default and can be changed. Selected ‘cpu’ on Chief, choose ‘cpu’ or ‘gpu’ on workers.
- Run in k8s cluster:
kubectl apply -f file.yaml - Delete running pods from k8s:
kubectl delete -f file.yaml
Your Turn
So, what are your views on this? Do you feel stuck anywhere? Let me know in the comments below, and I will try my best to answer them ASAP.