Awesome
Read this in other languages: 中文.
Fabric for Deep Learning (FfDL)
This repository contains the core services of the FfDL (Fabric for Deep Learning) platform. FfDL is an operating system "fabric" for Deep Learning. It is a collaboration platform for:
- Framework-independent training of Deep Learning models on distributed hardware
- Open Deep Learning APIs
- Running Deep Learning hosting in user's private or public cloud
To know more about the architectural details, please read the design document. If you are looking for demos, slides, collaterals, blogs, webinars and other materials related to FfDL, please find them here
Prerequisites
kubectl
: The Kubernetes command line interface (https://kubernetes.io/docs/tasks/tools/install-kubectl/)helm
: The Kubernetes package manager (https://helm.sh)docker
: The Docker command-line interface (https://www.docker.com/)S3 CLI
: The command-line interface to configure your Object Storage- An existing Kubernetes cluster (e.g., Kubeadm-DIND for local testing or Follow the appropriate instructions for standing up your Kubernetes cluster using IBM Cloud Public or IBM Cloud Private). The minimum capacity requirement for FfDL is 4GB Memory and 3 CPUs. <!-- For Minikube, use the command `make minikube` to start Minikube and set up local network routes. Minikube **v0.25.1** is tested with Travis CI. -->
Usage Scenarios
- If you are getting started and want to setup your own FfDL deployment, please follow the steps below.
- If you have a FfDL deployment up and running, you can jump to FfDL User Guide to use FfDL for training your deep learning models.
- If you want to leverage Jupyter notebooks to launch training on your FfDL cluster, please follow these instructions
- If you have FfDL configured to use GPUs, and want to train using GPUs, follow steps here
- To invoke Adversarial Robustness Toolbox to find vulnerabilities in your models, follow the instructions here
- To deploy your trained models, follow the integration guide with Seldon
- If you have used FfDL to train your models, and want to use a GPU enabled public cloud hosted service for further training and serving, please follow instructions here to train and serve your models using Watson Studio Deep Learning service.
Steps
1. Quick Start
There are multiple installation paths for installing FfDL into an existing Kubernetes cluster. Below are the steps for quick install. If you want to follow more detailed step by step instructions , please visit the detailed installation guide
- You need to initialize tiller with
helm init
before running the following commands.
1.1 Installation using Kubernetes Cluster
To install FfDL to any proper Kubernetes cluster, make sure kubectl
points to the right namespace,
then deploy the platform services:
export NAMESPACE=default # If your namespace does not exist yet, please create the namespace `kubectl create namespace $NAMESPACE` before running the make commands below
export SHARED_VOLUME_STORAGE_CLASS="ibmc-file-gold" # Change the storage class to what's available on your Cloud Kubernetes Cluster.
helm install ibmcloud-object-storage-plugin --name ibmcloud-object-storage-plugin --repo https://ibm.github.io/FfDL/helm-charts --set namespace=$NAMESPACE # Configure s3 driver on the cluster
helm install ffdl-helper --name ffdl-helper --repo https://ibm.github.io/FfDL/helm-charts --set namespace=$NAMESPACE,shared_volume_storage_class=$SHARED_VOLUME_STORAGE_CLASS --wait # Deploy all the helper micro-services for ffdl
helm install ffdl-core --name ffdl-core --repo https://ibm.github.io/FfDL/helm-charts --set namespace=$NAMESPACE,lcm.shared_volume_storage_class=$SHARED_VOLUME_STORAGE_CLASS --wait # Deploy all the core ffdl services.
1.2 Installation using Kubeadm-DIND
If you have Kubeadm-DIND installed on your machine, use these commands to deploy the FfDL platform:
export SHARED_VOLUME_STORAGE_CLASS=""
export NAMESPACE=default
./bin/s3_driver.sh # Copy the s3 drivers to each of the DIND node
helm install ibmcloud-object-storage-plugin --name ibmcloud-object-storage-plugin --repo https://ibm.github.io/FfDL/helm-charts --set namespace=$NAMESPACE,cloud=false
helm install ffdl-helper --name ffdl-helper --repo https://ibm.github.io/FfDL/helm-charts --set namespace=$NAMESPACE,shared_volume_storage_class=$SHARED_VOLUME_STORAGE_CLASS,localstorage=true --wait
helm install ffdl-core --name ffdl-core --repo https://ibm.github.io/FfDL/helm-charts --set namespace=$NAMESPACE,lcm.shared_volume_storage_class=$SHARED_VOLUME_STORAGE_CLASS --wait
# Forward the necessary microservices from the DIND cluster to your localhost.
./bin/dind-port-forward.sh
2. Test
To submit a simple example training job that is included in this repo (see etc/examples
folder):
Note: For PUBLIC_IP, put down one of your Cluster Public IP that can access your Cluster's NodePorts. You can check your Cluster Public IP with
kubectl get nodes -o wide
. For IBM Cloud, you can get your Public IP withbx cs workers <cluster_name>
.
export PUBLIC_IP=<Cluster Public IP> # Put down localhost if you are running with Kubeadm-DIND
make test-push-data-s3
make test-job-submit
3. Monitoring
The platform ships with a simple Grafana monitoring dashboard. The URL is printed out when running the status
make target.
4. Development
Please refer to the developer guide for more details.
5. Clean Up
If you want to remove FfDL from your cluster, simply use the following commands.
helm delete --purge ffdl-core ffdl-helper
If you want to remove the storage driver from your cluster, run:
helm delete --purge ibmcloud-object-storage-plugin
For Kubeadm-DIND, you need to kill your forwarded ports. Note that the below command will kill all the ports that are created with kubectl
.
kill $(lsof -i | grep kubectl | awk '{printf $2 " " }')
6. Troubleshooting
- FfDL has only been tested under Mac OS and Linux
-
If
glide install
fails with an error complaining about non-existing paths (e.g., "Without src, cannot continue"), make sure to follow the standard Go directory layout (see Prerequisites section). -
To remove FfDL on your Cluster, simply run
make undeploy
-
When using the FfDL CLI to train a model, make sure your directory path doesn't have slashes
/
at the end. -
If your job is stuck in pending stage, you can try to redeploy the plugin with
helm install storage-plugin --set dind=true,cloud=false
for Kubeadm-DIND andhelm install storage-plugin
for general Kubernetes Cluster. Also, double check your training job manifest file to make sure you have the correct object storage credentials.
7. References
Based on IBM Research work in Deep Learning.
-
B. Bhattacharjee et al., "IBM Deep Learning Service," in IBM Journal of Research and Development, vol. 61, no. 4, pp. 10:1-10:11, July-Sept. 1 2017. https://arxiv.org/abs/1709.05871
-
Scott Boag, et al. Scalable Multi-Framework Multi-Tenant Lifecycle Management of Deep Learning Training Jobs, In Workshop on ML Systems at NIPS'17, 2017. http://learningsys.org/nips17/assets/papers/paper_29.pdf