Home

Awesome

Read this in other languages: 中文.

build status

Fabric for Deep Learning (FfDL)

This repository contains the core services of the FfDL (Fabric for Deep Learning) platform. FfDL is an operating system "fabric" for Deep Learning. It is a collaboration platform for:

ffdl-architecture

To know more about the architectural details, please read the design document. If you are looking for demos, slides, collaterals, blogs, webinars and other materials related to FfDL, please find them here

Prerequisites

Usage Scenarios

Steps

  1. Quick Start
  1. Test
  2. Monitoring
  3. Development
  4. Clean Up
  5. Troubleshooting
  6. References

1. Quick Start

There are multiple installation paths for installing FfDL into an existing Kubernetes cluster. Below are the steps for quick install. If you want to follow more detailed step by step instructions , please visit the detailed installation guide

1.1 Installation using Kubernetes Cluster

To install FfDL to any proper Kubernetes cluster, make sure kubectl points to the right namespace, then deploy the platform services:

export NAMESPACE=default # If your namespace does not exist yet, please create the namespace `kubectl create namespace $NAMESPACE` before running the make commands below
export SHARED_VOLUME_STORAGE_CLASS="ibmc-file-gold" # Change the storage class to what's available on your Cloud Kubernetes Cluster.

helm install ibmcloud-object-storage-plugin --name ibmcloud-object-storage-plugin --repo https://ibm.github.io/FfDL/helm-charts --set namespace=$NAMESPACE # Configure s3 driver on the cluster
helm install ffdl-helper --name ffdl-helper --repo https://ibm.github.io/FfDL/helm-charts --set namespace=$NAMESPACE,shared_volume_storage_class=$SHARED_VOLUME_STORAGE_CLASS --wait # Deploy all the helper micro-services for ffdl
helm install ffdl-core --name ffdl-core --repo https://ibm.github.io/FfDL/helm-charts --set namespace=$NAMESPACE,lcm.shared_volume_storage_class=$SHARED_VOLUME_STORAGE_CLASS --wait # Deploy all the core ffdl services.

1.2 Installation using Kubeadm-DIND

If you have Kubeadm-DIND installed on your machine, use these commands to deploy the FfDL platform:

export SHARED_VOLUME_STORAGE_CLASS=""
export NAMESPACE=default

./bin/s3_driver.sh # Copy the s3 drivers to each of the DIND node
helm install ibmcloud-object-storage-plugin --name ibmcloud-object-storage-plugin --repo https://ibm.github.io/FfDL/helm-charts --set namespace=$NAMESPACE,cloud=false
helm install ffdl-helper --name ffdl-helper --repo https://ibm.github.io/FfDL/helm-charts --set namespace=$NAMESPACE,shared_volume_storage_class=$SHARED_VOLUME_STORAGE_CLASS,localstorage=true --wait
helm install ffdl-core --name ffdl-core --repo https://ibm.github.io/FfDL/helm-charts --set namespace=$NAMESPACE,lcm.shared_volume_storage_class=$SHARED_VOLUME_STORAGE_CLASS --wait

# Forward the necessary microservices from the DIND cluster to your localhost.
./bin/dind-port-forward.sh

2. Test

To submit a simple example training job that is included in this repo (see etc/examples folder):

Note: For PUBLIC_IP, put down one of your Cluster Public IP that can access your Cluster's NodePorts. You can check your Cluster Public IP with kubectl get nodes -o wide. For IBM Cloud, you can get your Public IP with bx cs workers <cluster_name>.

export PUBLIC_IP=<Cluster Public IP> # Put down localhost if you are running with Kubeadm-DIND
make test-push-data-s3
make test-job-submit

3. Monitoring

The platform ships with a simple Grafana monitoring dashboard. The URL is printed out when running the status make target.

4. Development

Please refer to the developer guide for more details.

5. Clean Up

If you want to remove FfDL from your cluster, simply use the following commands.

helm delete --purge ffdl-core ffdl-helper

If you want to remove the storage driver from your cluster, run:

helm delete --purge ibmcloud-object-storage-plugin

For Kubeadm-DIND, you need to kill your forwarded ports. Note that the below command will kill all the ports that are created with kubectl.

kill $(lsof -i | grep kubectl | awk '{printf $2 " " }')

6. Troubleshooting

<!-- * The default Minikube driver under Mac OS is VirtualBox, which is known for having issues with networking. We generally recommend Mac OS users to install Minikube using the xhyve driver. * Also, when testing locally with Minikube, make sure to point the `docker` CLI to Minikube's Docker daemon: ``` eval $(minikube docker-env) ``` * If you run into DNS name resolution issues using Minikube, make sure that the system uses only `10.0.0.10` as the single nameserver. Using multiple nameservers can result in problems, in particular under Mac OS. -->

7. References

Based on IBM Research work in Deep Learning.