SLIDES: Distributed deep learning with containers on heterogeneous GPU clusters 

File uploaded by slimbaltagi on Mar 20, 2018
Version 1Show Document
  • View in full screen mode

A presentation by dong meng at the 2018 Strata Data Conference in San Jose.

"There have been years of active research and development in deep learning, and organizations have begun to explore methods in which they can train and serve deep learning on a cluster in a distributed fashion. Many build a dedicated GPU HPC cluster that works well in a research or development setting, but data has to be moved consistently between clusters. There is overhead in managing the data used to train deep learning models and managing the models between research/development and production.


Dong Meng outlines the topics that need to be addressed to successfully utilize distributed deep learning, such as consistency, fault tolerance, communication, resource management, and programming libraries, and offers an overview of a converged data platform that serves as a data infrastructure, providing a distributed filesystem, key-value storage and streams, and Kubernetes as orchestration layer to manage containers to train and deploy deep learning models using GPU clusters. Along the way, Dong demonstrates a simple distributed deep learning training program and explains how to leverage pub/sub capability to build global real-time deep learning applications on NVIDIA GPUs.


For consistency, most DL libraries introduce a parameter server and worker architecture to enable synchronization. The checkpoint reload strategy has been used to provide fault tolerance. By designing the volume topology in the distributed filesystem, you can move the GPU computing closer to the data locality. This addresses possible communication congestion by bringing together your deep learning model, your data, and your applications. For resource management, Kubernetes orchestrates the containers to train and deploy deep learning models with GPUs.


You’ll learn how to utilize the converged data platform to serve as the data infrastructure to provide a distributed filesystem, key-value storage, and streams to store and build the data pipeline. With deep learning libraries like TensorFlow or Apache MXNet housed in persistent application client containers (PAAC), you can persist the model to the distributed filesystem, provide DL frameworks with full access to vast data on the distributed filesystem, and serve models to score the data coming in through streams. Furthermore, you can manage the model version and library dependencies through container images and customize the machine learning server for production."