36

Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation
Page 2: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

Scalable Deep Learning

Big Data, NLP, Machine Perception

2

Page 3: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, Kurt Keutzer: “ImageNet Training in Minutes” (2018) 3

Facebook: “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour” (2017)

Page 4: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

4

› Skalierungsmöglichkeiten

› Technologie Stack

› Frameworks + Vergleich

› Learnings

› Ausblick

Page 5: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

› physikalische Grenze

› mehr Ressourcen (CPU/GPU/RAM)

5Icons von Freepik/ CC 3.0 BY

Page 6: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

› Rechnerverbund

› Parallelisierbarkeit begrenzt

6Icons von Freepik/ CC 3.0 BY

Page 7: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

› periodische Synchronisation

› selbes Netz, unterschiedliche Daten

7

Rechner 2Rechner 1

Page 8: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

8

Technologie Stack

Page 9: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

› Container Virtualisierung

› Leichtgewichtig

› Meistverbreitetes Containerformat

› Verfügbarkeit von Registries

9

Page 10: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

10

› Hardware-Abstraktion

› Container ausführen

› Service Discovery & Networking

› Konfigurationsmanagement

Page 11: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

11

› Paketmanager

› Templating Funktionalität

› Convenience

Page 12: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

helm init

12

helm install --set worker=2

Icons von Freepik/ CC 3.0 BY

Page 13: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

13

Servicename: worker-0

Servicename: worker-1

Icons von Freepik/ CC 3.0 BY

Page 14: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

14

1 {{ range $i := until (int $worker) }}2 kind: Service3 apiVersion: v14 metadata:5 name: worker-{{ $i }}6 spec:7 selector:8 job: worker9 task: {{ $i }}10 {{ end }}

Page 15: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

15

Deep Learning Frameworks

Page 16: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

16

› Herbst 2015, Google

› “library for high performance numerical computation”

› ML/ DL support

› TensorBoard

Page 17: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

› Parameter Server

17Carnegie Mellon University, Baidu, Google: “Scaling Distributed Machine Learning with the Parameter Server” (2014)

Worker Worker Worker

Parameter Server

Page 18: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

› multi CPU/ GPU, multi Node

› Infrastruktur: keine Voraussetzungen

› IP-Adressen/ Hostnamen + Port

› Parameter Server

18Carnegie Mellon University, Baidu, Google: “Scaling Distributed Machine Learning with the Parameter Server” (2014)

Page 19: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

19

worker_hosts

1 worker_hosts={{ range $k := until (int $worker) }} tensorflow-worker-{{ $k }}:5000 {{ if lt (int $k | add 1) (int $worker) }}{{ print "," }} {{ end }}{{ end }}

worker_hosts=tensorflow-worker-0:5000,tensorflow-worker-1:5000

Page 20: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

20

› Distributed (Deep) Machine Learning Community

(DMLC)

› “A flexible and efficient library for deep learning.”

› Amazons Framework der Wahl

› (TensorBoard Support)

Page 21: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

› verteilter KVStore

21

GPU 1 GPU 2 GPU 1 GPU 2

T. Chen et al.: “MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems” (2015)

Page 22: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

22

› multi CPU/ GPU, multi Node

› Infrastruktur: SSH / MPI / YARN / SGE

› Hostfile mit IP-Adressen/ Hostnamen

› verteilter KVStore

T. Chen et al.: “MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems” (2015)

Page 23: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

23

hosts

1 data:2 hosts: |-3 {{- range $i := until (int $worker) }}4 mxnet-worker-{{ $i }}5 {{- end }}

Page 24: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

24

› Frühjahr 2015, Microsoft

› “A deep learning tool that balances efficiency,

performance and flexibility”

› TensorBoard Support

Page 25: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

25

› 1-Bit SGD/ Block-Momentum SGD/ Parameter Server

› multi CPU/ GPU, multi Node

› Infrastruktur: (Open)MPI

› Hostfile mit IP-Adressen/ Hostnamen

Microsoft: “Scalable Training of Deep Learning Machines by Incremental Block Training with Intra-block Parallel Optimization and Blockwise Model-Update Filtering” (2016)

Microsoft: “1-Bit Stochastic Gradient Descent and its Application to Data-Parallel Distributed Training of Speech DNNs” (2014)

Page 26: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

26

hosts

1 data:2 hosts: |-3 {{- range $i := until (int $worker) }}4 cntk-worker-{{ $i }}5 {{- end }}

Page 27: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

27

VS.

Page 28: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

CNN LSTM

28

› MNIST Datensatz

› LeNet-5

› einfachere Netze

› PTB Datensatz

› 200 Neuronen, 2 Layer

› komplexere Netze

Wojciech Zaremba, Ilya Sutskever, Oriol Vinyals: “Recurrent Neural Network Regularization” (2015)

Page 29: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

29

Page 30: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

30

Page 31: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

31

Page 32: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

32

› TensorFlow, MXNet, CNTK

› 3 diverse CNNs

› CNTK skaliert effizienter

› zeigen Bottlenecks

S. Shi, X. Chu: “Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs” (2017)

Page 33: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

33

› DevicePlugin installieren

› Base Image: nvidia/cuda

› GPU Ressourcen verwenden

1 resources:2 limits:3 nvidia.com/gpu: {{ $numGpus }}

Page 34: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

34

› kubeflow› Ähnlich zu unserer Plattform› ksnonnet für templating (vs Helm alleine)› Argo für workflow-steuerung (evtl auch)

› Horovod (Uber)› MPI für TensorFlow (bessere Performance)› besserer Skalierungsmechanismus

› TensorFlow Operator (Jakob Karalus, data2day ‘17)› jinja2 für’s templating

Page 35: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

35

Page 36: Scalable Deep Learning - inovex · ›TensorFlow, MXNet, CNTK ›3 diverse CNNs ›CNTK skaliert effizienter ›zeigen Bottlenecks S. Shi, X. Chu: “Performance Modeling and Evaluation

Sebastian Jäger@se_jaeger

Hans-Peter Zorn

@data_hpz

inovex GmbH

Ludwig-Erhard-Allee 6

76131 Karlsruhe