45
DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM Hans-Peter Zorn, Stefan Igel Heidelberg, 26. September 2018

DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES-

ÖKOSYSTEM

Hans-Peter Zorn, Stefan Igel Heidelberg, 26. September 2018

Page 2: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

● Use-case: Analyse von bildgebender

Massenspektronomie● Data Science Workflows & ML Plattformen● K8S als Basis für ML Plattformen● Tools & Komponenten für DS-Workflows● Ausblick

Agenda

Page 3: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

› Expertensystem zur Qualitätsbewertung und Auswertung 3-dimensionaler Massenspektroskopiedaten

› F&E-Projekt vonHochschule Mannheim

und inovex

› Laufzeit: 01.11.2017 - 31.10.2019

Use Case: EMQProjekt Setup

Page 4: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

Data acquisition

4 von xImage Sources:Nature Reviews Cancer 10, 639-646 09/2010Molecular Oncology 4, Issue 6, 529-538 12/2010University of Michigan: Histology and Virtual Microscopy Learning Resources

Bruker RapiflexMALDI-TOF/TOF

Mass spectrometer

Kidney tissueslice

Microscopicimage

Typical applications• Clinical diagnostic• Pharmaceutical monitoring• Histological research

MALDI Mass SpectrometryBasic workflow & application

Page 5: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

5 von x

MSI Datacubes

A state of the art MALDI-imaging dataset comprises a huge amount of spectra (up to 100k spectra) with each raw spectrum representing intensities (usually 10k – 100k) of small m/z bins and describing up to hundreds of different molecules.

Data generation time: sample preparation (30 – 90 min), data acquisition (2 pixels / sec ~ 14 h, currently with the next generation MALDI system up to 50 pixels / sec ~ 30 – 50 min), Data analysis (~ 1 h) → Total time ~ 2 – 3.5 h / tissue sample.

Jones, Emrys A., et al. Journal of proteomics 75.16 (2012): 4962-4989.

Page 6: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

1. support data science team processes2. democratization of data 3. democratization of machine learning

Data Science / Machine Learning PlattformenZiel: Professionalisieren von Data Science

Page 7: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

› Scalable› Reliable› Reproducible› Easy-to-use› Flexible› Automated› Offline and online

Data Science / Machine Learning Plattformenunterstützen Machine Learning Workflows:

https://eng.uber.com/michelangelo/

Manage Data

Train Models

Evaluate Models

Deploy Models

Make Predictions

Monitor Predictions

Page 8: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

EMQ Machine Learning Platform

Explore

(Pre-) Process

Train

Raw Data

Ingest

Prep. Data Set

Training Set

Infere

Model

ControlResult

MonitoringLogging Metadata

Runtime Environment

Page 9: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

EMQ Machine Learning PlatformRuntime Environment

Explore

(Pre-) Process

Train

Raw Data

Ingest

Prep. Data Set

Training Set

Infere

Model

ControlResult

MonitoringLogging Metadata

Runtime Environment

Page 10: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

Scalable? Sounds like Big Data ...Is there anything beyond Hadoop?

Linux Kernel

YARN, Zookeeper CoreOS, Kubernetes

HDFS S3, NFS, Ceph, Quobyte, ...

JVM Docker

MapReduce, Tez, Spark, ... Spark, Tensorflow, ...

Hadoop Stack Kubernetes Stack

Distributed Processing

Operating System

Cluster Management

Distributed Storage

Processing Core Unit

HBase Distributed Serving elastic, Cassandra, Druid, ...

Page 11: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

Scalable? Sounds like Big Data ...Is there anything beyond Hadoop?

Linux Kernel

YARN, Zookeeper CoreOS, Kubernetes

HDFS S3, NFS, Ceph, Quobyte, ...

JVM Docker

MapReduce, Tez, Spark, ... Spark, Tensorflow, ...

Hadoop Stack Kubernetes Stack

Distributed Processing

Operating System

Cluster Management

Distributed Storage

Processing Core Unit

HBase Distributed Serving elastic, Cassandra, Druid, ...

Page 12: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

› everything you need to build and scale

› build, ship and run any app, anywhere

› container orchestration, automated

management, deployment, scaling

› package manager for K8S Apps

Ingredients for K8S SolutionsBare Metal, Public & Private Cloud

https://www.inovex.de/fileadmin/files/Vortraege/2017/big-data-in-der-cloud-zorn-kreiling-29.09.2017.pdf

Page 13: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

● Meistverbreitetes Containerformat

● Leichtgewichtig

● Resource Limitation

● Verfügbarkeit von Registries

PackagingDocker, weil…

https://www.inovex.de/fileadmin/files/Vortraege/2017/big-data-in-der-cloud-zorn-kreiling-29.09.2017.pdf

Page 14: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

● Hardware-Abstraktion

● Container Scheduling und Management

● Service Discovery & Networking

● Konfigurationsmanagement

● Monitoring

● Load Balancing

● Rolling upgrades

DeploymentKubernetes, wegen…

https://www.inovex.de/fileadmin/files/Vortraege/2017/big-data-in-der-cloud-zorn-kreiling-29.09.2017.pdf

Page 15: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

● Paketmanager

● Convenience

● Zahlreiche Vorlagen

● Templating Funktionalität

Dependency ManagementHelm, für...

https://www.inovex.de/fileadmin/files/Vortraege/2017/big-data-in-der-cloud-zorn-kreiling-29.09.2017.pdf

Page 16: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

› Infrastructure as Code

› Cloud Provider agnostic› Software Defined Networking› Disposable Environments

Continuous IntegrationTerraform, weil ...

Page 17: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

• Integration mit Gitlab

• Einfach zu definierende CI-Pipelines

• Integrierte Docker Registry

Continuous IntegrationGitlab-CI, weil

https://www.inovex.de/fileadmin/files/Vortraege/2017/big-data-in-der-cloud-zorn-kreiling-29.09.2017.pdf

Page 18: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

CI / CD Pipeline

https://www.inovex.de/fileadmin/files/Vortraege/2017/big-data-in-der-cloud-zorn-kreiling-29.09.2017.pdf 18

Gitlab

docker push

git push

helm install

Service

Deployment / Statefull Setkubectl

docker pull

PodPod

Page 19: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

EMQ Machine Learning PlatformIngest & Store

Explore

(Pre-) Process

Train

Raw Data

Ingest

Prep. Data Set

Training Set

Infere

Model

ControlResult

MonitoringLogging Metadata

Runtime Environment

Page 20: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

Distributed File System

Ingest & Store

Data Lake

StreamProcessing NoSQL DB

File Transfer

Runtime Environment

Msg

Online - Streaming

Offline - Batch

NoSQL DB

Page 21: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

Kubernetes auf OpenstackKubernetes in der Cloud

Kubernetes neben Hadoop

HDFS Kubernetes

(managed) kubernetes

Kubernetes neben MapR-FS

Page 22: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

EMQ Machine Learning Platform(Pre-)Processing

Explore

(Pre-) Process

Train

Raw Data

Ingest

Prep. Data Set

Training Set

Infere

Model

ControlResult

MonitoringLogging Metadata

Runtime Environment

Page 23: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

• integrate legacy

algorithms• different

programming languages (C++, R, Python, ...)

• different base images

(Pre-)ProcessingStandardized Data Processing

Page 24: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

(Pre-)ProcessingOrchestrate data processing steps

● reproducible● flexible● scalable

Page 25: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

(Pre-)Processingargo Architecture

› Kubernetes API Erweiterung (CRD)

› Batch Job Pattern

› Data Handling per Buckets (S3)

Page 26: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

EMQ Machine Learning PlatformExplore & Analyze

Explore

(Pre-) Process

Train

Raw Data

Ingest

Prep. Data Set

Training Set

Infere

Model

ControlResult

MonitoringLogging Metadata

Runtime Environment

Page 27: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

› Jupyter notebooks› Language of choice (Python, R, Scala, ...

› Notebooks can be shared (git, ...)› Big data integration (Apache Spark)

› pandas, scikit-learn, ggplot2, TensorFlow› Jupyter Hub

› Multi-user Hub for Data Science Workgroups› spawns, manages, and proxies multiple instances of the

single-user Jupyter notebook server.

Train ModelsJupyter Hub

Page 28: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

› multi-user Hub (tornado process)› configurable http proxy

(node-http-proxy)› multiple single-user Jupyter

notebook servers (Python/Jupyter/tornado)

› REST API for administration of the Hub and its users.

Train ModelsJupyter Hub

https://github.com/jupyterhub/jupyterhub https://jupyterhub.readthedocs.io/en/stable/

Page 29: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

EMQ Machine Learning PlatformModel Training & Inference

Explore

(Pre-) Process

Train

Raw Data

Ingest

Prep. Data Set

Training Set

Infere

Model

ControlResult

MonitoringLogging Metadata

Runtime Environment

Page 30: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

› Herbst 2015, Google

› “library for high performance

numerical computation”

› ML/ DL support

› TensorBoard

Deep Learning

https://www.inovex.de/fileadmin/files/Vortraege/2018/skalieren-von-deep-learning-frameworks-m3-26.04.2018.pdf

Tensorflow

Page 31: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

› Parameter Server

› multi CPU/ GPU, multi Node

› Infrastruktur:

keine Voraussetzungen

› IP-Adressen/ Hostnamen + Port

Deep LearningScaling Tensorflow

Carnegie Mellon University, Baidu, Google: “Scaling Distributed Machine Learning with the Parameter Server” (2014)

Worker Worker Worker

Parameter Server

Page 32: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

› Distributed (Deep) Machine Learning Community

(DMLC)

› “A flexible and efficient library for deep learning.”

› Amazons Framework der Wahl

› (TensorBoard Support)

Deep LearningApache MXNet

https://www.inovex.de/fileadmin/files/Vortraege/2018/skalieren-von-deep-learning-frameworks-m3-26.04.2018.pdf

Page 33: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

› verteilter KVStore

› multi CPU/ GPU, multi Node

› Infrastruktur: SSH / MPI / YARN / SGE

› Hostfile mit IP-Adressen/ Hostnamen

Deep LearningScaling Apache MXNet

T. Chen et al.: “MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems” (2015)

GPU 1

GPU 2

GPU 1

GPU 2

Page 34: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

› DevicePlugin installieren

› Base Image: nvidia/cuda

› GPU Ressourcen verwenden

Deep LearningGPU Support mit Kubernetes

https://www.inovex.de/fileadmin/files/Vortraege/2018/skalieren-von-deep-learning-frameworks-m3-26.04.2018.pdf

1 resources:2 limits:3 nvidia.com/gpu: {{ $numGpus }}

Page 35: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

3 Ways to run Spark on k8s:● Spark in standalone mode:

https://github.com/helm/charts/tree/master/stable/spark

● Spark operator on Kubernetes:https://github.com/GoogleCloudPlatform/spark-on-k8s-operator

● Using spark-submit:https://spark.apache.org/docs/2.3.0/running-on-kubernetes.html

Train ModelsDistributed Machine Learning

Page 36: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

spark-submit:

● Spark creates a Spark driver

running within a k8s pod.

● The driver creates executors

running within k8s pods, connects

to them, and executes application

code.

Train ModelsDistributed Machine Learning

https://spark.apache.org/docs/2.3.0/running-on-kubernetes.html

Page 37: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

EMQ Machine Learning PlatformLogging & Monitoring

Explore

(Pre-) Process

Train

Raw Data

Ingest

Prep. Data Set

Training Set

Infere

Model

ControlResult

MonitoringLogging Metadata

Runtime Environment

Page 38: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

Logging & Monitoring

}

}

}}

Buffering und Transformation

Sammeln von Logs

Datenbank

Frontend

Page 39: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

Logging & Monitoring

}}

Sammeln von Metriken

Frontend

}Datenbank

Page 40: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

EMQ Machine Learning PlatformMetadata Management

Explore

(Pre-) Process

Train

Raw Data

Ingest

Prep. Data Set

Training Set

Infere

Model

ControlResult

MonitoringLogging Metadata

Runtime Environment

Page 41: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

● über die Umgebung

● über die Daten

● über die Workflows

● über die Modelle

● über die Fachlichkeit

● ...

Metadata… Daten über Daten

Page 42: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

EMQ Machine Learning PlatformPutting it all together

Explore

(Pre-) Process

Train

Raw Data

Ingest

Prep. Data Set

Training Set

Infere

Model

ControlResult

MonitoringLogging Metadata

Runtime Environment

Page 43: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

› Platform hardening › Adaption und Erweiterung für neue use-cases

› NLP/Semantische Suche› IIoT

› Metadaten› Modell-Management› Verbreitung

Ausblick

Manage Data

Train Models

Evaluate

Models

Deploy Models

Make Predicti

ons

Monitor Predicti

ons

Page 44: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

› Sebastian Schmidt› Alexander Grizschancew› Sebastian Jäger› Alexander Lontke› Julien Heitmann› Marcel Hofmann› Kevin Exel› David Waidner

Das Team… ohne das es das alles bei uns nicht gäbe

› Matthias Schwartz

› Stanislav Frolov› David Schmidt› Daniel Bäurer› Nils Domrose› Hans-Peter Zorn› Stefan Igel

Page 45: DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018  · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

Vielen Dank

Hans-Peter ZornHead of Machine Perception & [email protected]

Dr. Stefan IgelHead of Big Data [email protected]