DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018 · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology

DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES-

ÖKOSYSTEM

Hans-Peter Zorn, Stefan Igel Heidelberg, 26. September 2018

● Use-case: Analyse von bildgebender

Massenspektronomie● Data Science Workflows & ML Plattformen● K8S als Basis für ML Plattformen● Tools & Komponenten für DS-Workflows● Ausblick

Agenda

› Expertensystem zur Qualitätsbewertung und Auswertung 3-dimensionaler Massenspektroskopiedaten

› F&E-Projekt vonHochschule Mannheim

und inovex

› Laufzeit: 01.11.2017 - 31.10.2019

Use Case: EMQProjekt Setup

Data acquisition

4 von xImage Sources:Nature Reviews Cancer 10, 639-646 09/2010Molecular Oncology 4, Issue 6, 529-538 12/2010University of Michigan: Histology and Virtual Microscopy Learning Resources

Bruker RapiflexMALDI-TOF/TOF

Mass spectrometer

Kidney tissueslice

Microscopicimage

Typical applications• Clinical diagnostic• Pharmaceutical monitoring• Histological research

MALDI Mass SpectrometryBasic workflow & application

5 von x

MSI Datacubes

A state of the art MALDI-imaging dataset comprises a huge amount of spectra (up to 100k spectra) with each raw spectrum representing intensities (usually 10k – 100k) of small m/z bins and describing up to hundreds of different molecules.

Data generation time: sample preparation (30 – 90 min), data acquisition (2 pixels / sec ~ 14 h, currently with the next generation MALDI system up to 50 pixels / sec ~ 30 – 50 min), Data analysis (~ 1 h) → Total time ~ 2 – 3.5 h / tissue sample.

Jones, Emrys A., et al. Journal of proteomics 75.16 (2012): 4962-4989.

1. support data science team processes2. democratization of data 3. democratization of machine learning

Data Science / Machine Learning PlattformenZiel: Professionalisieren von Data Science

› Scalable› Reliable› Reproducible› Easy-to-use› Flexible› Automated› Offline and online

Data Science / Machine Learning Plattformenunterstützen Machine Learning Workflows:

https://eng.uber.com/michelangelo/

Manage Data

Train Models

Evaluate Models

Deploy Models

Make Predictions

Monitor Predictions

EMQ Machine Learning Platform

Explore

(Pre-) Process

Train

Raw Data

Ingest

Prep. Data Set

Training Set

Infere

Model

ControlResult

MonitoringLogging Metadata

Runtime Environment

EMQ Machine Learning PlatformRuntime Environment

Explore

(Pre-) Process

Train

Raw Data

Ingest

Prep. Data Set

Training Set

Infere

Model

ControlResult


Runtime Environment

Scalable? Sounds like Big Data ...Is there anything beyond Hadoop?

Linux Kernel

YARN, Zookeeper CoreOS, Kubernetes

HDFS S3, NFS, Ceph, Quobyte, ...

JVM Docker

MapReduce, Tez, Spark, ... Spark, Tensorflow, ...

Hadoop Stack Kubernetes Stack

Distributed Processing

Operating System

Cluster Management

Distributed Storage

Processing Core Unit

HBase Distributed Serving elastic, Cassandra, Druid, ...

Scalable? Sounds like Big Data ...Is there anything beyond Hadoop?

Linux Kernel

YARN, Zookeeper CoreOS, Kubernetes

HDFS S3, NFS, Ceph, Quobyte, ...

JVM Docker

MapReduce, Tez, Spark, ... Spark, Tensorflow, ...

Hadoop Stack Kubernetes Stack

Distributed Processing

Operating System

Cluster Management

Distributed Storage

Processing Core Unit

HBase Distributed Serving elastic, Cassandra, Druid, ...

› everything you need to build and scale

› build, ship and run any app, anywhere

› container orchestration, automated

management, deployment, scaling

› package manager for K8S Apps

Ingredients for K8S SolutionsBare Metal, Public & Private Cloud

https://www.inovex.de/fileadmin/files/Vortraege/2017/big-data-in-der-cloud-zorn-kreiling-29.09.2017.pdf

● Meistverbreitetes Containerformat

● Leichtgewichtig

● Resource Limitation

● Verfügbarkeit von Registries

PackagingDocker, weil…


● Hardware-Abstraktion

● Container Scheduling und Management

● Service Discovery & Networking

● Konfigurationsmanagement

● Monitoring

● Load Balancing

● Rolling upgrades

DeploymentKubernetes, wegen…


● Paketmanager

● Convenience

● Zahlreiche Vorlagen

● Templating Funktionalität

Dependency ManagementHelm, für...


› Infrastructure as Code

› Cloud Provider agnostic› Software Defined Networking› Disposable Environments

Continuous IntegrationTerraform, weil ...

• Integration mit Gitlab

• Einfach zu definierende CI-Pipelines

• Integrierte Docker Registry

Continuous IntegrationGitlab-CI, weil


CI / CD Pipeline

https://www.inovex.de/fileadmin/files/Vortraege/2017/big-data-in-der-cloud-zorn-kreiling-29.09.2017.pdf 18

Gitlab

docker push

git push

helm install

Service

Deployment / Statefull Setkubectl

docker pull

PodPod

EMQ Machine Learning PlatformIngest & Store

Explore

(Pre-) Process

Train

Raw Data

Ingest

Prep. Data Set

Training Set

Infere

Model

ControlResult


Runtime Environment

Distributed File System

Ingest & Store

Data Lake

StreamProcessing NoSQL DB

File Transfer

Runtime Environment

Msg

Online - Streaming

Offline - Batch

NoSQL DB

Kubernetes auf OpenstackKubernetes in der Cloud

Kubernetes neben Hadoop

HDFS Kubernetes

(managed) kubernetes

Kubernetes neben MapR-FS

EMQ Machine Learning Platform(Pre-)Processing

Explore

(Pre-) Process

Train

Raw Data

Ingest

Prep. Data Set

Training Set

Infere

Model

ControlResult


Runtime Environment

• integrate legacy

algorithms• different

programming languages (C++, R, Python, ...)

• different base images

(Pre-)ProcessingStandardized Data Processing

(Pre-)ProcessingOrchestrate data processing steps

● reproducible● flexible● scalable

(Pre-)Processingargo Architecture

› Kubernetes API Erweiterung (CRD)

› Batch Job Pattern

› Data Handling per Buckets (S3)

EMQ Machine Learning PlatformExplore & Analyze

Explore

(Pre-) Process

Train

Raw Data

Ingest

Prep. Data Set

Training Set

Infere

Model

ControlResult


Runtime Environment

› Jupyter notebooks› Language of choice (Python, R, Scala, ...

› Notebooks can be shared (git, ...)› Big data integration (Apache Spark)

› pandas, scikit-learn, ggplot2, TensorFlow› Jupyter Hub

› Multi-user Hub for Data Science Workgroups› spawns, manages, and proxies multiple instances of the

single-user Jupyter notebook server.

Train ModelsJupyter Hub

https://jupyter-notebook.readthedocs.io/

› multi-user Hub (tornado process)› configurable http proxy

(node-http-proxy)› multiple single-user Jupyter

notebook servers (Python/Jupyter/tornado)

› REST API for administration of the Hub and its users.

Train ModelsJupyter Hub

https://github.com/jupyterhub/jupyterhub https://jupyterhub.readthedocs.io/en/stable/

http://petstore.swagger.io/?url=https://raw.githubusercontent.com/jupyter/jupyterhub/master/docs/rest-api.yml#/default

https://github.com/jupyterhub/jupyterhub

EMQ Machine Learning PlatformModel Training & Inference

Explore

(Pre-) Process

Train

Raw Data

Ingest

Prep. Data Set

Training Set

Infere

Model

ControlResult


Runtime Environment

› Herbst 2015, Google

› “library for high performance

numerical computation”

› ML/ DL support

› TensorBoard

Deep Learning

https://www.inovex.de/fileadmin/files/Vortraege/2018/skalieren-von-deep-learning-frameworks-m3-26.04.2018.pdf

Tensorflow

› Parameter Server

› multi CPU/ GPU, multi Node

› Infrastruktur:

keine Voraussetzungen

› IP-Adressen/ Hostnamen + Port

Deep LearningScaling Tensorflow

Carnegie Mellon University, Baidu, Google: “Scaling Distributed Machine Learning with the Parameter Server” (2014)

Worker Worker Worker

Parameter Server

› Distributed (Deep) Machine Learning Community

(DMLC)

› “A flexible and efficient library for deep learning.”

› Amazons Framework der Wahl

› (TensorBoard Support)

Deep LearningApache MXNet


› verteilter KVStore

› multi CPU/ GPU, multi Node

› Infrastruktur: SSH / MPI / YARN / SGE

› Hostfile mit IP-Adressen/ Hostnamen

Deep LearningScaling Apache MXNet

T. Chen et al.: “MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems” (2015)

GPU 1

GPU 2

GPU 1

GPU 2

› DevicePlugin installieren

› Base Image: nvidia/cuda

› GPU Ressourcen verwenden

Deep LearningGPU Support mit Kubernetes


1 resources:2 limits:3 nvidia.com/gpu: {{ $numGpus }}

3 Ways to run Spark on k8s:● Spark in standalone mode:

https://github.com/helm/charts/tree/master/stable/spark

● Spark operator on Kubernetes:https://github.com/GoogleCloudPlatform/spark-on-k8s-operator

● Using spark-submit:https://spark.apache.org/docs/2.3.0/running-on-kubernetes.html

Train ModelsDistributed Machine Learning

https://github.com/helm/charts/tree/master/stable/spark

https://github.com/GoogleCloudPlatform/spark-on-k8s-operator

spark-submit:

● Spark creates a Spark driver

running within a k8s pod.

● The driver creates executors

running within k8s pods, connects

to them, and executes application

code.

Train ModelsDistributed Machine Learning

https://spark.apache.org/docs/2.3.0/running-on-kubernetes.html

https://kubernetes.io/docs/concepts/workloads/pods/pod/

EMQ Machine Learning PlatformLogging & Monitoring

Explore

(Pre-) Process

Train

Raw Data

Ingest

Prep. Data Set

Training Set

Infere

Model

ControlResult


Runtime Environment

Logging & Monitoring

}

}

}}

Buffering und Transformation

Sammeln von Logs

Datenbank

Frontend

Logging & Monitoring

}}

Sammeln von Metriken

Frontend

}Datenbank

EMQ Machine Learning PlatformMetadata Management

Explore

(Pre-) Process

Train

Raw Data

Ingest

Prep. Data Set

Training Set

Infere

Model

ControlResult


Runtime Environment

● über die Umgebung

● über die Daten

● über die Workflows

● über die Modelle

● über die Fachlichkeit

● ...

Metadata… Daten über Daten

EMQ Machine Learning PlatformPutting it all together

Explore

(Pre-) Process

Train

Raw Data

Ingest

Prep. Data Set

Training Set

Infere

Model

ControlResult


Runtime Environment

› Platform hardening › Adaption und Erweiterung für neue use-cases

› NLP/Semantische Suche› IIoT

› Metadaten› Modell-Management› Verbreitung

Ausblick

Manage Data

Train Models

Evaluate

Models

Deploy Models

Make Predicti

ons

Monitor Predicti

ons

› Sebastian Schmidt› Alexander Grizschancew› Sebastian Jäger› Alexander Lontke› Julien Heitmann› Marcel Hofmann› Kevin Exel› David Waidner

Das Team… ohne das es das alles bei uns nicht gäbe

› Matthias Schwartz

› Stanislav Frolov› David Schmidt› Daniel Bäurer› Nils Domrose› Hans-Peter Zorn› Stefan Igel

Vielen Dank

Hans-Peter ZornHead of Machine Perception & [email protected]

Dr. Stefan IgelHead of Big Data [email protected]

Documents

DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES- ÖKOSYSTEM · 9/26/2018 · Data acquisition 4 von x Image Sources: Nature Reviews Cancer 10, 639-646 09/2010 Molecular Oncology