Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
DATA SCIENCE UND MACHINE LEARNING IM KUBERNETES-
ÖKOSYSTEM
Hans-Peter Zorn, Stefan Igel Heidelberg, 26. September 2018
● Use-case: Analyse von bildgebender
Massenspektronomie● Data Science Workflows & ML Plattformen● K8S als Basis für ML Plattformen● Tools & Komponenten für DS-Workflows● Ausblick
Agenda
› Expertensystem zur Qualitätsbewertung und Auswertung 3-dimensionaler Massenspektroskopiedaten
› F&E-Projekt vonHochschule Mannheim
und inovex
› Laufzeit: 01.11.2017 - 31.10.2019
Use Case: EMQProjekt Setup
Data acquisition
4 von xImage Sources:Nature Reviews Cancer 10, 639-646 09/2010Molecular Oncology 4, Issue 6, 529-538 12/2010University of Michigan: Histology and Virtual Microscopy Learning Resources
Bruker RapiflexMALDI-TOF/TOF
Mass spectrometer
Kidney tissueslice
Microscopicimage
Typical applications• Clinical diagnostic• Pharmaceutical monitoring• Histological research
MALDI Mass SpectrometryBasic workflow & application
5 von x
MSI Datacubes
A state of the art MALDI-imaging dataset comprises a huge amount of spectra (up to 100k spectra) with each raw spectrum representing intensities (usually 10k – 100k) of small m/z bins and describing up to hundreds of different molecules.
Data generation time: sample preparation (30 – 90 min), data acquisition (2 pixels / sec ~ 14 h, currently with the next generation MALDI system up to 50 pixels / sec ~ 30 – 50 min), Data analysis (~ 1 h) → Total time ~ 2 – 3.5 h / tissue sample.
Jones, Emrys A., et al. Journal of proteomics 75.16 (2012): 4962-4989.
1. support data science team processes2. democratization of data 3. democratization of machine learning
Data Science / Machine Learning PlattformenZiel: Professionalisieren von Data Science
› Scalable› Reliable› Reproducible› Easy-to-use› Flexible› Automated› Offline and online
Data Science / Machine Learning Plattformenunterstützen Machine Learning Workflows:
https://eng.uber.com/michelangelo/
Manage Data
Train Models
Evaluate Models
Deploy Models
Make Predictions
Monitor Predictions
EMQ Machine Learning Platform
Explore
(Pre-) Process
Train
Raw Data
Ingest
Prep. Data Set
Training Set
Infere
Model
ControlResult
MonitoringLogging Metadata
Runtime Environment
EMQ Machine Learning PlatformRuntime Environment
Explore
(Pre-) Process
Train
Raw Data
Ingest
Prep. Data Set
Training Set
Infere
Model
ControlResult
MonitoringLogging Metadata
Runtime Environment
Scalable? Sounds like Big Data ...Is there anything beyond Hadoop?
Linux Kernel
YARN, Zookeeper CoreOS, Kubernetes
HDFS S3, NFS, Ceph, Quobyte, ...
JVM Docker
MapReduce, Tez, Spark, ... Spark, Tensorflow, ...
Hadoop Stack Kubernetes Stack
Distributed Processing
Operating System
Cluster Management
Distributed Storage
Processing Core Unit
HBase Distributed Serving elastic, Cassandra, Druid, ...
Scalable? Sounds like Big Data ...Is there anything beyond Hadoop?
Linux Kernel
YARN, Zookeeper CoreOS, Kubernetes
HDFS S3, NFS, Ceph, Quobyte, ...
JVM Docker
MapReduce, Tez, Spark, ... Spark, Tensorflow, ...
Hadoop Stack Kubernetes Stack
Distributed Processing
Operating System
Cluster Management
Distributed Storage
Processing Core Unit
HBase Distributed Serving elastic, Cassandra, Druid, ...
› everything you need to build and scale
› build, ship and run any app, anywhere
› container orchestration, automated
management, deployment, scaling
› package manager for K8S Apps
Ingredients for K8S SolutionsBare Metal, Public & Private Cloud
https://www.inovex.de/fileadmin/files/Vortraege/2017/big-data-in-der-cloud-zorn-kreiling-29.09.2017.pdf
● Meistverbreitetes Containerformat
● Leichtgewichtig
● Resource Limitation
● Verfügbarkeit von Registries
PackagingDocker, weil…
https://www.inovex.de/fileadmin/files/Vortraege/2017/big-data-in-der-cloud-zorn-kreiling-29.09.2017.pdf
● Hardware-Abstraktion
● Container Scheduling und Management
● Service Discovery & Networking
● Konfigurationsmanagement
● Monitoring
● Load Balancing
● Rolling upgrades
DeploymentKubernetes, wegen…
https://www.inovex.de/fileadmin/files/Vortraege/2017/big-data-in-der-cloud-zorn-kreiling-29.09.2017.pdf
● Paketmanager
● Convenience
● Zahlreiche Vorlagen
● Templating Funktionalität
Dependency ManagementHelm, für...
https://www.inovex.de/fileadmin/files/Vortraege/2017/big-data-in-der-cloud-zorn-kreiling-29.09.2017.pdf
› Infrastructure as Code
› Cloud Provider agnostic› Software Defined Networking› Disposable Environments
Continuous IntegrationTerraform, weil ...
• Integration mit Gitlab
• Einfach zu definierende CI-Pipelines
• Integrierte Docker Registry
Continuous IntegrationGitlab-CI, weil
https://www.inovex.de/fileadmin/files/Vortraege/2017/big-data-in-der-cloud-zorn-kreiling-29.09.2017.pdf
CI / CD Pipeline
https://www.inovex.de/fileadmin/files/Vortraege/2017/big-data-in-der-cloud-zorn-kreiling-29.09.2017.pdf 18
Gitlab
docker push
git push
helm install
Service
Deployment / Statefull Setkubectl
docker pull
PodPod
EMQ Machine Learning PlatformIngest & Store
Explore
(Pre-) Process
Train
Raw Data
Ingest
Prep. Data Set
Training Set
Infere
Model
ControlResult
MonitoringLogging Metadata
Runtime Environment
Distributed File System
Ingest & Store
Data Lake
StreamProcessing NoSQL DB
File Transfer
Runtime Environment
Msg
Online - Streaming
Offline - Batch
NoSQL DB
Kubernetes auf OpenstackKubernetes in der Cloud
Kubernetes neben Hadoop
HDFS Kubernetes
(managed) kubernetes
Kubernetes neben MapR-FS
EMQ Machine Learning Platform(Pre-)Processing
Explore
(Pre-) Process
Train
Raw Data
Ingest
Prep. Data Set
Training Set
Infere
Model
ControlResult
MonitoringLogging Metadata
Runtime Environment
• integrate legacy
algorithms• different
programming languages (C++, R, Python, ...)
• different base images
(Pre-)ProcessingStandardized Data Processing
(Pre-)ProcessingOrchestrate data processing steps
● reproducible● flexible● scalable
(Pre-)Processingargo Architecture
› Kubernetes API Erweiterung (CRD)
› Batch Job Pattern
› Data Handling per Buckets (S3)
EMQ Machine Learning PlatformExplore & Analyze
Explore
(Pre-) Process
Train
Raw Data
Ingest
Prep. Data Set
Training Set
Infere
Model
ControlResult
MonitoringLogging Metadata
Runtime Environment
› Jupyter notebooks› Language of choice (Python, R, Scala, ...
› Notebooks can be shared (git, ...)› Big data integration (Apache Spark)
› pandas, scikit-learn, ggplot2, TensorFlow› Jupyter Hub
› Multi-user Hub for Data Science Workgroups› spawns, manages, and proxies multiple instances of the
single-user Jupyter notebook server.
Train ModelsJupyter Hub
› multi-user Hub (tornado process)› configurable http proxy
(node-http-proxy)› multiple single-user Jupyter
notebook servers (Python/Jupyter/tornado)
› REST API for administration of the Hub and its users.
Train ModelsJupyter Hub
https://github.com/jupyterhub/jupyterhub https://jupyterhub.readthedocs.io/en/stable/
EMQ Machine Learning PlatformModel Training & Inference
Explore
(Pre-) Process
Train
Raw Data
Ingest
Prep. Data Set
Training Set
Infere
Model
ControlResult
MonitoringLogging Metadata
Runtime Environment
› Herbst 2015, Google
› “library for high performance
numerical computation”
› ML/ DL support
› TensorBoard
Deep Learning
https://www.inovex.de/fileadmin/files/Vortraege/2018/skalieren-von-deep-learning-frameworks-m3-26.04.2018.pdf
Tensorflow
› Parameter Server
› multi CPU/ GPU, multi Node
› Infrastruktur:
keine Voraussetzungen
› IP-Adressen/ Hostnamen + Port
Deep LearningScaling Tensorflow
Carnegie Mellon University, Baidu, Google: “Scaling Distributed Machine Learning with the Parameter Server” (2014)
Worker Worker Worker
Parameter Server
› Distributed (Deep) Machine Learning Community
(DMLC)
› “A flexible and efficient library for deep learning.”
› Amazons Framework der Wahl
› (TensorBoard Support)
Deep LearningApache MXNet
https://www.inovex.de/fileadmin/files/Vortraege/2018/skalieren-von-deep-learning-frameworks-m3-26.04.2018.pdf
› verteilter KVStore
› multi CPU/ GPU, multi Node
› Infrastruktur: SSH / MPI / YARN / SGE
› Hostfile mit IP-Adressen/ Hostnamen
Deep LearningScaling Apache MXNet
T. Chen et al.: “MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems” (2015)
GPU 1
GPU 2
GPU 1
GPU 2
› DevicePlugin installieren
› Base Image: nvidia/cuda
› GPU Ressourcen verwenden
Deep LearningGPU Support mit Kubernetes
https://www.inovex.de/fileadmin/files/Vortraege/2018/skalieren-von-deep-learning-frameworks-m3-26.04.2018.pdf
1 resources:2 limits:3 nvidia.com/gpu: {{ $numGpus }}
3 Ways to run Spark on k8s:● Spark in standalone mode:
https://github.com/helm/charts/tree/master/stable/spark
● Spark operator on Kubernetes:https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
● Using spark-submit:https://spark.apache.org/docs/2.3.0/running-on-kubernetes.html
Train ModelsDistributed Machine Learning
spark-submit:
● Spark creates a Spark driver
running within a k8s pod.
● The driver creates executors
running within k8s pods, connects
to them, and executes application
code.
Train ModelsDistributed Machine Learning
https://spark.apache.org/docs/2.3.0/running-on-kubernetes.html
EMQ Machine Learning PlatformLogging & Monitoring
Explore
(Pre-) Process
Train
Raw Data
Ingest
Prep. Data Set
Training Set
Infere
Model
ControlResult
MonitoringLogging Metadata
Runtime Environment
Logging & Monitoring
}
}
}}
Buffering und Transformation
Sammeln von Logs
Datenbank
Frontend
Logging & Monitoring
}}
Sammeln von Metriken
Frontend
}Datenbank
EMQ Machine Learning PlatformMetadata Management
Explore
(Pre-) Process
Train
Raw Data
Ingest
Prep. Data Set
Training Set
Infere
Model
ControlResult
MonitoringLogging Metadata
Runtime Environment
● über die Umgebung
● über die Daten
● über die Workflows
● über die Modelle
● über die Fachlichkeit
● ...
Metadata… Daten über Daten
EMQ Machine Learning PlatformPutting it all together
Explore
(Pre-) Process
Train
Raw Data
Ingest
Prep. Data Set
Training Set
Infere
Model
ControlResult
MonitoringLogging Metadata
Runtime Environment
› Platform hardening › Adaption und Erweiterung für neue use-cases
› NLP/Semantische Suche› IIoT
› Metadaten› Modell-Management› Verbreitung
Ausblick
Manage Data
Train Models
Evaluate
Models
Deploy Models
Make Predicti
ons
Monitor Predicti
ons
› Sebastian Schmidt› Alexander Grizschancew› Sebastian Jäger› Alexander Lontke› Julien Heitmann› Marcel Hofmann› Kevin Exel› David Waidner
Das Team… ohne das es das alles bei uns nicht gäbe
› Matthias Schwartz
› Stanislav Frolov› David Schmidt› Daniel Bäurer› Nils Domrose› Hans-Peter Zorn› Stefan Igel
Vielen Dank
Hans-Peter ZornHead of Machine Perception & [email protected]
Dr. Stefan IgelHead of Big Data [email protected]