23
Prof. Dr. Jens Albrecht jens.albrecht@th - nuernberg.de Big Data Technologien - Ein Überblick -

Big Data Technologien - BICCnet...Prof. Dr. Jens Albrecht Big Data 30 Streaming Frameworks Storm Flink Spark Streaming Delivery Guarantees at least once exactly once exactly once Latency

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Big Data Technologien - BICCnet...Prof. Dr. Jens Albrecht Big Data 30 Streaming Frameworks Storm Flink Spark Streaming Delivery Guarantees at least once exactly once exactly once Latency

Prof. Dr. Jens [email protected]

Big Data Technologien- Ein Überblick -

Page 2: Big Data Technologien - BICCnet...Prof. Dr. Jens Albrecht Big Data 30 Streaming Frameworks Storm Flink Spark Streaming Delivery Guarantees at least once exactly once exactly once Latency

Prof. Dr. Jens Albrecht Big Data 3

Big Data Landscape 2016

Page 3: Big Data Technologien - BICCnet...Prof. Dr. Jens Albrecht Big Data 30 Streaming Frameworks Storm Flink Spark Streaming Delivery Guarantees at least once exactly once exactly once Latency

Prof. Dr. Jens Albrecht Big Data 5

Appliance

Systemarchitektur im Wandel

Gestern und heute

Strukturierte DatenModerate Größe (S-XL)

"General Purpose"RDBMS

Heute und morgen

Polystrukturierte Datenin allen Größen (S-XXXL) {json}

<xml/>

InMemory

RDBMSHadoop

Streaming

NoSQL

ZweckoptimierteSpezialisten

Page 4: Big Data Technologien - BICCnet...Prof. Dr. Jens Albrecht Big Data 30 Streaming Frameworks Storm Flink Spark Streaming Delivery Guarantees at least once exactly once exactly once Latency

Prof. Dr. Jens Albrecht Big Data 6

Business Cases

Benefits of Big Data Technologies

Scalability

Throughput (Velocity)

Computing Power

Agility

Data Volume

Data Exploration

Schema-on-Read

Integration on Demand

EfficientDevelopment

DataVirtualization

Real-timeDecisions

SimplifiedData Access

AdvancedAnalytics

Cost Efficiency

Page 5: Big Data Technologien - BICCnet...Prof. Dr. Jens Albrecht Big Data 30 Streaming Frameworks Storm Flink Spark Streaming Delivery Guarantees at least once exactly once exactly once Latency

Prof. Dr. Jens Albrecht Big Data 7

BI/Big Data ArchitekturA

nal

yse

Dat

en

hal

tun

gD

ate

nq

uel

len

Klassische Datenquellen

OLTP-Systeme

Big-Data-Quellen

Dokumente, Server Logs, Sensor Daten,Social, Clickstream, GPS,

BusinessIntelligence

EnterpriseDWH

Data Marts

Predictive Analytics

OperationalAnalytics

In-MemoryRDBMS

ExplorativeAnalyse

Data Lake

NoSQL

Page 6: Big Data Technologien - BICCnet...Prof. Dr. Jens Albrecht Big Data 30 Streaming Frameworks Storm Flink Spark Streaming Delivery Guarantees at least once exactly once exactly once Latency

Prof. Dr. Jens Albrecht Big Data 8

Data Lake: Herausforderungen

Data Lake

Viele heterogene Datenquellen

Viele heterogene Nutzer

Skalierbare Integration On-the-fly,einfach zu entwickeln und zu betreiben

Page 7: Big Data Technologien - BICCnet...Prof. Dr. Jens Albrecht Big Data 30 Streaming Frameworks Storm Flink Spark Streaming Delivery Guarantees at least once exactly once exactly once Latency

Prof. Dr. Jens Albrecht Big Data 9

HadoopHDFS

Big Data Processing

Ingest Store Process AnalyzeAccess & Visualize

Data Sources

Flume

Kafka

Spark

Map Reduce

Hive

Storm

Spark ML

Impala

SparklingWater

SOLR

Looker

H2O

Zeppelin

Flink

Sqoop

Drill Datameer

Atlas

Waterline

Governance

NoSQL

Nifi

Ranger

Sentry

Samza

Page 8: Big Data Technologien - BICCnet...Prof. Dr. Jens Albrecht Big Data 30 Streaming Frameworks Storm Flink Spark Streaming Delivery Guarantees at least once exactly once exactly once Latency

Prof. Dr. Jens Albrecht Big Data 10

Batch vs. Stream

Clickstream

Transactions

Machine Logs

Sensor Data

Data Producers Batches of Data Batch Processing

Clickstream

Transactions

Machine Logs

Sensor Data

Stream Processing Streams of Data

Page 9: Big Data Technologien - BICCnet...Prof. Dr. Jens Albrecht Big Data 30 Streaming Frameworks Storm Flink Spark Streaming Delivery Guarantees at least once exactly once exactly once Latency

Prof. Dr. Jens Albrecht Big Data 12

>

Batch Processing & Analysis

Page 10: Big Data Technologien - BICCnet...Prof. Dr. Jens Albrecht Big Data 30 Streaming Frameworks Storm Flink Spark Streaming Delivery Guarantees at least once exactly once exactly once Latency

Prof. Dr. Jens Albrecht Big Data 13

YARN + Map Reduce

Distributed Storage (HDFS)

Hadoop

Batch-Processing-Framework

Komponenten

▸HDFS, HBase

▸ YARN, Zookeeper

▸Map Reduce, Hive, Pig, Sqoop, …

Stärken und Grenzen

▸ ausgereifte Basistechnologie

▸ umfangreiches Ökosystem

▸ breite Kompatibilität

▸MapReduce ist langsam und umständlich

Page 11: Big Data Technologien - BICCnet...Prof. Dr. Jens Albrecht Big Data 30 Streaming Frameworks Storm Flink Spark Streaming Delivery Guarantees at least once exactly once exactly once Latency

Prof. Dr. Jens Albrecht Big Data 16

Apache Spark – Swiss Army Knife of Big Data

☛ Agilität und Skalierbarkeit mit und ohne Hadoop▸ Effiziente Entwicklung durch mächtige API (identisch für Scala, Java, Python)

▸ In-Memory-Ausführung und SQL-ähnliche Anfrageoptimierung

▸ Einheitliches System für Batch- und Stream-Processing

Batch Processing

Machine Learning

JavaPython

Scala R

Data Streaming

Graph Processing

SQL

Apache Spark

Page 12: Big Data Technologien - BICCnet...Prof. Dr. Jens Albrecht Big Data 30 Streaming Frameworks Storm Flink Spark Streaming Delivery Guarantees at least once exactly once exactly once Latency

Prof. Dr. Jens Albrecht Big Data 18

Map Reduce vs. Spark

Quelle: https://spark.apache.org/examples.html

file = spark.textFile("hdfs://...")counts = file.flatMap(lambda line: line.split(" ")) \

.map(lambda word: (word, 1)) \

.reduceByKey(lambda a, b: a + b)counts.saveAsTextFile("hdfs://...")

… und in Spark

Hadoop-MapReduceWordcount in Java

Page 13: Big Data Technologien - BICCnet...Prof. Dr. Jens Albrecht Big Data 30 Streaming Frameworks Storm Flink Spark Streaming Delivery Guarantees at least once exactly once exactly once Latency

Prof. Dr. Jens Albrecht Big Data 19

Spark RDDs und DataFrames

http://de.slideshare.net/databricks/spark-sql-deep-dive-melbroune

Page 14: Big Data Technologien - BICCnet...Prof. Dr. Jens Albrecht Big Data 30 Streaming Frameworks Storm Flink Spark Streaming Delivery Guarantees at least once exactly once exactly once Latency

Prof. Dr. Jens Albrecht Big Data 21

SQL for Big Data

HDFS

MR / Tez

HiveQL

Hive (Native Hadoop) Hadoop SQL Engines

HDFS

DistributedSQL Engine

HDFS

DistributedSQL Engine

NoSQLHive

Format-agnostic SQL Engines

RDBMS

Relational

RDBMS with Hadoop Access

Hadoop

Stinger Big Insights

Page 15: Big Data Technologien - BICCnet...Prof. Dr. Jens Albrecht Big Data 30 Streaming Frameworks Storm Flink Spark Streaming Delivery Guarantees at least once exactly once exactly once Latency

Prof. Dr. Jens Albrecht Big Data 22

Datenbanken als Lego-Baukasten!?

SQL Prozessor

Verteilte Ausführung

Speicherverwaltung

Dat

a D

icti

on

ary

SQL

Klassisches monolithisches System

SQL Prozessor

MapReduce

Spark

CSV Parquet Kudu

JSON Avro ???

SQL

Baukasten

• Generische Ausführungs-Engine• Metadaten-Sharing über Hive Repository oder

selbstbeschreibende Dateiformate• Operatoren-Push-Down durch intelligente Dateien

Page 16: Big Data Technologien - BICCnet...Prof. Dr. Jens Albrecht Big Data 30 Streaming Frameworks Storm Flink Spark Streaming Delivery Guarantees at least once exactly once exactly once Latency

Prof. Dr. Jens Albrecht Big Data 23

>

Streaming

Page 17: Big Data Technologien - BICCnet...Prof. Dr. Jens Albrecht Big Data 30 Streaming Frameworks Storm Flink Spark Streaming Delivery Guarantees at least once exactly once exactly once Latency

Prof. Dr. Jens Albrecht Big Data 24

Anforderungen an Streaming Frameworks

Low Latency

High Throughput Scale-out Absorb backpressure

Fault Tolerant No message lost Exactly-once delivery Preserve order

Powerful computation model and API

Page 18: Big Data Technologien - BICCnet...Prof. Dr. Jens Albrecht Big Data 30 Streaming Frameworks Storm Flink Spark Streaming Delivery Guarantees at least once exactly once exactly once Latency

Prof. Dr. Jens Albrecht Big Data 25

Lambda und Kappa Architektur

Streaming Data

Speed LayerKafka, Storm

Batch LayerHadoop, Spark

Serving LayerLambda

Streaming DataMessage Buffer

and BrokerKafka

Stream ProcessorFlink, Spark

Serving Layer

Kappa

Speed Table

Batch Table

Page 19: Big Data Technologien - BICCnet...Prof. Dr. Jens Albrecht Big Data 30 Streaming Frameworks Storm Flink Spark Streaming Delivery Guarantees at least once exactly once exactly once Latency

Prof. Dr. Jens Albrecht Big Data 30

Streaming Frameworks

Storm Flink Spark Streaming

Delivery Guarantees at least once exactly once exactly once

Latency very low low high

Throughput medium high high

Processing Model stream stream micro-batch

Resource Management

YARN YARN YARN

Functionality stream-only stream & batch stream & batch

https://www.digitalocean.com/community/tutorials/hadoop-storm-samza-spark-and-flink-big-data-frameworks-comparedhttps://databaseline.wordpress.com/2016/03/12/an-overview-of-apache-streaming-technologies/

http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/

Page 20: Big Data Technologien - BICCnet...Prof. Dr. Jens Albrecht Big Data 30 Streaming Frameworks Storm Flink Spark Streaming Delivery Guarantees at least once exactly once exactly once Latency

Prof. Dr. Jens Albrecht Big Data 31

>

Fazit

Page 21: Big Data Technologien - BICCnet...Prof. Dr. Jens Albrecht Big Data 30 Streaming Frameworks Storm Flink Spark Streaming Delivery Guarantees at least once exactly once exactly once Latency

Prof. Dr. Jens Albrecht Big Data 32

Entscheidungen

Volumen / Kapazität Durchsatz / Latenz

Konsistenz Verfügbarkeit / Performance

SQL NoSQL (API)

On-Premise Cloud

Batch Stream

Strukturen Agilität

Page 22: Big Data Technologien - BICCnet...Prof. Dr. Jens Albrecht Big Data 30 Streaming Frameworks Storm Flink Spark Streaming Delivery Guarantees at least once exactly once exactly once Latency

Prof. Dr. Jens Albrecht Big Data 33

Fazit

1. Skalierbarkeit ist gegeben, Agilität und Effizienz bei Entwicklung und Betrieb sind erforderlich

2. Open Source Effekt 1: Big Data Technologien sind im ständigen Wandel und werden das auch bleiben

3. Open Source Effekt 2: Offene Schnittstellen und breite Kompatibilität

4. Open Source Effekt 3: Mächtige Werkzeuge für kleines Geld

Page 23: Big Data Technologien - BICCnet...Prof. Dr. Jens Albrecht Big Data 30 Streaming Frameworks Storm Flink Spark Streaming Delivery Guarantees at least once exactly once exactly once Latency

Prof. Dr. Jens Albrecht Big Data 34

>

Vielen [email protected]