Big Data Technologien - BICCnet...Prof. Dr. Jens Albrecht Big Data 30 Streaming Frameworks Storm Flink Spark Streaming Delivery Guarantees at least once exactly once exactly once Latency

Prof. Dr. Jens [email protected]

Big Data Technologien- Ein Überblick -

Prof. Dr. Jens Albrecht Big Data 3

Big Data Landscape 2016


Appliance

Systemarchitektur im Wandel

Gestern und heute

Strukturierte DatenModerate Größe (S-XL)

"General Purpose"RDBMS

Heute und morgen

Polystrukturierte Datenin allen Größen (S-XXXL) {json}

<xml/>

InMemory

RDBMSHadoop

Streaming

NoSQL

ZweckoptimierteSpezialisten


Business Cases

Benefits of Big Data Technologies

Scalability

Throughput (Velocity)

Computing Power

Agility

Data Volume

Data Exploration

Schema-on-Read

Integration on Demand

EfficientDevelopment

DataVirtualization

Real-timeDecisions

SimplifiedData Access

AdvancedAnalytics

Cost Efficiency


BI/Big Data ArchitekturA

nal

yse

Dat

en

hal

tun

gD

ate

nq

uel

len

Klassische Datenquellen

OLTP-Systeme

Big-Data-Quellen

Dokumente, Server Logs, Sensor Daten,Social, Clickstream, GPS,

BusinessIntelligence

EnterpriseDWH

Data Marts

Predictive Analytics

OperationalAnalytics

In-MemoryRDBMS

ExplorativeAnalyse

Data Lake

NoSQL


Data Lake: Herausforderungen

Data Lake

Viele heterogene Datenquellen

Viele heterogene Nutzer

Skalierbare Integration On-the-fly,einfach zu entwickeln und zu betreiben


HadoopHDFS

Big Data Processing

Ingest Store Process AnalyzeAccess & Visualize

Data Sources

Flume

Kafka

Spark

Map Reduce

Hive

Storm

Spark ML

Impala

SparklingWater

SOLR

Looker

H2O

Zeppelin

Flink

Sqoop

Drill Datameer

Atlas

Waterline

Governance

NoSQL

Nifi

Ranger

Sentry

Samza


Batch vs. Stream

Clickstream

Transactions

Machine Logs

Sensor Data

Data Producers Batches of Data Batch Processing

Clickstream

Transactions

Machine Logs

Sensor Data

Stream Processing Streams of Data


>

Batch Processing & Analysis


YARN + Map Reduce

Distributed Storage (HDFS)

Hadoop

Batch-Processing-Framework

Komponenten

▸HDFS, HBase

▸ YARN, Zookeeper

▸Map Reduce, Hive, Pig, Sqoop, …

Stärken und Grenzen

▸ ausgereifte Basistechnologie

▸ umfangreiches Ökosystem

▸ breite Kompatibilität

▸MapReduce ist langsam und umständlich


Apache Spark – Swiss Army Knife of Big Data

☛ Agilität und Skalierbarkeit mit und ohne Hadoop▸ Effiziente Entwicklung durch mächtige API (identisch für Scala, Java, Python)

▸ In-Memory-Ausführung und SQL-ähnliche Anfrageoptimierung

▸ Einheitliches System für Batch- und Stream-Processing

Batch Processing

Machine Learning

JavaPython

Scala R

Data Streaming

Graph Processing

SQL

Apache Spark


Map Reduce vs. Spark

Quelle: https://spark.apache.org/examples.html

file = spark.textFile("hdfs://...")counts = file.flatMap(lambda line: line.split(" ")) \

.map(lambda word: (word, 1)) \

.reduceByKey(lambda a, b: a + b)counts.saveAsTextFile("hdfs://...")

… und in Spark

Hadoop-MapReduceWordcount in Java

https://spark.apache.org/examples.html


Spark RDDs und DataFrames

http://de.slideshare.net/databricks/spark-sql-deep-dive-melbroune


SQL for Big Data

HDFS

MR / Tez

HiveQL

Hive (Native Hadoop) Hadoop SQL Engines

HDFS

DistributedSQL Engine

HDFS

DistributedSQL Engine

NoSQLHive

Format-agnostic SQL Engines

RDBMS

Relational

RDBMS with Hadoop Access

Hadoop

Stinger Big Insights


Datenbanken als Lego-Baukasten!?

SQL Prozessor

Verteilte Ausführung

Speicherverwaltung

Dat

a D

icti

on

ary

SQL

Klassisches monolithisches System

SQL Prozessor

MapReduce

Spark

CSV Parquet Kudu

JSON Avro ???

SQL

Baukasten

• Generische Ausführungs-Engine• Metadaten-Sharing über Hive Repository oder

selbstbeschreibende Dateiformate• Operatoren-Push-Down durch intelligente Dateien


>

Streaming


Anforderungen an Streaming Frameworks

Low Latency

High Throughput Scale-out Absorb backpressure

Fault Tolerant No message lost Exactly-once delivery Preserve order

Powerful computation model and API


Lambda und Kappa Architektur

Streaming Data

Speed LayerKafka, Storm

Batch LayerHadoop, Spark

Serving LayerLambda

Streaming DataMessage Buffer

and BrokerKafka

Stream ProcessorFlink, Spark

Serving Layer

Kappa

Speed Table

Batch Table


Streaming Frameworks

Storm Flink Spark Streaming

Delivery Guarantees at least once exactly once exactly once

Latency very low low high

Throughput medium high high

Processing Model stream stream micro-batch

Resource Management

YARN YARN YARN

Functionality stream-only stream & batch stream & batch

https://www.digitalocean.com/community/tutorials/hadoop-storm-samza-spark-and-flink-big-data-frameworks-comparedhttps://databaseline.wordpress.com/2016/03/12/an-overview-of-apache-streaming-technologies/

http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/


>

Fazit


Entscheidungen

Volumen / Kapazität Durchsatz / Latenz

Konsistenz Verfügbarkeit / Performance

SQL NoSQL (API)

On-Premise Cloud

Batch Stream

Strukturen Agilität


Fazit

1. Skalierbarkeit ist gegeben, Agilität und Effizienz bei Entwicklung und Betrieb sind erforderlich

2. Open Source Effekt 1: Big Data Technologien sind im ständigen Wandel und werden das auch bleiben

3. Open Source Effekt 2: Offene Schnittstellen und breite Kompatibilität

4. Open Source Effekt 3: Mächtige Werkzeuge für kleines Geld


>

Vielen [email protected]

Documents

Big Data Technologien - BICCnet...Prof. Dr. Jens Albrecht Big Data 30 Streaming Frameworks Storm Flink Spark Streaming Delivery Guarantees at least once exactly once exactly once Latency