97
Ein Unternehmen der Daimler AG Lecture @DHBW: Data Warehouse 05 Data Engineering Andreas Buckenhofer

Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

  • Upload
    others

  • View
    10

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Ein Unternehmen der Daimler AG

Lecture @DHBW: Data Warehouse

05 Data Engineering

Andreas Buckenhofer

Page 2: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Daimler TSS GmbH

Wilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99

[email protected] / Internet: www.daimler-tss.com

Sitz und Registergericht: Ulm / HRB-Nr.: 3844 / Geschäftsführung: Martin Haselbach (Vorsitzender), Steffen Bäuerle

© Daimler TSS I Template Revision

Andreas BuckenhoferSenior DB Professional

Since 2009 at Daimler TSS

Department: Machine Learning Solutions

Business Unit: AnalyticsDHBWDOAG

xing

Contact/Connect

vcard

• Oracle ACE Associate

• DOAG responsible for InMemory DB

• Lecturer at DHBW

• Certified Data Vault Practitioner 2.0

• Certified Oracle Professional

• Certified IBM Big Data Architect

• Over 20 years experience with

database technologies

• Over 20 years experience with Data

Warehousing

• International project experience

Page 3: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Daimler TSS Data Warehouse / DHBW 3

Change Log

Date Changes

30.10.2019 Initial version

Page 4: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Daimler TSS Data Warehouse / DHBW 4

What you will learn today

• Understand concepts behind

• ETL & ELT

• Data integration

• Data Engineering

• Data Pipelines

Page 5: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Data Warehouse /

DHBWDaimler TSS 5

Data Engineering

Page 6: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Over 75%• of time is spent for

• say they least enjoy

DATA PREPARATION

Data Consumers

Page 7: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

The machine learning pipeline

Daimler TSS Data Warehouse / DHBW 7

Page 8: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Data scientist vs Data Engineering vs software engineering

Source: 2018 Enterprise Almanah

Daimler TSS Data Warehouse / DHBW 8

Page 9: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Data Scientist vs Data Engineer

Source: https://www.oreilly.com/ideas/data-engineers-vs-data-scientistsDaimler TSS Data Warehouse / DHBW 9

Page 10: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Standard Data Warehouse architecture

Data Warehouse

FrontendBackend

External data sources

Internal data sources

Staging

Layer

(Input

Layer)

OLTP

OLTP

Core

Warehouse

Layer

(Storage

Layer)

Mart Layer

(Output

Layer)

(Reporting

Layer)

Integration

Layer

(Cleansing

Layer)

Aggregation

Layer

Metadata Management

Security

DWH Manager incl. Monitor

Daimler TSS Data Warehouse / DHBW 10

? ? ? ?

Daimler TSS Data Warehouse / DHBW 10

?

Page 11: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

ETL Process

Extract – Transform - Load

Other term: Data integration (better, more neutral)

Similar: Data Engineering, Data Pipelines

Daimler TSS Data Warehouse / DHBW 11

Page 12: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Tasks of the ETL Process - Extract

• Capture and copy data from source systems (e.g. operational systems) or from source table within DWH

• Many different types of sources

• Relational, hierarchical DBMSs

• Flat files

• NoSQL DBs

• Other internal/external sources … CSV, XML, JSON

Daimler TSS Data Warehouse / DHBW 12

Page 13: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Tasks of the ETL Process - Transform

Most complex part of data integration

• Filter data

• Integrate data

• Check and cleanse data

Daimler TSS Data Warehouse / DHBW 13

Page 14: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Tasks of the ETL Process - Load

Insert data into the target table

Daimler TSS Data Warehouse / DHBW 14

Page 15: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

ETL vs ELT

ETL often used for data integration in general (for ETL and ELT)

But: if ELT is mentioned, it is differentiated from ETL

Source

DB

Target

DB

ETL Server

Source

DB

Target

DB

ELT Server

Data flow

Daimler TSS Data Warehouse / DHBW 15

Page 16: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

ETL vs ELT

ETL ELT

Data is transferred to ETL server and transferred

back to DB. High network bandwidth required

Data remains in the DB except for cross

Database loads (e.g. source to target)

Transformations are performed in the ETL Server Transformations are performed (in the source

or) in the target

Proprietary code is executed in the ETL server Generated code, e.g. SQL, PL/SQL, SQLT

Typically used for

• source to target transfer

• Compute intensive transformations

• Small amount of data

Typically used for

• High amounts of data

Daimler TSS Data Warehouse / DHBW 16

Page 17: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

ETL/ELT tool vs manual ETL/ELT

ETL Tool Manual ETL

Informatica, Talend, Oracle ODI, etc. SQL, PL/SQL, SQLT, etc.

Separate license No additional license

Workflow, error handling, and restart/recovery

functionality included

Workflow, error handling, and restart/recovery

functionality must be implemented manually

Impact analysis and where-used (lineage)

functionality available

Impact analysis and where-used (lineage)

functionality difficult

Faster development, easier maintenance Slower development, more difficult maintenance

Additional (Tool-) Know How required Know How often available

Daimler TSS Data Warehouse / DHBW 17

Page 18: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

ETL/ELT tool vs manual ETL/ELT

Extract servicesLoad

services

Operations management services

Scheduler Control Repository Management

Connectors

Sorter

Connector

Sorter

Bulk Loader

Data Profiling servicesSource analysis

Data Quality servicesData cleansing

Data Transformation and Integration services

Data mapping Business rules

Slowly Changing Dimensions

Datatype conversion

Lookups

Job Monitoring Auditing Error Handling

Security

Daimler TSS Data Warehouse / DHBW 18

Page 19: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Mapping - Informatica

Source Target

Filter

Lookup

Daimler TSS Data Warehouse / DHBW 19

Page 20: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Mapping with Transformations - Informatica

Sorter

Aggregator Transformation

Union Transformation

Daimler TSS Data Warehouse / DHBW 20

Page 21: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Data Mapping

• Specification between source and target columns

• Source tables + columns

• Target table + columns

• Join rules

• Filter criteria

• Transformation rules

Daimler TSS Data Warehouse / DHBW 21

Page 22: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Workflow - Informatica

Decision & coordination step

Session containing Mapping

Daimler TSS Data Warehouse / DHBW 22

Page 23: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Job monitoring - Informatica

Daimler TSS Data Warehouse / DHBW 23

Page 24: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Extract

Page 25: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Data change detection

• Extracts from source systems

• Initial extract for setting up the data warehouse

• Initial Load

• Periodical extracts for adding new/changed information to the data warehouse

• Incremental Load

• Question: How to determine what is new or what has changed in the source systems?

Task of data change detection („monitoring“)

Daimler TSS Data Warehouse / DHBW 25

OLTP OLAP

OLTP OLAP

Page 26: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Data change detection: net effect of changes

• Discovery of all changes vs. determining the net effect at extract/load time only

• Example: an attribute value can get changed in two ways:

• by one update operation

• by one delete and one insert operation

• The net effect of both is the same

• However, history information is lost if the net effect is recorded only

Daimler TSS Data Warehouse / DHBW 26

5 UPDATE 6

5 DELETE INSERT 6

Page 27: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Exercise Data change detection

• Which techniques can be used to identify changes in a source system (RDBMS)?

• E.g. in OLTP system

• new products are inserted

• customer address changes

• Product is deleted because it is out of stock

• How would you identify such changes? List advantages / disadvantages of possible solutions

• Think about making changes in the source system. Think also about other solutions without any change in the source system.

Daimler TSS Data Warehouse / DHBW 27

Page 28: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Data change detection techniques

• Depend on characteristics of the data sources

• The following techniques are based on modern relational DBMS

• Types of techniques

• Based on DBMS

• Trigger-based

• Log-based discovery

• Controlled by application

• Timestamp-based discovery

• Snapshot-based discovery

Daimler TSS Data Warehouse / DHBW 28

Page 29: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Trigger-based

• Active monitoring mechanisms

• Based on (database) triggers

• Example:

• If new record is inserted in sales transaction table then insert transaction id and timestamp in change table

• Advantage:

• Triggers do not change operational applications

• Disadvantage:

• Performance impact on operation systems if triggers are used extensively

• Triggers have to be implemented for every table in the source systems

Daimler TSS Data Warehouse / DHBW 29

Page 30: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Trigger-based

• Sample Trigger Code, Oracle

CREATE [OR REPLACE] TRIGGER <trigger_name>

{BEFORE|AFTER} {INSERT|DELETE|UPDATE}

ON <table_name>

[REFERENCING [NEW AS <new_row_name>] [OLD AS <old_row_name>]]

[FOR EACH ROW [WHEN (<trigger_condition>)]]

<trigger_body>

• Trigger is created for each source table in OLTP DB and stores insert/update/delete changes in a “log/journal table”

• trigger body contains insert statements into log/journal table

Daimler TSS Data Warehouse / DHBW 30

Page 31: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Log-based

• Log-based discovery

• Also often referenced as CDC (Change Data Capture)

• Usage of database transaction logs to determine changes

• DBMSs write transaction logs in order to be able to undo partially executed transactions

• This information can be used to determine all changes

• Log reader identifies insert, update, delete, truncates and writes the changes as inserts into staging layer

• Transaction Log files can be transferred to other systems to avoid additional load on source systems

Daimler TSS Data Warehouse / DHBW 31

Page 32: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Daimler TSS Data Warehouse / DHBW 32

Transcation LogsRDBMS & NoSQL use transaction logs

Insert

Update

Commit

Delete

Create table

Create index

And others

OLTP

DB

Database

Files

Transaction

LogsLog Buffer

Data Buffer

commit

Checkpoint

Triggered

e.g. by time

or size

Essential for backup &

recovery; used for

replication & failover

Sequential

writes (and

reads for

most DBs)

Random

writes (and

reads for

most DBs)

Page 33: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Log-based (sample product architecture IIDR)

Fro

nte

nd

Standard

Reports

AdHoc

Reports

IIDR

ReplEngine

Source

Datastore

Source

OLTP

DB

IIDR ReplEngine

DWH

Datastore

DWH

DWH DB

Staging Layer

Core DWH Layer

Mart Layer

Transaction

Logs

Database

Files

initialincremental

Page 34: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Timestamp-based

• Timestamp-based discovery

• Every data item in a table is associated with timestamp information about its validity period

• Changed data can be determined from this timestamp information

Daimler TSS Data Warehouse / DHBW 34

Page 35: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Timestamp-based

• Sample customer table in OLTP

• Each table gets Change timestamp

• Delta process reads latest data only (e.g. ChangeTimestamp >= <yesterday>)

• Problem: it is not possible to identify deleted rows

CustomerID Name Department Change

Timestamp

1 Miller DWH 15.01.2015

17:00:01

2 Powell DB 22.03.2016

08:30:22

Daimler TSS Data Warehouse / DHBW 35

Page 36: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Snapshot-based

• Data comparison

• Comparison of snapshots of the operational data at different points in time

• Compute difference between two latest snapshots

• E.g. unload all data from a table into a file and diff newest file content with latest file content

• Can be very complex

• Sometimes the only possibility, for instance for legacy applications

• High performance impact on source

Daimler TSS Data Warehouse / DHBW 36

Page 37: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Data change detection techniques comparison

Trigger-based Log-based

discovery

Timestamp-

based

discovery

Snapshot-

based

discovery

Performance

impact on source

system

High Low Medium High

Performance

impact on target

system

Low Low Low High

Load on network Low Low Low High

Data loss if

nologging

operations

No Yes No No

Daimler TSS Data Warehouse / DHBW 37

Page 38: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Data change detection techniques comparison

Trigger-

based

Log-based

discovery

Timestamp-

based

discovery

Snapshot-

based

discovery

Identify

DELETE

operations

Yes Yes No Yes

Identify ALL

changes

(changes

between

extractions)

Yes Yes No No

Daimler TSS Data Warehouse / DHBW 38

Page 39: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Data Transport – direct access

Direct Access

• Source writes data into target or

• Target reads data from source

• Security concerns

• High coupling / dependencies

Source Target

Daimler TSS Data Warehouse / DHBW 39

Page 40: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Data Transport – file transfer

File transfer (or other transport medium)

• csv, json, xml, binary, etc

• Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise service bus), SOA (service oriented architecture), etc

• Often high amounts of data, therefore bulk transfer of compressed data most widely used

• Better decoupling of source and target

Source Targetfiles

Daimler TSS Data Warehouse / DHBW 40

Page 41: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Transform

Page 42: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Data quality concerns all

Source: https://twitter.com/markmadsen/status/1059579065164738560?s=21

Daimler TSS Data Warehouse / DHBW 42

Page 43: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Marriott – Starwood integration of loyalty programs: feedback on social media

Sources: https://www.flyertalk.com/forum/starwood-starwood-preferred-guest/ + https://www.facebook.com/marriottrewards/ + https://twitter.com/MarriottRewards/with_repliesDaimler TSS Data Warehouse / DHBW 43

Page 44: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Data quality issues

Source: https://flowingdata.com/2019/04/09/bad-data-from-a-faulty-sensor-on-the-boeing-737-max/

Daimler TSS Data Warehouse / DHBW 44

Page 45: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Prerequisite of transformation: Understanding The Data

• Profile Existing Data Sources, Extracted Data

• Analyze data structure, content, and quality

• Find data relationships across systems

• Often badly documented or missing foreign keys

• Uncover data issues that can affect subsequent transformation steps

• Missing values

• Duplicates

• Inconsistencies

Daimler TSS Data Warehouse / DHBW 45

Page 46: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

After 3 years and 9 months half the customer records in a database are incorrect

Half live of data

Source: NoCOUG 1/2019, Journal Bud Walker: The half-live of data, p. 20

Daimler TSS Data Warehouse / DHBW 46

Page 47: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Data Quality issues

CustomerNo Name Birthdate Age Gender Zip code

1 Miller, Tom 33.01.2001 15 M NULL

1 John Mayor 15.01.2001 15 M 98144

2 Mrs. Bush 31.10.1988 22 Q 00000

3 Martin 31.10.1988 22 M 75890

PK / Unique Key violated Data not uniform Not valid

Inconsistent Wrong value

Unknown / missing

FK violated

Daimler TSS Data Warehouse / DHBW 47

Page 48: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Data Quality issues and possible solutions in the source RDBMS

Issue Solution

Wrong data e.g. 31.02.2016 Proper data type definition

Wrong values, e.g. number out of range Check constraint

Missing values NOT NULL constraint

Violated references FOREIGN KEY constraint

Duplicates PRIMARY or UNIQUE KEY constraint

Inconsistent data ACID transactions, business logic, additional checks

Daimler TSS Data Warehouse / DHBW 48

Page 49: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Data Quality issues and possible solutions in the source RDBMS

Issue Solution

Wrong data e.g. 31.02.2016 Proper data type definition

Wrong values, e.g. number out of range Check constraint

Missing values NOT NULL constraint

Violated references FOREIGN KEY constraint

Duplicates PRIMARY or UNIQUE KEY constraint

Inconsistent data ACID transactions, business logic, additional checks

Daimler TSS Data Warehouse / DHBW 49

Page 50: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Data Quality issues: correct data in the source

Correcting the data

• In the source systems

• Common master data management across all operational applications

• Dedicated systems are “master” of e.g. customer data

• Correcting the data at the source is best approach but slow and often not feasible

Daimler TSS Data Warehouse / DHBW 50

Page 51: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Data Quality issues: workarounds in DWH

Correcting the data

• Automatically during ETL

• E.g., address of a customer if a correct reference table exists

• Manually after ETL is finished

• ETL stored bad data in error log tables or files

• ETL flags bad data (e.g. invalid)

Daimler TSS Data Warehouse / DHBW 51

Page 52: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Data Quality issues: missing data

• Column is null

• Reject data

• Use default values

• Missing values can represent

• an unknown value Iike date of birth of a customer

• a missing value like engine_id for a car (logical not null constraint)

• Dimension tables can include some dummy values:

DimensionTable_X Description

-1 Unknown

-2 MissingDaimler TSS Data Warehouse / DHBW 52

Page 53: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Data Quality issues: missing data

• Data is inaccuratee.g. wrong date 32.12.2015 or wrong number 55U

• Reject data

• Replace with value that represents „Invalid“

• Dimension tables can include some dummy values:

DimensionTable_X Description

-1 Unknown

-2 Missing

-3 InvalidDaimler TSS Data Warehouse / DHBW 53

Page 54: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Data Quality issues: conflicting data

• Data has conflicts, e.g. wrong postal code 80995 Stuttgart

• Reject data

• Replace one of the values with a value that represents „Invalid“ or with corrected valueWhich value to replace? Rules necessary

Daimler TSS Data Warehouse / DHBW 54

Page 55: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Data Quality issues: inconsistent data

• Data is inconsistent, e.g. unlikely high price for a product

• Can be discovered by statistical and data mining methods

Daimler TSS Data Warehouse / DHBW 55

Page 56: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Data Quality issues: duplicates

• Data is duplicated, e.g. „Martin Miller” vs “Miller, Martin” vs “M.Miller”

• Multiple representations for one entity

• Different keys

• Different encodings

• Duplicate detection can be very difficult / tricky

• Products are available for e.g. address duplicate detection address validation (Kingstreet = does this address actually exist?)address harmonization (Kingstr, Kingstreet, King Street, etc)

• Standardize / Harmonize data during ETL flow: “unification”

Daimler TSS Data Warehouse / DHBW 56

Page 57: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Transform - Unification of data

• Unification of data types

• Character string date „20.01.2006“ 20.01.2006

• Character string number „12345“ 12345

• Unification of encodings

• For instance for gender F and M

• Lookup-tables contain the mapping from old to new encodings

• Combination of different attributes to one attribute

• day, month, year date

Daimler TSS Data Warehouse / DHBW 57

Page 58: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Transform - Unification of data

• Split of one attribute into two or more

• Name first name, last name (“Herr Prof. Dr. Hans M. vom und zum Stein”)

• Unification of names can become very challenging “Herr Prof. Dr. Hans M. vomund zum Stein” or “Werner Martin” or “Mariae Gloria … Wilhelmine HubertaGräfin von Schönburg-Glauchau“

• Product name - „Cola, 0.33 l“ Product short name - „Cola“, size in liters - 0.33

Daimler TSS Data Warehouse / DHBW 58

Page 59: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Transform - Unification of data

• Unification of dates and timestamps

• Rules for representing incomplete date information If only month and year are known

• Dates and timestamps with regard to one specific timezoneImportant for multi-national organizationsUTC Coordinated Universal Time without daylight saving zone

• What can happen if clock is changed to wintertime if no UTC is used?- Update arrives at 02:15 in staging layer (CDC / log-based monitor)- Clock is changed to wintertime: -1h- Update of the same row arrives at 02:10 in staging layer (CDC / log-based)- How can batch load running the next night discover which update is the most recent one?

Daimler TSS Data Warehouse / DHBW 59

Page 60: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Transform - Unification of data

• Computation of derived values

• Profit = sales price – purchase price Without clear definition, different interpretations possible

• Net or gross sales price?

• Net or gross purchase price?

• Aggregations

• Revenue of the year computed from revenues of the dayWithout clear definition, different interpretations possible

• Calendar year?

• Fiscal year?

Daimler TSS Data Warehouse / DHBW 60

Page 61: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Load

Page 62: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Load

• Efficient load operations are important

• bulk load: Single row processing vs set based processing

• Online load

• Data warehouse (especially Data Mart) is still accessible

• Offline load

• Data warehouse (especially Data Mart) is offline

• For updates that require the recomputation of a cube

• Offline load is often a Tool limit because the Tool locks data structures. But

offline load could be faster.

Daimler TSS Data Warehouse / DHBW 62

Page 63: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Bulk processing

• Specific Bulk load operations provided by RDBMS, e.g. External tables in Oracle or LOAD command in DB2

• Single row vs set based processing

Single row processing Set based processing

Cursor curs = SELECT * FROM <source>

WHILE NOT EOF(curs)

FETCH NEXT ROW INTO myRoW;

INSERT INTO <target> VALUES(myRow);

LOOP

INSERT INTO <target>

SELECT * from <source>

Error handling easy All or nothing if there are errors

Slow for high amounts of data Performs well for small and high amounts of data

More coding Less code = less errors

Daimler TSS Data Warehouse / DHBW 63

Page 64: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Data Engineering/ETLSummary

Page 65: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Example for Data integration in Data Vault 2.0 architecture

Source: Hans Hultgren: Modeling the agile Data Warehouse with Data Vault, New Hamilton 2012, p. 224

Hard

Rules

only

Soft

Rules

Raw

Data Vault

Business

Data Vault

ETL (E)T(L) ETL

ETL, „M

on

ito

ring“

Daimler TSS Data Warehouse / DHBW 65

Page 66: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

ETL-Job parallelism for loading data into Core Warehouse Layer

HU

B lo

ad

ed

LIN

K u

nd

HU

B-

SA

Tlo

ad

ed

LIN

K-S

AT

load

ed

Da

ta V

au

lt

Lo

ad

Cla

ssic

al

Lo

ad

?

? ?

Integration of new JobsTime Windows for Loads, e.g 00:00-06:00

• Complex

• Many dependencies

• Many sequential jobs

• Systematic / Methodic

• Few, well defined dependencies

• Massive parallel

Daimler TSS Data Warehouse / DHBW 66

Page 67: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Standard Data Warehouse architecture

Data Warehouse

FrontendBackend

External data sources

Internal data sources

Staging

Layer

(Input

Layer)

OLTP

OLTP

Core

Warehouse

Layer

(Storage

Layer)

Mart Layer

(Output

Layer)

(Reporting

Layer)

Integration

Layer

(Cleansing

Layer)

Aggregation

Layer

Metadata Management

Security

DWH Manager incl. Monitor

Daimler TSS Data Warehouse / DHBW 67

? ? ? ??

Page 68: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Daimler TSS GmbH

Wilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99

[email protected] / Internet: www.daimler-tss.com

Sitz und Registergericht: Ulm / HRB-Nr.: 3844 / Geschäftsführung: Martin Haselbach (Vorsitzender), Steffen Bäuerle

© Daimler TSS I Template Revision

Page 69: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Learn how to replace code

• Machine learning will no doubt change software development in significant ways

• Software developers will put much more effort into data collection and preparation

• Developers will have to do more than just collect data; they’ll have to build data pipelines and the infrastructure to manage those pipelines. We’ve called this “Data Engineering”

• Data engineers will be responsible for maintaining the data pipeline: ingesting data, cleaning data, feature engineering, and model discovery

Source: https://www.oreilly.com/ideas/what-machine-learning-means-for-software-development

Daimler TSS Data Warehouse / DHBW 69

Page 70: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Replication-based

• Replication techniques

• Data replication

• Target tables not necessarily on local system

• Uses typically Transaction Logs

• Log reader identifies insert, update, delete, truncates and writes the changes into replicated tables (insert remains insert, update remains update, etc)

• Useful for 1:1 copies (e.g. ODS, Operational Data Store) but still challenge to detect changes for loading the data mart

Daimler TSS Data Warehouse / DHBW 70

Page 71: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Replication-based (sample product architecture IIDR)

Fro

nte

nd

Standard

Reports

AdHoc

Reports

IIDR

ReplEngine

Source

Datastore

Source

OLTP

DBIIDR ReplEngine

DWH

Datastore

DWH

DWH DB

Staging Layer

Core Layer

Mart Layer

Transaction

Logs

Daimler TSS Data Warehouse / DHBW 71

Page 72: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Extraction intervals

Extraction intervals

• Periodically – in regular intervals

• Every day, week, etc.

• Instantly / Continuous

• Every change is directly propagated into the data warehouse

• „real time data warehouse“

• Depends on the requirements on timeliness of the data warehouse data

Daimler TSS Data Warehouse / DHBW 72

Page 73: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Extraction intervals

Triggered by a specific request

• Addition of a new product

• Query which involves more recent data

Triggered by specific events

• Number of changes in operational data exceeds threshold

Daimler TSS Data Warehouse / DHBW 73

Page 74: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Data scientist sexiest job of the 21st century?

Source: https://www.simplilearn.com/what-skills-do-i-need-to-become-a-data-scientist-articleDaimler TSS Data Warehouse / DHBW 74

Page 75: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Exercise: load Data Vault table

Draw a flow diagram how to load a HUB, LINK and SAT table and describe the SQL statements

Daimler TSS Data Warehouse / DHBW 75

Page 76: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Exercise: load HUB table

Source

data exist

Load distinct

business keys

Does

business

Key exist in

HUB?

Insert row into

HUB

Conflict if PK

HashKey

collision!

no

Reject

data

Data loaded into

HUB

yes

Daimler TSS Data Warehouse / DHBW 76

Page 77: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Exercise: load HUB table

INSERT INTO core.fahrzeug (vehicle_hk, fin, loaddate, recordsource)

SELECT DISTINCT f.fahrzeug_hashkey

, f.fin_bk

, f.loaddate

, f.recordsource

FROM staging.fahrzeugdaten f

WHERE f.fin_bk NOT in (SELECT fin FROM core.hub_fahrzeug)

AND f.loaddate = <date to load>;

Daimler TSS Data Warehouse / DHBW 77

Page 78: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Exercise: load LINK table

Source

data exist

Load distinct

business keys

Does Hash

Key

relationship

exist in

HUB?

Insert row into

LINK

Conflict if PK

HashKey

collision!

no

Reject

data

Data loaded into

LINK

yes

Daimler TSS Data Warehouse / DHBW 78

Page 79: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Exercise: load LINK table

INSERT INTO core.link_verbaut (verbaut_hk, motor_hk, vehicle_hk, loaddate, recordsource)

SELECT DISTINCT h.verbaut_hk

, f.motor_hashkey

, f.fahrzeug_hashkey

, f.loaddate

, f.recordsource

FROM staging.fahrzeugdaten f

WHERE (f.motor_hashkey, f.fahrzeug_hashkey) NOT in (SELECT motor_hk, vehicle_hk FROM core.link_verbaut v)

AND loaddate = <date to load>;

Daimler TSS Data Warehouse / DHBW 79

Page 80: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Exercise: load SAT table

Source

data exist

Load

distinct

source

data

MD5-

HASH

Diff

identical?

Insert row into

SAT

no

Reject

data

Data loaded into

SAT

yes

Load

current/

latest row

from SAT

table

Daimler TSS Data Warehouse / DHBW 80

Page 81: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Exercise: load SAT table

INSERT INTO core.sat_fahrzeug_text (vehicle_hk, loaddate, recordsource, md5_hash, codeleiste, kommentar)

SELECT DISTINCT f.fahrzeug_hashkey

, f.loaddate

, f.recordsource

, f.md5hash

, f.codeleiste

, f.kommentar

FROM staging.fahrzeugdaten f

LEFT OUTER JOIN (select s.vehicle_hk, s.md5_hash from s_fahrzeug s JOIN (select i.VEHICLE_HK, max(i.loaddate) as loaddate from s_fahrzeug i GROUP BY i.VEHICLE_HK) m

ON s.vehicle_hk = m.vehicle_hk AND s.loaddate = m.loaddate) k ON f.fahrzeug_hashkey = k.vehicle_hk

WHERE (k.md5_hash is null OR f.md5hash <> k.md5_hash)

AND f.loaddate = <date to load>;

Daimler TSS Data Warehouse / DHBW 81

Page 82: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Bounded vs unbounded: Streaming ETL

Finite

Complete

consistent

Infinite

inomplete

inconsistent

Daimler TSS Data Warehouse / DHBW 82

Page 83: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Different Times in streaming ETL

• Event-time: the time the event was created in the origin

• Load-time (ingestion-time): the time the event was stored in a database or storage engine in general

• Processing-time: the time the event was consumed by e.g. a stream processor

t00:00 00:05 00:10 00:15 00:20 00:25

Daimler TSS Data Warehouse / DHBW 83

Page 84: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Window slices

t00:00 00:05 00:10 00:15 00:20 00:25

Daimler TSS Data Warehouse / DHBW 84

Page 85: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Window slices .. And late arrivals

t00:00 00:05 00:10 00:15 00:20 00:25

t = 00:01

Daimler TSS Data Warehouse / DHBW 85

Page 86: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Window slices

• Tumbling: Fixed size, gap-less

• Hopping: Fixed size, overlapping

• Session: Variable size, key-bound time

Daimler TSS Data Warehouse / DHBW 86

Page 87: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Window slicesTumbling: fixed-size, gap-less

t00:00 00:05 00:10 00:15 00:20 00:25

Daimler TSS Data Warehouse / DHBW 87

Page 88: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Advance

by

Window slicesHopping: fixed-size, overlapping

t00:00 00:05 00:10 00:15 00:20 00:25

Daimler TSS Data Warehouse / DHBW 88

Page 89: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Window slicesSession: variable size, key-bound timeout

t00:00 00:05 00:10 00:15 00:20 00:25

gap

of

Inactivity (timeout)

gap

of

Inactivity (timeout)Daimler TSS Data Warehouse / DHBW 89

Page 90: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Streaming tools

Source: https://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/

Daimler TSS Data Warehouse / DHBW 90

Page 91: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Streaming ETLKafka Streams and Kafka KSQL

Source: https://docs.confluent.io/current/KSQL/docs/concepts/KSQL-and-kafka-streams.html

Daimler TSS Data Warehouse / DHBW 91

Page 92: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Streaming ETLKafka Streams and Kafka KSQL

Source: https://docs.confluent.io/current/KSQL/docs/concepts/KSQL-and-kafka-streams.html

Daimler TSS Data Warehouse / DHBW 92

Page 93: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

KSQL stream

The following example creates a stream that has three columns from the pageviews topic: viewtime, userid, and pageid

CREATE STREAM pageviews \

(viewtime BIGINT, \

userid VARCHAR, \

pageid VARCHAR) \

WITH (KAFKA_TOPIC='pageviews');

Daimler TSS Data Warehouse / DHBW 93

Page 94: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

KSQL table

This query computes the pageview count per region per minute:

CREATE TABLE pageviews_per_region_per_minute AS \

SELECT regionid, \

COUNT(*) \

FROM pageviews \

WINDOW TUMBLING (SIZE 1 MINUTE) \

GROUP BY regionid;

Daimler TSS Data Warehouse / DHBW 94

Page 95: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

KSQL joins

For example, to find orders that have shipped within the last hour from an orders stream and a shipments stream, you might run a query like:

SELECT o.order_id, o.total_amount, o.customer_name, s.shipment_id,

s.warehouse \

FROM new_orders o \

INNER JOIN shipments s \

WITHIN 1 HOURS \

ON o.order_id = s.order_id;

Daimler TSS Data Warehouse / DHBW 95

Page 96: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

ETL original vs general meaning

• ETL - Original meaning: load data from source systems into the Core Warehouse Layer

• General meaning: Loading data from source system into staging layer and/or between any layer

Daimler TSS Data Warehouse / DHBW 96

Page 97: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Buckenhofer-… · • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise

Data Warehouse /

DHBWDaimler TSS 97

Data pyramid and data quality

Source: By Matthew.viel - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=49310779 LinkedIn 11/2017: https://www.linkedin.com/feed/update/urn:li:activity:6334062387355746304

Daimler TSS Data Warehouse / DHBW 97