160
Ein Unternehmen der Daimler AG Lecture @DHBW: Data Warehouse 04 Modeling Andreas Buckenhofer

Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Bucken...13 Freak of Nature Anastacia Performer Founded Anastacia 1999 Pink Floyd 1964 CD_ID Album Founded Titels 11 Anastacia –Not

  • Upload
    others

  • View
    5

  • Download
    2

Embed Size (px)

Citation preview

  • Ein Unternehmen der Daimler AG

    Lecture @DHBW: Data Warehouse

    04 Modeling

    Andreas Buckenhofer

  • Daimler TSS GmbH

    Wilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99

    [email protected] / Internet: www.daimler-tss.com

    Sitz und Registergericht: Ulm / HRB-Nr.: 3844 / Geschäftsführung: Martin Haselbach (Vorsitzender), Steffen Bäuerle

    © Daimler TSS I Template Revision

    Andreas BuckenhoferSenior DB Professional

    Since 2009 at Daimler TSS

    Department: Machine Learning Solutions

    Business Unit: AnalyticsDHBWDOAG

    xing

    Contact/Connect

    vcard

    • Oracle ACE Associate

    • DOAG responsible for InMemory DB

    • Lecturer at DHBW

    • Certified Data Vault Practitioner 2.0

    • Certified Oracle Professional

    • Certified IBM Big Data Architect

    • Over 20 years experience with

    database technologies

    • Over 20 years experience with Data

    Warehousing

    • International project experience

    mailto:[email protected]://twitter.com/ABuckenhoferhttps://de.linkedin.com/in/buckenhoferhttp://wwwlehre.dhbw-stuttgart.de/~buckenhofer/http://wwwlehre.dhbw-stuttgart.de/~buckenhofer/http://wwwlehre.dhbw-stuttgart.de/~buckenhofer/https://www.doag.org/de/themen/datenbank/in-memory/http://wwwlehre.dhbw-stuttgart.de/~buckenhofer/https://www.xing.com/profile/Andreas_Buckenhofer2https://apex.oracle.com/pls/apex/f?p=19297:3:16881272565360:https://www.doag.org/de/themen/datenbank/in-memory/http://wwwlehre.dhbw-stuttgart.de/~buckenhofer/

  • Daimler TSS Data Warehouse / DHBW 3

    Change Log

    Date Changes

    08.03.2020 Initial version

  • Daimler TSS Data Warehouse / DHBW 4

    What you will learn today

    • Understand data modeling (logical models) and physical implementation (physical models)

    • Explain models for a DWH

    • Dimensional model

    • Data Vault model

    • Understand dimensions and facts

    • Understand ROLAP & MOLAP

  • Structuring Data

    Source: https://www.linkedin.com/feed/update/urn:li:activity:6507927816082845696/

    Daimler TSS Data Warehouse / DHBW 5

    https://www.linkedin.com/feed/update/urn:li:activity:6507927816082845696/

  • Conceptual – logical – physical level

    Different levels of abstraction:

    • Conceptual (domain) model

    • Focus on (main) entities and its business definitions!

    • No attributes

    • Logical data model

    • Relational data model (independent of a DBMS or technology)

    • Logic can't affect performance = no performance optimization on this level

    • Physical implementation

    • Representation of a data design for a specific DBMS

    • RDBMS are the closest to physical independanceDaimler TSS Data Warehouse / DHBW 6

  • Physical implementation in OLTP applications

    Requirements

    • Efficient update and delete operations

    • Efficient read operations

    • Avoid contradiction in the data – don’t store data twice or multiple times

    • Easy maintenance of the data model

    • As little redundancy as possible in the data model

    Daimler TSS Data Warehouse / DHBW 7

  • Codd‘s normal forms for DB relations: 1NF

    • First Normal Form (1NF):

    • A relation is in first normal form if

    • the domain of each attribute contains only atomic (simple, indivisible) values.

    • the value of any attribute in a tuple/row must be a single value from the domain of that attribute, i.e. no attribute values can be sets

    Daimler TSS Data Warehouse / DHBW 8

  • Codd‘s normal forms for DB relations: 1NF

    CD_ID Album Founded Titels

    11 Anastacia – Not that kind 1999 1. Not that kind, 2. I‘m outta love, 3 Cowboys & Kisses

    12 Pink Floyd – Wish you were here 1964 1. Shine on you crazy diamond

    13 Anastacia – Freak of Nature 1999 1. Paid my dues

    CD_ID Album Performer Founded Track Titels

    11 Not that kind Anastacia 1999 1 Not that kind

    11 Not that kind Anastacia 1999 2 I‘m outta love

    11 Not that kind Anastacia 1999 3 Cowboys & Kisses

    12 Wish you were here Pink Floyd 1964 1 Shine on you crazy diamond

    13 Freak of Nature Anastacia 1999 1 Paid my duesDaimler TSS Data Warehouse / DHBW 9

  • Codd‘s normal forms for DB relations: 2NF

    • Second Normal Form (2NF):

    • In 1st normal form

    • Every non-key attribute is fully dependent on the key. There are no dependencies between a partial key and a non-key field.

    Daimler TSS Data Warehouse / DHBW 10

  • Codd‘s normal forms for DB relations: 2NF

    CD_ID Album Performer Founded Track Titels

    11 Not that kind Anastacia 1999 1 Not that kind

    11 Not that kind Anastacia 1999 2 I‘m outta love

    11 Not that kind Anastacia 1999 3 Cowboys & Kisses

    12 Wish you were here Pink Floyd 1964 1 Shine on you crazy diamond

    13 Freak of Nature Anastacia 1999 1 Paid my duesCD_ID Track Titels

    11 1 Not that kind

    11 2 I‘m outta love

    11 3 Cowboys & Kisses

    12 1 Shine on you crazy diamond

    13 1 Paid my dues

    CD_ID Album Performer Founded

    11 Not that kind Anastacia 1999

    12 Wish you were here Pink Floyd 1964

    13 Freak of Nature Anastacia 1999Daimler TSS Data Warehouse / DHBW 11

  • Codd‘s normal forms for DB relations: 3NF

    • Third Normal Form (3FN):

    • In 2nd normal form

    • No functional dependencies between non key fields: a non-key attribute is dependent from a PK only

  • Codd‘s normal forms for DB relations: 3NF

    CD_ID Track Titels

    11 1 Not that kind

    11 2 I‘m outta love

    11 3 Cowboys & Kisses

    12 1 Shine on you crazy diamond

    13 1 Paid my dues

    CD_ID Album Performer Founded

    11 Not that kind Anastacia 1999

    12 Wish you were here Pink Floyd 1964

    13 Freak of Nature Anastacia 1999

    CD_ID Album Performer

    11 Not that kind Anastacia

    12 Wish you were here Pink Floyd

    13 Freak of Nature Anastacia

    Performer Founded

    Anastacia 1999

    Pink Floyd 1964

    Daimler TSS Data Warehouse / DHBW 13

  • Codd‘s normal forms - Summary from 1NF to 3NF

    CD_ID Track Titels

    11 1 Not that kind

    11 2 I‘m outta love

    11 3 Cowboys & Kisses

    12 1 Shine on you crazy diamond

    13 1 Paid my dues

    CD_ID Album Performer

    11 Not that kind Anastacia

    12 Wish you were here Pink Floyd

    13 Freak of Nature Anastacia

    Performer Founded

    Anastacia 1999

    Pink Floyd 1964

    CD_ID Album Founded Titels

    11 Anastacia – Not that kind 1999 1. Not that kind, 2. I‘m outta love, 3 Cowboys & Kisses

    12 Pink Floyd – Wish you were here 1964 1. Shine on you crazy diamond

    13 Anastacia – Freak of Nature 1999 1. Paid my dues

    n

    n

    Daimler TSS Data Warehouse / DHBW 14

  • Why database design?

    “Data modeling is the process of learning about the data, and regardless of technology,

    this process must be performed for a successful application.”

    • Learn about the data and promote collective data understanding

    • Derive security classification and measures

    • Design for performance

    • Accelerate development

    • Improve Software quality

    • Reduce maintenance costs

    • Generate code

    • NoSQL Schema-on-read: understand model versions after years

    Source quote: Steve Hoberman: Data Modeling for Mongo DB, Technics Publications 2014

    Daimler TSS Data Warehouse / DHBW 15

  • Importance of a good database design

    Daimler TSS Data Warehouse / DHBW 16

  • Measuring the quality of a data modeldata model scorecard

    Source: Steve Hoberman - Data Modeling Scorecard, Technics Publication 2015

    Daimler TSS Data Warehouse / DHBW 17

  • Exercise: use OLTP physical implementation for DWH?

    The diagram shows a typical OLTP data model

    • Customers and products have uniqueids and some descriptive attributes

    • A customer can place an order on a specific date

    • The order contains one or more products

    Daimler TSS Data Warehouse / DHBW 18

  • Exercise: use OLTP physical implementation for DWH?

    Now consider DWH requirements like non-volatile and time-variant data

    • Customer Bush marries and takesher husband’s last name

    • Product number 5 gets a priceincrease

    How would you solve such

    requirements in a data model

    for the Core Warehouse Layer?

    Daimler TSS Data Warehouse / DHBW 19

  • Exercise: use OLTP physical implementation for DWH?

    Possible solutions:

    • Add timestamp column as part of the primary key

    • For all tables, not only for specific tables (e.g. product, customer)

    • Composite keys can become inefficient and impractical

    • New tables with head and version data to avoid redundancy

    • Head table contains static data that does not change (e.g. customer id, birthdate)

    • Version table contains data that changes (e.g. last name, comments)

    • Store every change in log tables

    • Querying tables can become difficult and slow if history is required ("main" table + log tables)

    Daimler TSS Data Warehouse / DHBW 20

  • Bad physical implementations

    Source: Corr / Stagnitto: Agile Data Warehouse Design, DecisionOne Press, 2011, page 5

    Create a SQL statement for:

    How many

    "Order Transactions"

    have been created by

    "Person/Organisation"?

    Daimler TSS Data Warehouse / DHBW 21

  • Disadvantages of a direct physical implementation of 3NF for DWH

    • 3NF is inefficient for query processing

    • 3NF models are difficult to understand

    • 3NF gets even more complicated with history added

    • 3NF not suited for „new“ data sources (JSON, NoSQL, etc.)

    • DWH needs own data modeling approaches for the Core Warehouse Layer and the Mart Layer

    Daimler TSS Data Warehouse / DHBW 22

  • What are candidates for primary keys?

    Primary keysNatural keys, Sequences, Hash keys

    Natural Keys

    „intelligent“ keys that

    have a meaning to the

    business user

    VIN (vehicle identifier)

    ISO country codes, e.g.

    DE, US, UK

    Generated Keys

    System-generated,

    unique values, e.g.

    sequences (increments)

    1, 2, 3, 4, 5, etc.

    GUIDs (globally unique

    identifiers) contain e.g.

    MAC address +

    timestamp to make an

    identifier unique.

    Hash Keys

    (composite) Natural key

    run through a hash

    function, e.g.

    Md5(VIN)

    Daimler TSS Data Warehouse / DHBW 23

  • Natural keys

    Advantages Disadvantages

    Have a meaning: can be considered as master

    keys

    Varying length (can be short or very long)

    Same value across (OLTP) systems: valid across

    business processes

    Meaning can change over time, e.g. VIN

    standards changed

    Allow parallel loads in a DWH or Big Data system Can be composite (several fields) which would

    make joins slower and more complex

    [concatenation would be possible]

    Often sequence-driven in OLTP systems (e.g.

    customer number; collisions possible when

    integration into DWH is done)

    Daimler TSS Data Warehouse / DHBW 24

  • Generated keys

    Advantages Disadvantages

    Small byte size (sequences): less storage and

    faster joins

    Insert performance can be slow (hot spot on

    index)

    Always unique No business meaning

    Good B*Tree index clustering Data load into DWH cause lookups (Big Data

    systems often have no sequences but would fail

    performance-wise with *sequential* sequence

    generation)

    Daimler TSS Data Warehouse / DHBW 25

  • Hash keys

    Advantages Disadvantages

    Allow parallel loads in a DWH or Big Data system Computed value can be longer compared to

    natural keys

    Ability to join across platforms (e.g. RDBMS,

    NoSQL, Hadoop)

    Computed value should be stored as binary

    instead of char. Some systems only allow have

    char (e.g. Hadoop).

    Example: MD5 hash is binary(16) or char(32).

    Deterministic across systems or if data is

    reloaded

    Collisions may occur: collision strategy required

    Data is distributed Bad B*Tree index clustering

    Daimler TSS Data Warehouse / DHBW 26

  • Standard Data Warehouse architecture

    Data Warehouse

    FrontendBackend

    External data sources

    Internal data sources

    Staging

    Layer

    (Input

    Layer)

    OLTP

    OLTP

    Core

    Warehouse

    Layer

    (Storage

    Layer)

    Mart Layer

    (Output

    Layer)

    (Reporting

    Layer)

    Integration

    Layer

    (Cleansing

    Layer)

    Aggregation

    Layer

    Metadata Management

    Security

    DWH Manager incl. Monitor

    Daimler TSS Data Warehouse / DHBW 27

  • Data Models in the DWH

    Layer Characteristics Data Model

    Staging Layer Temporary storage

    Ingest of source data

    Normally 1:1 copy of source table structure –

    usually without constraints and indexes

    Core Warehouse

    Layer

    Historization / bitemporal data

    Integration

    Tool-independent

    Non-redundant data storage

    Historization

    3NF with historization

    Head and Version modelling

    Data Vault

    Anchor modeling

    Dimensional model with historization (possible)

    Data Mart Layer Performance for end user queries

    required, Tool-dependent

    Lots of joins necessary to answer

    complex questions

    Flat structures, esp. Dimensional model

    (ROLAP / MOLAP / HOLAP)

    Daimler TSS Data Warehouse / DHBW 28

  • Data Vault - Architecture, Methodology, Model

    Lecture:

    DWH Architectures

    Lecture:

    DWH Data Modeling

    Architecture

    • Multi-Tier

    • Scalable

    • Supports NoSQL

    Methodology

    • Repeatable

    • Measureable

    • Agile

    Model

    • Flexible

    • Hash based

    • Hub & Spoke

    Implementation: Automation,

    Pattern based, High speed

    Daimler TSS Data Warehouse / DHBW 29

  • Data Vault 2.0, definition by Dan Linstedt

    "Data Vault 2.0” is a system of business intelligence which includes: Modeling, Methodology, Architecture, and Implementation best practices. The components, also known as pillars of Data Vault 2.0 are identified as follows:

    • Data Vault 2.0 Modeling - Focused on Process and Data Models

    • Data Vault 2.0 Methodology – Following SCRUM and agile ways of working

    • Data Vault 2.0 Architecture – Includes NoSQL and big-data systems

    • Data Vault 2.0 Implementation – Pattern based automation and generation

    The term “Data Vault” is merely a marketing term chosen in 2001 to represent the system to the market. The true name for the Data Vault System of BI is: common foundational warehouse modeling, methodology, architecture, and implementation.

    Source: https://www.linkedin.com/pulse/defining-data-vault-10-20-business-dan-linstedt/

    Daimler TSS Data Warehouse / DHBW 30

    https://www.linkedin.com/pulse/defining-data-vault-10-20-business-dan-linstedt/

  • Data modeling: 3NF, Star schema, Data Vault

    Business Key

    Relationships

    Context

    and

    history

    3NF

    Star:

    Dimensions

    Star:

    Facts

    Data Vault:

    Hub

    Data Vault:

    Link

    Data Vault:

    Sat

    Vehicle identifier

    Manufacturer

    Model

    Type

    Plant

    Delivery Date

    Production Date

    Price

    Source system

    Load date/time

    Buyer

    Salesperson

    Engine

    Vehicle data

    Daimler TSS Data Warehouse / DHBW 31

  • Unique

    identification

    by

    Natural keys

    (Business Keys)

    HUB

    ,

  • Structure HUB tables

    Daimler TSS Data Warehouse / DHBW 33

  • HUB tables: typical characteristics

    Business Keys should be natural keys used by the business (e.g. Vehicle Identifier, Serial number)

    Business Keys should stand alone and have meaning to the business

    Business Keys should never change, have the same semantic meaning and the same granularity

    Focus on Business Keys (instead focus on source system surrogates) ensures that the result serves the needs of the business

    Daimler TSS Data Warehouse / DHBW 34

  • Tying Business Processes To Business Keys

    Time

    ProcurementSales

    $$Revenue

    DeliveryContractsFinance

    PlanningManufacturing

    CustomerContact

    Finance

    Sales Procurement

    SLS123 SLS123SLS123 *P123MFG

    *P123MFG

    Excel Spreadsheet

    Manual Process

    NO VISIBILITY!© Copyright 1990-2017, Dan Linstedt, all rights reserved

  • Data Warehouse /

    DHBWDaimler TSS 36

    Unique

    relationships

    between

    Business Keys

    (HUBs)

    LINK

  • Structure LINK tables

    Daimler TSS Data Warehouse / DHBW 37

  • LINK tables: typical characteristics

    A LINK models a relationship between 2 or more HUBs

    The relationship is always n:m

    The composed key must be unique. One of the foreign keys is driving key

    Daimler TSS Data Warehouse / DHBW 38

  • Daimler TSS Data Warehouse / DHBW 39

    Relationship difference by time or by location

    Source: https://datavaultalliance.com/news/relationships-data-driven-edw/

    https://datavaultalliance.com/news/relationships-data-driven-edw/

  • Candidates for LINKs

    • Relationships / Associations

    • Foreign Keys in OLTP systems

    • Hierarchies and Redefinitions

    • Hierarchical relationships are modeled by one link and two connections to HUBs: HAL (parent-child LINK) and SAL (same-as LINK)

    • Transactions and events are often modeled as link (could also be a Hub)

    • E.g. sales order or sensor data

    • Intense discussions about modeling as Hub or Link on conferences or social media (modeling solution depends from requirements, context, etc)

    Daimler TSS Data Warehouse / DHBW 40

  • Data Warehouse /

    DHBWDaimler TSS 41

    Descriptive,

    detailled,

    current

    and

    historized

    data

    SAT

  • Structure SAT tables

    Daimler TSS Data Warehouse / DHBW 42

  • SAT tables: typical characteristics

    Contains all non-key attributes

    Is connected to exactly one Hub or Link

    HUB or LINK tables can (should) have several SAT tables, e.g. by source system

    Can contain in the extreme case one column only (or any number of columns)

    Daimler TSS Data Warehouse / DHBW 43

  • SAT table design

    Different criteria to design SAT tables (separate data into different SAT tables)

    • Source system

    • Rate of change

    • Data types (e.g. separate CLOBS or other lengthy textual fields)

    Daimler TSS Data Warehouse / DHBW 44

  • SAT table design

    Rate of change in order to avoid redundant storage of data

    Data that change oftenData that do not change

    Delivery date SW-Version Controlunit

    Theft message

    Color CommentsInterior

    Daimler TSS Data Warehouse / DHBW 45

  • Daimler TSS

    How many rows are stored in the hub and link tables?

    vehicleid model product

    iondate

    engine color

    V1 SUV 15.01.13 E1 red

    V2 Cabrio 16.01.13 E2 blue

    V1 SUV 15.01.13 E1 red

    V3 Cabrio 17.01.13 E3 red

    Staging Data in table stg_vehicle from 15.01.2015

    V1 SUV 16.01.13 E4 red

    V4 Cabrio 17.01.13 E5 blue

    Staging Data Data in table stg_vehicle from 16.01.2015

    V1 SUV 16.01.13 E1 red

    Staging Data Data in table stg_vehicle from 17.01.2015

    Daimler TSS Data Warehouse / DHBW 46

  • How many rows are stored in the hub and link tables?

    • H_VEHICLE

    • 4 rows: V1, V2, V3, V4

    • H_ENGINE

    • 5 rows: E1, E2, E3, E4, E5

    • L_PLUGGED_IN_EFFECTIVITY

    • 5 rows: V1-E1, V2-E2, V3-E3, V1-E4, V4-E5

    Daimler TSS Data Warehouse / DHBW 47

  • Daimler TSS

    How many rows are stored in the first 3 sat tables?

    vehicleid model product

    iondate

    engine color

    V1 SUV 15.01.13 E1 red

    V2 Cabrio 16.01.13 E2 blue

    V1 SUV 15.01.13 E1 red

    V3 Cabrio 17.01.13 E3 red

    Staging Data Data in table stg_vehicle from 15.01.2015

    V1 SUV 16.01.13 E4 red

    V4 Cabrio 17.01.13 E5 blue

    Staging Data Data in table stg_vehicle from 16.01.2015

    V1 SUV 16.01.13 E1 red

    Staging Data Data in table stg_vehicle from 17.01.2015

    Daimler TSS Data Warehouse / DHBW 48

  • Daimler TSS

    How many rows are stored in the first 3 sat tables?

    vehicleid model product

    iondate

    engine color

    V1 SUV 15.01.13 E1 red

    V2 Cabrio 16.01.13 E2 blue

    V1 SUV 15.01.13 E1 red

    V3 Cabrio 17.01.13 E3 red

    Staging Data Data in table stg_vehicle from 15.01.2015

    V1 SUV 16.01.13 E4 red

    V4 Cabrio 17.01.13 E5 blue

    Staging Data Data in table stg_vehicle from 16.01.2015

    V1 SUV 16.01.13 E1 red

    Staging Data Data in table stg_vehicle from 17.01.2015

    5

    4

    6

    4

    5

    5

    Daimler TSS Data Warehouse / DHBW 49

  • Ensemble modeling

    Hans Hultgren: “An ensemble is a representation of a Core Business Concept including all of its parts – the business key, with context and relationships”

    Source:

    e.g. vehicle

    Ensemble

    Decomposition

    Entity

    Daimler TSS Data Warehouse / DHBW 50

  • Exercise Data Vault

    The following data model shows vehicle sales with entities

    • Person (sales_person and owner)

    • Vehicle

    • Production_plant

    Architect a Data Vault model for theCore Warehouse Layer

    Daimler TSS Data Warehouse / DHBW 51

  • Sample solution Data Vault

    Daimler TSS Data Warehouse / DHBW 52

  • Data Vault - advantages

    • Flexible / agile approach

    • Highly parallel data loads, Scalable

    • Automatable

    • Systematic approach that covers historization and integration

    • Full auditability

    • No updates or deletes on business data

    • Horizontal and vertical partitioning

    • Supports / Combines RDBMS and Hadoop/NoSQL technologies

    Daimler TSS Data Warehouse / DHBW 53

  • Data Vault - disadvantages

    • More Tables

    • More joins

    • Performance to load Data Mart can be a challenge

    • Logic to load Data Marts can be rather complex if many tables are involved

    • All relationships are modeled n:m (documentation necessary!). Data Vault assumes worst-case scenario for relationships

    • The same source table is used several times while loading HUBs, SATs, LINKs

    Daimler TSS Data Warehouse / DHBW 54

  • Data Vault - quotes

    Bill Inmon: “Over multiple years, Dan

    improved the Data Vault and evolved it

    into Data Vault 2.0. Today this System Of

    Business Intelligence includes not only a

    more sophisticated model, but an agile

    methodology, a reference architecture for

    enterprise data warehouse systems, and

    best practices for implementation.

    The Data Vault 2.0 System Of Business

    Intelligence is ground-breaking, again. It

    incorporates concepts from massively

    parallel architectures, Big Data, real-time

    and unstructured data.“Source: Linstedt / Olschimke: Building a Scalable Data Warehouse with Data Vault 2.0

    Barry Devlin: “The Data Vault approach,

    since the early 2000s, promises a much-

    improved balance, with a hybrid of the

    normalized and star schema forms

    above. Version 2.0 introduced in 2013 ,

    consisting of a data model, methodology,

    and systems architecture, provides a

    design basis for data warehouses that

    emphasizes core data quality,

    consistency, and agility.“Source: https://www.wherescape.com/media/3476/data-vault-thoughpoint-april-2017.pdf

    Daimler TSS Data Warehouse / DHBW 55

  • Data Vault 2.0 benefits according to Dan Linstedt

    The issues Data Vault 2.0 is built to solve include:

    • Global distributed Teams

    • Global distributed physical data warehouse components

    • „Lazy“ joining during query time across multi-country servers

    • Ingestion and query parsing of images, video, audio, documents (unstructured data)

    • Ingestion of real-time streaming (IOT) data

    • Cloud and On-premises seamless integration

    • Agile Team Delivery

    • Incorporation of Data Virtualization, and NoSQL platforms

    • Extremely large data sets (in to the Petabyte ranges and beyond)

    • Automation and Generation of 80% of the work products

    Source: https://www.linkedin.com/pulse/defining-data-vault-10-20-business-dan-linstedt/

    Daimler TSS Data Warehouse / DHBW 56

    https://www.linkedin.com/pulse/defining-data-vault-10-20-business-dan-linstedt/

  • Standard Data Warehouse architecture

    57

    Data Warehouse

    FrontendBackend

    External data sources

    Internal data sources

    Staging

    Layer

    (Input

    Layer)

    OLTP

    OLTP

    Core

    Warehouse

    Layer

    (Storage

    Layer)

    Mart Layer

    (Output

    Layer)

    (Reporting

    Layer)

    Integration

    Layer

    (Cleansing

    Layer)

    Aggregation

    Layer

    Metadata Management

    Security

    DWH Manager incl. Monitor

    Daimler TSS Data Warehouse / DHBW 57

  • Data Models in the DWH

    Layer Characteristics Data Model

    Staging Layer Temporary storage

    Ingest of source data

    Normally 1:1 copy of source table structure –

    usually without constraints and indexes

    Core Warehouse

    Layer

    Historization / bitemporal data

    Integration

    Tool-independent

    Non-redundant data storage

    Historization

    3NF with historization

    Head and Version modelling

    Data Vault

    Anchor modeling

    Dimensional model with historization (possible)

    Data Mart Layer Performance for end user queries

    required, Tool-dependent

    Lots of joins necessary to answer

    complex questions

    Flat structures, esp. Dimensional model

    (ROLAP / MOLAP / HOLAP)

    Daimler TSS Data Warehouse / DHBW 58

  • Dimensional modeling

    • Design technique to present data in a standard, intuitive framework

    • Easily understandable for end users

    • High performance end user access

    • Conceptual data model

    • Physical data model: Not necessarily relational, can also be stored in specialicedmulti-dimensional tools (“OLAP Cubes”)

    • Analysis / Reporting of numerical measures (metrics) by different attributes (context)

    Daimler TSS Data Warehouse / DHBW 59

  • Dimensional model

    Dimensions

    • Are entities that contain descriptive textual attributes for analysis

    • E.g. Car (model, manufacturer, etc), Time period (day, week, month, year)

    Facts

    • Contain key numerical figures – “Measures” – “Metrics”

    • E.g. Sales amount (for dimensions: product X in region y and time period z)

    Daimler TSS Data Warehouse / DHBW 60

  • Dimensional model: Star schema

    Fact Table

    Dim1_key (FK)

    Dim2_key (FK)

    Dim3_key (FK)

    Dim4_key (FK)

    Measure1

    Measure2

    Measure3

    Dimension_1

    Dim1_key (PK)

    Dim1_Attribute1

    Dim1_Attribute2

    Dim1_Attribute3

    Dimension_2

    Dim2_key (PK)

    Dim2_Attribute1

    Dim2_Attribute2

    Dim2_Attribute3

    Dimension_3

    Dim3_key (PK)

    Dim3_Attribute1

    Dim3_Attribute2

    Dim3_Attribute3

    Dimension_4

    Dim4_key (PK)

    Dim4_Attribute1

    Dim4_Attribute2

    Dim4_Attribute3

    n

    n

    n

    n

    Daimler TSS Data Warehouse / DHBW 61

  • Dimensional model example: Sales Star schema

    Sales Fact

    Time_key (FK)

    Product_key (FK)

    Supplier_key (FK)

    Customer_key (FK)

    Number_items

    Sales_amout

    Discount

    Time Dimension

    Time_key (PK)

    Date

    Day

    Month

    Quarter

    Year

    DayOfWeek

    Product Dimension

    Product_key (PK)

    Product_name

    Product_category

    Product_size

    Customer Dimension

    Customer_key (PK)

    Customer_name

    Customer_address

    Customer_nation

    Supplier Dimension

    Supplier_key (PK)

    Supplier_name

    Street

    City

    Country

    n

    n

    n

    n

    Daimler TSS Data Warehouse / DHBW 62

  • Dimensional model example: Star Schema with data

    Sales Fact

    T2, P1, C1, 1, 600EUR

    T3, P2, C3, 2, 700EUR

    T4, P2, C3, 1, 700EUR

    T4, P1, C3, 1, 600EUR

    T4, P3, C3, 1, 700EUR

    Time Dimension

    T1, 01.01., Q1, Monday

    T2, 02.01., Q1, Tuesday

    T3, 03.01., Q1, Wednesday

    T4, 04.01., Q1, Thursday

    Product Dimension

    P1, ipad Air, Tablet

    P2, Surface Pro 6, Tablet

    P3, iPhone 11, Smartphone

    Customer Dimension

    C1, Bush, New York, USA

    C2, Miller, London, GB

    C3, Brown, Seattle, USA

    Supplier Dimension

    Supplier_key (PK)

    Supplier_name

    Street

    City

    Country

    n

    n

    n

    n

    Daimler TSS Data Warehouse / DHBW 63

  • Daimler TSS Data Warehouse / DHBW 64

    Data Warehouse and Decision Support on Integrated Crop Big Data

    Source: Data Warehouse and Decision Support on Integrated Crop Big Data, 2020, https://arxiv.org/abs/2003.04470

    https://arxiv.org/abs/2003.04470

  • Denormalized Dimensions for Hierarchies

    Denormalized Dimensions: 1 Table with all hierarchy levels

    • Advantage:

    • Efficient aggregations

    • Performance

    • Disadvantage:

    • Complex updates if hierarchies change

    Daimler TSS Data Warehouse / DHBW 65

  • Hierarchies

    Hierarchies (from most detailed to aggregated level):

    • Productid Productname Productgroup Category Class

    • Date Month/Year Quarter/Year Year

    • Customer Company Industry

    • City County/Landkreis State Country Continent

    Purpose:

    • group and structure data

    • enable view on data at different levels of granularity

    • Hierarchies define aggregations on measures

    Daimler TSS Data Warehouse / DHBW 66

  • How to cover data changes in the mart?

    Data changes, e.g.

    • new employees

    • employees change departments

    • employees leave

    • whole department reorganisations, etc

    • How are the changes handled? Insert-only approach in the Core Warehouse Layer, but choices in the Mart Layer (reduce data amount to what end user needs)

    • What does the business want to see? (Reporting Scenarios)

    • How is data inserted / updated in dimensions? (Slowly Changing Dimensions)

    Daimler TSS Data Warehouse / DHBW 67

  • Slowly changing dimensions

    • Dimensions must absorb changes

    • Slowly changing dimensions according to Kimball / Ross (2002):

    • SCD Type 0

    • no changes, new data is ignored

    • SCD Type 1 - 3

    • See next slides

    • And some more SCD types

    • Rarely relevant

    Daimler TSS Data Warehouse / DHBW 68

  • Slowly changing dimensions – example baseline

    Changes:

    • New data added: Albert, DWH

    • Powell marries and has new name Parker

    ID Employee Organisation

    1 Miller DWH

    2 Powell DatabaseEmp

    loye

    e

    Dim

    en

    sion

    Daimler TSS Data Warehouse / DHBW 69

  • Slowly Changing Dimension Type 1

    • No History

    • Dimension attributes always contain current data

    Changes:

    • New data added: Albert, DWH

    • Powell marries and hasnew name Parker

    Em

    plo

    yee

    Dim

    en

    sion

    ID Employee Organisation

    1 Miller DWH

    2 Parker Database

    3 Albert DWH

    Em

    plo

    yee

    Dim

    en

    sion

    ID Employee Organisation

    1 Miller DWH

    2 Powell Database

    Daimler TSS Data Warehouse / DHBW 70

  • Slowly Changing Dimension Type 2

    • Full Historization • Dimension contains

    timestamps with NULLs or future date like 31.12.2999

    Changes:

    • New data added: Albert, DWH

    • Powell marries and has new nameParker

    Em

    plo

    yee

    Dim

    en

    sion

    ID Employee Organisation Valid From Valid To

    1 Miller DWH 01.01.2015 NULL

    2 Powell Database 21.12.2014 15.10.2016

    3 Albert DWH 05.03.2014 NULL

    2 Parker Database 15.10.2016 NULL

    Em

    plo

    yee

    Dim

    en

    sion

    ID Employee Organisation

    1 Miller DWH

    2 Powell Database

    Daimler TSS Data Warehouse / DHBW 71

  • Slowly Changing Dimension Type 3

    • Historization of latest change only

    • And storage of current value

    Changes:

    • New data added: Albert, DWH

    • Powell marries and hasnew name Parker

    Em

    plo

    yee

    Dim

    en

    sion

    I

    D

    Employ

    ee

    Name

    Previo

    us

    Name

    Organi

    sation

    Previous

    Organisati

    on

    1 Miller NULL DWH NULL

    2 Parker Powell Database NULL

    3 Albert NULL DWH NULL

    Em

    plo

    yee

    Dim

    en

    sion

    ID Employee Organisation

    1 Miller DWH

    2 Powell Database

    Daimler TSS Data Warehouse / DHBW 72

  • Reporting scenarios

    • As-is scenario

    • As-of scenario

    • As-posted scenario

    • As-posted with comparable data scenario

    Daimler TSS Data Warehouse / DHBW 73

  • Data Mart – example baseline

    Employee Organisation

    Miller DWH

    Rogers DWH

    Douglas Database

    Powell Database

    Em

    plo

    yee

    Dim

    en

    sio

    n 2

    01

    5

    Employee Organisation

    Miller DWH

    Rogers DWH

    Powell DWH

    Douglas Database

    Bush DatabaseEm

    plo

    yee

    Dim

    en

    sio

    n 2

    01

    6

    Employee Year #Pro-

    jects

    Miller 2015 10

    Rogers 2015 10

    Douglas 2015 10

    Powell 2015 10

    Miller 2016 10

    Rogers 2016 10

    Powell 2016 10

    Douglas 2016 10

    Bush 2016 10

    Facts

    Assumption: current year: 2016

    New employee

    Other department

    Daimler TSS Data Warehouse / DHBW 74

  • As-is scenario

    Reporting uses current structure

    Employee Organisation

    Miller DWH

    Rogers DWH

    Powell DWH

    Douglas Database

    Bush DatabaseEm

    plo

    yee

    Dim

    en

    sio

    n 2

    01

    6

    Employee Year #Pro-

    jects

    Miller 2015 10

    Rogers 2015 10

    Douglas 2015 10

    Powell 2015 10

    Miller 2016 10

    Rogers 2016 10

    Powell 2016 10

    Douglas 2016 10

    Bush 2016 10

    Facts

    Organisation #Projects ´15 #Projects ´16

    DWH 30 30

    Database 10 20

    Daimler TSS Data Warehouse / DHBW 75

  • As-of scenario

    Reporting uses structure as demanded

    e.g. requested for 2015

    Employee Organisation

    Miller DWH

    Rogers DWH

    Douglas Database

    Powell Database

    Em

    plo

    yee

    Dim

    en

    sio

    n 2

    01

    5

    Employee Year #Pro-

    jects

    Miller 2015 10

    Rogers 2015 10

    Douglas 2015 10

    Powell 2015 10

    Miller 2016 10

    Rogers 2016 10

    Powell 2016 10

    Douglas 2016 10

    Bush 2016 10

    Facts

    Organisation #Projects ´15 #Projects ´16

    DWH 20 20

    Database 20 20

    Daimler TSS Data Warehouse / DHBW 76

  • As-posted scenario

    Reporting uses „historical truth“

    Organisation #Projects ´15 #Projects ´16

    DWH 20 30

    Database 20 20

    Daimler TSS Data Warehouse / DHBW 77

  • As-posted with comparable data scenario

    Reporting uses „historical truth“ for

    identical dimension data

    Organisation #Projects ´15 #Projects ´16

    DWH 20 20

    Database 10 10

    Daimler TSS Data Warehouse / DHBW 78

  • Exercise star schema

    The following data model shows vehicle sales with entities

    • Person (sales_person and owner)

    • Vehicle

    • Production_plant

    Architect a Star Schema for theData Mart Layer in order to analyse sales data

    Daimler TSS Data Warehouse / DHBW 79

  • Sample solution Star Schema

    Daimler TSS Data Warehouse / DHBW 80

  • Dimensional model – implementation types

    Implementation types of dimensional models

    Star Schema = Relational model (ROLAP) consists of•Fact Tables

    •Dimension Tables

    Cube = Multidimensional model (MOLAP) consists of•Edges = Attributes

    •Cells = Measures (facts)

    Daimler TSS Data Warehouse / DHBW 81

  • ROLAP

    Dimensions

    • Relational table for each dimension like product, region, time period

    • Primary key (surrogates) identifies each dimension element

    • Additional fields contain descriptive information like product name

    • E.g. Dimensions: Product, Region, Time period (day, week, month, year)

    Facts

    • Relational table containing key figures – “Measures”

    • Stores foreign keys to dimension tables

    • The other fields contain the values of the key figures/measures

    • E.g. Sales amount (for product X in region y and time period z)Daimler TSS Data Warehouse / DHBW 82

  • ROLAP

    • Advantage: can use well-engineered, reliable and high-performance database systems and query languages

    • Memory amount depends mainly on the number of facts

    • One row per fact

    • Size of a row approx. (#dimensions + #measures) * column size

    • Aggregated totals are computed dynamically in general

    • Longer response times

    Daimler TSS Data Warehouse / DHBW 83

  • ROLAP Enhancements

    Used for accelerating data warehouse queries in general

    • Precomputation of aggregated values

    • Materialized views / query tables store data physically

    • Relational Columnar (in-memory) databases

    Daimler TSS Data Warehouse / DHBW 84

  • Dimensional model – implementation types

    Implementation types of dimensional models

    Star Schema = Relational model (ROLAP) consists of•Fact Tables

    •Dimension Tables

    Cube = Multidimensional model (MOLAP) consists of•Edges = Attributes

    •Cells = Measures (facts)

    Daimler TSS Data Warehouse / DHBW 85

  • Multidimensional data model

    Edges of a cube (“Dimension”)

    • Attributes like Product, Region, Time period (day, week, month, year)

    Cells of a cube (“Measures”)

    • Key Figures (i.e. sales amount, profit) – “measures”

    • For every combination of attribute values one value of each key figure, e.g. Sales amount for product X in region y and time period z

    • Can be NULL and is stored as empty cell

    Daimler TSS Data Warehouse / DHBW 86

  • MOLAP - Multidimensional Databases

    • A database specially designed to handle the organization of data in multiple dimensions

    • Good for DWH requirements only but not generally suited like an RDBMS

    • E.g. IBM Cognos TM1, Oracle Essbase, Microsoft Analysis Services, Oracle OLAP Option, IBM Cognos Powerplay

    • Holds data cells in blocks that constitute a virtual cube

    • Optimized to handle numeric data

    • Aggregated totals often precalculated

    • Not intended for textual data

    Daimler TSS Data Warehouse / DHBW 87

  • Multidimensional storage

    • Linearization of the cells in a cube into a one-dimensional array

    Memory amount: #(dim1) x #(dim2) x ... x #(dimN)

    Depends on the number of dimensions and their cardinality, not on the number of facts

    Example:

    • Cube with 2 dimensions with 3 and 1 dimension with 2 elements

    • Memory amount = size = 3*3*2 = 18 cells

    • The numbers in the cube cells indicate the position in the array

    Daimler TSS Data Warehouse / DHBW 88

  • Storage cells containing all combinations

    Cube with 3 dimensions

    • Product – 4 values – p1, p2, p3, p4

    • Store – 3 values – s1, s2, s3

    • Time (year) – 2 values - y1, y2

    Number of cells in the cube: 4 x 3 x 2 = 24

    Daimler TSS Data Warehouse / DHBW 89

  • MDX - OLAP Query Language

    • ROLAP = SQL is standard language

    • MOLAP = MDX - Multidimensional Expressions

    • De-facto industry standard developed by Microsoft

    • Very complex

    • SQL like syntax

    • Language elements

    • Scalar – data type „string“ or „number“

    • Dimension, Hierarchy, Level, Member

    • …

    Daimler TSS Data Warehouse / DHBW 90

  • MDX Sample Query

    SELECT { [Measures].[Store Sales] } ON COLUMNS, { [Date].[2002], [Date].[2003] } ON ROWS

    FROM Sales

    WHERE ( [Store].[USA].[CA] )

    • This query defines the following result set information:

    • The SELECT clause sets the query axes as the Store Sales (amount) member and the 2002 and 2003 members of the Date dimension

    • The FROM clause indicates that the data source is the Sales cube

    • The WHERE clause defines the "slicer axis" for member California of Store dimension

    Store Sales

    2002 95863,66

    2003 99764,01

    Daimler TSS Data Warehouse / DHBW 91

  • MOLAP - ROLAP

    MOLAP ROLAP

    Database type Multidimensional Relational

    Data storage Special storage engines for

    cube data

    Star schema – special relational

    data model

    Size 100s of Gigabytes 10s of Terabytes

    Query language MDX SQL

    Daimler TSS Data Warehouse / DHBW 92

  • MOLAP - ROLAP

    MOLAP ROLAP

    Advantages • special database products optimized for multidimensional analysis

    • short response times, e.g. no joins

    • suitable storage schema and query

    processing for multidimensional data

    • can use existing, well established DBMS

    • easy data import, update

    • user access, backup, security mechanisms

    from DBMS can be used

    Disadvantages • problems with sparsity (ratio occupied / not occupied cells): "null" is stored in a field with

    same length as any value

    • limited data volume: 5-6 dimensions

    • cube data read-only accessible only for end

    users

    • expensive update operation

    • Complex SQL queries for processing OLAP

    requests longer response times (solution:

    Materialized Views and In-memory columnar

    databases)

    Daimler TSS Data Warehouse / DHBW 93

  • HOLAP – Hybrid OLAP

    • Combines the advantages of ROLAP and MOLAP

    • Relational DBMS for storage of sparse, historic data

    • Data of highest granularity level

    • Multidimensional DBMS for efficient storage of dense data cubes

    • Multidimensional cache for aggregated totals

    • Complex architecture and maintenance processes

    • No uniform OLAP query processing

    Daimler TSS Data Warehouse / DHBW 94

  • Exercise: OLAP

    The following is a conceptual model used by a supermarket chain to analyze their business: (assume 365 days a year)

    Daimler TSS Data Warehouse / DHBW 95

  • Exercise: OLAP

    With each transaction, an average of 20 different articles are bought.

    The data warehouse collects sales transactions data over 2 years.

    There are 1000 stores with 2000 transactions per store and day.

    Questions:

    1. What are the columns of the ROLAP fact table?

    2. How many records are stored in the fact table?

    3. What is the size of the cube (number of cells) that stores the aggregated values at the most detailed level?

    Daimler TSS Data Warehouse / DHBW 96

  • Exercise: OLAP

    1. What are the columns of the ROLAP fact table?

    • Trans. No. (FK to dimension)

    • Date (FK to dimension)

    • Location (FK to dimension)

    • Article (FK to dimension)

    • No of articles (measure)

    • Article Price (measure)

    Daimler TSS Data Warehouse / DHBW 97

  • Exercise: OLAP

    2. How many records are stored in the fact table?

    • One record per transaction and article (with quantity and price)

    • 2 years * 365 days/year * 1000 stores * 2000 transactions/(store*day)* 20 articles/transaction = 29.200.000.000 articles/records

    Daimler TSS Data Warehouse / DHBW 98

  • Exercise: OLAP

    3. What is the size of the cube (number of cells) that stores the aggregated values at the most detailed level?

    • 2 years * 365 [days]/year * 2000 [transactions] * 1000 [stores] * 50000 [articles] = 73.000.000.000.000 cells

    Daimler TSS Data Warehouse / DHBW 99

  • What about Data Lakes?

    Daimler TSS

  • Daimler TSS Data Warehouse / DHBW 101

    Variety

    • Structured data like tables typically stored in relational databases

    • Unstructured data usually generated by humans e.g. natural language, voice, Wikipedia, Twitter posts, video, images

    • Semi-structured data has some structure in tags but it changes with documents E.g. HTML, XML, JSON files, server logs

    Unstructured data is a bad phrase, e.g. Tweets are structured, too.

    Better: data has low information density.

  • • Standard in many NoSQLdatabases, e.g. MongoDB

    • But also: ISO SQL Standard 2017, Part 6 with SQL Query-Functions:

    • JSON_EXISTS

    • JSON_VALUE

    • JSON_QUERY

    • JSON_TABLE

    {

    "customer": {

    "firstName": "Jim",

    "lastName": "Bush",

    "interests": [

    "Travel",

    "Sports",

    "Photography“

    ]

    }

    }

    {object}

    [array]

    Key: Value pair

    JSON – schemaless future for Database design?

    102Daimler TSS Data Warehouse / DHBW

  • 103

    JSON (Java Script Object Notation)

    • JSON was developed by Douglas Crockford in 2001 to exchange data

    • JSON is simpler and more compact compared to XML

    • JSON is replacing XML more and more to exchange data

    • JSON is used

    • for data exchange (often via RESTful APIs)

    • as a configuration file (for example, Node.js stores metadata in package.json)

    • as a primary storage format in databases like MongoDB

    • Standardized by IETF (Internet Engineering Task Force) and ECMA (European Computer Manufacturers Association)

    Daimler TSS Data Warehouse / DHBW

  • Daimler TSS Data Warehouse / DHBW 104

    Modeling of relationships

    Customers• Cust_num, name

    Invoice (Order)• Invoice_num,

    invoice_date

    LineItems• quantity

    Items• Item_num, Name,

    price

    n n

    n

    nship

    bill

  • Daimler TSS Data Warehouse / DHBW 105

    Modeling of relationshipsNormalized vs Nested

    Source: https://dzone.com/articles/json-data-modeling-for-rdbms-users

    https://dzone.com/articles/json-data-modeling-for-rdbms-users

  • Daimler TSS Data Warehouse / DHBW 106

    JSON modeling – some considerations

    • Nested objects

    • Relationship is 1-to-1 or 1-to-many

    • Data reads are mostly parent and child fields (both

    • Data writes are mostly parent and child fields (both)

    • Separate documents

    • Relationship is many-to-1 or many-to-many

    • Data reads are mostly parent fields

    • Data reads are mostly parent or child (not both)

    • Does not make much sense —> relational

    Source: https://dzone.com/articles/json-data-modeling-for-rdbms-users

    https://dzone.com/articles/json-data-modeling-for-rdbms-users

  • 107

    Schema-on-read vs schema-on-write

    Schema-on-read Schema-on-write

    No data structure in DB necessary Table in DB is created first

    Schema is applied while reading the data Schema is applied while writing the data

    Fast for writes, slower for reads Fast for reads, slower for writes

    Flexible for programmer as he just copies

    data in

    Flexible for user as he just reads data

    High effort for user as he tries to ensure

    data quality

    High effort for programmer as he has to

    ensure data quality

    Wrong data can be stored Wrong data can not be stored

    Security is critical as data is not known Security can be managed as data is knownDaimler TSS Data Warehouse / DHBW

  • 108

    Summary: Schema-on-read vs schema-on-write

    • Many argue that schema-on-read is great and flexible

    • Data just needs to be copied

    • Data structure can be applied later (schemaless is misnomer)

    • Sensor data has often different formats (new vs old versions in the field) where the format makes sense

    • But, there are enormous disadvantages

    • Data quality

    • Same work is done many times

    • Data security is at high risk

    Daimler TSS Data Warehouse / DHBW

  • Summary

    • Data modeling in the Core Warehouse Layer

    • Choices like Data Vault

    • Data modeling in the Mart Layer

    • Dimensional Modeling

    • ROLAP (Star Schema with fact and dimension tables)

    • MOLAP (Cubes)

    • Data modeling in the Data Lake

    • Persistent landing Zone normally JSON, csv

    • Standardized Zone

    • Schema-on-write: similar to DWH, esp. Dimensional model still relevant

    • Schema-on-read: JSON (csv)

    Daimler TSS Data Warehouse / DHBW 109

  • Daimler TSS GmbH

    Wilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99

    [email protected] / Internet: www.daimler-tss.com

    Sitz und Registergericht: Ulm / HRB-Nr.: 3844 / Geschäftsführung: Martin Haselbach (Vorsitzender), Steffen Bäuerle

    © Daimler TSS I Template Revision

  • Exercise - recapture data modeling

    • Recapture data modeling topics

    • Which topics do you remember or do you find important?

    • Write down 1-2 topics on stick-it cards.

    Daimler TSS Data Warehouse / DHBW 111

  • Ensemble modeling – not just Data Vault 2.0

    Daimler TSS Data Warehouse / DHBW 112

  • 5 levels of CMMI

    1. You can't understand what you don't research

    2. You can't define what you don't understand (standards, context, concepts)

    3. You can't identify what you don't define (KPA's and structure)

    4. You can't measure what you don't identify (KPA's and KPI's)

    5. You can't optimize what you can't measure (KPI's and retrospective adaptation)

    Source: https://www.linkedin.com/pulse/data-vault-20-beyond-model-dan-linstedt/

    Daimler TSS Data Warehouse / DHBW 113

    https://www.linkedin.com/pulse/data-vault-20-beyond-model-dan-linstedt/

  • Daimler TSS

    Formal definition Data Vault (1.0) by Dan Linstedt

    „The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables thatsupport one or more functional areas of business.It is a hybrid approach encompassing the best ofbreed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent, and adaptable to the needs of the enterprise. It is a data model that is architected specifically to meetthe needs of today’s enterprise data warehouses.“

    Source: http://www.vertabelo.com/blog/technical-articles/data-vault-series-agile-modeling-not-an-option-anymore

    Daimler TSS Data Warehouse / DHBW 114

    http://www.vertabelo.com/blog/technical-articles/data-vault-series-agile-modeling-not-an-option-anymore

  • OLAP – 12 criteria by Codd

    OnLine Analytical Processing

    • Term introduced by E. Codd in 1993 in a white paper for Arbor Essbase

    • 12 criteria for OLAP systems like

    • Multi-dimensionality

    • Transparency

    • Constant response-times

    • Multi-user support

    • Flexible definition of reports

    • No limits on dimensions and hierarchy levels

    Daimler TSS Data Warehouse / DHBW 115

  • OLAP – FASMI criteria

    FASMI – Fast Analysis of Shared Multidimensional Information

    Criteria by Pendse/Creeth (1995)

    • Fast

    • maximum response time for regular queries 5 seconds and complex queries not more 20 seconds

    • Analysis

    • intuitive analysis, easy/no programming

    • flexible: queries may contain arbitrary computations

    Daimler TSS Data Warehouse / DHBW 116

  • OLAP – FASMI criteria

    • Shared

    • Multi user capable: Shared usage and access control

    • Multidimensional

    • Multidimenional view on the data

    • regardless of the underlying data model

    • Full support of hierarchies

    • Information

    • User must be able to get all data without restrictions by the used OLAP system, no restriction in regards to scalability

    Daimler TSS Data Warehouse / DHBW 117

  • Elements of Scale: Composing and Scaling Data Platforms (Ben Stopford)

    • Sequential operations are best

    • Sequential operations can be predicted

    • Random operations are the main challenge

    • Append-only journal leads to sequential IO

    • But what about updates (in place)?

    • Indexes speed up read random IO read performance but not random IO write performance

    Source: http://www.benstopford.com/2015/04/28/elements-of-scale-composing-and-scaling-data-platforms/

    Daimler TSS Data Warehouse / DHBW 118

  • Elements of Scale: Composing and Scaling Data Platforms (Ben Stopford)

    Source: http://www.benstopford.com/2015/04/28/elements-of-scale-composing-and-scaling-data-platforms/

    Daimler TSS Data Warehouse / DHBW 119

  • Multidimensional operations - Selection

    Selection

    • Definition of a filter

    • Select data of a single cell with a condition for each dimension

    • For instance:

    • time = 'January 2006'

    • location = 'Stuttgart'

    • product = ‘ThinkPad T60'

    Daimler TSS Data Warehouse / DHBW 120

  • Example Selection

    Daimler TSS Data Warehouse / DHBW 121

  • Multidimensional operations - Slice

    Slice

    • Definition of a filter

    • Condition for one single dimension

    • Select a new cube with one fewer dimension

    • For instance

    • Product = ‘ThinkPad T60'

    Daimler TSS Data Warehouse / DHBW 122

  • Example Slice

    Daimler TSS Data Warehouse / DHBW 123

  • Multidimensional operations - Dice

    Dice

    • Definition of intervals/sets as filter

    • Pick specific values of multiple dimensions

    • Select a smaller cube

    • Conditions for instance

    • time = 1st quarter (January, February, March)

    • location = region south (Stuttgart, Frankfurt, Munich)

    Daimler TSS Data Warehouse / DHBW 124

  • Example Dice

    Daimler TSS Data Warehouse / DHBW 125

  • Multidimensional operations – Rotate/Pivot

    Rotate/Pivot

    • Rotate cube along its axes

    • Get different view on data cube

    • # of views on cube = (# of dimensions)!

    • 2 dimensions, 2 views (2! = 2*1)

    • 3 dimensions, 6 views (3! = 3*2*1)

    • 4 dimensions, 24 views (4! = 4*3*2*1)

    • ...

    Daimler TSS Data Warehouse / DHBW 126

  • Example Rotate/Pivot

    Daimler TSS Data Warehouse / DHBW 127

  • Multidimensional operations – Roll-up/Drill-down

    Roll-up & Drill-down

    • Prerequisites:

    • Hierarchies defined

    • Aggregated data for all hierarchy levels available

    • Roll up: change hierarchy level "upwards":

    • get less detailed data (= higher aggregation)

    • Drill down: change hierarchy level "downwards":

    • get more detailed data (= lower aggregation)

    Daimler TSS Data Warehouse / DHBW 128

  • Types of fact tables - accumulating

    Accumulating snapshots

    • Shows activity of a process/event over time

    • The data is not complete at the beginning and is updated as soon as new data arrived (e.g. delivery date can be unknown at the beginning)

    • The grain must (should) be the same for all rows

    • E.g. fact table for processing an order

    Daimler TSS Data Warehouse / DHBW 129

  • One or two fact tables?

    Sales Fact

    Quantity_ordered

    Quantity_shipped

    Time Dimension

    Product Dimension

    Customer Dimension

    Sales Fact

    Quantity_ordered

    Time Dimension

    Product Dimension

    Customer Dimension

    Shipment Fact

    Quantity_shipped

    Daimler TSS Data Warehouse / DHBW 130

  • One or two fact tables?

    • Reports get much more complicated to filter NULL

    • Avg(quantity): 100+50/2 but avg(shipped): 100/1

    • There may be even more columns like quantity_delivered or Delivery_company

    • 2 fact tables

    Time Product Customer Quantity

    ordered

    Quantity

    shipped

    1 A X 100 NULL

    1 B Y 50 NULL

    2 A X NULL 100

    Daimler TSS Data Warehouse / DHBW 131

  • One or two fact tables?

    Different processes must result into different fact tables

    • E.g. measures at different time

    • E.g. facts with different grain

    Daimler TSS Data Warehouse / DHBW 132

  • Precomputation of aggregated totals

    Query processing in the Mart Layer

    • SQL statements can become complex, e.g. many joins

    • SQL statements can become slow if many rows are aggregated

    • E.g. sum of sales amount for city X AND product Y AND year 2016 compared to city X AND product Y AND year 2015

    • If aggregated values are stored in Fact tables, new data from the Core Warehouse layer have to be integrated into such aggregated fact tables

    Daimler TSS Data Warehouse / DHBW 133

  • Materialized Views/Query Tables

    The DBMS takes care of solving these problems

    • The user defines views containing aggregated values for certain hierarchy levels

    • These views are materialized as tables

    • Update options

    • immediate

    • deferred

    • When performing a query against a fact table the DB optimizer takes advantage of these materialized views, i.e., no special queries have to be written for this by a user or application program

    • The user has not to rewrite the original query to use the materialized views

    Daimler TSS Data Warehouse / DHBW 134

  • Materialized Views / Materialized Query Tables

    Example statement Oracle to precompute values (similar DB2 and other RDBMS)CREATE MATERIALIZED VIEW sales_agg

    BUILD IMMEDIATE

    REFRESH FAST

    ON DEMAND

    AS

    SELECT p.productname, s.city, EXTRACT(MONTH FROM s.date)

    , sum(s.sales_amount)

    , sum(no_items)

    FROM product p

    JOIN sales s ON p.productid = s.productid

    GROUP by p.productname, s.city, EXTRACT(MONTH FROM s.date);

    Daimler TSS Data Warehouse / DHBW 135

  • Dimension and fact table types

    • Conformed dimension

    • Junk dimension

    • Role-Playing dimension

    • Degenerated dimension

    • Transactional fact

    • Periodic fact

    • Accumulating fact

    Daimler TSS Data Warehouse / DHBW 136

  • Dimension types: Conformed dimension

    • Dimension that is used in several fact tables

    • Fact tables can be connected by using conformed dimensions

    Sales

    Fact

    Inventory

    Fact

    Product Dimension

    Location Dimension

    Daimler TSS Data Warehouse / DHBW 137

  • Dimension types: Conformed dimension

    Kimball: Enterprise DWH Bus Matrix is a “design tool” to document the organization’s processes

    Dat

    e

    Produ

    ct

    Locati

    on

    Custom

    er

    Promotion

    Sales Fact X X X X X

    Inventory Fact X X X

    Customer Returns

    Fact

    X X X X

    Sales Forecast Fact X X X

    Daimler TSS Data Warehouse / DHBW 138

  • Dimension types: Junk dimension

    Collection of lookup data / codes that could also form it’s own dimension

    I

    D

    MartialSta

    tus

    Gender

    1 Single Male

    2 Single Female

    3 Married Male

    4 Married Female

    Daimler TSS Data Warehouse / DHBW 139

  • Dimension types: Role-playing dimension

    A single dimension is referenced several times by the same fact table

    • E.g. several dates in fact table reference Date Dimension

    ID OrderD

    ate

    DeliveryD

    ate

    ProductionDate

    1 .. .. ..

    2 .. .. ..

    3 .. .. ..

    4 .. .. ..

    Daimler TSS Data Warehouse / DHBW 140

  • Dimension types: Degenerated dimension

    • A dimension without own dimension table. Data are stored in the fact table only.

    • Used e.g. for drill-through in reports

    • E.g. OrderNumber in sales fact table

    ID OrderNumber

    1 A51273 .. ..

    2 72841 .. ..

    3 732GT5 .. ..

    4 624TR5K .. ..

    Daimler TSS Data Warehouse / DHBW 141

  • Types of fact tables - transactional

    Transactional

    • Most common

    • Usually one row per line/event in a transaction

    • Most detailed level

    • The grain must (should) be the same for all rows

    • Measures can usually be aggregated: “additive measure” (e.g. sum over sales amount)

    • E.g. fact table for sales data

    Daimler TSS Data Warehouse / DHBW 142

  • Types of fact tables – periodic snapshot

    Periodic snapshots

    • Picture of the time

    • Often computed from transactional fact table, e.g. aggregated by month

    • Measures can usually not be aggregated (e.g. sum over inventory does not make sense as inventory is already snapshot / sum for a day)

    • The grain must (should) be the same for all rows

    • E.g. fact table for inventory data (summed up for each day)

    Daimler TSS Data Warehouse / DHBW 143

  • example

    Sales in year y2

    Daimler TSS Data Warehouse / DHBW 144

  • example

    Sales of store s1 in year y2

    Daimler TSS Data Warehouse / DHBW 145

  • example

    Sales of product p2 in year y1

    Daimler TSS Data Warehouse / DHBW 146

  • Roll-up & Drill-down

    Daimler TSS Data Warehouse / DHBW 147

  • Conceptual and logical level

    Scott Ambler – Disciplined agile delivery

    • Do you need it?

    • What do you want to achieve?

    • What is the value?

    • Which representation do you use: 3NF/UML/Object model/ADAPT/Data Vault?

    Daimler TSS Data Warehouse / DHBW 148

  • Daimler TSS

    Data Modeling - What about data modeling training?

    Employees often get trained in SQL Server, Oracle, Cognos TM1, Tableau, or any other tool / product. What about data model training?

    Sources: http://www.dbdebunk.com/2017/06/this-week.html

    Daimler TSS Data Warehouse / DHBW 149

  • Relational model: Snowflake Schema with normalizedDimensions

    Sales Fact

    Time_key (FK)

    Product_key (FK)

    Location_key (FK)

    Branch_key (FK)

    Sales_amout

    Discount

    Time Dimension

    Time_key (PK)

    Date

    Day

    Month

    Quarter

    Year

    Product Dimension

    Product_key (PK)

    Product_name

    Supplier_Key (FK)

    Branch Dimension

    Branch_key (PK)

    Branch_name

    Location Dimension

    Location_key (PK)

    Street

    City_key (FK)

    City Dimension

    City_key (PK)

    City

    Country_Key (FK)

    Supplier Dimension

    Supplier_key (PK)

    Supplier_Name

    Country Dimension

    Country_key (PK)

    Country

    n

    n

    n

    n

    n

    Daimler TSS Data Warehouse / DHBW 150

  • Data Models for Hierarchies

    Normalized Dimensions

    • 1 table for each hierarchy level

    • Advantage:

    • Minimal updates for changes in the hierarchies

    • Disadvantage:

    • More complex queries when computing aggregations

    • Multiple joins

    Daimler TSS Data Warehouse / DHBW 151

  • Dimensional model – logical view

    Sales

    Inven

    toryStock

    #Items

    Price

    Store City Country

    Customer

    Product Productgroup

    Day Month Year

    Measure

    Fact table / Cube

    Dimension

    Daimler TSS Data Warehouse / DHBW 152

  • Sample product Hierarchy

    Dimensions can be organized in hierarchies

    • i.e. product hierarchy

    Daimler TSS Data Warehouse / DHBW 153

  • Exercise: Queries 1

    How many cabriolets (D_Model.model) have been

    Built in January and February 2016?

    Assume SCD1 and no history in fact tables

    Count

    01/2016

    02/2016

    Daimler TSS Data Warehouse / DHBW 154

  • Exercise: Queries 1

    How many cabriolets (D_Model.model) have been

    Built in January and February 2016?

    SELECT d.month, d.year, sum(f.count)

    FROM f_vehicle_built f

    JOIN d_model m on m.modelid = f.modelid

    JOIN d_production_date d on d.prod_date = f.prod_date

    WHERE m.model = ‘Cabriolet‘

    AND d.month IN (1, 2) AND d.year = 2016

    GROUP BY d.month, d.year

    Daimler TSS Data Warehouse / DHBW 155

  • Exercise: Queries 2

    How many different models (D_Model.model) have

    Currently a performance of 105PS (D_ENGINE.performance)?

    Assume SCD1 and no history in fact tables

    Model Count

    Cabriolet

    SUV

    Daimler TSS Data Warehouse / DHBW 156

  • Exercise: Queries 2

    How many different models (D_Model.model) have

    Currently a performance of 105PS (D_ENGINE.performance)?

    Select m.model, sum(f.count)

    FROM f_vehicle_built f

    JOIN d_model m on m.modelid = f.modelid

    JOIN d_engine e on e.engineid = engineid

    WHERE e.performance = 105

    GROUP BY m.model

    Daimler TSS Data Warehouse / DHBW 157

  • Exercise: Queries 3

    How many different models (D_Model.model) have

    Currently a performance of 105PS (D_ENGINE.performance)?

    Model Count

    Cabriolet

    SUV

    Daimler TSS Data Warehouse / DHBW 158

  • Exercise: Queries 3

    How many different models (D_Model.model) have

    Currently a performance of 105PS (D_ENGINE.performance)?

    CREATE VIEW v_vehicle_sat as

    SELECT h_vehicle_key, max(loaddate), model

    FROM s_vehicle_base

    GROUP BY h_vehicle_key;

    CREATE VIEW v_engine_sat as

    SELECT h_engine_key, max(loaddate), performance

    FROM s_engine

    GROUP BY h_engine_key;

    Daimler TSS Data Warehouse / DHBW 159

  • Exercise: Queries 3

    How many different models (D_Model.model) have

    Currently a performance of 105PS (D_ENGINE.performance)?

    SELECT model, count(*)

    FROM v_vehicle_sat v

    JOIN l_plugged_into l ON l.h_vehicle_key = v.h_vehicle_key

    JOIN v_engine_sat e ON l.h_engine_key = e.h_engine_key

    JOIN s_engine s ON s.h_engine_key = e.h_engine_key

    AND s.loaddate = e.loaddate

    WHERE s.performance = 105

    GROUP by model; Many other solutions possible, e.g. using with clause

    instead of views or using window functions – all

    depending from DB vendor/version

    Daimler TSS Data Warehouse / DHBW 160