50
Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas Poggi, Victor Cuevas-Vicenttín, David Carrera, Josep Lluis Berral, Thomas Fenech, Gonzalo Gomez, Davide Brini, Alejandro Montero Umar Farooq Minhas, Jose A. Blakeley, Donald Kossmann, Raghu Ramakrishnan and Clemens Szyperski. TPCTC - August 2019

Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

Benchmarking Elastic Cloud Big Data Services under SLA Constraints

Nicolas Poggi, Victor Cuevas-Vicenttín, David Carrera, Josep Lluis Berral, Thomas Fenech, Gonzalo Gomez, Davide Brini, Alejandro Montero

Umar Farooq Minhas, Jose A. Blakeley, Donald Kossmann, Raghu Ramakrishnan and Clemens Szyperski.

TPCTC - August 2019

Page 2: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

Outline

1. Intro to TPCx-BBa. Limitations for cloud systemsb. Contributions

2. Realistic workload generationa. Production datasetsb. Job arrival rates

3. Elasticity Testa. Current metricb. SLA-based addition

4. Experimental evaluationa. Elasticity Testb. Load, Power, Throughput testsc. Metric evaluation

5. Conclusionsa. Future directions

2

Page 3: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

Benchmarking and TPCx-BB

• Benchmarks capture the solution to a problem and guide decisions.

• Widely used in development, configuration, and testing.

• TPCx-BB (BigBench) is the first standardized big data benchmark• Collaboration between industry and academia

• Follows the retailer model of TPC-DS

• Adds:• Semi and unstructured data

• SQL, UDF, ML, and NLP queries

Retailer data model

Page 4: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

TPCx-BB benchmark workflow

• Similar to previous TPC database benchmarks:• Load Test (TLD):

• Generates the DB• imports raw data, metastore, stats, columnar

• Power Test (TPT) • Runs queries sequentially

• Throughput Test (TTT) • Runs queries concurrently

• Includes a data refresh stage

• Produces a final performance metric• BB queries per minute

DB @ SF

Load data

Seq q1 … q30

User1 q15 q21 … q16

User2 q12 q18 … q2

UserN …

Metric

Page 5: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

Limitations of the cocurrency test

Drawback 1:

• Constant concurrency workloads at the same scale

Drawback 2:

• Does not consider QoS (isolation)• Query time degradation is not obvious

from the final metric

• We found poor scalability under concurrency in BB [1]Stream1 q15 q21 … q16

Stream2 q12 q18 … q2

Stream3 q16 q30 … q19

[1] Characterizing BigBench queries, Hive, and Spark in multi-cloud environments TPCTC'17

Q4 from 10 to 100GB over 15X slower

Page 6: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

Proposal and contributions

1. Build a realistic big data workload generator• Based on production workloads

2. Measure QoS in the form of per-query SLAs• Apply the results in a new metric

• With minimal parameters

3. Extend TPCx-BB with a new concurrency test and metric• Implement a driver and evaluate differences

Page 7: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

Realistic workload generation

Page 8: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

Analyzing production big data workloads

• Cosmos cluster operated within Microsoft• Sample of 350,000 job submissions

• Over a month of data in 2017

• Objectives:1. Model job submission patterns

2. Workload characterizationPeaks

Valleys

Page 9: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

Modeling arrival rates

• Use Hidden Markov Model (HMM) to model temporal pattern in the workload• Probabilities between finite number of states

• HMM allows scaling the workload

Peaks

Valleys

Page 10: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

Modeling arrival rates

• Use Hidden Markov Model (HMM) to model temporal pattern in the workload• Probabilities between finite number of states

• HMM allows scaling the workload

Fluctuations are captured by 4states and the transitions between them

Peaks

Valleys

Page 11: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

Job input data size

• As no general temporal pattern found• Cumulative distribution sufficient for

modeling SF

• CDF used to generate random variates mapped to SF• 1, 10, 100, 1000 GB

• Studied further in [2]

• Findings:• 55% < 1GB

• 90% < 1TB

CDF of the job’s input data size

[2] Big Data Data Management Systems performance analysis using Aloja and BigBench. Master thesis

Page 12: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

Elasticity Test

Page 13: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

Methodology for generating workloads

1. Set scale (max concurrent submissions)• Defaults to n• Total queries = n * total queries

2. Generate model (queries per interval)1. Assign queries to each batch randomly

• Query repetition avoided within a batch

2. Multi scale factors can be set • Include all standard smaller SF

3. Define granularity1. Set time between batches2. Defaults to 60s.

Page 14: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

Methodology for generating workloads

1. Set scale (max concurrent submissions)• Defaults to n• Total queries = n * total queries

2. Generate model (queries per interval)1. Assign queries to each batch randomly

• Query repetition avoided within a batch

2. Multi scale factors can be set • Include all standard smaller SF

3. Define granularity1. Set time between batches2. Defaults to 60s.

t1 q17

t2 q7

t3 q15 q21

t4 q6 q9 q14

t5 q9 q14

t6 q11 q22 q21

t7 q16 q15

t8 q24

Elasticity Test sequence

Tim

e in

terv

als

# queries / batch

Page 15: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

New SLA-aware benchmark metric

• Query-specific SLAs• Sets a limit for query completion time• Measures

• Number of misses• Distance to SLA

• Currently defined ad-hoc• Uses Power Test times for the SUT(s)

• Adds a 25% margin tolerance

• Benefits• Works on all SF and future proof

Page 16: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

New SLA-aware benchmark metric

• Query-specific SLAs• Sets a limit for query completion time• Measures

• Number of misses• Distance to SLA

• Currently defined ad-hoc• Uses Power Test times for the SUT(s)

• Adds a 25% margin tolerance

• Benefits• Works on all SF and future proof

Example:q1 took 38s. in isolationSLA for q1 = 47.5s.

Page 17: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

New SLA-aware benchmark metric

• Query-specific SLAs on concurrency• Sets a limit for query completion time• Measures

• Number of misses• Distance to SLA• Indirectly isolation and dependencies

• Currently defined ad-hoc• Uses Power Test times for the SUT(s)

• Adds a 25% margin tolerance

• Benefits• Works on all SF and future proof to tech.

Example:q1 took 38s. in isolationSLA for q1 = 47.5s.

t1 q17

t2 q7

t3 q15 q21

t4 q6 q9 q14

t5 q9 q14

t6 q11 q22 q21

t7 q16 q15

t8 q24

Elasticity Test sequence

Tim

e

# queries / batch time

SLA distance

Page 18: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

Current TPCx-BB performance metric

Page 19: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

Current TPCx-BB performance metric

Scale factor

Total number of queries

Page 20: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

Current TPCx-BB performance metric

Page 21: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

Current TPCx-BB performance metric

Page 22: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

Current TPCx-BB performance metric

Page 23: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

New SLA-aware benchmark metric

BB++

Page 24: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

New SLA-aware benchmark metric

BB++

Page 25: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

New SLA-aware benchmark metric

Interval between each batch of queries

BB++

Page 26: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

New SLA-aware benchmark metric

BB++

SLA distance

Page 27: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

New SLA-aware benchmark metric

BB++

SLA factor

Page 28: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

New SLA-aware benchmark metric

BB++

Total execution time of the elasticity test

Page 29: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

SLA distance

• Distance between the actual execution time and the specified SLA

Page 30: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

SLA distance

• Distance between the actual execution time and the specified SLA

Page 31: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

SLA distance

• Distance between the actual execution time and the specified SLA

Queries that complete within their SLA do not contribute to the sum

Page 32: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

SLA distance

• Distance between the actual execution time and the specified SLA

Page 33: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

SLA factor

< 1 when less tan 25% of the queries fail their SLA,> 1 if more of 25% of the queries fail their SLA

Page 34: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

SLA factor

< 1 when less tan 25% of the queries fail their SLA,> 1 if more of 25% of the queries fail their SLA

Number of queries that fail to meet their SLA

Page 35: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

SLA factor

< 1 when less tan 25% of the queries fail their SLA,> 1 if more of 25% of the queries fail their SLA

Page 36: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

Experimental evaluation

Page 37: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

Experimental evaluation

• Experiments performed on Apache Hive (2.2/2.3) and Spark (2.1/2.2)

• Benchmark runs limited to the 14 SQL queries of TPCx-BB

• Due to errors and scalability limitations

• Using a fixed scale factor

• Total 512-cores and 2TB of RAM

• 32 workers: 16 vcpus and 64GB RAM

• Ran on 3 major cloud providers using block storage• Results anonymized

• (Only results for Provider1 at 10TB presented)

Page 38: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

Elasticity Test at 10TB and 2 streams

Provider A: Hive

Page 39: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

Elasticity Test at 10TB and 2 streams

Provider A: Hive Provider A: Spark

Page 40: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

Complete TPCx-BB test times at 10TB

21

Provider A: Hive Provider A: Spark

Elasticity Time (s) 7,084 6,603

Throughput Time (s) 12,878 6,496

Power Time (s) 5,036 5,520

Load time (s) 5,124 5,124

Total Time (s) 30,122 23,743

5,124 5,124

5,036 5,520

12,878

6,496

7,084

6,603

Total Time (s), 30,122

Total Time (s), 23,743

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

Tim

e (s

)

Provider A: Hive Provider B: Spark

Page 41: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

BB++Qpm (new)

1 2

Provider A: Hive 1,352 295

Provider A: Spark 1,767 1,286

Provider A: Hive 1,352

Provider A: Hive 295

Provider A: Spark 1,767

Provider A: Spark 1,286

Met

ric

sco

reComparison of the two scores at 10TB

22

Hive gets 4.3xlower score in the new metric

30% diff

Spark also gets a lower score

BB++QpmBBQpm

BBQpm (old)

Page 42: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

BB++Qpm (new)

1 2

Provider A: Hive 1,352 295

Provider A: Spark 1,767 1,286

Provider A: Hive 1,352

Provider A: Hive 295

Provider A: Spark 1,767

Provider A: Spark 1,286

Met

ric

sco

reComparison of the two scores at 10TB

22

Hive gets 4.3xlower score in the new metric

30% diff

Spark also gets a lower score

BB++QpmBBQpm

BBQpm (old)

Page 43: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

Summary and future directions

Page 44: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

Summary

• The throughput test under TPC DB benchmarks provides limited signal• Closed loop system (constant load)• Does not consider temporal patterns• Limited test of load balancers and schedulers (no queueing)

• Modeling a real-world big data cluster we have produced:• A workload generator with job arrival rates • Multi-data-scales test

• Extended TPCx-BB with the Elasticity Test • Incorporating SLAs and proposing a new metric

• Evaluated its applicability to cloud big data systems• And how scores differs to the current metric

24

Page 45: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

Conclusions and future work

• The Elasticity Test considers aspects crucial for the cloud• Dynamic workloads in accordance to real-world behavior

• QoS at the query-level or isolation

• The ET can improve the development of elastic cloud systems• By rewarding systems that can keep QoS under concurrency

• While saving costs in periods of low intensity

Future directions• Test elastic DBaaS / QaaS under concurrency

• Specification of SLAs needs to be studied further

• Work with this community and gather feedback and next steps

Page 46: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

Thanks, questions?

Follow up / feedback : [email protected]

Benchmarking Elastic Cloud Big Data Services under SLA Constraints

TPCTC - August 2019

Page 47: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

Extra slides

Page 48: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

Elasticity Test at 1TB Hive: Prov A and B

SLA tester (sample)

Page 49: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

Sample total queries and arrivals

Workload parameters:

• 10 TB scale factor• 2 streams of 14 SQL queries

• total of 28 queries• λbatch = 240 sec (4 min)

Page 50: Benchmarking Elastic Cloud Big Data Services under SLA ...personals.ac.upc.edu/npoggi/slides/SLIDES - N... · Benchmarking Elastic Cloud Big Data Services under SLA Constraints Nicolas

Experiments at 100GB with 8-streams (112 total queries)Fast system Slow system showing queueing and degraded performance