Leisungsanalyse von Rechnersystemen · Willersbau Raum A104 Tel. +49 351 - 463 - 42483 Robert...

Preview:

Citation preview

Willersbau

Raum A104

Tel. +49 351 - 463 - 42483

Robert Schöne (robert.schoene@tu-dresden.de)

Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH)

Leisungsanalyse

von Rechnersystemen

Comparing system using sample data

- BenchIT -

Robert Schöne

Contributions

Jupp Müller

Daniel Molka

Jens Domke

Dr. Stefan Pflüger

Daniel Reiche

BenchIT Team

Robert Schöne

Agenda

Implementation Guidelines and Feature Overview

BenchIT GUI – Measuring and Plotting

BenchIT Website

Case Study – Optimizing STREAM for Intel Core 2

Robert Schöne

Implementation Guidelines

Platform independent

– POSIX conformability

– ANSI-C conformability

Usage of sh and cc only

No make files

Minimized size of the sources

Plain text for

– Configuration data

– Results

GPL licence model

Robert Schöne

The BenchIT Concept – From Measurement to Analysis

Measurement Analysis

user

user

group

Y-A

ch

se

X-Achse

Y-A

ch

se

X-Achse

Y-A

ch

se

X-Achse

Y-A

ch

se

X-Achse

Server

Database

WWW

1212121 122545

21212 1154532

21212154 4532132

5456465 452121

1212121 122545

21212 1154532

21212154 4532132

5456465 452121

1212121 122545

21212 1154532

21212154 4532132

5456465 452121

user

user

user

group

Robert Schöne

BenchIT – Step by Step

ConsoleEditor

Robert Schöne

BenchIT – Step by Step

Kernel

Sources

Console

LOCAL

DEFS

Editor use

edit edit

Robert Schöne

BenchIT – Step by Step

Kernel

Sources

Execut-

able

compile

Console

LOCAL

DEFS

Editor use

edit edit start

Robert Schöne

BenchIT – Step by Step

Kernel

Sources

Execut-

able

Result

File

compile run

Console

LOCAL

DEFS

Editor use

edit edit start start

Robert Schöne

BenchIT – Step by Step

Kernel

Sources

Execut-

able

Result

File

eps

png

...

compile run create

Console

LOCAL

DEFS

Editor use

edit edit start start

Robert Schöne

BenchIT – Step by Step

Kernel

Sources

BenchIT

Database

BenchIT-Website

Execut-

able

Result

File

eps

png

...

compile run create

compare resultsConsole

LOCAL

DEFS

Editor use

edit edit start start create

upload

Robert Schöne

BenchIT – Step by Step

Kernel

Sources

BenchIT

Database

BenchIT-GUI

BenchIT-Website

Execut-

able

Result

File

eps

png

...

compile run create

compare resultsConsole

LOCAL

DEFS

Editor use

edit edit start start startview/

plotcreate compare results

edit edit start start create

upload

Robert Schöne

BenchIT – Different Solutions for Specialized Purposes

BenchIT measurement

– Skripts (COMPILE.sh, RUN.sh, reference_run.sh)

– BenchIT-GUI for

• Local Measurement

• Remote Measurement

- Compile and run on the remote system

- Cross-compilation on the host system and run only on the remote

system

BenchIT visualization of results and comparison of different runs

– BenchIT-Website

– BenchIT-GUI

Examples: Memory Latency

Measuring the latency to the different memory levels

Problemsize: size of used memory

Benchmark: pointer chasing

Robert Schöne

ptr=first;

do{

ptr=(void **) *ptr;

} while (ptr!=first);

Examples: MPI Latency

Measuring the latency between different MPI nodes

Problemsize: ID of sender-receiver pair

Benchmark: ping pong

Robert Schöne

if (myRank==receiver(ID)){

MPI_Receive();

MPI_Send();

}

if (myRank==sender(ID)){

MPI_Send();

MPI_Receive();

}

Examples: Floating Point Performance

Measuring the floating point performance for using data in different memory

levels

Problemsize: memory size

Benchmark: matrix multiplication

Robert Schöne

for (i=0;i<N;i++)

for (j=0;j<N;j++)

for (k=0;k<N;k++)

c[i][j]=c[i][j]+a[i][k]*b[k][j];

Examples: Bandwidth

Measuring the bandwidth of different memory levels

Problemsize: memory size

Benchmark: STREAM like

Robert Schöne

for (i=0;i<N;i++)

c[i]=a[i]

Writing a measurement kernel

Naming convention

category.name.language.parallelLibs.otherLibs.ID

– numerical.matmul.C.0.0.double

– memory.latency.C.0.0.pointerchasing

Clear Interface to program against:

– bi_getinfo

Used by benchit to get information about the measurement kernel

– bi_init

Called by benchit to initialize data for the measurement kernel

– bi_entry

Called n times by benchit to generate results

– bi_cleanup

Called by benchit to free allocated resources

Robert Schöne

bi_getinfo

Passes info struct, defined in interface

Kernel should fill out the following informations:

– X/Y - axis settings

– Legend texts

– Outlier direction

– „maxproblemsize“ (Not the real problem size, but the number of

bi_entry calls)

– Usage of parallel libraries

– Number of functions

– Definition of „best“ result

Robert Schöne

bi_init / bi_cleanup

bi_entry

Called once before measurements start

„maxproblemsize“ passed

Should allocatelarge data fields, only parts of them may be used in bi_entry

Should initialize used libraries, devices, …

May return ONE pointer to its data

bi_cleanup

Called once after the measurement

Pointer returned by bi_init passed

Should free resources

Robert Schöne

bi_entry

Called several times

Pointer returned by bi_init and ID passed

ID is the number of the measurement – maybe its problemsize

Result value vector passed (double[number of functions +1] )

Should do measurement

Can use:

– bi_gettime() gets current time in seconds as double

– dTimerOverhead means overhead for bi_gettime()

– dTimerGranularity means granularity of bi_gettime()

Results should be stored in result vector

Robert Schöne

If there‘s so much to write …

Why should I use BenchIT?

BenchIT stores informations about compile and run time environment

BenchIT makes batch systems transparent to use

BenchIT selects the „best“ result

BenchIT allows easy comparison

BenchIT provides tools for remote measurement

Robert Schöne

Robert Schöne

Agenda

Implementation Guidelines and Feature Overview

BenchIT GUI – Measuring and Plotting

BenchIT Website

Case Study – Optimizing STREAM for Intel Core 2

Robert Schöne

BenchIT GUI – Start

Robert Schöne

BenchIT GUI – definition of local system

Robert Schöne

BenchIT GUI – select a kernel

Robert Schöne

BenchIT GUI – run kernel …

Robert Schöne

BenchIT GUI – run kernel … finished

Robert Schöne

BenchIT GUI – show result

Robert Schöne

BenchIT GUI – result with default settings

Robert Schöne

BenchIT GUI – changing settings (before)

Robert Schöne

BenchIT GUI - changing settings (after)

Robert Schöne

BenchIT GUI – result plot with new settings

Robert Schöne

BenchIT GUI – running on a remote machine

Robert Schöne

BenchIT GUI – define a remote machine

Robert Schöne

BenchIT GUI

Robert Schöne

BenchIT GUI – automatic generation of definitions

Robert Schöne

BenchIT GUI – switching local definitions

Robert Schöne

BenchIT GUI – loading definitions from remote machine

Robert Schöne

BenchIT GUI – new definitions loaded

Robert Schöne

BenchIT GUI

Robert Schöne

BenchIT GUI – changing some settings

Robert Schöne

BenchIT GUI – running pointerchasing on remote system

Robert Schöne

BenchIT GUI – selecting the target system

Robert Schöne

BenchIT GUI - pointerchasing running remote …

Robert Schöne

BenchIT GUI - pointerchasing running remote … done

Robert Schöne

BenchIT GUI – getting results from remote machine

Robert Schöne

BenchIT GUI – result from remote machine

Robert Schöne

BenchIT GUI – comparing both results

Robert Schöne

BenchIT GUI – comparing both results, better layout

Robert Schöne

BenchIT GUI - connecting to web server

Robert Schöne

BenchIT GUI – selecting results from web server

Robert Schöne

BenchIT GUI – getting results for Pentium M

Robert Schöne

BenchIT GUI – results from web server

Robert Schöne

BenchIT GUI - putting all together …

Robert Schöne

BenchIT GUI - … and another one

Robert Schöne

BenchIT GUI – exported to png

Robert Schöne

Agenda

Implementation Guidelines and Feature Overview

BenchIT GUI – Measuring and Plotting

BenchIT Website

Case Study – Optimizing STREAM for Intel Core 2

Analysis/Plot: 3 Different Analyse Paths, Stored Plots

Compare Different Architectures

Compare Different Processors

Kernels which run on both Systems

Compare their Memory Access Time

Select Additional Information

Compared Results

Compare a specific Kernel

Compare Memory Latencies (Pointerchasing)‏

Compare a Larger Set of Systems

Not Satisfying?

Compare Different Implementations

Compare Different Compilers

Compare Different Compiler Flags

Compare Different Processor Generations

Compare Different Libraries

Share ...

Share with specific user groups

Robert Schöne

Agenda

Feature Overview and Implementation Guidelines

BenchIT GUI – Measuring and Plotting

BenchIT Website

Case Study – Optimizing STREAM for Intel Core 2

Intel Core 2 Duo Processor

Robert Schöne

Core 1

32 KiB L1 Instruction Cache

4 MiB

shared(dynamically

allocated)

L2 Cache

32 KiB L1 Data Cache

ITLB

DTLB

Fetch and Predecode

Reservation Station – 32 entries

FS

B

Reorder Buffer – 96 entries

Rename/Alloc

Instruction Queue – 18 x86 Inst

Store

addrLoad

Int ALU

Int SIMD

FP MUL

Int ALU

Int SIMD

Int ALU

Int SIMD

FP ADD

Decode – 4+1 x86 Inst

Branch Predict

Bus

Interface

Unit

Microcode

ROM

Store

data

port2 port0port4port3 port5port1

16 Byte

6 x86

4+1 x86

complex simple simplesimple4 µops 1 µop 1 µop1 µop

Memory Order Buffer

12

8 B

it

12

8 B

it

12

8 B

it

12

8 B

it

Core 0

256 Bit

physical

Registers

Load/Store

Buffers

alloc

free

Robert Schöne

The STREAM Benchmark – Source Code Fracture

# define N 2000000

# define NTIMES 10

# define OFFSET 0

...

static double a[N+OFFSET],

b[N+OFFSET],

c[N+OFFSET];

...

for (k=0; k<NTIMES; k++)

{

times[0][k] = mysecond();

#pragma omp parallel for

for (j=0; j<N; j++)

c[j] = a[j];

times[0][k] = mysecond() - times[0][k];

...

}

Robert Schöne

First Measurements

Dissatisfying results, imprecise for small problem sizes

– STREAM designed for large memory accesses

– STREAM very simplistic

Only a single problem size is measured per run

– Recompilation for every measurement

– For cache access: more time needed to compile then

to measure

Reimplementation in BenchIT

Robert Schöne

First Measurements - Reimplementation

Design of the benchmark untouched, but

– Dynamic memory allocation

– Variable problem size

– Using RDTSC

No optimizations done

Offset still 0 (STREAM default)

Robert Schöne

Derived STREAM Benchmark

L1 Cache L2 Cache

Bandwidth in L2 Cache

approx. 20 GB/s

Robert Schöne

Derived STREAM Benchmark

Robert Schöne

Derived STREAM Benchmark

Robert Schöne

Derived STREAM Benchmark

Robert Schöne

Optimizations – Reduce Overhead

Still unsatisfying results in the L1 cache

To much overhead due to OpenMP

Solution:

Move time measurement into parallel region

Repeat every operation

Only increased timer accuracy

BUT:

Loops are moved into parallel regions too!

Robert Schöne

Optimizations – Reduce Overhead

Repititions for

every single

operation, not for

whole loop

Robert Schöne

Optimizations – Align Memory for SSE Access

Still relatively low cache performance

Previous measurements have shown

– 16 byte alignment important for performance

– Compiler directive #pragma vector aligned helps compiler

using alignments

Solution:

– Vectors now 16 byte aligned

– Both parts of the vectors have a multiple of 2 as length

– Compiler directive was introduced

Robert Schöne

Optimizations – Align Memory for SSE Access

Robert Schöne

Optimizations – Align Memory for Better Cache Access

Still instable behavior for small problem sizes

Better performance for vector lengths, which are a

multiple of 16 (8 for single threaded)

8*8 (double precision floating point) Byte

= 64 Byte (cache line length)

Solution:

Aligning vectors at 128 Byte barrier for 2 threads

Robert Schöne

Optimizations – Align Memory for Better Cache Access

Robert Schöne

Examination of Other Multicore-CPUs

Intel

Xeon

5160

Intel

Core Duo

T2600

Intel

Xeon

5060

AMD

Opteron

285

Codename Woodcrest Yonah Dempsey Italy

Compiler icc 9.1-em64t icc 9.1 icc 9.1-em64t icc 9.1-em64t

Clock rate 3.0 GHz 2.167 GHz 3.2 GHz 2.6 GHz

L1 D-

Cache per

Core

32 KiByte 32 KiByte 16 KiByte 64 KiByte

L2 Cache 4 MiByte

shared

2 MiByte

shared

2*2 MiByte 2*512 kByte

Robert Schöne

Examination of Other Multicore-CPUs

Robert Schöne

Examination of Other Multicore-CPUs

Recommended