The Vampir Performance Analysis Tool Hans–Christian Hoppe Gesellschaft für Parallele Anwendungen...

Preview:

Citation preview

The Vampir Performance Analysis Tool

Hans–Christian Hoppe

Gesellschaft für Parallele Anwendungen und Systeme mbH

Pallas GmbHHermülheimer Straße 10D-50321 Brühl, Germany

info@pallas.comhttp://www.pallas.com

SCICOMP 2000 Tutorial, San Diego

© Pallas GmbH

Outline

Performance tools for parallel programming

Performance analysis for MPI

The Vampir tool

The Vampir roadmap

© Pallas GmbH

Why performance tools?

CPUs and interconnects are getting faster all the time

Compilers are improving

“Abundance of computing power”

Shouldn’t it be sufficient to just write an application and let the system do the rest?

© Pallas GmbH

Why performance tools?

In reality, there remain severe performance bottlenecks– slow memory access (instructions and data)– cache consistency effects– starvation of instruction units– contention of interconnection systems– adverse interaction with schedulers

© Pallas GmbH

Why performance tools?

The application programmer does the rest– excessive sequential sections– bad load balance– non–optimized communication patterns– excessive synchronization

Performance analysis tools can– help to diagnose system–level performance problems– help to identify user–level performance bottlenecks– assist the users in improving their applications

© Pallas GmbH

Achieved performance vs. effort

Effort

Cod

e P

erfo

rman

ce

OpenMP

MPI

Code doesn’t work

Performance tools

Performance tools

KAP, Debuggers

© Pallas GmbH

Performance tools – goals?

Holy grail– Automatic parallelisation and optimization– One code version for sequential and parallel– One code version for all platforms– Automatic code verification– Automatic performance verification– Automatic detection of performance problems– Integration of performance analysis and parallelisation

© Pallas GmbH

Event–based MPI Analysis

Record trace of application execution– Calls to MPI and user routines– MPI communication events– Source locations– Values of performance registers or program variables

From a trace, a performance analysis tool can show– Protocol of execution over time– Statistics for MPI routine execution– Statistics for communication– Dynamic calling tree

Important advantage– Focus on any phase of the execution

© Pallas GmbH

Vampirtrace details

Vampirtrace™– Instrumentation library producing traces for Vampir and

Dimemas– Supports MPI–1 (incl. collective operations) and MPI–I/O– Exploits MPI profiling interface– Works with vendors MPI implementations– API for user–level instrumentation– Capability to filter for event subsets

Developed, productized and marketed by Pallas

Available for IBM SP, PE 3.x

© Pallas GmbH

Vampir details

Vampir™– Event–trace visualization tool– Analyzes MPI and user routines– Analyzes point–to–point, collective and MPI–IO operations– Focus on arbitrary execution phases– Execution and communication statistics– Filter processes, messages, and user/MPI routines

Jointly developed by TU Dresden and Pallas Productized and marketed by Pallas

Available for IBM RS6000, AIX 4.2/AIX 4.3

© Pallas GmbH

Dimemas details

Dimemas– Event–based performance prediction tool– Parameterized machine model

•CPU performance•Communication and network performance

– Predicts performance on modeled platform– What–if analysis determined influence of parameters

Jointly developed by UPC Barcelona and Pallas

Productized and marketed by Pallas

Available for IBM RS6000, AIX 4.2/AIX 4.3

© Pallas GmbH

Vampir main window

Vampir 2.5 main window

Tracefile loading can be interrupted at any time Tracefile loading can be resumed Tracefile can be loaded starting at a specified time offset Tracefile can be re–written

© Pallas GmbH

Aggregated profiling information– Execution time– Number of calls

Inclusive or exclusive of called routines

Summary chart

© Pallas GmbH

Vampir state model

User specifies activities and symbol grouping Look at all/any activities or all symbols

Summary chart

Calculation TracingMPI

MPI_Send

MPI_Recv

MPI_Wait

ssor

exchange

Activities

Symbols

© Pallas GmbH

Timeline display

To zoom, mark region with the mouse

© Pallas GmbH

Timeline display – message details

Click on message line

Message receive op

Messagesend op

Message information

© Pallas GmbH

Communication statistics

Message statistics for each process/node pair:– Byte and message count– min/max/avg message length, bandwidth

© Pallas GmbH

Message histograms

Message statistics by length, tag or communicator– Byte and message count– Min/max/avg bandwidth

© Pallas GmbH

Collective operations

For each process: mark operation locally

Connect start/stop points by lines

Start of opData being sent

Data being received

Stop of op

Connection lines

© Pallas GmbH

Collective operations

Click on collective operation display

See global timing info

See local timing info

© Pallas GmbH

I/O transfers are shown as lines

MPI–I/O operations

Click on I/O line

See detailed I/O information

© Pallas GmbH

Activity chart

Profiling information for all processes

© Pallas GmbH

Global calling tree

Display for each symbol:– Number of calls, min/max. execution time

Fold/unfold or restrict to subtrees

© Pallas GmbH

Process–local displays

Timeline (showing calling levels) Activity chart Calling tree (showing number of calls)

© Pallas GmbH

Effects of zooming

Select one iteration

Updated summary

Updated message statistics

© Pallas GmbH

Compare traces

Compare profiling information– To check load balance (between processes)– To evaluate scalability (different runs)– To look at optimization effects (different code versions)

Compare processes 6 and 19

Comparison by routine

© Pallas GmbH

Coupling Vampir and Dimemas

Actual program run

vs.

Ideal communication

© Pallas GmbH

Vampir/Vampirtrace roadmap

Ongoing developments– Scalability enhancements– Functionality enhancements– Instrumentation enhancements

Will be first available commercially on NEC and Compaq platforms

– Earth simulator– ASCI machines

PathForward developments for ASCI machines

© Pallas GmbH

Scalability challenges

Scalability in processor count– ASCI–class machines have 1000s of processors– High–end systems have 100s of processors– Applications use most of them

Scalability in time– Need to analyze actual production runs (hours/days)

Scalability in detail– Record and analyze system–specific performance data– Support for threaded and hybrid models

© Pallas GmbH

Scalability problems

Counter–based profiling tools are basically OK– Severely limited in the level of detail– Can’t focus into parts of application run

Event–based tools have problems– Event traces get really large– Display tools use huge amounts of memory– Many displays do not scale

Example: Vampir tracefiles for NAS NPB–LU– 128 processes: 3.000.000 records (120 Mbyte)– 256 processes: 15.000.000 records (600 Mbyte)– 512 processes: 150.000.000 records (6 Gbyte)

© Pallas GmbH

Threaded programming models

Enhance Vampir to display– Thread fork/join– Thread synchronization– Show a timeline per thread / aggregate threads into single

timeline– Display subroutine/code block execution for each thread

Create instrumentation library for thread packages

Integrate instrumentation capability into OpenMP systems

© Pallas GmbH

Cluster node display

Cluster information is already recorded Enhance Vampir to

– show aggregate execution information per node– show communication volume per node

© Pallas GmbH

Cluster timeline display

Display node–level information Show communication volume within nodes Show communication between nodes as usual Allow to expand nodes into processes

There may be more than two hierarchy levels ...

© Pallas GmbH

Cluster timeline display

© Pallas GmbH

Structured tracefile format

Subdivide the tracefile into frames– Time intervals, thread/process/node subsets

Put frame data – All in one file (as today)– In multiple files (one per frame ...)– On a parallel filesystem (exploit parallelism)

Frame index file holds– Location of frame start/end– Frame statistic data for immediate display– “Frame thumbnail”

© Pallas GmbH

Structured tracefile format

Vampir loads the frame index Displays immediately available

– Global profiling/communication statistics– By–frame profiling/communication statistics– Thumbnail timeline

User gets overview of application run– Can load particular frame data– Can navigate between frames

User can refine instrumentation/tracing– Get detailed trace of interesting frames

© Pallas GmbH

Dynamic tracing control

What can be controlled– Definition of frames– Data to be recorded per frame

Control methods– Instrumentation with Vampirtrace API– Binary instrumentation (atom) or use of a debugger– Configuration file– Interactive control agent (debugger)

Tracing the right data is an iterative process!

© Pallas GmbH

Cluster timeline display

For very large systems, still can’t look at complete system (too many nodes)

Display “interesting” nodes only– Regarding communication volume/delays– Regarding load imbalance– Regarding execution times of particular code modules

© Pallas GmbH

Scalable Vampir structure

Scalable user–interface Scalable internals

Data Control

Vampir SC

User Interaction

Trace Data Processing

Trace Data I/O

Data Control

Vampir DC

User Interaction

Trace Data Analysis

Display Handling

Structured Trace Data

runs on WS

runs on parallelsystem

may exploit parallel

FS

© Pallas GmbH

Access to Pallas tools

Download free evaluation copies from

http://www.pallas.com

Recommended