23
Rechen- und Kommunikationszentrum (RZ) Brainware als Faktor für energieeffizientes HPC Christian Bischof, Dieter an Mey, Christian Terboven [email protected] - HRZ, TU Darmstadt {anmey, terboven}@rz.rwth-aachen.de - RZ, RWTH Aachen 20.09.2012, ZKI AK SC, Universität Düsseldorf

Brainware als Faktor für energieeffizientes HPC · Brainware als Faktor für energieeffizientes HPC Christian Terboven | Rechen- und Kommunikationszentrum 2 Motivation Definition

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • Rechen- und Kommunikationszentrum (RZ)

    Brainware als Faktor für

    energieeffizientes HPC

    Christian Bischof, Dieter an Mey, Christian Terboven

    [email protected] - HRZ, TU Darmstadt

    {anmey, terboven}@rz.rwth-aachen.de - RZ, RWTH Aachen

    20.09.2012, ZKI AK SC, Universität Düsseldorf

  • Brainware als Faktor für energieeffizientes HPC

    Christian Terboven | Rechen- und Kommunikationszentrum 2

    Motivation

    Definition of „Green HPC“ from insidehpc.com:

    Design and management techniques that contribute to the responsible,

    effective use of energy in the operation of high performance computing

    centers and equipment.

    But: The current situation hardly allows for the Economical

    Optimization of the Total Budget

    Different budgets for

    Staff

    Hardware (mainly through applications – every X years), Maintenance

    Power

    Building (mainly through applications – once per decade?!)

    User’s in general don’t pay for compute resources

  • Brainware als Faktor für energieeffizientes HPC

    Christian Terboven | Rechen- und Kommunikationszentrum 3

    Cost of Brainware versus Hardware

    Tuning Opportunities

    Success Stories

    Summary

    Agenda

  • Brainware als Faktor für energieeffizientes HPC

    Christian Terboven | Rechen- und Kommunikationszentrum 4

    Cost of Brainware versus

    Hardware

  • Brainware als Faktor für energieeffizientes HPC

    Christian Terboven | Rechen- und Kommunikationszentrum 5

    Understanding the Total Cost of Ownership

    Assumptions

    2 Mio € HW investment

    per year

    5 years lifetime with

    4 years maintenance

    through vendor

    Power: 850 KW,

    PUE=1.5,

    0.14€ per kWh

    => 1.5 Mio € per year

    ISV software provided by

    users

    Commercial batch system

    Free Linux distribution

    costs per year percentage

    Building

    ( 5Mio / 25y) 200.000 € 3,72%

    HPC software 50.000 € 0,93%

    ISV software 0 € 0,00%

    Batch system 100.000 € 1,86%

    Linux 0 € 0,00%

    power 1.500.000 € 27,93%

    office space 0 € 0,00%

    Staff 12 FTE 720.000 € 13,41%

    hardware

    maintenance 800.000 € 14,90%

    investment

    compute servers 2.000.000 € 37,24%

    sum costs 5.370.000 € 100,00%

  • Brainware als Faktor für energieeffizientes HPC

    Christian Terboven | Rechen- und Kommunikationszentrum 6

    Does it pay off to hire more HPC Experts?

    Start tuning top user projects first

    15 projects account for 50% of the load

    64 projects account for 80% of the load

    Assumptions

    It takes 2 months to tune one project

    One analyst can handle 5 projects per year

    A projects profits for 2 years

    As a consequence one HPC expert

    can take care of 10 projects at a time

    One FTE costs 60,000€

    Tuning can improve the code by 5, 10 or 20 percent

    0,00%

    10,00%

    20,00%

    30,00%

    40,00%

    50,00%

    60,00%

    70,00%

    80,00%

    90,00%

    100,00%

    1

    18

    35

    52

    69

    86

    10

    3

    12

    0

    13

    7

    15

    4

    17

    1

    18

    8

    Accumulated usage of top accounts (excl. JARA-HPC)

    Accumulated usage of top accounts

  • Brainware als Faktor für energieeffizientes HPC

    Christian Terboven | Rechen- und Kommunikationszentrum 7

    Does it pay off? Yes!

    -600.000

    -400.000

    -200.000

    0

    200.000

    400.000

    600.000

    #Pro

    jects

    10

    20

    30

    40

    50

    60

    70

    80

    90

    10

    0

    11

    0

    12

    0

    13

    0

    14

    0

    15

    0

    16

    0

    Savings with 5% improvement

    Savings with 10% improvement

    Savings with 20% improvement

    ROI [€]

    # of tuned projects 10 projects handled by one FTE (60.000€/y)

    For example: Break even point: 7.5 HPC Analysts improve top 75 projects by 10% (TCO is 5.3 Mio €/y)

  • Brainware als Faktor für energieeffizientes HPC

    Christian Terboven | Rechen- und Kommunikationszentrum 8

    Tuning Opportunities

  • Brainware als Faktor für energieeffizientes HPC

    Christian Terboven | Rechen- und Kommunikationszentrum 9

    Opportunities for Tuning wo/ Code Access

    Sanity Check

    Use HW Counters to Measure Performance

    To check for Performance Anomalies

    IO behavior

    System call statistics

    Hardware

    Choose the optimal hardware platform

    File system, IO parameters

    Parameterization

    Choose optimal number of threads / MPI processes

    Thread / Process Placement (NUMA)

    Mapping MPI topology to hardware topology

    MPI parameterization (buffers, protocols)

    Optimal libraries (MKL …)

  • Brainware als Faktor für energieeffizientes HPC

    Christian Terboven | Rechen- und Kommunikationszentrum 10

    Opportunities for Tuning w/ Code Changes

    Cache Tuning

    padding, blocking, loop based optimization techniques

    Inlining/outlining,

    Help compiler to perform optimizations …

    MPI optimization

    avoid global synchronization

    Hide / reduce communication overhead, Unblocking communication

    Coalesce communications …

    OpenMP optimization

    Extend parallel regions

    Check for false sharing

    NUMA optimization: first touch, migration

    In vogue: Add OpenMP to an MPI code to improve scalability

    Of Course: Choosing the optimal Algorithm is crucial

    To be handled by or with the domain expert

  • Brainware als Faktor für energieeffizientes HPC

    Christian Terboven | Rechen- und Kommunikationszentrum 11

    Opportunities for Tuning wo/ Code Changes

    Development Environment

    Choose the optimal compiler

    Choose optimal compiler options

    Autoparallelization

    Compiler Profile / Feedback

    Adapt dataset: Partitioning / Blocking – Load Balancing

    This list is not intended to be exhaustive, but rather to illustrate that the

    skill set of an HPC tuning expert is very different from that of an

    application scientist who develops a program, but both skill sets are

    needed.

    SimLabs

    Interdisciplinary Collaboration: HPC, Domain Expert, Numerical Expert, …

  • Brainware als Faktor für energieeffizientes HPC

    Christian Terboven | Rechen- und Kommunikationszentrum 12

    Teaching via Courses and Workshop

    Selected courses and workshops in Aachen, since 2011:

    Date Event

    03/2011 Parallel Programming in CES (en), 1 week, 75 part.

    05/2011 Visual Studio 2010 + Windows HPC Server workshop, 25 selected part.

    10/2011 AIXcelerate Tuning Workshop (en) (with Intel), 25 selected part.

    12/2011 Parallel Programming with MATLAB, 35 part.

    03/2012 Parallel Programming in CES (en), 1 week, 55 part.

    08-09/2012 Parallel Programming Summer Courses (en): MPI, OpenMP, Tools, …

    10/2012 Planned: Tuning for bigSMP HPC Workshop (en)

    10/2012 Planned: OpenACC Workshop (en)

    11/2012 Planned: Technical Cloud Computing with Microsoft Azure (en)

  • Brainware als Faktor für energieeffizientes HPC

    Christian Terboven | Rechen- und Kommunikationszentrum 13

    Success Stories

  • Brainware als Faktor für energieeffizientes HPC

    Christian Terboven | Rechen- und Kommunikationszentrum 14

    HPC Consulting may save real money: Combustion of “Biofuels”

    Primary Breakup for Diesel

    Sprays: Hybrid CDPLit

    Adding a single OpenMP-

    parallelized kernel improves

    efficiency by 10% approx.

    Turns into cost reduction of

    an equivalent of one FTE/yr.

    Human effort ~7 weeks

    16 s

    32 s

    64 s

    128 s

    256 s

    512 s

    1 2 4 16 32 48 64

    8 PPN 1TPP

    4 PPN 1 TPP

    4 PPN 2 TPP

    Runtime for Small Test Data Set

    Nodes

    Cluster of Excellence „Tailor-Made Fuels from Biomass“, Inst. F. Combustion Technology, RWTH Aachen University

  • Brainware als Faktor für energieeffizientes HPC

    Christian Terboven | Rechen- und Kommunikationszentrum 15

    Green IT = Using Resources more efficiently

    Code Impact Partner

    FLOWer 1.8 x higher efficiency of hybrid Navier Stokes Solver simulating the landing of a space launch vehicle By adding autopar to MPI, carefully assigning threads to processes to adjust load imbalances

    RWTH Laboratory of Mechanics

    Matlab 150x Speed-up of numerical Solution of the Diffusion Equation by extracting compute intense kernel, transforming it to a Fortran code + careful cache tuning

    RWTH Institute of Physical Chemistry

    Gene-hunter

    14x Speed-up through Cache Optimization plus scalable MPI Parallelization of Linkage Analysis to identify genes which may cause diseases.

    Inst. F. Medical Biometry, Computer Science and Epidemiology (IMBIE) Bonn

    Dynmatt 33x Speed-up through I/O-Optimization by implementing appropriate Buffering and reducing meta data operations

    RWTH Institute of Steel and Light Alloy Building

    Code Impact Partner

    FIRE ~100x Speed-up of Image Recognition Software on large SMP by nested parallelization which saves a lot of IO

    RWTH Chair of Computer Science 6

    NestedCP 10 - 50 x Speed-up for Critical Point Extraction in flow simulation output data through nested parallelization with OpenMP even with highly imbalanced work chunks.

    Virtual Reality Center Aachen

    TFS 20x Speed-up for Simulation of Human Nasal Flow for Computer Aided Surgery through nested parallelization with OpenMP

    RWTH Aerodynamic Institute, Parallel Software Products

    Higher sophistication of parallelization leads to higher scalability, but does not save resources…

  • Brainware als Faktor für energieeffizientes HPC

    Christian Terboven | Rechen- und Kommunikationszentrum 16

    HECToR: Distributed CSE Success Stories

    Code Domain Effect Effort Saving

    CASTEP Key Materials Science 4x Speed and 4x Scalability 8 PMs 320k - 480k £ (p.a.)

    NEMO Oceanography Speed and I/O-Performance

    6 PMs 95 k £ (p.a.)

    CASINO Quantum Monte-Carlo

    4x Performance and 4x Scalability

    12 PMs 760 k £ (p.a.)

    CP2K Materials Science 12 % Speed and Scalability 12 PMs 1500 k £ (in total)

    GLOMAP/ TOMCAT

    Atmospheric Chemistry

    15 % Performance ?

    CITCOM Geodynamic Thermal Convection

    30% Performance ? significant

    EBL Fluid Turbulence 4x Scalability 12 PMs

    ChemShell Catalytic Chemistry 8x Performance 9 PMs

    Fluidity-ICOM

    Ocean Modelling Scalability ?

    DL_POLY_3 Molecular Dynamics 20x Performance 6 PMs

    CARP Heart Modelling 20x Performance 8 PMs

  • Brainware als Faktor für energieeffizientes HPC

    Christian Terboven | Rechen- und Kommunikationszentrum 17

    HPC Consulting may save real money: Hydro-Dynamics with XNS

    XNS (M. Behr, CATS, RWTH)

    Simulation of Hydro-Dynamic

    forces of the Ohio Dam

    Parallelized with MPI

    Scales very well for larger case

    Additional OpenMP

    Parallelization:

    9 parallel regions

    Human effort: ~ 6 weeks

    20-40 % improvement

    # Compute Nodes -20,00%

    -10,00%

    0,00%

    10,00%

    20,00%

    30,00%

    40,00%

    50,00%

    1 2 4 6 8 16 32 48 64

    Improvement of execution time in percentage best effort MPI only versus best effort hybrid

    Nehalem EP Cluster (2 processor chips, 4 cores each) with InfiniBand-QDR

    Higher is better

  • Brainware als Faktor für energieeffizientes HPC

    Christian Terboven | Rechen- und Kommunikationszentrum 18

    XNS: How much efficiency do we want to sacrifice? – Execution Time

    # Compute Nodes

    PPN = processes per node TPP = threads per process

    Exe

    cuti

    on

    tim

    e [

    sec]

    20

    40

    80

    160

    320

    640

    1280

    1 2 4 6 8 16 32 48 64

    PPN1 TPP1

    PPN1 TPP2

    PPN1 TPP4

    PPN1 TPP8

    PPN2 TPP1

    PPN2 TPP2

    PPN2 TPP4

    PPN4 TPP1

    PPN4 TPP2

    PPN8 TPP1

    Lower is better

  • Brainware als Faktor für energieeffizientes HPC

    Christian Terboven | Rechen- und Kommunikationszentrum 19

    XNS: How much efficiency do we want to sacrifice? - Efficiency

    Effi

    cien

    cy [

    %]

    PPN = processes per node TPP = threads per process

    0,0%

    10,0%

    20,0%

    30,0%

    40,0%

    50,0%

    60,0%

    70,0%

    80,0%

    90,0%

    100,0%

    1 2 4 6 8 16 32 48 64

    PPN1 TPP1

    PPN1 TPP2

    PPN1 TPP4

    PPN1 TPP8

    PPN2 TPP1

    PPN2 TPP2

    PPN2 TPP4

    PPN4 TPP1

    PPN4 TPP2

    PPN8 TPP1

    Higher is better

  • Brainware als Faktor für energieeffizientes HPC

    Christian Terboven | Rechen- und Kommunikationszentrum 20

    XNS: How much efficiency to we want to sacrifice? – Improvements versus Efficiency

    -16,19% -14,01%

    15,52%

    33,21%38,93%

    19,79% 20,98% 20,53%15,95%

    100,00%93,40%

    74,10%

    59,35%

    49,53%

    32,95%

    17,46%12,67%

    8,57%

    -20,00%

    0,00%

    20,00%

    40,00%

    60,00%

    80,00%

    100,00%

    1 2 4 6 8 16 32 48 64

    # Compute Nodes

    Parallelization Efficiency (best effort)

    Relative Improvement of Hybrid Version (best efforts)

  • Brainware als Faktor für energieeffizientes HPC

    Christian Terboven | Rechen- und Kommunikationszentrum 21

    Summary

  • Brainware als Faktor für energieeffizientes HPC

    Christian Terboven | Rechen- und Kommunikationszentrum 22

    Summary

    Higher investment in brainware pays-off, if it is only for tuning

    A University can save real money by investing in Brainware rather than

    Electricity for inefficiently used hardware

    HPC Performance Analysts are a rare species

    Needs HW knowledge, Tools, programming languages, compiler technologies

    + paradigms, (algorithms), OS effects

    It takes some time to hire anyone and get him/her up to speed

    Team work: more different brains create more synergy

    And now there are GPUs …

    They have much higher head room for tuning

  • Brainware als Faktor für energieeffizientes HPC

    Christian Terboven | Rechen- und Kommunikationszentrum 23

    The End – and an Invitation …

    German Heterogeneous Computing Group (GHCG)

    Unabhängige Interessengruppe rund um das Hochleistungsrechnen mit

    Beschleunigern im deutschsprachigen Raum

    Ziel: Intensivierung des technischen und wissenschaftlichen Austauschs über

    Projekte, Hardware und Algorithmen

    Nutzergruppen-Treffen

    Datum: 1. + 2. Oktober 2012

    Ort: Braunschweig (Haus der Wissenschaft)

    Anmelden und mitmachen (kostenfrei) !

    www.ghc-group.org (Anmeldung & weitere Infos)

    Jeder ist herzlich willkommen!

    Themen (u.a.)

    Neuerungen im Hard-

    und Softwarebereich

    CFD auf heterogenen

    Architekturen