Wolfgang Bez Erich Focht - download.e-bookshelf.de

Sustained Simulation Performance

123

2014

Michael M. Resch · Wolfgang Bez Erich Focht · Hiroaki Kobayashi Nisarg Patel Editors

Sustained Simulation Performance 2014

Michael M. Resch • Wolfgang Bez • Erich Focht •Hiroaki Kobayashi • Nisarg PatelEditors

Sustained SimulationPerformance 2014Proceedings of the joint Workshop onSustained Simulation Performance,University of Stuttgart (HLRS) and TohokuUniversity, 2014

123

EditorsMichael M. ReschNisarg PatelHigh Performance Computing CenterStuttgart (HLRS)University of StuttgartStuttgartGermany

Wolfgang BezNEC High Performance ComputingEurope GmbHDRusseldorfGermany

Erich FochtNEC High Performance ComputingEurope GmbHStuttgartGermany

Hiroaki KobayashiCyberscience CenterTohoku UniversitySendaiJapan

Front cover figure: Schematic view of Integrated Earthquake Simulation of Tokyo Metropolis Earthquakefor seismic response analysis. The number of buildings analyzed exceeds 10,00,000. Illustrated by MuneoHori, Earthquake Research Institute, The University of Tokyo, Tokyo, Japan

ISBN 978-3-319-10625-0 ISBN 978-3-319-10626-7 (eBook)DOI 10.1007/978-3-319-10626-7Springer Cham Heidelberg New York Dordrecht London

Library of Congress Control Number: 2014956566

Mathematics Subject Classification (2010): 68Wxx, 68W10, 68Mxx, 68U20, 76-XX, 86A10, 70FXX,92Cxx, 65-XX

© Springer International Publishing Switzerland 2015This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part ofthe material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,broadcasting, reproduction on microfilms or in any other physical way, and transmission or informationstorage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodologynow known or hereafter developed. Exempted from this legal reservation are brief excerpts in connectionwith reviews or scholarly analysis or material supplied specifically for the purpose of being enteredand executed on a computer system, for exclusive use by the purchaser of the work. Duplication ofthis publication or parts thereof is permitted only under the provisions of the Copyright Law of thePublisher’s location, in its current version, and permission for use must always be obtained from Springer.Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violationsare liable to prosecution under the respective Copyright Law.The use of general descriptive names, registered names, trademarks, service marks, etc. in this publicationdoes not imply, even in the absence of a specific statement, that such names are exempt from the relevantprotective laws and regulations and therefore free for general use.While the advice and information in this book are believed to be true and accurate at the date ofpublication, neither the authors nor the editors nor the publisher can accept any legal responsibility forany errors or omissions that may be made. The publisher makes no warranty, express or implied, withrespect to the material contained herein.

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)

www.springer.com

Preface

The field of high-performance computing is currently witnessing a significant shiftof paradigm. Ever larger raw number crunching capabilities of modern processorsare in principle available to computational scientists. Imperative knowledge ofefficiently exploiting modern processors and performance achievements in thescientific community is growing by leaps and bounds.

On the other hand, many areas of computational science have reached a saturationin terms of problem size. Scientists often do no longer wish to solve largerproblems. Instead they wish to solve smaller problems in a shorter time. The currentarchitectures, however, are much more efficient for larger problems than they are forthe more relevant smaller problems.

This series of workshops focuses on sustained simulation performance, i.e.high-performance computing for real application use-cases, rather than on peakperformance, which is the scope of artificial problem sizes. The series wasestablished in 2004, initially named Teraflop Workshop, and renamed to Workshopfor Sustained Simulation Performance in 2012. In general terms, the scope of theworkshop series has shifted from optimization for vector computers as the NECSX-8, through efficient usage of large-scale systems, including NEC SX-9 but alsocluster installations, to emphasis on future challenges, productivity and feasibilityof current and future high-performance computing systems.

This book presents the combined results of the 18th and 19th installments of theseries. The 18th workshop was held at the High-Performance Computing Center,Stuttgart, Germany, in October 2013. The 19th workshop was held in March 2014 atSendai, Miyagi, Japan, and organized jointly with the University of Tohoku, Sendai,Japan.

The topics studied by the contributed papers include application driven approachtowards future of HPC systems (Part I), framework analysis and scalability,exploitation of performance and productivity of the modern and existing hardwarearchitectures (Part II), and application use-cases studies in interdisciplinary field(Part III).

We would like to thank all the contributors and organizers of this book andthe Sustained Simulation Performance project. We especially thank Prof. Hiroaki

v

vi Preface

Kobayashi for the close collaboration over the past years and are looking forward tointensify our cooperation in the future.

Stuttgart, Germany Nisarg PatelStuttgart, Germany José GraciaStuttgart, Germany Michael ReschAugust 2014

Contents

Part I Sustainability of Future HPC Systems: ApplicationDriven Challenges

Feasibility Study of a Future HPC Systemfor Memory-Intensive Applications: Final Report . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Hiroaki Kobayashi1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 System Architecture .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Performance Estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Summary.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

The GASPI API: A Failure Tolerant PGAS API forAsynchronous Dataflow on Heterogeneous Architectures . . . . . . . . . . . . . . . . . . . 17Christian Simmendinger, Mirko Rahn, and Daniel Gruenewald1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 GASPI Overview .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 The GASPI Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.1 GASPI Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2 GASPI Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.3 GASPI Segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 GASPI One-Sided Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.1 Basic Calls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2 Weak Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.3 Extended Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5 GASPI Passive Communication .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 GASPI Global Atomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 GASPI Collective Communication .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

vii

viii Contents

8 GASPI Failure Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318.1 GASPI Timeouts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318.2 GASPI Error Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Characteristic Analysis of Applications for Designing a FutureHPC System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Osamu Watanabe, Takashi Soga, Youichi Shimomura, and AkihiroMusa1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 Social and Scientific Challenges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.1 Natural Disaster Mitigation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.2 High Productivity Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3 Application Requirements for the Future System . . . . . . . . . . . . . . . . . . . . . . . . . . . 374 Performance Estimation on our Designed System. . . . . . . . . . . . . . . . . . . . . . . . . . . 405 Potential of Overcoming the Challenges by Using our

Designed System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.1 Natural Disaster Mitigation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.2 High Productivity Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6 Summary.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44Appendix .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Enhancing High Performance Computing with CloudConcepts and Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47Bastian Koller and Michael Gienger1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472 Current Situation in HPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483 High Performance Computing and/or Clouds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.1 High Performance Computing Compared with Clouds . . . . . . . . . . . . . . . 493.2 Complementary Use of HPC and Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4 Cloud Based Access to HPC: Fortissimo as an Example.. . . . . . . . . . . . . . . . . . . 514.1 Introducing the Fortissimo Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.2 Realizing the One-Stop-Shop.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5 The Road to Further HPC-Cloud Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

SX-ACE, Brand-New Vector Supercomputer for HigherSustained Performance I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57Shintaro Momose1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572 Architecture of SX-ACE .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593 Implementation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624 Performance Evaluation.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Contents ix

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

SX-ACE, the Brand-New Vector Supercomputer for HigherSustained Performance II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69Noritaka Hoshi and Shintaro Momose1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 692 Concept of Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

2.1 Big Core Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 702.2 Reduction of Power and Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3 Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724 Implementation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745 Performance Evaluation.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

Feasibility Study of a Future HPC System for MemoryIntensive Applications: Conceptual Design of Storage System . . . . . . . . . . . . . 81Ken’ichi Itakura, Akihiro Yamashita, Koji Satake, Hitoshi Uehara,Atsuya Uno, and Mitsuo Yokokawa1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 812 Objectives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

2.1 Design Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 822.2 Requirements from Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

3 Design Concept and Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834 Storage System Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 855 Summary.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

Part II Exploitation of Existing HPC Systems: Potentiality,Performance and Productivity

Designing an HPC Refactoring Catalog Toward the Exa-scaleComputing Era . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91Ryusuke Egawa, Kazuhiko Komatsu, and Hiroaki Kobayashi1 Introductions.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 912 Performance Portability of HPC Applications .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 923 Designing an HPC Refactoring Catalog. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

3.1 Design Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 933.2 Current Status of the HPC Refactoring Catalog . . . . . . . . . . . . . . . . . . . . . . . 963.3 Ongoing and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97


x Contents

Endorsing Supercomputing Applications to Java Language . . . . . . . . . . . . . . . . 99Alexey Cheptsov and Bastian Koller1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 992 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

2.1 MPI Bindings for Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1012.2 Native C Implementations of MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1022.3 Non-MPI Based Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

3 Design and Implementation.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1043.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1043.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1053.3 Configuration and Running .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

4 Performance Evaluation.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1094.1 Basic Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1094.2 Pilot Application Scenario: Random Indexing Over Large

Text Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1135 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

Performance Evaluation of an OpenMP Parallelization byUsing Automatic Parallelization Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119Kazuhiko Komatsu, Ryusuke Egawa, Hiroyuki Takizawa,and Hiroaki Kobayashi1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1192 OpenMP Parallelization by Using Automatic Parallelization

Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1213 Performance Evaluation.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

3.1 Experimental Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1223.2 Performance of OpenMP Codes Parallelized by using

Automatic Parallelization Information .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

EXTOLL and Data Movements in Heterogeneous ComputingEnvironments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127Holger Fröning1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1272 EXTOLL.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

2.1 Communication Engines .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1302.2 Key Performance Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1322.3 Additional Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

3 Global GPU Address Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1333.1 GPUs and Accelerated Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1333.2 A Thread-Collaborative Communication Model . . . . . . . . . . . . . . . . . . . . . . 1343.3 Key Performance Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1363.4 Additional Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

Contents xi

4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

Requirements for Modern Network Infrastructures . . . . . . . . . . . . . . . . . . . . . . . . . 141Jens Aßmann, Alexander Kiontke, and Sabine Roller1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

1.1 MPLS Traffic Engineering in OSPF Networks aCombined Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

1.2 Enabling Software Defined Network (SDN) in OldSchool Networks with Software-Controlled Routing Protocols . . . . . . 142

2 Requirements for Modern Network Development at the University .. . . . . . . 1432.1 Collision Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1432.2 Routing in the Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1442.3 Routing with Redundant ISP Connection .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1462.4 Optical Fibre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1472.5 Optical Fibre with MPLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1493 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

Interconnection Network: Design Space Exploration ofNetwork for Supercomputers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151Kentaro Sano1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1512 Assumption for Design Space Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1523 Preliminary Comparison Among Possible Topologies . . . . . . . . . . . . . . . . . . . . . . 1534 Detailed Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

Part III Computational Approach Towards Engineering andMulti-Physics Applications

Experiences in Developing HPC Software with Portable Efficiency . . . . . . . 165Daniel Friedrich Harlacher, Harald Klimach, and Sabine Roller1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1652 Building Blocks in HPC Software Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

2.1 Implementation Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1672.2 Portability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1682.3 Ease of Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1692.4 Maintaining a Scientific HPC Application .. . . . . . . . . . . . . . . . . . . . . . . . . . . . 170


xii Contents

Petascale Computations for Large-Scale Atomic and MolecularCollisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173Brendan M. McLaughlin and Connor P. Ballance1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1732 Parallel R-matrix Photoionization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1753 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1754 X-ray and Inner-Shell Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1785 Heavy Atomic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

5.1 Kr and Xe Ions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1805.2 Tungsten (W) Ions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

6 Future Directions and Emergence of GPUS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

FPGA-Based Scalable Custom Computing Accelerator forComputational Fluid Dynamics Based on Lattice Boltzmann Method . . . . 187Kentaro Sano1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1872 Tightly-Coupled FPGA Cluster for Scalable Custom Computing . . . . . . . . . . 189

2.1 Architecture of Tightly-Coupled FPGA Cluster . . . . . . . . . . . . . . . . . . . . . . . 1892.2 Design and Implementation of a Cluster Node . . . . . . . . . . . . . . . . . . . . . . . . 1902.3 Acceleration Framework on FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

3 Case Study: Custom Computing with Lattice Boltzmann Method . . . . . . . . . 1933.1 Lattice Boltzmann Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1933.2 Architecture for Stream Computation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1943.3 PE Design for Fully-Streamed Computation . . . . . . . . . . . . . . . . . . . . . . . . . . 195

4 Performance Evaluation.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1964.1 Implementation of PEs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1964.2 Resource Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1974.3 Computational Performance .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198


Application of HPC to Earthquake Hazard and Disaster Estimation . . . . . . 203Muneo Hori, Tsuyoshi Ichimura, Maddegedara L.L. Wijerathne,and Kouhei Fujita1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2032 Overview of HPC Application .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

2.1 Capability Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2042.2 Capacity Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

3 Structure Seismic Response Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2063.1 Fault-Structure System of Nuclear Power Plant . . . . . . . . . . . . . . . . . . . . . . . 2063.2 Reinforced Concrete Pier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

4 Urban Area Seismic Response Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2124.1 Overview of Urban Area Seismic Response Analysis . . . . . . . . . . . . . . . . 2124.2 Partial Reproduction of 2011 Great East Japan

Earthquake Disaster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

Contents xiii

4.3 Partial Estimation of Tokyo Metropolis Earthquake . . . . . . . . . . . . . . . . . . 216Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

Geometry Dependent Computational Study of Patient SpecificAbdominal Aortic Aneurysm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221Nisarg Patel and Uwe Küster1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2212 Image Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

2.1 Image Acquisition and Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2232.2 Image Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

3 Computational Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2283.1 Finite Element Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2303.2 Fluid Simulation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

Part ISustainability of Future HPC Systems:

Application Driven Challenges

Feasibility Study of a Future HPC Systemfor Memory-Intensive Applications: FinalReport

Hiroaki Kobayashi

Abstract In the last 2 years, we have been involved in a project entitled, “a feasiblestudy of a future HPC system for memory-intensive applications.” In this project,we have analyzed some representative applications that need exascale computingaround 2020, and clarified design specifications to develop a high-end computingsystem that will become available around 2018 and be best suited for theseapplications. This article reports results of a conceptual design and performanceestimation of the system obtained through the project.

1 Introduction

According to the projection of the trend in the top 500 ranking to the future, manypeople expect that an exa-flop/s system will be available around 2019, and ranked atNo.1 in that time frame [5]. Under such expectations of the HPC community aroundthe world, US, Europe, China and Japan started several HPC strategic programs fortargeting at realization of exascale systems around 2020.

In Japan, MEXT (Ministry of Education, Culture, Sports, Science and Technol-ogy) organized a committee to discuss the HPC policy of Japan for the next 5 to10-year research and development on national leading-supercomputers in the eraafter K-computer, which was the first 10-peta flop/s LINPACK system in 2011.During the discussion, the committee decided to start a program entitled FeasibilityStudy of Future HPCI systems last year. The objectives of this program is to

• discuss future high-end systems capable of satisfying the social and scientificdemands for HPC in the next 5–10 years in Japan, and

• investigate hardware and software technologies for developing future high-endsystems available around year 2018 that satisfy the social and scientific demands.

In this project, three teams, which are University of Tokyo with Fujitsu (ProjectLeader: Professor Yutaka Ishikawa), University of Tsukuba with Hitachi (Project

H. Kobayashi (�)Tohoku University, Sendai 980-8578, Japane-mail: [email protected]

© Springer International Publishing Switzerland 2015M.M. Resch et al. (eds.), Sustained Simulation Performance 2014,DOI 10.1007/978-3-319-10626-7__1

3

mailto:[email protected]

4 H. Kobayashi

0.001 0.01 0.1 1 10 100 10000.0001

0.001

0.01

0.1

1

10

Required memory capacity [PB]

Req

uire

d m

emor

y ba

ndw

idth

[Byt

e/Fl

op]

Computation- intensive

Memory- intensive

Structural analysis Fluid dynamics

MD, Weather Cosmo physics Particle physics

Quantum chemistry Nuclear physics

Fig. 1 Memory requirements of applications

Leader: Professor Mitsuhisa Sato), and Tohoku University with NEC (ProjectLeader: Hiroaki Kobayashi), started the feasibility studies of high-end computingsystems in the exa-scale computing era as a 2-year national project in 2012. Inthis article, we will present our system design approach to exascale computing,especially for memory-intensive applications.

The system design philosophy of the Tohoku University team is to increasethe productivity of high-performance computing. In the last decade, the peakperformance of high-end computing systems is incredibly boosted by aggregating ahuge number of nodes, each of which consists of multiple fine-grain cores, becausethe LINPACK benchmark for the top 500 ranking is computation-intensive, andinflating peak computing performance is a key factor to get a higher position inranking rather than enhancing the memory throughput. However, according to theapplication development roadmap report summarized in Japan in 2012 [3], manyimportant applications in the wide variety of science and engineering fields arememory-intensive and need 0.5 B/F or more for HPC system, as shown in Fig. 1.Here, B/F stands for bytes per flop that can be defined as a ratio of a memorythroughput in bytes/s to a computing performance in flop/s of a HPC system.Therefore, if we continue to develop high-end computing systems by concentratingon increasing flop/s rates, simply targeting toward exa-flop/s in 2020, rather thanmemory bandwidth, their applicable areas are becoming narrowed, i.e., there willbe a high probability that only few percentage of peak performance of exa-flop/swould be effective in the execution of practical applications, because lots ofarithmetic units are stalled due to waiting for the arrival of data and end up wastedduring their executions. Therefore, we rethink the design of high-performancecomputing systems with the quality of parallel processing, not the quantity ofparallel processing for the era of exascale computing around 2020. Our key messageis to realize 100� more sustained performance with 10� peak performance for

Feasibility Study of a Future HPC System 5

256.0

128.0

64.0

32.0

16.0

8.0

4.0

2.0

1.0

0.5

Application B/F (Memory Access Intensity)8 4 2 1 0.5 0.25 0.125 0.0625

Stream

BW 72

.95GB/s

SX-9 2.5B/F

Stream

BW 25

6GB/s

Stream

BW 58

.61GB/s

Stream

BW 17

.6GB/s

Stream

BW 17

.0GB/s

Stream

BW 34

.8GB/s

Stream

BW 10

.0 GB/s

Stream

BW 43

.3GB/s

Stream

BW 64

.7GB/s

For Memory intensive

applications

For Computation-intensive applications

0.03 0.01

Fig. 2 Attainable performances of HPC processors in the roofline model

memory-intensive applications, compared with K-computer!, by exploiting sleepingcomputing capability with increasing memory, network, and I/O throughputs

To make HPC much more productive, we have to improve the efficiency ofcomputing as much as possible. The efficiency is defined as the ratio of sustainedperformance of a real application to the peak performance of a HPC system. Tothis end, we have two design policies for future HPC systems. One is to makememory-throughput balanced with floating point performance to keep a high bytesper flop rate, i.e. 10� or more compared with those of competitive HPC systems.The other is to make processing cores much more coarse-grain high performance toavoid excessive parallel processing by using fine-grain low performance cores of thecurrent trend in processor design. Figure 2 shows attainable performances of severalrepresentative HPC processors in the roofline model [4] as a function of applicationB/F, which is defined by the number of bytes for memory accesses to the numberof floating point operations of a highest kernel of an application. The rooflinemodel suggests that peak performances of individual processors make sense onlyfor applications with an application B/F of 0.25 or lower, and for applications with0.25 or larger, i.e., memory-intensive applications, their performances are severelylimited by memory throughput. Therefore, as the throughput of the commodity-based memory subsystem improves very slowly compared with the performance inflop/s of a HPC system, B/F rates of future HPC systems are getting small, andtheir applicable areas are getting narrowed accordingly. As a result, we have tokeep a memory throughput of a HPC system balanced with a flop/s performanceto make the HPC system applicable to and productive for a wide variety of scienceand engineering applications.

6 H. Kobayashi

Fig. 3 Sustained performance of large-grain cores and fine-grain cores

The second point for the design of future productive HPC systems is to make par-allel processing granularity larger to reduce the size of parallel processing. Figure 3shows a comparison in sustained performance between two-type implementationsof a 1 Tflop/s processor, one is ten 100 Gflop/s large-grain cores and the other ishundred 10 Gflop/s fine-grain cores. As the number of cores increases, overheadsdue to communications and synchronizations become significant. Therefore, asystem with a smaller number of large-grain cores leads to a higher sustainedperformance even with a lower parallel efficiency as Amdahl’s law suggests.Of course, larger cores have to achieve a higher efficiency with intra-core levelvectorization and parallelization supported by a high memory throughput.

Based on the design polices discussed the above, we have come up with goals andtheir approaches in the design of a future HPC system that realizes highly productiveHPC, especially for memory-intensive applications, as shown in Fig. 4. In orderto realize a high memory bandwidth balanced with a peak flop/s performance, wedesign a memory subsystem with a B/F rate of 1 up to 2 by aggressively introducinginnovative device technologies such as the 2.5D interposer technology, the 3D die-stacking technology and their combination. A B/F rate of up to 2 is very challengingdesign specification and is 10� or more than the expectation that is currentlydiscussed on design of future HPC systems. In addition design of the memorysubsystem, designs of highly efficient large-grain processing cores, sockets, nodes,and network are also important to make parallel processing much more efficientand productive. To improve vector processing capability, we design an advancedvector architecture with a large on-chip vector load/store buffer to reduce penaltiesfor handling short-vectors and indirect memory accesses, and with these large-grainvector processing cores, we construct a large node with a high-bandwidth, large


Memory IssuesHigh BW & Balanced B/F

High Memory B/W of 1~2B/F at Low Power by Using Advanced Device Technologies such as 2.5D/3D Die

Stacking

Vector Processing IssuesAdvanced Vector Architecture Hiding Short-Vector

PenaltySupported by A Large On-Chip Vector Load/Store Buffer at

4B/F with Random Access Mechanism

Node IssuesLarge-Grain Nodes for Reducing the Total Number of MPI

Processes

High Performance Nodes Composed of a Small Number of Large Cores of 256 GFlop/s each

Network IssuesWell-balanced Local/Neighboring and Global

Communication Capability

Short-Diameter/Low-Latency Network with Hi-Radix Switches

Storage/IO IssuesScalable Storage System for Data Assimilation and

Checkpointing/Restart Mechanism

Hierarchical Distributed Storage System with Locally High B/W and Globally Large Shard Storage

Capacity

System Software IssuesCompliance with Standard Programming Models and Tools

New Functionalities for Fault-tolerance and Resource-aware/Power-Aware Job Management

Linux with Advanced Functionalities

Fig. 4 Goals and approaches of a future HPC system for memory-intensive applications

shared memory. We try to reduce the total number of nodes connected via a networkto satisfy a certain level of sustained performances that target applications need.

The design of a storage system is also important because there is an increasingdemand to handle big data in HPC such as data assimilation for the atmosphere-ocean coupled simulation. In addition, the storage system plays an important roleto support a checkpoint/restart mechanism that is mandatory to make the systemmuch more dependable. Therefore, our approach to the scalability storage systemfor a large data assimilation and efficient checkpointing/restarting is to design ahierarchical distributed storage system with locally high bandwidth and globallylarge shared storage capacity. The last but not least issue is system software design.We investigate a Linux based operating system with several advanced functions thatprovide a standard programming environment specially designed for our vector-parallel processing mechanism under the high fault-tolerance and effective resource-aware/power-aware job managements.

2 System Architecture

Figure 5 shows the overview of a conceptual design of our system. A CPU socketconsists of four vector-processing cores and a 32 MB VLSB (vector load storebuffer). Each core has a performance of 256 Gflop/s and is connected to the VLSBat a bandwidth of up to 1 TB/s, resulting in a B/F rate of 4 per core and a peak

8 H. Kobayashi

Node0 4TFlop/s

VLSB

core core core core

CPU0 1 2 3Node xxxx

Hierarchical Distributed Storage System

Hierarchical Network

VLSB 32MB

core core core core

2.5D/3D Die-Stacking Shared memory ~256GB

1TF = 256GFlop/s x 4cores

CPU0 1 2 3

~2TB/s(2B/F)

~1TB/s (4B/F)

2.5D/3D Die-Stacking Shared memory

Fig. 5 Overview of the system architecture

CPU performance of 1 Tflop/s. VLSB is a software-controllable on-chip buffer tosupport vector load/store operations and works very effective when they have thelocality of reference in their vector processing. Software-controllable means thatVLSB can selectively hold vector data with the high locality only to reduce thepollution of VLSB by vector data with the low locality. VLSB can also effectivelyassist gather/scatter operations to reduce the latency of list vector processing.

A node consists of four CPU sockets and a shared memory. The shared memoryhas four channels to connect CPU sockets at a bandwidth of up to 2 TB/s. Therefore,the node B/F is also up to 2. To realize such a high memory bandwidth, we designthe memory subsystem with 2.5D and 3D die-stacking technologies. We design twotypes of the memory subsystem under the consideration of the tradeoff betweenperformance and cost. One is the custom design and the other is the commodity-based design as shown in Fig. 6. The custom design is aggressive one to realize2 B/F of the node. Figure 7 shows the design of the memory modules by using fourstacked DRAM devices and custom designed memory controller on a Si interposerconnected through TSVs (Through Silicon Vias). The commodity-based memorysubsystem design is also examined to reduce the cost, while sacrificing performance.In the commodity-based design, we use Hybrid Memory Cube (HMC) devicesof Micron[2] instead of DDR4 devices[1]. Both of them are expected to becomeavailable around 2018, however, HMC is more promising to obtain a high bandwidthof the memory subsystem at the level of 1 B/F, even though the cost of HMC ishigher than that of DDR4 due to the emerging technology of 3D die stacking forhigh-bandwidth memory device design of HMC. The B/F of the commodity-based


Fig. 6 A memory subsystem

4port host links

Stacked DRAM Memory controllerSi interposer

High bandwidth low power memory I/F (128GB/s)

Fig. 7 Memory module design

design is half of that of the custom design, but we think that it is still 10x higher thanthat for the scalar-based architecture expected to become available around 2018.

We examined two topologies for the network system, which are fat tree and fat-tree torus hybrid (FTT-Hybrid) as shown in Fig. 8. As the preliminary evaluation inFig. 9 suggests, Fat-tree has a shorter diameter but a relatively larger cable delaycompared with FTT-Hybrid. FTT-Hybrid has a lower cable delay and lower cost,however, we have decided to use Fat-Tree because it also provides the easiness andflexibility of job scheduling when partitioning the system for scheduling a lot ofsmall and/or medium sized jobs in addition to the advantages regarding its topology.

Figure 10 shows a hierarchical storage system. The storage system at the firstlayer is an SSD-based local storage to realize a high bandwidth for handling bigdata and checkpointing/restarting, and the second layer is composed of a lot ofcommodity hard disk drives as a large global storage to hold big data such as

10 H. Kobayashi

FTT-HybridGlobal NW: 2D Torus of 16x16 groups

G

x 16

x 16

128128128

256Nodes

Local Fat tree

Local 256 Node Groups (2-StageFull Fat Tree)

32 Nodes

SW

N N N N N N N N N N N N N N N N N N

SW SW

SW SW SW

32 SWs

32 SWs

SW SW SW

SW SW SW

SW SW SW

SW SW SW

1024 Nodes/ island

25000 (of 32768) Nodes/ 32-island

SW SW SW

2nd-layer 1024 SWs

1st-layer 1024 SWs

3rd-layer 512 SWs

64 Links/ SW

SW SW SW SW SW SW

Fat Tree

Fig. 8 Examined networks

data assimilation and check-pointed snapshots of memory images. The detaileddiscussion regarding the storage system will be available in [7] in this book.

3 Performance Estimation

To evaluate our design presented in Sect. 2, we have developed and analyzed targetapplications that will be needed around 2020 to satisfy the social and scientificdemands for HPC at that time. In particular, after the 3.11 Great East JapanEarthquake in 2011, the people of Japan seriously concern about the secure and safehomeland against natural disasters. Therefore, we examine total ten applications thatcover natural disaster prevention and mitigation areas such as earthquake, tsunami,


Fig. 9 Preliminaryevaluation of networktopologies

Fat tree

Torus

Torus

Dragonfly

FTT-HybridHierarchical

Structure of Fat-Tree& Torus

Fig. 10 A hierarchicalstorage system

typhoon and their compound phenomenon. In addition to applications for naturaldisasters prevention and mitigation, we also investigate some important applicationsin the engineering field. The details of our target applications analyzed in the projectare available in [6] in this book.

Figure 11 summarize computation, memory, and I/O demands of our 14 targetapplications. All the applications are memory-intensive, which need B/F of 2 ormore, up to over 9. They also need the exa-level computation that should be executedwithin several hours. By using these applications, we evaluate our designed systemthrough a simulator developed. In the evaluation, we analyze two types of systems:the custom memory system with 2 B/F and the commodity memory system with1 B/F. These two system have the same computing performance of a 100 Pflop/sequipped with a 250 PB storage and a 40 GB/s Fat-Tree network.

12 H. Kobayashi

Fig. 11 Target applications and their computing demands

Figure 12 shows the evaluation results. The x-axis is for application names andthe y-axis means the execution time normalized by individual expected executiontime presented in Fig. 11. Therefore, values of 1.0 or smaller mean that thecomputation is completed within the expected execution time of that application. Asthe experimental results show, the 2 B/F system can satisfy the computing demandsof 13 applications out of 14. On the other hand, the 1 B/F system competed sixapplications within the expected execution time, but even for the applications thatthe 1 B/F system cannot complete within the expected time, the system only needsa 30 % longer execution time on average than the expectation. We think that the2 B/F system is the best solution to highly-productive HPC for memory-intensiveapplications in the time-frame of 2018–2020, however, as our cost and powerconsumption estimation suggests that the 1 B/F system can achieve a 20 % reductionin power consumption and a 30 % in manufacturing cost, the 1 B/F also shows agood trade-off between cost and performance.

The readers of this article may think that a 100 Pflop/s performance of ourbase design is quite lower than the expectation for the exa-scale and/or exa-flop/scomputing. To answer this question, we examine a most-likely commodity-systemavailable around 2018 in comparison with our designed system. Figure 13 showsour assumption of the commodity-based system to be available in 2018. The

Documents

Wolfgang Bez Erich Focht - download.e-bookshelf.de