33
Parallel code for multicore systems An overview of programming models 13.10.2011 1 Multicore Briefing - parallel programming models

Multicore Code Entwicklung

Embed Size (px)

DESCRIPTION

Parallel code for multicore systems

Citation preview

  • Parallel code for multicore systems

    An overview of programming models

    TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA

    13.10.2011 1 Multicore Briefing - parallel programming models

  • Overview

    There are just too many and remember that the compiler will not help you.

    Threading models for multicore processors POSIX threads Intel Threading building blocks OpenMP

    Threading models for GPGPUs CUDA OpenCL

    Parallel programming for distributed memory MPI

    Overall goal: Exploit the parallelism built into the hardware!

    13.10.2011 2 Multicore Briefing - parallel programming models

    shared memory and accelerators

    distributed memory

  • POSIX threads

    13.10.2011 3 Multicore Briefing - parallel programming models

  • Why threads for parallel programs?

    Thread == Lightweight process Independent instruction stream In simulation we usually run one thread per (virtual or physical) core, but

    more is possible

    New processes are expensive to generate (via fork()) Threads share all the data of a process, so they are cheap

    Inter-process communication is slow and cumbersome Shared memory between threads provides an easy way to communicate

    and synchronize

    A threading model puts threads to use by making them accessible to the programmer Either explicitly or wrapped in some parallel paradigm

    13.10.2011 4 Multicore Briefing - parallel programming models

  • POSIX threads example: Matrix-vector multiply with 100 threads

    static double a[100][100], b[100], c[100]; int main(int argc, char* argv[]) { pthread_t tids[100]; ... for (int i = 0; i < 100; i++) pthread_create(tids + i, NULL, mult, (void *)(c + i)); for (int i = 0; i < 100; i++) pthread_join(tids[i], NULL); ... } static void *mult(void *cp) { int i = (double *)cp - c; double sum = 0; for (int j = 0; j < 100; j++) sum += a[i][j] * b[j]; c[i] = sum; return NULL; }

    Adapted from material by J. Kleinder

    There are no shared resources here! (not really)

    13.10.2011 5 Multicore Briefing - parallel programming models

  • POSIX threads pros and cons

    Pros Most basic threading interface Straightforward, manageable API Dynamic generation and destruction of threads Reasonable synchronization primitives Full execution control

    Cons Most basic threading interface Higher functions (reductions, synchronization, work distributions, task

    queueing) must be done by hand Only available with C API Only available on (near-) POSIX compliant OSs Compiler has no clue about threads

    13.10.2011 6 Multicore Briefing - parallel programming models

  • Intel Threading Building Blocks (TBB)

    13.10.2011 7 Multicore Briefing - parallel programming models

  • Intel Threading Building Blocks (TBB)

    Introduced by Intel in 2006

    C++ threading library Uses POSIX threads under the hood Programmer works with tasks rather than threads Task stealing model Parallel C++ containers

    Commercial and open source variants exist

    13.10.2011 8 Multicore Briefing - parallel programming models

  • A simple parallel loop in TBB: Apply Foo() to every element of an array

    #include "tbb/tbb.h" using namespace tbb; class ApplyFoo { float *const my_a; public: void operator()( const blocked_range& r ) const { float *a = my_a; for( size_t i=r.begin(); i!=r.end(); ++i ) Foo(a[i]); } ApplyFoo( float a[] ) : my_a(a) {} }; void ParallelApplyFoo( float a[], size_t n ) { parallel_for(blocked_range(0,n), ApplyFoo(a)); }

    Adapted from the Intel TBB tutorial

    13.10.2011 9 Multicore Briefing - parallel programming models

  • TBB pros and cons

    Pros High-level programming model Task concept is often more natural for real-world problems than thread

    concept Built-in parallel (thread-safe) containers Built-in work distribution (configurable, but not too finely) Available for Linux, Windows, MacOS

    Cons C++ only Mapping of threads to resources (cores) not part of the model Number of threads concept only vaguely implemented Dynamic work sharing and task stealing introduce variability, difficult to

    optimize under ccNUMA constraints Compiler has no clue about threads

    13.10.2011 10 Multicore Briefing - parallel programming models

  • OpenMP

    13.10.2011 11 Multicore Briefing - parallel programming models

  • Parallel Programming with OpenMP

    Easy and portable parallel programming of shared memory computers: OpenMP

    Standardized set of compiler directives & library functions: http://www.openmp.org/ FORTRAN, C and C++ interfaces Supported by most/all commercial compilers, GNU starting with 4.2 Few free tools are available

    OpenMP program can be written to compile and execute on a single-processor machine just by ignoring the directives

    13.10.2011 12 Multicore Briefing - parallel programming models

  • private

    Shared Memory

    Shared Memory Model used by OpenMP

    T

    T

    T

    T

    n Threads access globally shared memory

    n Data can be shared or private n shared data available

    to all threads (in principle)

    n private data only to thread that owns it

    private

    private

    private

    Central concept of OpenMP programming: Threads

    13.10.2011 13 Multicore Briefing - parallel programming models

  • OpenMP Program Execution Fork and Join

    Program start: only master thread runs

    Parallel region: team of worker threads is generated (fork)

    synchronize when leaving parallel region (join)

    Only master executes sequential part worker threads usually sleep

    Task and data distribution via directives

    Usually optimal: one thread per core Thread # 0 1 2 3 4 5

    13.10.2011 14 Multicore Briefing - parallel programming models

  • ! function to integrate double f(double x) { return 4.0/(1.0+x*x); }

    w=1.0/n; sum=0.0; for(i=1; i

  • Example: Numerical integration in OpenMP

    concurrent execution by team of threads

    worksharing among threads

    sequential execution

    ... pi=0.0; w=1.0/n; #pragma omp parallel private(x,sum) { sum=0.0; #pragma omp for for(i=1; i

  • OpenMP pros and cons

    Pros High-level programming model Available for Fortran, C, C++ Ideal for data parallelism, some support for task parallelism Built-in work distribution Directive concept is part of the language Good support for incremental parallelization

    Cons Mapping of threads to resources (cores) not part of the model OpenMP parallelization may interfere with compiler optimization Parallel data structures are not part of the model Only limited synchronization facilities Model revolves around parallel region concept

    13.10.2011 17 Multicore Briefing - parallel programming models

  • CUDA

    13.10.2011 18 Multicore Briefing - parallel programming models

  • NVIDIA CUDA

    Compute Unified Device Architecture Hardware architecture and software environment Convenient programming model for using NVIDIA GPUs as

    general-purpose compute devices Implements Single Instruction Multiple Threads (SIMT) approach Programming model

    Accelerator style: Main program runs on host CPU, kernels are offloaded to GPU

    Unified binary for host + device Supports multiple GPUs Data transfer to/from device is

    explicit Kernel execution may be

    asynchronous to CPU code Latest devices (Fermi) allow

    multiple concurrent kernels

    13.10.2011 19 Multicore Briefing - parallel programming models

    GPU #1

    GPU #2 PCIe link

  • A simple CUDA example: Host code

    13.10.2011 20 Multicore Briefing - parallel programming models

    // allocate memory on host h_A = (float *)malloc(DATA_SZ); h_C = (float *)malloc(DATA_SZ); h_C_GPU = (float *)malloc(RESULT_SZ); // allocate memory on CUDA device cudaMalloc((void **)&d_A, DATA_SZ) ; cudaMalloc((void **)&d_C, RESULT_SZ) ; //Copy data to GPU memory for further processing cudaMemcpy(d_A, h_A, DATA_SZ, cudaMemcpyHostToDevice); cudaMemcpy(d_C, h_C, DATA_SZ, cudaMemcpyHostToDevice) ; cudaThreadSynchronize() ; //Kernel Call: do_work_on_gpu(d_C, d_A, DATA_N); cudaThreadSynchronize(); // copy result back to host cudaMemcpy(h_C_GPU, d_C, RESULT_SZ, cudaMemcpyDeviceToHost) ;

  • A simple CUDA example: CUDA kernel

    13.10.2011 21 Multicore Briefing - parallel programming models

    __global__ void do_work_on_gpu ( float *d_C, float *d_A, int elementN ) { for ( int pos = (blockIdx.x * blockDim.x) + threadIdx.x; pos < elementN ; pos += blockDim.x*gridDim.x ) { d_C[pos] = 5.0f * d_A[pos]; } __syncthreads(); }

  • CUDA pros and cons

    Pros Relatively straightforward programming model Low-level programming, explicit data management Compatible with many NVIDIA GPUs code runs usually without changes Available for C, but wrappers for many languages available

    including scripting languages Directive-based compiler extensions available (e.g., PGI) Potential for overlapping GPU computation with CPU tasks

    Cons Restricted to NVIDIA GPUs

    No support for multicore processors No support for AMD GPUs

    Low-level programming, explicit data management Powerful tools are just beginning to emerge Largely manual work distribution Not an open standard

    13.10.2011 22 Multicore Briefing - parallel programming models

  • OpenCL

    13.10.2011 23 Multicore Briefing - parallel programming models

  • OpenCL

    Open Computing Language Open standard Convenient programming model for using any kind of

    accelerator GPGPUs, multicore CPUs,

    Programming model similar to CUDA but more flexible Pure kernel code often portable from CUDA without major changes

    13.10.2011 24 Multicore Briefing - parallel programming models

  • 13.10.2011 25 Multicore Briefing - parallel programming models

    A simple OpenCL example: Host code

    // Get platform (Platform is NVIDIA Corp or Intel Corp or AMD Corp) std::vector platforms; cl::Platform::get(&platforms); // Get devices std::vector devices; platforms.front().getDevices( DEVTOQUERY , &devices ); // Build context and Command Queue cl::Context context( devices ); cl::CommandQueue cmdQ ( context , devices[0] ); // Read Kernel and compile JIT cl::Program::Sources sourceCode ; source_str = (char*)malloc(MAX_SOURCE_SIZE); source_size = fread( source_str, 1, MAX_SOURCE_SIZE, fp); sourceCode.push_back(std::make_pair(source_str,source_size)); cl::Program program = cl::Program ( context ,sourceCode ); program.build ( devices ) ; cl::Kernel kernel(program, "VectorCopy"); //Allocate buffer cl::Buffer D_A(context,CL_MEM_READ_WRITE,sizeof(REAL)*Vectorlength); //Copy data cmdQ.enqueueWriteBuffer (D_A , true,0,sizeof(REAL)*Vectorlength , &H_A[0]); // Bind parameters to kernel cl::KernelFunctor kernel_func = kernel.bind (cmdQ, cl::NDRange(Globalsize), cl::NDRange(Workgroupsize)); // Call Kernel event = kernel_func(D_A, D_B, D_C, scalar, Vectorlength, i ) ;

  • OpenCL pros and cons

    Pros Relatively straightforward programming model Low-level programming, explicit data management Available for NVIDIA and AMD GPUs, and multicore CPUs Potential for overlapping GPU computation with CPU tasks CUDA kernel code largely re-usable Some support for modern SIMD instruction sets

    Cons Available for C(99)/C++ Just in time kernel compilation Low-level programming, explicit data management Powerful tools are just beginning to emerge Largely manual work distribution, but more flexible than CUDA Best performance on all architectures requires specialized code for each

    13.10.2011 26 Multicore Briefing - parallel programming models

  • MPI

    The Message Passing Interface

    13.10.2011 27 Multicore Briefing - parallel programming models

  • The message passing paradigm: A programming model

    Distributed memory architecture:

    Each process(or) can only access its dedicated address space.

    No global shared address space

    Data exchange and communication between processes is done by explicitly passing messages through a communication network

    Message passing library:

    Should be flexible, efficient and portable

    Hide communication hardware and software layers from application programmer

    Message

    13.10.2011 28 Multicore Briefing - parallel programming models

  • The message passing paradigm

    Widely accepted standard in HPC / numerical simulation: Message Passing Library (MPI) See http://www.mpi-forum.org for documents Many free and commercial implementations: Intel MPI, OpenMPI,

    MVAPICH, Process-based approach: All variables are local! Same program on each processor/machine (SPMD)

    No restriction of the general MP model, because processes can be distinguished by their rank (see later)

    The program is written in a sequential language (Fortran/C/C++)

    Data exchange between processes: Send/receive messages via MPI library calls This is usually the most tedious but also the most flexible way of

    parallelization

    13.10.2011 29 Multicore Briefing - parallel programming models

  • Processes run throughout program execution: All variables are local

    Startup phase: launch tasks establishes communication context

    (communicator) among all tasks

    Point-to-point data transfer: between pairs of tasks may be blocking or non-blocking explicit synchronization is needed for

    non-blocking Collective communication:

    between all tasks or a subgroup of tasks

    presently blocking-only reductions, scatter/gather operations efficiency of library call

    Clean shutdown

    MPI in a nutshell: Parallel execution

    +

    Process ID: 0 1 2 3 4

    13.10.2011 30 Multicore Briefing - parallel programming models

  • program hello use mpi implicit none integer rank, size, ierror

    call MPI_INIT(ierror) call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror) call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)

    write(*,*) 'Hello World! I am ',rank,' of ',size

    call MPI_FINALIZE(ierror)

    end program

    13.10.2011 31 Multicore Briefing - parallel programming models

    MPI in a nutshell Hello World!

    Hello World! I am 3 of 4 Hello World! I am 1 of 4 Hello World! I am 0 of 4 Hello World! I am 2 of 4

  • MPI in a nutshell Transmitting a message

    MPI requires the following information: Which processor is sending the message. Where is the data on the sending processor. What kind of data is being sent. How much data is there.

    Which processor(s) are receiving the message. Where should the data be left on the receiving processor. How much data is the receiving processor prepared to accept.

    Sender and receiver must pass their information to MPI separately

    Holds for point-to-point communication

    13.10.2011 32 Multicore Briefing - parallel programming models

    Message

  • MPI pros and cons

    Pros Suitable for distributed-memory and shared-memory machines Supports massive parallelism Well supported, many free and commercial implementations Tremendous code base, huge experience in the field Standard supports Fortran and C, wrappers for other languages exist

    including scripting languages Hybrid MPI+X models are supported: X {OpenMP,CUDA,OpenCL,TBB,}

    Cons Execution environment is crucial to set up Huge standard (500+ functions) with many obscure bits and pieces Incremental parallelization next to impossible most sequential code

    needs serious restructuring Performance properties sometimes hard to understand

    also implementation-dependent

    13.10.2011 33 Multicore Briefing - parallel programming models