Multicore Code Entwicklung

Parallel code for multicore systems

An overview of programming models

TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA

13.10.2011 1 Multicore Briefing - parallel programming models

Overview

There are just too many and remember that the compiler will not help you.

Threading models for multicore processors POSIX threads Intel Threading building blocks OpenMP

Threading models for GPGPUs CUDA OpenCL

Parallel programming for distributed memory MPI

Overall goal: Exploit the parallelism built into the hardware!


shared memory and accelerators

distributed memory

POSIX threads


Why threads for parallel programs?

Thread == Lightweight process Independent instruction stream In simulation we usually run one thread per (virtual or physical) core, but

more is possible

New processes are expensive to generate (via fork()) Threads share all the data of a process, so they are cheap

Inter-process communication is slow and cumbersome Shared memory between threads provides an easy way to communicate

and synchronize

A threading model puts threads to use by making them accessible to the programmer Either explicitly or wrapped in some parallel paradigm


POSIX threads example: Matrix-vector multiply with 100 threads

static double a[100][100], b[100], c[100]; int main(int argc, char* argv[]) { pthread_t tids[100]; ... for (int i = 0; i < 100; i++) pthread_create(tids + i, NULL, mult, (void *)(c + i)); for (int i = 0; i < 100; i++) pthread_join(tids[i], NULL); ... } static void *mult(void *cp) { int i = (double *)cp - c; double sum = 0; for (int j = 0; j < 100; j++) sum += a[i][j] * b[j]; c[i] = sum; return NULL; }

Adapted from material by J. Kleinder

There are no shared resources here! (not really)


POSIX threads pros and cons

Pros Most basic threading interface Straightforward, manageable API Dynamic generation and destruction of threads Reasonable synchronization primitives Full execution control

Cons Most basic threading interface Higher functions (reductions, synchronization, work distributions, task

queueing) must be done by hand Only available with C API Only available on (near-) POSIX compliant OSs Compiler has no clue about threads


Intel Threading Building Blocks (TBB)


Intel Threading Building Blocks (TBB)

Introduced by Intel in 2006

C++ threading library Uses POSIX threads under the hood Programmer works with tasks rather than threads Task stealing model Parallel C++ containers

Commercial and open source variants exist


A simple parallel loop in TBB: Apply Foo() to every element of an array

#include "tbb/tbb.h" using namespace tbb; class ApplyFoo { float *const my_a; public: void operator()( const blocked_range& r ) const { float *a = my_a; for( size_t i=r.begin(); i!=r.end(); ++i ) Foo(a[i]); } ApplyFoo( float a[] ) : my_a(a) {} }; void ParallelApplyFoo( float a[], size_t n ) { parallel_for(blocked_range(0,n), ApplyFoo(a)); }

Adapted from the Intel TBB tutorial


TBB pros and cons

Pros High-level programming model Task concept is often more natural for real-world problems than thread

concept Built-in parallel (thread-safe) containers Built-in work distribution (configurable, but not too finely) Available for Linux, Windows, MacOS

Cons C++ only Mapping of threads to resources (cores) not part of the model Number of threads concept only vaguely implemented Dynamic work sharing and task stealing introduce variability, difficult to

optimize under ccNUMA constraints Compiler has no clue about threads


OpenMP


Parallel Programming with OpenMP

Easy and portable parallel programming of shared memory computers: OpenMP

Standardized set of compiler directives & library functions: http://www.openmp.org/ FORTRAN, C and C++ interfaces Supported by most/all commercial compilers, GNU starting with 4.2 Few free tools are available

OpenMP program can be written to compile and execute on a single-processor machine just by ignoring the directives


private

Shared Memory

Shared Memory Model used by OpenMP

T

T

T

T

n Threads access globally shared memory

n Data can be shared or private n shared data available

to all threads (in principle)

n private data only to thread that owns it

private

private

private

Central concept of OpenMP programming: Threads


OpenMP Program Execution Fork and Join

Program start: only master thread runs

Parallel region: team of worker threads is generated (fork)

synchronize when leaving parallel region (join)

Only master executes sequential part worker threads usually sleep

Task and data distribution via directives

Usually optimal: one thread per core Thread # 0 1 2 3 4 5


! function to integrate double f(double x) { return 4.0/(1.0+x*x); }

w=1.0/n; sum=0.0; for(i=1; i

Example: Numerical integration in OpenMP

concurrent execution by team of threads

worksharing among threads

sequential execution

... pi=0.0; w=1.0/n; #pragma omp parallel private(x,sum) { sum=0.0; #pragma omp for for(i=1; i

OpenMP pros and cons

Pros High-level programming model Available for Fortran, C, C++ Ideal for data parallelism, some support for task parallelism Built-in work distribution Directive concept is part of the language Good support for incremental parallelization

Cons Mapping of threads to resources (cores) not part of the model OpenMP parallelization may interfere with compiler optimization Parallel data structures are not part of the model Only limited synchronization facilities Model revolves around parallel region concept


CUDA


NVIDIA CUDA

Compute Unified Device Architecture Hardware architecture and software environment Convenient programming model for using NVIDIA GPUs as

general-purpose compute devices Implements Single Instruction Multiple Threads (SIMT) approach Programming model

Accelerator style: Main program runs on host CPU, kernels are offloaded to GPU

Unified binary for host + device Supports multiple GPUs Data transfer to/from device is

explicit Kernel execution may be

asynchronous to CPU code Latest devices (Fermi) allow

multiple concurrent kernels


GPU #1

GPU #2 PCIe link

A simple CUDA example: Host code


// allocate memory on host h_A = (float *)malloc(DATA_SZ); h_C = (float *)malloc(DATA_SZ); h_C_GPU = (float *)malloc(RESULT_SZ); // allocate memory on CUDA device cudaMalloc((void **)&d_A, DATA_SZ) ; cudaMalloc((void **)&d_C, RESULT_SZ) ; //Copy data to GPU memory for further processing cudaMemcpy(d_A, h_A, DATA_SZ, cudaMemcpyHostToDevice); cudaMemcpy(d_C, h_C, DATA_SZ, cudaMemcpyHostToDevice) ; cudaThreadSynchronize() ; //Kernel Call: do_work_on_gpu(d_C, d_A, DATA_N); cudaThreadSynchronize(); // copy result back to host cudaMemcpy(h_C_GPU, d_C, RESULT_SZ, cudaMemcpyDeviceToHost) ;

A simple CUDA example: CUDA kernel


__global__ void do_work_on_gpu ( float *d_C, float *d_A, int elementN ) { for ( int pos = (blockIdx.x * blockDim.x) + threadIdx.x; pos < elementN ; pos += blockDim.x*gridDim.x ) { d_C[pos] = 5.0f * d_A[pos]; } __syncthreads(); }

CUDA pros and cons

Pros Relatively straightforward programming model Low-level programming, explicit data management Compatible with many NVIDIA GPUs code runs usually without changes Available for C, but wrappers for many languages available

including scripting languages Directive-based compiler extensions available (e.g., PGI) Potential for overlapping GPU computation with CPU tasks

Cons Restricted to NVIDIA GPUs

No support for multicore processors No support for AMD GPUs

Low-level programming, explicit data management Powerful tools are just beginning to emerge Largely manual work distribution Not an open standard


OpenCL


OpenCL

Open Computing Language Open standard Convenient programming model for using any kind of

accelerator GPGPUs, multicore CPUs,

Programming model similar to CUDA but more flexible Pure kernel code often portable from CUDA without major changes



A simple OpenCL example: Host code

// Get platform (Platform is NVIDIA Corp or Intel Corp or AMD Corp) std::vector platforms; cl::Platform::get(&platforms); // Get devices std::vector devices; platforms.front().getDevices( DEVTOQUERY , &devices ); // Build context and Command Queue cl::Context context( devices ); cl::CommandQueue cmdQ ( context , devices[0] ); // Read Kernel and compile JIT cl::Program::Sources sourceCode ; source_str = (char*)malloc(MAX_SOURCE_SIZE); source_size = fread( source_str, 1, MAX_SOURCE_SIZE, fp); sourceCode.push_back(std::make_pair(source_str,source_size)); cl::Program program = cl::Program ( context ,sourceCode ); program.build ( devices ) ; cl::Kernel kernel(program, "VectorCopy"); //Allocate buffer cl::Buffer D_A(context,CL_MEM_READ_WRITE,sizeof(REAL)*Vectorlength); //Copy data cmdQ.enqueueWriteBuffer (D_A , true,0,sizeof(REAL)*Vectorlength , &H_A[0]); // Bind parameters to kernel cl::KernelFunctor kernel_func = kernel.bind (cmdQ, cl::NDRange(Globalsize), cl::NDRange(Workgroupsize)); // Call Kernel event = kernel_func(D_A, D_B, D_C, scalar, Vectorlength, i ) ;

OpenCL pros and cons

Pros Relatively straightforward programming model Low-level programming, explicit data management Available for NVIDIA and AMD GPUs, and multicore CPUs Potential for overlapping GPU computation with CPU tasks CUDA kernel code largely re-usable Some support for modern SIMD instruction sets

Cons Available for C(99)/C++ Just in time kernel compilation Low-level programming, explicit data management Powerful tools are just beginning to emerge Largely manual work distribution, but more flexible than CUDA Best performance on all architectures requires specialized code for each


MPI

The Message Passing Interface


The message passing paradigm: A programming model

Distributed memory architecture:

Each process(or) can only access its dedicated address space.

No global shared address space

Data exchange and communication between processes is done by explicitly passing messages through a communication network

Message passing library:

Should be flexible, efficient and portable

Hide communication hardware and software layers from application programmer

Message


The message passing paradigm

Widely accepted standard in HPC / numerical simulation: Message Passing Library (MPI) See http://www.mpi-forum.org for documents Many free and commercial implementations: Intel MPI, OpenMPI,

MVAPICH, Process-based approach: All variables are local! Same program on each processor/machine (SPMD)

No restriction of the general MP model, because processes can be distinguished by their rank (see later)

The program is written in a sequential language (Fortran/C/C++)

Data exchange between processes: Send/receive messages via MPI library calls This is usually the most tedious but also the most flexible way of

parallelization


Processes run throughout program execution: All variables are local

Startup phase: launch tasks establishes communication context

(communicator) among all tasks

Point-to-point data transfer: between pairs of tasks may be blocking or non-blocking explicit synchronization is needed for

non-blocking Collective communication:

between all tasks or a subgroup of tasks

presently blocking-only reductions, scatter/gather operations efficiency of library call

Clean shutdown

MPI in a nutshell: Parallel execution

+

Process ID: 0 1 2 3 4


program hello use mpi implicit none integer rank, size, ierror

call MPI_INIT(ierror) call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror) call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)

write(*,*) 'Hello World! I am ',rank,' of ',size

call MPI_FINALIZE(ierror)

end program


MPI in a nutshell Hello World!

Hello World! I am 3 of 4 Hello World! I am 1 of 4 Hello World! I am 0 of 4 Hello World! I am 2 of 4

MPI in a nutshell Transmitting a message

MPI requires the following information: Which processor is sending the message. Where is the data on the sending processor. What kind of data is being sent. How much data is there.

Which processor(s) are receiving the message. Where should the data be left on the receiving processor. How much data is the receiving processor prepared to accept.

Sender and receiver must pass their information to MPI separately

Holds for point-to-point communication


Message

MPI pros and cons

Pros Suitable for distributed-memory and shared-memory machines Supports massive parallelism Well supported, many free and commercial implementations Tremendous code base, huge experience in the field Standard supports Fortran and C, wrappers for other languages exist

including scripting languages Hybrid MPI+X models are supported: X {OpenMP,CUDA,OpenCL,TBB,}

Cons Execution environment is crucial to set up Huge standard (500+ functions) with many obscure bits and pieces Incremental parallelization next to impossible most sequential code

needs serious restructuring Performance properties sometimes hard to understand

also implementation-dependent


Documents

Multicore Code Entwicklung