Simulation of ﬂuid ﬂows with Lattice Boltzmann on GPU’sA cheap, common 0 card, can be used almost as a mini-cluster. This fact will lead to numerous work and investigation to

Simulation of fluid flows with Lattice Boltzmann onGPU’s

Joaquim José da Costa e Simas

Master Thesis on:

Mechanical Engineering

Jury

President: Prof. Doutor Luís Rego da Cunha de Eça

Supervisors: Prof. Doutor José Carlos Fernandes Pereira

Mestre Ricardo José Nunes dos Reis

Examiner: Doutor Mário António Prazeres Lino da Silva

December, 2010

Abstract

In this work, two numerical schemes of Lattice Boltzmann are implemented in three different parallel

machines, each with it’s own programming model: a multiprocessor computer, a cluster of computers

and a GPU. The latter, is a graphics processor with a parallel architecture. This architecture allows it

to achieve high computational power with very competitive performances in terms of energetic efficience.

Lattice Boltzmann methods are known for their efficient and simple numerical schemes. Despite their

simplicity, these particle methods top almost any other CFD implementation when dealing with shape-

shifting irregular and disordered boundary conditions. This fact alone gives them a very competitive

edge. But, it is the combination of theses characteristics with it’s highly parallel nature and the still

growing fields of research and applications, that all leads to believe in the present and future importance

of this method.

Noticeable results have been achieved, in terms of simulations speed. The more promising of these

have been obtained using the GPU: speedups in the order of the hundredths. A cheap, common 0 card,

can be used almost as a mini-cluster. This fact will lead to numerous work and investigation to be

devoted to Lattice Boltzmann and GPU’s.

Keywords: Lattice Boltzmann, LGBK, MRT, OpenMP, MPI, GPU, CUDA

Resumo

Neste trabalho vão ser implementados dois esquemas numéricos do método de Lattice Boltzmann,

em três meios diferentes de computação em paralelo, tendo cada um destes meios um modelo e uma

maneira própria de pensar no paralelismo: num computador com múltiplos processadores, num cluster

de computadores e num GPU. Este último, um processador gráfico, tem uma arquitectura inerentemente

paralela. Esta arquitectura permite que o GPU atinja valores muito elevados no que se refere ao poder

computacional e uma eficiência energética muito competitiva.

Os métodos de Lattice Boltzmann são reconhecidos como sendo esquemas numéricos simples mas

eficientes. São métodos de particulas que, apesar da sua simplicidade, estão demonstrados como sendo

mais eficazes que outras implementações de CFD no que toca a geometrias irregulares, que mudam a sua

forma com o tempo e que estão espalhadas aleatoriamente pelo seio do fluido. Só esta caracteristica já

lhe confere um certo estatuto especial. Mas, é na combinação desta com a natureza altamente paralela

do esquema e com o ainda crescente campo de conhecimentos e aplicações, que tudo indicam que este

método irá ser fundamental no futuro do CFD.

Foram obtidos resultados impressionantes no que toca à rapidez das simulações. A que mais se

destacou foi a implementação no GPU: foram atingidos speedups na ordem das centenas. Baratos, são um

elemento quase banal, existente em qualquer computador. Encerram um enorme poder computacional,

pelo que são quase como um mini-cluster. Estes factos levam a que muita investigação se esteja a focar

na união entre os modelos de partículas e os GPU’s.

Palavras-chave: Lattice Boltzmann, LGBK, MRT, OpenMP, MPI, GPU, CUDA

Contents

1 Introduction 6

1.1 The problem and it’s relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3 Organizations of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Lattice Boltzmann Equation (LBE) 9

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.1 LBGK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.2 Multi-Relaxation Time method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 Boundary Conditions (BC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Parallel processing 22

3.1 Classifications of parallel computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1.1 A primer on CPU architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1.2 Classifying parallel machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2 Parallel Programming Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.4 MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.5 Massively Parallel Processors: GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.5.2 CUDA architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.5.3 CUDA programming model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.5.4 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Implementations of the LB methods 37

4.1 Serial implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2 Shared memory implementation: OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.3 Distributed memory implementation: MPI . . . . . . . . . . . . . . . . . . . . . . . . . . 40

1

4.4 Massively Parallel Processors: CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5 Results 45

5.1 Test cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.1.1 Poiseulle Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.2 Shared Memory parallelization: OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.3 Distributed memory : MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.3.1 Massively parallel processors: CUDA . . . . . . . . . . . . . . . . . . . . . . . . . 53

6 Conclusion 56

.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

.2 Rules of thumb for higher performances . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

2

List of Figures

2.1 Particles’ direction of travel for the D2Q9 (left) and the D3Q19 (right). . . . . . . . . . . 12

2.2 Algorithm of LGBK’s method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3 Algorithm of MRT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4 Periodic boundary conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1 Simplified scheme of a CPU architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2 Scheme of a UMA layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3 Scheme of a NUMA layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.4 Distributed memory system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.5 Hybrid memory layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.6 Representation of MPI’s hierarchical chain. . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.7 Layout of a typical MPI program. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.8 The GPU’s architecture. The small squares are CUDA cores. Two rows of eight Streaming

processors can be seen. The remaining blocks represent memories. . . . . . . . . . . . . . 31

3.9 The Tesla C1060 board. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.10 Code running in the host with a single thread. Kernels are offloaded to the GPU and ran

with thousands of threads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.11 Availability of the different types of memory to the CUDA working structures. . . . . . . 34

3.12 Illustration of the organization of one bi-dimensional grid into blocks and one block into

threads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.13 Memory layout in the device and it’s accessibility to the processors. . . . . . . . . . . . . 35

4.1 For a given node, each particle’s PDF is stored one after the other, according to the

numeration used for each direction. Nodes also follow a lattice numeration. . . . . . . . . 38

4.2 Domain divided in two parts, one white, one gray. Each part is assigned to a node of the

Cluster. In the case of the white sub-domain, there is an extra frame , where data from

the frontier cells of the neighbor sub-domain are made available: Ghost cells. . . . . . . . 40

4.3 Each direction of propagation of a particle has an array. For each array, nodes are assigned

consecutively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3

5.1 Longitudinal velocity profile obtained for the Poiseuille flow test case. . . . . . . . . . . . 46

5.2 The error decreasing with the number of iterations. The criterion for convergence was set

to 10−6. It took 9556 iterations in a 256 × 64 mesh. . . . . . . . . . . . . . . . . . . . . . 47

5.3 Bidimensional plot of the horizontal velocity of a Poiseuille flow for a converged case.

The maximum velocity is in the middle of the channel section, and the velocity profile is

constant throughout the channel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.4 Bi-dimensional plot of the moduli of the velocity for a Reynolds number of 1000 . . . . . 48

5.5 Horizontal and vertical velocity profiles for, respectively, a longitudinal and a transverse

section. Obtained with SRT for a Re=100. . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.6 Horizontal and vertical velocity profiles for, respectively, a longitudinal and a transverse

section. Obtained with MRT for a Re=1000. . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.7 A solution is considered converged, for the cavity, when the residue is inferior to 10−7.

We have used a logarithmic scale, both to the residue and to the number of iterations for

visualization purposes: small oscillations in the residue were small enough to look like a

straight line if the plot was residue vs iteration number. . . . . . . . . . . . . . . . . . . . 49

5.8 Horizontal velocity profile the top half nodes in the vicinity of the Wall (x=1). The code

is set to run 100 iterations for a Reynolds number of 5848. . . . . . . . . . . . . . . . . . . 50

5.9 Code scaling with the number of cores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.10 The increase in performance, up to a certain point, with the increase of the lattice dimensions. 54

5.11 Comparison between the three implementations. The filled nodes use the scale on the left,

while the others the scale on the right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

1 A 3 × 3 matrix stored in a row major fashion. . . . . . . . . . . . . . . . . . . . . . . . . 60

2 The loop on the left accesses data non-contiguously in memory. The loop on the right

uses a row major acessing pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3 The loop on the right is the unrolled version of the loop on the left. . . . . . . . . . . . . 61

4

List of Tables

4.1 General properties of the GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.1 Speedups obtained for the OpenMP implementation for up to four processors. . . . . . . 51

5.2 Speedups for different mesh sizes and cluster nodes. . . . . . . . . . . . . . . . . . . . . . 52

5.3 Speedups for different mesh sizes and cluster nodes. . . . . . . . . . . . . . . . . . . . . . 53

5.4 Comparison of the peak values in the three implementations. . . . . . . . . . . . . . . . . 54

5

Chapter 1

Introduction

1.1 The problem and it’s relevance

Computational Fluid Dynamics (CFD) is deeply and firmly rooted in the Navier-Stokes equation, but the

sheer computational demand, required to numerically solve the equations, always led the quest for faster

algorithms or ways to decrease simulations time. Examples of both strategies include simplifications of

the boundary layer and using parallel computing.

Lattice Boltzmann methods hold all the physical validity of the Navier-Stokes while bringing a simpler

set of equations that are highly parallelizable. Adding these factor to the exceptionally fast simulations

performed with low cost Graphics Processing Units (GPU’s), has generated high expectations regarding

this method as a future leader in many fields of the CFD world.

Parallelism is an intuitive solution to increase the productivity of a given activity. In every day life

there are numerous examples of how a system could increase it’s productivity by simply performing in a

parallel manner. One can find examples ranging from biological systems, to industrial production lines.

The procedure is simple: divide the problem into smaller parts and introduce more active elements to

work on these parts. But bare in mind that not all tasks can be parallelized. As Fred Brooks said, “Nine

woman can’t make a baby in one month”.

Imagine a coffee house with a queue of people wanting to buy a coffee. This coffee house has only one

employee, one coffee machine - which can take only one coffee at a time - and one cash register. In this

case, the employee can only serve one client at a time. The rate of service would increase if the coffee

machine could make two coffees at the same time; or if a second employee was introduced to the system,

taking coffees while the original employee makes the registers. Better yet would be two have two of each:

the two employees, two single-coffee machines and two cash registers.

This kind of examples, seemingly simple, provide useful analogies to the analysis of the advantages

and disadvantages of parallelism. And, as one introduces more realism, the more trickier things will

6

become, and the more they will resemble with the typical problems of computational parallelism. In

the example above, one could introduce constraints, such the number of expected clients or the profit

to be made. This constraints are determinant to decide what level of parallelism is needed to face the

constraints.

Many scientific problems can take advantage of parallelization. Examples include the simulation of

fluid flows, such as weather and climate forecasting, the interaction between atoms and molecules, code

cracking, medic visualization, finances, and so on.

So it’s no surprise that most of the ground works of parallelism, both theory and designs, trace back

to the 60’s and 70’s. This era was also characterized by Moore’s law, telling us that in every couple

of years, the number of transistors on a die will double. Back then it meant that some of the best

computers could easily became obsolete in a matter of months. Moore’s law was found to extend to

computer memories, bandwidths and other digital components and devices.

But very recently, the number of transistors on a die could not be increased. Not that the technology

to do so isn’t available, but physical and economic limitations, such as the heat dissipation needed and

the high energy consumption, made the number of transistors reach a plateau. In order to keep Moore’s

law alive, the architecture of a typical home computer has changed. Instead of having one CPU, we now

have more than one CPU, either on the same die - multi-core - or on different dies - multi-processor -. In

theory, this shift allowed for the revival of the Moore’s law, but, in order to actually notice the increased

performance, software has to be purposely written to take advantage of the multicore paradigm.

Although recently, servers and workstations can have processors with 8 or even 12 cores. Expensive,

but due to their high-demand activity and desired reliability, it’s a price worth paying. But, nowadays,

a 4-core computer can be cheaply acquired. It is the case of AMD’s quad-core CPU Phenom. And a

six-core, by Intel, is already rumored for next year. So, ranging from the hardware design up to the

applications, the use of parallelism has been a reality in the last years and, without any doubt, to the

nearby future.

The state of the art in parallel computing are Supercomputers, Clusters, the Grid, and, more recently,

the GPU’s.

Supercomputers are a special type of computer, whose architecture is different from the one com-

monly find on Desktop or laptop computers. They have thousands of processors. Some of the top

Supercomputers include the Cray Jaguar, with a peak speed of 2.3 petaflops, and IBM’s Roadrunner,

peaking 1.7 petaflops.

A cluster is the connection, with a high speed network, of many separate computers, each with it’s

own memory resources and processors. Clusters can be seen as a single computing machine.

If the cluster nodes, the computers, are spread across large distances, the cluster goes by the name

Grid. One of the biggest Grids in the world is the TeraGrid, with more than 1 petaflop of computing

capability and 30 petabytes of distributed storage.

7

One advantage of the Grid is the possibility of regular computer users sign up and, when their comput-

ers are idle, letting the Grid use their resources to perform some type of work and store data. Examples

of this include the seti@home, to help processing spacial radio signal, which may reveal extraterrestrial

intelligence, folding@home, to aid the study of protein folding, leading to better and innovative drugs,

or the lch@home, to work on the elementary particle physics, fundamental for unraveling, for example,

the origins of the Universe.

The GPU’s are processors that co-work with the CPU’s, relieving them of the intensive work of graphic

rendering. Built with that purpose, their architecture allows a much higher computational power, and

a more energetic efficiency. These characteristics are making them an attractive and cheaper way of

’competing’ with Supercomputers and Clusters, as one can pack a bunch of cheap GPU’s together and

achieve an exceptional computational power.

Numerous applications are already taking advantage of the GPU’s. In terms of CFD, Euler methods

have successfully been ported to the GPU. The achieved speedup rounds 12. A Burguer’s Equation

solver reached 18. But Lattice Boltzmann hit one of the strongest speedups: a 341 times improvement

over the CPU implementation [15].

Since Lattice Boltzmann have it’s strong points in handling well with traditionally cumbersome

boundary conditions, the combination of these characteristics with that of a highly parallel implementa-

tion are very promising.

1.2 Objectives

At the present moment, Lattice Boltzmann methods are an active field of research and development.

As some of the investigation being made in the laboratory includes scenarios where recent papers point

to LB as the best choice for a fast numerical solution, LASEF decided to start, from the very bottom,

with these methods. Thus, the main purpose of this thesis was to establish some groundwork related to

Lattice Boltzmann’s methods and to it’s parallel implementations.

1.3 Organizations of the thesis

Chapter 2 introduces in a historical perspective, the Lattice Boltzmann methods and the theoretical

background of two numerical implementations. Chapter 3 makes a brief description of the machines

that are used for parallel computation, over-viewing the programming models used in each machine.

Following, in chapter 4, all the developed implementations are described in terms of the modifications

performed to adjust the serial code to each parallel machine. All the results obtained with the different

parallel approaches are compared and contrasted in Chapter 5. The most relevant aspects of the present

work are presented in Chapter 6. Finally, one can find, in the appendices, a small glossary and a brief

description on basic principles of code optimization.

8

Chapter 2

Lattice Boltzmann Equation (LBE)

2.1 Introduction

The Universe is composed of atoms and molecules interacting with each other. It’s like a random dance,

whose erratic behavior is better described within the field of statistical mechanics. In this sea of particles,

the probability of finding one particle in the position −→x , at the time t and with the velocity −→v , is given

by f . This f is paramount to the Boltzmann Equation (BE) (eq. 2.1 ) and it’s called the probability

density function (PDF). It’s expressed in terms of

[

1

m3

(m

s

)

−3]

.

[

∂t +−→pm

∂−→x +−→F ∂−→p

]

f(−→x ,−→p , t) =∂f

∂t= C (2.1)

The left side of the Boltzmann equation represents the movement described by the particles (coupled

with the force field−→F ); while the right side represents the inter-particle collision operator. The letters ~p

and m are used to represent the momentum and the mass of the atoms/molecules.

Boltzmann considered that particles are point-like entities in a dilute gas. These particles can only

interact due to short range 2-body potentials and, as a consequence of the previous considerations, there

is no relation between particles entering a collision ( the molecular chaos assumption). After the collision,

however, the particles are coupled by the conservation laws of mass and momentum [14]. As such, the

collision operator calculated for a two body collision is the following

C = C1,2 =

∫

(f1′,2′ − f1,2) gσ (g, Ω) dΩd−→p2 (2.2)

Here the collision operator, C, was renamed to C1,2 to reinforce the idea of a two particle collision.

Also, being fi the PDF of the particle i, due to the molecular chaos assumption, f1,2 = f1f2. The other

parameters are σ, the differential cross section,−→Ω , the solid angle, −→g , the velocity of the molecules

relative to the solid angle and −→p , the momentum.

Boltzmann later showed, through his famous H-Theorem [4], the unification between two previously

unrelated fields - mechanics and thermodynamics - and that the BE could compute properties, such as

9

diffusivity, viscosity, thermal conductivity, and that it could also transport parameters [14].

Following in the wake of this idea, it was later showed, through the Chapman-Enskog procedure [4]

that the interaction between particles could lead to the hydrodynamic mean macroscopic properties.

This means that we can simulate the fluid dynamics through the underlying statistical particle collisions.

In 1986, it was shown that a cellular automata could reproduce the complexity of a fluid flow by

simulating the underlying microscopic interactions. Cellular Automatons are simple evolving models

defined by a regular lattice, where a certain number of particles take residence in each node and change

their states through interactions between these particles [3]. The interactions are defined by a set

of previously given rules, and the states tend to be boolean ( zero or one), hence, the microscopic

interactions are called boolean microdynamics [14].

This led to the Lattice Gas Cellular Automata (LGCA), where the boolean microdynamics gave rise

to the desired macroscopic parameters. However, this approaches had some crucial faults, namely the

inability to achieve high Reynolds numbers, the statistical noise and the difficulty to move to the third

dimension [14].

The attempt to avoid these drawbacks led to the creation of the Lattice Boltzmann Equation (LBE)

[3]. A smooth transition since the LBE was hidden inside the LGCA formulations. Despite having

eliminated the fore mentioned LGCA problems it still required some tuning to achieve the desired

behavior. The main equation of LBE is as follows

ifi = C1,2 = Aij

(

fj − feqj

)

(2.3)

The left side represents the streaming step, while the right side represents the collision operator. All

the non-linearities of the Boltzmann Equation are placed on the Collision operator (more precisely in the

feq term), while the streaming step is totally linear [1]. The matrix Aij is called the scattering matrix.

It’s entries are calculated in each iteration as they depend on the boolean states of the lattices’ node.

Experiments made with LBE models revealed that one can go as low in viscosity ( or as high in

the Reynolds number) as the lattice allows, and, for numerical simulation of turbulence, LBE is fairly

competitive with spectral methods.

More recently, further simplifications and improvements on the LBE led to the creation of the most

used models of the LBE: the LGBK, MRT and Entropic models [1]. This last one holds the promise of

a very strong stability. Analysis of this statement is still to be done. But it seems to be a promising

method.

This leads us to what is being done with Lattice Boltzmann (LB) in our days and why.

LB methods can be found in many different areas. Examples include the simulation of combustion

inside porous media, fluid flow of blood inside the arteries with the interaction between the blood and

the vessel’s wall, microfluidics, specially in the area of liquid crystals, in the cinema industry, for CG

fluid flows and particle-suspensions, etc. A unique niche can be found in the linkage between molecular

dynamics and the ’continuum’ computational dynamics.

10

Overall, this method is simple and flexible, easy to implement and ’easily’ adapted to non-uniform

grids and to time-changing irregular geometries. But it’s the fact that they are highly parallel that have

led to their recent re-insurgence.

2.2 Numerical Methods

In general, every LB implementation is composed of a temporal loop where, to a given time and to every

node of the lattice, there is a collision between particles, followed by the propagation of these particles

to neighbor nodes. The propagation path is well defined by the lattice used.

Differences between implementations result from the number of particles present in each node (at any

given time), the way the collision step is made, and what type of formulation was used for the boundary

conditions [14]. Lattice Boltzmann schemes are all explicit second-order accurate ( more in the case of

turbulence) with the possibility of plummeting to first-order on the boundaries if first order boundary

conditions are used[14].

Generally, the number of particles used per node and the type of space where they exist are represented

as DnQm. n stands for the number of the dimensions and m the number of particles. It is important

to choose properly the number of particles for a given space, as some models may be unable to capture

certain physical phenomena. This is due to the stability criterion of the LB: the lattice has to be able to

support the fastest speed of physical information (Courant number). For this reason, most LB methods

are limited to incompressible regimes ( numbers of Mach inferior to 0.33) [18] [8]. Also, the wrong choice

of the number of particles and their directions of propagation may lead, through the Chapman-Enskog

procedure, to an equation different from the Navier-Stokes.

To two dimensional simulations the D2Q9 model was used. In this model, every node has 9 particles,

8 of which move to 8 neighboring nodes and one particle that remains at rest. For three dimensional

lattices, the D3Q19 model was used.

Even between models DnQm that can restore the Navier-Stokes, the number of particles and their

propagation paths are still important in terms of numerical errors and to the quality of the obtained

results. For example, the D3Q15 model is very susceptible to short-wave oscillations [8]. The D3Q19 is

not. That is due, not only to the extra four particles, but also to the way that particles are free to move.

Choosing the D3Q27 model would not bring much more stability and precision than the D3Q19, but it

would bring more complexity and computational demand.

According to the coordinate system and the indexing given, it’s necessary to define the translation

vectors of the particles and their weights, e and W respectively. These weights can be seen as a adimen-

sional masses. So, the lighter the mass, the further it can travel (for the same time step). For the D2Q9

11

Figure 2.1: Particles’ direction of travel for the D2Q9 (left) and the D3Q19 (right).

model represented in figure 2.1

eiD2Q9 =

( 0, 0), i = 0

( ± 1, 0), ( 0, ± 1), i = 1, 3, 5, 7

( ± 1, ± 1), i = 2, 4, 6, 8

(2.4)

WiD2Q9 =

4/9, i=0

1/9, i = 1, 3, 5, 7

1/36, i = 2, 4, 6, 8

(2.5)

For the D3Q19, figure 2.1

eiD3Q19 =

( 0, 0, 0), i = 0

( ± 1, 0, 0), ( 0, ± 1, 0), ( 0, 0, ± 1), i = 1, 2, ... , 6

( ± 1, ± 1, 0), ( 0, ± 1, ± 1), ( ± 1, 0, ± 1), i = 7, 8, ..., 18

(2.6)

WiD3Q19 =

1/3, i=0

1/18, i = 1, 2, ... , 6

1/36, i = 7, 8, ... , 17, 18

(2.7)

(2.8)

In Lattice Boltzmann methods, it’s typical to work on unitary units - the lattice spacing and the time

step are taken as unitary. As a result, the obtained hydrodynamic properties need to be scaled in order

to be physical congruent with experimental values. The main scale factors are for time, space and mass.

All other scale factors can be deduced from these three.

Let N be the number of lattice nodes in one dimension of the domain, L the physical dimension of

the domain (in centimeters), then the lattice spacing, x, is given (in centimeters) as

12

Sl = x =L

N − 1(2.9)

The two other scale factors are:

St = t =cs x

Cs

(2.10)

Sm = m (2.11)

Where Cs and cs are, respectively, the physical and the lattice’s speed of sound, m is the physical

mass of the fluid molecules (in grams) and t the time (in seconds). The lattice speed of sound is a

constant, cs =1√3, and holds valid for both the two and the three dimensional lattices used.

In the presented cases, two collision operators were used, both simplifications of the LBE model: the

BGK and the MRT.

2.2.1 LBGK

The Lattice Bhatnagar-Gross-Krook (LBGK) method, sometimes called Single Relaxation Time (SRT),

is the name of the model that uses the BGK collision operator. This collision operator was obtained by

simplifying the scattering matrix Aij , since only one of the numerous parameters present in the matrix

is actually responsible for controlling viscosity and transport phenomena. All the other elements of this

matrix are there to control the non-hydrodynamics modes (called ghost modes) that induce numerical

instability [14]. This simplification reduces the scattering matrix to a single relaxation parameter: ω.

CBGK = −ω(f − feq) (2.12)

This operator removes some of the mathematical complexity and computational demand of LBE

without compromising the physics - the Navier Stokes equation can still be recovered from the LBGK

model [1].

Nonetheless, all the modes, the rate of transfer of mass, momentum, heat, etc, all take place in the

same time step [14]. This implies, for example, that the Prandtl number is unitary and that short-wave

oscillations and numerical instabilities will acquire enough weight to take it’s toll in the model’s stability

[18]. In theory, LGBK can simulate any given Reynolds number as long as the lattice is tight enough to

capture the small scales. For example, a 128 × 128 lattice cannot capture Reynolds bigger than 2000.

But, if we increase the mesh to 200 × 200, then we can achieve Reynolds up to 8000. This has to do

with gradients that occur in singular points. These gradients create short-wave oscillation that are only

loosely captured by the lattice. It’s an alias effect that leads to numerical instabilities that devoid the

results of meaning.

So, generally, the LBGK model is given by

13

fi(x + ci t, t + t) − fi(x, t) = −ω(fi(x, t) − feqi (x, t)) (2.13)

The terms on the left side represent the advection, and the ones on the right the collision.

ω is defined as the inverse of the time scale and is linked with the kinematic viscosity.

ω =1

τ, 0 < ω < 2 (2.14)

ν =2τ − 1

6, (2.15)

feq is the Maxwell-Boltzmann equilibrium distribution approximation, function of the local conserved

quantities : ρ, −→v , etc. It’s inside this member that all the non-linearities of the Navier-Stokes are hidden.

feqi ≈ ρWi

(

1 +eiu

c2s

+(eiu)

2

2c4s

+u2

2c2s

+(eiu)

3

2c6s

+(eiu)u

2c4s

)

+ O[u4] (2.16)

Here cs represents the speed of sound in lattice units.

Finally, conservation of mass (eq. 2.17) and momentum (eq. 2.18) have to be reinforced, since they

were implicitly included in the scattering matrix Aij .

ρ =

m−1∑

i=0

fi (2.17)

ρuα =

m−1∑

i=0

fieiα (2.18)

Since this method only holds for incompressible situations (or, more precisely, cases with small density

variations), if one wants to obtain the pressure:

p = c2sρ (2.19)

In case there are external forces acting upon the fluid, each particle will receive a momentum input

in accordance to the mass of the particle and it’s direction of propagation [14].

14

Figure 2.2: Algorithm of LGBK’s method.

15

2.2.2 Multi-Relaxation Time method

With the purpose of having a model that could reduce the ghost modes, allow high Reynolds numbers

in (relatively) small lattices, while maintaining the simplicity of the LBGK, the Multi-Relaxation Time

(MRT) method was proposed [8]. It was obtained by simplifications made on the scattering matrix

of the LBE, while maintaining the possibility to control how the ghost modes and other instabilities

behave ( although they couldn’t be completely eliminated) [1]. Reynolds numbers as high as 106 have

been obtained for a 2D lid driven cavity, although the accuracy of the results could not be verified [18] [14].

The collision term of the MRT model is given by:

C12 = −[M ]−1[S] (|R〉 − |Req〉 ) (2.20)

To be in accordance with the literature, we have used [ ] to represent a matrix and |〉 to represent a

column vector.

S is the relaxation matrix. It is diagonal and every element is given by a sii = ωi =1

τi

. The possibility

of having multiple relaxations is what offers this model the increased stability. Note that the entries of S

that act upon mass and momentum are taken to be zero to reinforce the conservation of these properties.

If all non-zero values of S take the same ω, the MRT model will have the same behavior has the LGBK

(with the same relaxation) [8].

S2D = diagonal (0, s2, s3, 0, s5, 0, s7, s8, s9) (2.21)

S3D = diagonal (0, s1, s2, 0, s4, 0, s4, 0, s4, s9, s10, s9, s10, s13, s13, s13, s16, s16, s16) (2.22)

The fluid viscosity is obtained by:

ν =1

3

(

1

s9

− 1

2

)

(2.23)

R is a column vector that stores the macroscopic properties of the node. The number of elements of

the column are equal to the number of particles living in a node. The elements of R are density ρ, the

part of the kinetic energy independent of the density ǫ, the momentum j, the energy flux independent

of the mass flux q, viscous stress tensor p and an anti-symmetric third-order moment m [8].

R2D = (ρ, e, ε, jx, qx, jy, qy, pxx, pxy) (2.24)

R3D = (ρ, e, ε, jx, qx, jy, qy, jz, qz, pxx, πxx, pww, πww, pxy, pyz, pxz, mx, my, mz) (2.25)

R is obtained by performing a matrix transformation of f through the matrix M. This matrix depends

upon the numeration given to the directions a particle can travel (fig. 2.1). A use of a different numeration

16

makes it necessary to re-order the columns of M, and is common in the literature to spot different versions

of these matrices. But keep in mind it’s function is to summarize the conservation laws, as it is the matrix

version of eq. 2.17 and eq. 2.18, among others.

|R〉 = [M ] |f〉 (2.26)

MD2Q9 =

1 1 1 1 1 1 1 1 1

−4 −1 2 −1 2 −1 2 −1 2

4 −2 1 −2 1 −2 1 −2 1

0 1 1 0 −1 −1 −1 0 1

0 −2 1 0 −1 2 −1 0 1

0 0 1 1 1 0 −1 −1 −1

0 0 1 −2 1 0 −1 2 −1

0 1 0 −1 0 1 0 −1 0

0 0 1 0 −1 0 1 0 −1

(2.27)

MD3Q19 =

0

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

@

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

-30 -11 -11 -11 -11 -11 -11 8 8 8 8 8 8 8 8 8 8 8 8

12 -4 -4 -4 -4 -4 -4 1 1 1 1 1 1 1 1 1 1 1 1

0 1 -1 0 0 0 0 1 -1 1 -1 1 -1 1 -1 0 0 0 0

0 -4 4 0 0 0 0 1 -1 1 -1 1 -1 1 -1 0 0 0 0

0 0 0 1 -1 0 0 1 1 -1 -1 0 0 0 0 1 -1 1 -1

0 0 0 -4 4 0 0 1 1 -1 -1 0 0 0 0 1 -1 1 -1

0 0 0 0 0 1 -1 0 0 0 0 1 1 -1 -1 1 1 -1 -1

0 2 2 -1 -1 -1 -1 1 1 1 1 1 1 1 1 -2 -2 -2 -2

0 -4 -4 2 2 2 2 1 1 1 1 1 1 1 1 -2 -2 -2 -2

0 0 0 1 1 -1 -1 1 1 1 1 -1 -1 -1 -1 0 0 0 0

0 0 0 -2 -2 2 2 1 1 1 1 -1 -1 -1 -1 0 0 0 0

0 0 0 0 0 0 0 1 -1 -1 1 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 -1 -1 1

0 0 0 0 0 0 0 0 0 0 0 1 -1 -1 1 0 0 0 0

0 0 0 0 0 0 0 1 -1 1 -1 -1 1 -1 1 0 0 0 0

0 0 0 0 0 0 0 -1 -1 1 1 0 0 0 0 1 -1 1 -1

0 0 0 0 0 0 0 0 0 0 0 1 1 -1 -1 -1 -1 1 1

1

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

A

(2.28)

Req is a column vector responsible for storing the equilibrium macroscopic properties (similar to the

feq in the LGBK). For the bi-dimensional case, the equilibrium terms are given by:

17

eeq = −2ρ + 3(j2x + j2

y)

ǫeq = ρ − 3(j2x + j2

y)

qeqx = −jx

qeqy = −jy

peqxx = j2

x − j2y

peqxy = jxjy

And for the three dimensional case:

eeq = −11ρ +19

ρ0

(j2x + j2

y + j2z )

ǫeq = wǫρ +wǫj

ρ0

(j2x + j2

y + j2z )

qeqx = −2

3jx

qeqy = −2

3jy

qeqz = −2

3jz

peqxx =

1

ρ0

[2j2x − (j2

y + j2z )]

peqxy =

1

ρ0

jxjy

peqyz =

1

ρ0

jyjz

peqxz =

1

ρ0

jxjz

peqww =

1

ρ0

[j2x − j2

z ]

πeqxx = wxxpeq

xx

πeqww = wxxpeq

ww

meqx = meq

y = meqz = 0

Here wǫ, wǫj and wxx are weights used together with the relaxation matrix to provide stability.

According to [8], the optimized stability uses wǫ = 0, wǫj =475

63, wxx = 0 and s1 = 1.19 s2 = s10 = 1.4

s4 = 1.2 s16 = 1.98.

More concisely, this method consists in working in the momentum space for the collision and in the

velocity space for the propagation. That means that in the collision phase we start by calculating the

macroscopic properties, R and Req, then calculating a post-collision macroscopic property, R∗ (eq. 2.29)

and then calculate the post-collision PDF as:

18

|f〉 = [M ]−1 (|R〉 − [S] (|R〉 − |Req〉 )) = [M ]−1 |R∗〉 (2.29)

The streaming step is equal to the one of the LGBK.

Figure 2.3: Algorithm of MRT.

Overall, this method is only 15 % slower than the SRT [8] [17]. All this extra time takes place in the

collision step.

19

2.3 Boundary Conditions (BC)

Boundary conditions have an important effect on the stability of the models [1]. Although complex

boundary conditions are still an active field of research [14], the simpler boundary conditions used in this

work are pretty well matured. They are the no-slip, the sliding wall and the periodic BC’s.

The no-slip boundary condition is replaced by the bounce-back boundary condition (BBBC). In the

no-slip case, we assure that the velocity tangentially to the wall is zero (perpendicular if the wall is non-

permeable). In the BBBC this is not needed. Since LB can work on the mesoscale, the particles velocity

at the interface may be different from zero, although macroscopically it can be perceived that way.

Therefore, the BBBC only guarantees the conservation of the particles energy. They collide elastically

with the wall and return in the opposite direction. Forcing the macroscopic velocity on the wall to be

zero will dampen the flow near the wall, as energy will not be conserved.

The wall where particles bounce-back can either be placed on the node or in-between nodes of the

grid. If the wall is placed on a grid node, then it takes two iterations for the presence of the wall to

be felt by the neighboring flow (one iteration to receive the particles, other to propagate them back).

On-grid walls give rise to first order accurate boundary conditions. If the wall is in between nodes, the

particle collides with the wall at a mid point, which means it bounces back to the fluid nodes at the

same iteration. This type of approach offers a second order accuracy.

A sliding wall is a special case of a bounce-back condition where the particles reflected from the wall

are impinged by the momentum of the wall, propagating this momentum to the fluid below.

The simplest way to apply a sliding wall BC is using the macroscopic properties, such as the velocity

of the wall, to calculate the equilibrium PDF’s of the particles propagating out of the wall. This is a

first order approximation [8] [1].

fi = Wiρ0

eiUwall

c2s

(2.30)

A second order approximation would take into account the interaction of the particles colliding with

the wall and being reflected back [8] [1].

f∗

i = fi + 2Wiρ0

eiUwall

c2s

(2.31)

This is a super-set of the first order approximation.

The periodic boundary condition, used for infinite domains, consists in passing the particles leaving

the computational domain nodes and making them re-enter through the opposite side. For example,

let’s suppose the periodic boundaries are the left and right edges of a bi-dimensional rectangular domain.

A node placed in the right frontier will have the east, north-east and south-east particles leaving the

20

domain. These particles are placed in the first node of the left boundary. This left node shares the same

y value. The same occurs for the particles leaving the left frontier.

Figure 2.4: Periodic boundary conditions.

In terms of the PDF’s, this condition is translated in the following way

fin(E) = fout(E)

fin(NE) = fout(NE)

fin(SE) = fout(SE)

fin(W ) = fout(W )

fin(NW ) = fout(NW )

fin(SW ) = fout(SW )

21

Chapter 3

Parallel processing

3.1 Classifications of parallel computing

Roughly speaking, parallel computing consists in dividing a problem into smaller ones, sending each one

of these parts to be solved on a different processor, and combining all of the results. If all goes well, the

time it took the problem to be solved in parallel, Tpar, should be less than the time it would take to

solve the same problem in a sequential way, Tseq. The ratio between the two, the speedup S, should be

higher than the unity, and is a way of comparing the performances.

S =Tseq

Tpar

(3.1)

The processors mentioned above are either the CPU or the GPU. This last one, more correctly called

a co-processor, shall be dealt with later. For now, we will stick with the CPU-based parallel machines.

Nonetheless, some of the given concepts will be applicable to the GPUs.

3.1.1 A primer on CPU architecture

Given it’s pivotal nature in parallel machines, it is important to understand how the CPU works. This

kind of processor is better understood with the von Neumann architecture model [13]. It consists in

three major components: a central processing unit (or CPU), the main memory and the Input/Output

systems.

22

Figure 3.1: Simplified scheme of a CPU architecture.

A program, which is a set of instructions, is stored in the global memory, the RAM. An input, such

as a key stroke, tells the Control Unit, the CU, to retrieve a certain instruction from memory - fetch

operation. It then decodes the instruction and executes it. If a logic or arithmetic operation is required,

the ALU, the Arithmetic and Logic Unit, will be summoned and fed with the data.

If intermediate values are present, they are stored in the nearby memory. First the registers, then the

caches and, finally, the main memory. This hierarchy of memory storage has to do with the time it takes

to read/write to the memory. The register memory is the closest, therefore the fastest memory. It’s also

the smallest. Caches have bigger storage than registers but are further away. The slowest memory, yet

the biggest, is the main memory. This is where final values are recorded.

Finally, the result of the program is sent to the output. The display, for example.

All the connections between elements of the CPU, and between the CPU and the main memory is

done by buses, a collection of physical wires.

This was an example of how a CPU works. Nowadays, CPU architectures are much more complex

than this one, and they use way more complex instruction cycles than the fetch-decode-execute model

[13]. Mainly because a CPU has to be able to perform very well in many different kinds of solicitations.

This ends up making the CPU a compromise solution to adjust the demands.

3.1.2 Classifying parallel machines

It’s the way processors and memories are assembled ’together’ that determines the type of parallel ma-

chine. The most common ways to classify this machines are according to the Flynn’s taxonomy and to

the memory layout architecture [6].

Flynn’s taxonomy classifies a computer according to the number of concurrent instructions and the

data streams available in the architecture. There are four classifications possible:

• Single Instruction, Single Data (SISD) - The processor executes one instruction set over a

23

single data stream.

• Single Instruction, Multiple Data (SIMD) - Each processor executes the same instruction

but on different data streams. It’s the case of GPU’s and CPU’s of Supercomputers.

• Multiple Instruction, Single Data (MISD) - The same data stream will be sent to different

processors, each one of this executing a different set of instructions over the data.

• Multiple Instruction, Multiple Data (MIMD) - Every processor may run a different set of

instructions over different sets of data. It’s a superset of SIMD. Most parallel computers fall into

this category: Supercomputers, Clusters, Grids.

According to the memory architecture, parallel machines can be classified as shared, distributed or

hybrid [6].

In the shared memory architecture, all processors have access to the same global memory resource.

If the distribution of the processors around the memory is done in a way that the memory access times

are equal, shared memory machines can be called UMA - Uniform Memory Access - (fig. 3.2). They are

called NUMA otherwise - Non Uniform Memory Access - (fig. 3.3). The former is the case of Symmetrical

Multiprocessor machines - SMP- found in desktop PC’s, laptops and Supercomputers boards. The latter

is made by linking two or more SMP’s by means of a bus interconnect.

Since processors act independently, sharing access to the same pool of memory holds the danger of

having one processor retrieving a certain value from memory unaware that another processor had just

changed that value. This leads to faulty results in the first processor. It also leads to apparent non-

deterministic behavior, as sometimes the first processor may retrieve the value before it was changed,

others after. This type of situations, where results are affected by the timing, are called race condi-

tion [10]. It’s an undesired event. Normally, this type of machines have built-in cache-coherence (cc),

which means that if one processor updated a certain value in memory, all other processors are notified.

Nonetheless is up to the programmer to be careful with memory accesses. Tools such as locks/unlocks

and semaphores are used to guarantee correct data accesses [6].

The main advantages of this type of architectures are the quick memory accesses, due to the memory

proximity to the CPU’s. The main disadvantage are the lack of scalability between CPU’s and memories.

What this means is that if we increasing the number of processors we would expect an increase in the

parallel performance. But, due to the increase memory traffic, that may not be the case. Race conditions

also became more probable.

Distributed memory systems consist in the connection, through a network, of processors, each with

a local pool of memory. Each processor has free access to it’s local memory (registers, caches and main)

but, to access main memories of other processors, a protocol has to be established between the two.

24

Figure 3.2: Scheme of a UMA layout. Figure 3.3: Scheme of a NUMA layout.

Scalability of memory, which is the increase of main memory size with the increase of the number of

processors, proves to be one of the main advantages of this type of architecture. But, since communication

between processors has to go through the network’s connection bandwidth, communication times are

elevated [6]. This poses the main disadvantage of this memory architecture, as it affects the overall time

of a simulation. To achieve an overall good performance, strategies to hide the communication time have

to be applied.

Figure 3.4: Distributed memory system.

The hybrid memory architecture is the combination of the two previous systems: locally a shared

memory system, either UMA or NUMA, globally connected to other systems. It is the most commonly

used. Grid, Clusters and Supercomputer all follow this memory layout.

Figure 3.5: Hybrid memory layout.

3.2 Parallel Programming Models

In theory, any parallel programming model can be applied to any type of parallel machine or memory

architecture. Nonetheless, some models may prove to be more fitting to certain machines than others.

25

The major parallel programming models are the following: shared memory, threads, message passing,

data parallel and hybrid. From these, only two were implemented in the CPU: threads and message

passing. The former using OpenMP; the latter with MPI. Commonly one finds OpenMP and MPI used

together in a hybrid approach. Cluster and Supercomputers use it.

3.3 OpenMP

OpenMP is commonly used as a mean to split tasks among different processors in a shared memory

architecture [5]. It acts upon the threads, hence being a threads-type parallel programming model.

Threads are defined as “execution states that are able to process an instruction stream”[5]. This

means that they are a sub-task of a program, just like a sub-routine.

It is normally associated that a sequential program running on only one processor, consists of only one

thread being executed at a time. Although not entirely false, CPU’s can execute more than one thread

’simultaneously’ using an interleaved method. It’s an instruction-level parallelism used to minimize CPU

idle times [10] [5].

But to us, multiple threads make more sense with more than one processor, each processor running

a thread.

Threads are generated in forks of the execution, where different sets of instructions, with local data,

can be running concurrently. Besides their own private data, they all share the resources available to the

program. This is why one of the main problems with threads is coordinating their actions and avoid-

ing data corruption due to race conditions. For that purpose, methods like barriers and locks are used [5].

OpenMP consists of compiler directives, runtime library routines and environment variables, all to

take the burden of sorting the low level details and work assignments. The programmer has to identify

where work can be divided and run concurrently, by putting a special comment or a pragma in the code.

Let’s look at an example of a matrix, M, times a vector, v.

#pragma omp p a r a l l e l f o r d e f a u l t ( none ) shared (m, n , a , b , c ) p r i va t e ( i , j )

f o r ( i = 0; i < m ; i++)

a [ i ] = 0 . 0 ;

f o r ( j = 0; j < n ; j++)

a [ i ] += b [ i∗n+j ] ∗ c [ j ] ;

The two for loops are a sequential implementation of the multiplication. The only addition to a

normal code was the first line. #pragma omp is the OpenMP directive that tells the compiler this is an

26

OpenMP directive. It is followed by the construct parallel, meaning the command following this special

comment will be able to have more than one thread running. So, the thread that found the OpenMP

directive forks into a team of threads, giving to each thread an ID number. But it does not tell the

threads what to do, hence the construct for. This is a loop construct responsible for assigning some of

the loop iterations to each thread. Next is an indication of what variables are able to be modified by all

threads and which ones are only available to each thread.

Although not explicit, at the end of the parallel region there is a barrier responsible for synchronizing

the threads and, in this case, to join them into a single thread that continues running the program.

It should be noted that we cannot make any assumptions on the order of the execution of threads.

They are not deterministic.

3.4 MPI

Since in parallel computation a processor is given only a small part of the whole problem, it is common

that the processor may need data calculated by another processor. In a shared memory architecture this

can be done through the use of main memory. In the case of a distributed memory architecture, a form

of communication must be established between the two. The processor in need of data has to receive

this data, and the processor responsible for producing the data must send it to the right processor. This

communication problem can be done with MPI.

MPI, which stands for Message Passing Interface, is the standard library specification for message-

passing between processors [10] [9]. It is a type of explicit parallelism - it’s up for the programmer

to identify how the parallelism should be done and use the appropriate MPI functions. Although not

obligatory, MPI is mostly used in distributed or hybrid memory architectures such as Clusters and Su-

percomputers.

In order to keep track of the processors and control their activities, MPI established a hierarchy. The

main entities are Communicators, groups and ranks [9]. A group is a set of processors. Each of the

processors inside the group are given an identification number starting with zero, it’s rank. Each group

has a communicator which is a unique handler to the group.

27

Figure 3.6: Representation of MPI’s hierarchical chain.

In order for a message to be sent, a protocol established that each message has to define the commu-

nicator - to reach the group -, the rank - to reach the specific processor inside the group-, the type of

data to be sent - integer, float, etc -, the size of the data to be sent, a tag - an integer to define the order

of receiving, in case many messages reach the processor at the same time - and the source - the rank of

the processor that sent the message. Only if the receiving processor has concordant data will it be able

to receive it.

All send and receive actions can be performed in a blocking or non-blocking way [9]. A blocking

event means that only after data has been safely sent or received, will the processor continue to run the

program. It may bring performance down, as the processor becomes idle, waiting. Non-blocking events

mean the opposite. The processor will continue to run the program even if it did not sent or received the

message yet. This will allow the processor to continue working. Sooner or later though, it will indeed

need to use the received data. This is done through the use of a barrier, which will only let the program

advance after all pending sends and requests are complete.

A typical code using MPI takes the following format

28

Figure 3.7: Layout of a typical MPI program.

Every processor receives a copy of the code and, based upon it’s position inside the hierarchy, executes

certain part of the problem, communicating with other nodes when necessary.

MPI can perform point-to-point communication - one processor sends data to another (specific)

processor - as well as collective communications - broadcast, scatter, gather, all to all, etc [9]. It also has

ways to synchronize the program between nodes.

Following is an example of a C MPI implementation where two processors send one character to each

other.

#inc lude "mpi . h"

#inc lude <s t d i o . h>

in t main ( i n t argc , char ∗ argv [ ] )

i n t numtasks , rank , dest , source , rc , count , tag=1;

char inmsg , outmsg='x ' ;

MPI_Status Stat ;

MPI_Init(&argc ,& argv ) ;

MPI_Comm_size ( MPI_COMM_WORLD , &numtasks ) ;

MPI_Comm_rank ( MPI_COMM_WORLD , &rank ) ;

i f ( rank == 0)

dest = 1;

source = 1;

rc = MPI_Send(&outmsg , 1 , MPI_CHAR , dest , tag , MPI_COMM_WORLD ) ;

rc = MPI_Recv(&inmsg , 1 , MPI_CHAR , source , tag , MPI_COMM_WORLD , &Stat ) ;

29

e l s e i f ( rank == 1)

dest = 0;

source = 0;

rc = MPI_Recv(&inmsg , 1 , MPI_CHAR , source , tag , MPI_COMM_WORLD , &Stat ) ;

rc = MPI_Send(&outmsg , 1 , MPI_CHAR , dest , tag , MPI_COMM_WORLD ) ;

MPI_Finalize ( ) ;

The program starts with the inclusion of the MPI library through the header #include “mpi.h“.

Next, all the variables are declared and the number of processes available are assessed through the

use of MPI_Comm_size. Here, MPI_COMM_WORLD is the default communicator and the number

of processes is an input argument of the executing the program.

Following, each processor retrieves it’s rank inside the communicator. Then, based upon the rank,

each processor accesses it’s task. In this case they have similar tasks: each processor sends and receives

one character to and from the other processor. They use the MPI_Send and MPI_Recv to do this.

These are blocking commands, which means the program will only continue after both functions return.

Both arguments are the address where the variable to send is, the size of the variable, the type of variable,

the rank of the processor to receive or who sent the message, and the communicator.

3.5 Massively Parallel Processors: GPU

3.5.1 Introduction

A Graphics Processing Unit is a specialized processor with the main purpose of aiding the CPU in

the high-arithmetic-intensive calculation of graphic rendering. With that purpose, GPU’s architecture

evolved to perform in the SIMD paradigm, which means that a GPU can perform much more floating

point operations per second than a CPU [10]. For example, the Pentium 4 has a peak computational

power, in single precision, of 125 GFLOP/s, while the NvIDIA GPU GTX480 has 1375 GFLOP/s, or

1.375 Teraflops. In order to achieve this peak power, 10 Pentium’s 4 were needed, which would end up

being more expensive as the ratio between price and computational power is one of the attractiveness of

GPU’s [12].

It was only natural that non-graphic applications found a way to take advantage of this condensed

computational power. But, if one wanted to use a GPU to make scientific computations, the code had

to be transformed to elements known to the GPU, such has vertexes, textures, shaders, etc. Although

a enormous nuisance, numerous application were developed and the trend of using GPU’s was coined as

GPGPU or General-Purpose computation on GPU’s [?] [10].

NVIDIA took the next step by ’opening’ the architecture of the GPU: one no longer needed to convert

the algorithm to fit into the device. This new architecture and it’s API’s ( Application Programming

Interface) were named Compute Unified Device Architecture, CUDA for short.

30

GPU’s are still riding on Moore’s law and there is transparent scalability between devices, meaning

that the same code will scale it’s performance with newer devices.

The increased interest in massively parallel processors led to the creation of the OpenCL standard.

With OpenCL one can have a program that can be ran across different plataforms, independent of

architecture or vendor.

At the moment, many algorithms are being ported to the GPU to test whether or not they can bring

an increased performance. An example of this is the GROMACS code used for molecular dynamics. In

some cases, the transition to the GPU can be done very smoothly. But others may require a fresh way

of viewing the problem. Years of optimizations of codes to perform in extremely efficient ways in the

CPU may be an obstacle, as a step back must be taken.

3.5.2 CUDA architecture

The GPU’s architecture is represented in fig. 3.8. Is is composed of various Streaming Multiprocessors,

SM’s for short. Each one of these SM’s contains numerous scalar processors, commercially named CUDA

cores, which are a type of ALU.

Each of this cores has it’s own private memories - registers and caches - a memory accessible to all

cores inside an SM - shared memory - and a memory accessible to all cores of all the SM’s - global

memory -. Like in the CPU, the nearer a memory bank is, the smaller and faster it is.

Figure 3.8: The GPU’s architecture. The small squares are CUDA cores. Two rows of eight Streaming

processors can be seen. The remaining blocks represent memories.

As can be seen in the figure, the high density of ALU’s makes the device optimal for high-intensive

arithmetic operations, being one of the key differences between the architecture of a CPU and GPU. In

single precision, the peak processing power of the Tesla is 933.12 GFLOPs ( see table 4.1 ). For compari-

son, let’s use the AMD Opteron quad-core CPU. It’s a state of the art CPU used in Supercomputers such

as the Cray Jaguar. The Opteron has a peak processing power of 33.6 GFLOPs (8.4 per core) roughly 30

times less than that of the GPU [2]. So, if processing power was our only criteria, the Jaguar would need

31

30 times less processors, which would meant a decrease in the volume occupied by the infrastructure.

Tesla supports double precision, but it’s peak computing power plummets to 77.76 GFLOPs. Roughly,

a one twelfth drop. Nonetheless, almost twice of the Opteron in single precision

Figure 3.9: The Tesla C1060 board.

Also, the relative power consumed by the GPU is lower than that of the CPU. Tesla, in full load,

consumes roughly 190 Watt, while the Opteron, in the same situation, consumes about 230. So, in

general, the absolute energy consumption of the GPU is toe-to-toe with that of the CPU ( although

there are GPU’s whose power consumptions are way over 200 Watts), but it’s in the ratio between the

rate of floating point operations performed and the energy consumption that the scale is tipped to the

GPU side: 933/190 vs 34/230.

Returning to the Jaguar, if we used GPU’s instead of CPU’s, the peak power consumption would

reduce from 10.5MW to 0.29MW, roughly 37 times less. Highly noticeable in the power bill. Note that

saying to replace the CPU with GPU’s may lead to some misconception. The GPU is not meant to

replace the CPU, but only to aid it. Not all the nodes of the cluster could be a GPU. Also, using a

cluster of GPU’s still suffers from drawbacks. At the moment two GPU’s cannot communicate directly

between themselves. They have to use the CPU as a middleman to message-passing. Due to the rel-

ative small bandwidth of the physical connections between GPU’s and CPU’s, communication will be

the performance’s bottleneck. It is expected that in future devices direct communication will be possible.

As data has to be loaded from memory before any operation takes place, the throughput of the GPU

is limited by the memory bandwidth [7] . This is an important bottleneck. For example, let’s look at a

loop iteration that takes two different numbers and makes a sum and a multiplication between the two.

If we only take into account the two operations, the peak power of the iteration is 933.12 GFLOPs. But,

before any operation can be performed, the numbers have to be retrieved from memory. Two memory

fetches to two operations gives a ratio of 1.0 of compute to global memory access (CGMA). Since global

memory access bandwidth is 102 GB/sec, and if we retrieve two floats, 4 bytes each, then one cannot

perform faster than 102/4, limiting the throughput to 102/4.

32

3.5.3 CUDA programming model

The key working element of the GPU is the thread. A thread is an entity that takes the instructions to

be performed, along with the information of which data should the instruction use. They are different

from the previously mentioned CPU threads, as they are more lightweight [12].

In the GPU, threads are executed by CUDA cores. Each core can run multiple threads at the same

time, as it takes advantage of unproductive times of a thread. For example, if a thread has to retrieve

data from global memory, which is an operation that takes some time ( long latency operation), instead

of the core being idle, waiting for data to work on, it will start running another thread. This way a

GPU can hide the long-latency operations and achieve higher performances. The higher the number of

threads, the easier it is to hide these operations. That is why GPU’s have thousands of threads running

at the same time, contrasting with the CPU.

Threads are organized into Blocks, which, in turn, are organized into a Grid. This way, we have an

abstract model of programming that fits the hardware model of the GPU: the thread corresponds to the

CUDA core, the Block to the SM and the Grid with the GPU (fig. 3.12).

The procedure is as follows. The CPU, called the host, is running the program. It’s typically one

CPU thread. When it encounters a part of the code that says it is supposed to be ran on the GPU, a

kernel, it handles it to the GPU. According to runtime parameters, the GPU will generate the Grid. The

Grid will be divided, sending a certain number of blocks to each SM. In the SM, groups of 32 threads

are taken from the Blocks. Each 32 thread group is called a warp. The warp divides itself among the

CUDA cores, filling long-latency operations with work from other threads [7].

It’s up for the programmer to assign the size of the grid and of the blocks. A grid can be assigned

in a one or two dimensional fashion, and each block can be assigned in a one, two or three dimensional

manner (fig.3.12). This multidimensional disposition may prove useful to better fit the natural indices

of certain problems, as the threads ID may be used as an index.

But the values chosen for the size of Blocks and Grids should not be taken lightly, as different values

give different performances. A good block size has to take into account the resources usage and the

occupancy of the SM [7].

For example, the maximum number of blocks that can be be assigned to any SM is 8 ( it is a fixed

value, independent on the device generation). But, if each block has a high demand of the SM’s resources,

then, less blocks can be assigned to the SM, as there are not enough resources to be shared among blocks.

The extreme case is having a single block that consumes almost all of the SM’s resources, leaving less

than the needs of another block. That way, the SM can only digest a single block bringing performance

down.

Other important issue is assigning work to all the threads inside the SM. A SM has a limit of threads

it can run. Let’s say 1024. If we create a block with 8x8 = 64 threads, this means that 12 blocks are

needed to occupy all the 1024 threads. But only 8 blocks are allowed to a SM, leaving the occupancy to

half: 512 threads. Having lesser threads running means less efficiency in hiding long-latency operations

33

Figure 3.10: Code running in the host with a single

thread. Kernels are offloaded to the GPU and ran

with thousands of threads.

Figure 3.11: Availability of the different types of

memory to the CUDA working structures.

and a worst performance.

Using the occupancy calculator 1 will help to visualize which combination of parameters provide the

better performance.

There are other tricks that can be used to one up the performance. Using, whenever possible, regis-

ters and shared memory instead of global memory, also brings performance up as less time is expended

accessing the memory. Accessing contiguous memory banks, similar to the row major in C, makes the

hardware coalesce the accesses [7] [12].

One limitation of this programming model has to do with the communication between threads.

Threads can only communicate among themselves through the memories: threads inside a block can

communicate with each other through the shared memory, and threads within different blocks can only

communicate through the use of the global memory. But, for optimized use, it’s always preferred to

minimize accesses to the global memory, which implies a quest for a more fitting algorithm that minimizes

communication between threads of different blocks. The scope of accessible memory is expressed in fig

1http://developer.download.nvidia.com/compute/cuda/CUDA_OCCUPANCY_CALCULATOR.xls

34

Figure 3.12: Illustration of the organization of one

bi-dimensional grid into blocks and one block into

threads.

Figure 3.13: Memory layout in the device and it’s

accessibility to the processors.

3.13.

Another limitation refers to synchronism. Threads may be synchronized, but only within the same

block. At this moment, there is no barrier that coordinates threads among blocks and, by hardware

constraints, there may not be one. This is a huge nuisance, which, once again, implies a new algorithm

and extra careful.

3.5.4 An example

To better understand how CUDA works, an example of a sum between two matrices will be used.

First the host code. We declare the two matrices A and B and initialize then. Then we give the sizes

of the Grid and of the Blocks by storing them in a cuda built-in type, the dim3.

Then we allocate the needed space in memory of the device. This is done with the cudaMalloc()

function, whose arguments are the address of the variable in the host’s memory that points to the

memory address in the GPU, and the size to be allocated. We call A_d and B_d to differentiate the

variables living in the device and the ones living in the host.

We then have to copy both matrices from the CPU to the GPU. By telling where to copy, from where,

what size, and specify the direction of the data flow, the function cudaMemcpy() does the job.

35

We then call the kernel, which is the function that runs on the GPU. It is like calling a normal

function but with the dimensions of grid and blocks inside the pointy parentheses.

The GPU then does it’s deed and, after finishing the calculations, we have to copy the results from

the GPU to the CPU.

dim3 dimGrid ( dim_grid_x , dim_grid_y , 1) ;

dim3 dimBlock ( dim_grid_x , dim_grid_y , dim_grid_z ) ;

cudaMalloc ( ( void ∗∗) , &A_d , n_col ∗ n_rows∗ s i z e o f ( f l o a t ) ) ;

cudaMalloc ( ( void ∗∗) , &B_d . . . ) ;

cudaMalloc ( ( void ∗∗) , &C_d . . . ) ;

cudaMemcpy ( A_d , A , n_col ∗ n_rows∗ s i z e o f ( f l o a t ) , cudaMemcopyHostToDevice ) ;

cudaMemcpy ( B_d , B , . . . ) ;

mat_sum_kernel<<< dimGrid , dimBlock >>> ( A_d , B_d , C_d , ) ;

cudaMemcpy ( C , C_d , n_cols∗ n_rows∗ s i z e o f ( f l o a t ) , cudaMemcopyDeviceToHost ) ;

The GPU kernel consists of a function or sub-routine. It has the special keyword __global__ to tell

the CPU that this function is called from the host but to run on the device.

Since the GPU has thousands of threads running simultaneously, each entry of the output matrix,

C_d, can be performed at the same time. Therefore there is no need for ’for’ loops. Also, the number

of the thread is used as the index to access entries of matrices A and B.

__global__ f l o a t mat_sum_kernel ( f l o a t ∗ A_d , f l o a t ∗ B_d , f l o a t ∗ C_d , . . . )

i n t trd_id_x , trd_id_y ;

trd_id_x = blockIdx . x∗ blockDim . x + threadIdx . x ;

trd_id_y = blockIdx . y∗ blockDim . y + threadIdx . y ;

C [ trd_id_y ] [ trd_id_x ] = A [ trd_id_y ] [ trd_id_x ] + B [ trd_id_y ] [ trd_id_x ] ;

This is a first implementation. Optimizations could be performed to increase performance, such

has coalesce the accesses to global memory. The example of a matrix multiplication found in [7] is

recommended.

36

Chapter 4

Implementations of the LB methods

All the implementations were made using the C language, it’s extensions and libraries. OpenMP and

OpenMPI were used, respectively, for the threads and the message-passing parallel programming model.

All codes were compiled with gcc 4.3.2 except for the CUDA application, compiled with the nvcc 2.2.

In terms of the machines, the OpenMP and the CUDA implementation used a computer with four

Intel Xeon processors and four GPU’s (three Tesla C1060 and one Quadro FX1800). The properties of

the GPU’s can be found in table 4.1. For the OpenMPI application, a cluster of four AMD’s Opteron

252 processor was used.

To measure and compare performances we will basically use the speedup and the MLUPs. The

speedup ( eq. 3.1 ) should only be used to compare among implementations on the same computer. To

compare implementations across different architectures we will use the MLUPS ’units’. MLUPs stands

for Mega Lattice Updates per Second. It is obtained by inverting the elapsed time of the temporal loop

divided by the number or iterations and the number of nodes.

First, a few remarks about the code.

There are many arrangements on how to store the PDF’s of all particles of all nodes. A very

organized way would be creating a structure, where each node has it’s own particles’ properties. We did

not implement it this way. Instead, and independently of the number of spatial dimensions, we used a

one-dimensional array, where, for each node, we stored the PDF’s of the particles in the order of the

numeration given (fig. 4). This minimizes memory accesses.

To access one specific particle of a certain node we calculated it’s location on the array. For example,

in the bi-dimensional case, if we had L nodes in the horizontal direction, H in the vertical, we started

enumerated the nodes from the lefto to the right, bottom up; then, to access the particle i of the node

with coordinates x and y would be:

idx = (y × L + x) × m + i (4.1)

37

Table 4.1: General properties of the GPU

Tesla C1060

Global Memory [GB] 4.29

Number CUDA Cores 240

Number Multiprocessors 30

Shared memory per block [KB] 16,384

Number of 32-bit Registers per block 16384

Clock rate 1.3 GHz

Memory Bandwidth 102GB/sec

Maximum Power Consumption [Watt] 187.8

Warp size 32

Maximum number of threads per block 512

Maximum size of each dimension of a block 512 × 512 × 64

Maximum size of each dimension of a grid 65535 × 65535 × 1

Figure 4.1: For a given node, each particle’s PDF is stored one after the other, according to the numer-

ation used for each direction. Nodes also follow a lattice numeration.

Being idx the location in the memory array and m the number of links per node (DnQm).

The propagation step and the boundary conditions can be applied in different ways. One could take

the option of fixating a node and propagating out all the particles living in that node (like if the node

exploded). Or we could fixate the node and propagate into it, particles from the surrounding nodes.

We chose the latter. Also, to determine which particle comes/goes to which node, we can take the node

position and, with the aid of the translation versors ei, calculate the new node of each particle. This

bores numerous calculations needed to perform per iteration. A better way is to use a fixed vector used

to obtain the index, in memory, of the new location. For example, if a particle moves to the node at

it’s right, all we need to do is to sum, to the actual node address in the memory, the necessary offset

for reaching the memory entry of the node to the right. In this case m ( DnQm). Since each of the

directions has a unique constant value, we avoid performing the translation calculation. We also take

advantage of this method and used it to perform the bounce-back and periodic boundary conditions.

For a bi-dimensional situation this offered no problem. But to a three dimensional one, the number of

vectors needed made it a bit clumsy approach.

38

Typically, to deal with generic fluid scenarios, each node is assigned a boolean value to tell if it’s

a fluid node or an obstacle one. So, per node, one has to evaluate the nodes nature and perform in

accordance. Since our test cases had the bulk free of obstacles, we avoided this conditionals.

Independently of the ways chosen, it’s unavoidable the use of a secondary array to store the propa-

gated PDF’s.

4.1 Serial implementation

The serial implementation is a straightforward transcription of the fluxogram (??). In pseudocode, one

can describe the flowchart process as

initialize_pdfs(initial properties)

for t < tmax do

collision()

propagation()

boundary_conditions()

end for

As will be seen later, the collision step consumes about three quarters of the total time of the temporal

loop.

4.2 Shared memory implementation: OpenMP

The best approach of parallelizing LB with OMP is to create multiple threads in the beginning of the

temporal loop (fig.2.2). Each thread executes the collision and propagation of a certain number of

nodes and, at the end of the iteration, they synchronize with each other through a barrier. After the

temporal loop is over, all the threads join into a single thread, responsible for finishing the program. This

way we reduce the overhead of creating and destroying threads, as there is only one creation and one

destruction. Unfortunately, this approach could not be implemented due to restrictions of the OpenMP’s

parallel construct: only one if clause is supported, unpredictable behavior occurs when branching in or

out (like when calling a function), to name a few.

Of course, one could ’streamline’ the code to try to fit in the scope of the construct rules. But our

second approach, although with a higher overhead, achieved the theoretical speedups, making a total

make-up of the code unnecessary.

The second approach consists in creating the team of threads at the beginning of both spatial loops

( fig. 2.2 ). Threads are joined at the end of each loop, meaning that, per iteration, there is overhead

due to two creations and destruction’s of threads. The impact of the total time of the overheads can

39

become neglectable if the number of nodes in the mesh is high and if the number of threads is not too

high (they normally are in the order of the tenths). This holds true for most cases.

4.3 Distributed memory implementation: MPI

In LB, the collision step is totally local, independent of the neighboring nodes. It is in the stream step

that interaction between nodes occur. It is during this step that one needs to use the message-passing

standards when using a distributed memory system.

The assignment of work to the nodes has to be done in a way to minimize the amount of communica-

tion needed and, at the same time, maintain similar amounts of work to every processor ( load-balancing).

Minimizing communication is the same as minimizing the number of nodes that have to share data

across processors. So, for example, if we had a rectangular domain with height H and length L = 4*H,

dividing across the domain would imply that L nodes need to exchange data. A transverse division

meant only H. Since both ways hold the same number of nodes for each sub-domain, the latter would

provide a better solution, as there are less nodes to exchange data.

Load-balancing is not as easy as assigning the same number of nodes to each processors, as some

nodes need different cares. Example of this are boundary or obstacle nodes. This leads to an unbalanced

work load. This problem is even worse in irregular meshes. The typical solution found for this cases is

assigning a weight to each node, in accordance to the work it gives to the processor. Then divide the

domain in such a way that summing all the weights of the nodes of each sub-domain holds identical

values. To more on that see [3].

Independently on how the division is processed, each sub-domain will have to have three sections:

the bulk, the frame and the frame-buffer.

Figure 4.2: Domain divided in two parts, one white, one gray. Each part is assigned to a node of the

Cluster. In the case of the white sub-domain, there is an extra frame , where data from the frontier cells

of the neighbor sub-domain are made available: Ghost cells.

In the bulk everything processes normally, as all needed data for propagation is in the sub-domain

where the these nodes live.

40

The frame consists on the lattice nodes in contact with the border of the domain division. These are

the last nodes of the sub-domain to be updated but, in order to do so, they have to use data from nodes

living in another sub-domain. This is why there is the frame-buffer. Sometimes called ghost cells, these

serve to hold data from other domain and make it available to the frame nodes. At the end of every time

iteration the ghost cells must be updated and the loops synchronized. This is where message-passing is

used.

The MPI implementation had only two nodes of the cluster available. But since each node had 2 cores,

we used all 4 processors as 4 separate nodes, assuring that all the cores have their own memory bank,

and do not use memory to communicate between them. That way we could use up to four processors.

The code was made in a way that the domain was divided in a number of sub-domains in accordance

with the number of processors available. Since the lattice is regular and there are no obstacles in the

midst of the fluid, a regular division suffices. One of the processor will be the master, responsible for

assigning the different sub-domains to the processors, to begin the temporal loop, to guarantee that all

messages have been sent and received at the end of each iteration and to finalize the program.

According to the sub-domain that a processor has, proper work is assigned. For example, the proces-

sor responsible for the top domain in the lid-cavity must apply the top boundary condition, as well as

the two side bounce-backs. The bottom domain, on the other hand, will have three bounce-backs to apply.

MPI_

initialize_flow_properties()

allocate_vars_

if rank == 0 then

initialize_bulk()

initialize_BC()

initialize_lid()

else

initialize_bulk()

initialize_BC()

end if

for t < tmax do

if rank == 0 then

collision() has a loop inside that sweeps across the nodes of the sub-domain

MPI_non-blocking-send()

MPI_non-blocking-receive()

propagation()

boundary_conditions()

41

MPI_barrier garantir recepçao e envio

propagation_frame()

CF_frame()

else if rank == num_processors -1 then

similar to rank 0

else

similiar to rank 0

end if

end for

4.4 Massively Parallel Processors: CUDA

Since a GPU can have hundredths of thousands of threads running ’simultaneously’, we will make a

unique association between each thread and a node. This is done with the threads ID. Then each thread

will perform the collision and streaming phases of that node. The spatial loops to sweep across all the

nodes are no longer needed. Although one still might use them to loop over all the directions of a node.

An important fact valid to all approaches is that the temporal loop must be held by the CPU, as the

GPU cannot synchronize all the threads at the end of each iteration.

Our first approach consisted of a direct conversion from the serial version: we used the same data

structures as in the CPU, with the difference of replacing the indices used to access the arrays with the

threads ID’s.

First, and before entering in the temporal loop, we copy all the needed data to the GPU’s global

memory. Then we perform the temporal loop, calling the kernel in every iteration and using a barrier

to impede the CPU of proceeding to the following time step. This is done because the GPU will return

control to the CPU as soon as the grid is generated, meaning that the CPU will proceed with the pro-

gram. So we have to guarantee that the GPU has finished it’s task in the same time step as it was called.

After the temporal loop is over, we copy data back to the host.

initialize-flow-properties()

cuda-alloc-memory()

cuda_copy_host_to-device()

for t < tmax do

kernel()

cuda_barrier()

end for

cuda_copy_device_to-host()

42

store_macroscopic_properties()

node_index = threadID

collision( node_index)

propagate-outwards( node_index)

Subsequently we delve into more careful approaches, taking into account the "rules of thumb for

higher speedups": coalescent memory accesses and low global memory usage.

First we reshaped the way the PDF’s were stored in the array. To use coalescence we started by

putting the values, of a specif direction, sweeping all the domain nodes. Then sweep the domain again

for the following direction, and so on. That way, when threads were accessing the global memory, they

would be accessing consecutive data in global memory.

We copy the PDF’s from the global memory to the shared memory. Then perform the collision step.

After that, we have a situation similar to a sub-domain of MPI. As threads of different blocks cannot

communicate through shared memory, the propagation step needed ghost cells in shared memory, to

store values of neighboring PDF’s. But, since there is no global synchronism between threads, we need

to break the GPU code into two kernel, just after the collision.

Although it was a good improvement over the first implementation, it hardly reached the solution

proposed by[16]. His approach is highly efficient.

Following in his footsteps we separate each direction of the lattice into one dimensional arrays. Each

of these arrays holds the values of the PDF’s of the particles that move in a certain direction, of all the

nodes.

Figure 4.3: Each direction of propagation of a particle has an array. For each array, nodes are assigned

consecutively.

We make use of three kernel. One to make the collision step and part of the propagation, other to

complete the propagation and a third responsible for the boundary conditions.

The first kernel uses one dimensional blocks. We start by copying all the PDF’s of each node to

the registers of the thread associated with the node. We guarantee that this copy takes advantage of

coalescence accesses as, for example, two threads of the same block, consecutive in their ID number, will

43

access, at the same time, the two PDF’s of a certain direction that are consecutive in memory. The

hardware will notice this and coalesce the access. Then each thread will perform the collision step. Then

part of the streaming step. This part is done with the aid of shared memory. We allocate arrays of

shared memory, one for each direction and with the size of the block. We will pass all the values of

the registers to this shared memory taking the opportunity to write them into the place the PDF would

propagate.

For example, the PDF of a particle whose direction of travel is to east will move to the following

node. So, when writing to the shared memory, we will write to the following node. But the last node

of the block cannot write on the first node of the following block, as threads cannot communicate with

shared memory of different blocks. So we store the particles leaving the block through the right frontier

(the last node of the block) in the fist node of the block, as this node couldn’t receive the new PDF’s

and had it’s directions empty. This is done to all particles with horizontal components of propagation.

So a node that moves in the north-east direction will only move to the right node, remaining in the same

vertical position. The vertical movement is done in the following step. We are gonna copy the values

from shared back to global memory. When doing so, we will take the opportunity to make the vertical

propagation. The particles that move in the north-east direction and that have previously moved to the

right node, will now move to the node above, completing propagation of vertical and diagonal PDF’s.

This ends the first kernel. We still have to exchange the PDF’s that horizontally couldn’t get out of

the block and were ’imprisioned’ in it. The second kernel does this. It moves the PDF’s stored in the

beginning and of the end of each of the previously used blocks into the correct positions. It only uses

global memory, as it is an unavoidable one-read-one-write.

The third kernel is a direct conversion of the serial boundary conditions.

44

Chapter 5

Results

5.1 Test cases

In order to benchmark both the LGBK and the MRT codes, two cases were used: the Poiseuille flow and

the lid-driven cavity.

5.1.1 Poiseulle Flow

The Poiseuille flow consists in having a fluid between two infinite plates, at rest, subjected to a pressure

gradient between the inlet and the outlet. This makes the fluid move, reaching a steady state with a

parabolic velocity profile. To this flow there is an exact solution of the Navier-Stokes (eq. 5.1).

u(y) =GH2

8ν

(

1 −(

2y

H

)2)

,−H

2< y <

H

2(5.1)

Here G =P

ρL, H , L and P are, respectively, the height and length of the channel and the pressure

gradient between the inlet and the outlet.

This flow is mainly used to ’calibrate the flow viscosity’ [14] : we select a viscosity and compare the

attained profile with that given by eq. 5.1.

The two top plates are simulated with first order bounce-back boundary condition, while the other

two frontiers are considered periodic (infinite domain). The effect of the pressure is translated into a

bias, g, applied to the PDF’s, during the collision step [14].

f′

i = fi − ω(fi − feq) + gi (5.2)

According to [14], this bias, for a bi-dimensional case, is as follows

45

g1 = 4g2 = 4g8 =1

12gρ

g5 = 4g4 = 4g6 = − 1

12gρ

g0 = 4g3 = 4g7 = 0

The bias is calculated by considering the momentum that is necessary to be equivalent to a given

pressure drop. We are replacing the pressure at the inlet and outlet by a forcing term in every node of

the domain (bulk and frontier nodes). Calculating the moment of each node according to the expression

given before, we can see that there is only one horizontal component, pulling to the direction of the

smaller pressure.

g =Uoν

8H2(5.3)

We used Succi’s suggestion [14] and used a mesh of 128× 32 with ν = 0.1 and expected to obtain the

parabolic profile with the maximum equal to 0.1.

Figure 5.1: Longitudinal velocity profile obtained for the Poiseuille flow test case.

In can be seen that near the walls, the profile drops abruptly, instead of smoothly as before. This is

the result of the first order accurate boundary condition, as it is independent of the mesh size.

To know when we have reached the steady state solution, we have used the difference between the

analytical solution and the numerical one. Representing the velocity in the horizontal direction as u and

Nt as the total number of points, the error (or residue) used was the following

e =

∑ |uexact − unum|Nt

(5.4)

Depending on the mesh used, the criterion for convergence may lead to smoother or oscillatory curves.

For example, if we set 10−6 as the criterion for a 128 × 32 mesh, up until 2 × 10−4 the error would be

46

0 1000 2000 3000 4000 5000 6000 7000 8000 90000

5

10

15

x 10−4

iteration

resi

due

Figure 5.2: The error decreasing with the number of

iterations. The criterion for convergence was set to

10−6. It took 9556 iterations in a 256 × 64 mesh.

50 100 150 200 250

10

20

30

40

50

60

Figure 5.3: Bidimensional plot of the horizontal ve-

locity of a Poiseuille flow for a converged case. The

maximum velocity is in the middle of the channel

section, and the velocity profile is constant through-

out the channel.

decreased smoothly as in the previous graphic. From that value onwards, the residue would be oscil-

lating. The number of iterations needed for the expected residue to be achieved increases exponetially

when this behaviour is verified. The same criterion for a 256 × 64 mesh lead us to the smooth graphic

of residue times iteration number presented earlier.

Lid-driven cavity Flow

The lid-driven cavity is a rich testing environment with simple boundary conditions: three no-slip and a

sliding wall at the top.

We used the cavity to benchmark the solution by comparing the velocity profiles to the ones Ghia

obtained 1. We compare for Reynolds numbers of 100 and 1000. We will use Reynolds of 100 for the

LGBK method and the 1000 for the MRT. This is done only to avoid repeating the same graphic’s twice.

Both boundary conditions were simulated with first order approximations. The lid velocity, in lattice

units, is set to be 0.1, the lattice is 128×128 and τ is set to 0.5384. To the MRT model we used the

following relaxations

s1 = s4 = s6 = 0, s2 = 1.1, s3 = 1.0, s5 = s7 = 1.2, s8 = s9 =1

τ(5.5)

For convergence, we have used the difference between the values of two consecutive iterations. This

holds as the flow is expected to reach a steady state situation. If n + 1 represents the iteration of t + ∆t

1More accurate solutions for the cavity have been obtained since Ghia, such as the ones from Peyret and Botella.

Nonetheless, since the objective is just to verify the solution and not it’s accuracy, Ghia solution suffices.

47

50 100 150 200 250

50

100

150

200

2500

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

Figure 5.4: Bi-dimensional plot of the moduli of the velocity for a Reynolds number of 1000

Figure 5.5: Horizontal and vertical velocity profiles for, respectively, a longitudinal and a transverse

section. Obtained with SRT for a Re=100.

and n the iteration at t, then

e =

∑∣

∣un+1

i − uni

∣

∣

Nt(5.6)

48

Figure 5.6: Horizontal and vertical velocity profiles for, respectively, a longitudinal and a transverse

section. Obtained with MRT for a Re=1000.

0 1 2 3 4 5 6 7 8 9−14

−13

−12

−11

−10

−9

−8

−7

LN(iteration)

LN(r

esid

ue)

Figure 5.7: A solution is considered converged, for the cavity, when the residue is inferior to 10−7. We

have used a logarithmic scale, both to the residue and to the number of iterations for visualization

purposes: small oscillations in the residue were small enough to look like a straight line if the plot was

residue vs iteration number.

In terms of stability, it’s easy to see how much improvement the MRT method brings. For the same

exact set of properties and iterations, we compare the velocity values in the vicinity of the left wall,

where the gradients are higher, with a relatively high Reynolds number. Figures 5.1.1 demonstrate the

numerical instabilities that appeared in the velocity profile.

The numerical instabilities in the LGBK scheme will propagate, causing errors that will collapse the

numerical stability. In less than 1000 iteration all PDF’s are nans (Not a Number). This nan’s are

caused by the PDF values decreasing below zero, which has no physical meaning and trashes the results.

The MRT scheme manages to take control over this instabilities and exhibit a stabler behavior.

A few remarks are in order. First, in bi-dimensional serial code, the collision step takes about 77 %

of the time. The streaming step and boundary conditions take about 21 %. Since both steps can be

49

Figure 5.8: Horizontal velocity profile the top half nodes in the vicinity of the Wall (x=1). The code is

set to run 100 iterations for a Reynolds number of 5848.

parallelized, Lattice Boltzmann methods offer a paralelizable fraction of 98%. What this means is that,

according to Amdhal’s law (see appendices), if we maintain the problem size constant, the maximum

theoretical speedup that can be achieved is 50. But, to obtain it, more than 10000 processors were

needed. To a speedup of 47, ’only’ 767 ( a saturation curve).

Our MRT implementation was about 30% slower than the SRT. This means that some optimization

can be performed in order to obtain the 15% referred by [8].

For the three dimensional case (3D lid-cavity), the collision and propagation took, respectively, 84

and 16 % of the time. Meaning that, not only the overhead of initializing and ending the variables of

the program are negligible, but also that there is a slightly edge of 2%in parallelization of 3D cases over

2D. This is in accordance with [14]. This little improvement translates, for the same problem size and

according to Amdahl, in a linear speedup with the number of processors.

Depending on the processor used, a serial implementation typically achieves values of MLUPs ranging

from around 0.5 to 2 MFLUPs. A 3D case offers lower MFLUPs, as each node has more particles in

which to take action.

5.2 Shared Memory parallelization: OpenMP

To run the tests of OpenMP we used Tesla’s four CPU’s, with a shared memory architecture.

We will compare the elapsed time of the program for different mesh sizes and for different number

of processors. Each processor is only given, at most, one thread, so, we generate up to four threads. All

the values were obtained for the same parameters and the same number of iterations. See (table 5.2)

Fixing a mesh size we can see a linear evolution with the number of processors (fig. 5.9).

50

Table 5.1: Speedups obtained for the OpenMP implementation for up to four processors.

Speedups

Lattice Serial (sec.) 2 procs 3 procs 4 procs

322 3.13 1.84 2.85 3.48

642 12.47 1.95 2.97 4.0

1282 50.0 2.02 2.82 3.89

2562 200.04 2.02 2.87 3.77

5122 798.99 2.02 3.02 3.98

10242 3188.77 2.08 3.03 3.91

Figure 5.9: Code scaling with the number of cores.

As was expected, increasing the number of cores reduces the amount of time needed to complete

the program. The linear behavior, predicted by both Amdahl and Gustafson’s laws, is verified. Bigger

meshes are closer to the speedups’ trend-line. That’s because the overhead of generating/destroying

threads is hidden by the amount of work each thread has to perform.

Note that, for a fixed mesh size, a linear behavior with the introduction of more processors is not

expected ad-infinitum. In a extreme situation, where we had as many processors as lattice nodes, each

processor would be in charge of only one node. If the lattice was the same, there was no way to increase

the speedup beyond this point( Amdahl’s law). Of course, probably the overhead of generating so many

threads would kill all speedup. But if we increase the number of processors and the problem size, the

linear behavior is expected to last as long as overheads and memory accesses allow it (Gustafson’s law).

If one tries to use more threads per processor, in a hope that instruction-level parallelism can make

51

Table 5.2: Speedups for different mesh sizes and cluster nodes.

Speedups

Lattice serial [s] 2 nodes 3 nodes 4 nodes

322 11.7 1.80 1.83 2.13

642 42.7 1.92 2.79 3.14

1282 187.2 2.08 3.28 4.28

5122 2995.2 2.074 3.28 4.14

10242 11980.8 2.02 3.1 4.3

use of them and reduce CPU idle times, we are confronted with, at best, slightly worst results. For

example, when using one processor, if we assign two threads to the processor there is a slight increment

of the elapsed time. That is due to the extra overhead and to the permutation of the threads running

in the processor. If we assigned three threads to the processor, the elapsed time is better than when

assigning only two threads. But it’s worst than when assigning only one thread.

In terms of MLUPs, we can say that the total MLUPs achieved by the system is proportional to the

MLUPs of a single processor scaled up by the total number of processors used. So, the top performance

achieved when using the four CPU’s, is 6.56 MLUPs.

5.3 Distributed memory : MPI

The time to run a simulation, for different meshes is found on table 5.3.

With small meshes we can verify that, even with 4 nodes, the improvements only occurs by a factor of

two. That is due to the fine granularity present in small meshes. To achieve coarse granularity we have

to use bigger meshes. It’s only then that the speedup is close to the number of processors used. Note

that speedups slighlty higher than the number of processors can occur due to caching in the processors.

We will now test the effect of communication with rectangular domains. For that purpose, with the

same mesh, we will split the domains longitudinally and transversely and compare the different elapsed

times. We will use a fixed lattice size, so that the number of nodes in each sub-domain remains constant.

That way, different wall clock times are the result of communication time (as long as this difference is

bigger than the small fluctuation of different runs of the same program in the same conditions).

We started with a 128 × 1024 lattice. The longer direction has 8 times more nodes that the shorter

one. When cutting in the longer direction, the runtime increases only 0.6 seconds. We then used a mesh

with 128 × 10240. The difference between frontier nodes is 80 times. It corresponds to a 5.3 seconds.

We continued this type of procedure and the conclusion was that the way the domain is divided does

influence the communication time but, in absolute terms, the amount of time required is neglectable due

52

Table 5.3: Speedups for different mesh sizes and cluster nodes.

Speedups

Lattice SRT MRT SRT

322 0.99 0.82 2.05

642 3.78 2.97 8.2

1282 12.3 6.6 27

2562 28.59 9.25 65.5

5122 42.38 11.1 83.2

10242 47.45 11.5 84.56

to the high amounts of computational work per communication event.

With the four node cluster we achieved a maximum of 3.88 MLUPs. It is lower than the value

obtained with OpenMP, but that’s because each individual processor of the cluster is ’slower’ than those

used for the OpenMP implementation. The same serial code has 1.64 MLUPS for a single Tesla processor

and 0.88 MLUPs for a single processor of the Cluster. Roughly half. Once again one can approximate

the number of MLUPs achieved by the cluster by multiplying the number of processors in a cluster by

the MLUPs speed of single processor. The velocity fields predicted with serial, OpenMP and MPI are

virtually the same.

5.3.1 Massively parallel processors: CUDA

With the first approach, i.e., the literal porting of the serial code to the GPU, the highest achieved

speedup, for the SRT method, was 27. This speedup is almost three times the one obtained, in a similar

crude implementation, by [11]. He reached only 9. The two values differ due to the hardware that was

used. We used a newer device with higher computational power. This is representative of the scalability

that occurs with newer versions of hardware.

In terms of MLUPs, the speedup of 27 translates into 45 MLUPs. Without much effort, a single GPU

achieved the same performance that would be obtained with 51 cluster nodes or 27 processors.

Comparing between the SRT and the MRT implementation, this last one falls a little bit behind. It’s

peak performance is 12MFLUPs. It dropped to near a half relatively to the SRT.

The second implementation, using shared memory and coalesced accesses, achieved, for the SRT, the

maximum speedup of 44, peaking 86 MLUPs. About the double of the previous implementation.

But it was with the approach introduced by [16] that the maximum speedup was obtained. His

method, highly optimized, allowed the maximum speedup of 110 or 180MFLUPs.

53

Table 5.4: Comparison of the peak values in the three implementations.

OpenMP MPI CUDA

Speedup 4.0 4.28 110

MFLUPs 6.56 3.88 180

Figure 5.10: The increase in performance, up to a certain point, with the increase of the lattice dimen-

sions.

Independently of the implementation, the GPU achieves it’s peak performance for bigger lattices. We

also see that up from a certain point, it saturates,as it was expected.

Overall, the maximum MLUPs achieved with the three parallel methods are resumed in the table

5.4.

Figure 5.11 illustrates the discrepancy between values.

As can be seen, the GPU offers a great edge when it comes to Lattice Boltzmann methods and alikes.

A single GPU could outperform all the other methods. To achieve the same performance as the GPU, we

needed 110 CPU’s or 205 cluster nodes (for example, approximately 28 machines, each with 4 processors).

54

MPI

OMP

0

2

4

6

8

0 1000 2000 3000 4000

CUDA

MLUPs

0

50

100

150

200

Lattice2

0 1000 2000 3000 4000

Figure 5.11: Comparison between the three implementations. The filled nodes use the scale on the left,

while the others the scale on the right.

55

Chapter 6

Conclusion

In this work we have made two implementations of the Lattice Boltzmann numerical schemes: the LGBK

and the MRT. These implementations were initially made in serial and then ported to three different

parallel platforms: to a shared memory architecture, to a distributed memory architecture and to a

GPU. To do so, we used, respectively, OpenMP, MPI and CUDA. All the codes were verified with two

classical benchmarks: the Poiseuille flow and the lid-driven cavity.

The LGBK method was found to be the fastest but, for limited mesh sizes, should only be used in

flows with low Reynolds number. Typically, laminar cases. The other method, the MRT, was slower

than the LGBK but compensated by presenting a much more stable nature, allowing for computations

with high Reynolds numbers with relatively small meshes. It’s a better choice for turbulent simulations.

The parallel nature of the Lattice Boltzmann methods revealed itself in every parallel implementation,

as high, scalable speedups were achieved (limited, of course, by the hardware available). It was found

that increasing the number of processors, independent of the memory architecture used, the code scaled

linearly with the added processors. It was also concluded that the GPU achieved the best performances

of all three implementations. A single GPU card achieved the performance that could only be equaled

by twenty eight machines, each with four processors. Plus, in terms of efficiency, either energetically,

economically or in terms of room space, the GPU was, by far, the best solution.

Future work should be the extent of the implementations to the third dimension, it’s conversion to

the three parallel approaches, specially to the GPU, and to make make an implementation that could

test how a cluster of GPU’s could perform. It is also necessary to compare this parallel results with

the ones achieved with more complex flows and geometries. That way, one could access if the main

competitive advantages of the Lattice Boltzmann methods - the ability to adapt to dispersed media and

to irregular-time-changing boundary conditions and to be highly parallel - are still one competitive edge

in parallel implementations.

56

Bibliography

[1] C. K. Aidun and J. R. Clausen. Lattice-boltzmann method for complex flows. Annual Review of

Fluid Mechanics, 42(1):439–472, January 2010.

[2] Ricky A. Kendall Arthur S.Bland. Jaguar: The world’s most powerfull computer. 2009.

[3] Jorg Matthias Bernsdorf. Simulation of complex flows and multi-physics with the lattice-boltzmann

method. 2008.

[4] Carlo Cercignani. The Boltzmann Equation and its Applications. Springer-Verlag, 1988.

[5] Barbara Chapman, Gabriele Jost, and Ruud van der Pas. Using OpenMP: Portable Shared Memory

Parallel Programming (Scientific and Engineering Computation). The MIT Press, October 2007.

[6] Jaswinder Pal Singh David E. Culler. Parallel Computer Architecture, A hardware/software ap-

proach. Morgan-Kaufmann, 1999.

[7] Wen-mei Hwu David Kirk. Programming Massively Parallel Processors: A hands on approach. 2010.

[8] D. d’Humieres, I. Ginzburg, M. Krafczyk, P. Lallemand, and L. S. Luo. Multiple-relaxation-time

lattice boltzmann models in three dimensions. Philosophical Transactions: Mathematical, Physical

and Engineering Sciences, 360(1792):437+, 2002.

[9] Lumsdaine Gropp, Huss-Lederman. MPI - The complete reference. MIT Press, 1998.

[10] Harry Henderson. Encyclopedia of Computer Science and Technology. Springer-Verlag, 2009.

[11] Takehiro Oyakawa Liu Peng, Ken-ichi Nomura. Parallel lattice boltzmann simulation on emergin

multi-core platforms. 2008.

[12] NVIDIA. NVIDIA CUDA C programming guide. 2010.

[13] Daniel Page. A Practical Introduction to Computer Architecture. Springer, 2009.

[14] S. Succi. The lattice Boltzmann equation for fluid dynamics and beyond. Numerical mathematics

and scientific computation. Clarendon Press ; Oxford University Press, August 2001.

57

[15] J. Tölke and M. Krafczyk. Teraflop computing on a desktop pc with gpus for 3d cfd. Int. J. Comput.

Fluid Dyn., 22(7):443–456, August 2008.

[16] Jonas Tolke. Implementation of a lattice boltzmann kernel using the compute unified device archi-

tecture developed by nvidia. Comput. Vis. Sci., 13(1):29–39, 2009.

[17] J. S. Wu and Y. L. Shao. Simulation of lid-driven cavity flows by parallel lattice boltzmann

method using multi-relaxation-time scheme. International Journal for Numerical Methods in Fluids,

46(9):921–937, 2004.

[18] Zhou Xiao-Yang, Shi Bao-Chang, and Wang Neng-Chao. Numerical simulation of lbgk model for

high reynolds number flow. Chinese Physics, 13(5):712, 2004.

58

.1 Terminology

Altough some terms may have already been mentioned, I believe it is important to have a non-extensive

glossary of terms eventually found in parallel literature.

• Throughput - the rate at which data is successfully transmitted. A measure of productivity.

• Granularity - the ratio of computational work to communication. If, between communications,

high amounts of computational work is performed, it’s called Coarse granularity. Otherwise it’s

fine.

• Overhead - it’s the non-productive time necessary to prepare or organize ’things’ so that work

can be done. Making an analogy with an assembly line, it would be the time to start the machines.

Whenever a computer program is starting or ending, some overhead is present. In parallel programs

other sources of overhead include synchronizations, data communication

• Bottleneck - It’s a single part of a program that is imposing the limit on the maximum speedup

that can be obtained. No matter how we improve or parallelize all the other parts of the program,

the elapsed time will not decrese as it is dictated by the bottleneck.

• Synchronization - A point of synchronization is a location in a code where all parallel tasks are

coordenated. The program will not continue until all the tasks achieve this point. The entity used

for marking these points is called a barrier.

• Race Conditons - It’s typically an undesired event where the outcome of one operation depends

entirely on who reached and modified a common resource, first. The entities used to avoid race

conditions include locks, semaphores and atomic operations.

• Amdahl’s law - a theoretical way to predict the speedup achieved with a system of various

processors. If f is the fraction of the code that can be run in a parallel and Np is the number of

processors, then the speedup, S, is is obtained by:

SAmdahl =1

f

Np

+ (1 − f)

(1)

The fraction of code that cannot be run parallely, 1 − f , is the bottleneck, limiting the maximum

speedup. Amdahl’s law is used to tread problems with a fixed size.

• Gustafson’s law - another theorical prediction on the speed ups that can be achieved with multiple

processors. It differs from Amdahl’s law as it takes into account the possibility of increasing the

problem’s size in order to take advantage of the computational resources available. Being f the

fraction of the code taking advange of parallelism and NP the number of processors, the maximum

speedup is

SGustafson = NP − (1 − f)(NP − 1) (2)

59

.2 Rules of thumb for higher performances

Although every model has it’s own quirks, there are general cares that can be taken to have an optimized

implementation.

One of the cares relates to the way arrays are stored in memory and how to access them. A one-

dimensional array, a vector, is stored consecutively in memory. This offers no problem. But a two or

three dimensional array are also stored in a one dimensional fashion. Each computer language has it’s

preferential way to arrange the multi-dimensional matrices in consecutive memory locations.

For example, in C, if we declare a 2D matrix, all elements will be placed linearly in memory in a

row-major order. That is, starting in the top row, each element of the row, starting with lowest column

index, will be placed one after the other in memory. Then, after the first row is finished, it will move to

the second row and place each element of that row in memory. And so on (fig. 1). Fortran, for example,

uses a column-major approach, putting all elements of a column sequentially.

Figure 1: A 3 × 3 matrix stored in a row major fashion.

The importance of knowing how data is stored is to know how we can access it in the most efficient

way. As data is stored in main memory, and this is a slow memory, whenever a processor needs data

from main memory, intead of taking just the data needed, it uses the strategy of taking a whole chunk of

contiguous data and places it in cache memory. This way avoiding another long-latency access to main

memory. So, if we are programming accesses to arrays entries having in mind it’s contiguous pattern,

we can reduce the number of calls to the main memory and hence reduce the overall time of a program.

f o r ( i n t j=0; j<n ; j++)

f o r ( i n t i=0; i<n ; i++)

sum += a [ i ] [ j ] ;

f o r ( i n t i=0; i<n ; i++)

f o r ( i n t j=0; j<n ; j++)

sum += a [ i ] [ j ] ;

Figure 2: The loop on the left accesses data non-contiguously in memory. The loop on the right uses a

row major acessing pattern.

60

For the same reasons, whenever possible, it is wise to do loops unrolls: instead of giving one instruction

and telling the loop to perform that instruction over a certain number of different indices, we can obtain

better performance if the loop sweeps over less indices, but in every loop iteration there is more than

one instruction. It’s better understood and expressed with an example.

f o r ( i n t i=1; i<n ; i++)

a [ i ] = b [ i ] + 1 ;

c [ i ] = a [ i ] + a [ i−1] + b [ i−1] ;

f o r ( i n t i=1; i<n ; i+=2)

a [ i ] = b [ i ] + 1 ;

c [ i ] = a [ i ] + a [ i−1] + b [ i−1] ;

a [ i+1] = b [ i+1] + 1 ;

c [ i+1] = a [ i+1] + a [ i ] + b [ i ] ;

Figure 3: The loop on the right is the unrolled version of the loop on the left.

It is also important to familiarize with the compiler, as there are optimization flags that, when active,

can lead to better performance.

61

Documents

Simulation of ﬂuid ﬂows with Lattice Boltzmann on GPU’sA cheap, common 0 card, can be used almost as a mini-cluster. This fact will lead to numerous work and investigation to