GPU Sample Sort - KITalgo2.iti.kit.edu/sanders/courses/paralg12/GpuSampleSort.pdf · GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik Sortingis important

1 Vitaly Osipov:GPU Sample Sort

Fakultät für InformatikInstitut für Theoretische Informatik

Institut für Theoretische Informatik - Algorithmik II

GPU Sample SortNikolaj Leischner, Vitaly Osipov, Peter Sanders

KIT – Universität des Landes Baden-Württemberg undnationales Grossforschungszentrum in der Helmholtz-Gemeinschaft www.kit.edu

Overview



IntroductionTesla architectureComputing Unified Device Architecture ModelPerformance GuidelinesSample Sort Algorithm OverviewHigh Level GPU Algorithm DesignFlavor of Implementation DetailsExperimental EvaluationFuture Trends

Introductionmulti-way sorting algorithms



Sorting is importantDivide-and-Conquer approaches:

recursively split the input in tiles until the tile size is M (e.g cache size)sort each tile independentlycombine intermidiate results

Two-way approaches:two-way distribution - quicksort log2 (n/M) scans to partition the inputtwo-way merge sort log2 (n/M) scans to combine intermidiate results

Multi-way approaches:k -way distribution - sample sort only logk (n/M) scans to partitionk -way merge sort only logk (n/M) scans to combine

Multiway approaches are benifitial when the memory bandwidth isan issue!

























NVidia Tesla Architecture



30 Streaming Processors (SM) × 8 Scalar Processors (SP) eachoverall 240 physical cores16KB shared memory per SM similar to CPU L1 cache4GB global device memory

Computing Unified DeviceArchitecture Model



tl tl tl tl

Input

ThreadBlocks

PrefixSum

01

k-1

01

k-1

Input

0 1 2 k-1

Output

Bucket indices

ThreadBlocks

......

...

...

SP

shared

SP SP SPSP SP SP SP

virtualizes

TBlock(0,0) TBlock(0,1) TBlock(0,2)

TBlock(1,0) TBlock(1,1) TBlock(1,2)

Grid

Global Memory

C codeint main {//serial

//parallelKernel<<>>

//serial}

Similar to SPMD(single-programmultiple-data) model

block of concurrentthreads execute ascalar sequentialprogram, a kernelthread blocksconstitute a grid

Performance Guidelines



General pattern in GPU algorithm designdecompose the problem into many data-independent sub-problems

solve sub-problems by blocks of cooperative parallel threads

Performance Guidelinesconditional branching- follow the same execution path

shared memory- exploit fast on-chip memory

coalesced global memory operations- load/store requests to the same memory block fewer memory accesses

Algorithm Overview



SampleSort(e = 〈e1, . . . , en〉, k)begin

if n < M then return SmallSort(e)choose a random sample S = S1, . . . , Sak−1 of eSort(S)

〈s0, s1, . . . , sk 〉 = 〈−∞, Sa, . . . , Sa(k−1), ∞〉for 1 ≤ i ≤ n do

find j ∈ {1, . . . , k}, such that sj−1 ≤ ei ≤ sjplace ei in bucket bjreturn Concatenate(SampleSort(b1, k), . . . , SampleSort(bk , k))

endend

Algorithm 1: Serial Sample Sort

High Level GPU Algorithm Design



tl tl tl tl

Input

ThreadBlocks

PrefixSum

01

k-1

01

k-1

Input

0 1 2 k-1

Output

Bucket indices

ThreadBlocks

......

...

...

Parameters:distribution degree k = 128threads per block t = 256elements per thread l = 8number of blocks p = n/(t · l)

Phase 1. Choose splittersPhase 2. Each of p TB:

computes its elementsbucket indicesid , 0 ≤ id ≤ k − 1stores the bucket sizes inDRAM

Phase 3. Prefix sum over thek × p table global offsetsPhase 4.

as in Phase 2 localoffsetslocal + global offsets finalpositions




tl tl tl tl

Input

ThreadBlocks

PrefixSum

01

k-1

01

k-1

Input

0 1 2 k-1

Output

Bucket indices

ThreadBlocks

......

...

...









tl tl tl tl

Input

ThreadBlocks

PrefixSum

01

k-1

01

k-1

Input

0 1 2 k-1

Output

Bucket indices

ThreadBlocks

......

...

...









tl tl tl tl

Input

ThreadBlocks

PrefixSum

01

k-1

01

k-1

Input

0 1 2 k-1

Output

Bucket indices

ThreadBlocks

......

...

...






Flavor of Implementation Detailscomputing element bucket indices



bt = 〈sk/2, sk/4, s3k/4, sk/8, s3k/8, s5k/8, s7k/8 . . .〉TraverseTree(ei)

beginj := 1// go left or right?

repeat log k timesj := 2j + (ei > bt [j ])// bucket index

j := j − k + 1end

3k/8sk/8s 5k/8s 7k/8s

k/4s 3k/4s

k/2s

< <

<

1b 2b 3b 4b 5b 6b 7b 8b< < < <> > > >

>>

>

Srore the tree in fast shared memoryUse predicated instructions no path divergenceUnroll the loop







j := j − k + 1end

3k/8sk/8s 5k/8s 7k/8s

k/4s 3k/4s

k/2s

< <

<

1b 2b 3b 4b 5b 6b 7b 8b< < < <> > > >

>>

>








j := j − k + 1end

3k/8sk/8s 5k/8s 7k/8s

k/4s 3k/4s

k/2s

< <

<

1b 2b 3b 4b 5b 6b 7b 8b< < < <> > > >

>>

>








j := j − k + 1end

3k/8sk/8s 5k/8s 7k/8s

k/4s 3k/4s

k/2s

< <

<

1b 2b 3b 4b 5b 6b 7b 8b< < < <> > > >

>>

>


Experimental Evaluation



NVidia Tesla C106030 SMs x 8 SPs = 240 cores4GB RAM

Data types32- and 64-bit integerskey-value pairs

DistributionsUniformGausianBucket SortedStaggeredDeterministic Duplicates

GPU sorting AlgorithmsCUDPP and THRUST radix sortTHRUST merge sortquicksorthybrid sortbbsort

Experimental EvaluationUniform 32-bit integers



0

20

40

60

80

100

120

140

160

180

217 218 219 220 221 222 223 224 225 226 227 228

sort

ed e

lem

ents

/ tim

e [

µs]

number of elements

cudpp radixthrust radix

quick

bbsorthybrid (float)

sample

Experimental EvaluationUniform key-value pairs



20

30

40

50

60

70

80

90

100

110

120

130

219 220 221 222 223 224 225 226 227

sort

ed e

lem

en

ts /

tim

e [

µs]

number of elements

thrust radixcudpp radix

thrust mergesample

Experimental EvaluationUniform 64-bit integers



20

30

40

50

60

70

80

217 218 219 220 221 222 223 224 225 226 227

sort

ed

ele

me

nts

/ tim

e [

µs]

number of elements

thrust radix sample

Future TrendsFermi architecture



11

bandwidth constrained. For existing applications that use Shared memory as software managed cache, code can be streamlined to take advantage of the hardware caching system, while still having access to at least 16 KB of shared memory for explicit thread cooperation. Best of all, applications that do not use Shared memory automatically benefit from the L1 cache, allowing high performance CUDA programs to be built with minimum time and effort.

Summary Table

GPU G80 GT200 FermiTransistors 681 million 1.4 billion 3.0 billionCUDA Cores 128 240 512Double Precision Floating Point Capability

None 30 FMA ops / clock 256 FMA ops /clock

Single Precision Floating Point Capability

128 MAD ops/clock

240 MAD ops / clock

512 FMA ops /clock

Special Function Units (SFUs) / SM

2 2 4

Warp schedulers (per SM) 1 1 2Shared Memory (per SM) 16 KB 16 KB Configurable 48 KB or

16 KB L1 Cache (per SM) None None Configurable 16 KB or

48 KB L2 Cache None None 768 KBECC Memory Support No No YesConcurrent Kernels No No Up to 16Load/Store Address Width 32-bit 32-bit 64-bit

Second Generation Parallel Thread Execution ISA

Fermi is the first architecture to support the new Parallel Thread eXecution (PTX) 2.0 instruction set. PTX is a low level virtual machine and ISA designed to support the operations of a parallel thread processor. At program install time, PTX instructions are translated to machine instructions by the GPU driver.

The primary goals of PTX are:

p Provide a stable ISA that spans multiple GPU generations

p Achieve full GPU performance in compiled applications

p Provide a machine-independent ISA for C, C++, Fortran, and other compiler targets.

p Provide a code distribution ISA for application and middleware developers

p Provide a common ISA for optimizing code generators and translators, which map PTX to specific target machines.

p Facilitate hand-coding of libraries and performance kernels

p Provide a scalable programming model that spans GPU sizes from a few cores to many parallel cores

What about memory bandwidth? No significant improvements?multi-way approaches are likely to be even more beneficialmulti-way merge sort?



Thank you!

Documents

GPU Sample Sort - KITalgo2.iti.kit.edu/sanders/courses/paralg12/GpuSampleSort.pdf · GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik Sortingis important