24
1 Vitaly Osipov: GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik Institut für Theoretische Informatik - Algorithmik II GPU Sample Sort Nikolaj Leischner, Vitaly Osipov, Peter Sanders KIT – Universität des Landes Baden-Württemberg und nationales Grossforschungszentrum in der Helmholtz-Gemeinschaft www.kit.edu

GPU Sample Sort - KITalgo2.iti.kit.edu/sanders/courses/paralg12/GpuSampleSort.pdf · GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik Sortingis important

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: GPU Sample Sort - KITalgo2.iti.kit.edu/sanders/courses/paralg12/GpuSampleSort.pdf · GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik Sortingis important

1 Vitaly Osipov:GPU Sample Sort

Fakultät für InformatikInstitut für Theoretische Informatik

Institut für Theoretische Informatik - Algorithmik II

GPU Sample SortNikolaj Leischner, Vitaly Osipov, Peter Sanders

KIT – Universität des Landes Baden-Württemberg undnationales Grossforschungszentrum in der Helmholtz-Gemeinschaft www.kit.edu

Page 2: GPU Sample Sort - KITalgo2.iti.kit.edu/sanders/courses/paralg12/GpuSampleSort.pdf · GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik Sortingis important

Overview

2 Vitaly Osipov:GPU Sample Sort

Fakultät für InformatikInstitut für Theoretische Informatik

IntroductionTesla architectureComputing Unified Device Architecture ModelPerformance GuidelinesSample Sort Algorithm OverviewHigh Level GPU Algorithm DesignFlavor of Implementation DetailsExperimental EvaluationFuture Trends

Page 3: GPU Sample Sort - KITalgo2.iti.kit.edu/sanders/courses/paralg12/GpuSampleSort.pdf · GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik Sortingis important

Introductionmulti-way sorting algorithms

3 Vitaly Osipov:GPU Sample Sort

Fakultät für InformatikInstitut für Theoretische Informatik

Sorting is importantDivide-and-Conquer approaches:

recursively split the input in tiles until the tile size is M (e.g cache size)sort each tile independentlycombine intermidiate results

Two-way approaches:two-way distribution - quicksort log2 (n/M) scans to partition the inputtwo-way merge sort log2 (n/M) scans to combine intermidiate results

Multi-way approaches:k -way distribution - sample sort only logk (n/M) scans to partitionk -way merge sort only logk (n/M) scans to combine

Multiway approaches are benifitial when the memory bandwidth isan issue!

Page 4: GPU Sample Sort - KITalgo2.iti.kit.edu/sanders/courses/paralg12/GpuSampleSort.pdf · GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik Sortingis important

Introductionmulti-way sorting algorithms

3 Vitaly Osipov:GPU Sample Sort

Fakultät für InformatikInstitut für Theoretische Informatik

Sorting is importantDivide-and-Conquer approaches:

recursively split the input in tiles until the tile size is M (e.g cache size)sort each tile independentlycombine intermidiate results

Two-way approaches:two-way distribution - quicksort log2 (n/M) scans to partition the inputtwo-way merge sort log2 (n/M) scans to combine intermidiate results

Multi-way approaches:k -way distribution - sample sort only logk (n/M) scans to partitionk -way merge sort only logk (n/M) scans to combine

Multiway approaches are benifitial when the memory bandwidth isan issue!

Page 5: GPU Sample Sort - KITalgo2.iti.kit.edu/sanders/courses/paralg12/GpuSampleSort.pdf · GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik Sortingis important

Introductionmulti-way sorting algorithms

3 Vitaly Osipov:GPU Sample Sort

Fakultät für InformatikInstitut für Theoretische Informatik

Sorting is importantDivide-and-Conquer approaches:

recursively split the input in tiles until the tile size is M (e.g cache size)sort each tile independentlycombine intermidiate results

Two-way approaches:two-way distribution - quicksort log2 (n/M) scans to partition the inputtwo-way merge sort log2 (n/M) scans to combine intermidiate results

Multi-way approaches:k -way distribution - sample sort only logk (n/M) scans to partitionk -way merge sort only logk (n/M) scans to combine

Multiway approaches are benifitial when the memory bandwidth isan issue!

Page 6: GPU Sample Sort - KITalgo2.iti.kit.edu/sanders/courses/paralg12/GpuSampleSort.pdf · GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik Sortingis important

Introductionmulti-way sorting algorithms

3 Vitaly Osipov:GPU Sample Sort

Fakultät für InformatikInstitut für Theoretische Informatik

Sorting is importantDivide-and-Conquer approaches:

recursively split the input in tiles until the tile size is M (e.g cache size)sort each tile independentlycombine intermidiate results

Two-way approaches:two-way distribution - quicksort log2 (n/M) scans to partition the inputtwo-way merge sort log2 (n/M) scans to combine intermidiate results

Multi-way approaches:k -way distribution - sample sort only logk (n/M) scans to partitionk -way merge sort only logk (n/M) scans to combine

Multiway approaches are benifitial when the memory bandwidth isan issue!

Page 7: GPU Sample Sort - KITalgo2.iti.kit.edu/sanders/courses/paralg12/GpuSampleSort.pdf · GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik Sortingis important

NVidia Tesla Architecture

4 Vitaly Osipov:GPU Sample Sort

Fakultät für InformatikInstitut für Theoretische Informatik

30 Streaming Processors (SM) × 8 Scalar Processors (SP) eachoverall 240 physical cores16KB shared memory per SM similar to CPU L1 cache4GB global device memory

Page 8: GPU Sample Sort - KITalgo2.iti.kit.edu/sanders/courses/paralg12/GpuSampleSort.pdf · GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik Sortingis important

Computing Unified DeviceArchitecture Model

5 Vitaly Osipov:GPU Sample Sort

Fakultät für InformatikInstitut für Theoretische Informatik

tl tl tl tl

Input

ThreadBlocks

PrefixSum

01

k-1

01

k-1

Input

0 1 2 k-1

Output

Bucket indices

ThreadBlocks

......

...

...

SP

shared

SP SP SPSP SP SP SP

virtualizes

TBlock(0,0) TBlock(0,1) TBlock(0,2)

TBlock(1,0) TBlock(1,1) TBlock(1,2)

Grid

Global Memory

C codeint main {//serial

//parallelKernel<<>>

//serial}

Similar to SPMD(single-programmultiple-data) model

block of concurrentthreads execute ascalar sequentialprogram, a kernelthread blocksconstitute a grid

Page 9: GPU Sample Sort - KITalgo2.iti.kit.edu/sanders/courses/paralg12/GpuSampleSort.pdf · GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik Sortingis important

Performance Guidelines

6 Vitaly Osipov:GPU Sample Sort

Fakultät für InformatikInstitut für Theoretische Informatik

General pattern in GPU algorithm designdecompose the problem into many data-independent sub-problems

solve sub-problems by blocks of cooperative parallel threads

Performance Guidelinesconditional branching- follow the same execution path

shared memory- exploit fast on-chip memory

coalesced global memory operations- load/store requests to the same memory block fewer memory accesses

Page 10: GPU Sample Sort - KITalgo2.iti.kit.edu/sanders/courses/paralg12/GpuSampleSort.pdf · GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik Sortingis important

Algorithm Overview

7 Vitaly Osipov:GPU Sample Sort

Fakultät für InformatikInstitut für Theoretische Informatik

SampleSort(e = 〈e1, . . . , en〉, k)begin

if n < M then return SmallSort(e)choose a random sample S = S1, . . . , Sak−1 of eSort(S)

〈s0, s1, . . . , sk 〉 = 〈−∞, Sa, . . . , Sa(k−1), ∞〉for 1 ≤ i ≤ n do

find j ∈ {1, . . . , k}, such that sj−1 ≤ ei ≤ sjplace ei in bucket bjreturn Concatenate(SampleSort(b1, k), . . . , SampleSort(bk , k))

endend

Algorithm 1: Serial Sample Sort

Page 11: GPU Sample Sort - KITalgo2.iti.kit.edu/sanders/courses/paralg12/GpuSampleSort.pdf · GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik Sortingis important

High Level GPU Algorithm Design

8 Vitaly Osipov:GPU Sample Sort

Fakultät für InformatikInstitut für Theoretische Informatik

tl tl tl tl

Input

ThreadBlocks

PrefixSum

01

k-1

01

k-1

Input

0 1 2 k-1

Output

Bucket indices

ThreadBlocks

......

...

...

Parameters:distribution degree k = 128threads per block t = 256elements per thread l = 8number of blocks p = n/(t · l)

Phase 1. Choose splittersPhase 2. Each of p TB:

computes its elementsbucket indicesid , 0 ≤ id ≤ k − 1stores the bucket sizes inDRAM

Phase 3. Prefix sum over thek × p table global offsetsPhase 4.

as in Phase 2 localoffsetslocal + global offsets finalpositions

Page 12: GPU Sample Sort - KITalgo2.iti.kit.edu/sanders/courses/paralg12/GpuSampleSort.pdf · GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik Sortingis important

High Level GPU Algorithm Design

8 Vitaly Osipov:GPU Sample Sort

Fakultät für InformatikInstitut für Theoretische Informatik

tl tl tl tl

Input

ThreadBlocks

PrefixSum

01

k-1

01

k-1

Input

0 1 2 k-1

Output

Bucket indices

ThreadBlocks

......

...

...

Parameters:distribution degree k = 128threads per block t = 256elements per thread l = 8number of blocks p = n/(t · l)

Phase 1. Choose splittersPhase 2. Each of p TB:

computes its elementsbucket indicesid , 0 ≤ id ≤ k − 1stores the bucket sizes inDRAM

Phase 3. Prefix sum over thek × p table global offsetsPhase 4.

as in Phase 2 localoffsetslocal + global offsets finalpositions

Page 13: GPU Sample Sort - KITalgo2.iti.kit.edu/sanders/courses/paralg12/GpuSampleSort.pdf · GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik Sortingis important

High Level GPU Algorithm Design

8 Vitaly Osipov:GPU Sample Sort

Fakultät für InformatikInstitut für Theoretische Informatik

tl tl tl tl

Input

ThreadBlocks

PrefixSum

01

k-1

01

k-1

Input

0 1 2 k-1

Output

Bucket indices

ThreadBlocks

......

...

...

Parameters:distribution degree k = 128threads per block t = 256elements per thread l = 8number of blocks p = n/(t · l)

Phase 1. Choose splittersPhase 2. Each of p TB:

computes its elementsbucket indicesid , 0 ≤ id ≤ k − 1stores the bucket sizes inDRAM

Phase 3. Prefix sum over thek × p table global offsetsPhase 4.

as in Phase 2 localoffsetslocal + global offsets finalpositions

Page 14: GPU Sample Sort - KITalgo2.iti.kit.edu/sanders/courses/paralg12/GpuSampleSort.pdf · GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik Sortingis important

High Level GPU Algorithm Design

8 Vitaly Osipov:GPU Sample Sort

Fakultät für InformatikInstitut für Theoretische Informatik

tl tl tl tl

Input

ThreadBlocks

PrefixSum

01

k-1

01

k-1

Input

0 1 2 k-1

Output

Bucket indices

ThreadBlocks

......

...

...

Parameters:distribution degree k = 128threads per block t = 256elements per thread l = 8number of blocks p = n/(t · l)

Phase 1. Choose splittersPhase 2. Each of p TB:

computes its elementsbucket indicesid , 0 ≤ id ≤ k − 1stores the bucket sizes inDRAM

Phase 3. Prefix sum over thek × p table global offsetsPhase 4.

as in Phase 2 localoffsetslocal + global offsets finalpositions

Page 15: GPU Sample Sort - KITalgo2.iti.kit.edu/sanders/courses/paralg12/GpuSampleSort.pdf · GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik Sortingis important

Flavor of Implementation Detailscomputing element bucket indices

9 Vitaly Osipov:GPU Sample Sort

Fakultät für InformatikInstitut für Theoretische Informatik

bt = 〈sk/2, sk/4, s3k/4, sk/8, s3k/8, s5k/8, s7k/8 . . .〉TraverseTree(ei)

beginj := 1// go left or right?

repeat log k timesj := 2j + (ei > bt [j ])// bucket index

j := j − k + 1end

3k/8sk/8s 5k/8s 7k/8s

k/4s 3k/4s

k/2s

< <

<

1b 2b 3b 4b 5b 6b 7b 8b< < < <> > > >

>>

>

Srore the tree in fast shared memoryUse predicated instructions no path divergenceUnroll the loop

Page 16: GPU Sample Sort - KITalgo2.iti.kit.edu/sanders/courses/paralg12/GpuSampleSort.pdf · GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik Sortingis important

Flavor of Implementation Detailscomputing element bucket indices

9 Vitaly Osipov:GPU Sample Sort

Fakultät für InformatikInstitut für Theoretische Informatik

bt = 〈sk/2, sk/4, s3k/4, sk/8, s3k/8, s5k/8, s7k/8 . . .〉TraverseTree(ei)

beginj := 1// go left or right?

repeat log k timesj := 2j + (ei > bt [j ])// bucket index

j := j − k + 1end

3k/8sk/8s 5k/8s 7k/8s

k/4s 3k/4s

k/2s

< <

<

1b 2b 3b 4b 5b 6b 7b 8b< < < <> > > >

>>

>

Srore the tree in fast shared memoryUse predicated instructions no path divergenceUnroll the loop

Page 17: GPU Sample Sort - KITalgo2.iti.kit.edu/sanders/courses/paralg12/GpuSampleSort.pdf · GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik Sortingis important

Flavor of Implementation Detailscomputing element bucket indices

9 Vitaly Osipov:GPU Sample Sort

Fakultät für InformatikInstitut für Theoretische Informatik

bt = 〈sk/2, sk/4, s3k/4, sk/8, s3k/8, s5k/8, s7k/8 . . .〉TraverseTree(ei)

beginj := 1// go left or right?

repeat log k timesj := 2j + (ei > bt [j ])// bucket index

j := j − k + 1end

3k/8sk/8s 5k/8s 7k/8s

k/4s 3k/4s

k/2s

< <

<

1b 2b 3b 4b 5b 6b 7b 8b< < < <> > > >

>>

>

Srore the tree in fast shared memoryUse predicated instructions no path divergenceUnroll the loop

Page 18: GPU Sample Sort - KITalgo2.iti.kit.edu/sanders/courses/paralg12/GpuSampleSort.pdf · GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik Sortingis important

Flavor of Implementation Detailscomputing element bucket indices

9 Vitaly Osipov:GPU Sample Sort

Fakultät für InformatikInstitut für Theoretische Informatik

bt = 〈sk/2, sk/4, s3k/4, sk/8, s3k/8, s5k/8, s7k/8 . . .〉TraverseTree(ei)

beginj := 1// go left or right?

repeat log k timesj := 2j + (ei > bt [j ])// bucket index

j := j − k + 1end

3k/8sk/8s 5k/8s 7k/8s

k/4s 3k/4s

k/2s

< <

<

1b 2b 3b 4b 5b 6b 7b 8b< < < <> > > >

>>

>

Srore the tree in fast shared memoryUse predicated instructions no path divergenceUnroll the loop

Page 19: GPU Sample Sort - KITalgo2.iti.kit.edu/sanders/courses/paralg12/GpuSampleSort.pdf · GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik Sortingis important

Experimental Evaluation

10 Vitaly Osipov:GPU Sample Sort

Fakultät für InformatikInstitut für Theoretische Informatik

NVidia Tesla C106030 SMs x 8 SPs = 240 cores4GB RAM

Data types32- and 64-bit integerskey-value pairs

DistributionsUniformGausianBucket SortedStaggeredDeterministic Duplicates

GPU sorting AlgorithmsCUDPP and THRUST radix sortTHRUST merge sortquicksorthybrid sortbbsort

Page 20: GPU Sample Sort - KITalgo2.iti.kit.edu/sanders/courses/paralg12/GpuSampleSort.pdf · GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik Sortingis important

Experimental EvaluationUniform 32-bit integers

11 Vitaly Osipov:GPU Sample Sort

Fakultät für InformatikInstitut für Theoretische Informatik

0

20

40

60

80

100

120

140

160

180

217 218 219 220 221 222 223 224 225 226 227 228

sort

ed e

lem

ents

/ tim

e [

µs]

number of elements

cudpp radixthrust radix

quick

bbsorthybrid (float)

sample

Page 21: GPU Sample Sort - KITalgo2.iti.kit.edu/sanders/courses/paralg12/GpuSampleSort.pdf · GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik Sortingis important

Experimental EvaluationUniform key-value pairs

12 Vitaly Osipov:GPU Sample Sort

Fakultät für InformatikInstitut für Theoretische Informatik

20

30

40

50

60

70

80

90

100

110

120

130

219 220 221 222 223 224 225 226 227

sort

ed e

lem

en

ts /

tim

e [

µs]

number of elements

thrust radixcudpp radix

thrust mergesample

Page 22: GPU Sample Sort - KITalgo2.iti.kit.edu/sanders/courses/paralg12/GpuSampleSort.pdf · GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik Sortingis important

Experimental EvaluationUniform 64-bit integers

13 Vitaly Osipov:GPU Sample Sort

Fakultät für InformatikInstitut für Theoretische Informatik

20

30

40

50

60

70

80

217 218 219 220 221 222 223 224 225 226 227

sort

ed

ele

me

nts

/ tim

e [

µs]

number of elements

thrust radix sample

Page 23: GPU Sample Sort - KITalgo2.iti.kit.edu/sanders/courses/paralg12/GpuSampleSort.pdf · GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik Sortingis important

Future TrendsFermi architecture

14 Vitaly Osipov:GPU Sample Sort

Fakultät für InformatikInstitut für Theoretische Informatik

11

bandwidth constrained. For existing applications that use Shared memory as software managed cache, code can be streamlined to take advantage of the hardware caching system, while still having access to at least 16 KB of shared memory for explicit thread cooperation. Best of all, applications that do not use Shared memory automatically benefit from the L1 cache, allowing high performance CUDA programs to be built with minimum time and effort.

Summary Table

GPU G80 GT200 FermiTransistors 681 million 1.4 billion 3.0 billionCUDA Cores 128 240 512Double Precision Floating Point Capability

None 30 FMA ops / clock 256 FMA ops /clock

Single Precision Floating Point Capability

128 MAD ops/clock

240 MAD ops / clock

512 FMA ops /clock

Special Function Units (SFUs) / SM

2 2 4

Warp schedulers (per SM) 1 1 2Shared Memory (per SM) 16 KB 16 KB Configurable 48 KB or

16 KB L1 Cache (per SM) None None Configurable 16 KB or

48 KB L2 Cache None None 768 KBECC Memory Support No No YesConcurrent Kernels No No Up to 16Load/Store Address Width 32-bit 32-bit 64-bit

Second Generation Parallel Thread Execution ISA

Fermi is the first architecture to support the new Parallel Thread eXecution (PTX) 2.0 instruction set. PTX is a low level virtual machine and ISA designed to support the operations of a parallel thread processor. At program install time, PTX instructions are translated to machine instructions by the GPU driver.

The primary goals of PTX are:

p Provide a stable ISA that spans multiple GPU generations

p Achieve full GPU performance in compiled applications

p Provide a machine-independent ISA for C, C++, Fortran, and other compiler targets.

p Provide a code distribution ISA for application and middleware developers

p Provide a common ISA for optimizing code generators and translators, which map PTX to specific target machines.

p Facilitate hand-coding of libraries and performance kernels

p Provide a scalable programming model that spans GPU sizes from a few cores to many parallel cores

What about memory bandwidth? No significant improvements?multi-way approaches are likely to be even more beneficialmulti-way merge sort?

Page 24: GPU Sample Sort - KITalgo2.iti.kit.edu/sanders/courses/paralg12/GpuSampleSort.pdf · GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik Sortingis important

15 Vitaly Osipov:GPU Sample Sort

Fakultät für InformatikInstitut für Theoretische Informatik

Thank you!