51
Bachelor Thesis Engineering Faster Sorters for Small Sets of Items Jasper Anton Marianczuk Date: May 09, 2019 Supervisors: Prof. Dr. Peter Sanders Dr. Timo Bingmann Institute of Theoretical Informatics, Algorithmics Department of Informatics Karlsruhe Institute of Technology

Enigneering faster sorters for small sets of items

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Enigneering faster sorters for small sets of items

Bachelor Thesis

Engineering Faster Sortersfor Small Sets of Items

Jasper Anton Marianczuk

Date: May 09, 2019

Supervisors: Prof. Dr. Peter SandersDr. Timo Bingmann

Institute of Theoretical Informatics, AlgorithmicsDepartment of Informatics

Karlsruhe Institute of Technology

Page 2: Enigneering faster sorters for small sets of items

Hiermit versichere ich, dass ich diese Arbeit selbständig verfasst und keine anderen, als dieangegebenen Quellen und Hilfsmittel benutzt, die wörtlich oder inhaltlich übernommenenStellen als solche kenntlich gemacht und die Satzung des Karlsruher Instituts für Technolo-gie zur Sicherung guter wissenschaftlicher Praxis in der jeweils gültigen Fassung beachtet habe.

Karlsruhe, den 09.05.2019

Page 3: Enigneering faster sorters for small sets of items

AbstractSorting a set of items is a task that can be useful by itself or as a building

block for more complex operations. That is why a lot of effort has been put intofinding sorting algorithms that sort large sets as efficiently as possible. But themore sophisticated and fast the algorithms become asymptotically, the less efficientthey are for small sets of items due to large constant factors.A relatively simple sorting algorithm that is often used as a base case sorter is

insertion sort, because it has small code size and small constant factors influencingits execution time.This thesis aims to determine if there is a faster way to sort these small sets of

items to provide an efficient base case sorter. We looked at sorting networks, athow they can improve the speed of sorting few elements, and how to implementthem in an efficient manner by using conditional moves. Since sorting networksneed to be implemented explicitly for each set size, providing networks for largersizes becomes less efficient due to increased code sizes. To also enable the sortingof slightly larger base cases, we modified Super Scalar Sample Sort and createdRegister Sample Sort, to break down those larger sets into sizes that can in turn besorted by sorting networks.From our experiments we found that when sorting only small sets, the sorting

networks outperform insertion sort by at least 25% for any array size between 2and 16. When integrating sorting networks as a base case sorter into quicksort, weachieved far less performance improvements over using insertion sort, which is dueto the networks having a larger code size and cluttering the L1 instruction cache.The same effect occurs when including Register Sample Sort as a base case sorterfor IPS4o. But for computers that have a larger L1 instruction cache of 64 KiB ormore, we obtained speed-ups of 6.4% when using sorting networks as a base casesorter in quicksort, and of 9.2% when integrating Register Sample Sort as a basecase sorter into IPS4o, each in comparison to using insertion sort as the base casesorter.In conclusion, the desired improvement in speed could only be achieved under

special circumstances, but the results clearly show the potential of using conditionalmoves in the field of sorting algorithms.

Page 4: Enigneering faster sorters for small sets of items

ZusammenfassungDas Sortieren einer Menge von Elementen ist ein Prozess der für sich alleine nütz-

lich sein kann oder als Baustein für komplexere Operationen dient. Deswegen wurdein den Entwurf von Sortieralgorithmen, die eine große Menge an Elementen effizi-ent sortieren, bereits großer Aufwand investiert. Doch je ausgefeilter und schnellerdie Algorithmen asymptotisch sind, desto ineffizienter werden sie beim Sortierenkleinerer Mengen aufgrund hoher konstanter Faktoren.Ein relativ einfacher Sortieralgorithmus, der oft als Basisfall Sortierer genutzt

wird, ist Insertion Sort, weil dessen Code kurz ist und er kleine konstante Faktorenhat.Diese Bachelorarbeit hat das Ziel herauszufinden ob es einen schnelleren Algo-

rithmus gibt um solche wenigen Elemente zu sortieren, damit dieser als effizienterBasisfall Sortierer genutzt werden kann. Wir haben uns dazu Sortiernetzwerke an-geschaut, wie man durch sie das Sortieren kleiner Listen beschleunigen kann undwie man sie effizient implementiert: Durch das Ausnutzen von konditionellen move-Befehlen. Weil Sortiernetzwerke für jede Listengröße explizit implementiert werdenmüssen, nimmt die Effizienz des Sortierens mittels Sortiernetwerken wegen erhöhterCodegröße ab je größer die Listen sind, die sortiert werden sollen. Um auch dasSortieren etwas größerer Basisfälle zu ermöglichen haben wir Super Scalar SampleSort modifiziert und Register Sample Sort entworfen, welcher eine größere Liste inmehrere kleine Listen zerteilt, die dann von den Sortiernetzwerke sortiert werdenkönnen.In unseren Experimenten sind wir zu dem Ergebnis gekommen, dass, wenn nur

kleine Mengen sortiert werden, die Sortiernetzwerke um mindestens 25% schnellersind als Insertion Sort, für alle Listen, die zwischen 2 und 16 Elementen enthalten.Beim Integrieren der Sortiernetzwerke als Basisfall Sortierer in Quicksort habenwir weit weniger Geschwindigkeitszuwachs gegenüber der Benutzung von InsertionSort erhalten, was daran liegt, dass der Code der Netzwerke mehr Platz benötigtund den Code für Quicksort aus dem L1 Instruktionscache verdrängt. DerselbeEffekt tritt auch beim Benutzen von Register Sample Sort as Basisfall Sortiererfür IPS4o auf. Allerdings konnten wir uns bei Rechnern, die über einen größerenL1 Instruktionscache von 64 KiB oder mehr verfügen, mit Sortiernetzwerken beiQuicksort um 6,4% und mit Register Sample Sort bei IPS4o um 9,2% gegenüberInsertion Sort als Basisfall Sortierer verbessern.Zusammenfassend haben wir die angestrebte Verbesserung nur unter besonderen

Bedingungen erreicht, aber die Ergebnisse weisen deutlich darauf hin, dass die kon-ditionellen move-Befehle Potential im Anwendungsbereich von Sortieralgorithmenhaben.

Page 5: Enigneering faster sorters for small sets of items

Contents

Contents1 Introduction 8

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.2 Overview of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Sorting Networks 92.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Networks in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.2 Improving the Speed of Sorting through Sorting Networks . . . . . . . . 102.1.3 Compare-and-Swap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.1.4 Assembly Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Implementation of Sorting Networks . . . . . . . . . . . . . . . . . . . . . . . . 132.2.1 Providing the Network Frame . . . . . . . . . . . . . . . . . . . . . . . . 132.2.2 Implementing the Conditional Swap . . . . . . . . . . . . . . . . . . . . . 15

3 Register Sample Sort 203.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2 Implementing Sample Sort for Medium-Sized Sets . . . . . . . . . . . . . . . . . 20

4 Experimental Results 234.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2 Generating Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.3 Conducting the Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.4 Sorting One Set of 2-16 items . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.5 Sorting Many Continuous Sets of 2-16 Items . . . . . . . . . . . . . . . . . . . . 354.6 Sorting a Large Set of Items with Quicksort . . . . . . . . . . . . . . . . . . . . 384.7 Sorting a Medium-Sized Set of Items with Sample Sort . . . . . . . . . . . . . . 414.8 Sorting a Large Set of Items with IPS4o . . . . . . . . . . . . . . . . . . . . . . 45

5 Conclusion 495.1 Results and Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.2 Experiences and Hurdles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.3 Possible Additions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5

Page 6: Enigneering faster sorters for small sets of items

List of Tables

List of Figures

1 Sorting network by Bose and Nelson for 6 elements . . . . . . . . . . . . . . . . 102 Best network with optimal length for 16 elements . . . . . . . . . . . . . . . . . 143 Bose Nelson network for 16 elements optimizing locality . . . . . . . . . . . . . . 144 Bose Nelson network for 16 elements optimizing parallelism . . . . . . . . . . . . 145 Single sort for array size = 8 on machine A . . . . . . . . . . . . . . . . . . . . . 326 Single sort for array size = 8 on machine B . . . . . . . . . . . . . . . . . . . . . 327 Single sort for array size = 8 on machine C . . . . . . . . . . . . . . . . . . . . . 338 Single sort of array sizes 2 to 16 on machine A . . . . . . . . . . . . . . . . . . . 339 Single sort of array sizes 2 to 16 on machine B . . . . . . . . . . . . . . . . . . . 3410 Single sort of array sizes 2 to 16 on machine C . . . . . . . . . . . . . . . . . . . 3411 Continuous sorting of array sizes 2 to 16 on machine A . . . . . . . . . . . . . . 3512 Continuous sorting of array sizes 2 to 16 on machine B . . . . . . . . . . . . . . 3613 Continuous sorting of array sizes 2 to 16 on machine C . . . . . . . . . . . . . . 3614 Sorting times of quicksort with different base cases on machine A . . . . . . . . 3815 Sorting times of quicksort with different base cases on machine B . . . . . . . . 3916 Sorting times of quicksort with different base cases on machine C . . . . . . . . 3917 Sample sort on machine A with 256 items and different configurations . . . . . . 4218 Sample sort on machine B with 256 items and different configurations . . . . . . 4219 Sample sort on machine C with 256 items and different configurations . . . . . . 4320 Sample sort 332 with different base cases on machine A . . . . . . . . . . . . . . 4321 Sample sort 332 with different base cases on machine B . . . . . . . . . . . . . . 4422 Sample sort 332 with different base cases on machine C . . . . . . . . . . . . . . 4423 Distribution of the size of the array passed to the base case sorter when executing

IPS4o with parameter BaseCaseSize4 = 16 . . . . . . . . . . . . . . . . . . . . 4624 Distribution of the size of the array passed to the base case sorter when executing

IPS4o with parameter BaseCaseSize4 = 32 . . . . . . . . . . . . . . . . . . . . 4625 Sorting times for IPS4o on machine A with different base cases and base case sizes 4726 Sorting times for IPS4o on machine B with different base cases and base case sizes 4827 Sorting times for IPS4o on machine C with different base cases and base case sizes 48

List of Tables

1 Registers required by Register Sample Sort with three or seven splitters . . . . . 222 Hardware properties of the machines used . . . . . . . . . . . . . . . . . . . . . 233 Average number of CPU cycles per iteration of single array sorting on machine A 284 Average number of CPU cycles per iteration of single array sorting on machine B 295 Average number of CPU cycles per iteration of single array sorting on machine C 306 Average number of CPU cycles per iteration of single array sorting across all

machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 Average number of CPU cycles per array of continuous sorting across all machines 378 Average speed-ups of the fastest sorting network over the fastest insertion sort

as base case in quicksort and unmodified std::sort . . . . . . . . . . . . . . . . . 409 Average speed-ups of the fastest sorting network over the fastest insertion sort

as base case in sample sort and unmodified std::sort . . . . . . . . . . . . . . . . 4010 Average speed-ups of the fastest sorting network over the fastest insertion sort

as base case in IPS4o and unmodified std::sort . . . . . . . . . . . . . . . . . . . 47

6

Page 7: Enigneering faster sorters for small sets of items

Algorithms

Algorithms1 Register Sample Sort Classification(array, elementCount, predicate) . . . . . . 222 MeasureSorting(arraySize, numberOfIterations, seed) . . . . . . . . . . . . . 263 MeasureSortingInRow(arraySize, numberOfArrays, seed) . . . . . . . . . . . . . 26

7

Page 8: Enigneering faster sorters for small sets of items

1 Introduction

1 Introduction

1.1 Motivation

Sorting, that is rearranging the elements in a set to be in a specific order, is one of the basicalgorithmic problems. In school and university, basic sorting algorithms like bubble sort, in-sertion sort, and merge sort, as well as a simple variant of quicksort are taught at first. Thesealgorithms are rated by the number of comparisons they require to sort a set of items. Thisamount of comparisons is put into relation to the input size and looked at on an asymptoticlevel. Only later one realizes that what looks good on paper does not have to work well inpractice, so factors like average cases, cache effects, hardware setups, and constant factors needto be taken into consideration, too. A sophisticated choice on which sorting algorithm to use(for a particular use case) should be influenced by all of these factors.Complex sorting algorithms aim to sort a large number of items quickly, and a lot of themfollow the divide-and-conquer idea of designing an algorithm. However, sorting small sets ofitems, e.g. with 16 elements or less, is usually fast enough that investing a lot of effort intooptimizing sorting algorithms for those cases results in very small gains, looking at the absoluteamount of time saved.The complex sorters do not perform as well when sorting small sets of items, having goodasymptotic properties but larger constant factors that become more important for the smallsizes. Because of that the base case of sorting small enough subsets is performed using a sim-pler algorithm, which is often insertion sort. It has a worst-case run-time of O(n2), but smallconstant factors that make it suitable to use for small n. If this sorter is executed many timesas base case of a larger sorter though, the times do sum up to contribute to a substantial partof the sorting time.The guiding question of this thesis is:

Is there a faster way to sort sets of up to 16 elements than insertion sort?

When sorting a set of uniformly distributed random numbers, the chance of any number beinggreater than another is on average 50%. Therefore, whenever a conditional branch is influencedby one element’s relation to another, one in two of those branches will be mispredicted, whichleads to an overall performance penalty.This is a problem that has already been looked at by Michael Codish, Luís Cruz-Filipe, MarkusNebel and Peter Schneider-Kamp in “Optimizing sorting algorithms by using sorting networks”[CCNS17] in 2017, and this thesis has taken a great deal of inspiration from it.

1.2 Overview of the Thesis

We will first look at sorting networks in section 2. Section 2.1 gives a basis of sorting networksand assembly code. After that, we look at different ways of implementing sorting networksefficiently in C++ in section 2.2. For that we focused on elements that consist of a key andan additional reference value. This enables the sorting of complex items, not being limited tointegers.In section 3 we will take a small detour to look at Super Scalar Sample Sort and develop anefficient modified version for sets with 256 elements or less by holding the splitters in generalpurpose registers instead of an array. After that section 4 discusses the results and improve-ments of using sorting networks we achieved in our experiments, measuring the performance ofthe sorting networks and sample sort individually, and also including them as base cases intoquicksort and IPS4o [AWFS17]. After that we conclude the results of this thesis in section 5.

8

Page 9: Enigneering faster sorters for small sets of items

2 Sorting Networks

2 Sorting Networks

2.1 Preliminaries

Sorting algorithms can generally be classified into two groups: Those of which the behaviourdepends on the input, e.g. quicksort where the sorting speed depends on how well the chosenpivot partitions the set into equally-sized halves, and those of which the behaviour is notinfluenced by the configuration of the input. The latter are also called data-oblivious.One example of a data-oblivious sorting algorithm is the sorting network. A sorting networkof size n consists of a number of n so-called channels numbered 1 to n, each representing oneof the inputs, and connections between the channels, called comparators. Where two channelsare connected by a comparator it means that the values are to be compared, and if the channelwith the lower number currently holds a value that is greater than the value of the channel withthe higher number, the values are to be exchanged between the channels. The comparators aregiven in a fixed order that determines the sequence of executing these conditional swaps, sothat in the end(i) the channels contain a permutation of the original input, and(ii) the values held by the channels are in nondecreasing order.

Sorting networks are data-oblivious because all the comparisons are always performed, and inthe same order, no matter which permutation of an input is given.For any sorting network, two metrics can be used to quantify it: the length and the depth.A network’s length refers to the number of comparators it contains, and a network’s depthdescribes the minimal amount of levels a network can be divided into.Where two comparators are ordered one after the other, and no channel is used by both com-parators, they can be combined into a level. In other words: the result of the second comparatordoes not depend upon the result of the first. Inductively, any comparator can be merged into alevel that executes right before or after it, if its channels are not already used by any compara-tor in the level. Since now all the comparators in a level are independent from one another,they can be executed in parallel.

2.1.1 Networks in Practice

• Best known networks: For networks of up to size 16 there exist proven optimal lengthsand proven optimal depths. For example, the network for 10 elements with optimallength 29 has depth 9, the one with optimal depth 7 has length 31 [Knu98, CCFS14].For networks of greater size there only exist currently known lowest numbers of lengthor depth. Those best networks are acquired through optimizations that were initiallydone by hand and nowadays are realized e.g. with the help of computers and evolutionaryalgorithms [Ber18].

• Recursive networks: For creating sorting networks there also exist algorithms thatwork in a recursive divide-and-conquer way: split the input into two parts, sort each partrecursively, and merge the two parts together in the end. Representatives for this kind ofapproach are the constructions of R.J. Nelson and B.C. Bose [BN62] and the algorithmby K.E. Batcher [Bat68]. Bose and Nelson split the input sequence into first and secondhalf, while Batcher partitions into elements with an even index and elements with an oddindex. The advantage of those recursive networks over the specially optimized ones is thatthey can easily be created even for large network sizes. While the generated networksmay have more comparators than the best known networks, the number of comparators

9

Page 10: Enigneering faster sorters for small sets of items

2 Sorting Networks

in a network acquired from either Bose-Nelson or Batcher of size n has an upper boundof O(n (log n)2) [Knu98].

Figure 1: Sorting network by Bose and Nelson for 6 elements

Sorting networks are usually depicted by using horizontal lines for the channels, and verticalconnections between these lines for the comparators. A network by Bose and Nelson for 6elements displayed like that can be seen in figure 1.

2.1.2 Improving the Speed of Sorting through Sorting Networks

An important question to ask is how sorting networks can improve the sorting speed on a set ofelements (on average), if they can not take any shortcuts for “good” inputs, like an insertion sortthat would leverage an already sorted input and do one comparison per element. The answerto this question is branching. Because the compiler knows in advance which comparisons aregoing to be executed in which order, the control flow does not contain conditional branches,in particular getting rid of expensive branch mispredictions. On uniformly distributed randominputs, the chances that any number is smaller than another is 50% on average, making branchesunpredictable. In the case of insertion sort that means not knowing in advance with how manyelements the next one has to be compared until it is inserted into the right place.Even though with sorting networks the compiler knows in advance when to execute whichcomparator, implementing the compare-and-swap operation in a naive way (as seen in 2.1.3)the compiler might still generate branches. In that case, the sorting networks are no faster thaninsertion sort, or even slower.

2.1.3 Compare-and-Swap

For sorting networks, the basic operation used is to compare two values against each other.If they are in the wrong order (the “smaller” element occurs after the “bigger” one in thesequence), they are swapped. Intuitively, one might implement the operation in C++ like this:

void ConditionalSwap(TValueType& left, TValueType& right){

if (left > right) { std::swap(left, right); }}

10

Page 11: Enigneering faster sorters for small sets of items

2.1 Preliminaries

Here TValueType is a template typename and can be instantiated with any type that imple-ments the > operator.As suggested in [CCNS17], the same piece of code can be rewritten like this:

void ConditionalSwap2(TValueType& left, TValueType& right){

TValueType temp = left;if (temp > right) { left = right; }if (temp > right) { right = temp; }

}

At first glance it looks like we now have two branches that can be taken. But the code executedif the condition is true now only consists of a single assignment each, which can be expressed inx86-Architecture through a conditional move instruction. In AT&T syntax (see section 2.1.4),a conditional move (cmov a,b) will write a value in register a into register b, if a conditionis met. If the condition is not met, no operation takes place (still taking the same number ofCPU cycles as the move operation would have). Since the address of the next instruction nolonger depends upon the previously evaluated condition, the control flow now does not containbranches. The only downside of the conditional move is that it can take longer than a normalmove instruction on certain architectures, and can only be executed when the comparison hasperformed and its result is available.When the elements to be sorted are only integers, some compilers do generate code with condi-tional moves for those operations. When the elements are more general (in this thesis we willlook at pairs of an unsigned 64 bit integer key and an unsigned 64 bit reference value, whichcould be a pointer or an address in an array), gcc 7.3.0, the compiler used for the experiments,does not generate conditional moves. To force the usage of conditional moves, a feature of gccwas used that allows the programmer to specify small amounts of assembly code to be insertedinto the regular machine code generated by gcc, called inline assembly [Fre19]. This mechanicand the notation is further explained in section 2.1.4.

2.1.4 Assembly Code

Assembly code represents the machine instructions executed by the CPU. It can be given asthe actual opt-codes or as human-readable text. There are two different conventions for thetextual representation, the Intel syntax or MASM syntax and the AT&T syntax. The maindifferences are:

Intel AT&TOperand size The size of the operand does not

have to be specifiedThe size of the operand is ap-pended to the instruction: b (byte= 8 bit), l (long = 32 bit), q (quad-word = 64 bit)

Parameter order The destination is written first,then the source of the value:mov dest,src

The source is written first, then thedestination: movq src,dest

In this thesis only the AT&T syntax will be used.The gcc C++ compiler has a feature that allows the programmer to write assembly instruc-tions in between regular C++ code, called “inline assembly” (asm) [Fre19]. A set of assemblyinstructions to be executed must be given, followed by a definition for input and output vari-ables and a list of clobbered registers. This extra information is there to communicate to the

11

Page 12: Enigneering faster sorters for small sets of items

2 Sorting Networks

compiler what is happening inside the asm block. Gcc itself does not parse or optimize thegiven assembly statements, they are only after compilation added into the generated assemblycode by the GNU Assembler. A variable being in the output list means that the value will bemodified, a clobbered register is one where gcc cannot assume that the value it held before theasm block will be the same as after the block. In this thesis, the clobbered registers will almostalways be the conditional-codes registers (cc), which include the carry-flag, zero-flag and thesigned-flag, which are modified during a compare-instruction. This way of specifying the input,output and clobbered registers is also called extended asm.Taking the code from 2.1.3, and assuming TValueType = uint64_t, the statement

if (temp > right) { left = right; }

can now be written as

__asm__("cmpq %[temp],%[right]\n\t" //performs right - temp internally"cmovbq %[right],%[left]\n\t" //left = right, if right < temp: [left] "=&r"(left) //output: "0"(left), [right] "r"(right), [temp] "r"(temp) //input: "cc" //clobber

);

In extended asm, one can define C++ variables as input or output operands, and gcc will assign aregister for that value (if it has the "r" modifier), and also write the value in an output registerback to the given variable after the asm block. Note that the names in square brackets aresymbolic names only valid in the context of the assembly instructions and independent fromthe names in the C++ code before. The link between the C++ names and the symbolic nameshappens in the input and output declarations.With the conditional moves it is important to properly declare the input and output variables,because they perform a task that is a bit unusual: an output variable may be overwritten, andalso may not. For the output register for left, two things must apply:

(i) if the condition is false, it must hold the value of left, and(ii) if the condition is true, it must hold the value of right.

For optimizations purposes, the compiler might reduce the number of registers used by plac-ing the output of one operation into a register that previously held the input for some otheroperation. To prevent this, the declaration for the output [left] "=&r"(left) has the "&"modifier added to it, meaning it is an “early clobber” register and that no other input can beplaced in that register. In combination with "0"(left) in the input line, it is also tied to aninput, so that the previous value of left is loaded into the register beforehand, to comply withconstraint (i). Because we already declared it as output, instead of giving it a new symbolicname we tie it to the output by referencing its index in the output list, which since it is thefirst output variable is "0". The "=" in the output declaration solely means that this registerwill be written to. Any output needs to have the "=" modifier.We see that each assembly instruction is postfixed with \n\t. That is because the instructionstrings are appended into a single instruction string during compilation and \n\t tells the GNUassembler where one instruction ends and the next begins.

The cmov instruction is postfixed with a b in this example, which stands for “below”. So thecmov will be executed if right is below temp (unsigned comparison right < temp). Apart

12

Page 13: Enigneering faster sorters for small sets of items

2.2 Implementation of Sorting Networks

from below we will also see not equal (ne) and carry (c) as a postfix.In addition to that, both the cmp and the cmovb are postfixed with a q (quad-word) to indicatethat the operands are 64-bit values.When a subtraction minuend−subtrahend is performed and subtrahend is larger than minuend(interpreted as unsigned numbers), the operation causes an underflow which results in the carryflag being set to 1. The check for that carry flag being 1 can be used as a condition by itself,and the carry flag influences other condition checks like below. This property of the comparisonsetting the carry flag will be used in section 3.2.

2.2 Implementation of Sorting Networks

2.2.1 Providing the Network Frame

The best networks for sizes of up to 16 elements were taken from John Gamble’s Website[Gam19] and are length-optimal.The Bose Nelson networks have been generated using the instructions from their paper [BN62].For sizes of 8 and below the best and generated networks have the same amount of comparatorsand levels. For sizes larger than 8 the generated networks are at a disadvantage because theyhave more comparators and/or levels. As a trade-off their recursive structure makes it possibleto leverage a different trait: locality. Instead of optimizing them to sort as parallel as possible,we can first sort the first half of the set, then the second half, and then apply the merger. Thisway, chances are higher that all n

2 elements of the first half might fit into the processor’s generalpurpose registers. During this part of the sorting routine, no accesses to memory or cache arerequired. To determine if there is a visible speed-up, the networks were generated optimizing(a) locality and (b) parallelism.As an extra idea, the Bose Nelson networks were generated in a way that one can pass theelements as separate parameters instead of as an array. That way one can sort elements thatare not contiguously placed in memory. Because the networks were implemented as methodcalls to the smaller sorters and merge methods, there would be a large overhead in placingmany elements onto the call stack for each method call. While we hoped this would make adifference by reducing code size, the overhead for the method call was too large. That is whyall the methods are declared inline which results in the same flat sequence of swaps for eachsize the networks optimizing locality have.Examples of networks for 16 elements can be seen in figures 2, 3 and 4.All networks are implemented so that they have an entry method that takes a pointer to an ar-ray A and an array size n as input and delegates the call to the specific method for that numberof elements, which in turn executes all the comparators. To measure different implementationsfor the conditional swaps, the network methods and the swap are templated, so that whencalling the network with an array of a specific type the respective specialized conditional-swapimplementation will be used.

13

Page 14: Enigneering faster sorters for small sets of items

2 Sorting Networks

Figure 2: Best network with optimal length for 16 elements

Figure 3: Bose Nelson network for 16 elements optimizing locality

Figure 4: Bose Nelson network for 16 elements optimizing parallelism

14

Page 15: Enigneering faster sorters for small sets of items

2.2 Implementation of Sorting Networks

Our approach differs from the work in [CCNS17] in the type of elements that were sorted.While they measured the sorting of ints, which are usually 32-bit sized integers, we made thedecision to sort elements that consist of a 64-bit integer key and a 64-bit integer reference value,enabling not only the sorting of numbers but also the sorting of complex elements, when givinga pointer or an array index as the reference value to the original set. This was implemented bycreating structs that contain a key and reference value each, having the following structure:

struct SortableRef{

uint64_t key, reference;}

They also define the operators >, >=, ==, <, <= and != for reasons of usability, andtemplated methods uint64_t GetKey(TSortable) and uint64_T GetReference(TSortable)are available.

2.2.2 Implementing the Conditional Swap

The ConditionalSwap is implemented as a templated method like this:

template <typename TValueType>inlinevoid ConditionalSwap(TValueType& left, TValueType& right){

//body}

The following variants will represent the body of one specialization of the template functionfor a specific struct. Each of them was given a three letter abbreviation to name them in theresults. We implemented the following approaches:

• using std::swap (Def)• using inline if statements (QMa)• using std::tie and std::tuple (Tie)• using jmp and xchg (JXc)• using four cmovs and temp variables (4Cm)• using four cmovs split from one another and temp variables (4CS)• using six cmovs and temp variables (6Cm)• moving pointers with cmov instead of values (Cla)• moving pointers and supporting a predicate (CPr)

The details of implementation can be seen in the following paragraphs.

using std::swap (Def) The default implementation for the template makes use of the defined< operator:

if (right < left)std::swap(left, right);

This is the intuitive way of writing the conditional swap we already saw in section 2.1.3, withoutany inline assembly.

15

Page 16: Enigneering faster sorters for small sets of items

2 Sorting Networks

using inline if statements (QMa)

bool r = (left > right);auto temp = left;left = r ? right : left;right = r ? temp : right;

Here it was attempted to convince the compiler to generate conditional moves by using theinline if -statements with trivial values in the else part.

using std::tie and std::tuple (Tie)

std::tie(left, right) =(right < left) ? std::make_tuple(right, left) : std::make_tuple(left, right);

This approach uses assignable tuples (tie).

using jmp and xchg (JXc)

__asm__("cmpq %[left_key],%[right_key]\n\t""jae %=f\n\t""xchg %[left_key],%[right_key]\n\t""xchg %[left_reference],%[right_reference]\n\t""%=:\n\t": [left_key] "=&r"(left.key), [right_key] "=&r"(right.key),

[left_reference] "=&r"(left.reference),[right_reference] "=&r"(right.reference)

: "0"(left.key), "1"(right.key), "2"(left.reference), "3"(right.reference): "cc"

);

The %= generates a unique label for each instance of the asm statement, so that the jumps gowhere they belong.

using four cmovs and temp variables (4Cm)

uint64_t tmp = left.key;uint64_t tmpRef = left.reference;__asm__(

"cmpq %[left_key],%[right_key]\n\t""cmovbq %[right_key],%[left_key]\n\t""cmovbq %[right_reference],%[left_reference]\n\t""cmovbq %[tmp],%[right_key]\n\t""cmovbq %[tmp_ref],%[right_reference]\n\t": [left_key] "=&r"(left.key), [right_key] "=&r"(right.key),

[left_reference] "=&r"(left.reference),[right_reference] "=&r"(right.reference)

: "0"(left.key), "1"(right.key), "2"(left.reference), "3"(right.reference),[tmp] "r"(tmp), [tmp_ref] "r"(tmpRef)

: "cc");

16

Page 17: Enigneering faster sorters for small sets of items

2.2 Implementation of Sorting Networks

using four cmovs split from one another and temp variables (4CS)

uint64_t tmp = left.key;uint64_t tmpRef = left.reference;__asm__ volatile (

"cmpq %[left_key],%[right_key]\n\t":: [left_key] "r"(left.key), [right_key] "r"(right.key): "cc"

);__asm__ volatile (

"cmovbq %[right_key],%[left_key]\n\t": [left_key] "=&r"(left.key): "0"(left.key), [right_key] "r"(right.key):

);__asm__ volatile (

"cmovbq %[right_reference],%[left_reference]\n\t": [left_reference] "=&r"(left.reference): "0"(left.reference), [right_reference] "r"(right.reference):

);__asm__ volatile (

"cmovbq %[tmp],%[right_key]\n\t": [right_key] "=&r"(right.key): "0"(right.key), [tmp] "r"(tmp):

);__asm__ volatile (

"cmovbq %[tmp_ref],%[right_reference]\n\t": [right_reference] "=&r"(right.reference): "0"(right.reference), [tmp_ref] "r"(tmpRef):

);

Because we split the asm blocks, they have to be declared volatile so that the optimizer doesnot move them around or out of order. Without declaring them volatile, some of the net-works were not sorting correctly. The blocks were split because we hoped the compiler wouldbe able to insert operations that do not affect the conditional codes and are unrelated to thecurrent conditional swap between the cmp-instruction and the conditional moves, to reduce theamount of wait cycles that have to be performed. This was successful as can be seen in theexperimental results in section 4.4.

17

Page 18: Enigneering faster sorters for small sets of items

2 Sorting Networks

using six cmovs and temp variables (6Cm)

uint64_t tmp;uint64_t tmpRef;__asm__ (

"cmpq %[left_key],%[right_key]\n\t""cmovbq %[left_key],%[tmp]\n\t""cmovbq %[left_reference],%[tmp_ref]\n\t""cmovbq %[right_key],%[left_key]\n\t""cmovbq %[right_reference],%[left_reference]\n\t""cmovbq %[tmp],%[right_key]\n\t""cmovbq %[tmp_ref],%[right_reference]\n\t": [left_key] "=&r"(left.key), [right_key] "=&r"(right.key),

[left_reference] "=&r"(left.reference),[right_reference] "=&r"(right.reference),[tmp] "=&r"(tmp), [tmp_ref] "=&r"(tmpRef)

: "0"(left.key), "1"(right.key), "2"(left.reference), "3"(right.reference),"4"(tmp), "5"(tmpRef)

: "cc");

moving pointers with cmov instead of values (Cla) This idea came from a result createdby the clang compiler from the special code as seen in the ConditionalSwap2 method in 2.1.3.For the transformation to gcc, we took only the minimal necessary instructions concerning theconditional move into the asm block:

SortableRef_ClangVersion* leftPointer = &left;SortableRef_ClangVersion* rightPointer = &right;uint64_t rightKey = right.key;SortableRef_ClangVersion tmp = left;__asm__ volatile(

"cmpq %[tmp_key],%[right_key]\n\t""cmovbq %[right_pointer],%[left_pointer]\n\t": [left_pointer] "=&r"(leftPointer): "0"(leftPointer), [right_pointer] "r"(rightPointer),

[tmp_key] "m"(tmp.key), [right_key] "r"(rightKey): "cc"

);left = *leftPointer;leftPointer = &tmp;__asm__ volatile(

"cmovbq %[left_pointer],%[right_pointer]\n\t": [right_pointer] "=&r"(rightPointer): "0"(rightPointer), [left_pointer] "r"(leftPointer):

);right = *rightPointer;

18

Page 19: Enigneering faster sorters for small sets of items

2.2 Implementation of Sorting Networks

moving pointers and supporting a predicate (CPr) Instead of performing the comparisoninside the asm block, which requires knowledge of the datatype of the key, it can also be doneover a predicate, using the result of that comparison inside the inline assembly:

SortableRef_ClangPredicate* leftPointer = &left;SortableRef_ClangPredicate* rightPointer = &right;SortableRef_ClangPredicate temp = left;int predicateResult = (int) (right < temp);__asm__ volatile(

"cmp $0,%[predResult]\n\t""cmovneq %[right_pointer],%[left_pointer]\n\t": [left_pointer] "=&r"(leftPointer): "0"(leftPointer), [right_pointer] "r"(rightPointer),

[predResult] "r"(predicateResult): "cc"

);left = *leftPointer;leftPointer = &temp;__asm__ volatile(

"cmovneq %[left_pointer],%[right_pointer]\n\t": [right_pointer] "=&r"(rightPointer): "0"(rightPointer), [left_pointer] "r"(leftPointer):

);right = *rightPointer;

For the Cla implementation the b in cmovb was used to execute the conditional move ifright_key was smaller than temp_key. If that is the case, the predicate will return true,or as an int a value not equal to zero. When comparing this result to 0, the cmov is to beexecuted if the result was any value other than zero, so the postfix here is ne (not equal).Note that while the knowledge of how to compare the elements is still present by doing thecomparison directly (right < temp), the compiler now needs to take the result from the com-parison, and put it into an integer that is then used in the asm block. The only addition tomake it completely independent from the sorted elements would be to pass a predicate to do thecomparison, which would also involve modifying the network frame to take and pass the pred-icate. To measure on the same network frame we took this shortcut of doing the comparisonusing the < operator.

19

Page 20: Enigneering faster sorters for small sets of items

3 Register Sample Sort

3 Register Sample Sort

3.1 PreliminariesSample sort is a sorting algorithm that follows the divide-and-conquer principle. The inputis separated into k subsets, that each contain elements within an interval of the total or-dering, with the intervals being distinct from one another. That is done by first choosinga subset S of a · k elements and sorting S. Afterwards the splitters {s0, s1, . . . , sk−1, sk} ={−∞, Sa, S2a, . . . , S(k−1)a,∞} are taken from S. The parameter a denotes the oversamplingfactor. Oversampling is used to get a better sample of splitters to achieve more evenly-sizedpartitions, trading for the time that is required to sort the larger sample.With the splitters the elements ei are then classified, placing them into buckets bj, wherej ∈ {1, . . . , k} and sj−1 < ei ≤ sj. For k being a power of 2, this placement can be achieved byviewing the splitters as a binary tree, with sk/2 being the root, all sl with l < k/2 representingthe left subtree and those with l > k/2 the right one. To place an element, one must onlytraverse this binary tree, resulting in a binary search instead of a linear one [SW04].Quicksort is therefore a specialization of sample sort with fixed parameter k = 2, having onlyone splitter, the pivot, and splitting the input into two partitions.

3.2 Implementing Sample Sort for Medium-Sized SetsThe motivation to look at sample sort was that we wanted to see how well the sorting networksperform when using them as a base case for the In-Place Parallel Super Scalar Samplesort(IPS4o) by Michael Axtmann, Sascha Witt, Daniel Ferizovic and Peter Sanders [AWFS17].The problem that occured is that IPS4o can go into the base case with sizes larger than 16,while the networks we looked at only sort sets of up to 16 elements.To close that gap, we created a sequential version of Super Scalar Sample Sort [SW04] that canreduce base case sizes of up to 256 down to blocks of 16 or less in an efficient manner.Since the total size was expected to not be much greater than 256, not much effort was madeto keep the algorithm in-place. The central idea was to place the splitters not into an array, asdescribed in [SW04], but to hold them in general purpose registers for the whole duration ofthe element classification.The question now arose as to which splitter an element needs to be compared to after the firstcomparison with the middle splitter. When the splitters are organized in a binary heap in anarray, that can be done by using array indices, the children of splitter j being at positions 2jand 2j + 1. If an element is smaller than sj, it would afterwards be compared to s2j, otherwiseto s2j+1. But this way of accessing the splitters does not work when they are placed in registers.The solution was to create a copy of the left subtree, and to conditionally overwrite that withthe right subtree, should the element be greater than the root node. The next comparisonis then made against the root of the temporary tree that now contains the correct splittersto compare that element against. For 3 splitters that requires 1 conditional move, and for7 splitters would require 3 conditional moves after the first comparison and 1 more after thesecond comparison, per element.After finding the correct splitters to compare to, we are left with one more problem: How toknow in which bucket the element is to be placed into at the end. In [SW04] this was done bymaking use of the calculated index determining the next splitter to compare to. We chose anapproach similar to creating this index, using the correlation between binary numbers and thetree-like structure of the splitters. We will be viewing the splitters not as a binary heap butjust as a list where the middle of the list represents the root node of the tree, its children being

20

Page 21: Enigneering faster sorters for small sets of items

3.2 Implementing Sample Sort for Medium-Sized Sets

the middle element of the left and the middle element of the right list.If an element ei is larger than the first splitter sk/2 (with k−1 being the number of splitters), itmust be placed in a bucket bj with j ≥ k

2 (assuming 0-based indexing for b). That also meansthat the index of that bucket, represented as a binary number, must have its bit at positionl := log k

2 set to 1. That way, the result of the comparison (ei > sk/2) can be interpreted as aninteger (1 for true, 0 for false) and added to j. If that was not the last comparison, j is thenmultiplied by 2 (meaning its bits are shifted left by one position). This means the bit fromthe first comparison makes its way “left” in the binary representation while the comparisontraverses down the tree, and so forth with the other comparisons. After traversing the splittertree to the end, ei will have been compared to the correct splitters and j will hold the indexof the bucket that ei belongs into. These operations can be implemented without branches bymaking use of the way comparisons are done:At the end of section 2.1.4 we explained that when comparing (unsigned) numbers (which isnothing but a subtraction), and the subtrahend being greater than the minuend, the operationcauses an underflow and the carry flag is set. We also notice that when converting the resultof the predicate (ei > sk/2) to an integer value, the integer will be 1 for true and 0 for false.So in assembly code, we can compare the result from evaluating the predicate to the value 0:cmp %[predResult],%[zero] where zero is just a register that holds the value 0. This trick isneeded because the cmp instruction needs the second operand to be a register. This will execute0 − predResult, which underflows for the predicate returning true. This way we can postfixthe cmov needed for moving the next splitters with a c checking for a set carry flag. The secondinstruction we make use of is the rotate carry left (rcl) instruction, which performs a rotateleft instruction on j, but includes the carry flag as an additional bit after the least significantbit of the integer. This exactly takes the predicate result and puts it at the bottom of j, withthe previous content being shifted one to the left beforehand. That means it performs twonecessary operations at once.As an addition to the efficient classification, while looping over the elements we allow to placemultiple elements into buckets per loop, allowing for all the registers in the machine to be used.This additional parameter is called blockSize.

There is one downside to this approach: The keys of the splitters (since we only need a splitter’skey for classifying an element) must be small enough to fit into a general purpose register.Needing more than one register per key would mean either running out of registers or spendingextra time to conditionally move the splitter keys around. For three splitters the needednumber of registers for block sizes 1 to 5 are as seen in table 1. We can see that the trade-offfor classifying multiple elements at the same time is the amount of registers needed.If we were to use 7 splitters instead of three, the number of registers required for classifying just1 element at a time would go up to 15. Also, with 8 buckets, if we get recursive subproblemswith sizes just over 16, classifying into 8 buckets again would be greatly inefficient, resulting inmany buckets containing very few. This is why we decided to only use three splitters for thisparticular sorter.Pseudocode to implement the classification can be seen as an example for an array of integersand blockSize = 1 in algorithm 1. j is here called state, and the temporary subtree consistsof one splitter which we gave the name splitterx. For the branchless implementation we usedthe cmovc for line 9 and the rcl instruction for line 10. At the last level of classification no moremoving of splitters is required, so instead of doing another comparison against the predicateresult and using rcl, we can just shift state left by one position and add the predicate’s resultto it (line 12). Alternatively we could use a bitwise OR or XOR after the shift, which would havethe same result. But we decided that adding the predicate result was more readable.For sorting the splitter sample, the same sorting method can be used as for the base case.

21

Page 22: Enigneering faster sorters for small sets of items

3 Register Sample Sort

3 splitters 7 splittersblock size block size

1 2 3 4 5 1 2 3 4 5splitters 3 3 3 3 3 7 7 7 7 7

buckets pointer 1 1 1 1 1 1 1 1 1 1current element index 1 1 1 1 1 1 1 1 1 1

element count 1 1 1 1 1 1 1 1 1 1state 1 2 3 4 5 1 2 3 4 5

predicate result 1 2 3 4 5 1 2 3 4 5splitterx 1 2 3 4 5 3 6 9 12 15

sum 9 12 15 18 21 15 20 25 30 35

Table 1: Registers required by Register Sample Sort with three or seven splitters

Algorithm 1: Register Sample Sort Classification(array, elementCount, predicate)1 int splitter0, splitter1, splitter2 ← determineSplitters()2 int state, predicateResult, splitterx3 int* b0, b1, b2, b3 ← allocateBuckets(elementCount)4 for 1 ≤ i ≤ elementCount do5 state ← 06 predicateResult ← (int) predicate(splitter1 < array[i])7 splitterx ← splitter08 if predicateResult > 0 then9 splitterx ← splitter2

10 state ← (state « 1) + 111 predicateResult ← (int) predicate(splitterx < array[i])12 state ← (state « 1) + predicateResult13 place array[i] in buckets bstate

22

Page 23: Enigneering faster sorters for small sets of items

4 Experimental Results

4 Experimental ResultsIn the tests we ran, different sorting algorithms and conditional-swap implementations werecompared. For the details about the different sorters and swaps refer to section 2.2.The names of the sorters are built in an abbrevatory way that matches the following format:(i) It starts with an I or an N, indicating if the used algorithm is insertion sort or a sorting

network.• In case of sorting networks, if it is a Best network or a Bose Nelson network (BoNe).

– For a Bose Nelson network whether it was optimized for Locality (L), Parallelism(P) or generated to take the items as single parameters M (see section 2.2)

(ii) Then follows the type of benchmark, -N for sorting one set of items (“normal sort”,section 4.4), -I for sorting many continuous sets of items (“inrow sort”, section 4.5), -Sfor sorting with Sample Sort (section 4.7), -Q for sorting with quicksort (section 4.6) and-4 for sorting with IPS4o (section 4.8).

• In case of Sample Sort, the parameters numberOfSplitters, oversamplingFactorand blockSize are appended as numbers

(iii) Lastly, the name of the struct used for the template specialization is appended (seesection 2.1.3 for the abbreviations for conditional swaps) as well as a single K for elementsthat have only a key and KR for those that have a key and a reference value.

Where for comparison std::sort was run, the name in step (i) is StdSort.For example, when measuring sample sort with parameters 332 and a Bose Nelson networkoptimizing parallelism as the base case with conditional swap 4CS, the sorter name would beN BoNeP -S332 KR 4CS .

4.1 Environment

Machine Name A B C

CPU 2 x Intel Xeon 8-core 2 x Intel Xeon 12-core AMD Ryzen 8-coreE5-2650 v2 2.6 GHz E5-2670 v3 2.3 GHz 1800X 3.6 GHz

RAM 128 GiB DDR3 128 GiB DDR4 32GB DDR4L1 Cache (per Core) 32 KiB I + 32 KiB D 32 KiB I + 32 KiB D 64 KiB I + 32 KiB DL2 Cache (per Core) 256 KiB 256 KiB 512 KiB

L3 Cache (total) 20 MiB 30 MiB 16 MiB [8 MiB]

Table 2: Hardware properties of the machines used

As compiler the gcc C++ compiler in version 7.3.0 was used with the -O3 flag.The measurements were done with only essential processes running on the machine apart fromthe measurement. To prevent the process from being swapped to another core during executionit was run with taskset 0x1.In total, three different machines were used to do the measurements. Their hardware propertiescan be seen in table 2. “I” and “D” refer to dedicated Instruction and Data caches. Also notethat while the AMD Ryzen’s L3 cache has a total size of 16 MiB, it is divided into two 8 MiBcaches that are exclusive to 4 cores each. Since all measurements were done on a single core,the L3 cache size in brackets is the one available to the program. The operating system on allmachine was Ubuntu 18.04.

23

Page 24: Enigneering faster sorters for small sets of items

4 Experimental Results

4.2 Generating Plots

Due to the high number of dimensions in the measurements (machine the measurement is runon, type of network, conditional swap implementation, array size) the results could not alwaysbe plotted two-dimensionally. We used box-plots where applicable to show more than just anaverage value for a measurement. The box incloses all values between the first quartile (1Q) andthird quartile (3Q). The line in the middle shows the median. Further the inter-quartile-range(IQR) is calculated as the distance between first and third quartile. The lines (called whiskers)left and right of the boxes go until the smallest value greater than 1Q−1.5 ·IQR and the greatestvalue smaller than 3Q + 1.5 · IQR respectively. Values below these ranges are called outliers andshown as individual dots.

4.3 Conducting the Measurements

Random Numbers In order to measure the time needed to sort some data, one has to havedata first. For these measurements, the data consisted of pairs of a 64-bit unsigned integer keyand a 64-bit unsigned integer reference value. Those were generated as uniformly distributedrandom numbers by a lightweight implementation of the std::minstd_rand generator from theC++ <random> library that works as follows:First a seed is set, taken e.g. from the current time. When a new random number is requested,the generator calculates seed = seed · 48271 % 2147483647 and returns the current seed.The numbers generated like that do not use all 64 bits available, which is only for practicalitywith the permutation check as will be seen below.For each measurement i, a new seedi is taken from the current time. The same seedi is thenset before the execution of each sorter, to provide all sorters with the same random inputs.

Measuring The actual measuring was done via linux’s PERF_EVENT interface that allowsto do fine-grained measurements. Here, the number of cpu cycles spent on sorting was the unitof measurement. That also means that the results do not depend on clock speeds (e.g. whenoverclocking), but only on the CPU’s architecture.

Compilation When we started this project, it was only a single source file (.cpp) with anincreasing amount of headers that were all included in that single file. That is also due to thefact that templated methods cannot be placed in source files because they need to be visibleto all including files at compile time. The increasing amount of code and the many differenttemplates brought the compiler to a point where it took over a minute to compile the project.The problem we encountered was that the compiler only gives itself a limited amount of time forcompiling a single source file. In order to stay within the time boundaries for a single file, theoptimization became poor. We saw measurements being slower for no apparent reason. To solvethat problem, we used code generation to create source files that contain an acceptable amountof methods that initiate part of a measurement in a wrapper method. This way, from the mainsource file we only need to call the correct wrapper methods to perform the measurements, andthis way we achieved results that were more stable and reproducible.For compilation, the flag -O3 was used to achieve high optimization and speed. That alsomeans that, without using the sorted data in some way, the compiler would deem the resultunimportant and skip the sorting altogether. That is why after each sort, to generate a side-effect, the set is checked for two properties: That it is sorted, and that it is a permutation ofthe previously generated set. The first can easily be done by checking for each value that it isnot greater than the value before it.

24

Page 25: Enigneering faster sorters for small sets of items

4.3 Conducting the Measurements

Permutation Check The permutation check is done probabilistically: At design time, a(preferably large) prime number p is chosen.Before sorting, v = ∏n

i=1(z−ai) mod p is calculated for a number z and values a = {a1, . . . , an}.To check the permutation after sorting and obtaining a′ = {a′1, . . . , a′n}, w = ∏n

i=1(z−a′i) mod pis calculated. If v 6= w, a′ cannot be a permutation of a. If v = w, we claim that a′ is a permu-tation of a.To minimize the chances of a′ not being a permutation of a, but v being equal to w, v = 0 wasdisallowed in the first step. If v is zero, z is incremented by one and the product calculatedagain, until v 6= 0.

Benchmarks The benchmark seen in algorithm 2 was used for most of the measurements.To reduce the chance of cache misses at the beginning of the measurement, one warmup run ofrandom generation, sorting and sorted checking is done beforehand (lines 5 to 7). The array isthen sorted numberOfIterations times and checked for the sorted and permutation properties.After that only the generation of the random numbers and the sorted and permutation checkingis measured, to later subtract the time from the previously measured one, resulting in the timeneeded for the sorting alone. Since this is not deterministic in time, and both measurementsare subjects to their own deviation, it can occasionally happen that the second measurementtakes longer than the first, even though less work has been done. We get those negative timesmore often for the sorters with small array sizes, where the sorting itself takes relatively littletime compared to the random generation and sorted checking. The negative times show up asoutliers in the results.The function simulateCheckSorted checks the permutation like checkSorted, but since ran-domly generated arrays are rarely ordered, instead of checking for each element if it is smallerthan its predecessor, it checks for equality. That should never happen with the random numbergenerator used, and thus run for the same amount of cycles.The function MeasureSorting is called a total of numberOfMeasures times for each arraySizethat is sorted.For the measurements shown in section 4.5 the benchmark was slightly modified as can be seenin algorithm 3. Here the goal was to look at cache- and memory-effects by creating an arraythat does not fit into the CPU’s L3-cache, and then filling the cache with something else, in thiscase the reference array. We then split the original array into many blocks of size arraySizeand sort each independently. Because we have to create the whole array at the beginning, wecan generate the numbers before and check for correct sorting after measuring, so there is noneed to do a second measurement like in the first benchmark (lines 15 to 21 in algorithm 2).Here, instead of giving a numberOfIterations parameter to indicate how often the sorting is tobe executed, we provide a numberOfArrays value that says how many arrays of size arraySizeare to be created contiguously. This parameter is chosen for each arraySize in a way thatnumberOfArrays × arraySize does not fit into the L3 cache of the machine the measurementis performed on.

25

Page 26: Enigneering faster sorters for small sets of items

4 Experimental Results

Algorithm 2: MeasureSorting(arraySize, numberOfIterations, seed)1 foreach sorter do2 setSeed(seed)3 arr ← makeArray(arraySize)4 numberOfBadSorts ← 05 arr ← generateRandomArray()6 sorter(arr)7 checkSorted(arr) // create side-effect8 startMeasuring()9 for i ← 0 to numberOfIterations do

10 arr ← generateRandomArray()11 sorter(arr)12 checkSorted(arr) // create side-effect13 stopMeasuring()14 outputResult()15 setSeed(seed)16 startMeasuring()17 for i ← 0 to numberOfIterations do18 arr ← generateRandomArray()19 simulateCheckSorted(arr) // create side-effect20 stopMeasuring()21 outputResult()

Algorithm 3: MeasureSortingInRow(arraySize, numberOfArrays, seed)1 foreach sorter do2 SetSeed(seed)3 arr ← makeArray(arraySize × numberOfArrays)4 arr ← GenerateRandomArray()5 compareArr ← makeArray(arraySize × numberOfArrays)6 compareArr ← CopyArray(arr)7 foreach currentArr in compareArr of size arraySize do8 sort(currentArray, arraySize) //sort reference array9 //warmup on single array of size arraySize like in algorithm 2, lines 5 to 7

10 StartMeasuring()11 foreach currentArr in arr of size arraySize do12 sorter(currentArray, arraySize)13 StopMeasuring()14 CheckArraysForEquality(arr, compareArr) //check correct sorting, create side-effect

OutputResult()

26

Page 27: Enigneering faster sorters for small sets of items

4.4 Sorting One Set of 2-16 items

4.4 Sorting One Set of 2-16 items

The benchmark from algorithm 2 was used with parameters• numberOfIterations = 100• numberOfMeasures = 500• arraySize ∈ {2, . . . , 16}.

The results seen in tables 3, 4 and 5 contain the name of the sorter and the average number ofcycles per iteration, over the total of all measurements, for machines A, B and C. The algorithmthat performed best in a column is marked in bold font, and for each column the value relativeto the best in that column was calculated. For each row the geometric mean is calculated overthe relative values and from that the rank is determined.Table 6 contains the geometric mean and rank taking the results from all three machines intoconsideration.Here it becomes visible that the implementations that have conditional branches and those thatdo not are clearly separated by rank, the former occupy the lower share of the ranks, while thelatter get all the higher ranks. We see that the claim from section 2.2.2 for the 4CS conditionalswap is true for machines A and B, but not for machine C. We also see in table 6 that the firstthree ranks have the same geometric mean, so the Bose Nelson networks can compete with theoptimized networks that have fewer comparators due to their locality.The boxplots for array size 8 are given for each machine in figures 5, 6 and 7, showing thatthese higher-ranked implementations are not only faster on average, but that their distributionis almost entirely faster than any of the insertion sort implementations, together with a lowervariance. To improve readability, the variants JXc, 6Cm and QMa are omitted. Also one outlierwas removed from dataset of machine B for the ’N BoNeL -N KR Cla’ sorter with value −42.6so that the plot has a scale similar to those of the other two machines, to improve comparability.The result set for machine A contains a lot of outliers that we did not want to exclude. To beable to compare it easily with the other two plots we added an additional axis at the top thatshows the CPU cycles per iteration as percentages where the average of the best insertion sortis 100%.To see a trend in increasing array size, we chose a few Conditional Swap implementations thatdo best for more than one network and array size on all machines. Their average sorting timescan be seen in figures 8, 9 and 10. For visibility reasons, we omitted the Bose Nelson Parameternetworks in these plot. What we already saw from the tables is here visible as well, the 4Cmand 4CS implementations have good performance and are almost always faster on average thaninsertion sort (apart from arraySize = 2 on machine A).These results indicate that there is potential in using sorting networks, showing an improvementof 32% of the best network over the best insertion sort, on average, for any array size. Problemswith this way of measurement are that the same space in memory is sorted over and over again,which is rarely a use case when sorting a base case. Because of this, the measurements probablyreflect unrealistic conditions regarding cache accesses and cache misses. To get a bit closer toactual base case sorting, the next section has a different approach to not sort the same spacein memory twice.

27

Page 28: Enigneering faster sorters for small sets of items

4 Experimental Results

Ove

rall

Arr

aySi

zeR

ank

Geo

M2

34

56

78

910

1112

1314

1516

I-N

KRPO

p22

1.85

15.2

137

.82

80.5

212

4.17

166.

1420

4.37

250.

0828

2.97

323.

8736

9.31

417.

5743

7.93

509.

3752

0.20

579.

44I

-NKR

STL

261.

9213

.82

39.9

983

.75

128.

4417

8.50

213.

0325

7.62

287.

1534

6.35

382.

0843

4.47

455.

2953

2.36

554.

8061

4.56

I-N

KRDe

f33

2.07

17.2

340

.53

84.8

013

2.12

178.

0922

0.69

277.

2931

1.27

378.

6641

2.51

475.

4849

9.27

595.

6861

4.74

693.

90I

-NKR

AIF

362.

2116

.55

50.7

690

.56

148.

3720

2.86

252.

2230

7.29

342.

6640

0.98

442.

4848

5.63

518.

6259

3.07

609.

2967

2.98

NBe

st-N

KR4C

S1

1.07

11.5

924

.11

34.3

566

.58

82.5

496

.95

125.

5613

4.92

183.

7321

8.14

254.

9527

8.53

356.

7535

3.24

395.

58N

Best

-NKR

4Cm

51.

1216

.33

24.2

138

.05

54.8

074

.74

85.2

312

7.40

141.

9920

1.85

238.

3927

9.15

301.

4737

5.39

399.

0645

0.54

NBe

st-N

KRCl

a8

1.23

8.91

31.2

240

.41

86.7

311

0.04

144.

0516

3.77

188.

6722

0.54

240.

0029

8.76

285.

4234

7.36

349.

3540

0.98

NBe

st-N

KRCP

r10

1.27

8.13

32.2

046

.88

87.9

911

2.46

146.

1016

4.91

190.

6120

8.00

256.

0329

6.58

301.

1438

2.17

381.

7143

8.18

NBe

st-N

KR6C

m13

1.37

17.2

025

.57

46.8

664

.68

96.3

610

7.56

146.

3417

6.34

259.

5628

9.24

339.

0239

2.64

490.

0550

2.11

585.

95N

Best

-NKR

Def

211.

8420

.09

37.0

873

.94

113.

1414

4.92

179.

0924

8.33

268.

1530

2.44

338.

7641

7.69

429.

2755

5.59

552.

8471

2.22

NBe

st-N

KRTi

e25

1.90

20.4

738

.06

63.5

898

.92

139.

2118

2.82

238.

5927

1.77

316.

3436

9.96

477.

9451

9.14

597.

6963

9.07

753.

42N

Best

-NKR

JXc

322.

0418

.44

36.5

068

.67

113.

0616

7.99

207.

1226

4.82

293.

5634

7.16

409.

6750

6.11

522.

2168

0.63

711.

2079

1.41

NBe

st-N

KRQM

a37

2.60

17.7

244

.69

96.0

214

9.03

207.

7325

2.95

341.

1939

7.81

438.

9857

3.64

681.

1370

0.09

832.

2791

0.77

1057

.45

NBo

NeL

-NKR

4CS

21.

0811

.45

24.9

935

.82

67.9

482

.41

98.0

512

8.16

132.

6818

6.46

224.

6626

2.69

275.

8834

4.76

352.

4638

7.59

NBo

NeL

-NKR

4Cm

31.

1113

.53

25.1

838

.33

55.6

276

.06

86.0

613

2.02

142.

0619

3.99

232.

7028

4.86

302.

1238

3.21

386.

9642

3.71

NBo

NeL

-NKR

6Cm

151.

4215

.96

28.1

845

.90

73.1

890

.18

115.

2414

8.93

214.

2727

8.32

298.

2836

5.05

414.

2749

3.26

508.

8556

0.16

NBo

NeL

-NKR

Cla

161.

428.

7231

.04

40.7

182

.99

112.

4614

3.75

163.

7623

9.80

270.

3632

5.30

354.

5640

3.83

452.

6749

3.90

550.

58N

BoNe

L-N

KRCP

r17

1.44

9.03

33.0

147

.21

88.3

711

3.62

147.

2116

6.12

238.

6726

5.87

321.

1234

7.10

401.

8144

6.13

482.

2853

6.51

NBo

NeL

-NKR

Tie

271.

9320

.87

40.5

764

.35

99.6

513

7.68

173.

1823

1.53

265.

8534

3.47

383.

8847

2.81

513.

5663

6.85

676.

7978

2.22

NBo

NeL

-NKR

Def

281.

9420

.11

40.5

278

.60

102.

5813

9.42

170.

7323

7.18

265.

2736

6.93

372.

5847

8.09

481.

8763

7.88

636.

3576

4.29

NBo

NeL

-NKR

JXc

312.

0418

.95

36.1

868

.86

108.

9216

0.18

196.

6125

6.54

314.

8336

8.23

427.

1250

4.82

570.

4164

2.64

662.

7378

9.99

NBo

NeL

-NKR

QMa

382.

6718

.16

45.9

593

.38

143.

0319

6.39

241.

8232

6.07

406.

0551

4.55

578.

5868

5.08

776.

9291

2.15

998.

3411

63.9

9N

BoNe

M-N

KR4C

m7

1.22

16.0

827

.11

38.8

454

.86

82.5

194

.36

119.

9021

4.28

252.

7825

1.85

284.

4831

8.74

415.

5139

4.31

500.

68N

BoNe

M-N

KR4C

S11

1.28

11.7

924

.70

43.9

673

.23

83.7

613

0.14

115.

4820

4.19

273.

2728

6.72

281.

4331

4.56

487.

8546

4.53

518.

94N

BoNe

M-N

KR6C

m18

1.51

15.6

527

.98

52.7

193

.07

85.7

611

3.21

134.

3522

3.11

340.

6934

0.96

393.

8543

4.99

575.

9849

9.30

630.

78N

BoNe

M-N

KRCl

a19

1.63

13.0

532

.46

54.0

790

.32

116.

2714

9.86

175.

8027

8.17

314.

1334

8.62

395.

8544

8.91

559.

2456

6.87

639.

09N

BoNe

M-N

KRCP

r20

1.67

15.1

033

.91

46.4

210

3.89

120.

1315

3.62

203.

5528

7.35

322.

2234

7.58

374.

6742

3.75

521.

4656

5.28

726.

59N

BoNe

M-N

KRDe

f29

1.94

18.3

839

.34

75.9

111

3.68

157.

8719

9.64

237.

5025

9.60

352.

1936

9.31

455.

5847

9.48

596.

3463

3.54

748.

67N

BoNe

M-N

KRTi

e30

2.01

21.3

438

.64

62.6

696

.05

135.

1917

2.29

229.

3726

5.31

368.

9645

2.05

554.

5054

4.95

769.

8476

7.68

788.

78N

BoNe

M-N

KRJX

c35

2.18

19.4

738

.29

70.0

110

8.73

147.

2918

4.76

252.

6135

8.55

454.

9747

8.92

594.

3158

9.14

769.

0575

1.62

924.

53N

BoNe

M-N

KRQM

a39

2.71

24.2

854

.14

100.

7913

6.42

204.

5725

1.56

325.

2440

3.05

480.

8254

7.65

651.

8173

9.13

864.

3596

6.34

1092

.30

NBo

NeP

-NKR

4CS

41.

1111

.41

24.8

335

.67

61.1

284

.51

96.5

913

0.30

159.

8119

9.35

230.

2427

1.39

302.

5036

3.86

388.

3442

2.56

NBo

NeP

-NKR

4Cm

61.

1413

.00

25.0

838

.14

53.9

574

.04

94.4

711

9.49

151.

0920

9.73

237.

7029

1.55

334.

2939

8.82

426.

0646

8.23

NBo

NeP

-NKR

Cla

91.

258.

8131

.30

41.4

780

.30

100.

5413

0.65

147.

0021

1.21

233.

0226

5.58

286.

0332

0.45

363.

0538

5.07

438.

35N

BoNe

P-N

KRCP

r12

1.28

9.68

33.0

447

.46

82.3

694

.76

116.

3314

7.20

212.

9722

3.40

267.

7528

9.20

337.

1240

1.57

417.

1146

8.35

NBo

NeP

-NKR

6Cm

141.

4115

.17

27.8

645

.69

66.4

386

.38

102.

0715

1.31

207.

8127

5.71

317.

6937

5.96

418.

4750

5.54

545.

9261

7.54

NBo

NeP

-NKR

Tie

231.

8820

.56

37.0

064

.49

97.0

513

1.03

171.

3722

3.33

270.

2132

4.68

395.

2046

2.10

534.

3359

0.53

642.

3174

4.88

NBo

NeP

-NKR

Def

241.

8819

.80

41.8

674

.79

111.

4114

4.15

174.

4523

7.91

265.

8932

5.01

350.

5842

0.83

473.

6556

3.51

599.

1572

7.96

NBo

NeP

-NKR

JXc

342.

0820

.22

36.9

069

.50

109.

3715

0.28

188.

4025

1.74

293.

7137

9.19

439.

0752

5.51

585.

2571

9.42

744.

8184

4.24

NBo

NeP

-NKR

QMa

402.

7924

.23

52.0

999

.01

148.

8519

2.75

263.

7933

8.27

395.

2551

7.71

584.

9767

7.76

794.

2093

8.78

1006

.31

1166

.80

Table3:

Averagenu

mbe

rof

CPU

cycles

perite

ratio

nof

singlearraysortingon

machine

A

28

Page 29: Enigneering faster sorters for small sets of items

4.4 Sorting One Set of 2-16 items

Ove

rall

Arr

aySi

zeR

ank

Geo

M2

34

56

78

910

1112

1314

1516

I-N

KRPO

p25

1.84

12.5

636

.78

73.3

711

1.91

151.

5218

3.05

221.

8626

3.07

302.

3535

3.36

399.

8343

9.54

475.

6050

8.11

550.

21I

-NKR

STL

291.

9310

.97

37.2

678

.04

122.

7516

1.05

201.

2724

7.30

280.

7832

1.94

376.

1542

0.50

461.

5249

3.07

519.

9655

9.69

I-N

KRDe

f32

1.98

14.0

340

.14

76.9

311

7.68

157.

5919

8.50

242.

6028

0.73

322.

7237

5.54

422.

6746

5.00

511.

5155

8.47

599.

95I

-NKR

AIF

362.

3214

.98

54.7

892

.20

135.

5919

5.79

245.

1429

2.73

327.

8238

0.85

438.

3848

9.94

538.

3857

3.31

612.

5765

6.83

NBe

st-N

KR4C

S1

1.06

8.00

22.1

035

.70

60.3

671

.61

93.4

011

2.30

144.

1717

1.34

214.

0323

8.42

285.

8831

6.71

344.

1036

4.40

NBe

st-N

KR4C

m3

1.10

8.75

20.3

434

.47

54.7

070

.88

90.2

711

5.48

137.

9218

9.67

222.

1426

2.03

307.

0935

3.77

391.

7243

2.70

NBe

st-N

KRCP

r8

1.19

5.23

25.1

538

.71

71.1

392

.24

124.

1214

5.15

188.

3119

4.53

234.

7226

6.15

307.

6334

3.99

372.

5340

9.52

NBe

st-N

KRCl

a9

1.20

6.86

28.3

038

.70

75.6

992

.84

129.

2414

6.96

190.

3119

6.30

233.

0526

1.56

278.

4330

6.94

335.

5736

3.08

NBe

st-N

KR6C

m14

1.36

9.52

24.2

339

.97

69.7

388

.59

111.

5213

6.40

174.

3522

8.98

283.

9632

1.57

402.

1345

8.39

499.

0255

3.25

NBe

st-N

KRDe

f21

1.78

16.3

833

.75

64.5

795

.65

125.

0516

4.26

204.

8624

6.37

265.

3433

6.21

399.

2842

8.57

508.

3855

0.37

631.

51N

Best

-NKR

Tie

221.

8216

.04

33.7

457

.37

88.8

611

4.24

165.

8520

3.09

250.

3228

0.61

352.

4743

5.29

508.

7553

9.74

598.

7668

5.05

NBe

st-N

KRJX

c33

2.00

16.5

233

.44

63.4

498

.88

138.

3617

9.19

218.

4428

2.01

309.

1138

5.21

480.

6754

1.52

626.

8366

5.20

766.

23N

Best

-NKR

QMa

372.

5314

.98

42.4

386

.31

140.

4118

4.73

235.

3530

1.62

364.

0638

3.06

539.

4462

3.15

659.

7174

4.50

846.

4393

2.15

NBo

NeL

-NKR

4CS

21.

088.

7923

.49

37.0

462

.21

72.4

093

.37

112.

7014

2.31

173.

7421

8.67

237.

7528

1.41

309.

1334

2.61

349.

19N

BoNe

L-N

KR4C

m4

1.11

9.06

21.7

735

.45

55.1

472

.10

90.9

811

6.37

142.

5118

1.85

223.

3626

3.63

319.

1935

5.23

380.

3740

4.78

NBo

NeL

-NKR

CPr

131.

346.

5427

.47

39.7

972

.94

93.5

112

5.24

145.

9921

5.92

243.

9028

5.57

315.

1937

4.44

404.

6543

4.14

454.

70N

BoNe

L-N

KRCl

a15

1.39

7.30

29.8

439

.35

76.7

192

.91

130.

6314

7.23

226.

0224

7.07

293.

2832

4.55

393.

5941

5.47

460.

0347

7.18

NBo

NeL

-NKR

6Cm

161.

4310

.65

25.9

040

.56

70.9

087

.99

113.

1913

7.92

217.

5426

0.44

292.

3335

3.98

427.

9747

5.74

503.

8352

9.54

NBo

NeL

-NKR

Tie

261.

8717

.59

33.7

557

.62

87.7

911

5.16

157.

7019

8.00

245.

8930

7.31

367.

0144

4.25

506.

0057

9.13

658.

8172

2.07

NBo

NeL

-NKR

Def

271.

8916

.64

36.3

767

.23

90.5

812

0.61

158.

1820

0.27

250.

8332

0.55

355.

6745

3.90

487.

5857

5.63

625.

4568

8.18

NBo

NeL

-NKR

JXc

311.

9716

.28

33.1

562

.43

91.8

613

3.58

171.

4421

4.22

279.

8433

8.71

398.

7045

6.86

536.

4461

5.27

660.

7474

8.77

NBo

NeL

-NKR

QMa

382.

6314

.51

43.2

382

.70

134.

5317

8.88

225.

6529

0.45

391.

1546

2.34

547.

5564

6.34

742.

9783

5.85

940.

4910

57.8

9N

BoNe

M-N

KR4C

m7

1.17

9.86

21.4

934

.89

55.4

372

.68

89.5

610

6.83

212.

1822

4.25

233.

2026

0.33

325.

4537

8.21

405.

9743

8.89

NBo

NeM

-NKR

4CS

121.

288.

9023

.66

41.7

472

.84

72.8

812

9.91

108.

6520

6.26

248.

0026

2.31

262.

5332

1.93

431.

0044

5.25

452.

85N

BoNe

M-N

KR6C

m18

1.56

11.7

225

.59

48.6

695

.36

84.1

111

0.15

133.

0223

4.05

318.

2434

3.21

382.

4344

7.82

527.

2352

8.53

599.

74N

BoNe

M-N

KRCP

r19

1.58

11.9

229

.69

39.6

693

.65

94.4

914

4.63

168.

8025

8.49

301.

0633

1.26

341.

1543

2.34

444.

3653

1.04

566.

47N

BoNe

M-N

KRCl

a20

1.60

12.1

133

.08

47.8

688

.18

94.5

214

9.64

147.

2724

9.49

280.

2233

0.82

350.

0944

3.00

490.

1554

6.82

537.

15N

BoNe

M-N

KRDe

f28

1.92

15.7

435

.40

67.2

510

5.51

137.

7217

9.21

211.

6625

0.63

320.

0635

0.38

443.

8348

1.71

549.

2562

1.84

689.

81N

BoNe

M-N

KRTi

e30

1.96

16.9

533

.72

56.6

786

.72

115.

3115

4.85

198.

7524

5.12

339.

0944

3.65

526.

5255

3.74

695.

9274

1.39

723.

53N

BoNe

M-N

KRJX

c35

2.14

16.4

635

.71

60.9

395

.72

123.

5016

7.92

218.

1335

9.78

433.

3344

9.88

539.

7357

7.80

689.

9672

9.62

862.

97N

BoNe

M-N

KRQM

a39

2.66

20.6

448

.64

90.6

112

6.40

185.

6222

9.86

300.

8737

3.85

431.

1051

8.01

615.

5773

5.18

780.

4089

5.37

951.

65N

BoNe

P-N

KR4C

S5

1.12

8.46

23.3

437

.09

57.8

073

.68

92.7

011

3.69

159.

6019

0.06

219.

9425

2.55

307.

7932

8.41

377.

2739

2.42

NBo

NeP

-NKR

4Cm

61.

148.

9522

.69

35.5

755

.35

69.6

090

.41

111.

1314

9.56

197.

4622

5.24

271.

9833

5.76

362.

4041

4.60

447.

03N

BoNe

P-N

KRCP

r10

1.22

5.97

26.9

939

.44

67.4

480

.12

111.

6812

9.78

190.

3621

1.08

251.

3527

8.01

336.

6537

0.57

405.

3743

6.74

NBo

NeP

-NKR

Cla

111.

237.

1029

.97

39.9

673

.81

84.9

311

6.40

131.

8920

0.48

213.

5024

3.83

265.

2531

7.23

337.

1037

7.03

402.

82N

BoNe

P-N

KR6C

m17

1.44

10.3

225

.62

40.4

769

.19

84.7

511

0.24

134.

7821

0.52

252.

5231

3.76

368.

7943

4.89

481.

1854

3.99

580.

31N

BoNe

P-N

KRDe

f23

1.83

16.9

138

.39

65.1

897

.07

120.

6315

6.82

195.

8924

2.43

288.

9433

6.14

403.

4545

8.41

522.

1758

9.12

656.

89N

BoNe

P-N

KRTi

e24

1.84

17.3

732

.93

57.1

086

.35

114.

5515

6.12

193.

5324

8.25

299.

7236

9.58

429.

2850

9.05

566.

4262

8.05

716.

46N

BoNe

P-N

KRJX

c34

2.04

16.5

132

.47

60.1

795

.97

137.

2217

1.44

221.

1828

3.43

339.

4542

2.44

492.

8757

6.73

655.

4672

3.94

784.

93N

BoNe

P-N

KRQM

a40

2.75

20.4

448

.92

87.1

514

2.84

174.

9824

8.01

298.

4938

5.09

462.

6654

4.00

626.

8576

3.33

847.

0094

0.79

1027

.04

Table4:

Averagenu

mbe

rof

CPU

cycles

perite

ratio

nof

singlearraysortingon

machine

B

29

Page 30: Enigneering faster sorters for small sets of items

4 Experimental Results

Ove

rall

Arr

aySi

zeR

ank

Geo

M2

34

56

78

910

1112

1314

1516

I-N

KRPO

p21

2.45

11.3

936

.05

77.2

812

8.96

181.

1522

7.59

265.

0029

8.11

335.

5637

0.76

404.

1045

0.91

488.

8852

9.10

582.

69I

-NKR

Def

332.

9015

.33

48.2

495

.20

148.

2719

5.18

249.

5830

5.80

347.

4139

5.45

434.

9847

9.17

524.

5957

5.01

628.

2068

8.63

I-N

KRST

L35

2.97

16.6

452

.56

102.

1015

4.79

194.

8024

8.41

306.

9634

6.42

393.

3343

6.37

486.

4653

6.80

580.

0363

4.94

702.

84I

-NKR

AIF

363.

2316

.77

56.4

510

9.56

160.

6621

5.97

268.

9333

2.20

379.

0143

2.62

479.

8954

0.53

601.

3664

3.93

708.

1077

6.50

NBe

st-N

KR4C

m2

1.11

7.31

16.2

729

.45

47.0

370

.03

82.2

296

.48

126.

3914

3.46

169.

8717

6.58

233.

6126

8.71

310.

7934

5.07

NBe

st-N

KR4C

S5

1.19

7.07

18.1

034

.24

55.8

967

.86

93.5

411

0.54

137.

9414

6.50

175.

1619

5.19

239.

2027

4.19

315.

0034

3.90

NBe

st-N

KR6C

m8

1.25

6.85

17.8

433

.82

62.4

479

.27

97.9

411

6.81

142.

6716

5.77

184.

1720

6.94

256.

1729

5.75

316.

2635

2.90

NBe

st-N

KRCP

r11

1.43

3.09

30.8

636

.06

73.5

789

.71

124.

6314

0.12

190.

4719

9.37

251.

1426

5.94

292.

4333

5.35

364.

6838

9.62

NBe

st-N

KRCl

a12

1.45

4.73

26.6

037

.62

75.0

687

.90

116.

5514

2.33

194.

2219

6.00

251.

0626

0.73

294.

2632

6.31

356.

2138

4.08

NBe

st-N

KRTi

e26

2.75

15.7

638

.05

73.0

110

9.62

144.

1621

1.29

266.

3231

4.66

362.

6342

4.85

500.

3059

5.58

687.

3473

3.87

897.

51N

Best

-NKR

JXc

272.

7915

.56

42.2

775

.44

114.

3014

7.68

205.

7125

4.78

309.

6737

5.45

433.

9049

1.17

605.

1470

3.28

774.

2890

1.10

NBe

st-N

KRDe

f30

2.85

15.4

037

.48

80.6

012

6.48

158.

4122

2.60

294.

9134

6.24

358.

1345

6.70

514.

1456

1.78

694.

2575

8.85

846.

88N

Best

-NKR

QMa

383.

7911

.05

54.5

311

0.76

180.

7823

9.95

310.

4038

9.88

468.

1449

4.45

665.

4074

1.91

783.

5692

0.78

1000

.01

1110

.08

NBo

NeL

-NKR

4Cm

11.

076.

0516

.77

29.7

047

.32

70.1

483

.57

96.8

712

5.01

149.

1916

4.25

174.

8421

6.16

244.

0926

6.22

288.

18N

BoNe

L-N

KR4C

S4

1.15

6.51

18.5

133

.64

57.8

469

.08

90.7

011

2.78

134.

2815

3.70

182.

7618

5.91

234.

7725

6.08

272.

7129

9.99

NBo

NeL

-NKR

6Cm

91.

318.

3720

.04

35.3

267

.72

77.6

499

.97

120.

8915

6.16

177.

5219

9.67

217.

9626

0.26

282.

8731

6.31

348.

27N

BoNe

L-N

KRCP

r17

1.58

3.27

32.0

837

.08

73.8

591

.57

124.

2014

1.78

205.

0825

1.50

290.

9931

2.06

369.

3240

5.85

437.

2346

2.54

NBo

NeL

-NKR

Cla

181.

594.

4827

.21

37.6

076

.08

89.2

811

7.83

141.

7820

0.79

240.

4828

9.48

306.

4137

1.91

403.

2744

3.11

457.

93N

BoNe

L-N

KRJX

c22

2.56

15.2

435

.21

66.9

610

0.40

131.

0918

9.59

235.

2130

4.23

364.

9941

8.31

472.

0653

6.86

655.

1770

4.77

777.

83N

BoNe

L-N

KRTi

e24

2.69

15.1

537

.39

64.1

011

2.33

139.

3919

1.19

240.

1831

4.21

375.

7843

5.35

500.

6861

0.78

682.

6576

5.66

882.

64N

BoNe

L-N

KRDe

f31

2.89

14.6

539

.25

78.6

511

5.49

145.

6220

6.33

273.

2934

7.40

409.

9246

0.74

560.

3161

1.72

725.

2282

7.99

932.

90N

BoNe

L-N

KRQM

a39

3.81

10.3

350

.08

97.4

216

8.02

223.

1728

6.64

359.

0348

4.01

578.

5266

3.41

766.

7790

6.75

1002

.35

1112

.73

1223

.32

NBo

NeM

-NKR

4Cm

71.

237.

2917

.72

29.3

545

.03

61.8

882

.52

98.4

320

2.36

212.

8320

0.65

243.

9323

8.92

308.

9132

0.33

364.

00N

BoNe

M-N

KR6C

m15

1.50

7.00

21.0

740

.85

84.3

680

.01

99.3

111

7.10

197.

2026

6.33

235.

0225

8.16

323.

1841

5.37

352.

3439

9.50

NBo

NeM

-NKR

4CS

161.

516.

9617

.60

40.6

680

.80

70.9

314

5.19

112.

5120

8.82

246.

5525

6.60

238.

2528

2.13

411.

1641

6.42

433.

64N

BoNe

M-N

KRCP

r19

1.95

8.86

33.2

838

.15

94.9

398

.62

147.

7918

6.70

251.

7129

8.36

323.

5234

1.95

417.

3244

8.73

548.

3559

2.07

NBo

NeM

-NKR

Cla

201.

9613

.01

32.5

547

.51

90.7

293

.87

154.

4014

6.24

232.

5527

6.11

316.

9234

0.40

421.

7048

2.01

533.

0553

3.90

NBo

NeM

-NKR

JXc

282.

8216

.39

37.7

966

.83

110.

6913

4.03

194.

4824

3.91

385.

8944

4.32

496.

2653

8.42

586.

5670

5.07

767.

0187

7.98

NBo

NeM

-NKR

Tie

292.

8417

.52

41.2

965

.86

105.

6514

5.14

198.

2525

9.15

322.

9939

8.23

468.

7854

7.83

627.

1079

6.34

798.

6684

2.88

NBo

NeM

-NKR

Def

342.

9511

.71

37.0

884

.63

136.

0617

6.03

231.

4328

9.16

351.

3141

3.91

456.

6454

2.57

641.

5372

0.54

826.

3790

8.10

NBo

NeM

-NKR

QMa

373.

7717

.63

58.3

210

7.62

142.

4023

7.64

264.

9638

4.33

435.

3750

0.00

574.

6572

6.09

825.

4391

3.20

980.

1211

29.6

1N

BoNe

P-N

KR4C

m3

1.15

6.86

17.9

229

.84

49.0

161

.90

79.9

010

6.49

129.

8215

6.20

163.

8420

1.19

251.

6328

7.49

319.

1334

5.12

NBo

NeP

-NKR

4CS

61.

216.

6320

.99

33.4

750

.87

69.9

189

.23

114.

0313

2.59

155.

0519

7.48

204.

6026

2.52

285.

9232

1.02

342.

46N

BoNe

P-N

KR6C

m10

1.31

7.80

20.5

834

.62

60.0

783

.27

101.

9412

6.02

143.

4817

4.07

207.

3522

8.25

262.

7528

7.39

344.

9235

5.76

NBo

NeP

-NKR

CPr

131.

483.

2431

.77

37.4

774

.46

84.1

812

3.89

146.

8319

6.26

213.

1225

4.61

274.

1132

3.22

347.

0538

4.41

414.

97N

BoNe

P-N

KRCl

a14

1.48

4.77

27.8

838

.17

69.8

375

.02

116.

2814

1.25

191.

4922

1.49

250.

4427

4.70

325.

9235

0.47

394.

4942

4.23

NBo

NeP

-NKR

JXc

232.

6913

.90

35.1

665

.39

105.

1113

9.71

197.

4825

5.87

294.

8536

4.97

461.

8150

5.71

605.

4668

6.42

814.

1191

9.46

NBo

NeP

-NKR

Tie

252.

7215

.69

34.7

268

.55

112.

0314

2.03

209.

7025

7.04

312.

5236

7.93

434.

8550

9.64

578.

6370

2.64

752.

5789

2.05

NBo

NeP

-NKR

Def

322.

9014

.81

39.7

878

.61

127.

9218

1.05

231.

4327

4.24

346.

1137

9.57

453.

4449

1.92

610.

9870

8.39

773.

0885

9.50

NBo

NeP

-NKR

QMa

403.

9919

.09

56.4

999

.26

180.

0922

2.01

298.

1537

1.18

475.

0054

9.86

660.

0074

5.95

881.

2499

4.07

1082

.76

1196

.13

Table5:

Averagenu

mbe

rof

CPU

cycles

perite

ratio

nof

singlearraysortingon

machine

C

30

Page 31: Enigneering faster sorters for small sets of items

4.4 Sorting One Set of 2-16 items

Ove

rall

Arr

aySi

zeR

ank

Geo

M2

34

56

78

910

1112

1314

1516

I-N

KRPO

p21

1.97

13.3

136

.84

77.5

212

1.73

166.

3920

4.93

245.

3528

1.53

320.

2036

4.03

407.

2944

3.03

491.

1951

9.29

570.

38I

-NKR

STL

292.

1813

.99

44.1

388

.72

136.

0517

8.19

221.

5427

1.65

306.

4635

4.48

399.

1144

8.19

485.

9453

6.15

570.

5162

6.51

I-N

KRDe

f34

2.22

15.4

943

.53

85.9

213

3.04

177.

0522

2.94

275.

7731

3.38

366.

0040

7.47

459.

2049

6.26

561.

5760

0.20

659.

81I

-NKR

AIF

362.

4816

.19

53.9

698

.14

148.

5220

5.27

255.

5731

1.14

350.

3940

4.94

454.

1650

5.94

553.

3860

4.04

645.

6670

3.20

NBe

st-N

KR4C

S1

1.08

9.49

21.1

434

.61

61.3

974

.12

93.8

811

6.89

138.

9016

6.86

201.

4622

9.80

267.

6731

8.35

338.

5936

7.90

NBe

st-N

KR4C

m4

1.10

11.8

020

.24

33.8

051

.24

71.6

785

.75

113.

8513

5.38

178.

3320

8.90

238.

1027

8.08

333.

6136

7.57

413.

45N

Best

-NKR

Cla

81.

267.

0728

.71

38.9

780

.04

96.9

712

9.58

150.

9819

1.32

204.

8824

2.45

273.

7128

6.36

326.

9734

7.76

382.

99N

Best

-NKR

CPr

91.

275.

6229

.14

40.7

978

.30

98.9

913

3.63

150.

5218

9.73

200.

7824

7.67

278.

4730

0.13

353.

2637

3.06

412.

66N

Best

-NKR

6Cm

121.

3111

.86

21.8

440

.18

65.6

187

.83

105.

0913

3.64

163.

4121

7.18

252.

8328

6.92

353.

7641

7.93

429.

4149

7.07

NBe

st-N

KRTi

e23

2.07

17.7

936

.53

64.8

699

.31

132.

7518

6.83

236.

9127

8.75

320.

4738

2.78

471.

0054

2.17

608.

6265

8.26

779.

34N

Best

-NKR

Def

242.

0717

.67

36.2

573

.23

111.

9514

3.00

189.

5025

0.12

287.

7630

8.47

378.

6344

3.90

474.

7158

7.42

626.

3073

1.54

NBe

st-N

KRJX

c33

2.19

17.1

137

.58

69.3

610

8.46

151.

9319

7.17

245.

6929

5.11

343.

6440

9.43

492.

5655

6.92

670.

4871

6.70

818.

76N

Best

-NKR

QMa

372.

8614

.49

47.7

197

.71

157.

5021

1.46

266.

7934

4.29

410.

6943

9.47

592.

6768

2.56

714.

7083

3.13

919.

8210

34.2

9N

BoNe

L-N

KR4C

m2

1.08

9.98

21.3

634

.70

51.9

973

.39

87.2

311

6.70

135.

6417

4.36

205.

5624

0.54

275.

6733

0.05

346.

8337

5.29

NBo

NeL

-NKR

4CS

31.

089.

1122

.19

35.6

562

.92

75.1

694

.38

119.

0013

6.72

173.

2221

1.53

232.

7026

5.93

304.

8932

6.46

348.

16N

BoNe

L-N

KR6C

m14

1.37

12.1

024

.61

41.1

970

.46

84.8

310

8.23

135.

3819

6.49

245.

9825

9.64

310.

4636

6.85

429.

4643

7.78

485.

76N

BoNe

L-N

KRCl

a16

1.43

6.67

29.2

339

.13

79.2

899

.19

130.

7815

0.76

222.

1925

2.77

303.

2032

8.71

389.

4142

2.62

465.

0949

3.86

NBo

NeL

-NKR

CPr

171.

436.

2430

.74

41.6

179

.86

100.

7613

4.10

151.

8121

9.75

254.

2029

9.55

327.

5638

2.07

421.

8045

2.26

485.

46N

BoNe

L-N

KRTi

e25

2.08

18.1

437

.39

61.9

810

0.26

130.

5517

4.16

223.

0927

5.50

342.

7139

5.67

472.

4354

4.67

633.

9070

1.33

794.

77N

BoNe

L-N

KRJX

c27

2.12

17.0

334

.89

65.9

110

0.48

142.

8118

6.18

235.

3029

9.63

356.

6641

4.47

478.

4754

8.58

637.

5367

7.42

771.

92N

BoNe

L-N

KRDe

f28

2.15

17.4

838

.62

74.4

410

3.11

135.

1117

8.95

237.

1528

8.96

365.

8739

7.72

498.

0252

9.04

646.

1069

9.94

796.

40N

BoNe

L-N

KRQM

a38

2.92

14.4

346

.63

90.9

314

9.35

199.

6625

1.64

325.

0542

7.89

518.

8859

6.66

699.

0481

0.17

917.

1410

17.5

511

48.2

8N

BoNe

M-N

KR4C

m7

1.19

11.6

722

.36

34.3

551

.89

72.1

989

.01

109.

4120

8.81

231.

5022

7.94

263.

3129

4.41

367.

7137

4.03

431.

20N

BoNe

M-N

KR4C

S13

1.32

9.54

21.8

942

.09

76.1

877

.20

136.

4111

2.08

206.

6825

6.74

270.

6026

0.43

306.

4144

1.37

442.

2146

7.06

NBo

NeM

-NKR

6Cm

181.

4911

.67

24.9

047

.78

90.5

383

.76

107.

6912

8.11

219.

3731

1.29

297.

5934

5.98

401.

8350

3.23

457.

1754

1.58

NBo

NeM

-NKR

Cla

191.

6812

.87

32.6

850

.31

89.8

110

3.25

151.

2915

7.12

253.

0429

0.36

332.

4436

1.42

437.

6251

0.11

548.

5957

2.66

NBo

NeM

-NKR

CPr

201.

6912

.25

32.2

642

.04

98.5

010

5.12

149.

0918

6.85

265.

4930

8.33

334.

2735

5.31

424.

8547

6.11

548.

7362

8.61

NBo

NeM

-NKR

Def

302.

1815

.33

37.2

676

.07

119.

0715

7.49

203.

5224

6.69

289.

4536

2.59

392.

8948

2.09

536.

3362

1.77

697.

3378

3.48

NBo

NeM

-NKR

Tie

312.

1818

.97

38.0

561

.85

96.3

513

1.68

175.

2022

9.28

278.

3036

8.86

455.

0154

2.54

576.

2875

3.78

769.

3378

4.82

NBo

NeM

-NKR

JXc

352.

3017

.73

37.1

065

.93

104.

7313

5.12

182.

4023

7.82

368.

4844

4.19

474.

9455

8.74

584.

2572

2.61

749.

4588

9.37

NBo

NeM

-NKR

QMa

392.

9221

.00

53.8

399

.47

135.

0420

9.36

248.

7833

7.02

403.

9747

0.19

546.

7766

4.62

768.

2185

3.46

946.

9310

58.0

6N

BoNe

P-N

KR4C

m5

1.12

10.1

021

.80

34.5

352

.94

69.2

088

.78

112.

8014

3.67

188.

1721

0.52

252.

7030

3.77

355.

7439

2.22

419.

03N

BoNe

P-N

KR4C

S6

1.13

9.31

23.1

135

.51

56.2

977

.11

92.9

912

0.16

151.

1917

9.63

217.

6824

4.75

292.

5032

8.03

362.

4638

7.20

NBo

NeP

-NKR

Cla

101.

286.

8629

.69

39.9

074

.93

87.1

512

2.21

139.

9920

1.27

222.

6025

3.53

275.

3732

1.67

350.

1638

5.47

421.

30N

BoNe

P-N

KRCP

r11

1.30

6.55

30.5

142

.04

75.2

786

.68

117.

7214

0.35

199.

8221

6.17

258.

3528

0.93

331.

4437

3.93

402.

7144

0.35

NBo

NeP

-NKR

6Cm

151.

3811

.51

24.7

740

.86

65.5

884

.98

105.

3413

7.82

187.

4923

5.82

276.

3332

4.97

372.

6643

5.94

470.

5952

5.82

NBo

NeP

-NKR

Tie

222.

0618

.15

34.9

363

.42

98.7

712

9.21

179.

8122

4.98

277.

1633

1.25

399.

7146

6.61

540.

5562

0.63

675.

0178

5.68

NBo

NeP

-NKR

Def

262.

1117

.48

39.9

772

.50

112.

5914

9.29

187.

8923

6.08

284.

8933

2.48

380.

6243

9.68

515.

1660

0.16

656.

7774

9.71

NBo

NeP

-NKR

JXc

322.

1917

.01

34.8

165

.09

103.

3814

3.31

185.

3224

2.27

290.

6036

1.36

441.

2050

8.06

589.

2168

7.21

762.

4184

9.86

NBo

NeP

-NKR

QMa

403.

0521

.39

52.5

694

.71

158.

0219

6.77

270.

1933

5.88

419.

2250

9.74

596.

4168

4.62

813.

4892

6.26

1010

.03

1129

.40

Table6:

Averagenu

mbe

rof

CPU

cycles

perite

ratio

nof

singlearraysortingacross

allm

achines

31

Page 32: Enigneering faster sorters for small sets of items

4 Experimental Results

●●●●●

●●●●●● ●●●● ●

●●●●● ●●●

●●● ●

●●● ●●

●● ●●● ●

●●●●● ●●●●

●● ● ●● ●●● ●●●●● ●●●

●●●●● ●● ●●●●●

●●●●●●●● ●●● ●● ●●

●●● ● ●●●

●● ●

●●● ●● ●●● ●

●●● ●● ● ●● ●

●●● ●●●

●●● ●●● ● ●

●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●

●● ●●●●● ●● ●● ●● ●● ● ●●●

●● ●● ●●

● ●●●

●● ●●●●● ● ●●

●● ●● ●●

● ●●

● ● ●●● ●● ●●●●●●●●●●●●●●●●●

● ● ●●● ● ●

●● ● ●● ●● ●● ●●

●● ●● ●●●●

●●●● ●●●●● ●●●●● ●● ●● ●

0% 20% 40% 60% 80% 100% 120% 140% 160% 180% 200% 220%

0 200 400

I −N KR AIFI −N KR DefI −N KR STLI −N KR POp

N Best −N KR DefN Best −N KR TieN Best −N KR CPrN Best −N KR ClaN Best −N KR 4CmN Best −N KR 4CS

N BoNeL −N KR DefN BoNeL −N KR TieN BoNeL −N KR CPrN BoNeL −N KR ClaN BoNeL −N KR 4CmN BoNeL −N KR 4CS

N BoNeM −N KR DefN BoNeM −N KR TieN BoNeM −N KR CPrN BoNeM −N KR ClaN BoNeM −N KR 4CmN BoNeM −N KR 4CS

N BoNeP −N KR DefN BoNeP −N KR TieN BoNeP −N KR CPrN BoNeP −N KR ClaN BoNeP −N KR 4CSN BoNeP −N KR 4Cm

Value in relation to 'I −N KR POp'

CPU cycles per iteration

Sor

ting

algo

rithm

ArraySize = 8

Figure 5: Single sort for array size = 8 on machine A

● ●●●●

●● ●● ●●●

●●●●●●● ●

●● ●●●● ●●●● ●● ●●● ●

●● ●● ●

● ●●●●●● ●●●● ●● ●●

●●●

●● ●●

● ●● ●●●●● ● ●●● ●

●●● ● ●● ●●

●● ●● ●●●●

●●●● ●●

●●● ●

●●●● ● ●

●●●●● ● ●

● ●● ●●●

● ●● ●

●●●●●● ●

●● ●●● ●●● ●

●●

●●● ●● ●●●● ●

●●● ●● ●● ●

●●●● ●●●

● ●●● ●● ●

● ●●●● ● ●●● ●● ●●● ●

40% 60% 80% 100% 120% 140%

100 150 200 250 300

I −N KR AIFI −N KR STLI −N KR DefI −N KR POp

N Best −N KR DefN Best −N KR TieN Best −N KR ClaN Best −N KR CPrN Best −N KR 4CmN Best −N KR 4CS

N BoNeL −N KR DefN BoNeL −N KR TieN BoNeL −N KR ClaN BoNeL −N KR CPrN BoNeL −N KR 4CmN BoNeL −N KR 4CS

N BoNeM −N KR DefN BoNeM −N KR TieN BoNeM −N KR CPrN BoNeM −N KR ClaN BoNeM −N KR 4CSN BoNeM −N KR 4Cm

N BoNeP −N KR DefN BoNeP −N KR TieN BoNeP −N KR ClaN BoNeP −N KR CPrN BoNeP −N KR 4CSN BoNeP −N KR 4Cm

Value in relation to 'I −N KR POp'

CPU cycles per iteration

Sor

ting

algo

rithm

ArraySize = 8

Figure 6: Single sort for array size = 8 on machine B

32

Page 33: Enigneering faster sorters for small sets of items

4.4 Sorting One Set of 2-16 items

●● ●●●● ● ●● ●

● ●● ●●● ●● ● ●●●

●● ●●●●●●● ●

●●●● ●●●●● ●

● ● ●●●●●●

●●● ●

● ●● ●●●

●●●●●●●●● ●● ●

●●● ●● ● ●● ●●● ●●●●●●●● ●

●●● ●●●●●●●●●● ●●●●●●●●●●●●●

●●● ●

●●●

●●● ●●●●●●

●●●●● ●

●●●●●● ●●● ●●● ●●●●●● ●

●●●●●● ●●● ●●●● ●● ●●● ●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●

●●●●● ●

●●●

●● ●●●●●●●●● ●● ●●●●● ●●●● ●

● ●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●● ●●

●●● ●●●●

●●

● ●● ●● ●

●● ●●●●●●

●● ●●● ●●

40% 60% 80% 100% 120% 140%

100 200 300 400

I −N KR AIFI −N KR STLI −N KR DefI −N KR POp

N Best −N KR DefN Best −N KR TieN Best −N KR ClaN Best −N KR CPrN Best −N KR 4CSN Best −N KR 4Cm

N BoNeL −N KR DefN BoNeL −N KR TieN BoNeL −N KR ClaN BoNeL −N KR CPrN BoNeL −N KR 4CSN BoNeL −N KR 4Cm

N BoNeM −N KR DefN BoNeM −N KR TieN BoNeM −N KR CPrN BoNeM −N KR ClaN BoNeM −N KR 4CSN BoNeM −N KR 4Cm

N BoNeP −N KR DefN BoNeP −N KR TieN BoNeP −N KR CPrN BoNeP −N KR ClaN BoNeP −N KR 4CSN BoNeP −N KR 4Cm

Value in relation to 'I −N KR POp'

CPU cycles per iteration

Sor

ting

algo

rithm

ArraySize = 8

Figure 7: Single sort for array size = 8 on machine C

● ●

● ●

● ●

10

20

30

40

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Array Size

CP

U C

ycle

s pe

r el

emen

t

4Cm

4CS

CPr

POp

STL

● I

N Best

N BoNeL

N BoNeP

Figure 8: Single sort of array sizes 2 to 16 on machine A

33

Page 34: Enigneering faster sorters for small sets of items

4 Experimental Results

●●

●● ●

● ●

●● ● ●

10

20

30

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Array Size

CP

U C

ycle

s pe

r el

emen

t

4Cm

4CS

CPr

POp

STL

● I

N Best

N BoNeL

N BoNeP

Figure 9: Single sort of array sizes 2 to 16 on machine B

● ●● ●

●● ●

●● ● ● ● ●

● ● ●

0

10

20

30

40

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Array Size

CP

U C

ycle

s pe

r el

emen

t

4Cm

4CS

CPr

POp

STL

● I

N Best

N BoNeL

N BoNeP

Figure 10: Single sort of array sizes 2 to 16 on machine C

34

Page 35: Enigneering faster sorters for small sets of items

4.5 Sorting Many Continuous Sets of 2-16 Items

●●

●●

10

20

30

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Array Size

CP

U C

ycle

s pe

r el

emen

t●

4Cm

4CS

CPr

POp

STL

● I

N Best

N BoNeL

N BoNeP

Figure 11: Continuous sorting of array sizes 2 to 16 on machine A

4.5 Sorting Many Continuous Sets of 2-16 Items

Here the benchmark shown in algorithm 3 was used. Instead of sorting a single array multipletimes, multiple arrays are created adjacent to each other and sorted in series.The number of arrays used is chosen in a way that all of them do not fit into the CPU’s L3cache. Since the reference array is sorted before the measurement, the original array shouldnot be present in the cache, causing a cache miss on every access.The results are similar to the previous ones. A difference we can see when comparing figures11, 12 and 13 to figures 8, 9 and 10 from the single sort measurement is that the CPr swapthat operates on pointers and moves values around in memory became worse compared to the4Cm and 4CS implementations for array sizes greater than 2. Here the values can probably getpre-loaded for the next conditional swap while the current one is finishing, while CPr accessesthe element’s reference value only when the destination address is calculated, which results inless pre-loading that can be done.The complete overview over the average values of each sorter across all three machines can beseen in table 7. We see speed-ups for using the sorting networks from 25% at array size 2 allthe way up to 59% at array size 15.

35

Page 36: Enigneering faster sorters for small sets of items

4 Experimental Results

●●

●●

●●

●●

10

20

30

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Array Size

CP

U C

ycle

s pe

r el

emen

t

4Cm

4CS

CPr

POp

STL

● I

N Best

N BoNeL

N BoNeP

Figure 12: Continuous sorting of array sizes 2 to 16 on machine B

●●

●●

●●

●●

●●

● ●

10

20

30

40

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Array Size

CP

U C

ycle

s pe

r el

emen

t

4Cm

4CS

CPr

POp

STL

● I

N Best

N BoNeL

N BoNeP

Figure 13: Continuous sorting of array sizes 2 to 16 on machine C

36

Page 37: Enigneering faster sorters for small sets of items

4.5 Sorting Many Continuous Sets of 2-16 Items

Ove

rall

Arr

aySi

zeR

ank

Geo

M2

34

56

78

910

1112

1314

1516

I-I

KRPO

p26

2.39

20.6

744

.67

72.2

310

4.33

142.

5318

2.47

222.

0726

2.90

302.

1334

2.53

382.

8742

5.57

468.

7351

3.80

556.

07I

-IKR

Def

272.

4222

.17

47.3

074

.60

106.

7014

3.40

183.

4722

3.37

263.

2330

2.53

342.

4738

3.63

425.

7346

7.77

511.

9755

4.67

I-I

KRST

L34

2.57

27.0

753

.47

82.4

011

4.07

150.

6319

1.53

232.

8727

3.00

313.

6735

5.40

398.

7344

1.80

484.

6352

9.97

573.

47I

-IKR

AIF

352.

5723

.67

50.0

079

.53

111.

9714

9.87

193.

7723

7.77

280.

8732

4.37

367.

4341

2.23

456.

1349

9.43

544.

1058

9.47

NBe

st-I

KR4C

S2

1.05

15.9

725

.00

35.6

748

.73

57.6

070

.90

79.3

096

.50

111.

2313

3.43

148.

7017

5.07

222.

0322

6.00

261.

80N

Best

-IKR

4Cm

51.

0815

.97

23.6

734

.87

47.2

056

.30

70.2

082

.20

95.0

711

4.20

131.

3014

6.00

205.

5023

6.97

271.

7329

9.80

NBe

st-I

KR6C

m8

1.29

15.9

725

.70

37.0

355

.13

66.8

085

.50

94.3

711

7.63

139.

0716

0.90

181.

3327

0.23

301.

7034

7.80

379.

53N

Best

-IKR

Cla

111.

4014

.00

25.3

039

.90

70.8

387

.67

114.

4312

8.97

169.

6717

2.10

201.

0021

8.60

231.

2025

4.33

278.

0329

5.50

NBe

st-I

KRCP

r13

1.43

13.9

028

.37

41.0

774

.03

91.3

311

7.27

130.

7716

7.40

165.

9320

1.17

220.

9023

3.23

259.

7328

8.53

307.

23N

Best

-IKR

Def

222.

3323

.67

44.3

372

.00

96.0

712

9.27

167.

4020

5.97

245.

7326

2.57

328.

7037

3.57

408.

0347

8.33

531.

4760

5.10

NBe

st-I

KRTi

e23

2.34

25.9

743

.77

69.0

092

.00

119.

3716

6.30

200.

8323

5.50

274.

1332

4.33

380.

4343

5.33

486.

9055

5.63

639.

40N

Best

-IKR

JXc

322.

5224

.60

45.3

070

.63

99.8

313

7.83

173.

7321

3.70

253.

7329

5.23

363.

1741

8.60

479.

2356

0.80

638.

8069

6.60

NBe

st-I

KRQM

a37

3.10

24.6

746

.00

83.1

313

0.07

176.

8322

3.30

279.

9033

3.30

357.

0350

0.63

575.

6358

0.67

684.

1777

3.97

845.

93N

BoNe

L-I

KR4C

S1

1.04

16.0

025

.00

35.6

348

.50

57.1

069

.30

78.8

310

0.57

115.

2713

6.13

150.

6317

4.33

200.

5320

7.23

247.

57N

BoNe

L-I

KR4C

m3

1.06

15.9

724

.00

34.9

747

.70

57.0

070

.50

79.1

010

0.80

117.

3013

6.77

148.

6318

0.77

198.

2324

5.17

268.

67N

BoNe

L-I

KR6C

m9

1.32

16.0

026

.00

37.6

055

.70

66.9

385

.33

94.5

712

4.93

145.

5717

3.90

194.

7726

9.73

328.

2035

0.90

355.

43N

BoNe

L-I

KRCl

a17

1.65

13.9

325

.13

40.0

070

.97

87.6

311

4.33

129.

0020

0.47

223.

8026

8.67

286.

8734

4.30

365.

3340

0.97

416.

60N

BoNe

L-I

KRCP

r18

1.67

14.6

028

.40

41.1

074

.00

91.4

011

7.00

130.

7319

7.03

229.

1726

5.43

282.

9333

5.90

362.

1338

9.33

407.

03N

BoNe

L-I

KRTi

e25

2.36

25.9

743

.63

68.3

392

.87

120.

4016

6.40

200.

6323

7.40

282.

2033

1.43

384.

1744

4.20

494.

6756

2.53

639.

57N

BoNe

L-I

KRDe

f28

2.42

24.1

045

.40

74.3

795

.90

128.

6716

6.37

205.

6724

8.33

306.

1733

0.20

414.

6043

8.80

525.

6357

5.17

630.

03N

BoNe

L-I

KRJX

c31

2.47

24.3

345

.33

70.6

799

.67

137.

8317

3.80

213.

6726

0.80

303.

7034

7.07

402.

8045

9.87

533.

1757

6.37

640.

67N

BoNe

L-I

KRQM

a39

3.27

25.0

046

.50

83.2

013

0.23

177.

1322

3.37

280.

1737

1.97

434.

0750

5.10

582.

4368

8.13

756.

1383

3.53

945.

70N

BoNe

M-I

KR4C

m7

1.19

16.0

023

.93

34.3

347

.33

55.9

066

.60

76.5

717

0.50

177.

0016

7.90

194.

0019

4.70

253.

5726

2.33

307.

97N

BoNe

M-I

KR4C

S12

1.41

16.0

024

.63

41.2

063

.03

55.7

711

1.00

77.5

016

5.30

203.

9720

4.63

200.

4329

4.33

321.

3334

9.40

382.

30N

BoNe

M-I

KR6C

m16

1.52

16.0

025

.93

36.9

053

.23

65.3

012

5.07

92.4

719

3.63

244.

9023

9.67

236.

8334

4.47

348.

4338

3.13

363.

77N

BoNe

M-I

KRCl

a19

1.86

16.6

729

.33

41.2

083

.67

89.0

013

5.63

128.

3022

0.20

259.

9729

7.53

311.

7340

2.10

427.

6748

4.07

477.

33N

BoNe

M-I

KRCP

r20

1.93

16.2

729

.73

41.9

092

.67

91.7

013

7.23

157.

1324

0.17

275.

7730

4.70

309.

1039

9.97

404.

7047

9.83

531.

63N

BoNe

M-I

KRTi

e29

2.45

26.0

044

.33

69.0

392

.70

121.

0016

2.47

197.

5723

9.17

305.

2737

3.97

427.

8051

6.20

569.

8756

6.87

648.

47N

BoNe

M-I

KRDe

f30

2.46

23.3

345

.40

73.7

310

5.90

144.

7318

2.37

212.

0724

6.03

305.

0732

9.70

410.

6043

2.07

501.

0357

5.37

633.

07N

BoNe

M-I

KRJX

c36

2.64

24.4

042

.67

70.6

312

3.73

130.

3316

8.57

212.

2028

1.13

333.

7343

2.47

458.

3749

2.10

583.

7763

2.10

741.

07N

BoNe

M-I

KRQM

a38

3.20

29.0

052

.00

89.7

712

6.97

177.

1322

5.23

278.

2034

6.27

400.

3346

7.37

533.

5061

1.23

702.

3377

4.00

878.

87N

BoNe

P-I

KR4C

S4

1.08

16.0

024

.97

35.7

045

.70

54.7

766

.70

78.7

399

.60

122.

9014

5.37

156.

0019

0.20

210.

6326

3.43

279.

30N

BoNe

P-I

KR4C

m6

1.09

16.0

023

.80

36.7

346

.93

54.9

765

.80

78.7

310

1.50

119.

9713

8.13

155.

3721

8.03

211.

6328

5.87

305.

37N

BoNe

P-I

KR6C

m10

1.35

16.3

326

.00

37.3

052

.43

63.1

779

.57

93.2

712

4.70

151.

1317

4.90

241.

0028

7.13

329.

4038

5.30

411.

77N

BoNe

P-I

KRCl

a14

1.45

14.6

325

.03

40.0

366

.00

81.4

710

8.33

116.

6717

8.70

188.

4021

5.53

227.

3326

9.33

285.

9731

6.60

333.

97N

BoNe

P-I

KRCP

r15

1.47

13.9

028

.33

41.1

370

.27

83.4

310

8.47

116.

9317

7.00

185.

5721

0.00

228.

5726

3.30

295.

2031

8.77

341.

63N

BoNe

P-I

KRDe

f21

2.32

23.6

746

.10

71.0

096

.90

128.

8316

0.97

196.

5024

1.30

278.

1332

1.10

364.

4741

9.27

478.

6754

4.53

597.

17N

BoNe

P-I

KRTi

e24

2.35

25.5

044

.67

69.6

390

.33

124.

7316

0.10

193.

6723

6.70

284.

3332

7.73

383.

8044

0.30

500.

3755

7.47

633.

60N

BoNe

P-I

KRJX

c33

2.54

24.6

743

.33

70.6

010

0.33

133.

1716

9.03

217.

2325

8.40

318.

8037

7.17

428.

9348

9.80

564.

9062

5.70

713.

63N

BoNe

P-I

KRQM

a40

3.29

26.3

349

.67

87.5

713

0.00

165.

3322

4.03

274.

0736

7.53

428.

2050

9.07

570.

7768

4.67

776.

1784

9.17

927.

27

Table7:

Averagenu

mbe

rof

CPU

cycles

perarrayof

continuo

ussortingacross

allm

achines

37

Page 38: Enigneering faster sorters for small sets of items

4 Experimental Results

●● ●● ● ●●●●●

● ●● ●●●●● ●

●●

● ●●

●●

●●

●●

●●●

●●

●●●

●●●

● ●●●

●●●

● ●●●● ●

● ●

98% 100% 102% 104% 106% 108% 110% 112%

3400000 3500000 3600000 3700000 3800000

I −Q KR STLI −Q KR AIFI −Q KR POpI −Q KR Def

N Best −Q KR TieN Best −Q KR DefN Best −Q KR 4CmN Best −Q KR 4CSN Best −Q KR CPrN Best −Q KR Cla

N BoNeL −Q KR TieN BoNeL −Q KR DefN BoNeL −Q KR CPrN BoNeL −Q KR ClaN BoNeL −Q KR 4CmN BoNeL −Q KR 4CS

N BoNeM −Q KR TieN BoNeM −Q KR DefN BoNeM −Q KR CPrN BoNeM −Q KR ClaN BoNeM −Q KR 4CSN BoNeM −Q KR 4Cm

N BoNeP −Q KR TieN BoNeP −Q KR DefN BoNeP −Q KR 4CmN BoNeP −Q KR 4CSN BoNeP −Q KR CPrN BoNeP −Q KR Cla

QSort −Q KR Def

StdSort −Q KR Def

Value in relation to 'I −Q KR Def'

CPU cycles per iteration

Sor

ting

algo

rithm

QuickSort

Figure 14: Sorting times of quicksort with different base cases on machine A

4.6 Sorting a Large Set of Items with Quicksort

After seeing the first two results, we wanted to know how the base case sorters perform whenused inside a scalable sorting algorithm. For that we modified introsort, a quicksort implemen-tation from the STL library, as follows: Introsort calls insertion sort only once, right at theend. Since that is not possible with the sorting networks, they had to be called directly whenthe partitioning resulted in a partition of 16 elements or less. Also we determined the pivotusing the 3-element Bose Nelson parameter network instead of using if-else and std::swap.The sorters were measured using benchmark 2 with parameters

• numberOfIterations = 50• numberOfMeasures = 200• arraySize = 1024× 16 = 16384 = 214.

To have a basis of comparison we also measured sorting with std::sort. These times can betaken from figures 14, 15 and 16.The QSort -Q KR Def sorter is just a direct copy of the STL sort doing a final insertion sortat the end. That was measured to see that our code copy does as well as std::sort beforedoing the modifications.

38

Page 39: Enigneering faster sorters for small sets of items

4.6 Sorting a Large Set of Items with Quicksort

●● ●

●●●

●●

●●

●●

●●

●●● ●

98% 100% 102% 104% 106% 108% 110%

3300000 3400000 3500000 3600000 3700000

I −Q KR AIFI −Q KR STLI −Q KR POpI −Q KR Def

N Best −Q KR TieN Best −Q KR DefN Best −Q KR 4CmN Best −Q KR 4CSN Best −Q KR CPrN Best −Q KR Cla

N BoNeL −Q KR TieN BoNeL −Q KR DefN BoNeL −Q KR CPrN BoNeL −Q KR 4CmN BoNeL −Q KR ClaN BoNeL −Q KR 4CS

N BoNeM −Q KR TieN BoNeM −Q KR DefN BoNeM −Q KR CPrN BoNeM −Q KR ClaN BoNeM −Q KR 4CSN BoNeM −Q KR 4Cm

N BoNeP −Q KR TieN BoNeP −Q KR DefN BoNeP −Q KR 4CmN BoNeP −Q KR 4CSN BoNeP −Q KR CPrN BoNeP −Q KR Cla

QSort −Q KR Def

StdSort −Q KR Def

Value in relation to 'I −Q KR Def'

CPU cycles per iteration

Sor

ting

algo

rithm

QuickSort

Figure 15: Sorting times of quicksort with different base cases on machine B

●●●●●

● ●

●●

●●●●

●●●

●●●●

●●● ●

●●● ●

●●

●●●●

●●●●

●●●

●●

●● ●

● ●●●● ●●

94% 96% 98% 100% 102% 104% 106% 108% 110%

3600000 3800000 4000000

I −Q KR STLI −Q KR DefI −Q KR AIFI −Q KR POp

N Best −Q KR TieN Best −Q KR DefN Best −Q KR CPrN Best −Q KR 4CSN Best −Q KR 4CmN Best −Q KR Cla

N BoNeL −Q KR TieN BoNeL −Q KR DefN BoNeL −Q KR CPrN BoNeL −Q KR ClaN BoNeL −Q KR 4CSN BoNeL −Q KR 4Cm

N BoNeM −Q KR TieN BoNeM −Q KR DefN BoNeM −Q KR CPrN BoNeM −Q KR ClaN BoNeM −Q KR 4CSN BoNeM −Q KR 4Cm

N BoNeP −Q KR TieN BoNeP −Q KR DefN BoNeP −Q KR CPrN BoNeP −Q KR 4CSN BoNeP −Q KR 4CmN BoNeP −Q KR Cla

QSort −Q KR Def

StdSort −Q KR Def

Value in relation to 'I −Q KR POp'

CPU cycles per iteration

Sor

ting

algo

rithm

QuickSort

Figure 16: Sorting times of quicksort with different base cases on machine C

39

Page 40: Enigneering faster sorters for small sets of items

4 Experimental Results

A: N Best -Q KR Cla B: N Best -Q KR Cla C: N Best -Q KR ClaI -Q KR Def 1.76% 2.1% 8.76%I -Q KR POp 3.99% 2.58% 6.47%StdSort -Q 12.3% 10.6% 14%

Table 8: Average speed-ups of the fastest sorting network over the fastest insertion sort as basecase in quicksort and unmodified std::sort

Speed-ups of including sorting networks into a sorting algorithm like quicksort can be seen intable 8.What is notable is that the variants with insertion sort at the base are faster than the one withthe final insertion sort, which should come from the fact that they are already specialized forthe item they sort and do not require a predicate for the sorting. Also, the base case is calledright after the partitioning is at a low enough level, which means that the elements are stillpresent in the first- or second-level cache. That also explains why the Cla conditional swapperforms the best with quicksort, while we saw in the last section that this is not necessarilythe case when we have a cache miss.Recalling the results from the previous sections, we appeared to be achieving great improve-ments in reducing the time needed for sorting sets of 2-16 items. By measuring only the sortingof the small sets we have exploited the networks’ strength: not containing conditional branches.The results from the measurements with quicksort highlight the networks’ weakness: The largercode size.When integrating the sorting networks into quicksort for sorting the base cases, every time apartition results in one part having 16 elements or less, we switch from the code for quicksortto the code for the sorting network. Thus, the code for quicksort is partly removed from the L1instruction cache and replaced with the code for the sorting network. Because the network’scode is just a flat sequence of conditional swaps, each line of code is accessed exactly once persort. That means it caused a lot of quicksort’s code to be removed from the instruction cachewithout gaining a speed-up because its code is now in the cache, and will be removed againwhen quicksort is handed back the flow of control and loads its code back into the instructioncache.We can see that effect especially for machines A and B which have 32 KiB of L1 instructioncache, where the speed-up is hardly over 2% for the best network base case over the best in-sertion sort base case. Where we got a much more improvement is on machine C, which hasdouble the space in its L1 instruction cache. Here we achieved a speed-up of almost 6.5% whenmaking use of the best networks.It is no surprise that we do not see improvements similar to those in section 4.4 or 4.5 becausethe partitioning that quicksort performs takes the same amount of time no matter which basecase sorter is used, representing a part of the algorithm that is not optimizable through usingsorting networks.

A: N BoNeL -s332 4CS B: N BoNeL -s332 4CS C: N BoNeL -s332 4CSI -s332 Def 17.4% 17.5% 29.2%StdSort -s 43.6% 43.49% 51%

Table 9: Average speed-ups of the fastest sorting network over the fastest insertion sort as basecase in sample sort and unmodified std::sort

40

Page 41: Enigneering faster sorters for small sets of items

4.7 Sorting a Medium-Sized Set of Items with Sample Sort

4.7 Sorting a Medium-Sized Set of Items with Sample Sort

Sample sort was measured using benchmark 2 with parameters:• numberOfIterations = 50• numberOfMeasures = 200• arraySize = 256.

The measurements were done with two different goals in mind: The first was to see which pa-rameters work best for the machines used and the array size set. This can be seen in figures 17,18 and 19 for the Bose Nelson networks optimizing locality. To be able to compare the resultson the different machines, the configurations were ordered based on the times from machine A,and are in the same order in the other two plots. An oversampling factor of 3 and block size of2 performed best on machine A and B. That configuration also performs best when using theother networks or insertion sort as a base case.On machine C block sizes larger than 2 performed better (on average) along with an oversam-pling factors of 3 or greater. We measured larger variances and got a lot more outliers, sohere choosing a “best” configuration was not so easy. When looking at the other networks andinsertion sort as base case, consistently well performing parameters are an oversampling factorof 3 and a block size of 4, but with very little lead over other configurations. That is interestingto see because all three machines run x86 assembly instructions and have the same number ofgeneral purpose registers available. What comes into play here is the size of the instructioncache: Machine C has double the amount of L1 instruction cache of what machines A and Bhave. We can only assume that the instructions for classifying three elements need more spacethan the smaller 32 KiB instruction caches can provide, while the 64 KiB instruction cachethat machine C has fits the instructions for classifying four and / or almost five elements atonce, considering that block size 5 also performs well.

The second goal was to see if the results from section 4.4, 4.5 and 4.6 would relate to the resultsfrom using sample sort with the sorting networks as base cases. These results can be seen infigures 20, 21 and 22 for the 332 configuration. All measurements were made with a base caselimit of 16. Here, too, a single outlier was excluded from the dataset for scaling purposes: Avalue of 40177 measured on machine B for the ’N BoNeP -s332 KR 4Cm’ sorter.The achieved speed-ups of using the sorting networks are given in table 9. On the left we seesample sort with insertion sort as base case and std::sort that was also measured sorting 256elements. On the top we see the best performing network ’N BoNeL -s332 4CS’ as a base casefor sample sort on all three machines. The number indicates the speed-up of sample sort withthe network over sample sort with insertion sort and over std::sort.Again we see that due to machine C having a larger L1 instruction cache the performance gainis almost double that for the other machines. Unlike in the previous section though we got muchgreater speed-ups as a result of using the sorting networks as a base case. That comes fromthe fact that sample sort has no unpredictable branches classifying the elements, as opposed toquicksort having to deal with conditional branches during the partitioning, while both need toinvest the same time to sort all the base cases. So with sample sort, the base case sorting takesup a larger time slot of the whole execution than it does with quicksort. We also see that withvery few conditional branches we can get up to 50% faster than std::sort (for sets of up to 256items at least).

41

Page 42: Enigneering faster sorters for small sets of items

4 Experimental Results

● ●

●●

●● ●

●●

● ●●●●●

●●

● ●●

●●●

● ●

●●

●●●●●

● ●

●●●

●● ●●

●●●●●

●●●●

●●● ● ●● ●

● ●●●●

25000 27500 30000

N BoNeL −S311 KR 4CmN BoNeL −S315 KR 4CmN BoNeL −S341 KR 4CmN BoNeL −S321 KR 4CmN BoNeL −S314 KR 4CmN BoNeL −S313 KR 4CmN BoNeL −S355 KR 4CmN BoNeL −S325 KR 4CmN BoNeL −S312 KR 4CmN BoNeL −S353 KR 4CmN BoNeL −S354 KR 4CmN BoNeL −S324 KR 4CmN BoNeL −S345 KR 4CmN BoNeL −S323 KR 4CmN BoNeL −S344 KR 4CmN BoNeL −S342 KR 4CmN BoNeL −S331 KR 4CmN BoNeL −S343 KR 4CmN BoNeL −S335 KR 4CmN BoNeL −S351 KR 4CmN BoNeL −S334 KR 4CmN BoNeL −S352 KR 4CmN BoNeL −S333 KR 4CmN BoNeL −S322 KR 4CmN BoNeL −S332 KR 4Cm

CPU cycles per iteration

Sor

ting

algo

rithm

SampleSort

Figure 17: Sample sort on machine A with 256 items. -Sxyz has parameters x =numberOfSplitters, y = oversamplingFactor and z = blockSize

●●

●● ●● ●

●●●

●●

●●

●● ●●

●●

●●● ●

22000 24000 26000 28000 30000

N BoNeL −S311 KR 4CmN BoNeL −S315 KR 4CmN BoNeL −S341 KR 4CmN BoNeL −S321 KR 4CmN BoNeL −S314 KR 4CmN BoNeL −S313 KR 4CmN BoNeL −S355 KR 4CmN BoNeL −S325 KR 4CmN BoNeL −S312 KR 4CmN BoNeL −S353 KR 4CmN BoNeL −S354 KR 4CmN BoNeL −S324 KR 4CmN BoNeL −S345 KR 4CmN BoNeL −S323 KR 4CmN BoNeL −S344 KR 4CmN BoNeL −S342 KR 4CmN BoNeL −S331 KR 4CmN BoNeL −S343 KR 4CmN BoNeL −S335 KR 4CmN BoNeL −S351 KR 4CmN BoNeL −S334 KR 4CmN BoNeL −S352 KR 4CmN BoNeL −S333 KR 4CmN BoNeL −S322 KR 4CmN BoNeL −S332 KR 4Cm

CPU cycles per iteration

Sor

ting

algo

rithm

SampleSort

Figure 18: Sample sort on machine B with 256 items. -Sxyz has parameters x =numberOfSplitters, y = oversamplingFactor and z = blockSize

42

Page 43: Enigneering faster sorters for small sets of items

4.7 Sorting a Medium-Sized Set of Items with Sample Sort

●● ●●

● ●● ●●● ● ●●● ●● ●

●●● ●●

●●●● ●●●● ●

●● ●● ●●●●

●●●● ●●●●● ●●●●●●●

● ●●● ● ●●

●● ● ●●●● ●

●●●● ●●●●●●●●●●● ●

●●●● ●●●●●●●●● ●●●

●●●● ●●●●● ●●●●● ●●● ●● ●●

● ●●●●●●● ●●●●

● ●●●

●●●●●●●● ●●●●● ●

●●● ● ●● ●●

● ●●●●●●● ●●●●

●●●●● ● ●●●●●●●●●●

●●● ●●●●●● ●●●●●●

●●

●●● ●● ●● ●●● ●●●●

● ●●●●●●

●●●●● ●

●●●●●●●●

20000 25000 30000 35000 40000

N BoNeL −S311 KR 4CmN BoNeL −S315 KR 4CmN BoNeL −S341 KR 4CmN BoNeL −S321 KR 4CmN BoNeL −S314 KR 4CmN BoNeL −S313 KR 4CmN BoNeL −S355 KR 4CmN BoNeL −S325 KR 4CmN BoNeL −S312 KR 4CmN BoNeL −S353 KR 4CmN BoNeL −S354 KR 4CmN BoNeL −S324 KR 4CmN BoNeL −S345 KR 4CmN BoNeL −S323 KR 4CmN BoNeL −S344 KR 4CmN BoNeL −S342 KR 4CmN BoNeL −S331 KR 4CmN BoNeL −S343 KR 4CmN BoNeL −S335 KR 4CmN BoNeL −S351 KR 4CmN BoNeL −S334 KR 4CmN BoNeL −S352 KR 4CmN BoNeL −S333 KR 4CmN BoNeL −S322 KR 4CmN BoNeL −S332 KR 4Cm

CPU cycles per iteration

Sor

ting

algo

rithm

SampleSort

Figure 19: Sample sort on machine C with 256 items. -Sxyz has parameters x =numberOfSplitters, y = oversamplingFactor and z = blockSize

●● ●●

●● ●

●●● ●

●●●

● ●●●

●●

●●

●● ●●●●

● ● ●●

●● ●●

●●●

●●●

●●●

● ●

●●●

● ●

● ●● ●●

●●● ●●

●●●

● ●●

●●

●●●

80% 90% 100% 110%

20000 22500 25000 27500

I −s332 KR POpI −s332 KR STLI −s332 KR AIFI −s332 KR Def

N Best −s332 KR TieN Best −s332 KR DefN Best −s332 KR 4CmN Best −s332 KR CPrN Best −s332 KR ClaN Best −s332 KR 4CS

N BoNeL −s332 KR TieN BoNeL −s332 KR DefN BoNeL −s332 KR 4CmN BoNeL −s332 KR CPrN BoNeL −s332 KR ClaN BoNeL −s332 KR 4CS

N BoNeM −s332 KR TieN BoNeM −s332 KR DefN BoNeM −s332 KR CPrN BoNeM −s332 KR 4CmN BoNeM −s332 KR ClaN BoNeM −s332 KR 4CS

N BoNeP −s332 KR TieN BoNeP −s332 KR DefN BoNeP −s332 KR 4CmN BoNeP −s332 KR CPrN BoNeP −s332 KR ClaN BoNeP −s332 KR 4CS

Value in relation to 'I −s332 KR Def'

CPU cycles per iteration

Sor

ting

algo

rithm

SampleSort

Figure 20: Sample sort 332 with different base cases on machine A

43

Page 44: Enigneering faster sorters for small sets of items

4 Experimental Results

● ●

●● ●

● ●

●●●

● ●

●●

●● ●

● ●●●

●● ●

●●

● ●

●●

●●● ●

●●

80% 90% 100% 110%

20000 22000 24000 26000 28000

I −s332 KR AIFI −s332 KR STLI −s332 KR POpI −s332 KR Def

N Best −s332 KR TieN Best −s332 KR DefN Best −s332 KR 4CmN Best −s332 KR ClaN Best −s332 KR CPrN Best −s332 KR 4CS

N BoNeL −s332 KR TieN BoNeL −s332 KR DefN BoNeL −s332 KR ClaN BoNeL −s332 KR 4CmN BoNeL −s332 KR CPrN BoNeL −s332 KR 4CS

N BoNeM −s332 KR TieN BoNeM −s332 KR DefN BoNeM −s332 KR 4CmN BoNeM −s332 KR CPrN BoNeM −s332 KR ClaN BoNeM −s332 KR 4CS

N BoNeP −s332 KR TieN BoNeP −s332 KR DefN BoNeP −s332 KR 4CmN BoNeP −s332 KR ClaN BoNeP −s332 KR CPrN BoNeP −s332 KR 4CS

Value in relation to 'I −s332 KR Def'

CPU cycles per iteration

Sor

ting

algo

rithm

SampleSort

Figure 21: Sample sort 332 with different base cases on machine B

●●●●●

● ●●

● ●

●●

● ●●●

● ●

●●

●●●

●●●

●●

●●

70% 80% 90% 100% 110% 120%

20000 24000 28000 32000

I −s332 KR AIFI −s332 KR STLI −s332 KR POpI −s332 KR Def

N Best −s332 KR TieN Best −s332 KR DefN Best −s332 KR 4CmN Best −s332 KR CPrN Best −s332 KR ClaN Best −s332 KR 4CS

N BoNeL −s332 KR DefN BoNeL −s332 KR TieN BoNeL −s332 KR 4CmN BoNeL −s332 KR CPrN BoNeL −s332 KR ClaN BoNeL −s332 KR 4CS

N BoNeM −s332 KR DefN BoNeM −s332 KR TieN BoNeM −s332 KR ClaN BoNeM −s332 KR CPrN BoNeM −s332 KR 4CmN BoNeM −s332 KR 4CS

N BoNeP −s332 KR TieN BoNeP −s332 KR DefN BoNeP −s332 KR 4CmN BoNeP −s332 KR CPrN BoNeP −s332 KR ClaN BoNeP −s332 KR 4CS

Value in relation to 'I −s332 KR Def'

CPU cycles per iteration

Sor

ting

algo

rithm

SampleSort

Figure 22: Sample sort 332 with different base cases on machine C

44

Page 45: Enigneering faster sorters for small sets of items

4.8 Sorting a Large Set of Items with IPS4o

4.8 Sorting a Large Set of Items with IPS4o

With the efficient implementation of sample sort for medium-sized sets we can now includethe new base case sorters into a complex sorting algorithm. The In-Place Parallel SuperScalar Samplesort (IPS4o) [AWFS17] was executed without introducing parallelism. The al-gorithm has many parameters that can be adjusted. The important parameter for us wasthe BaseCaseSize4

1: it tells IPS4o to aim for base case sizes that are smaller or equal toBaseCaseSize4. Even though that is the goal, for a large-scale sorter like IPS4o it wouldbe far less efficient to partition e.g. 32 elements into many buckets, that might end up notcontaining many elements each, than just using the base case sorter for these situations, eventhough the number of items is larger than the specified BaseCaseSize4.That was the reason to develop Register Sample Sort that can break those medium-sized setsdown into sizes that can be sorted using the sorting networks.We started the measuring using the best combination of sample sort from section 4.7 as a basecase for IPS4o, together with using the default BaseCaseSize4= 16, but that turned out toperform worse than just insertion sort.The distribution of the base case array sizes can be seen in figure 23 for BaseCaseSize4 = 16and figure 24 for BaseCaseSize4 = 32. From that it was evident that in most of the instanceswith parameter BaseCaseSize4 = 16 the base case sorter was being invoked on sets smallerthan even 32 elements. That also meant that sample sort had to deal with a larger overheadthan insertion sort, not justified by a larger amount of items.In addition to that the size of the instruction cache that had already had a great influence onthe measurements of quicksort seemed to be another factor for the bad performance of RegisterSample Sort as a base case.That is why we decided to measure the following setups:

• Pure insertion sort as base case (I) with– BaseCaseSize4= 16 and 32

• Register sample sort as base case (S+N) with– BaseCaseSize4= 16, 32, and 64,– Configurations 331 and 332, and– Best networks and Bose Nelson networks (optimizing locality) as base case for Reg-

ister Sample Sort, with the 4CS conditional swap and base case size 16• A combination of the sorting networks and insertion sort (I+N):

Since the base case sizes were often smaller than 16, we wanted to make use of thatby using the sorting networks, while not having to rely on Register Sample Sort withits larger overhead for the slightly larger base cases. The solution was to use the BoseNelson networks (optimizing locality) if the set had 16 elements or less, and insertion sortotherwise.

1we will use the 4 to distinguish IPS4o’s BaseCaseSize from Register Sample Sort’s base case size

45

Page 46: Enigneering faster sorters for small sets of items

4 Experimental Results

0.00

0.01

0.02

0.03

0.04

0 16 32 48 64 80 96 112 128 160 192 224 256 288 320 352 384

Base Case Size

Fre

quen

cy

Figure 23: Distribution of the size of the array passed to the base case sorter when executingIPS4o with parameter BaseCaseSize4 = 16

0.000

0.005

0.010

0.015

0.020

0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 480 512 544 576

Base Case Size

Fre

quen

cy

Figure 24: Distribution of the size of the array passed to the base case sorter when executingIPS4o with parameter BaseCaseSize4 = 32

46

Page 47: Enigneering faster sorters for small sets of items

4.8 Sorting a Large Set of Items with IPS4o

●●

●●●

● ●

●●

● ●●●

100% 105% 110% 115% 120% 125%

3200000 3400000 3600000 3800000 4000000

I −4 32 KR Def

I −4 16 KR Def

I + N −4 16 KR 4CS

S+N Best −4 64_332 KR 4CS

S+N Best −4 64_331 KR 4CS

S+N Best −4 32_332 KR 4CS

S+N Best −4 16_332 KR 4CS

S+N Best −4 32_331 KR 4CS

S+N Best −4 16_331 KR 4CS

S+N BoNeL −4 64_332 KR 4CS

S+N BoNeL −4 64_331 KR 4CS

S+N BoNeL −4 32_332 KR 4CS

S+N BoNeL −4 16_332 KR 4CS

S+N BoNeL −4 32_331 KR 4CS

S+N BoNeL −4 16_331 KR 4CS

Value in relation to 'I −4 16 KR Def'

CPU cycles per iteration

Sor

ting

algo

rithm

IPSSSSo

Figure 25: Sorting times for IPS4o on machine A with different base cases and base case sizes

Figures 25, 26 and 27 display the results from the measurements with the above variants. TheBaseCaseSize4 was appended after the -4, along with an underscore followed by the RegisterSample Sort configuration.The the benchmark from algorithm 2 was used with parameters

• numberOfIterations = 50• numberOfMeasures = 200• arraySize = 1024× 32 = 32768 = 215.

As already seen in [AWFS17], we get a speed-up of over 59% over std::sort with unchangedIPS4o on all machines. On machine A, none of the variants we tried led to an improvementin sorting speed over the default use of insertion sort at BaseCaseSize4 16. For machine B,interestingly, using Register Sample Sort did not lead to an improvement, but the combinationof insertion sort and Bose Nelson networks did manage to reduce the sorting time by 4.3%.For machine C we see the impact of the large L1 instruction cache in the visible improvementof 9.2% for having Register Sample Sort as a base case instead of insertion sort, though thecombinations of insertion sort and the sorting network also performed well. It is notable to seethat, while Register Sample Sort by itself did well with blockSizes of 4 or 5, here it is beneficialto use blockSize = 1, having a smaller impact on the instruction cache.

A: S+N BoNeL 16_331 4CS B: I+N 16 C: S+N BoNeL 16_331 4CSI 16 Def -3.4% 4.3% 9.2%StdSort -s 59.1% 61,7% 65%

Table 10: Average speed-ups of the fastest sorting network over the fastest insertion sort asbase case in IPS4o and unmodified std::sort

47

Page 48: Enigneering faster sorters for small sets of items

4 Experimental Results

●● ●●

●●

●●●

●● ●●●

● ●●

● ●●●●

●●

●●● ●●

96% 98% 100% 102% 104%

3000000 3100000 3200000

I −4 32 KR Def

I −4 16 KR Def

I + N −4 16 KR 4CS

S+N Best −4 64_331 KR 4CS

S+N Best −4 64_332 KR 4CS

S+N Best −4 16_332 KR 4CS

S+N Best −4 32_331 KR 4CS

S+N Best −4 32_332 KR 4CS

S+N Best −4 16_331 KR 4CS

S+N BoNeL −4 64_332 KR 4CS

S+N BoNeL −4 64_331 KR 4CS

S+N BoNeL −4 16_331 KR 4CS

S+N BoNeL −4 16_332 KR 4CS

S+N BoNeL −4 32_332 KR 4CS

S+N BoNeL −4 32_331 KR 4CS

Value in relation to 'I −4 16 KR Def'

CPU cycles per iteration

Sor

ting

algo

rithm

IPSSSSo

Figure 26: Sorting times for IPS4o on machine B with different base cases and base case sizes

● ●●

● ●●

●●

● ●● ●

●●

● ●

●●

●●

●●●

●● ●● ●●

90% 92% 94% 96% 98% 100% 102% 104% 106% 108% 110% 112%

3000000 3200000 3400000 3600000

I −4 32 KR Def

I −4 16 KR Def

I + N −4 16 KR 4CS

S+N Best −4 64_331 KR 4CS

S+N Best −4 64_332 KR 4CS

S+N Best −4 32_332 KR 4CS

S+N Best −4 16_332 KR 4CS

S+N Best −4 32_331 KR 4CS

S+N Best −4 16_331 KR 4CS

S+N BoNeL −4 64_331 KR 4CS

S+N BoNeL −4 64_332 KR 4CS

S+N BoNeL −4 32_332 KR 4CS

S+N BoNeL −4 32_331 KR 4CS

S+N BoNeL −4 16_332 KR 4CS

S+N BoNeL −4 16_331 KR 4CS

Value in relation to 'I −4 16 KR Def'

CPU cycles per iteration

Sor

ting

algo

rithm

IPSSSSo

Figure 27: Sorting times for IPS4o on machine C with different base cases and base case sizes

48

Page 49: Enigneering faster sorters for small sets of items

5 Conclusion

5 Conclusion

5.1 Results and Assessment

In this thesis we have seen that for sorting sets of up to 16 elements it can be viable to usesorting algorithms other than insertion sort. We looked at sorting networks in particular, payingspecial attention to the implementation of the conditional swap and giving multiple alternativeways of realizing that implementation.After seeing that the sorting networks outperform insertion sort each on their own for a specificarray size in section 4.4 and 4.5, we saw in section 4.6 that this improvement does not necessarilytransfer to sorting networks being used as base case sorter in quicksort. Because the networkshave a larger code size, the code for quicksort is removed from the instruction cache and theadvantage of not having conditional branches is impaired by that larger code size. But we alsosaw that for machines with larger instruction caches using sorting networks with quicksort canlead to visible improvements of about 6.4%.After that we integrated the sorting networks into a very advanced sorter like IPS4o, whichwas possible by adding an intermediate sorter into the procedure. For that we created RegisterSample Sort, which is an implementation of Super Scalar Sample Sort that holds the splittersin general-purpose registers instead of an array. When measuring IPS4o with Register SampleSort as a base case, we found that the instruction cache makes even more of a difference,because we now add the code size for Register Sample Sort on top of the code size for thesorting networks.We proposed an additional alternative to Register Sample Sort, using a combination of insertionsort and sorting networks: For base cases of 16 elements or less, we used the sorting network,for any size above that insertion sort.On one of the machines with a smaller instruction cache of 32 KiB we could not achieve aspeed-up with any of the variants, on the other the combination of insertion sort and sortingnetworks led to an improvement in sorting time of 4.3%. The only substantial improvementwe achieved with IPS4o was on the machine with 64 KiB of L1 instruction cache, where usingRegister Sample Sort led to an improvement of 9.2% over plain insertion sort.In closing, we want to mention that this particular implementation only compiles when usingthe gcc C++ compiler due to compiler-dependent inline-assembly statements. This also meansthat the code is probably not as fast as it could be due to the inline-assembly not beingoptimized by the compiler. The complete project is available on github athttps://github.com/JMarianczuk/SmallSorters.

5.2 Experiences and Hurdles

The greatest hurdle we encountered during this project was, as mentioned in section 4.3, the factthat the compiler reduces its optimizations with increasing compilation effort, when compilingonly a single source file. That can lead to performance variations that happen for no “apparent”reason, and is especially tricky when dealing with templated methods that can not be movedfrom header files into source files. The solution was to use code generation and to include alllogically coherent method invocations in one wrapper method that is then placed in its ownsource file, to not have different parts of the program influencing each other over the decisionwhich one gets to be optimized and which one not.

49

Page 50: Enigneering faster sorters for small sets of items

5 Conclusion

5.3 Possible Additions

In addition to the work in this thesis, we would like to explore further possibilities to implementthe conditional swap for the sorting networks, as well as seeing which of the C++ compil-ers generate conditional moves when using portable C++ code instead of compiler-dependentinline-assembly. That also includes looking at conditional swaps for elements that differ fromthe 64-bit key and reference value pair that we looked at in this thesis.Furthermore we would like to take a look at implementing sorting networks in a way that theytake up less code space, and what the trade-off for that decreased code size would be.

Apart from the sorting networks we would also like to take another look at Register SampleSort to find out if using seven splitters instead of three can be more practical when increasingthe input size to sizes larger than 256.

50

Page 51: Enigneering faster sorters for small sets of items

References

References[AWFS17] Axtmann, Michael ; Witt, Sascha ; Ferizovic, Daniel ; Sanders, Peter:

In-Place Parallel Super Scalar Samplesort (IPSSSSo). In: 25th Annual Euro-pean Symposium on Algorithms, ESA 2017, September 4-6, 2017, Vienna, Austria.https://github.com/SaschaWitt/ips4o, 2017, 9:1–9:14

[Bat68] Batcher, Kenneth E.: Sorting Networks and Their Applications. In: Ameri-can Federation of Information Processing Societies: AFIPS Conference Proceedings:1968 Spring Joint Computer Conference, Atlantic City, NJ, USA, 30 April - 2 May1968, 1968, 307–314

[Ber18] Bertdobbelaere: SorterHunter. https://github.com/bertdobbelaere/SorterHunter, 2018

[BN62] Bose, R. C. ; Nelson, R. J.: A Sorting Problem. In: J. ACM 9 (1962), Nr. 2, 282–296. http://dx.doi.org/10.1145/321119.321126. – DOI 10.1145/321119.321126

[CCFS14] Codish, Michael ; Cruz-Filipe, Luís ; Frank, Michael ; Schneider-Kamp, Pe-ter: Twenty-Five Comparators Is Optimal When Sorting Nine Inputs (and Twenty-Nine for Ten). In: 26th IEEE International Conference on Tools with ArtificialIntelligence, ICTAI 2014, Limassol, Cyprus, November 10-12, 2014, 2014, 186–193

[CCNS17] Codish, Michael ; Cruz-Filipe, Luís ; Nebel, Markus ; Schneider-Kamp, Pe-ter: Optimizing sorting algorithms by using sorting networks. In: Formal Asp. Com-put. 29 (2017), Nr. 3, 559–579. http://dx.doi.org/10.1007/s00165-016-0401-3.– DOI 10.1007/s00165–016–0401–3

[Fre19] Free Software Foundation: How to Use Inline Assembly Language in C Code.https://gcc.gnu.org/onlinedocs/gcc/Using-Assembly-Language-with-C.html, 2019

[Gam19] Gamble, John M.: Sorting network generator. http://pages.ripco.net/~jgamble/nw.html, 2019

[Knu98] Knuth, Donald E.: The art of computer programming, , Volume III, 2nd Edi-tion. Addison-Wesley, 1998 http://www.worldcat.org/oclc/312994415. – ISBN0201896850

[SW04] Sanders, Peter ; Winkel, Sebastian: Super Scalar Sample Sort. In: Algorithms -ESA 2004, 12th Annual European Symposium, Bergen, Norway, September 14-17,2004, Proceedings, 2004, 784–796

51