Home

Awesome

Intro

This document describes a stable quicksort / mergesort hybrid named fluxsort. The sort is stable, adaptive, branchless, and has exceptional performance. A visualisation and benchmarks are available at the bottom.

Analyzer

Fluxsort starts out with an analyzer that handles fully in-order arrays and reverse-order arrays using n comparisons. It also splits the array in 4 segments and obtains a measure of presortedness for each segment, switching to quadsort if the segment is more than 50% ordered.

While not as adaptive as the bottom-up run detection used by quadsort, a top-down analyzer works well because quicksort significantly benefits from sorting longer ranges. This approach results in more robust overall adaptivity as fluxsort cannot be tricked into performing less efficient partitions on small sorted runs. If only random data is found the analyzer starts skipping ahead, only using n / 4 comparisons to analyze random data.

Increasing the segments from 4 to 16 is challenging due to register pressure and code size.

Partitioning

Partitioning is performed in a top-down manner similar to quicksort. Fluxsort obtains the quasimedian of 9 for partitions smaller than 2024 elements, and the median of 32, 64, 128, 256, 512, or 1024 for larger partitions, making the pivot selection an approximation of the cubic root of the partition size. The median element obtained will be referred to as the pivot. Partitions that grow smaller than 96 elements are sorted with quadsort.

For the quasimedian of 9 I developed a very efficient and elegant branchless median of 3 computation.

        int x = swap[0] > swap[1];
        int y = swap[0] > swap[2];
        int z = swap[1] > swap[2];

        return swap[(x == y) + (y ^ z)];

Fluxsort's cubic root median selection differs from traditional pseudomedian (median of medians) selection by utilizing a combination of quadsort and a binary search.

After obtaining a pivot the array is parsed from start to end. Elements equal or smaller than the pivot are copied in-place to the start of the array, elements greater than the pivot are copied to swap memory. The partitioning routine is called recursively on the two partitions in main and swap memory.

The flux partitioning scheme is partially in-place, giving it a performance advantage over mergesort for large arrays.

Worst case handling

To avoid run-away recursion fluxsort switches to quadsort for both partitions if one partition is less than 1/16th the size of the other partition. On a distribution of random unique values the observed chance of a false positive is 1 in 3,000 for the quasimedian of 9 and less than 1 in 10 million for the quasimedian of 32.

Combined with cbrt(n) median selection, this guarantees a worst case of n log n comparisons.

Branchless optimizations

Fluxsort uses a branchless comparison optimization. The ability of quicksort to partition branchless was first described in "BlockQuicksort: How Branch Mispredictions don't affect Quicksort" by Stefan Edelkamp and Armin Weiss.1 Since Fluxsort uses auxiliary memory, the partitioning scheme is simpler and faster than the one used by BlockQuicksort.

Median selection uses a branchless comparison technique that selects the quasimedian of 9 using 15 comparisons, and the median of 32 using 115 comparisons.

When sorting, branchless comparisons are primarily useful to take advantage of memory-level parallelism. After writing data, the process can continue without having to wait for the write operation to have actually finished, and the process will primarily stall when a cache line is fetched. Since quicksort partitions to two memory regions, part of the loop can continue, reducing the wait time for cache line fetches. This gives an overall performance gain, even though the branchless operation is more expensive due to a lack of support for efficient branchless operations in C / gcc.

When the comparison becomes more expensive (like string comparisons), the size of the type is increased, the size of the partition is increased, or the comparison accesses uncached memory regions, the benefit of memory-level parallelism is reduced, and can even result in slower overall execution. While it's possible to write to four memory regions at once, instead of two, the cost-benefit is dubious, though it might be a good strategy for future hardware.

Quadsort, as of September 2021, uses a branchless optimization as well, and writes to two distinct memory regions by merging both ends of an array simultaneously. For sorting strings and objects quadsort's overall branchless performance is better than fluxsort's, with the exception that fluxsort is faster on random data with low cardinality.

As a general note, branch prediction is awesome. Quadsort and fluxsort try to take advantage of branch prediction where possible.

Generic data optimizations

Fluxsort uses a method that mimicks dual-pivot quicksort to improve generic data handling. If after a partition all elements were smaller or equal to the pivot, a second sweep is performed, filtering out all elements equal to the pivot, next it carries on as usual. This typically only occurs when sorting tables with many identical values, like gender, age, etc. Generic data performance is improved slightly by checking if the same pivot is chosen twice in a row, in which case it performs a reverse partition as well. To my knowledge, pivot retention was first introduced by pdqsort. Generic data performance is further improved by defaulting to quadsort in some cases.

┌──────────────────────────────────┬───┬──────────────┐
│             E <= P               │ P │    E > P     | default partition
└──────────────────────────────────┴───┴──────────────┘

┌──────────────┬───┬───────────────────┐
│    P > E     │ P │    P <= E         |                reverse partition
└──────────────┴───┴───────────────────┘

┌──────────────┬───┬───────────────┬───┬──────────────┐
│    E < P     │ P │    E == P     │ P │     E > P    | 
└──────────────┴───┴───────────────┴───┴──────────────┘

This diagram and the 'second sweep' quicksort optimization for equal keys was described as early as 1985 by Lutz M Wegner.2

Since equal elements are copied back to the input array it is guaranteed that no more than n - 3 elements are copied to swap memory. Subsequently fluxsort can store a stack of previously used pivots at the end of swap memory.

Adaptive partitioning

Fluxsort performs low cost run detection while it partitions and switches to quadsort if a potential long run is detected. While the run detection is not fully robust it can result in significant performance gains at a neglible cost.

Large array optimizations

For partitions larger than 2048 elements fluxsort obtains the median of 32, 64, 128, 256, 512, or 1024. It does so by copying random elements to swap memory, sorting two halves with quadsort, and returning the center right element using a binary search.

Small array optimizations

For partitions smaller than 96 elements fluxsort uses quadsort's small array sorting routine. This routine uses branchless parity merges which gives a significant performance gain compared to the unguarded insertion sort used by most introsorts.

Big O

                 ┌───────────────────────┐┌───────────────────────┐
                 │comparisons            ││swap memory            │
┌───────────────┐├───────┬───────┬───────┤├───────┬───────┬───────┤┌──────┐┌─────────┐┌─────────┐
│name           ││min    │avg    │max    ││min    │avg    │max    ││stable││partition││adaptive │
├───────────────┤├───────┼───────┼───────┤├───────┼───────┼───────┤├──────┤├─────────┤├─────────┤
│fluxsort       ││n      │n log n│n log n││1      │n      │n      ││yes   ││yes      ││yes      │
├───────────────┤├───────┼───────┼───────┤├───────┼───────┼───────┤├──────┤├─────────┤├─────────┤
│quadsort       ││n      │n log n│n log n││1      │n      │n      ││yes   ││no       ││yes      │
├───────────────┤├───────┼───────┼───────┤├───────┼───────┼───────┤├──────┤├─────────┤├─────────┤
│quicksort      ││n log n│n log n│n²     ││1      │1      │1      ││no    ││yes      ││no       │
├───────────────┤├───────┼───────┼───────┤├───────┼───────┼───────┤├──────┤├─────────┤├─────────┤
│pdqsort        ││n      │n log n│n log n││1      │1      │1      ││no    ││yes      ││semi     │
└───────────────┘└───────┴───────┴───────┘└───────┴───────┴───────┘└──────┘└─────────┘└─────────┘

Data Types

The C implementation of fluxsort supports long doubles and 8, 16, 32, and 64 bit data types. By using pointers it's possible to sort any other data type, like strings.

Interface

Fluxsort uses the same interface as qsort, which is described in man qsort.

Fluxsort comes with the fluxsort_prim(void *array, size_t nmemb, size_t size) function to perform primitive comparisons on arrays of 32 and 64 bit integers. Nmemb is the number of elements. Size should be either sizeof(int) or sizeof(long long) for signed integers, and sizeof(int) + 1 or sizeof(long long) + 1 for unsigned integers. Support for additional primitive as well as custom types can be added to fluxsort.h and quadsort.h.

Fluxsort comes with the fluxsort_size(void *array, size_t nmemb, size_t size, CMPFUNC *cmp) function to sort elements of any given size. The comparison function needs to be by reference, instead of by value, as if you are sorting an array of pointers.

Memory

Fluxsort allocates n elements of swap memory, which is shared with quadsort. Recursion requires log n stack memory.

If memory allocation fails fluxsort defaults to quadsort, which can sort in-place through rotations.

If in-place stable sorting is desired the best option is to use blitsort, which is a properly in-place alternative to fluxsort. For in-place unstable sorting crumsort is an option as well.

Performance

Fluxsort is one of the fastest stable comparison sorts written to date.

To take full advantage of branchless operations the cmp macro can be uncommented in bench.c, which will double the performance on primitive types. Fluxsort, after crumsort, is faster than a radix sort for sorting 64 bit integers. An adaptive radix sort, like wolfsort, has better performance on 8, 16, and 32 bit types.

Fluxsort needs to be compiled using gcc -O3 for optimal performance.

Porting

People wanting to port fluxsort might want to have a look at piposort, which is a simplified implementation of quadsort. Fluxsort itself is relatively simple. Earlier versions of fluxsort have a less bulky analyzer. Fluxsort works without the analyzer, but will be less adaptive.

Variants

Credits

I was likely the first to write a partially inplace stable quicksort, and I introduced a few new things, like branchless pivot selection, increasing the pivot candidate selection from the pseudomedian of 9 to an approximation of sqrt(n), improving small partition sorting by utilizing branchless parity merges, top-down run detection for increased adaptivity, detecting emergent patterns, and an efficient switch between partitioning swap and main memory.

Special thanks to Control from the Holy Grail Sort Project for invaluable feedback and insights.

Visualization

In the visualization below eleven tests are performed on 256 elements. The various distributions are indexed on YouTube: https://youtu.be/TDJ5zpcZJ18.

fluxsort visualization

Benchmarks

The following benchmark was on WSL gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04) using the wolfsort benchmark. The source code was compiled using g++ -O3 -w -fpermissive bench.c. The bar graph shows the best run out of 100 on 100,000 32 bit integers. Comparisons for timsort, fluxsort and std::stable_sort are inlined.

fluxsort vs stdstablesort

<details><summary>data table</summary>
NameItemsTypeBestAverageComparesSamplesDistribution
stablesort100000640.0060080.0060930100random order
fluxsort100000640.0020320.0020720100random order
timsort100000640.0077540.0078140100random order
NameItemsTypeBestAverageLoopsSamplesDistribution
stablesort100000320.0061130.0061520100random order
fluxsort100000320.0019060.0019160100random order
timsort100000320.0076300.0076960100random order
stablesort100000320.0039540.0039880100random % 100
fluxsort100000320.0006770.0006830100random % 100
timsort100000320.0056480.0056880100random % 100
stablesort100000320.0008040.0008590100ascending order
fluxsort100000320.0000440.0000440100ascending order
timsort100000320.0000450.0000450100ascending order
stablesort100000320.0014960.0015480100ascending saw
fluxsort100000320.0003050.0003070100ascending saw
timsort100000320.0008460.0008520100ascending saw
stablesort100000320.0012060.0012260100pipe organ
fluxsort100000320.0001930.0001950100pipe organ
timsort100000320.0004710.0004770100pipe organ
stablesort100000320.0009140.0009240100descending order
fluxsort100000320.0000560.0000560100descending order
timsort100000320.0001020.0001020100descending order
stablesort100000320.0016220.0016440100descending saw
fluxsort100000320.0003150.0003160100descending saw
timsort100000320.0009080.0009220100descending saw
stablesort100000320.0021540.0022150100random tail
fluxsort100000320.0006410.0006460100random tail
timsort100000320.0020050.0020330100random tail
stablesort100000320.0036330.0036780100random half
fluxsort100000320.0010960.0011080100random half
timsort100000320.0040420.0040720100random half
stablesort100000320.0010910.0011210100ascending tiles
fluxsort100000320.0002930.0002940100ascending tiles
timsort100000320.0009450.0010020100ascending tiles
stablesort100000320.0017470.0018870100bit reversal
fluxsort100000320.0017050.0017180100bit reversal
timsort100000320.0022820.0029200100bit reversal
</details>

The following benchmark was on WSL 2 gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04) using the wolfsort benchmark. The source code was compiled using g++ -O3 -w -fpermissive bench.c. It measures the performance on random data with array sizes ranging from 10 to 10,000,000. It's generated by running the benchmark using 10000000 0 0 as the argument. The benchmark is weighted, meaning the number of repetitions halves each time the number of items doubles. A table with the best and average time in seconds can be uncollapsed below the bar graph.

fluxsort vs stdstablesort

<details><summary>data table</summary>
NameItemsTypeBestAverageComparesSamplesDistribution
stablesort10320.1277790.1281490.010random 10
fluxsort10320.0496790.0498830.010random 10
timsort10320.1410790.1443500.010random 10
stablesort100320.2427440.2432880.010random 100
fluxsort100320.1123590.1130830.010random 100
timsort100320.3411050.3417100.010random 100
stablesort1000320.3629020.3635860.010random 1000
fluxsort1000320.1375690.1380660.010random 1000
timsort1000320.4930470.4936220.010random 1000
stablesort10000320.4768590.4771680.010random 10000
fluxsort10000320.1605650.1607740.010random 10000
timsort10000320.6372240.6419500.010random 10000
stablesort100000320.6022760.6028380.010random 100000
fluxsort100000320.1874200.1879150.010random 100000
timsort100000320.7620780.7626480.010random 100000
stablesort1000000320.7317150.7340210.010random 1000000
fluxsort1000000320.2171960.2192070.010random 1000000
timsort1000000320.8955320.8965470.010random 1000000
stablesort10000000320.8910280.895325010random 10000000
fluxsort10000000320.2669100.269159010random 10000000
timsort10000000321.0817091.089263010random 10000000
</details>

The following benchmark was on WSL gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1). The source code was compiled using gcc -O3 bench.c. The bar graph shows the best run out of 100 on 100,000 32 bit integers. Comparisons for qsort, fluxsort and quadsort are not inlined. The stdlib qsort() in the benchmark is a mergesort variant.

fluxsort vs qsort

<details><summary>data table</summary>
NameItemsTypeBestAverageComparesSamplesDistribution
qsort100000640.0170330.0172271536381100random string
fluxsort100000640.0099280.0102121782460100random string
quadsort100000640.0107770.0109091684673100random string
qsort100000640.0153980.0155511536491100random double
fluxsort100000640.0076460.0078111781640100random double
quadsort100000640.0087020.0088181684633100random double
qsort100000640.0110140.0111241536491100random long
fluxsort100000640.0048590.0049651781640100random long
quadsort100000640.0060430.0061201684633100random long
qsort100000640.0108340.0109651536634100random int
fluxsort100000640.0045960.0046621790032100random int
quadsort100000640.0053650.0054221684734100random int
NameItemsTypeBestAverageComparesSamplesDistribution
qsort1000001280.0182190.0190141536491100random order
fluxsort1000001280.0111780.0112901781640100random order
quadsort1000001280.0111410.0112581684633100random order
NameItemsTypeBestAverageComparesSamplesDistribution
qsort100000640.0093280.0095051536491100random order
fluxsort100000640.0040250.0040951781640100random order
quadsort100000640.0041030.0041431684633100random order
NameItemsTypeBestAverageComparesSamplesDistribution
qsort100000320.0089320.0090941536634100random order
fluxsort100000320.0034030.0034531790032100random order
quadsort100000320.0033730.0034191684734100random order
qsort100000320.0066920.0068311532465100random % 100
fluxsort100000320.0013220.001352897246100random % 100
quadsort100000320.0027230.0028161415417100random % 100
qsort100000320.0022330.002371815024100ascending order
fluxsort100000320.0001740.00017599999100ascending order
quadsort100000320.0001590.00016199999100ascending order
qsort100000320.0030670.003177915020100ascending saw
fluxsort100000320.0005450.000549300011100ascending saw
quadsort100000320.0008970.000915379624100ascending saw
qsort100000320.0024800.002523884462100pipe organ
fluxsort100000320.0003810.000382200006100pipe organ
quadsort100000320.0004570.000462277113100pipe organ
qsort100000320.0024590.002535853904100descending order
fluxsort100000320.0001860.00018799999100descending order
quadsort100000320.0001640.00016699999100descending order
qsort100000320.0032890.003361953892100descending saw
fluxsort100000320.0005550.000561300013100descending saw
quadsort100000320.0009220.000930391547100descending saw
qsort100000320.0039430.0040051012073100random tail
fluxsort100000320.0011810.001207623604100random tail
quadsort100000320.0012030.001221592061100random tail
qsort100000320.0057910.0059861200713100random half
fluxsort100000320.0019740.0020041028725100random half
quadsort100000320.0020550.0020921006728100random half
qsort100000320.0040850.0042231209200100ascending tiles
fluxsort100000320.0015230.001557528889100ascending tiles
quadsort100000320.0020730.002130671244100ascending tiles
qsort100000320.0051430.0054291553378100bit reversal
fluxsort100000320.0032630.0033851798806100bit reversal
quadsort100000320.0031790.0032181727134100bit reversal
</details>

The following benchmark was on WSL gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04) using the wolfsort benchmark. The source code was compiled using g++ -O3 -w -fpermissive bench.c. The bar graph shows the best run out of 100 on 100,000 32 bit integers. Comparisons for pdqsort, fluxsort and crumsort are inlined.

fluxsort vs pdqsort

<details><summary>data table</summary>
NameItemsTypeBestAverageComparesSamplesDistribution
pdqsort1000001280.0058540.0059540100random order
fluxsort1000001280.0085550.0086620100random order
crumsort1000001280.0082530.0083120100random order
NameItemsTypeBestAverageComparesSamplesDistribution
pdqsort100000640.0026600.0026830100random order
fluxsort100000640.0019500.0019730100random order
crumsort100000640.0018500.0018640100random order
NameItemsTypeBestAverageLoopsSamplesDistribution
pdqsort100000320.0026870.0027110100random order
fluxsort100000320.0018260.0018570100random order
crumsort100000320.0017990.0018150100random order
pdqsort100000320.0007850.0007920100random % 100
fluxsort100000320.0006560.0006690100random % 100
crumsort100000320.0005600.0005650100random % 100
pdqsort100000320.0000910.0000910100ascending order
fluxsort100000320.0000440.0000440100ascending order
crumsort100000320.0000440.0000440100ascending order
pdqsort100000320.0034650.0034820100ascending saw
fluxsort100000320.0003290.0003370100ascending saw
crumsort100000320.0006280.0006360100ascending saw
pdqsort100000320.0028400.0028620100pipe organ
fluxsort100000320.0002150.0002180100pipe organ
crumsort100000320.0003590.0003640100pipe organ
pdqsort100000320.0001950.0002000100descending order
fluxsort100000320.0000550.0000560100descending order
crumsort100000320.0000560.0000560100descending order
pdqsort100000320.0032360.0032750100descending saw
fluxsort100000320.0003290.0003310100descending saw
crumsort100000320.0006370.0006480100descending saw
pdqsort100000320.0025660.0025870100random tail
fluxsort100000320.0006240.0006290100random tail
crumsort100000320.0008790.0008880100random tail
pdqsort100000320.0026700.0026970100random half
fluxsort100000320.0010640.0010810100random half
crumsort100000320.0011990.0012160100random half
pdqsort100000320.0023160.0024250100ascending tiles
fluxsort100000320.0002930.0002990100ascending tiles
crumsort100000320.0015180.0015450100ascending tiles
pdqsort100000320.0026700.0026940100bit reversal
fluxsort100000320.0016920.0017370100bit reversal
crumsort100000320.0017850.0018050100bit reversal
</details>

The following benchmark was on WSL clang version 10 (10.0.0-4ubuntu1~18.04.2) using rhsort's wolfsort benchmark. The source code was compiled using clang -O3. The bar graph shows the best run out of 100 on 131,072 32 bit integers. Comparisons for quadsort, fluxsort and glidesort are inlined.

Some additional context is required for this benchmark. Glidesort is written and compiled in Rust which supports branchless ternary operations, subsequently fluxsort and quadsort are compiled using clang with branchless ternary operations in place for the merge and small-sort routines. Since fluxsort and quadsort are optimized for gcc there is a performance penalty, with some of the routines running 2-3x slower than they do in gcc.

fluxsort vs glidesort

<details><summary>data table</summary>
NameItemsTypeBestAverageComparesSamplesDistribution
quadsort131072320.0021620.0022140100random order
fluxsort131072320.0019470.0020340100random order
glidesort131072320.0030470.0031260100random order
quadsort131072320.0016280.0016970100random % 100
fluxsort131072320.0007990.0008290100random % 100
glidesort131072320.0010320.0010600100random % 100
quadsort131072320.0000580.0000620100ascending order
fluxsort131072320.0000580.0000600100ascending order
glidesort131072320.0000910.0000940100ascending order
quadsort131072320.0003420.0003490100ascending saw
fluxsort131072320.0003420.0003570100ascending saw
glidesort131072320.0003520.0003620100ascending saw
quadsort131072320.0002310.0002410100pipe organ
fluxsort131072320.0002240.0002440100pipe organ
glidesort131072320.0002280.0002380100pipe organ
quadsort131072320.0000740.0000770100descending order
fluxsort131072320.0000730.0000750100descending order
glidesort131072320.0001060.0001090100descending order
quadsort131072320.0003730.0003830100descending saw
fluxsort131072320.0003510.0003740100descending saw
glidesort131072320.0003640.0003750100descending saw
quadsort131072320.0006870.0007170100random tail
fluxsort131072320.0006610.0006920100random tail
glidesort131072320.0009440.0009800100random tail
quadsort131072320.0011920.0012320100random half
fluxsort131072320.0011390.0011860100random half
glidesort131072320.0016500.0016900100random half
quadsort131072320.0007730.0008010100ascending tiles
fluxsort131072320.0005690.0005950100ascending tiles
glidesort131072320.0025700.0026750100ascending tiles
quadsort131072320.0021900.0022580100bit reversal
fluxsort131072320.0019690.0020280100bit reversal
glidesort131072320.0027940.0028540100bit reversal
quadsort131072320.0011070.0011360100random % 4
fluxsort131072320.0002390.0002490100random % 4
glidesort131072320.0005020.0005290100random % 4
quadsort131072320.0010510.0010880100signal
fluxsort131072320.0011890.0012290100signal
glidesort131072320.0037450.0038520100signal
quadsort131072320.0019210.0019930100exponential
fluxsort131072320.0011230.0011520100exponential
glidesort131072320.0023320.0024340100exponential
quadsort131072320.0019150.001972010095% generic
fluxsort131072320.0002900.000302010095% generic
glidesort131072320.0008810.000916010095% generic
</details>

Footnotes

Footnotes

  1. BlockQuicksort: How Branch Mispredictions don't affect Quicksort, Stefan Edelkamp and Armin Weiß, April 2016

  2. Quicksort for Equal Keys, Lutz M. Wegner, April 1985