Awesome
GPUSorting
GPUSorting aims to bring state-of-the-art GPU sorting techniques from CUDA and make them available in portable compute shaders. All sorting algorithms included in GPUSorting utilize wave/warp/subgroup (referred to as "wave" hereon) level parallelism but are completely agnostic of wave size. Wave size specialization is entirely accomplished through runtime logic, instead of through shader compilation defines. This has a minimal impact on performance and significantly reduces the number of shader permutations. Although GPUSorting aims to be portable to any wave size supported by HLSL, [4, 128], due to hardware limitations, it has only been tested on wave sizes 4, 16, 32, and 64. You have been warned!
Device Radix Sort vs OneSweep
GPUSorting includes two sorting algorithms, both based on those found in the CUB library: DeviceRadixSort and OneSweep. The two algorithms are almost identical, except for the way that the inter-threadblock prefix sum of digit counts is performed. In DeviceRadixSort, the prefix sum is done through an older technique, "reduce-then-scan," whereas in OneSweep, it is accomplished using "chained-scan-with-decoupled-lookback." Because "chained-scan" relies on forward thread-progress guarantees, OneSweep is less portable than DeviceRadixSort, and DeviceRadixSort should be used whenever portability is a concern. Again, due to a lack of hardware, I cannot say exactly how portable OneSweep is, but as a general rule of thumb, OneSweep tends to run on anything that is not mobile, a software rasterizer, or Apple. Use OneSweep at your own risk; you have been warned!
As a measure of the quality of the code, GPUSorting has also been implemented in CUDA and benchmarked against Nvidia's CUB library, with the following results:
SplitSort
GPUSorting also introduces a novel hybrid radix-merge based segmented sort called SplitSort. Due to its unique radix-based property across all maximum segment lengths, SplitSort demonstrates significant speedups when sorting on 16-bit keys. On 32-bit keys, SplitSort shows modest speedups on maximum segment lengths less than 256, particularly when sorting with a 64-bit value. At this point, SplitSort is still very much a proof of concept. For a more complete write-up on how SplitSort works under the hood, see this thread on the Lindbender Org Zulip. Note that the following benchmarks were performed on Ubuntu using the benchmarking suite provided by Kobus et al..
Various Other Benchmarks
Thearling and Smith Benchmark:
GPUSorting vs Fidelity FX Parallel Sort
Automatic Tuning for Devices:
Getting Started
GPUSortingD3D12
Headless implementation in D3D12, currently demo only, but release as a package is planned.
Requirements:
- Visual Studio 2019 or greater
- Windows SDK 10.0.20348.0 or greater
The repository folder contains a Visual Studio 2019 project and solution file. Upon building the solution, NuGet will download and link the external dependencies. See the repository wiki for information on running tests.
GPUSortingCUDA
The purpose of this implementation is to benchmark the algorithms and demystify their implementation in the CUDA environment. It is not intended for production or use; instead, a proper implementation can be found in the CUB library.
- Visual Studio 2019 or greater
- Windows SDK 10.0.20348.0 or greater
- CUDA Toolkit 12.3.2
- Nvidia Graphics Card with Compute Capability 7.x or greater.
The repository folder contains a Visual Studio 2019 project and solution file; there are no external dependencies besides the CUDA toolkit. The use of sync primitives necessitates Compute Capability 7.x or greater. See the repository wiki for information on running tests.
GPUSortingUnity
Released as a Unity package.
Requirements:
- Unity 2021.3.35f1 or greater
Within the Unity package manager, add a package from git URL and enter:
https://github.com/b0nes164/GPUSorting.git?path=/GPUSortingUnity
See the repository wiki for information on running tests.
Strongly Suggested Reading / Bibliography
Andy Adinets and Duane Merrill. Onesweep: A Faster Least Significant Digit Radix Sort for GPUs. 2022. arXiv: 2206.01784 url: https://arxiv.org/abs/2206.01784
Dondragmer. CuteSort. https://gist.github.com/dondragmer/0c0b3eed0f7c30f7391deb11121a5aa1.
Duane Merrill and Michael Garland. “Single-pass Parallel Prefix Scan with De-coupled Lookback”. In: 2016. url: https://research.nvidia.com/publication/2016-03_single-pass-parallel-prefix-scan-decoupled-look-back
Kaixi Hou, Weifeng Liu, Hao Wang, and Wu-chun Feng. 2017. Fast segmented sort on GPUs. In Proceedings of the International Conference on Supercomputing (ICS '17). Association for Computing Machinery, New York, NY, USA, Article 12, 1–10. https://doi.org/10.1145/3079079.3079105
Kobus, R., Nelgen, J., Henkys, V., Schmidt, B. (2023). Faster Segmented Sort on GPUs. In: Cano, J., Dikaiakos, M.D., Papadopoulos, G.A., Pericàs, M., Sakellariou, R. (eds) Euro-Par 2023: Parallel Processing. Euro-Par 2023. Lecture Notes in Computer Science, vol 14100. Springer, Cham. https://doi.org/10.1007/978-3-031-39698-4_45
Oded Green, Robert McColl, and David A. Bader. 2012. GPU merge path: a GPU merging algorithm. In Proceedings of the 26th ACM international conference on Supercomputing (ICS '12). Association for Computing Machinery, New York, NY, USA, 331–340. https://doi.org/10.1145/2304576.2304621
Rafael F. Schmid, Flávia Pisani, Edson N. Cáceres, and Edson Borin. 2022. An evaluation of fast segmented sorting implementations on GPUs. Parallel Comput. 110, C (May 2022). https://doi.org/10.1016/j.parco.2021.102889
Saman Ashkiani et al. “GPU Multisplit”. In: SIGPLAN Not. 51.8 (Feb. 2016). issn: 0362-1340. doi: 10.1145/3016078.2851169. url: https://doi.org/10.1145/3016078.2851169.
Sean Baxter. Segmented Sort and Locality Sort. https://moderngpu.github.io/segsort.html