Home

Awesome

GPU Prefix Sums

Prefix Sum Roundup

GPUPrefixSums aims to bring state-of-the-art GPU prefix sum techniques from CUDA and make them available in portable compute shaders. In addition to this, it contributes "Decoupled Fallback," a novel fallback technique for Chained Scan with Decoupled Lookback that should allow devices without forward thread progress guarantees to perform the scan without crashing. The D3D12 implementation includes an extensive survey of GPU prefix sums, ranging from the warp to the device level; all included algorithms utilize wave/warp/subgroup (referred to as "wave" hereon) level parallelism but are completely agnostic of wave size. As a measure of the quality of the code, GPUPrefixSums has also been implemented in CUDA and benchmarked against Nvidia's CUB library. Although GPUPrefixSums aims to be portable to any wave size supported by HLSL, [4, 128], due to hardware limitations, it has only been tested on wave sizes 4, 16, 32, and 64. You have been warned!

If you are interested in prefix sums for their use in radix sorting, check out GPUPrefixSum's sibling repository GPUSorting!

Decoupled Fallback

In Decoupled Fallback, a threadblock will spin for a set amount of cycles while waiting for the reduction of a preceding partition tile. If the maximum spin count is exceeded, the threadblock is free to perform a fallback operation. Multiple thread blocks are allowed to perform fallbacks on the same deadlocking tile, but through use of atomic compare and swap, only one thread block ends up broadcasting its reduction in device memory. Although this means potentially performing redundant calculations, the upside is that fallback performance is no longer limited by the latency of signal propagation between thread blocks.

As of writing this 9/22/2024, Decoupled Fallback shows promising results on Apple M GPU's. However the version included here are out of date, with the most up-to-date development occuring in Vello.

Survey

Prefix

A prefix sum, also called a scan, is a running total of a sequence of numbers at the n-th element. If the prefix sum is inclusive the n-th element is included in that total, if it is exclusive, the n-th element is not included. The prefix sum is one of the most important algorithmic primitives in parallel computing, underpinning everything from sorting, to compression, to graph traversal.

Basic Scans

<details open> <summary>

Kogge-Stone

</summary>

KoggesStoneImage

</details> <details> <summary>

Sklansky

</summary>

SklanskyFinal

</details> <details> <summary>

Brent-Kung

</summary>

BrentKungImage

</details> <details> <summary>

Reduce Scan

</summary>

ReduceScanFinal

</details> <details> <summary>

Raking Reduce-Scan

</summary>

RakingReduceScan

</details>

Warp-Synchronized Scans

<details open> <summary>

Warp-Sized-Radix Brent-Kung

</summary>

RadixBrentKung

</details> <details> <summary>

Warp-Sized-Radix Brent-Kung with Fused Upsweep-Downsweep

</summary>

RadixBKFused

</details> <details> <summary>

Warp-Sized-Radix Sklansky

</summary>

WarpSklansky

</details> <details> <summary>

Warp-Sized-Radix Serial

</summary>

RadixSerial

</details> <details> <summary>

Warp-Sized-Radix Raking Reduce-Scan

</summary>

WarpRakingReduce

</details>

Block-Level Scan Pattern

<details> <summary>

First Partition

</summary>

Block Level 1

</details> <details> <summary>

Second Partition and Onwards

</summary>

Block Level 2

</details>

Device Level Scan Pattern (Reduce-Then-Scan)

<details> <summary>

Reduce

</summary>

Device 1

</details> <details> <summary>

Scan Along the Intermediate Reductions

</summary>

Device 2

</details> <details> <summary>

Scan and Pass in Intermediate Values

</summary>

Device 3

</details>

Getting Started

GPUPrefixSumsD3D12

Headless implementation in D3D12, includes:

Requirements:

The repository folder contains a Visual Studio 2019 project and solution file. Upon building the solution, NuGet will download and link the following external dependencies:

See the repository wiki for information on running tests.

GPUPrefixSumsCUDA

GPUPrefixSumsCUDA includes:

The purpose of this implementation is to benchmark the algorithms and demystify their implementation in the CUDA environment. It is not intended for production or use; instead, a proper implementation can be found in the CUB library.

Requirements:

The repository folder contains a Visual Studio 2019 project and solution file; there are no external dependencies besides the CUDA toolkit. The use of sync primitives necessitates Compute Capability 7.x or greater. See the repository wiki for information on running tests.

GPUPrefixSumsUnity

Released as a Unity package includes:

Requirements:

Within the Unity package manager, add a package from git URL and enter:

https://github.com/b0nes164/GPUPrefixSums.git?path=/GPUPrefixSumsUnity

See the repository wiki for information on running tests.

GPUPrefixSumsWGPU

WARNING: TESTING ONLY CURRENTLY, NOT FULLY PORTABLE

Barebones implementation--no vectorization, no wave intrinsics--to be used as a testbed.

Requirements:

Interesting Reading and Bibliography

Duane Merrill and Michael Garland. “Single-pass Parallel Prefix Scan with De-coupled Lookback”. In: 2016. url: https://research.nvidia.com/publication/2016-03_single-pass-parallel-prefix-scan-decoupled-look-back

Grimshaw, Andrew S. and Duane Merrill. “Parallel Scan for Stream Architectures.” (2012). url: https://libraopen.lib.virginia.edu/downloads/6t053g00z

Matt Pettineo. GPU Memory Pools in D3D12. Jul. 2022. url: https://therealmjp.github.io/posts/gpu-memory-pool/

Ralph Levien. Prefix sum on portable compute shaders. Nov. 2021. url: https://raphlinus.github.io/gpu/2021/11/17/prefix-sum-portable.html

Tyler Sorensen, Hugues Evrard, and Alastair F. Donaldson. “GPU Schedulers: How Fair Is Fair Enoughl”. In: 29th International Conference on Concurrency Theory (CONCUR 2018). Ed. by Sven Schewe and Lijun Zhang. Vol. 118. Leibniz International Proceedings in Informatics (LIPIcs). Dagstuhl, Germany: Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2018, 23:1–23:17. isbn: 978-3-95977-087-3. doi: 10.4230/LIPIcs.CONCUR.2018.23. url: http://drops.dagstuhl.de/opus/volltexte/2018/9561.

Vasily Volkov. “Understanding Latency Hiding on GPUs”. PhD thesis. EECS Department, University of California, Berkeley, Aug. 2016. url: http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.html

Zhe Jia et al. Dissecting the NVidia Turing T4 GPU via Microbenchmarking. 2019. arXiv: 1903.07486. url: https://arxiv.org/abs/1903.07486