Home

Awesome

vulkan_radix_sort

Vulkan implementation of radix sort.

Reduce-then-scan GPU radix sort algorithm is implemented (Onesweep is abandoned.)

Requirements

Build and Test

$ cmake . -B build
$ cmake --build build --config Release -j
$ ./build/Release/bench.exe  # Windows
$ ./build/bench  # Linux

Test Environment

Benchmark Result

================ sort key value speed ================
[0] total time: CUB 4.20352ms (7.98246 GItems/s) vs. Vulkan 3.52461ms (9.52005 GItems/s)
[1] total time: CUB 4.17075ms (8.04518 GItems/s) vs. Vulkan 3.49389ms (9.60375 GItems/s)
[2] total time: CUB 4.16768ms (8.05111 GItems/s) vs. Vulkan 3.50896ms (9.5625 GItems/s)
[3] total time: CUB 4.12774ms (8.129 GItems/s) vs. Vulkan 3.7161ms (9.02948 GItems/s)
[4] total time: CUB 4.16666ms (8.05308 GItems/s) vs. Vulkan 3.45875ms (9.70131 GItems/s)
...

Use as a Library with CMake

Usage

  1. When creating VkDevice, enable VkPhysicalDeviceBufferAddressFeatures.

  2. When creating VmaAllocator, enable VMA_ALLOCATOR_CREATE_BUFFER_DEVICE_ADDRESS_BIT flag.

  3. Create VkBuffer for keys and values, with VK_BUFFER_USAGE_STORAGE_BUFFER_BIT and VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT.

  4. Create VrdxSorter

    It creates shared resources: pipeline layouts, pipelines, etc.

    VrdxSorter sorter = VK_NULL_HANDLE;
    VrdxSorterCreateInfo sorterInfo = {};
    sorterInfo.physicalDevice = physicalDevice;
    sorterInfo.device = device;
    sorterInfo.pipelineCache = pipelineCache;
    vrdxCreateSorter(&sorterInfo, &sorter);
    
  5. Create a temporary storage buffer for sort.

    // request storage buffer request
    VrdxSorterStorageRequirements requirements;
    // for key-only
    vrdxGetSorterStorageRequirements(sorter, elementCount, &requirements);
    // for key-value
    vrdxGetSorterKeyValueStorageRequirements(sorter, elementCount, &requirements);
    
    // create or reuse buffer
    VkBufferCreateInfo bufferInfo = {VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO};
    bufferInfo.size = requirements.size;
    bufferInfo.usage = requirements.usage;
    // ...
    
  6. Record sort commands.

    This command binds pipeline, pipeline layout, and push constants internally.

    So, users must not expect previously bound targets retain after the sort command.

    Users must add proper execution barriers.

    One can use buffer memory barrier, but in general, global barriers are more efficient than per-resource, according to official synchronization examples:

    ... global memory barrier covers all resources. Generally considered more efficient to do a global memory barrier than per-resource barriers, per-resource barriers should usually be used for queue ownership transfers and image layout transitions - otherwise use global barriers.

    The sort command will read from key/value buffers (and elementCount buffer for indirect sort) in compute shader stage, and write to output key/value buffers in later compute shader stage.

    The second synchronization scope before sort command must include COMPUTE_SHADER stage (and TRANSFER for indirect sort) and SHADER_READ access (and TRANSFER_READ for indirect sort).

    The first synchronization scope after sort command must include COMPUTE_SHADER stage and SHADER_WRITE access.

    VkQueryPool queryPool;  // VK_NULL_HANDLE, or a valid timestamp query pool with size at least 8.
    
    // sort keys
    vrdxCmdSort(commandBuffer, sorter, elementCount,
                keysBuffer, 0,
                storageBuffer, 0,
                queryPool, 0);
    
    // sort keys with values
    vrdxCmdSortKeyValue(commandBuffer, sorter, elementCount,
                        keysBuffer, 0,
                        valuesBuffer, 0,
                        storageBuffer, 0,
                        queryPool, 0);
    
    // indirectBuffer contains elementCount, a single uint entry in GPU buffer.
    // maxElementCount is required for storage buffer offsets.
    // element count in the indirect buffer must not be greater than maxElementCount. Otherwise, undefined behavior.
    vrdxCmdSortKeyValueIndirect(commandBuffer, sorter, maxElementCount,
                                indirectBuffer, 0,
                                keysBuffer, 0,
                                valuesBuffer, 0,
                                storageBuffer, 0,
                                queryPool, 0);
    

TODO

References

Troubleshooting