Home

Awesome

Vulkan Device Generated Commands Sample

The "device generated cmds" sample demonstrates the use of the VK_NV_device_generated_commands and VK_EXT_device_generated_commands (DGC) extensions.

We recommend to have a look at this blog post about the extension first.

The EXT_device_generated_commands works slightly different in detail, but is conceptually very similar.

NVEXT
input streamflexible mix or interleaved or separatedsingle interleaved
indirect shader bindingVkPipeline can be extended with a fixed set of immutable VkGraphicsShaderGroupCreateInfoNV that are indexed at runtimeVkIndirectExecutionSet is a table that is indexed at runtime and does allow dynamic updates of slots with VkPipeline or VkShaderEXT
draw callsone draw per indirect command sequenceone draw, or passing an indirect buffer address and count (allows binning draw calls by state)

The sample furthermore allows to compare the impact of a few alternative ways to do the same work:

The content being rendered in the sample is a CAD model which is made of many parts that have few triangles. Having such low complexity per draw-call can very often result into being CPU bound and also serves as stress test for the GPU's command processor.

The sample was a derived of the threaded cadscene sample. Please refer to its readme, as it explains the scene's principle setup.

For simplicity the draw call information for DGC is also created on CPU and uploaded once. In real-world applications we would use compute shaders for this and implement something like scene traversal and occlusion culling on the device.

sample screenshot

Options

Beyond the key comparisons listed above there is a few more options provided.

Scene Complexity

To reduce the amount of commands we do some level of redundant state filtering when generating the commands for each draw call. Because this scene has such few triangles per draw call, this is highly recommended.

A particular new feature that the DGC extension provides is the ability to switch shaders on the device. For that purpose the sample generates up to 128 artificial vertex/fragment shader combinations, which yield different polygon stippling patterns.

When max shadergroups is set to 1 then indirect shader binds will be disabled in the DGC renderers. As the support is optional in the DGC extensions, the value will get automatically set to one on absence of the feature.

Renderer Settings

Device Generated Commands

For an overview on this extension, we recommend to have a look at this article.

There is a few principle steps:

  1. Define a sequence of commands you want to generate as IndirectCommandsLayoutEXT/NV
  2. If you want the ability to change shaders, create your graphics pipelines with VK_PIPELINE_CREATE_2_INDIRECT_BINDABLE_BIT_EXT / VK_PIPELINE_CREATE_INDIRECT_BINDABLE_BIT_NV or your shader objects with VK_SHADER_CREATE_INDIRECT_BINDABLE_BIT_EXT.
  1. Create a preprocess buffer based on sizing information acquired by vkGetGeneratedCommandsMemoryRequirementsEXT/NV.
  2. Fill your input buffer(s) for the generation step and setup VkGeneratedCommandsInfoEXT/NV accordingly.
  3. Optionally use a separate preprocess step via vkCmdPreprocessGeneratedCommandsEXT/NV.
  4. Run the execution via vkCmdExecuteGeneratedCommandsEXT/NV.

Highlighted Files

Indirect Shader Binding

EXT_device_generated_commands

Similar to descriptor sets, there exists VkIndirectExecutionSetEXT which serves as a binding table with a fixed upper count.

TypeStored ObjectsFunction to registerPipeline / Shader Flag
VK_INDIRECT_EXECUTION_SET_INFO_TYPE_PIPELINES_EXTVkPipelinevkUpdateIndirectExecutionSetPipelineEXTVK_PIPELINE_CREATE_2_INDIRECT_BINDABLE_BIT_EXT
VK_INDIRECT_EXECUTION_SET_INFO_TYPE_SHADER_OBJECTS_EXTVkShaderEXTvkUpdateIndirectExecutionSetShaderEXTVK_SHADER_CREATE_INDIRECT_BINDABLE_BIT_EXT

The VK_INDIRECT_COMMANDS_TOKEN_TYPE_EXECUTION_SET_EXT requires either a single uint32_t input for pipelines, or shader-stage-many uint32_t for shader objects.

typedef struct VkIndirectExecutionSetCreateInfoEXT
{
  VkStructureType                   sType;
  void const*                       pNext;
  VkIndirectExecutionSetInfoTypeEXT type;
  VkIndirectExecutionSetInfoEXT     info;
  // either pointer to VkIndirectExecutionSetPipelineInfoEXT or 
  //                   VkIndirectExecutionSetShaderInfoEXT

} VkIndirectExecutionSetCreateInfoEXT;

// for pipelines 

typedef struct VkIndirectExecutionSetPipelineInfoEXT
{
  VkStructureType sType;
  void const*     pNext;
  VkPipeline      initialPipeline;
  uint32_t        maxPipelineCount;
} VkIndirectExecutionSetPipelineInfoEXT;

// for shader objects

typedef struct VkIndirectExecutionSetShaderLayoutInfoEXT
{
  VkStructureType              sType;
  void const*                  pNext;
  uint32_t                     setLayoutCount;
  VkDescriptorSetLayout const* pSetLayouts;
} VkIndirectExecutionSetShaderLayoutInfoEXT;

typedef struct VkIndirectExecutionSetShaderInfoEXT
{
  VkStructureType                                  sType;
  void const*                                      pNext;

  uint32_t                                         shaderCount;
  VkShaderEXT const*                               pInitialShaders;
  VkIndirectExecutionSetShaderLayoutInfoEXT const* pSetLayoutInfos;

  // the size of the table
  uint32_t                                         maxShaderCount;

  uint32_t                                         pushConstantRangeCount;
  VkPushConstantRange const*                       pPushConstantRanges;
} VkIndirectExecutionSetShaderInfoEXT;

// Updating is similar to descriptor set writes.
// Developers must ensure that the `index` is not currently
// in use by the device.

typedef struct VkWriteIndirectExecutionSetPipelineEXT
{
  VkStructureType sType;
  void const*     pNext;
  uint32_t        index;
  VkPipeline      pipeline;
} VkWriteIndirectExecutionSetPipelineEXT;

typedef struct VkWriteIndirectExecutionSetShaderEXT
{
  VkStructureType sType;
  void const*     pNext;
  uint32_t        index;
  VkShaderEXT     shader;
} VkWriteIndirectExecutionSetShaderEXT;

NV_device_generated_commands

The ray tracing extension introduced the notion of "ShaderGroups" that are stored within a pipeline object. This extension makes use of the same principle to store multiple shader groups within a graphics pipeline object.

Each shader group can override a subset of the pipeline's state:

typedef struct VkGraphicsShaderGroupCreateInfoNV
{
  // A shadergroup, is a set of unique shader combinations (VS,FS,...) etc.
  // that all are stored within a single graphics pipeline that share
  // most of the state.
  // Must not mix mesh with traditional pipeline.
  VkStructureType sType;
  const void*     pNext;

  // overrides createInfo from original graphicsPipeline
  uint32_t                                          stageCount;
  const VkPipelineShaderStageCreateInfo*            pStages;
  const VkPipelineVertexInputStateCreateInfo*       pVertexInputState;
  const VkPipelineTessellationStateCreateInfo*      pTessellationState;
} VkGraphicsShaderGroupCreateInfoNV;


typedef struct VkGraphicsPipelineShaderGroupsCreateInfoNV
{
  // extends regular VkGraphicsPipelineCreateInfo
  // If bound via vkCmdBindPipeline will behave as if pGroups[0] is active,
  // otherwise bind using vkCmdBindPipelineShaderGroup with proper index
  VkStructureType                          sType;
  const void*                              pNext;

  // first shader group must match the pipeline's traditional shader stages
  uint32_t                                 groupCount;
  const VkGraphicsShaderGroupCreateInfoNV* pGroups;
  
  // we recommend importing shader groups from regular pipelines that were created with
  // `VK_PIPELINE_CREATE_INDIRECT_BINDABLE_BIT_NV` and are compatible in state
  uint32_t                                 pipelineCount;
  const VkPipeline*                        pPipelines; 
} VkGraphicsPipelineShaderGroupsCreateInfoNV;

You can bind a shader group using vkCmdBindPipelineShaderGroupNV(.... groupIndex). However, the primary use-case is to bind them indirectly on the device via VK_INDIRECT_COMMANDS_TOKEN_TYPE_SHADER_GROUP_NV which is a single uint32_t index into the array of shader groups.

Important Note: To make any graphics pipeline bindable by the device set the VK_PIPELINE_CREATE_INDIRECT_BINDABLE_BIT_NV flag. This is also true for imported pipelines.

To speed up creation of a pipeline that contains many pipelines, you can pass existing pipelines to be referenced via pPipelines. You must ensure that those referenced pipelines are alive as long as the referencing pipeline is alive. The referenced pipelines must match in all state, except for what can be overridden per shader group. The shader groups from such imported pipelines are virtually appended in order of the array.

With this mechanism you can easily collect existing pipelines (thought don't forget the bindable flag and the state compatibility), which should ease the integration of this technology.

IndirectCommandsLayoutEXT/NV

The DGC extension allows you to generate some common graphics commands on the device based on a pre-defined sequence of command tokens. This sequence is encoded in the IndirectCommandsLayoutEXT/NV object.

The following pseudo code illustrates the kind of state changes you can make. You will see that there is no ability to change the descriptor set bindings, which is why this sample showcases the passing of bindings via push constants. This is somewhat similar to ray tracing as well, where you manage all resources globally as well.

void cmdProcessSequence(cmd, pipeline, indirectCommandsLayout, pIndirectCommandsStreams, uint32_t s)
{
  for (uint32_t t = 0; t < indirectCommandsLayout.tokenCount; t++){
    token = indirectCommandsLayout.pTokens[t];

#if NV_device_generated_commands
    uint streamIndex  = token.stream;
#else
    // EXT_dgc has single interleaved stream
    uint streamIndex  = 0;
#endif

    uint32_t stride   = indirectCommandsLayout.pStreamStrides[token.stream];
    stream            = pIndirectCommandsStreams[token.stream];
    uint32_t offset   = stream.offset + stride * s + token.offset;
    const void* input = stream.buffer.pointer( offset )

    switch(input.type){
#if NV_device_generated_commands
    VK_INDIRECT_COMMANDS_TOKEN_TYPE_SHADER_GROUP_NV:
      VkBindShaderGroupIndirectCommandNV* bind = input;

      // the pipeline must have been created with
      // VK_PIPELINE_CREATE_INDIRECT_BINDABLE_BIT_NV
      // and VkGraphicsPipelineShaderGroupsCreateInfoNV

      vkCmdBindPipelineShaderGroupNV(cmd, indirectCommandsLayout.pipelineBindPoint,
        pipeline, bind->groupIndex);
    break;

    VK_INDIRECT_COMMANDS_TOKEN_TYPE_STATE_FLAGS_NV:
      VkSetStateFlagsIndirectCommandNV* state = input;

      if (token.indirectStateFlags & VK_INDIRECT_STATE_FLAG_FRONTFACE_BIT_NV){
        if (state.data & (1 << 0)){
          setState(VK_FRONT_FACE_CLOCKWISE);
        } else {
          setState(VK_FRONT_FACE_COUNTER_CLOCKWISE);
        }
      }
    break;
#endif

#if EXT_device_generated_commands
    VK_INDIRECT_COMMANDS_TOKEN_TYPE_EXECUTION_SET_EXT:
      uint32_t* data = input;

      if (token.pExecutionSet->type == VK_INDIRECT_EXECUTION_SET_INFO_TYPE_PIPELINES_EXT) {
        // the pipeline must have been created with
        // VK_PIPELINE_CREATE_2_INDIRECT_BINDABLE_BIT_EXT (not the same as NV!)
        // and registered via vkUpdateIndirectExecutionSetPipelineEXT

        uint32_t pipelineIndex = *data;
        vkCmdBindPipeline(cmd, activePipelineBindPoint, indirectExecutionSet.pipelines[pipelineIndex]);
      }
      else if (token.pExecutionSet->type == VK_INDIRECT_EXECUTION_SET_INFO_TYPE_SHADER_OBJECTS_EXT) {

        // the shaders must have been created with
        // VK_SHADER_CREATE_INDIRECT_BINDABLE_BIT_EXT
        // and registered via vkUpdateIndirectExecutionSetShaderEXT

        // iterate in lowest to highest bit order
        for (shaderStageBit : iterateSetBits(token.pExecutionSet->shaderStages))
        {
          uint32_t shaderIndex = *data;
          vkCmdBindShadersEXT(cmd, 1, &shaderStageBit, &indirectExecutionSet.shaders[shaderIndex])
          data++;
        }
      }

    break;

    VK_INDIRECT_COMMANDS_TOKEN_TYPE_SEQUENCE_INDEX_EXT:
      uint32_t sequenceIndex = s;

      vkCmdPushConstants(cmd,
        activePipelineLayout,
        token.pPushConstant->updateRange.stageFlags,
        token.pPushConstant->updateRange.offset,
        token.pPushConstant->updateRange.size, &sequenceIndex);
    break;
#endif

    // we focus on the EXT flavors here
    // NV variants are pretty similar

    VK_INDIRECT_COMMANDS_TOKEN_TYPE_PUSH_CONSTANT_EXT:
      uint32_t* data = input;

      vkCmdPushConstants(cmd,
        activePipelineLayout,
        token.pPushConstant->updateRange.stageFlags,
        token.pPushConstant->updateRange.offset,
        token.pPushConstant->updateRange.size, data);
    break;

    VK_INDIRECT_COMMANDS_TOKEN_TYPE_INDEX_BUFFER_EXT:
      VkBindIndexBufferIndirectCommandEXT* data = input;

      // NV: the indexType may optionally be remapped
      // from a custom uint32_t value, via
      // VkIndirectCommandsLayoutTokenNV::pIndexTypeValues
      
      // EXT: the input mode VkIndirectCommandsIndexBufferTokenEXT::mode
      // can be set to DXGI
      
      vkCmdBindIndexBuffer2KHR(cmd,
        deriveBuffer(data->bufferAddress),
        deriveOffset(data->bufferAddress),
        data->size,
        data->indexType);
    break;

    VK_INDIRECT_COMMANDS_TOKEN_TYPE_VERTEX_BUFFER_EXT:
      VkBindVertexBufferIndirectCommandEXT* data = input;

      vkCmdBindVertexBuffers2(cmd,
        token.pVertexBuffer->vertexBindingUnit, 1,
        &deriveBuffer(data->bufferAddress),
        &deriveOffset(data->bufferAddress),
        &data.size,
        &data.stride);
    break;

    // regular draws use an inlined draw indirect struct
    // directly from the input stream

    VK_INDIRECT_COMMANDS_TOKEN_TYPE_DRAW_INDEXED_EXT:
      vkCmdDrawIndexedIndirect(cmd,
        stream.buffer, offset, 1, 0);
    break;

    VK_INDIRECT_COMMANDS_TOKEN_TYPE_DRAW_EXT:
      vkCmdDrawIndirect(cmd,
        stream.buffer,
        offset, 1, 0);
    break;

    VK_INDIRECT_COMMANDS_TOKEN_TYPE_DRAW_MESH_TASKS_EXT:
      vkCmdDrawMeshTasksIndirectEXT(cmd,
        stream.buffer, offset, 1, 0);
    break;

#if EXT_device_generated_commands

    // NEW for EXT_dgc is that we can use gpu-sourced indirect buffer
    // and count. The indirect draw calls can be stored anywhere.

    VK_INDIRECT_COMMANDS_TOKEN_TYPE_DRAW_INDEXED_COUNT_EXT:
      VkDrawIndirectCountIndirectCommandEXT* data = input;

      vkCmdDrawIndexedIndirect(cmd,
        deriveBuffer(data->bufferAddress), deriveOffset(data->bufferAddress), data->count, data->stride);
    break;

    VK_INDIRECT_COMMANDS_TOKEN_TYPE_DRAW_COUNT_EXT:
      VkDrawIndirectCountIndirectCommandEXT* data = input;

      vkCmdDrawIndirect(cmd,
        deriveBuffer(data->bufferAddress), deriveOffset(data->bufferAddress), data->count, data->stride);
    break;

    VK_INDIRECT_COMMANDS_TOKEN_TYPE_DRAW_MESH_TASKS_COUNT_EXT:
      VkDrawIndirectCountIndirectCommandEXT* data = input;

      vkCmdDrawMeshTasksIndirectEXT(cmd,
        deriveBuffer(data->bufferAddress), deriveOffset(data->bufferAddress), data->count, data->stride);
    break;
#endif
    }
  }
}

The sequence generation itself is also influenced by a few usage flags, as follows:

cmdProcessAllSequences(
    cmd, pipeline, indirectCommandsLayout, pIndirectCommandsTokens,
    uint32_t maxSequencesCount,
    sequencesCountBuffer, uint32_t sequencesCount, 
    sequencesIndexBuffer, uint64_t sequencesIndexOffset)
{
  uint32_t sequencesCount = sequencesCountBuffer ?
    min(maxSequencesCount, sequencesCountBuffer.load_uint32(sequencesCountOffset) :
    maxSequencesCount;


  for (uint32_t s = 0; s < sequencesCount; s++)
  {
    uint32_t sUsed = s;

#if NV_device_generated_commands
    if (indirectCommandsLayout.flags & VK_INDIRECT_COMMANDS_LAYOUT_USAGE_INDEXED_SEQUENCES_BIT_NV) {
      sUsed = sequencesIndexBuffer.load_uint32( sUsed * sizeof(uint32_t) + sequencesIndexOffset);
    }
#endif

    if (indirectCommandsLayout.flags & VK_INDIRECT_COMMANDS_LAYOUT_USAGE_UNORDERED_SEQUENCES_BIT_EXT) {
      sUsed = incoherent_implementation_dependent_permutation[ sUsed ];
    }

    cmdProcessSequence(cmd, pipeline, indirectCommandsLayout, pIndirectCommandsTokens, sUsed);
  }
}

Preprocess Buffer

The NVIDIA implementation of the DGC extension needs some device space to generate the commands prior their execution. The preprocess buffer provides this space, and is sized via vkGetGeneratedCommandsMemoryRequirementsEXT/NV.

As you can see below, it depends on the pipeline (NV) or indirectExecutionSet (EXT), as well as the indirectCommandsLayout and the number of maximum sequences and or draws you may generate. Most sizing information is based on the token type alone, however there can be some variable costs. Especially the complexity of the shader group changes influences the number of bytes a lot. The more similar the shader groups are, the less memory will be required.

typedef struct VkGeneratedCommandsMemoryRequirementsInfoEXT
{
  VkStructureType             sType;
  void*                       pNext;

  VkIndirectExecutionSetEXT   indirectExecutionSet;

  VkIndirectCommandsLayoutEXT indirectCommandsLayout;

  uint32_t                    maxSequenceCount;
  uint32_t                    maxDrawCount;     // upper limit for the DRAW.._COUNT tokens
} VkGeneratedCommandsMemoryRequirementsInfoEXT;

typedef struct VkGeneratedCommandsMemoryRequirementsInfoNV
{
  VkStructureType                     sType;
  const void*                         pNext;

  VkPipelineBindPoint                 pipelineBindPoint;
  VkPipeline                          pipeline;
  
  VkIndirectCommandsLayoutNV          indirectCommandsLayout;

  uint32_t                            maxSequencesCount;
} VkGeneratedCommandsMemoryRequirementsInfoNV;

The sample shows the size of the buffer in the UI as "preprocessBuffer ... KB". As of writing, the size may be substantial for a very large number of drawcalls. If you need to stay within a memory budget, you can split your execution into multiple passes and re-use the preprocess memory.

If you make use of the dedicated preprocess step through vkCmdPreprocessGeneratedCommandsEXT/NV, then you must ensure all inputs (all buffer content etc.) are the same at execution time. An implementation is allowed to split the workload required for execution into these two functions.

You can see the time it takes to preprocess in the UI as "Preproc. GPU [ms]" if you selected the preprocess,generate cmds renderer.

Note that the NVIDIA implementation uses an internal compute dispatch to build the preprocess buffer, therefore it is recommended to batch multiple explicit preprocessing steps prior the execution calls. Otherwise the barriers between implicit preprocessing and execution can slow things down significantly.

Performance

Preliminary results from NVIDIA RTX 6000 Ada Generation, AMD Ryzen 9 7950X 16-Core Processor, Windows 11 64-bit.

At the time of writing the EXT_device_generated_commands implementation was new, future improvements to performance and preprocess memory may happen.

The settings are an extreme stress-test of very tiny draw calls. Such draw calls with less than a few hundred triangles should either be avoided by design or be handled by a task shader emitting meshlets:

sorted once OFF

renderershaderspreprocess [ms]draw (GPU) [ms]dgc execution size [MB]sequences
re-used cmdspipeline6.9
re-used cmdsshaderobjs5.9
threaded cmds (16 threads)pipeline2.4 (CPU)7.1
threaded cmds (16 threads)shaderobjs2.2 (CPU)6.1
preprocess,generated extpipeline0.4 (GPU)7.0436 preprocess547 K
preprocess,generated extshaderobjs0.2 (GPU)6.3180 preprocess547 K
preprocess,generated ext (binned)pipeline0.3 (GPU)24.4 !54 preprocess<br>10 drawindirect70 K
preprocess,generated ext (binned)shaderobjs0.1 (GPU)21.0 !21 preprocess<br>10 drawindirect70 K
preprocess,generated nvpipeline0.2 (GPU)4.7 *145 preprocess547 K

We can see that with so many shader changes, shader objects do better across the renderers that support them. They particularly help EXT_dgc to reduce its preprocess buffer.

NV_dgc is currently fastest due to the static nature of the pipeline table containing the shader groups, allowing more optimizations.

The renderers that binned (DRAW.._COUNT tokens) do particularly bad here, as without state sorting, there is not a lot of binning going on. This still results in very high number of sequences (70 K) and it creates extra latency when each sequence launches a multi-draw-indirect with little work (approximately 7-8 draw calls). With so many sequences it's faster to inline the draw call data, at the cost of higher memory. However, we recommend to avoid such a design in the first place and always do a bit of state sorting / binning.

sorted once ON

When we do state sorting, the pushaddress bindings method is the fastest for most renderers, except binned which keeps inst.vertexattrib index. This sometimes increases tokens for EXT_dgc to 6 as well.

renderer (sorted once)shaderspreprocess [ms]draw (GPU) [ms]dgc execution size [MB]sequences
re-used cmdspipeline1.5
re-used cmdsshaderobjs1.5
threaded cmds (16 threads)pipeline1.2 (CPU)1.8
threaded cmds (16 threads)shaderobjs1.2 (CPU)1.8
preprocess,generated extpipeline0.2 (GPU)1.4436 preprocess547 K
preprocess,generated extshaderobjs0.2 (GPU)1.3196 preprocess547 K
preprocess,generated ext (binned)pipeline< 0.1 (GPU)1.50.2 preprocess<br>10 drawindirect16
preprocess,generated ext (binned)shaderobjs< 0.1 (GPU)1.50.1 preprocess<br>10 drawindirect16
preprocess,generated nvpipeline0.2 (GPU)1.3145 preprocess547 K

Sorting by state speeds things up significantly for this scene.

This especially shows the benefit of using EXT_dgc with draw indirect calls binned into few sequences as it reduces memory cost substantially. In the test we used up to 16 shaders, hence we get 16 state buckets, equal to 16 sequences here. While it's not the very fastest technique here, in reality it should do best, because normally the individual draw calls will have more triangles.

However, we want to stress again, that the main goal of this extension is not generating so much work or simply moving things on the GPU, but leveraging it to reduce the actual work needed.

For example by using occlusion culling, or other techniques that allow you to reduce the required workload on the device.

Recommendations

Acknowledgements

Special thanks to Mike Blumenkrantz for the long push to make VK_EXT_device_generated_commands a reality, as well as Patrick Doane for the initial kickstart.

Building

Make sure to have installed the Vulkan-SDK. Always use 64-bit build configurations.

Ideally, clone this and other interesting nvpro-samples repositories into a common subdirectory. You will always need nvpro_core. The nvpro_core is searched either as a subdirectory of the sample, or in a common parent directory.

If you are interested in multiple samples, you can use build_all CMAKE as entry point, it will also give you options to enable/disable individual samples when creating the solutions.