Home

Awesome

This is an incomplete and almost certainly incorrect attempt to rephrase Vulkan's requirements on execution dependencies in a more precise form.

The basic idea is: Every 'action command' and 'sync command' defines a collection of nodes in a dependency graph, defines some internal edges between those nodes, and defines some external edges between its own nodes and the nodes of other commands. In some cases we define internal nodes that are not strictly necessary but that reduce the number of external edges, making the behaviour easier to understand and visualize. Some sequences of n commands will still result in O(n^2) edges, so these rules should not be implemented literally - implementations may use any approach that gives observably equivalent behaviour.

The goal is to specify the rules in a pseudo-mathematical way so that they're unambiguous (albeit not necessarily intuitive) for human readers, and so that an algorithm (e.g. implemented in a Vulkan layer) could follow the rules to detect race conditions and (ideally) to render a visualization so the human gets some idea of what they've done wrong.

The (still unfinished) definition has ended up being quite verbose. To hopefully to make it a bit easier to follow, we can draw some diagrams to illustrate parts of the dependency graph. The internal nodes and edges in action commands and pipeline barriers should look like:

(Some of the pipeline stages are omitted from these diagrams, for clarity.)

Pipeline barriers create external edges from a stage in a previous action command to the corresponding source stage in the pipeline barrier, and from a destination stage in the pipeline barrier to the corresponding stage in a subsequent action command.

E.g.

vkCmdDraw(...);
vkCmdPipelineBarrier(...,
    VK_PIPELINE_STAGE_VERTEX_SHADER_BIT | VK_PIPELINE_STAGE_TRANSFER_BIT,
    VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT,
    ...
);
vkCmdDraw(...);

creates an execution dependency graph like:

in which you can follow the arrows to see an execution dependency chain from e.g. draw 1's VERTEX_SHADER stage to draw 2's TRANSFER stage, meaning draw 1's VERTEX_SHADER must complete before draw 2's TRANSFER can start. However there is no execution dependency chain from draw 1's FRAGMENT_SHADER stage to any stage in draw 2, meaning the FRAGMENT_SHADER can run after or concurrently with any of draw 2's stages.

When there are multiple arrows leading into a node, all of those dependencies must complete before the node can start. It is often easiest to read the graphs by starting at the node you're interested in, then reading backwards along the arrows, following every possible backwards chain of arrows, and the nodes you pass through will be all the ones that your initial node depends on.

TODO: More diagrams, for events and subpasses and everything else.

Definitions

First we need to define a few terms.

An "execution dependency order" is a partial order over elements of the following types:

Actions commands are:

Sync commands are:

By "command" we really mean a single execution of a command. A command might be recorded once but its command buffer might be submitted many times, and each time will count as a separate execution with its own separate position in the ordering.

extractStages(mask) converts a bitmask into a set of stages:

We define "command order" as follows:

Execution dependency order

We define the execution dependency order '<' as follows:


NOTE: This is defining that a pipeline barrier looks like:

There are sets of source stages and destination stages. The stages included in srcStageMask/dstStageMask are connected to the barrier's internal SRC/DST. Image layout transitions happen in the middle of the pipeline barrier.

The active source stages are connected to the corresponding stages of earlier commands (either all earlier commands, or (if the barrier is inside a subpass) earlier commands in the current subpass). The active destination stages are connected similarly to following commands. This means you can construct a chain of execution dependencies through multiple pipeline barriers, as long as they have the appropriate bits set in srcStageMask/dstStageMask to make the connection.



NOTE: A pair of vkCmdSetEvent and vkCmdWaitEvents is very similar to a vkCmdPipelineBarrier split in half.



NOTE: This is defining that an action command looks like:

i.e. a bunch of stages between TOP and BOTTOM. Action commands will not do work in all of these stages.



NOTE: This is saying that commands can't start before their corresponding command buffer, and the command buffer won't be considered complete until all its commands are complete.

This is defined in terms of primary command buffers. For commands in secondary command buffers, it'll use the primary command buffer that executes that secondary command buffer.



NOTE: The definition of subpass SRC/DST stages is necessary because we might have execution dependency chains passing through a subpass which contains no commands. The SRC/DST stages give something for the dependency to be ordered relative to.


We also define the by-region execution dependency order '<_{region}' as follows:

We need some validity requirements for the event execution dependencies to make sense:

i.e. you must not have race conditions between two commands on the same event when the behaviour depends on the order they execute in. (TODO: These are somewhat stricter than the current spec requires. Maybe it needs to be defined differently, so we allow multiple valid execution orders instead of simply saying it's undefined if there's more than one valid order.)

Finally we can say:

(TODO: define what "completion" actually means)

Note that '<' is defined so that execution dependencies always go in the same direction as command order. (...unless there are bugs in the definition). That means an implementation could simply execute every command in command order, with no pipelining and no reordering, and would satisfy all the requirements above. Or it could choose to insert arbitrary sync points at which every previous command completes before any subsequent command starts, for example at the start/end of a command buffer, to limit the scope in which it has to track parallelism.

Memory dependencies

The concept we use for memory dependencies is that a device will have many internal caches - in the worst case a separate cache for every different access type in every pipeline stage. Cached writes must be flushed to main memory to make them available to other stages. Similarly caches must be invalidated before reading recently-written data through them, to make the changes in main memory visible to that stage, preventing it reading stale data from the cache; and must also be invalidated before writing near recently-written data, since a partial cache line write will essentially perform a read and we again need to avoid using stale data. No memory dependencies are needed for read-after-read and write-after-read.

(Implementations are not expected to literally use caches like this - e.g. if two stages have a shared cache then they could optimise away a flush/invalidate pair between those stages, or they could flush from independent L1 caches into a shared L2 cache instead of into main memory, or they might buffer memory accesses in something quite different to a data cache, etc. They just need to make sure the observable behaviour is compatible with the what's described here, so that application developers can ignore those details and assume it's simple caches.)

(NOTE: I'm using the terms "flush" and "invalidate" instead of "make available" and "make visible", even though they're slightly more low-level than intended, because they're much more conventional terms and I find it much easier to remember which way round they go, and because they're easier to use in sentences.)

We define four new groups of elements:

mem is either a range of a VkDeviceMemory, or an image subresource, or the special value "GLOBAL" (only used in FLUSH/INVALIDATE). access is one of VkAccessFlagBits. stage is one of VkPipelineStageFlags. c is an action command. b is a barrier command or (TODO: other things that create memory dependencies).

memOverlap(mem_1, mem_2) is the set of memory locations in the intersection between the two memory ranges, as defined by the current spec for memory aliasing. (TODO: might need to be more specific about cache line granularity for buffer accesses). If one is GLOBAL then the intersection is the other one.

memIsSubset(mem_1, mem_2) is true if (and only if) mem_2 is 'larger' than (or equal to) mem_1. That means either mem_2 is GLOBAL, or is a larger range of the same VkBuffer, or is a larger image subresource range. (memIsSubset ignores aliasing.)

These elements all participate in the execution dependency order '<', extending the definition of '<' above.

Most READ and WRITE operations happen inside one of the stages of an action command. To represent them happening at the same time as a (c, stage), we define an equivalence relation '~' which means the READ/WRITE adopts the same position in execution dependency order as their corresponding command:

We need to define every possible memory access from every command:

FLUSH and INVALIDATE are created by memory barriers:

If we modify the earlier example to include some memory barriers like:

vkCmdDraw(...);
vkCmdPipelineBarrier(...,
    VK_PIPELINE_STAGE_VERTEX_SHADER_BIT | VK_PIPELINE_STAGE_TRANSFER_BIT,
    VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT | VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT,
    pMemoryBarriers = { {
        srcAccessMask = 0,
        dstAccessMask = VK_ACCESS_SHADER_READ_BIT
    } },
    pImageMemoryBarriers = { {
        srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT,
        dstAccessMask = 0,
        image = ...,
        subresourceRange = ...
    } },
    ...
);
vkCmdDraw(...);

then we can illustrate it like:

Here draw 1's VERTEX_SHADER stage is writing to some image subresource img_1, draw 2's FRAGMENT_SHADER is reading from img_1, and the pipeline barrier is doing a flush of img_1 and a GLOBAL invalidate which both include the appropriate access types and stages that correspond to the WRITE/READ.

Having defined these operations, we can now define the rules that applications must follow. Violating these rules will result in unpredictable reads from memory.

Race conditions between writes and other read/write accesses are not permitted, because they would result in unpredictable behaviour depending on the scheduling of the commands:

Between a write and a subsequent memory access, the memory must be flushed and invalidated:

Additionally, you mustn't invalidate dirty memory (because that would leave the contents of RAM unpredictable, depending on whether the dirty lines were partially written back or not) - you must flush it first. This applies even when the invalidate is a different stage or access type, because some implementations might share caches between stages and access types and so the invalidate will still touch the dirty cache lines:

And there must not be any race conditions between writes and invalidates, for the same reason:

(On the other hand, race conditions between writes and flushes are no problem.)

TODO: by-region memory dependencies.

TODO: cases where a stage in a command can be coherent with itself (Coherent in shaders, color attachment reads/writes, etc).

TODO

Transitions: The spec says:

Layout transitions that are performed via image memory barriers are automatically ordered against other layout transitions, including those that occur as part of a render pass instance.

but I have no idea what that even means?

Fences

Semaphores

Host events

Host accesses

QueueSubmit guarantees

Semaphores, fences, events: be clear about how they're referring to the object at the time the command is executed (it might get deleted later)

...