Home

Awesome

vGPU: Vulkan GPU Framework for Graphics and Compute, in Go

IMPORTANT: This is the cogentcore branch that was developed for the Cogent Core framework, which also includes the gosl "go as a shader language" package, which converts Go to the HLSL shader language for compilation on the GPU. Cogent Core has now switched to using WebGPU instead of Vulkan, so this Vulkan-based code is no longer being used or maintained.

The main branch of vgpu is the original v1 version of vGPU, which works with the original v1 version of goki.

Mac Installation prerequisite: https://vulkan.lunarg.com/sdk/home -- download the Vulkan SDK installer for the mac. Unfortunately there does not appear to be a full version of this on homebrew -- the molten-vk package is not enough by itself.

Major design problems with vGPU

In the process of rewriting the WebGPU version in Cogent Core, called gpu, much was improved about the overall design and implementation, particularly in these areas:

  1. vgpu uses shared memory buffers and a separate Memory management system for all values, which requires all values to be sync'd to / from the GPU together, and adds considerable complexity to the overall implementation. Everything uses dynamic offsets by default. Later, an even more complex mechanism for splitting the Storage memory into separate chunks was introduced. In gpu, each Value has its own buffer by default, and dynamic offset is only optionally enabled for specific Values, and the Value directly manages everything -- much simpler and cleaner.

  2. vgpu uses dynamically indexed Textures, which has a limit of 16 on the mac, so considerable complexity was introduced to pack a bunch of images into these 16 images, via the szalloc package. In gpu, we just call Bind to update which current texture is being used -- this is very fast and is the standard way of doing things, so you typically only need a single active texture variable at a time. Major "doh" moment there.

  3. vgpu has a bunch of confusing complexity around the vulkan DescSet concept, which is confusing in relation to the VarSet. In gpu, variable sets are now VarGroup, and you just manage multiple instances using as many Value instances as you want for each Var. The DescSet was also conflated with the texture packing thing, and there are only 3 desc sets supported, so overall that system was just plain bad and unnecessary.

So, if anyone ever did want to actually use this vulkan-based version, it would be good to fix these issues and make it work more like the current Cogent Core gpu version. Basically, you'd start with gpu and just back-port all the vulkan impl from this vgpu version.

But so far, WebGPU is proving to be much simpler and just as performant as Vulkan, with significantly fewer hassels on the main Windows and Mac platforms (because it goes direct to native GPU frameworks there), and it works on the Web, which Vulkan never will, so there isn't much reason to go back to Vulkan at this point.

Overview

vGPU is a Vulkan-based framework for both Graphics and Compute Engine use of GPU hardware, in the Go langauge. It uses the basic cgo-based Go bindings to Vulkan in vulkan-go and was developed starting with the associated example code surrounding that project. Vulkan is a relatively new, essentially universally supported interface to GPU hardware across all types of systems from mobile phones to massive GPU-based compute hardware, and it provides high-performance "bare metal" access to the hardware, for both graphics and computational uses.

Vulkan is very low-level and demands a higher-level framework to manage the complexity and verbosity. While there are many helpful tutorials covering the basic API, many of the tutorials don't provide much of a pathway for how to organize everything at a higher level of abstraction. vGPU represents one attempt that enforces some reasonable choices that enable a significantly simpler programming model, while still providing considerable flexibility and high levels of performance. Everything is a tradeoff, and simplicity definitely was prioritized over performance in a few cases, but in practical use-cases, the performance differences should be minimal.

Platforms

Selecting a GPU Device

For systems with multiple GPU devices, by default the discrete device is selected, and if multiple of those are present, the one with the most RAM is used. To see what is available and their properties, use:

$ vulkaninfo --summary

The following environment variables can be set to specifically select a particular device by name (deviceName):

vPhong and vShape

The vPhong package provides a complete rendering implementation with different pipelines for different materials, and support for 4 different types of light sources based on the classic Blinn-Phong lighting model. See the examples/phong example for how to use it. It does not assume any kind of organization of the rendering elements, and just provides name and index-based access to all the resources needed to render a scene.

vShape generates standard 3D shapes (sphere, cylinder, box, etc), with all the normals and texture coordinates. You can compose shape elements into more complex groups of shapes, programmatically. It separates the calculation of the number of vertex and index elements from actually setting those elements, so you can allocate everything in one pass, and then configure the shape data in a second pass, consistent with the most efficient memory model provided by vgpu. It only has a dependency on the math32 package and could be used for anything.

Basic Elements and Organization

Memory organization

Memory maintains a host-visible, mapped staging buffer, and a corresponding device-local memory buffer that the GPU uses to compute on (the latter of which is optional for unified memory architectures). Each Value records when it is modified, and a global Sync step efficiently transfers only what has changed. You must allocate and sync update a unique Value for each different value you will need for the entire render pass -- although you can dynamically select which Value to use for each draw command, you cannot in general update the actual data associated with these values during the course of a single rendering pass.

#version 450
#extension GL_EXT_nonuniform_qualifier : require

// must use mat4 -- mat3 alignment issues are horrible.
// each mat4 = 64 bytes, so full 128 byte total, but only using mat3.
// pack the tex index into [0][3] of mvp,
// and the fill color into [3][0-3] of uvp
layout(push_constant) uniform Mtxs {
	mat4 mvp;
	mat4 uvp;
};

layout(set = 0, binding = 0) uniform sampler2DArray Tex[]; //
layout(location = 0) in vector2 uv;
layout(location = 0) out vector4 outputColor;

void main() {
	int idx = int(mvp[3][0]);   // packing into unused part of mat4 matrix push constant
	int layer = int(mvp[3][1]);
	outputColor = texture(Tex[idx], vector3(uv,layer)); // layer selection as 3rd dim here
}

Naming conventions

Graphics Rendering

See https://developer.nvidia.com/vulkan-shader-resource-binding for a clear description of DescriptorSets etc.

Here's a widely used rendering logic, supported by the Cogent Core Scene (and tbd std Pipeline), and how you should organize the Uniform data into different sets at each level, to optimize the binding overhead:

for each view {
  bind view resources [Set 0]         // camera, environment...
  for each shader type (based on material type: textured, transparent..) {
    bind shader pipeline  
    bind shader resources [Set 1]    // shader control values (maybe none)
    for each specific material {
      bind material resources  [Set 2] // material parameters and textures
      for each object {
        bind object resources  [Set 3] // object transforms
        draw object [VertexInput binding to locations]
        (only finally calls Pipeline here!)
      }
    }
  }
}

It is common practice to use different DescriptorSets for each level in the swapchain, for maintaining high FPS rates by rendering the next frame while the current one is still cooking along itself -- this is the NDescs parameter mentioned above.

Because everything is all packed into big buffers organized by different broad categories, in Memory, we exclusively use the Dynamic mode for Uniform and Storage binding, where the offset into the buffer is specified at the time of the binding call, not in advance in the descriptor set itself. This has only very minor performance implications and makes everything much more flexible and simple: just bind whatever variables you want and that data will be used.

The examples and provided vPhong package retain the Y-is-up coordinate system from OpenGL, which is more "natural" for the physical world, where the Y axis is the height dimension, and up is up, after all. Some of the defaults reflect this choice, but it is easy to use the native Vulkan Y-is-down coordinate system too.

Combining many pipeline renders per RenderPass

The various introductory tutorials all seem to focus on just a single simple render pass with one draw operation, but any realistic scene needs different settings for each object! As noted above, this requires dynamic binding, which is good for Uniforms and Vertex data, but you might not appreciate that this also requires that you pre-allocate and sync up to device memory all the Values that you will need for the entire render pass -- the dynamic binding only selects different offsets into memory buffers, but the actual contents of those buffers should not change during a single render pass (otherwise things will get very slow and lots of bad sync steps might be required, etc). The Memory system makes it easy to allocate, update, and dynamically bind these vals.

Here's some info on the logical issues:

This blog has a particularly clear discussion of the need for Texture arrays for managing textures within a render pass. This is automatically how Texture vars are managed .

GPU Accelerated Compute Engine

See examples/compute1 for a very simple compute shader, and compute.go for Compute* methods specifically useful for this case.

See the gosl repository for a tool that converts Go code into HLSL shader code, so you can effectively run Go on the GPU.

Here's how it works:

Gamma Correction (sRGB vs Linear) and Headless / Offscreen Rendering

It is hard to find this info very clearly stated:

Mac Platform

To have the mac use the libMoltenVK.dylib installed by brew install molten-vk, you need to change the LDFLAGS here:

github.com/goki/vulkan/vulkan_darwin.go

#cgo darwin LDFLAGS: -L/opt/homebrew/lib -Wl,-rpath,/opt/homebrew/lib -F/Library/Frameworks -framework Cocoa -framework IOKit -framework IOSurface -framework QuartzCore -framework Metal -lMoltenVK -lc++

However it does not find the libvulkan which is not included in molten-vk.

Platform properties

See MACOS.md file for full report of properties on Mac.

These are useful for deciding what kinds of limits are likely to work in practice:

This is a significant constraint! need to work around it.

Note that this constraint is largely irrelevant because each dynamic descriptor can have an unlimited number of offset values used for it.