Home

Awesome

gl cadscene render techniques

This sample implements several scene rendering techniques that target mostly static data, such as often found in CAD or DCC applications. In this context, 'static' means that the vertex and index buffers for the scene's objects rarely change. This can include editing the geometry of a few scene objects, but the matrix and material values are the properties that are modified the most across frames. Imagine making edits to the wheel topology of a car, or positioning an engine; the rest of the assembly remains the same.

The principal OpenGL mechanisms that are used here are described in the SIGGRAPH 2014 presentation slides. It is highly recommended to go through the slides first.

The sample makes use of multiple OpenGL 4 core features, such as ARB_multi_draw_indirect, but also showcases OpenGL 3 style rendering techniques.

There are also several techniques built around the NV_command_list extension. Please refer to gl commandlist basic for an introduction to NV_command_list.

Note: This is just a sample to illustrate several techniques and possibilities for how to approach rendering. Its purpose is not to provide production-level, highly optimized implementations.

Scene Setup

The sample loads a cadscene file (csf). This file format is inspired by CAD applications' data organization, but (for simplicity) everything is stored in a single RAW file.

The scene is organized into:

Shademodes

sample screenshot

Strategies

These influence the number of drawcalls we generate for the hardware and software. Using OpenGL's MultiDraw* functions we can have less software calls than hardware drawcalls, which helps trigger faster paths in the driver as there is less validation overhead. A strategy is applied on a per-object level.

Imagine an object whose parts use two materials, red and blue:

material: r b b r
parts:    A B C D

Typically we do all rendering with basic state redundancy filtering so we don't setup a matrix/material change if the same is still active. To keep things simple for state redundancy filtering, you should not go too fine-grained, otherwise all the tracking causes too much memory hopping. In our case we have 3 indices we track: geometry (handles vertex / index buffer setup), material, and matrix.

Renderers

Most renderers will traverse the scene data every frame. The organization of the data is cache-friendly foremost, everything is stored in arrays, without too much memory hopping. Some renderers may implement additional caching for rendering.

Variants:

Techniques:

We are mostly looking into accelerating our matrix and material parameter switching performance.

A hybrid approach, where the parameter index like "indexedmdi" is used for matrices and uborange bind is used for materials, is not yet implemented, but would be a good compromise.

The following renderers make use of the NV_command_list extension. In principle they behave as "uborange", however all buffer bindings and drawcalls are encoded into binary tokens that are submitted in bulk. In preparation for drawing, the appropriate stateobjects are created and reused when rendering (one for lines and for triangles). While stateobject capturing is not extremely expensive, it is still best to cache it across frames.

Performance

All timings are preliminary results for Timer Draw on a win7-64, i7-860, Quadro K5000 system.

Important Note About Timer Query Results: The GPU time reported below is measured via timer queries, those values however can be skewed by CPU bottlenecks. The "begin" timestamp may be part of a different command submission to the GPU than the "end" timestamp. That means a long delay on the CPU side between those submissions will also increase the reported GPU time. That is why in CPU-bottlenecked scenarios with tons of OpenGL commands, the GPU times below are close to the CPU time.

scene statistics:
geometries:    110
materials:      66
nodes:        5004
objects:      2497

tokenbuffer/glstream complexities:
type: solid              materialgroups | drawcall individual
commandsize:                     347292 | 1301692
statetoggles:                         1 | 1
tokens:                 
GL_DRAW_ELEMENTS_COMMAND_NV:      11103 |   68452
GL_ELEMENT_ADDRESS_COMMAND_NV:      807 |     807
GL_ATTRIBUTE_ADDRESS_COMMAND_NV:    807 |     807
GL_UNIFORM_ADDRESS_COMMAND_NV:     8988 |   11289
GL_POLYGON_OFFSET_COMMAND_NV:         1 |       1

type: solid w edges
commandsize:                     629644 | 2534412
statetoggles:                      4994 |    4994
tokens:
GL_DRAW_ELEMENTS_COMMAND_NV:      22281 |  136750
GL_ELEMENT_ADDRESS_COMMAND_NV:      807 |     807
GL_ATTRIBUTE_ADDRESS_COMMAND_NV:    807 |     807
GL_UNIFORM_ADDRESS_COMMAND_NV:    15457 |   20036
GL_POLYGON_OFFSET_COMMAND_NV:         1 |       1

As one can see from the statistics the key difference is the number of drawcalls for the hardware:

shademode: solid

rendererGPU timeCPU timeGPU timeCPU time (microseconds)
strategymaterial--groupsdrawcall--individual
ubosub1550187060007420
uborange1010189037207660
uborange_bindless1010120025604900
indexedmdi1120120020801100
tokenstream86030015201400
tokenbuffer780<101230<10
tokenlist780<10880<10
tokenbuffer_cullsorted540120760120

The results are of course very scene dependent; this model was specifically chosen as it is made of many parts with very few triangles. If the complexity per drawcall were higher (say more triangles or complex shading), then the CPU impact would be lower and we would be GPU-bound. However the CPU time recovered by faster submission mechanisms can always be used elsewhere. So even if we are GPU-bound, time should not be wasted.

We can see that the "token" techniques do very well and are never CPU-bound, and the "indexedmdi" technique is also quite good. This technique is especially useful for very high-frequency parameters, for example when rendering "id-buffers" for selection, but also for matrix indices. For general use-cases, working with uborange binds is recommended.

shademode: solid with edges

Unless "sorted", around 5000 toggles are done between triangles/line rendering. The shader is manipulated through an immediate vertex attribute to toggle between lit/unlit rendering respectively.

rendererGPU timeCPU timeGPU timeCPU time (microseconds)
strategymaterial--groupsdrawcall--individual
ubosub289033501300015000
uborange215037001250015200
uborange_bindless21502640830010000
indexedmdi2340220040502050
tokenstream1860125033603200
tokenbuffer17504502650350
tokenlist1650<101890<10
tokenbuffer_cullsorted7701201250120

Compared to the "solid" results, the tokenbuffer and tokenlist techniques show a greater difference in CPU time.

Model Explosion View

The simple viewer allows you to add animation to the scene and artificially increase scene complexity via "clones".

xplodeclones

To "emulate" typical interaction where users might move objects around or have animated scenes, the sample also implements the matrix transform system sketched on slide 30.

The effect works by first moving all object matrices a bit (xplode-animation.comp.glsl), and afterwards the transform hierarchy is updated via a system that is implemented in the transformsystem.cpp / hpp files.

The code is not particularly tuned but naively assumes that upper levels of the hierarchy contain fewer nodes than lower levels (pyramid). Therefore it uses leaf-processing (which redundantly calculates matrices) instead of level-wise processing for the first 10 levels, to avoid dependencies (one small compute task waiting for the previous). Later levels are always processed level-wise. A better strategy would be to switch between the two approaches based on the actual number of nodes per level. The shaders for this are transform-leaves.comp.glsl and transform-level.comp.glsl.

The hierarchy is managed by nodetree.cpp/hpp, which stores the tree as array of 32bit values. Each value represents a node, and encodes the "level" in the hierarchy in 8 bits and their parent index in the rest of the bits. Which means you can traverse a node up to the root:

// sample traversal of "idx" node to root
self = array[idx];
while( self.level != 0) {
  self = array[self.parent];
}
// self is now the top root for the idx node

The nodetree also stores two node index lists for each level: one storing all nodes of a level, and one for all leaves in this level. We feed these two index lists to the appropriate shader. When leaf processing is used we append the leaves level-wise, which should minimize divergence within a warp (ideally most threads have the same number of levels to ascend in the hierarchy).

Many CAD applications tend to use double-precision matrices, and the system could be adjusted for this. For rendering, however, float matrices should be used. To account for large translation values, one could run a concatenation of view-projection (double) and object-world-matrix (double) per-frame and generate the matrices (float) for actual vertex transforms. To improve memory performance, it might be beneficial to use double only for storing translations within the matrices.

Note: Only the GPU matrices are updated. CPU techniques such as "ubosub" will not show animations.

Sample Highlights

This sample is a bit more complex than most others as it contains several subsystems. Don't hesitate to contact the author if something is unclear (commenting was not a priority ;) ).

csfviewer.cpp

The principle setup of the sample is in this main file. However, most of the interesting bits happen in the renderers.

renderer... and tokenbase...

Each renderer has its own file and is derived from the Renderer class in renderer.hpp

The renderers may have additional functions. The "token" renderers using NV_command_list or "indexedmdi", for instance, must create their own scene representation.

cadscene...

The "csf" (cadscene file) format is a simple binary format that encodes a scene as is typical for CAD. It closely matches the description at the beginning of the readme. It is not very sophisticated, and is meant for demo purposes.

Note: The geforce.csf.gz assembly binary file that ships with this sample may NOT be redistributed.

nodetree... and transform...

Implement the matrix hierarchy updates as described in the "model explosion view" section.

cull... and scan...

For files related to culling, it is best to refer to the gl occlusion cullling sample, as it leverages the same system and focuses on just that topic.

renderertokensortcull.cpp implements RendererCullSortToken::CullJobToken::resultFromBits, which contains the details of how the occlusion results are handled in this sample. The implementation uses the "raster" "temporal" approach.

statesystem... nvtoken... and nvcommandlist...

These files contain helpers when using the NV_command_list extension. Please see gl commandlist basic for a smaller sample.

Building

Ideally, clone this and other interesting nvpro-samples repositories into a common subdirectory. You will always need nvpro_core. The nvpro_core is searched either as a subdirectory of the sample, or one directory up.

If you are interested in multiple samples, you can use the build_all CMAKE as entry point. This will also give you options to enable or disable individual samples when creating the solutions.

Related Samples

gl commandlist basic illustrates the core principle of the NV_command_list extension. gl occlusion cullling also uses the occlusion system of this sample, but in a simpler usage scenario.

When using classic scenegraphs, there is typically a lot of overhead in traversing the scene. For this reason, it is highly recommended to use simpler representations for actual rendering. Consider using flattened hierarchies, arrays, memory-friendly data structures, data-oriented design patterns, and similar techniques. If you are still working with a classic scenegraph, then nvpro-pipeline may provide some acceleration strategies to avoid full scenegraph traversal. Some of these strategies are also described in this GTC 2013 presentation.