Home

Awesome

XSBench

Latest Github release Build Status Published in Annals of Nuclear Energy

XSBench is a mini-app representing a key computational kernel of the Monte Carlo neutron transport algorithm. Specifically, XSBench represents the continuous energy macroscopic neutron cross section lookup kernel. XSBench serves as a lightweight stand-in for full neutron transport applications like OpenMC, and is a useful tool for performance analysis on high performance computing architectures.

Table of Contents

  1. Understanding the FOM
  2. Selecting a Programming Model
  3. Compilation
  4. Running XSBench / Command Line Interface
  5. Feature Discussion
  6. Theory & Algorithms
  7. Optimized Kernels
  8. Citing XSBench
  9. Development Team

Understanding the XSBench Figure of Merit

The figure of merit (FOM) in XSBench is the number of macroscopic cross section lookups performed per second. This value includes only the time spent performing the cross section lookups themselves. It is intended to exclude any time spent on program initialization (e.g., generating randomized cross section data) and data movement to/from device memory spaces. In some of the GPU implementations of XSBench, timing data may be limited to host CPU timers, in which case they sometimes may include data movement costs to and from the device memory space. Thus, it is recommended to use device kernel timers (e.g., nsys, nvprof, rocprof, iprof, etc) to ascertain the true kernel runtime of the simulation loop alone. The true FOM can then be computing by dividing the number of lookups (reported by XSBench at the end of the simulation) by the kernel runtime.

In some cases, users may want to extend the runtime of XSBench (e.g., for investigating thermal effects that only arise after several seconds of execution). The best way of doing this while still representing a realistic cross section lookup kernel is to increase the number of lookups (in event-based mode), or particle histories (in history-based mode), as described in the running XSBench section below.

Selecting A Programming Model

XSBench has been implemented in multiple different languages to target a variety of computational architectures and accelerators. The available implementations can be found in their own directories:

  1. XSBench/openmp-threading This is the "default" version of XSBench that is appropriate for serial and multicore CPU architectures. The method of parallelism is via the OpenMP threading model.

  2. XSBench/openmp-offload This method of parallelism uses OpenMP 4.5 (or newer) to map program data to a remote accelerator memory space and run targeted kernels on the accelerator. This method of parallelism could be used for a wide variety of architectures (besides CPUs) that support OpenMP 4.5 targeting. NOTE: The Makefile will likely not work by default and will need to be adjusted to utilize your OpenMP accelerator compiler.

  3. XSBench/cuda This version of XSBench is written in CUDA for use with NVIDIA GPU architectures. NOTE: You will likely want to specify in the makefile the SM version for the card you are running on.

  4. XSBench/opencl This version of XSBench is written in OpenCL, and can be used for CPU, GPU, FPGA, or other architectures that support OpenCL. It was written with GPUs in mind, so if running on other architectures you may need to heavily re-optimize the code. You will also likely need to edit the makefile to supply the path to your OpenCL compiler.

  5. XSBench/sycl This version of XSBench is written in SYCL, and can be used for CPU, GPU, FPGA, or other architectures that support OpenCL and SYCL. It was written with GPUs in mind, so if running on other architectures you may need to heavily re-optimize the code. You will also likely need to edit the makefile to supply the path to your SYCL compiler.

  6. XSBench/hip This version of XSBench is written in HIP for use with GPU architectures. This version is derived from CUDA using an automatic conversion tool with only a few small manual changes.

Compilation

To compile XSBench with default settings, navigate to your selected source directory and use the following command:

make

You can alter compiler settings in the included Makefile. Alternatively, for the OpenMP threading version of XSBench you may specify a compiler via the CC environment variable and then making as normal, e.g.:

export CC=clang
make

Debugging, Optimization & Profiling

There are also a number of switches that can be set in the makefile. Here is a sample of the control panel at the top of the makefile:

OPTIMIZE = yes
DEBUG    = no
PROFILE  = no
MPI      = no
AML      = no

Running XSBench

To run XSBench with default settings, use the following command:

./XSBench

For non-default settings, XSBench supports the following command line options:

ArgumentDescriptionOptionsDefault
-t# of OpenMP threadsinteger valueSystem Default
-mSimulation methodhistory, eventhistory
-sProblem Sizesmall, large, XL, XXLlarge
-g# of gridpoints per nuclide (overrides -s defaults)integer value11,303
-GGrid search typeunionized, nuclide, hashunionized
-p# of particle histories (if running using "history" method)integer value500,000
-l# of Cross-section (XS) lookups. If using history based method, this is lookups per particle history. If using event-based method, this is total lookups.integer value(History: 34) (Event: 17,000,000)
-h# of hash bins (only used with hash-based grid search)integer value10,000
-bRead/Write binary filesread, write
-kOptimized kernel IDinteger value0

Feature Discussion

MPI Support

While XSBench is primarily used to investigate "on node parallelism" issues, some systems provide power & performance statistics batched in multi-node configurations. To accommodate this, XSBench provides an MPI mode which runs the code on all MPI ranks simultaneously. There is no decomposition across ranks of any kind, and all ranks accomplish the same work. This is a "weak scaling" approach -- for instance, if running the event-based model all MPI ranks will execute 17,000,000 cross section lookups regardless of how many ranks are used. There is only one point of MPI communication (a reduce) at the end, which aggregates the timing statistics and averages them across MPI ranks before printing them out. MPI support can be enabled with the makefile flag "MPI". If you are not using the mpicc wrapper on your system, you may need to alter the makefile to make use of your desired compiler.

AML Optimizations

AML is a memory management library featuring optimization abtractions. More information about the library can be found in the online documentation. The library can be downloaded and installed from the repository.

XSBench can be compiled to include these optimizations by toggling make AML=yes option. In order for pkg-config to find out the appropriate compilation flags, environment variable PKG_CONFIG_PATH must point to the install directory of aml.pc.

Current optimizations featured in XSBench are as follow:

Optimizations implementation and availability may depend on the programming model. In the current version the following programming models feature AML optimizations:

Verification Support

Legacy versions of XSBench had a special "Verification" compiler flag option to enable verification of the results. However, a much more performant and portable verification scheme was developed and is now used for all configurations -- therefore, it is not necessary to compile with or without the verification mode as it is always enabled by default. XSBench generates a hash of the results at the end of the simulation and displays it with the other data once the code has completed executing. This hash can then be verified against hashes that other versions or configurations of the code generate. For instance, running XSBench with 4 threads vs 8 threads (on a machine that supports that configuration) should generate the same hash number. Running on GPU vs CPU should not change the hash number. However, changing the model / run parameters is expected to generate a totally different hash number (i.e., increasing the number of particles, number of gridpoints, etc, will result in different hashes). However, changing the type of lookup performed (e.g., nuclide, unionized, or hash) should result in the same hash being generated. Changing the simulation mode (history or event) will generate different hashes.

Binary File Support

Instead of initializing the randomized synthetic cross section data structres in XSBench everytime it is run, you may optionally have XSBench generate a data set and write it to file. It can then be read on subsequent runs to speed up initialization. This process is controlled with the "-b (read, write)" command line argument. This feature may be extremely useful for users running on simulators where walltime minimization is critical for logistical purposes, or for users who are doing many sequential runs. Note that identical input parameters (problem size, solution method etc) must be used when reading and writing a binary file. No runtime checks are made to validate that the file correctly corresponds to the selected input parameters.

Algorithms

Transport Simulation Styles

History-Based Transport

The default simulation model used in XSBench is the "history-based" model. In this model, parallelism is expressed over independent particle histories, with each particle being simulated in a serial fashion from birth to death:

for each particle do		   // Independent
	while particle is alive do // Dependent
		Move particle to collision site
		Process particle collision

This method of parallelism is very memory efficient, as the total number of particles that must be kept in memory at once is equivalent to the total number of active threads being run in the simulation. However, as there are many different types of collision events, the history-based model means that there is no natural SIMD style parallelism available for work happening between different threads.

Event-Based Transport

An alternative simulation model is the "event-based" model. In this model, parallelism is instead expressed over different collision (or "event") types. To facilitate this, all particles in the simulation are stored in memory at once. Each event kernel is executed in parallel on vectors of particles that currently require that event to be executed:

Get vector of source particles
while any particles are alive do	     // Dependent
	for each living particle do          // Independent
		Move particle to collision site
	for each living particle do          // Independent
		Process particle collision
	Sort/consolidate surviving particles

This method of parallelism is requires more memory and requires an extra stream compaction kernel to sort and organize the particles periodically to ready them for the different event kernels. The benefit of this model is that kernels can potentially be execute in a SIMD manner and with higher cache efficiency due to the potential to sort particles by material and energy. On CPU architectures, the costs of sorting and buffering particles typically outweigh the benefits of the event-based model, but on accelerator architectures the tradeoff has been found to usually be more favorable.

Cross Section (XS) Lookup Methods

XSBench represents the macroscopic cross section lookup kernel. This kernel is responsible for adding together microscopic cross section data from all nuclides present in the material the neutron is travelling through, given a certain energy:

<p align="center"> <img src="docs/img/XS_equation.svg" alt="XS_Lookup_EQ" width="500"/> </p>

Macroscopic cross section data is typically required for multiple reaction channels "c", such as the total cross section, fission cross section, etc. This data is typically stored in point-wise data form for each nuclide. There are multiple ways of accessesing this data in an efficient manner which will be discusses in this section.

Nuclide Grid

This is the default "naive" method of performing macroscopic XS lookups. XS data is stored for a number of energy levels for each nuclide in the simulation problem. Different nuclides can have a different number of energy levels. For instance, U-238 usually has over 100k energy levels, whereas some other nuclides may only have a few thousand. The "Nuclide Grid" is composed of all nuclides in the problem, with a variable number of data points for each nuclide. Each data point is composed of the energy level and accompanying cross section data for multiple different reaction channels:

<p align="center"> <img src="docs/img/xs_point.png" alt="xs_point" width="300"/> </p>

These XS data points are arranged into the nuclide grid:

<p align="center"> <img src="docs/img/nuclide_grid.png" alt="nuclide_grid" width="350"/> </p>

When assembling a macroscopic cross section data point, we will be accessing and interpolating data from the nuclide grid for a neutron travelling through a given material (composed of some number of nuclides) and at a given energy level. This will involve performing a binary search for each nuclide:

Nuclide_Grid_Search( Energy E, Material M ):
	macroscopic XS = 0
	for each nuclide in M do:
		index = binary search to find E in nuclide grid
		interpolate data from grid[nuclide, index]
		macroscopic XS += data

This algorithm requires no extra memory usage beyond the minimum to represent the pointwise cross section data, but also requires that a binary search be run for each nuclide in the material the neutron is travelling through. In the case of depleted fuel, there can be 300+ nuclides, making this option computationally expensive.

Unionized Energy Grid

One way of speeding up the nuclide grid search is to form a separate acceleration structure to reduce the number of binary searches that need to be performed. In the Unionized Energy Grid (EUG) method, a second grid is created with columns corresponding to the union of all energy levels from the nuclide grid. For each energy level (column) in the unionized grid, each row stores an index corresponding to the closest location in the nuclide grid for each nuclide corresponding that energy level:

<p align="center"> <img src="docs/img/UEG.png" alt="UEG" width="500"/> </p>

A lookup using the UEG therefore requires only one single binary search on the unionized grid, allowing then for fast accesses using the indices stored at that energy level:

Unionized_Grid_Search( Energy E, Material M ):
	macroscopic XS = 0
	UEG_index = binary search to find E in unionized grid
	for each nuclide in M do:
		index = unionized_grid[nuclide, UEG_index]
		interpolate data from grid[nuclide, index]
		macroscopic XS += data

Logarithmic Hash Grid

An alternative to the unionized energy grid is the logarithmic hash grid. This method takes in account the fact that while nuclides will be tabulated on grids containing different numbers of energy points, the points within each nuclide's grid will in general be spaced in (roughly) uniform maner in log space . Therefore, the nuclide grid is augmented with a separate acceleration structure similar to the unionized grid. However, the number of columns is capped at some number of bins spaced evenly in log space, with each row therefore corresponding to an approximate location within each nuclide's grid for that energy level. While the unionized grid points exactly to the correct index in the nuclide grid, the logarithmic hash grid points to only an approximate location below the true point -- meaning that a fast binary or iterative search must still be performed over the constrained area (typically only 10 or so elements in size):

Logarithmic_Hash_Grid_Search( Energy E, Material M ):
	macroscopic XS = 0
	hash_index = grid_delta * (ln(E) - grid_minimum_energy)
	for each nuclide in M do:
		i_low  = unionized_grid[nuclide, hash_index]
		i_high = unionized_grid[nuclide, hash_index+1]
		index = binary search in range(i_low, i_high) to find E in nuclide grid
		interpolate data from grid[nuclide, index]
		macroscopic XS += data

Compared to the unionized energy grid method, the logarithmic hash method uses far less memory but typically results in competitive performance. More details on this method can be found in the following publication:

Forrest B Brown. New hash-based energy lookup algorithm for monte carlo codes. Trans. Am. Nucl. Soc., 111:659–662, 2014. http://permalink.lanl.gov/object/tr?what=info:lanl-repo/lareport/LA-UR-14-27037

Optimized Kernels

If using the event-based model, we will be executing the lookup kernel in XSBench across all particles at once. While SIMD execution is possible using this method, typically issues can arrise that greatly reduce SIMD efficiency. In particular, different materials in the simulation have very different numbers of nuclides in them. For instance, spent fuel has 300+ nuclides, while moderator regions only have 10 or so nuclides. This creates a significant load imbalance across lanes in a SIMD vector, as some particles may only need a few iterations to complete all nuclides while others would need hundreds. Furthermore, due to the random memory accesses involved in the binary search process, efficient SIMD execution of the event-based model is not possible without some optimizations.

One promising optimization for the event-based model is to perform a key-value sort of particles: first by material, and then by energy within each material. The first sort by material allows for adjacent particles in the vector to typically reside in the same type of material -- meaning that they will require the same number of nuclide lookup iterations. Then, the energy sort means that adjacent particles in the vector will be located close in energy space -- potentially allowing for many adjacent particles to access the same energy indices in each nuclide and therefore perform many or all of the same branching operations and read the same cache lines into memory at the same time. Once sorted, separate event kernels are then called for each material in the simulation. These two sorts can potentially boost both SIMD efficieny and cache efficiency, with effects being amplified as more particles are simulated at each event stage. The downside to this optimization is the introduction of the key-value particle sorting operations, which can be costly and potentially outweight any gains due to improved SIMD efficiency and cache performance.

We have implemented this optimization in both the OpenMP threading and CUDA models. They are not enabled by default, but must be enabled with the "-k 1" or "-k 6" flags if running with OpenMP and CUDA respectively. These optimizations have not yet been implemented in the other programming models due to the lack of an efficient parallel sorting function being easily available without having to create an external library dependency.

Citing XSBench

Papers citing the XSBench program in general should refer to:

J. R. Tramm, A. R. Siegel, T. Islam, and M. Schulz, “XSBench - The Development and Verification of a Performance Abstraction for Monte Carlo Reactor Analysis,” presented at PHYSOR 2014 - The Role of Reactor Physics toward a Sustainable Future, Kyoto. https://www.mcs.anl.gov/papers/P5064-0114.pdf

Bibtex Entry:

@inproceedings{Tramm:wy,
author = {Tramm, John R and Siegel, Andrew R and Islam, Tanzima and Schulz, Martin},
title = {{XSBench} - The Development and Verification of a Performance Abstraction for {M}onte {C}arlo Reactor Analysis},
booktitle = {{PHYSOR} 2014 - The Role of Reactor Physics toward a Sustainable Future},
address = {Kyoto},
year = 2014,
url = "https://www.mcs.anl.gov/papers/P5064-0114.pdf"
}

Development Team

Authored and maintained by John Tramm (@jtramm) with help from Ron Rahaman, Amanda Lund, and other contributors.