Home

Awesome

Open in GitHub Codespaces

Contributor GuideDev ContainersDiscordGodboltGitHub ProjectDocumentation

CUDA Core Compute Libraries (CCCL)

Welcome to the CUDA Core Compute Libraries (CCCL) where our mission is to make CUDA more delightful.

This repository unifies three essential CUDA C++ libraries into a single, convenient repository:

The goal of CCCL is to provide CUDA C++ developers with building blocks that make it easier to write safe and efficient code. Bringing these libraries together streamlines your development process and broadens your ability to leverage the power of CUDA C++. For more information about the decision to unify these projects, see the announcement here.

Overview

The concept for the CUDA Core Compute Libraries (CCCL) grew organically out of the Thrust, CUB, and libcudacxx projects that were developed independently over the years with a similar goal: to provide high-quality, high-performance, and easy-to-use C++ abstractions for CUDA developers. Naturally, there was a lot of overlap among the three projects, and it became clear the community would be better served by unifying them into a single repository.

The main goal of CCCL is to fill a similar role that the Standard C++ Library fills for Standard C++: provide general-purpose, speed-of-light tools to CUDA C++ developers, allowing them to focus on solving the problems that matter. Unifying these projects is the first step towards realizing that goal.

Example

This is a simple example demonstrating the use of CCCL functionality from Thrust, CUB, and libcudacxx.

It shows how to use Thrust/CUB/libcudacxx to implement a simple parallel reduction kernel. Each thread block computes the sum of a subset of the array using cub::BlockReduce. The sum of each block is then reduced to a single value using an atomic add via cuda::atomic_ref from libcudacxx.

It then shows how the same reduction can be done using Thrust's reduce algorithm and compares the results.

Try it live on Godbolt!

#include <thrust/execution_policy.h>
#include <thrust/device_vector.h>
#include <cub/block/block_reduce.cuh>
#include <cuda/atomic>
#include <cuda/cmath>
#include <cuda/std/span>
#include <cstdio>

template <int block_size>
__global__ void reduce(cuda::std::span<int const> data, cuda::std::span<int> result) {
  using BlockReduce = cub::BlockReduce<int, block_size>;
  __shared__ typename BlockReduce::TempStorage temp_storage;

  int const index = threadIdx.x + blockIdx.x * blockDim.x;
  int sum = 0;
  if (index < data.size()) {
    sum += data[index];
  }
  sum = BlockReduce(temp_storage).Sum(sum);

  if (threadIdx.x == 0) {
    cuda::atomic_ref<int, cuda::thread_scope_device> atomic_result(result.front());
    atomic_result.fetch_add(sum, cuda::memory_order_relaxed);
  }
}

int main() {

  // Allocate and initialize input data
  int const N = 1000;
  thrust::device_vector<int> data(N);
  thrust::fill(data.begin(), data.end(), 1);

  // Allocate output data
  thrust::device_vector<int> kernel_result(1);

  // Compute the sum reduction of `data` using a custom kernel
  constexpr int block_size = 256;
  int const num_blocks = cuda::ceil_div(N, block_size);
  reduce<block_size><<<num_blocks, block_size>>>(cuda::std::span<int const>(thrust::raw_pointer_cast(data.data()), data.size()),
                                                 cuda::std::span<int>(thrust::raw_pointer_cast(kernel_result.data()), 1));

  auto const err = cudaDeviceSynchronize();
  if (err != cudaSuccess) {
    std::cout << "Error: " << cudaGetErrorString(err) << std::endl;
    return -1;
  }

  int const custom_result = kernel_result[0];

  // Compute the same sum reduction using Thrust
  int const thrust_result = thrust::reduce(thrust::device, data.begin(), data.end(), 0);

  // Ensure the two solutions are identical
  std::printf("Custom kernel sum: %d\n", custom_result);
  std::printf("Thrust reduce sum: %d\n", thrust_result);
  assert(kernel_result[0] == thrust_result);
  return 0;
}

Getting Started

Users

Everything in CCCL is header-only. Therefore, users need only concern themselves with how they get the header files and how they incorporate them into their build system.

CUDA Toolkit

The easiest way to get started using CCCL is via the CUDA Toolkit which includes the CCCL headers. When you compile with nvcc, it automatically adds CCCL headers to your include path so you can simply #include any CCCL header in your code with no additional configuration required.

If compiling with another compiler, you will need to update your build system's include search path to point to the CCCL headers in your CTK install (e.g., /usr/local/cuda/include).

#include <thrust/device_vector.h>
#include <cub/cub.cuh>
#include <cuda/std/atomic>

GitHub

Users who want to stay on the cutting edge of CCCL development are encouraged to use CCCL from GitHub. Using a newer version of CCCL with an older version of the CUDA Toolkit is supported, but not the other way around. For complete information on compatibility between CCCL and the CUDA Toolkit, see our platform support.

Everything in CCCL is header-only, so cloning and including it in a simple project is as easy as the following:

git clone https://github.com/NVIDIA/cccl.git
nvcc -Icccl/thrust -Icccl/libcudacxx/include -Icccl/cub main.cu -o main

Note Use -I and not -isystem to avoid collisions with the CCCL headers implicitly included by nvcc from the CUDA Toolkit. All CCCL headers use #pragma system_header to ensure warnings will still be silenced as if using -isystem, see https://github.com/NVIDIA/cccl/issues/527 for more information.

Installation

A minimal build that only generates installation rules can be configured using the install CMake preset:

git clone https://github.com/NVIDIA/cccl.git
cd cccl
cmake --preset install -DCMAKE_INSTALL_PREFIX=/usr/local/
cd build/install
ninja install

To include experimental libraries in the installation, use the install-unstable preset and build directory.

To install only the experimental libraries, use the install-unstable-only preset and build directory.

Conda

CCCL also provides conda packages of each release via the conda-forge channel:

conda config --add channels conda-forge
conda install cccl

This will install the latest CCCL to the conda environment's $CONDA_PREFIX/include/ and $CONDA_PREFIX/lib/cmake/ directories. It is discoverable by CMake via find_package(CCCL) and can be used by any compilers in the conda environment. For more information, see this introduction to conda-forge.

If you want to use the same CCCL version that shipped with a particular CUDA Toolkit, e.g. CUDA 12.4, you can install CCCL with:

conda config --add channels conda-forge
conda install cuda-cccl cuda-version=12.4

The cuda-cccl metapackage installs the cccl version that shipped with the CUDA Toolkit corresponding to cuda-version. If you wish to update to the latest cccl after installing cuda-cccl, uninstall cuda-cccl before updating cccl:

conda uninstall cuda-cccl
conda install -c conda-forge cccl

Note There are also conda packages with names like cuda-cccl_linux-64. Those packages contain the CCCL versions shipped as part of the CUDA Toolkit, but are designed for internal use by the CUDA Toolkit. Install cccl or cuda-cccl instead, for compatibility with conda compilers. For more information, see the cccl conda-forge recipe.

CMake Integration

CCCL uses CMake for all build and installation infrastructure, including tests as well as targets to link against in other CMake projects. Therefore, CMake is the recommended way to integrate CCCL into another project.

For a complete example of how to do this using CMake Package Manager see our basic example project.

Other build systems should work, but only CMake is tested. Contributions to simplify integrating CCCL into other build systems are welcome.

Contributors

Interested in contributing to making CCCL better? Check out our Contributing Guide for a comprehensive overview of everything you need to know to set up your development environment, make changes, run tests, and submit a PR.

Platform Support

Objective: This section describes where users can expect CCCL to compile and run successfully.

In general, CCCL should work everywhere the CUDA Toolkit is supported, however, the devil is in the details. The sections below describe the details of support and testing for different versions of the CUDA Toolkit, host compilers, and C++ dialects.

CUDA Toolkit (CTK) Compatibility

Summary:

CCCL users are encouraged to capitalize on the latest enhancements and "live at head" by always using the newest version of CCCL. For a seamless experience, you can upgrade CCCL independently of the entire CUDA Toolkit. This is possible because CCCL maintains backward compatibility with the latest patch release of every minor CTK release from both the current and previous major version series. In some exceptional cases, the minimum supported minor version of the CUDA Toolkit release may need to be newer than the oldest release within its major version series. For instance, CCCL requires a minimum supported version of 11.1 from the 11.x series due to an unavoidable compiler issue present in CTK 11.0.

When a new major CTK is released, we drop support for the oldest supported major version.

CCCL VersionSupports CUDA Toolkit Version
2.x11.1 - 11.8, 12.x (only latest patch releases)
3.x (Future)12.x, 13.x (only latest patch releases)

Well-behaved code using the latest CCCL should compile and run successfully with any supported CTK version. Exceptions may occur for new features that depend on new CTK features, so those features would not work on older versions of the CTK. For example, C++20 support was not added to nvcc until CUDA 12.0, so CCCL features that depend on C++20 would not work with CTK 11.x.

Users can integrate a newer version of CCCL into an older CTK, but not the other way around. This means an older version of CCCL is not compatible with a newer CTK. In other words, CCCL is never forward compatible with the CUDA Toolkit.

The table below summarizes compatibility of the CTK and CCCL:

CTK VersionIncluded CCCL VersionDesired CCCLSupported?Notes
CTK X.YCCCL MAJOR.MINORCCCL MAJOR.MINOR+nSome new features might not work
CTK X.YCCCL MAJOR.MINORCCCL MAJOR+1.MINORPossible breaks; some new features might not be available
CTK X.YCCCL MAJOR.MINORCCCL MAJOR+2.MINORCCCL supports only two CTK major versions
CTK X.YCCCL MAJOR.MINORCCCL MAJOR.MINOR-nCCCL isn't forward compatible
CTK X.YCCCL MAJOR.MINORCCCL MAJOR-n.MINORCCCL isn't forward compatible

For more information on CCCL versioning, API/ABI compatibility, and breaking changes see the Versioning section below.

Operating Systems

Unless otherwise specified, CCCL supports all the same operating systems as the CUDA Toolkit, which are documented here:

Host Compilers

Unless otherwise specified, CCCL supports all the same host compilers as the CUDA Toolkit, which are documented here:

In the spirit of "You only support what you test", see our CI Overview for more information on exactly what we test.

C++ Dialects

GPU Architectures

Unless otherwise specified, CCCL supports all the same GPU architectures/Compute Capabilities as the CUDA Toolkit, which are documented here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capability

Note that some features may only support certain architectures/Compute Capabilities.

Testing Strategy

CCCL's testing strategy strikes a balance between testing as many configurations as possible and maintaining reasonable CI times.

For CUDA Toolkit versions, testing is done against both the oldest and the newest supported versions. For instance, if the latest version of the CUDA Toolkit is 12.3, tests are conducted against 11.1 and 12.3. For each CUDA version, builds are completed against all supported host compilers with all supported C++ dialects.

The testing strategy and matrix are constantly evolving. The matrix defined in the ci/matrix.yaml file is the definitive source of truth. For more information about our CI pipeline, see here.

Versioning

Objective: This section describes how CCCL is versioned, API/ABI stability guarantees, and compatibility guidelines to minimize upgrade headaches.

Summary

Note: Prior to merging Thrust, CUB, and libcudacxx into this repository, each library was independently versioned according to semantic versioning. Starting with the 2.1 release, all three libraries synchronized their release versions in their separate repositories. Moving forward, CCCL will continue to be released under a single semantic version, with 2.2.0 being the first release from the nvidia/cccl repository.

Breaking Change

A Breaking Change is a change to explicitly supported functionality between released versions that would require a user to do work in order to upgrade to the newer version.

In the limit, any change has the potential to break someone somewhere. As a result, not all possible source breaking changes are considered Breaking Changes to the public API that warrant bumping the major semantic version.

The sections below describe the details of breaking changes to CCCL's API and ABI.

Application Programming Interface (API)

CCCL's public API is the entirety of the functionality intentionally exposed to provide the utility of the library.

In other words, CCCL's public API goes beyond just function signatures and includes (but is not limited to):

Moreover, CCCL's public API does not include any of the following:

In general, the goal is to avoid breaking anything in the public API. Such changes are made only if they offer users better performance, easier-to-understand APIs, and/or more consistent APIs.

Any breaking change to the public API will require bumping CCCL's major version number. In keeping with CUDA Minor Version Compatibility, API breaking changes and CCCL major version bumps will only occur coinciding with a new major version release of the CUDA Toolkit.

Anything not part of the public API may change at any time without warning.

API Versioning

The public API of all CCCL's components share a unified semantic version of MAJOR.MINOR.PATCH.

Only the most recently released version is supported. As a rule, features and bug fixes are not backported to previously released version or branches.

The preferred method for querying the version is to use CCCL_[MAJOR/MINOR/PATCH_]VERSION as described below. For backwards compatibility, the Thrust/CUB/libcudacxxx version definitions are available and will always be consistent with CCCL_VERSION. Note that Thrust/CUB use a MMMmmmpp scheme whereas the CCCL and libcudacxx use MMMmmmppp.

CCCLlibcudacxxThrustCUB
Header<cuda/version><cuda/std/version><thrust/version.h><cub/version.h>
Major VersionCCCL_MAJOR_VERSION_LIBCUDACXX_CUDA_API_VERSION_MAJORTHRUST_MAJOR_VERSIONCUB_MAJOR_VERSION
Minor VersionCCCL_MINOR_VERSION_LIBCUDACXX_CUDA_API_VERSION_MINORTHRUST_MINOR_VERSIONCUB_MINOR_VERSION
Patch/Subminor VersionCCCL_PATCH_VERSION_LIBCUDACXX_CUDA_API_VERSION_PATCHTHRUST_SUBMINOR_VERSIONCUB_SUBMINOR_VERSION
Concatenated VersionCCCL_VERSION (MMMmmmppp)_LIBCUDACXX_CUDA_API_VERSION (MMMmmmppp)THRUST_VERSION (MMMmmmpp)CUB_VERSION (MMMmmmpp)

Application Binary Interface (ABI)

The Application Binary Interface (ABI) is a set of rules for:

A library's ABI includes, but is not limited to:

An ABI Breaking Change is any change that results in a change to the ABI of a function or type in the public API. For example, adding a new data member to a struct is an ABI Breaking Change as it changes the size of the type.

In CCCL, the guarantees about ABI are as follows:

Who should care about ABI?

In general, CCCL users only need to worry about ABI issues when building or using a binary artifact (like a shared library) whose API directly or indirectly includes types provided by CCCL.

For example, consider if libA.so was built using CCCL version X and its public API includes a function like:

void foo(cuda::std::optional<int>);

If another library, libB.so, is compiled using CCCL version Y and uses foo from libA.so, then this can fail if there was an ABI break between version X and Y. Unlike with API breaking changes, ABI breaks usually do not require code changes and only require recompiling everything to use the same ABI version.

To learn more about ABI and why it is important, see What is ABI, and What Should C++ Do About It?.

Compatibility Guidelines

As mentioned above, not all possible source breaking changes constitute a Breaking Change that would require incrementing CCCL's API major version number.

Users are encouraged to adhere to the following guidelines in order to minimize the risk of disruptions from accidentally depending on parts of CCCL that are not part of the public API:

Portions of this section were inspired by Abseil's Compatibility Guidelines.

Deprecation Policy

We will do our best to notify users prior to making any breaking changes to the public API, ABI, or modifying the supported platforms and compilers.

As appropriate, deprecations will come in the form of programmatic warnings which can be disabled.

The deprecation period will depend on the impact of the change, but will usually last at least 2 minor version releases.

Mapping to CTK Versions

Coming soon!

CI Pipeline Overview

For a detailed overview of the CI pipeline, see ci-overview.md.

Related Projects

Projects that are related to CCCL's mission to make CUDA more delightful:

Projects Using CCCL

Does your project use CCCL? Open a PR to add your project to this list!