Awesome
DM-Sim: Density Matrix Quantum Circuit Simulation Environment (Merged in NWQSim)
A Density Matrix Quantum Simulation Environment for Single-GPU/CPU, Single-Node-Multi-GPUs/CPUs and Multi-Nodes GPU/CPU Cluster. It supports Intel/AMD/IBM CPUs, NVIDIA/AMD GPUs.
Current version
Latest version: 2.5
Version-2.5 Updates:
- Add AMD HIP GPU backend.
- Slightly restructure the code.
Version-2.4 Updates:
- Add CMake for easy build-up.
- Add qasmbench for evaluation.
Version-2.3 Updates:
- Add AVX512 acceleration support for the CPU-backend. Specify USE_AVX512 in config.hpp. Require AVX512F support of the CPU.
Version-2.2 Updates:
- Add CPU backend for both OpenMP and MPI so that DM-Sim can support system without GPUs.
- Reform the source code file structure.
Version-2.1 Updates:
- Support Q# code through QIR and Bridge developed by Microsoft quantum team. The DM-Sim project is also listed as one of the public projects using QIR to support Q#/QDK.
- Enable the dump of current quantum circuit into a file stream.
Version-2.0 Updates:
- Provide python APIs, see the adder_n10_omp.py example in src.
- Provide C++/CUDA APIs, see the adder_n10_omp.cu example in src.
- Support single-GPU fast running (up to 14 qubits on a V100 GPU).
- Support single-node-multi-GPU execution managed by OpenMP.
- Support multi-node cluster execution managed by MPI and mpi4py.
If you're looking for the implementation we described in our SC-20 paper, please see the V1.0 release. DM-Sim is under active development. Please propose any bugs and suggest any features. We will continuously add new features. Questions and suggestions are welcome.
About DM-Sim
Please see our SuperComputing (SC-20) paper for details. The paper is nominated for the best paper award in SC-20.
In this repository you will find a CUDA/C++ implementation for simulating deep quantum circuits on a single-GPU/CPU, a single-node-multi-GPUs/CPUs (e.g., NVIDIA DGX-1, DGX-2 and HGX)), and multi-nodes GPU/CPU cluster (like the Summit supercomputer in ORNL) using full density matrices. Our DM_sim simulator fully supports OpenQASM intermediate-representation (IR) language (see spec. OpenQASM can be generated by Qiskit, Cirq, ProjectQ and Scaffold (see below). It also supports Q#/QDK through QIR. For scale-up (i.e., single-node-multi-GPUs), we leverage fast intra-node interconnects such as NVLink, NV-SLI and NVSwitch (see our benchmarking paper and evaluation paper about several modern GPU Interconnect). This simulator is based on the Multi-GPU-BSP (MG-BSP) model, please see our SuperComputing-20 paper for details. Here is the video presentation on YouTube:
DM-Sim (OpenMP) simulates 1M general gates with 15-qubits gate-by-gate in 94 minutes on DGX-2 (16 NVIDIA V100 GPUs) using the density-operator -- on average 5.6 ms/gate. DM-Sim simulates a VQE-UCCSD 8-qubits circuit with 10808 gates in 249.3ms on a single NVIDIA V100 GPU -- on average 0.023 ms/gate.
Supported Gate
Gates | Meaning | Gates | Meaning |
---|---|---|---|
U3 | 3 parameter 2 pulse 1-qubit | CY | Controlled Y |
U2 | 2 parameter 1 pulse 1-qubit | SWAP | Swap |
U1 | 1 parameter 0 pulse 1-qubit | CH | Controlled H |
CX | Controlled-NOT | CCX | Toffoli |
ID | Idle gate or identity | CSWAP | Fredkin |
X | Pauli-X bit flip | CRX | Controlled RX rotation |
Y | Pauli-Y bit and phase flip | CRY | Controlled RY rotation |
Z | Pauli-Z phase flip | CRZ | Controlled RZ rotation |
H | Hadamard | CU1 | Controlled phase rotation |
S | sqrt(Z) phase | CU3 | Controlled U3 |
SDG | conjugate of sqrt(Z) | RXX | 2-qubit XX rotation |
T | sqrt(S) phase | RZZ | 2-qubit ZZ rotation |
TDG | conjugate of sqrt(S) | RCCX | Relative-phase CXX |
RX | X-axis rotation | RC3X | Relative-phase 3-controlled X |
RY | Y-axis rotation | C3X | 3-controlled X |
RZ | Z-axis rotation | C3XSQRTX | 3-controlled sqrt(X) |
CZ | Controlled phase | C4X | 4-controlled X |
W | W gate | RYY | 2-qubit YY rotation |
C1 | Arbitrary 1-qubit gate | C2 | Arbitrary 2-qubit gate |
Package Structure
src: DM-Sim source file
- config.hpp: configurations, constants and macros.
- Makefile: for compilation.
- util_nvgpu.cuh: NVIDIA GPU utility functions.
- util_cpu.h: CPU utility functions.
- dmsim_nvgpu_omp.cuh: major DM-Sim source file for 1-GPU or 1-node-N-GPUs with NVIDIA GPU backend.
- dmsim_nvgpu_mpi.cuh: major DM-Sim source file for N-nodes with NVIDIA GPU backend.
- dmsim_cpu_omp.hpp: major DM-Sim source file for 1-CPU_core or 1-node-N-CPU_cores with CPU backend.
- dmsim_cpu_mpi.hpp: major DM-Sim source file for N-nodes with CPU backend.
- py_nvgpu_omp_wrapper.cu: PyBind11 wrapper for Single/OpenMP python APIs with NVIDIA GPU backend.
- py_nvgpu_mpi_wrapper.cu: PyBind11 wrapper for MPI python APIs via mpi4py with NVIDIA GPU backend.
- py_cpu_omp_wrapper.cpp: PyBind11 wrapper for Single/OpenMP python APIs with CPU backend.
- py_cpu_mpi_wrapper.cpp: PyBind11 wrapper for MPI python APIs via mpi4py with CPU backend.
- adder_n10_cpu_omp.cpp: A 10-qubit adder example using C++ APIs (CPU backend).
- adder_n10_cpu_mpi.cpp: A 10-qubit adder example using MPI C++ APIs (CPU backend).
- adder_n10_nvgpu_omp.cu: A 10-qubit adder example using C++ APIs (NVIDIA GPU backend).
- adder_n10_nvgpu_mpi.cu: A 10-qubit adder example using MPI C++/CUDA APIs (NVIDA GPU backend).
- adder_n10_omp.py: A 10-qubit adder example using Python APIs for scaling-up (select backend inside).
- adder_n10_mpi.py: A 10-qubit adder example using Python APIs for scaling-out (select backend inside).
- qir_omp_wrapper.cu: It wrappers DM-Sim to realize QIR-Bridge API based on OpenMP (select backend inside).
- qir_mpi_wrapper.cu: It wrappers DM-Sim to realize QIR-Bridge API based on MPI (select backend inside).
- set_summit_qir_env.sh: The environment required for building Q#/QIR support on Summit.
- vqe.ll: The example VQE QIR code.
- vqe.qs: The example VQE Q# code.
- vqe_omp_driver.cu: The driver code for the Q# based VQE example for scaling-up.
- vqe_mpi_driver.cu: The driver code for the Q# based VQE example for scaling-out.
benchmark:
- OpenQASM-based benchmarks (.qasm), it contains some circuits in the paper, for more, please refer to our QASMBench.
tool: Supporting tools (will add support for other quantum languages).
- dmsim_qasm.py: A script to convert OpenQASM code into a Python-API based python code (OpenMP version).
- qelib1.inc: OpenQASM standard header file.
- randomtest_n14.py: Test the single-GPU version using 100K Hadamard gates applying randomly on 14 qubits.
- vqe_uccsd_n8.qasm: An OpenQASM based VQE circuit example generated from Scaffold.
- run_dimsim_qasm.sh: Show how to convert the vqe_uccsd_n8.qasm circuit to the python script that can run on DM-Sim.
- vqe_uccsd_n8.py: The generated python script that can run on DM-Sim.
summit: The files that are useful for running on ORNL Summit supercomputer
- set_env.sh: set the environment for CUDA, C/C++ compiler, MPI, Python. We also describe how to setup the python-2.7 environment with pybind11 and mpi4py support on Summit.
- summit_dmsim.lsf: example lsf file for job submission on Summit (please update accordingly).
- Summit.txt: System information about a node of Summit HPC generated using the SC Author-Kit tool.
artifact: System configuration for the evaluation performed in our paper.
These are generated by using
- SLI.txt: For the SLI-system with two RTX2080 GPUs connected by NV-SLI bridge.
- dgx-1P.txt: For the Pascal architecture P100-DGX-1 with 8 GPUs connected by NVLink-V1.
- dgx-1V.txt: For the Volta architecture V100-DGX-1 with 8 GPUs connected by NVLink-V2.
- dgx-2.txt: For the Volta architecture DGX-2 with 16 GPUs connected by NVSwitch.
img: images for the Repo.
Configuration
You may need to update "src/Makefile" to configure your NVCC path and GPU architecture (e.g., -arch=sm_60 for P100, -arch=sm_70 for V100 and -arch=sm_80 for A100 GPUs). We need C++11 support (-std=c++11).
CC = nvcc
FLAGS = -O3 -arch=sm_70 -std=c++11 -rdc=true
LIBS = -lm
Prerequisite
DM-Sim requires the following packages.
Dependency | Version | Comments |
---|---|---|
CUDA | 10.0 or later | For NVIDIA GPU backend |
GCC (or XL) | 5.2 or later (16.01 for xlc) | |
OpenMP | 4.0 | For single-node scale-up |
Spectrum-MPI | 10.3 | For NVIDIA GPU cluster scale-out RDMA |
Python | 3.4 | For Python-API |
Pybind11 | 2.5.0 | For Python-API |
mpi4py | 3.0.3 | For Python-API cluster scale-out |
ROCM | 1.6.0 | For AMD GPU backend |
To build the scale-up version, we need OpenMP. To build the scale-out version, it needs MPI with GPUDirect support (we only tested using IBM XL and Spectrum-MPI on Summit).
The QDK/QRI has additional dependency requirements. For ORNL Summit HPC, please check the setting file: set_summit_qir_env.sh
Build
Please configure the Makefile for the targets, then use the following command for compilation:
make
The default Python version is Python-2.7. If you are using the simulator in other python version, you can adjust accordingly in the Makefile. Note, if you need Python-3, say Python-3.7, you may need to take out the "-lpython3.7" from the compiler option before make.
Execution
DM-Sim requires NVIDIA GPUs for execution. We have tested it on Tesla-P100 (Pascal, CC-6.0), Tesla-V100 (Volta, CC-7.0) and RTX2080 (Turing, CC-7.5). To run on scale-up workstations (e.g., DGX-1 and DGX-2), it needs all the GPUs to be directly connected by NVLink, NVSwitch or NV-SLI for all-to-all communication (when performing adjoint operation when transposing the density matrix)). Therefore, on DGX-1, it can use up to 4 GPUs (despite 8 in total) and provided they are directly interconnected, see our TPDS Evaluation paper on GPU interconnect for detail. For scale-out GPU clusters, it requires the support of GPUDirect-RDMA for direct GPU-memory access. On the ORNL Summit supercomputer, this can be enabled by --smpiargs="-gpu". See the example .lsf file.
Single GPU or single-node-multi-GPUs using C++/CUDA APIs
Writing a CUDA circuit code using DM-Sim C++/CUDA APIs can be simple:
#include "util.cuh"
#include "gate_omp.cuh"
using namespace DMSim;
int main()
{
int n_qubits = 10;
int n_gpus = 4;
sim.append(Simulation::X(0)); //add a Poly-X gate
sim.append(Simulation::H(1)); //add a Hadamard gate
sim.upload(); //upload to GPU
sim.sim(); //simulate
auto res = sim.measure(5); //measure with 5 repetitions
print_measurement(res, 10, 5); //print results
}
When you have the circuit driver, compile and use the following command for execution:
./adder_n10_omp
Single GPU or single-node-multi-GPUs using Python APIs
Writing a python circuit code using DM-Sim C++/CUDA APIs can be even more simple:
import dmsim_py_omp_wrapper as dmsim_omp
n_qubits = 10
n_gpus = 4
sim = dmsim_omp.Simulation(n_qubits, n_gpus))
sim.append(sim.X(0)) #add an X gate
sim.append(sim.H(1)) #add an H gate
sim.upload() #upload to GPU
sim.run() #run
sim.clear_circuit() #clear existing circuit
sim.append(sim.H(0)) #add a new H gate
sim.upload() #upload to GPU
sim.run() #run new circuit on original states
res = sim.measure(10) #measure with 10 repetitions and return in a list
python adder_n10_omp.py
Scale-out
This is the execution command on ORNL Summit supercomputer (8 resource sets with 8 MPI ranks, 1 GPU per rank) with GPUDirect-RDMA enabled using Python APIs and C++/CUDA APIs.
jsrun -n8 -a1 -g1 -c1 --smpiargs="-gpu" python -m mpi4py adder_n10_mpi.py 10
jsrun -n8 -a1 -g1 -c1 --smpiargs="-gpu" ./adder_n10_mpi
For the Python version, 10 means the number of qubits used. For the C++/CUDA version, it is written in the code.
Expected Output
When build and execute , which realizes a ripple-carry adder using 10-qubits in total on a single-GPU, should print out the following output:
============== DM-Sim ===============
nqubits:10, ngates:30, ngpus:4, comp:11.685 ms, comm:0.777 ms, sim:12.462 ms, mem:32.000 MB, mem_per_gpu:8.000 MB
=====================================
=============== Measurement (qubits=10, gates=30, tests=10) ================
Test-0: 1000000010
Test-1: 1000000010
Test-2: 1000000010
Test-3: 1000000010
Test-4: 1000000010
Test-5: 1000000010
Test-6: 1000000010
Test-7: 1000000010
Test-8: 1000000010
Test-9: 1000000010
The inputs are: carry-in cin = 0, A=0001, B=1111. The outputs are: B=B+A=0000, carry-out=1.
- "nqubits" is the number of qubits simulated.
- "ngates" is the number of gates executed.
- "ngpus" is the number of GPUs utilized.
- "comp" is the computation time (ms) in the simulation.
- "comm" is the communication time (ms) in the simulation.
- "sim" is total simulation latency (ms).
- "mem" is the total GPU memory cost for all GPUs (in MBs).
- "mem" is the per-GPU memory usage (in MBs).
The measurement measures all qubits at once. "repetition" refers to the number of repeated measurements. You can configure the number of trials when calling "measure()" in both C++/CUDA API and Python API. The default value is 10 times.
More Configurations
To simulate qubit-size larger than 15, the index is already larger than a normal unsigned integer, you need to define IdxType to "unsigned long long" in "config.hpp". The ValType is by default double.
When defining "CUDA_ERROR_CHECK", DM-Sim checks CUDA API error and kernel execution error.
Performance
DM-Sim is bounded by GPU memory access bandwidth, and possibly by interconnect bandwidth. We use the Roofline model to show the bound. The real sustainable bandwidth is profiled by using the Roofline Toolkit from LBNL. This following figure shows the Roofline model for the simulation on SLI, DGX-1P, DGX-1V and DGX-2 systems. See the files in the artifact folder. AI stands for arithmetic intensity for the DM simulation.
We show the performance of simulation by increasing the number of qubits (256 gates):
We show the performance of simulation by increasing the number of gates (14 qubits):
And performance bound on computation, memory access and communication:
Performance for deep circuits on DGX-2 using 16 GPUs and 15 qubits using general 1-qubit gate(i.e., C1 gate):
Gates | Computation | Communication | Simulation | Time/Gate |
---|---|---|---|---|
10K | 53.8s | 9.36ms | 53.8s | 5.38ms |
100K | 558.0s | 7.31ms | 558.0s | 5.58ms |
1M | 5645.5s | 7.21ms | 5645.5s | 5.65ms |
Performance on ORNL Summit supercomputer, the numbers on the bars indicate the number of GPUs utilized. For benchmarks, please see QASMBench. Clearly, the communication overhead is much more significant than scale-up.
<img src="img/summit.png" width="500">Support Tools
dmsim_qasm.py
To translate an OpenQASM (e.g., vqe_uccsd_n8.qasm) to a DM-Sim python file (e.g., vqe_uccsd_n8.py):
python dmsim_qasm.py -i vqe_uccsd_n8.qasm -o vqe_uccsd_n8.py
It outputs the target "vqe_uccsd_n8.py" and reports the number of qubits, the number of gates, and the number of CX/CNOT gates. Currently, it generates the OpenMP version python code.
python dmsim_qasm_ass.py -i adder.qasm -o circuit.cuh -s omp
More Benchmarks
We have developed an OpenQASM based benchmark suite called "QASMBench" which provides more real quantum circuit benchmarks. Please see our QASMBench paper for details.
OpenQASM
OpenQASM (Open Quantum Assembly Language) is a low-level quantum intermediate representation (IR) for quantum instructions, similar to the traditional Hardware-Description-Language (HDL) like Verilog and VHDL. OpenQASM is the open-source unified low-level assembly language for IBM quantum machines publically available on cloud that have been investigated and verified by many existing research works. Several popular quantum software frameworks use OpenQASM as one of their output-formats, including Qiskit, Cirq, Scaffold, ProjectQ, etc.
Qiskit
The Quantum Information Software Kit (Qiskit) is a quantum software developed by IBM. It is based on Python. OpenQASM can be generated from Qiskit via:
QuantumCircuit.qasm()
Cirq
Cirq is a quantum software framework from Google. OpenQASM can be generated from Cirq (not fully compatible) via:
cirq.Circuit.to_qasm()
Scaffold
Scaffold is a quantum programming language embedded in the C/C++ programming language based on the LLVM compiler toolchain. A Scaffold program can be compiled by Scaffcc to OpenQASM via the "-b" compiler option.
ProjectQ
ProjectQ is a quantum software platform developed by Steiger et al. from ETH Zurich. The official website is here. ProjectQ can generate OpenQASM when using IBM quantum machines as the backends:
IBMBackend.get_qasm()
Authors
Ang Li, Senior Computer Scientist, Pacific Northwest National Laboratory (PNNL)
Sriram Krishnamoorthy, Lab Fellow, Pacific Northwest National Laboratory (PNNL)
We are currently collaborating with Microsoft Quantum team (Alan Geller, Bettina Heim, Irina Yatsenko, Guen Prawiroatmodjo, Martin Roetteler) on improving the pipeline from Q# to QIR to DM-Sim. Many thanks to their strong support.
Citation format
If you find DM-Sim useful, please cite our SC-20 paper:
- Ang Li, Omer Subasi, Xiu Yang, and Sriram Krishnamoorthy. "Density Matrix Quantum Circuit Simulation via the BSP Machine on Modern GPU Clusters." In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2020.
Bibtex:
@inproceedings{li2020density,
title={Density Matrix Quantum Circuit Simulation via the BSP Machine on Modern GPU Clusters},
author={Li, Ang and Subasi, Omer and Yang, Xiu and Krishnamoorthy, Sriram},
booktitle={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},
year={2020}
}
License
This project is licensed under the BSD License, see LICENSE file for details.
Acknowledgments
PNNL-IPID: 31919-E, ECCN: EAR99, IR: PNNL-SA-143160
This project is currently supported by the Quantum Science Center (QSC). It was originally supported by PNNL's Quantum Algorithms, Software, and Architectures (QUASAR) LDRD Initiative. The Pacific Northwest National Laboratory (PNNL) is operated by Battelle for the U.S. Department of Energy (DOE) under contract DE-AC05-76RL01830.
Contributing
Please contact us If you'd like to contribute to DM-Sim. See the contact in our paper or my webpage.