Home

Awesome

DM-Sim: Density Matrix Quantum Circuit Simulation Environment (Merged in NWQSim)

A Density Matrix Quantum Simulation Environment for Single-GPU/CPU, Single-Node-Multi-GPUs/CPUs and Multi-Nodes GPU/CPU Cluster. It supports Intel/AMD/IBM CPUs, NVIDIA/AMD GPUs.

alt text

Current version

Latest version: 2.5

Version-2.5 Updates:

Version-2.4 Updates:

Version-2.3 Updates:

Version-2.2 Updates:

Version-2.1 Updates:

Version-2.0 Updates:

If you're looking for the implementation we described in our SC-20 paper, please see the V1.0 release. DM-Sim is under active development. Please propose any bugs and suggest any features. We will continuously add new features. Questions and suggestions are welcome.

About DM-Sim

Please see our SuperComputing (SC-20) paper for details. The paper is nominated for the best paper award in SC-20.

In this repository you will find a CUDA/C++ implementation for simulating deep quantum circuits on a single-GPU/CPU, a single-node-multi-GPUs/CPUs (e.g., NVIDIA DGX-1, DGX-2 and HGX)), and multi-nodes GPU/CPU cluster (like the Summit supercomputer in ORNL) using full density matrices. Our DM_sim simulator fully supports OpenQASM intermediate-representation (IR) language (see spec. OpenQASM can be generated by Qiskit, Cirq, ProjectQ and Scaffold (see below). It also supports Q#/QDK through QIR. For scale-up (i.e., single-node-multi-GPUs), we leverage fast intra-node interconnects such as NVLink, NV-SLI and NVSwitch (see our benchmarking paper and evaluation paper about several modern GPU Interconnect). This simulator is based on the Multi-GPU-BSP (MG-BSP) model, please see our SuperComputing-20 paper for details. Here is the video presentation on YouTube:

Watch the video

DM-Sim (OpenMP) simulates 1M general gates with 15-qubits gate-by-gate in 94 minutes on DGX-2 (16 NVIDIA V100 GPUs) using the density-operator -- on average 5.6 ms/gate. DM-Sim simulates a VQE-UCCSD 8-qubits circuit with 10808 gates in 249.3ms on a single NVIDIA V100 GPU -- on average 0.023 ms/gate.

Supported Gate

GatesMeaningGatesMeaning
U33 parameter 2 pulse 1-qubitCYControlled Y
U22 parameter 1 pulse 1-qubitSWAPSwap
U11 parameter 0 pulse 1-qubitCHControlled H
CXControlled-NOTCCXToffoli
IDIdle gate or identityCSWAPFredkin
XPauli-X bit flipCRXControlled RX rotation
YPauli-Y bit and phase flipCRYControlled RY rotation
ZPauli-Z phase flipCRZControlled RZ rotation
HHadamardCU1Controlled phase rotation
Ssqrt(Z) phaseCU3Controlled U3
SDGconjugate of sqrt(Z)RXX2-qubit XX rotation
Tsqrt(S) phaseRZZ2-qubit ZZ rotation
TDGconjugate of sqrt(S)RCCXRelative-phase CXX
RXX-axis rotationRC3XRelative-phase 3-controlled X
RYY-axis rotationC3X3-controlled X
RZZ-axis rotationC3XSQRTX3-controlled sqrt(X)
CZControlled phaseC4X4-controlled X
WW gateRYY2-qubit YY rotation
C1Arbitrary 1-qubit gateC2Arbitrary 2-qubit gate

Package Structure

src: DM-Sim source file

benchmark:

tool: Supporting tools (will add support for other quantum languages).

summit: The files that are useful for running on ORNL Summit supercomputer

artifact: System configuration for the evaluation performed in our paper.

These are generated by using

img: images for the Repo.

Configuration

You may need to update "src/Makefile" to configure your NVCC path and GPU architecture (e.g., -arch=sm_60 for P100, -arch=sm_70 for V100 and -arch=sm_80 for A100 GPUs). We need C++11 support (-std=c++11).

CC = nvcc
FLAGS = -O3 -arch=sm_70 -std=c++11 -rdc=true
LIBS = -lm

Prerequisite

DM-Sim requires the following packages.

DependencyVersionComments
CUDA10.0 or laterFor NVIDIA GPU backend
GCC (or XL)5.2 or later (16.01 for xlc)
OpenMP4.0For single-node scale-up
Spectrum-MPI10.3For NVIDIA GPU cluster scale-out RDMA
Python3.4For Python-API
Pybind112.5.0For Python-API
mpi4py3.0.3For Python-API cluster scale-out
ROCM1.6.0For AMD GPU backend

To build the scale-up version, we need OpenMP. To build the scale-out version, it needs MPI with GPUDirect support (we only tested using IBM XL and Spectrum-MPI on Summit).

The QDK/QRI has additional dependency requirements. For ORNL Summit HPC, please check the setting file: set_summit_qir_env.sh

Build

Please configure the Makefile for the targets, then use the following command for compilation:

make 

The default Python version is Python-2.7. If you are using the simulator in other python version, you can adjust accordingly in the Makefile. Note, if you need Python-3, say Python-3.7, you may need to take out the "-lpython3.7" from the compiler option before make.

Execution

DM-Sim requires NVIDIA GPUs for execution. We have tested it on Tesla-P100 (Pascal, CC-6.0), Tesla-V100 (Volta, CC-7.0) and RTX2080 (Turing, CC-7.5). To run on scale-up workstations (e.g., DGX-1 and DGX-2), it needs all the GPUs to be directly connected by NVLink, NVSwitch or NV-SLI for all-to-all communication (when performing adjoint operation when transposing the density matrix)). Therefore, on DGX-1, it can use up to 4 GPUs (despite 8 in total) and provided they are directly interconnected, see our TPDS Evaluation paper on GPU interconnect for detail. For scale-out GPU clusters, it requires the support of GPUDirect-RDMA for direct GPU-memory access. On the ORNL Summit supercomputer, this can be enabled by --smpiargs="-gpu". See the example .lsf file.

Single GPU or single-node-multi-GPUs using C++/CUDA APIs

Writing a CUDA circuit code using DM-Sim C++/CUDA APIs can be simple:

#include "util.cuh"
#include "gate_omp.cuh"
using namespace DMSim;

int main()
{
    int n_qubits = 10;
    int n_gpus = 4;
    sim.append(Simulation::X(0)); //add a Poly-X gate
    sim.append(Simulation::H(1)); //add a Hadamard gate
    sim.upload(); //upload to GPU
    sim.sim(); //simulate
    auto res = sim.measure(5); //measure with 5 repetitions
    print_measurement(res, 10, 5); //print results
}

When you have the circuit driver, compile and use the following command for execution:

./adder_n10_omp

Single GPU or single-node-multi-GPUs using Python APIs

Writing a python circuit code using DM-Sim C++/CUDA APIs can be even more simple:

import dmsim_py_omp_wrapper as dmsim_omp
n_qubits = 10
n_gpus = 4
sim = dmsim_omp.Simulation(n_qubits, n_gpus))
sim.append(sim.X(0)) #add an X gate
sim.append(sim.H(1)) #add an H gate
sim.upload() #upload to GPU
sim.run() #run
sim.clear_circuit() #clear existing circuit
sim.append(sim.H(0)) #add a new H gate 
sim.upload() #upload to GPU
sim.run() #run new circuit on original states
res = sim.measure(10) #measure with 10 repetitions and return in a list
python adder_n10_omp.py

Scale-out

This is the execution command on ORNL Summit supercomputer (8 resource sets with 8 MPI ranks, 1 GPU per rank) with GPUDirect-RDMA enabled using Python APIs and C++/CUDA APIs.

jsrun -n8 -a1 -g1 -c1 --smpiargs="-gpu" python -m mpi4py adder_n10_mpi.py 10 
jsrun -n8 -a1 -g1 -c1 --smpiargs="-gpu" ./adder_n10_mpi

For the Python version, 10 means the number of qubits used. For the C++/CUDA version, it is written in the code.

Expected Output

When build and execute , which realizes a ripple-carry adder using 10-qubits in total on a single-GPU, should print out the following output:

============== DM-Sim ===============
nqubits:10, ngates:30, ngpus:4, comp:11.685 ms, comm:0.777 ms, sim:12.462 ms, mem:32.000 MB, mem_per_gpu:8.000 MB
=====================================

===============  Measurement (qubits=10, gates=30, tests=10) ================
Test-0: 1000000010
Test-1: 1000000010
Test-2: 1000000010
Test-3: 1000000010
Test-4: 1000000010
Test-5: 1000000010
Test-6: 1000000010
Test-7: 1000000010
Test-8: 1000000010
Test-9: 1000000010

The inputs are: carry-in cin = 0, A=0001, B=1111. The outputs are: B=B+A=0000, carry-out=1.

The measurement measures all qubits at once. "repetition" refers to the number of repeated measurements. You can configure the number of trials when calling "measure()" in both C++/CUDA API and Python API. The default value is 10 times.

More Configurations

To simulate qubit-size larger than 15, the index is already larger than a normal unsigned integer, you need to define IdxType to "unsigned long long" in "config.hpp". The ValType is by default double.

When defining "CUDA_ERROR_CHECK", DM-Sim checks CUDA API error and kernel execution error.

Performance

DM-Sim is bounded by GPU memory access bandwidth, and possibly by interconnect bandwidth. We use the Roofline model to show the bound. The real sustainable bandwidth is profiled by using the Roofline Toolkit from LBNL. This following figure shows the Roofline model for the simulation on SLI, DGX-1P, DGX-1V and DGX-2 systems. See the files in the artifact folder. AI stands for arithmetic intensity for the DM simulation. alt text

We show the performance of simulation by increasing the number of qubits (256 gates):

alt text

We show the performance of simulation by increasing the number of gates (14 qubits):

alt text

And performance bound on computation, memory access and communication:

alt text

Performance for deep circuits on DGX-2 using 16 GPUs and 15 qubits using general 1-qubit gate(i.e., C1 gate):

GatesComputationCommunicationSimulationTime/Gate
10K53.8s9.36ms53.8s5.38ms
100K558.0s7.31ms558.0s5.58ms
1M5645.5s7.21ms5645.5s5.65ms

Performance on ORNL Summit supercomputer, the numbers on the bars indicate the number of GPUs utilized. For benchmarks, please see QASMBench. Clearly, the communication overhead is much more significant than scale-up.

<img src="img/summit.png" width="500">

Support Tools

dmsim_qasm.py

To translate an OpenQASM (e.g., vqe_uccsd_n8.qasm) to a DM-Sim python file (e.g., vqe_uccsd_n8.py):

python dmsim_qasm.py -i vqe_uccsd_n8.qasm -o vqe_uccsd_n8.py

It outputs the target "vqe_uccsd_n8.py" and reports the number of qubits, the number of gates, and the number of CX/CNOT gates. Currently, it generates the OpenMP version python code.

python dmsim_qasm_ass.py -i adder.qasm -o circuit.cuh -s omp

More Benchmarks

We have developed an OpenQASM based benchmark suite called "QASMBench" which provides more real quantum circuit benchmarks. Please see our QASMBench paper for details.

OpenQASM

OpenQASM (Open Quantum Assembly Language) is a low-level quantum intermediate representation (IR) for quantum instructions, similar to the traditional Hardware-Description-Language (HDL) like Verilog and VHDL. OpenQASM is the open-source unified low-level assembly language for IBM quantum machines publically available on cloud that have been investigated and verified by many existing research works. Several popular quantum software frameworks use OpenQASM as one of their output-formats, including Qiskit, Cirq, Scaffold, ProjectQ, etc.

Qiskit

The Quantum Information Software Kit (Qiskit) is a quantum software developed by IBM. It is based on Python. OpenQASM can be generated from Qiskit via:

QuantumCircuit.qasm()

Cirq

Cirq is a quantum software framework from Google. OpenQASM can be generated from Cirq (not fully compatible) via:

cirq.Circuit.to_qasm()

Scaffold

Scaffold is a quantum programming language embedded in the C/C++ programming language based on the LLVM compiler toolchain. A Scaffold program can be compiled by Scaffcc to OpenQASM via the "-b" compiler option.

ProjectQ

ProjectQ is a quantum software platform developed by Steiger et al. from ETH Zurich. The official website is here. ProjectQ can generate OpenQASM when using IBM quantum machines as the backends:

IBMBackend.get_qasm()

Authors

Ang Li, Senior Computer Scientist, Pacific Northwest National Laboratory (PNNL)

Sriram Krishnamoorthy, Lab Fellow, Pacific Northwest National Laboratory (PNNL)

We are currently collaborating with Microsoft Quantum team (Alan Geller, Bettina Heim, Irina Yatsenko, Guen Prawiroatmodjo, Martin Roetteler) on improving the pipeline from Q# to QIR to DM-Sim. Many thanks to their strong support.

Citation format

If you find DM-Sim useful, please cite our SC-20 paper:

Bibtex:

@inproceedings{li2020density,
    title={Density Matrix Quantum Circuit Simulation via the BSP Machine on Modern GPU Clusters},
    author={Li, Ang and Subasi, Omer and Yang, Xiu and Krishnamoorthy, Sriram},
    booktitle={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},
    year={2020}
}

License

This project is licensed under the BSD License, see LICENSE file for details.

Acknowledgments

PNNL-IPID: 31919-E, ECCN: EAR99, IR: PNNL-SA-143160

This project is currently supported by the Quantum Science Center (QSC). It was originally supported by PNNL's Quantum Algorithms, Software, and Architectures (QUASAR) LDRD Initiative. The Pacific Northwest National Laboratory (PNNL) is operated by Battelle for the U.S. Department of Energy (DOE) under contract DE-AC05-76RL01830.

Contributing

Please contact us If you'd like to contribute to DM-Sim. See the contact in our paper or my webpage.