Home

Awesome

Pytorch Custom CUDA kernel for searchsorted

This repository is an implementation of the searchsorted function to work for pytorch CUDA Tensors. Initially derived from the great C extension tutorial, but totally changed since then because building C extensions is not available anymore on pytorch 1.0.

Warnings:

Description

Implements a function searchsorted(a, v, out, side) that works just like the numpy version except that a and v are matrices.

the output is of size as (nrows, ncols_v). If all input tensors are on GPU, a cuda version will be called. Otherwise, it will be on CPU.

Disclaimers

Installation

Just pip install ., in the root folder of this repo. This will compile and install the torchsearchsorted module.

be careful that sometimes, nvcc needs versions of gcc and g++ that are older than those found by default on the system. If so, just create symbolic links to the right versions in your cuda/bin folder (where nvcc is)

For instance, on my machine, I had gcc and g++ v9 installed, but nvcc required v8. So I had to do:

sudo apt-get install g++-8 gcc-8
sudo ln -s /usr/bin/gcc-8 /usr/local/cuda-10.1/bin/gcc
sudo ln -s /usr/bin/g++-8 /usr/local/cuda-10.1/bin/g++

be careful that you need pytorch to be installed on your system. The code was tested on pytorch v1.5

Usage

Just import the torchsearchsorted package after installation. I typically do:

from torchsearchsorted import searchsorted

Testing

Under the examples subfolder, you may:

  1. try python test.py with torch available.
Looking for 50000x1000 values in 50000x300 entries
NUMPY:  searchsorted in 4851.592ms
CPU:  searchsorted in 4805.432ms
  difference between CPU and NUMPY: 0.000
GPU:  searchsorted in 1.055ms
  difference between GPU and NUMPY: 0.000

Looking for 50000x1000 values in 50000x300 entries
NUMPY:  searchsorted in 4333.964ms
CPU:  searchsorted in 4753.958ms
  difference between CPU and NUMPY: 0.000
GPU:  searchsorted in 0.391ms
  difference between GPU and NUMPY: 0.000

The first run comprises the time of allocation, while the second one does not.

  1. You may also use the nice benchmark.py code written by @baldassarreFe, that tests searchsorted on many runs:
Benchmark searchsorted:
- a [5000 x 300]
- v [5000 x 100]
- reporting fastest time of 20 runs
- each run executes searchsorted 100 times

Numpy: 	4.6302046799100935
CPU: 	5.041533078998327
CUDA: 	0.0007955809123814106