Home

Awesome

D3DShot

D3DShot is a pure Python implementation of the Windows Desktop Duplication API. It leverages DXGI and Direct3D system libraries to enable extremely fast and robust screen capture functionality for your Python scripts and applications on Windows.

D3DShot:

TL;DR Quick Code Samples

Screenshot to Memory

import d3dshot

d = d3dshot.create()
d.screenshot()
Out[1]: <PIL.Image.Image image mode=RGB size=2560x1440 at 0x1AA7ECB5C88>

Screenshot to Disk

import d3dshot

d = d3dshot.create()
d.screenshot_to_disk()
Out[1]: './1554298682.5632973.png'

Screen Capture for 5 Seconds and Grab the Latest Frame

import d3dshot
import time

d = d3dshot.create()

d.capture()
time.sleep(5)  # Capture is non-blocking so we wait explicitely
d.stop()

d.get_latest_frame()
Out[1]: <PIL.Image.Image image mode=RGB size=2560x1440 at 0x1AA044BCF60>

Screen Capture the Second Monitor as NumPy Arrays for 3 Seconds and Grab the 4 Latest Frames as a Stack

import d3dshot
import time

d = d3dshot.create(capture_output="numpy")

d.display = d.displays[1]

d.capture()
time.sleep(3)  # Capture is non-blocking so we wait explicitely
d.stop()

frame_stack = d.get_frame_stack((0, 1, 2, 3), stack_dimension="last")
frame_stack.shape
Out[1]: (1080, 1920, 3, 4)

This is barely scratching the surface... Keep reading!

Requirements

Installation

pip install d3dshot

D3DShot leverages DLLs that are already available on your system so the dependencies are very light. Namely:

These dependencies will automatically be installed alongside D3DShot; No need to worry about them!

Extra Step: Laptop Users

Windows has a quirk when using Desktop Duplication on hybrid-GPU systems. Please see the wiki article before attempting to use D3DShot on your system.

Concepts

Capture Outputs

The desired Capture Output is defined when creating a D3DShot instance. It defines the type of all captured images. By default, all captures will return PIL.Image objects. This is a good option if you mostly intend to take screenshots.

# Captures will be PIL.Image in RGB mode
d = d3dshot.create()
d = d3dshot.create(capture_output="pil")

D3DShot is however quite flexible! As your environment meets certain optional sets of requirements, more options become available.

If NumPy is available

# Captures will be np.ndarray of dtype uint8 with values in range (0, 255)
d = d3dshot.create(capture_output="numpy")

# Captures will be np.ndarray of dtype float64 with normalized values in range (0.0, 1.0)
d = d3dshot.create(capture_output="numpy_float")  

If NumPy and PyTorch are available

# Captures will be torch.Tensor of dtype uint8 with values in range (0, 255)
d = d3dshot.create(capture_output="pytorch")

# Captures will be torch.Tensor of dtype float64 with normalized values in range (0.0, 1.0)
d = d3dshot.create(capture_output="pytorch_float")

If NumPy and PyTorch are available + CUDA is installed and torch.cuda.is_available()

# Captures will be torch.Tensor of dtype uint8 with values in range (0, 255) on device cuda:0
d = d3dshot.create(capture_output="pytorch_gpu")

# Captures will be torch.Tensor of dtype float64 with normalized values in range (0.0, 1.0) on device cuda:0
d = d3dshot.create(capture_output="pytorch_float_gpu")

Trying to use a Capture Output for which your environment does not meet the requirements will result in an error.

Singleton

Windows only allows 1 instance of Desktop Duplication per process. To make sure we fall in line with that limitation to avoid issues, the D3DShot class acts as a singleton. Any subsequent calls to d3dshot.create() will always return the existing instance.

d = d3dshot.create(capture_output="numpy")

# Attempting to create a second instance
d2 = d3dshot.create(capture_output="pil")
# Only 1 instance of D3DShot is allowed per process! Returning the existing instance...

# Capture output remains 'numpy'
d2.capture_output.backend
# Out[1]: <d3dshot.capture_outputs.numpy_capture_output.NumpyCaptureOutput at 0x2672be3b8e0>

d == d2
# Out[2]: True

Frame Buffer

When you create a D3DShot instance, a frame buffer is also initialized. It is meant as a thread-safe, first-in, first-out way to hold a certain quantity of captures and is implemented as a collections.deque.

By default, the size of the frame buffer is set to 60. You can customize it when creating your D3DShot object.

d = d3dshot.create(frame_buffer_size=100)

Be mindful of RAM usage with larger values; You will be dealing with uncompressed images which use up to 100 MB each depending on the resolution.

The frame buffer can be accessed directly with d.frame_buffer but the usage of the utility methods instead is recommended.

The buffer is used by the following methods:

It is always automatically cleared before starting one of these operations.

Displays

When you create a D3DShot instance, your available displays will automatically be detected along with all their relevant properties.

d.displays
Out[1]: 
[<Display name=BenQ XL2730Z (DisplayPort) adapter=NVIDIA GeForce GTX 1080 Ti resolution=2560x1440 rotation=0 scale_factor=1.0 primary=True>,
 <Display name=BenQ XL2430T (HDMI) adapter=Intel(R) UHD Graphics 630 resolution=1920x1080 rotation=0 scale_factor=1.0 primary=False>]

By default, your primary display will be selected. At all times you can verify which display is set to be used for capture.

d.display
Out[1]: <Display name=BenQ XL2730Z (DisplayPort) adapter=NVIDIA GeForce GTX 1080 Ti resolution=2560x1440 rotation=0 scale_factor=1.0 primary=True>

Selecting another display for capture is as simple as setting d.display to another value from d.displays

d.display = d.displays[1]
d.display
Out[1]: <Display name=BenQ XL2430T (HDMI) adapter=Intel(R) UHD Graphics 630 resolution=1080x1920 rotation=90 scale_factor=1.0 primary=False>

Display rotation and scaling is detected and handled for you by D3DShot:

Regions

All capture methods (screenshots included) accept an optional region kwarg. The expected value is a 4-length tuple of integers that is to be structured like this:

(left, top, right, bottom)  # values represent pixels

For example, if you want to only capture a 200px by 200px region offset by 100px from both the left and top, you would do:

d.screenshot(region=(100, 100, 300, 300))

If you are capturing a scaled display, the region will be computed against the full, non-scaled resolution.

If you go through the source code, you will notice that the region cropping happens after a full display capture. That might seem sub-optimal but testing has revealed that copying a region of the GPU D3D11Texture2D to the destination CPU D3D11Texture2D using CopySubresourceRegion is only faster when the region is very small. In fact, it doesn't take long for larger regions to actually start becoming slower than the full display capture using this method. To make things worse, it adds a lot of complexity by having the surface pitch not match the buffer size and treating rotated displays differently. It was therefore decided that it made more sense to stick to CopyResource in all cases and crop after the fact.

Usage

Create a D3DShot instance

import d3dshot

d = d3dshot.create()

create accepts 2 optional kwargs:

Do NOT import the D3DShot class directly and attempt to initialize it yourself! The create helper function initializes and validates a bunch of things for you behind the scenes.

Once you have a D3DShot instance in scope, we can start doing stuff with it!

List the detected displays

d.displays

Select a display for capture

Your primary display is selected by default but if you have a multi-monitor setup, you can select another entry in d.displays

d.display = d.displays[1]

Take a screenshot

d.screenshot()

screenshot accepts 1 optional kwarg:

Returns: A screenshot with a format that matches the capture output you selected when creating your D3DShot object

Take a screenshot and save it to disk

d.screenshot_to_disk()

screenshot_to_disk accepts 3 optional kwargs:

Returns: A string representing the full path to the saved image file

Take a screenshot every X seconds

d.screenshot_every(X)  # Where X is a number representing seconds

This operation is threaded and non-blocking. It will keep running until d.stop() is called. Captures are pushed to the frame buffer.

screenshot_every accepts 1 optional kwarg:

Returns: A boolean indicating whether or not the capture thread was started

Take a screenshot every X seconds and save it to disk

d.screenshot_to_disk_every(X)  # Where X is a number representing seconds

This operation is threaded and non-blocking. It will keep running until d.stop() is called.

screenshot_to_disk_every accepts 2 optional kwargs:

Returns: A boolean indicating whether or not the capture thread was started

Start a high-speed screen capture

d.capture()

This operation is threaded and non-blocking. It will keep running until d.stop() is called. Captures are pushed to the frame buffer.

capture accepts 2 optional kwargs:

Returns: A boolean indicating whether or not the capture thread was started

Grab the latest frame from the buffer

d.get_latest_frame()

Returns: A frame with a format that matches the capture output you selected when creating your D3DShot object

Grab a specific frame from the buffer

d.get_frame(X)  # Where X is the index of the desired frame. Needs to be < len(d.frame_buffer)

Returns: A frame with a format that matches the capture output you selected when creating your D3DShot object

Grab specific frames from the buffer

d.get_frames([X, Y, Z, ...])  # Where X, Y, Z are valid indices to desired frames

Returns: A list of frames with a format that matches the capture output you selected when creating your D3DShot object

Grab specific frames from the buffer as a stack

d.get_frame_stack([X, Y, Z, ...], stack_dimension="first|last")  # Where X, Y, Z are valid indices to desired frames

Only has an effect on NumPy and PyTorch capture outputs.

get_frame_stack accepts 1 optional kwarg:

Returns: A single array stacked on the specified dimension with a format that matches the capture output you selected when creating your D3DShot object. If the capture output is not stackable, returns a list of frames.

Dump the frame buffer to disk

The files will be named according to this convention: <frame buffer index>.png

d.frame_buffer_to_disk()

frame_buffer_to_disk accepts 1 optional kwarg:

Returns: None

Performance

Measuring the exact performance of the Windows Desktop Duplication API proves to be a little complicated because it will only return new texture data if the contents of the screen has changed. This is optimal for performance but it makes it difficult to express in terms of frames per second, the measurement people tend to expect for benchmarks. Ultimately the solution ended up being to run a high FPS video game on the display to capture to make sure the screen contents is different at all times while benchmarking.

As always, remember that benchmarks are inherently flawed and highly depend on your individual hardware configuration and other circumstances. Use the numbers below as a relative indication of what to expect from D3DShot, not as some sort of absolute truth.

2560x1440 on NVIDIA GTX 1080 Ti1920x1080 on Intel UHD Graphics 6301080x1920 (vertical) on Intel UHD Graphics 630
"pil"29.717 FPS47.75 FPS35.95 FPS
"numpy"57.667 FPS58.1 FPS58.033 FPS
"numpy_float"18.783 FPS29.05 FPS27.517 FPS
"pytorch"57.867 FPS58.1 FPS34.817 FPS
"pytorch_float"18.767 FPS28.367 FPS27.017 FPS
"pytorch_gpu"27.333 FPS35.767 FPS34.8 FPS
"pytorch_float_gpu"27.267 FPS37.383 FPS35.033 FPS

The absolute fastest capture outputs appear to be "numpy" and unrotated "pytorch"; all averaging around 58 FPS. In Python land, this is FAST!

How is the "numpy" capture output performance that good?

NumPy arrays have a ctypes interface that can give you their raw memory address (X.ctypes.data). If you have the memory address and size of another byte buffer, which is what we end up with by processing what returns from the Desktop Duplication API, you can use ctypes.memmove to copy that byte buffer directly to the NumPy structure, effectively bypassing as much Python as possible.

In practice it ends up looking like this:

ctypes.memmove(np.empty((size,), dtype=np.uint8).ctypes.data, pointer, size)

This low-level operation is extremely fast, leaving everything else that would normally compete with NumPy in the dust.

Why is the "pytorch" capture output slower on rotated displays?

Don't tell anyone but the reason it can compete with NumPy in the first place is only because... it is generated from a NumPy array built from the method above! If you sniff around the code, you will indeed find torch.from_numpy() scattered around. This pretty much matches the speed of the "numpy" capture output 1:1, except when dealing with a rotated display. Display rotation is handled by np.rot90() calls which yields negative strides on that array. Negative strides are understood and perform well under NumPy but are still unsupported in PyTorch at the time of writing. To address this, an additional copy operation is needed to bring it back to a contiguous array which imposes a performance penalty.

Why is the "pil" capture output, being the default, not the fastest?

PIL has no ctypes interface like NumPy so a bytearray needs to be read into Python first and then fed to PIL.Image.frombytes(). This is still fast in Python terms, but it just cannot match the speed of the low-level NumPy method.

It remains the default capture output because:

  1. PIL Image objects tend to be familiar to Python users
  2. It's a way lighter / simpler dependency for a library compared to NumPy or PyTorch

Why are the float versions of capture outputs slower?

The data of the Direct3D textures made accessible by the Desktop Duplication API is formatted as bytes. To represent this data as normalized floats instead, a type cast and element-wise division needs to be performed on the array holding those bytes. This imposes a major performance penalty. Interestingly, you can see this performance penalty mitigated on GPU PyTorch tensors since the element-wise division can be massively parallelized on the device.

Crafted with ❤ by Serpent.AI 🐍
Twitter - Twitch