Awesome

nvidia-pstated

A daemon that automatically manages the performance states of NVIDIA GPUs.

How it works

flowchart TD
    A(Start) --> B
    subgraph For each GPU
    B("Check temperature[1]") -->|Below threshold| D
    B("Check temperature[1]") -->|Above threshold| C
    C("Enter low PState[2]") --> M
    D(Check utilization) -->|Is 0%| E
    D(Check utilization) -->|Is not 0%| J
    E(Check current PState) -->|High| F
    E(Check current PState) -->|Low| I
    F("Iterations counter exceeded threshold[3]") -->|Yes| G
    F("Iterations counter exceeded threshold[3]") -->|No| H
    G("Enter low PState[2]") --> H
    H(Increment iterations counter) --> M
    I(Do nothing) --> M
    J(Check current PState) -->|High| K
    J(Check current PState) -->|Low| L
    K(Reset iterations counter) --> M
    L("Enter high PState[4]") --> M
    end
    M(End) --> N
    N("Sleep[5]") --> A

1 - Threshold is controlled by option --temperature-threshold (default: 80 degrees C)
2 - Value is controlled by option --performance-state-low (default: 8)
3 - Threshold is controlled by option --iterations-before-switch (default: 30 iterations)
4 - Value is controlled by option --performance-state-high (default: 16)
5 - Value is controlled by option --sleep-interval (default: 100 milliseconds)

Installation

Prerequirements

Linux

Make sure the proprietary NVIDIA driver is installed.

You will need the following libraries:

libnvidia-api.so.1
libnvidia-ml.so.1

Packages that provide these libraries:

ArchLinux: nvidia-utils
Debian: libnvidia-api1 or libnvidia-tesla-api1 (depending on the GPU and driver installed)

On Debian derivatives, you can use apt search libnvidia-api.so.1 and apt search libnvidia-ml.so.1 to find the package you need.

Note that you MUST run this daemon at the host level, i.e. where the CUDA Driver is available. You can NOT run this daemon in a container.

nvidia-container-stack

Windows

Make sure the NVIDIA driver is installed.

Installation

Download the latest version of the executable for your OS from releases.

Building from source code

Prerequirements

CMake
CUDA toolkit

Building

# Configure
cmake -B build

# Build
cmake --build build

Misc

Managing only specific GPUs

You can use -i/--ids option to manage only specific GPUs.

Suppose you have 8 GPUs and you want to manage only the first 4 (as in nvidia-smi):

./nvidia-pstated -i 0,1,2,3

systemd service

Install nvidia-pstated in /usr/local/bin. Then save the following as /etc/systemd/system/nvidia-pstated.service.

[Unit]
Description=A daemon that automatically manages the performance states of NVIDIA GPUs
StartLimitInterval=0

[Service]
DynamicUser=yes
ExecStart=/usr/local/bin/nvidia-pstated
Restart=on-failure
RestartSec=1s

[Install]
WantedBy=multi-user.target

vGPU manager

If you are using a hypervisor (KVM) with a vGPU manager, you cannot run nvidia-pstated in virtual machines. Instead, you can run it at the hypervisor level.

To do this, you need to:

Extract libnvidia-api.so.1 from your guest driver (in my case Guest_Drivers/nvidia-linux-grid-535_535.183.06_amd64.deb/data.tar.xz/usr/lib/x86_64-linux-gnu/libnvidia-api.so.1) to some directory.
Download nvidia-pstated to the same directory.

Try running nvidia-pstated: LD_LIBRARY_PATH=. ./nvidia-pstated. You should get the following:

$ LD_LIBRARY_PATH=. ./nvidia-pstated
NvAPI_Initialize(): NVAPI_ERROR

Check dmesg, you should get the following message:

NVRM: API mismatch: the client has the version 535.183.06, but
NVRM: this kernel module has the version 535.183.04.  Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.

Use sed -i 's/535.183.06/535.183.04/g' libnvidia-api.so.1 (replace the values with what you got in dmesg) to replace the client version in libnvidia-api.so.1.
Run nvidia-pstated: LD_LIBRARY_PATH=. ./nvidia-pstated. Enjoy.