Home

Awesome

Prometheus Slurm Exporter

Prometheus collector and exporter for metrics extracted from the Slurm resource scheduling system.

Exported Metrics

State of the CPUs

State of the GPUs

NOTE: since version 0.19, GPU accounting has to be explicitly enabled adding the -gpus-acct option to the command line otherwise it will not be activated.

Be aware that:

State of the Nodes

Additional info about node usage

Since version 0.18, the following information are also extracted and exported for every node known by Slurm:

See the related test data to check the format of the information extracted from Slurm.

Status of the Jobs

State of the Partitions

Jobs information per Account and User

The following information about jobs are also extracted via squeue:

Scheduler Information

DBD Agent queue size: it is particularly important to keep track of it, since an increasing number of messages counted with this parameter almost always indicates three issues:

Share Information

Collect share statistics for every Slurm account. Refer to the manpage of the sshare command to get more information.

Installation

Prometheus Configuration for the SLURM exporter

It is strongly advisable to configure the Prometheus server with the following parameters:

scrape_configs:

#
# SLURM resource manager:
#
  - job_name: 'my_slurm_exporter'

    scrape_interval:  30s

    scrape_timeout:   30s

    static_configs:
      - targets: ['slurm_host.fqdn:8080']

The previous configuration file can be immediately used with a fresh installation of Prometheus. At the same time, we highly recommend to include at least the global section into the configuration. Official documentation about configuring Prometheus is available here.

NOTE: the Prometheus server is using YAML as format for its configuration file, thus indentation is really important. Before reloading the Prometheus server it would be better to check the syntax:

$~ promtool check-config prometheus.yml

Checking prometheus.yml
  SUCCESS: 1 rule files found
[...]

Grafana Dashboard

A dashboard is available in order to visualize the exported metrics through Grafana:

Status of the Nodes

Status of the Jobs

SLURM Scheduler Information

License

Copyright 2017-2020 Victor Penso, Matteo Dessalvi

This is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.