Home

Awesome

GPFS Prometheus exporter

Build Status GitHub release GitHub All Releases Go Report Card codecov

GPFS Prometheus exporter

The GPFS exporter collects metrics from the GPFS filesystem. The exporter supports the /metrics endpoint to gather GPFS metrics and metrics about the exporter.

Collectors

Collectors are enabled or disabled via --collector.<name> and --no-collector.<name> flags.

NameDescriptionDefault
mmgetstateCollect state via mmgetstateEnabled
mmpmonCollect metrics from mmpmon using fs_io_sEnabled
mountCheck status of GPFS mounts.Enabled
configCollect configs via 'mmdiag --config'Enabled
verbsTest if GPFS is using verbs interfaceDisabled
mmhealthTest node health through mmhealthDisabled
waiterCollect waiters via 'mmdiag --waiters'Disabled
mmdfCollect filesystem space for inodes, block and metadata.Disabled
mmcesCollect state of CESDisabled
mmrepquotaCollect fileset quota informationDisabled
mmlssnapshotCollect GPFS snapshot informationDisabled
mmlsfilesetCollect GPFS fileset informationDisabled
mmlsqosCollect GPFS I/O performance values of a file system, when you enable Quality of ServiceDisabled

mount

The default behavior of the mount collector is to collect mount statuses on GPFS mounts in /proc/mounts or /etc/fstab. The --collector.mount.mounts flag can be used to adjust which mount points to check.

mmhealth

The mmhealth statuses and events collected can be filtered with the following flags that all take a regex.

waiter

The waiter's seconds are stored in Histogram buckets defined by --collector.waiter.buckets which is a comma separated list of durations that are converted to seconds so 1s,5s,30s,1m would have buckets of []float64{1,5,30,60}.

The flag --collector.waiter.exclude defines a regular expression of waiter names to exclude.

The flag --collector.waiter.log-reason can enable logging of waiter reasons. The reason can produce very high cardinality so it is not included in metrics.

mmdf

Due to the time it can take to execute mmdf that is an executable provided that can be used to collect mmdf via cron. See gpfs_mmdf_exporter.

Flags:

mmces

The command used to collect CES states needs a specific node name. The --collector.mmces.nodename flag can be used to specify which CES node to check. The default is FQDN of those running the exporter.

mmrepquota

mmlssnapshot

The exporter gpfs_mmlssnapshot_exporter is provided to allow snapshot collection, including size (with --collector.mmlssnapshot.get-size) to be collected with cron rather than a Prometheus scrape through the normal exporter.

mmlsfileset

NOTE: This collector does not collect used inodes. To get used inodes look at using the mmrepquota collector.

mmlsqos

Displays the I/O performance values of a file system, when you enable Quality of Service for I/O operations (QoS) with the mmchqos command.

Flags:

Sudo

Ensure the user running gpfs_exporter can execute GPFS commands necessary to collect metrics. The following sudo config assumes gpfs_exporter is running as gpfs_exporter.

Defaults:gpfs_exporter !syslog
Defaults:gpfs_exporter !requiretty
# mmgetstate collector
gpfs_exporter ALL=(ALL) NOPASSWD:/usr/lpp/mmfs/bin/mmgetstate -Y
# mmpmon collector
gpfs_exporter ALL=(ALL) NOPASSWD:/usr/lpp/mmfs/bin/mmpmon -s -p
# config collector
gpfs_exporter ALL=(ALL) NOPASSWD:/usr/lpp/mmfs/bin/mmdiag --config -Y
# mmhealth collector
gpfs_exporter ALL=(ALL) NOPASSWD:/usr/lpp/mmfs/bin/mmhealth node show -Y
# verbs collector
gpfs_exporter ALL=(ALL) NOPASSWD:/usr/lpp/mmfs/bin/mmfsadm test verbs status
# mmdf/mmlssnapshot collector if filesystems not specified
gpfs_exporter ALL=(ALL) NOPASSWD:/usr/lpp/mmfs/bin/mmlsfs all -Y -T
# waiter collector
gpfs_exporter ALL=(ALL) NOPASSWD:/usr/lpp/mmfs/bin/mmdiag --waiters -Y
# mmces collector
gpfs_exporter ALL=(ALL) NOPASSWD:/usr/lpp/mmfs/bin/mmces state show *
# mmdf collector, each filesystem must be listed
gpfs_exporter ALL=(ALL) NOPASSWD:/usr/lpp/mmfs/bin/mmdf project -Y
gpfs_exporter ALL=(ALL) NOPASSWD:/usr/lpp/mmfs/bin/mmdf scratch -Y
# mmrepquota collector, filesystems not specified
gpfs_exporter ALL=(ALL) NOPASSWD:/usr/lpp/mmfs/bin/mmrepquota -j -Y -a
# mmrepquota collector, filesystems specified
gpfs_exporter ALL=(ALL) NOPASSWD:/usr/lpp/mmfs/bin/mmrepquota -j -Y project scratch
# mmlssnapshot collector, each filesystem must be listed
gpfs_exporter ALL=(ALL) NOPASSWD:/usr/lpp/mmfs/bin/mmlssnapshot project -s all -Y
gpfs_exporter ALL=(ALL) NOPASSWD:/usr/lpp/mmfs/bin/mmlssnapshot ess -s all -Y
# mmlsfileset collector, each filesystem must be listed
gpfs_exporter ALL=(ALL) NOPASSWD:/usr/lpp/mmfs/bin/mmlsfileset project -Y
gpfs_exporter ALL=(ALL) NOPASSWD:/usr/lpp/mmfs/bin/mmlsfileset ess -Y
# mmlsqos collector, each filesystem must be listed
gpfs_exporter ALL=(ALL) NOPASSWD:/usr/lpp/mmfs/bin/mmlsqos mmfs1 -Y
gpfs_exporter ALL=(ALL) NOPASSWD:/usr/lpp/mmfs/bin/mmlsqos ess -Y

Install

Download the latest release

Add the user that will run gpfs_exporter

groupadd -r gpfs_exporter
useradd -r -d /var/lib/gpfs_exporter -s /sbin/nologin -M -g gpfs_exporter -M gpfs_exporter

Install compiled binaries after extracting tar.gz from release page.

cp /tmp/gpfs_exporter /usr/local/bin/gpfs_exporter

Add sudo rules, see sudo section

Add systemd unit file and start service. Modify the ExecStart with desired flags.

cp systemd/gpfs_exporter.service /etc/systemd/system/gpfs_exporter.service
systemctl daemon-reload
systemctl start gpfs_exporter

Build from source

To produce the gpfs_exporter, gpfs_mmdf_exporter, and gpfs_mmlssnapshot_exporter binaries:

make build

Or

go get github.com/treydock/gpfs_exporter/cmd/gpfs_exporter
go get github.com/treydock/gpfs_exporter/cmd/gpfs_mmdf_exporter
go get github.com/treydock/gpfs_exporter/cmd/gpfs_mmlssnapshot_exporter

TLS and basic auth

gpfs_exporter supports TLS and basic auth using exporter-toolkit. To use TLS and/or basic auth, users need to use --web.config.file CLI flag as follows

gpfs_exporter --web.config.file=web-config.yaml

A sample web-config.yaml file can be fetched from exporter-toolkit repository. The reference of the web-config.yaml file can be consulted in the docs.

Grafana

There is an example GPFS Performance dashboard. See the description on that dashboard for additional information on labels needed to utilize that dashboard.

Prometheus Configuration

This is an example scrape config with some metrics excluded for HPC compute nodes with label role=compute:

- job_name: gpfs
  scrape_timeout: 2m
  scrape_interval: 3m
  relabel_configs:
  - source_labels: [__address__]
    regex: "([^.]+)..*"
    replacement: "$1"
    target_label: host
  metric_relabel_configs:
  - source_labels: [__name__,role]
    regex: gpfs_(mount|health|verbs)_status;compute
    action: drop
  - source_labels: [__name__,collector,role]
    regex: gpfs_exporter_(collect_error|collector_duration_seconds);(mmhealth|mount|verbs);compute
    action: drop
  - source_labels: [__name__,role]
    regex: "^(go|process|promhttp)_.*;compute"
    action: drop
  file_sd_configs:
  - files:
    - "/etc/prometheus/file_sd_config.d/gpfs_*.yaml"

An example scrape target configuration:

- targets:
  - c0001.example.com:9303
  labels:
    host: c0001
    cluster: example
    environment: production
    role: compute