Home

Awesome

Overview

This repository containes the needed data to train a Linear Model (OLS) / XGBoost for the SPECPower data set.

The models are built with dynamic variables designed to work in different cloud environments where some information may not be available.

Its use is the estimation of the current power draw of the whole machine in Watts.

Currently the model supports following variables:

Only the CPU Utilization parameter is mandatory. All other paramters are optional.
vHost ratio is assumed to be 1 if not given.

You are free to supply only the utilization or as many additional parameters that the model supports. The model will then be retrained on the new configuration on the spot.

Typically the model gets more accurate the more parameters you can supply. Please see the Assumptions & Limitations part at the end to get an idea how accurate the model will be in different circumstances.

Background

Typically in the cloud, especially when virtualized, it is not posssible to access any energy metrics either from the ILO / IDRAC controllers or from RAPL.

Therfore power draw must be estimated.

Many approaches like this have been made so far:

Cloud Carbon Footprint and Teads operate on Billing data and are too coarse for a fast paced development that pushes changing code on a daily basis.

Teads could theoretically solve this, but is strictily limited to AWS EC2. Also it provides no interface out of the box to inline monitor the emissions.

Therefore we created a model out of the SPECPower dataset that also can be used in real-time.

Discovery of the parameters

At least utilization is needed as an input parameter.

You need some small script that streams the CPU utilization as pure float numbers line by line.

The solution we are using is a modified version of our CPU Utilization reporter from the Green Metrics Tool.

This one is tailored to read from the procfs. You might need something different in your case ...

Hyperthreading

HT can be easily checked if the core-id is similar to the processor id.

Last Core-ID should be processor_id+1 If Last core ID is > processor_id+2 then HT is enabled

Alternatively looking at lscpu might reveal some infos.

SVM / VT-X / VT-D / AMD-V ...

The presence of virtualization can be checked by looking at:

/dev/kvm

If that directory is present this is a strong indicator, that virtualization is enabled.

One can also install cpu-checker and then run sudo apt install kvm-ok -y && sudo kvm-ok

This will tell with more checks if virtualization is on. even on AMD machines.

However in a vHost this might not work at all, as the directory is generally hidden.

Here it must be checked if a virtualization is already running through: sudo apt install virt-what -y && sudo virt-what

Also lscpu might provide some insights by having these lines:

Virtualization features:
  Hypervisor vendor:     KVM
  Virtualization type:   full

Hardware prefetchers

There are actually many to disable: The above mentioned processors support 4 types of h/w prefetchers for prefetching data. There are 2 prefetchers associated with L1-data cache (also known as DCU DCU prefetcher, DCU IP prefetcher) and 2 prefetchers associated with L2 cache (L2 hardware prefetcher, L2 adjacent cache line prefetcher).

There is a Model Specific Register (MSR) on every core with address of 0x1A4 that can be used to control these 4 prefetchers. Bits 0-3 in this register can be used to either enable or disable these prefetchers. Other bits of this MSR are reserved.

However it seems that for some processors this setting is only available in the BIOS as it is not necessary disclosed info by Intel how to disable it. For servers it seems quite standard to do be an available feature apparently ...

https://stackoverflow.com/questions/54753423/correctly-disable-hardware-prefetching-with-msr-in-skylake https://stackoverflow.com/questions/55967873/how-can-i-verify-that-my-hardware-prefetcher-is-disabled https://stackoverflow.com/questions/784041/how-do-i-programmatically-disable-hardware-prefetching https://stackoverflow.com/questions/19435788/unable-to-disable-hardware-prefetcher-in-core-i7 https://stackoverflow.com/questions/784041/how-do-i-programmatically-disable-hardware-prefetching

Other variables

Other variables to be discovered like CPU Make etc. can be found in these locations typically:

Informations like the vHost-Ratio you can sometimes see in /proc/stat, but this info is usually given in the machine selector of your cloud provider.

If you cannot find out specific parameters the best thing is: Write an email to your cloud provider and ask :)

Model Details / EDA

The EDA is currently only on Kaggle, where you can see how we selected the subset of the available variables and their interaction in our Kaggle notebook

In order to create some columns we inspected the SUT_BIOS and SUT_Notes fields and created some feature columns dervied from them. Here is a quick summary:

Unclear data in SUT_BIOS / SUT_Notes

Some info we thought might be related to energy, but we could not make sense of them. If you can, please share and create and create a Pull Request:

Interpolation for output

Like all tree based models our XGBoost model can only predict what it has seen so far.

Since the original data from SPECPower only has information for every 10% of utilization the model will by default for instance give the same value for 6% as well as for 7%.

To combat this behaviour we interpolate between the points where the model actually reports new data, which is:

The data is just interpolated linearly. The interpolation is done directly when the xgb.py script is starting and thus all possible infered values for utilization (0.00 - 100.00) are stored in a dict. This makes the model extremely performant at the cost of a minimal memory cost.

Results

We have first compared the model against a machine from SPECPower that we did not include in the model training: Hewlett Packard Enterprise Synergy 480 Gen10 Plus Compute Module

This machine is comprised of 10 identical nodes, therefore the power values have to be divided by 10 to get the approximate value that would have resulted if only one node was tested individually.

An individual node has the following characteristics as model parameters:

hp_synergy_480_Gen10_Plus.png

This is the comparison chart:

Secondly we have bought a machine from the SPECPower dataset: FUJITSU Server PRIMERGY RX1330 M3

The machine has the following characteristics as model parameters:

This is the comparison chart for the SPEC data vs our modelling: fujitsu_TX1330_SPEC.png

This is the comparison chart where we compare the standard BIOS setup against the tuning settings from SPECPower: fujitsu_TX1330_measured.png

Summary

Installation

Tested on python-3.10 but should work on older versions.

python3 -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt

Re-build training data

If you want to rebuild the training data (spec_data*.csv) then you have to include the git submodule with the raw data.

git submodule update --init

Use

You must call the python file ols.py or xgb.py. This file is designed to accept streaming inputs.

A typical call with a streaming binary that reports CPU Utilization could look like so:

$ ./static-binary | python3 ols.py --tdp 240 
191.939294374113
169.99632303510703
191.939294374113
191.939294374113
191.939294374113
191.939294374113
194.37740205685841
191.939294374113
169.99632303510703
191.939294374113
....

Since all possible outputs are infered directly into a dict the model is highly performant to use in inline reporting scenarios.

Demo reporter

If you want to use the demo reporter to read the CPU utilization there is a C reporter in the demo-reporter directory.

Compile it with gcc cpu-utilization.c

Then run it with ./a.out

Or feed it directly to the model with: ./a.out | python3 model.py --tdp ....

Comparison with Interact DC variable selection

Run the interact_validation.py to see a K-Folds comparison of our variable selection against the one from Interact DC.

Without Hyperparameter Tuning when comparing the available variables in the cloud they are about the same.

Assumptions & Limitations

TODO

Credits

A similar model has been developed in academia from Interact DC and the paper can be downloaded on their official resources site.

Our model was initially developed idependently but we have taken some inspiration from the paper to tune the model afterwards.

A big thank you to Rich Kenny from Interact DC to providing some insights to parameters and possible pitfalls during our model development.