Home

Awesome

GIFT-EVAL: A Benchmark for General Time Series Forecasting Model Evaluation

Paper | Blog Post | Train-Test Dataset | Pretrain Dataset | Leaderboard

gift eval main figure

BenchmarkFreq. RangeNum. of DomainPretraining dataNum. of var.Pred. Len.Benchmark MethodsProb. Forecasting
Monash (Godahewa et al., 2021)Secondly ~ Yearly7NoUniShortStat./DLNo
TFB (Qiu et al., 2024)Minutely ~ Yearly6NoUni/MultiShortStat./DLNo
LTSF (Zeng et al., 2022)Minutely ~ Weekly5NoMultiLongStat./DLNo
BasicTS+ (Shao et al., 2023)Minutely ~ Daily3NoMultiShort/LongStat./DLNo
GIFT-Eval (our work)Secondly ~ Yearly7YesUni/MultiShort/LongStat./DL/FMYes

GIFT-Eval is a comprehensive benchmark designed to evaluate general time series forecasting models across diverse datasets, promoting the advancement of zero-shot forecasting capabilities in foundation models.

To facilitate the effective pretraining and evaluation of foundation models, we also provide a non-leaking pretraining dataset --> GiftEvalPretrain.

Installation

  1. Clone the repository and change the working directory to GIFT_Eval.
  2. Create a conda environment:
python3 -m venv myenv
source myenv/bin/activate
  1. Install required pamyenvckages:

If you just want to explore the dataset, you can install the required dependencies as follows:

pip install -e .

If you want to run baselines, you can install the required dependencies as follows:

pip install -e .[baseline]

Note: The specific instructions for installing the Moirai and Chronos models are available in their relevant notebooks.

  1. Get the train/test dataset from huggingface.
huggingface-cli download Salesforce/GiftEval --repo-type=dataset --local-dir PATH_TO_SAVE
  1. Set up the environment variables and add the path to the data:
echo "GIFT_EVAL=PATH_TO_SAVE" >> .env

Getting Started

Iterating the dataset

We provide a simple class, Dataset to load each dataset in our benchmark following the gluonts interface. It is highly recommended to use this class to split the data to train/val/test for compatibility with the evaluation framework and other baselines in the leaderboard. You don't have to stick to gluonts interface though as you can easily implement a wrapper class to load the data iterator in a different format than gluonts.

This class provides the following properties:

Please refer to the dataset.ipynb for an example of how to iterate the train/val/test splits of the dataset.

Running baselines

We provide examples of how to run the statistical, deep learning, and foundation baselines in the naive.ipynb, feedforward.ipynb and moirai.ipynb and chronos.ipynb notebooks. Each of these notebooks wrap models available in different libraries to help you get started. You can either follow these examples or implement your own wrapper class to iterate over the splits of the dataset as explained in the dataset.ipynb notebook.

Each of these notebooks will generate a csv file called all_results.csv under the results/<MODEL_NAME> folder containing the results for your model on the gift-eval benchmark. Regardless of the model you choose and how you run it, you can submit your results to the leaderboard by following the instructions in the Submitting your results section.

Sample output file

A sample output file is located at results/naive/all_results.csv.

The file contains the following columns:

The first column in the csv file is the dataset config name which is a combination of the prettified dataset name, frequency and the term (The sample notebooks, e.g. naive.ipynb, show how to get this name, please follow the same format to align with the leaderboard.):

f"{dataset_name}/{freq}/{term}"

Submitting your results

Evaluation

res = evaluate_model(
        predictor,
        test_data=dataset.test_data,
        metrics=metrics,
        batch_size=512,
        axis=None,
        mask_invalid_label=True,
        allow_nan_forecast=False,
        seasonality=season_length,
    )

We highly recommend you to evaluate your model using gluonts evaluate_model function as it is compatible with the evaluation framework and other baselines in the leaderboard. Please refer to the sample notebooks where we show its use with statistical, deep learning and foundation models for more details. However, if you decide to evaluate your model in a different way please follow the below conventions for compatibility with the rest of the baselines in our leaderboard. Specifically:

  1. Aggregate results over all dimensions (following axis=None)
  2. Do not count nan values in the target towards calculation (following mask_invalid_label=True).
  3. Make sure the prediction does not have nan values (following allow_nan_forecast=False).

Submission

Submit your results to the leaderboard by creating a pull request that adds your results to the results/<YOUR_MODEL_NAME> folder. Your PR should contain only a folder with two files called all_results.csv and config.json. The config.json file should contain the following fields:

{
    "model": "YOUR_MODEL_NAME",
    "model_type": "one of statistical, deep-learning, or pretrained",
    "model_dtype": "float32, etc."
}

The final all_results.csv file should contain 98 lines (one for each dataset configuration) and 15 columns: 4 for dataset, model, domain and num_variates and 11 for the evaluation metrics.

Citation

If you find this benchmark useful, please consider citing:

@article{aksu2024giftevalbenchmarkgeneraltime,
      title={GIFT-Eval: A Benchmark For General Time Series Forecasting Model Evaluation}, 
      author={Taha Aksu and Gerald Woo and Juncheng Liu and Xu Liu and Chenghao Liu and Silvio Savarese and Caiming Xiong and Doyen Sahoo},
      journal = {arxiv preprint arxiv:2410.10393},
      year={2024},
}

This repository is intended for research purposes only.