Awesome
18.06.2021: Our paper has been accepted at the ECML/PKDD 2021. Stay tuned for an updated version. 🤩
This repository includes the code for our paper Very Fast Streaming Submodular Function Maximization which introduces a new nonnegative submodular function maximization algorithm for streaming data. For our experiments, we also implemented already existing state-of-the-art streaming algorithms for which we could not find an implementation. The code focuses on easy extensibility and accessibility. It is mainly written in header-only C++ with a Python interface via pybind11. A detailed documentation can be found here.
How to use this code
First clone this repo recursively
git clone --recurse-submodules git@github.com:sbuschjaeger/SubmodularStreamingMaximization.git
Using the Python Interface
If you use anaconda to manage your python packages you can use the following commands to install all dependencies including the python interface PySSM
source /opt/anaconda3/bin/activate
conda env create -f environment.yml --force
conda activate pyssm
Note the --force
option which overrides all existing environments called pyssm
. Alternatively, you can just install this package via
pip install -e .
Once installed, you can simply import the desired submodular function and optimizer via PySSM
. For a detailed explanation on specific parameters / functions provided please have a look at the documentation of the source code.
The following example uses the Greedy optimizer to select a data summary by maximizing the Informative Vector Machine (the full examples can be found in tests/main.py
)
from PySSM import RBFKernel
from PySSM import IVM
from PySSM import Greedy
X = [
[0, 0],
[1, 1],
[0.5, 1.0],
[1.0, 0.5],
[0, 0.5],
[0.5, 1],
[0.0, 1.0],
[1.0, 0.]
]
K = 3
kernel = RBFKernel(sigma=1,scale=1)
ivm = IVM(kernel = kernel, sigma = 1.0)
greedy = Greedy(K, ivm)
greedy.fit(X)
# Alternativley, you can use the streaming interface.
#for x in X:
# opt.next(x)
fval = opt.get_fval()
solution = np.array(opt.get_solution())
print("Found a solution with fval = {}".format(fval))
print(solution)
Using the C++ interface
The C++ code is header-only so simply include the desired functions in your project and your are good to go. However, you require a C++17 compiler (e.g. gcc-7 or clang-5). If you have trouble compiling you can look at the CMakeLists.txt
file which compiles the Python bindings as well as the following test file. The following example uses the Greedy optimizer to select a data summary by maximizing the Informative Vector Machine (the full examples can be found in tests/main.cpp
)
#include <iostream>
#include <vector>
#include <math.h>
#include "FastIVM.h"
#include "RBFKernel.h"
#include "Greedy.h"
std::vector<std::vector<double>> data = {
{0, 0},
{1, 1},
{0.5, 1.0},
{1.0, 0.5},
{0, 0.5},
{0.5, 1},
{0.0, 1.0},
{1.0, 0.0}
};
unsigned int K = 3;
FastIVM fastIVM(K, RBFKernel(), 1.0);
Greedy greedy(K, fastIVM)
greedy.fit(data);
auto solution = greedy.get_solution();
double fval = greedy.get_fval();
std::cout << "Found a solution with fval = " << fval << std::endl;
for (auto x : solution) {
for (auto xi : x) {
std::cout << xi << " ";
}
std::cout << std::endl;
}
How to reproduce the experiments in our paper
This repository combines code for multiple datasets and two different repositories. It is structured as the following:
experiment_runner
: Contains code for theexperiment_runner
package. This is a small tool which we use to run multiple experiments in our computing environment. Basically, theexperiment_runner
accepts a list of experiments and runs these either on multiple machines or in a multi-threaded environment. All results are written to a*.jsonl
. This code is sparsely documented. For a commented example please have a look atexperiments/kdd99/run.{cpp,py}
.include/
: Contains theC++
implementation of all optimizers. You can find more documentation underdocs/html/index.html
including some examples how to use specific optimizers.tests/run.{py,cpp}
: Contains sample code forC++
andPython
DATASET/run.py
: Contains code to run the experiments on the specific DATASET.DATASET/init.sh
: Some datasets must be downloaded manually. In this case you can use the providedinit.sh
to do so.explore_results.ipynb
: A juypter notebook to plot the results of each experiment
All experiments have been performed via the Python interface. Assuming that you followed the instructions above, simply navigate to the specific folder and execute the run file
cd experiments/kddcup99
./run.py
In the experiments
folder you can find code which runs experiments on various dataset via the Python interface. You will probably need to download the data first which can be done using the init.{sh,py}
scripts provided in each folder. Some notes on this:
creditfraud
is hosted on kaggle, which requires the kaggle-api to be installed and configured with an API key. It might be easier to manually download this data-set from kagglefact-highlevel
is hosted by the FACT project page. To process these files some additional packages are required which should be installed via the conda environment. If not, please make sure to havepyfact
installed and working. Also note, that these files are rather large (> 2GB) so the download may take some time.fact-lowlevel
requires even more additional tools and packages. Please contact sebastian.buschjaeger@tu-dortmund.de if you are interested in these files.
Once the data is downloaded, you can start the experiments by executing run.py
in the respective folder. This file starts all experiments for a single data-set including all hyperparameter configurations. This may take some time (usually a few hours per data-set) to finish. The experiments are currently configured to launch 15
threads via joblib
, so make sure your hardware is strong enough or reduce the number of cores by setting n_cores
at the end of each file. After the experiments finished, you can browse the results by using the explore_results
Jupyter Notebook. Note that, depending on actual experiments you ran you might want to change the list of datasets
used for plotting in the second cell of this notebook accordingly.
Citing our Paper
@misc{buschjäger2020fast,
title={Very Fast Streaming Submodular Function Maximization},
author={Sebastian Buschjäger and Philipp-Jan Honysz and Katharina Morik},
year={2020},
eprint={2010.10059},
archivePrefix={arXiv},
primaryClass={cs.LG}
}