Awesome
VFDT-split-time-prediction
Installation:
- Clone the adapted MOA repository from here
- MOA provides several algorithms derived from the Hoeffding-Tree. Therefore, they are adapted as well and are able to use the local split-time prediction. The main split-time prediction algorithms are located in HoeffdingTree.java.
- If you want to get only the modified files to integrate them into your local MOA version, they are located here
- Build MOA (The easiest way is to use an IDE such as IntelliJ)
Using the local split-time prediction
Select the type of split-time-prediction you want to use in the properties of the Hoeffding-Tree and run your experiments.
Datasets
Artificial:
RBF
Gaussian distributions with random initial positions, weights and standard deviations are generated in d-dimensional space. The weight controls the partitioning of the examples among the Gaussians.
This dataset was generated using MOA with the following parameters: 10 Million instances, 100 dimesions, 50 Gaussians, 50 classes, 100 centroids.
RTG
The RTG in MOA constructs a decision tree by randomly splitting along the attributes as well as assigning random classes to each leaf. Numeric and nominal attributes are supported and the tree depth can be predefined. Instances are generated by uniform sampling along each attribute. %Traversing the tree with the instance determines the corresponding class label.
This dataset was generated using MOA with the following parameters: 5 Million instances, 100 numeric dimesions, 100 nominal dimensions, 25 classes, max tree depth 15.
LED-Drift
This generator yields instances with 24 boolean features with 17 of them being irrelevant. The remaining features corresponds to segments of a seven-segment LED display. The goal is to predict the digit displayed on the LED display, where each feature has a 10% chance of being inverted. Drift is generated by swapping the relevant features with irrelevant ones. We used the LEDDrift generator in MOA (7 drifting dimensions, 10% noise).
Real-world:
Rialto Bridge Timelapse
Ten of the colorful buildings next to the famous Rialto bridge in Venice are encoded in a normalized 27-dimensional RGB histogram. The images were obtained from time-lapse videos captured by a webcam with fixed position. The recordings cover 20 consecutive days during may-june 2016. Continuously changing weather and lighting conditions affect the representation, generating natural concept drift.
Airline
The Airline data set was inspired by the regression data set from Ikonomovska. The task is to predict whether a given flight will be delayed or not based on seven attributes encoding various information on the scheduled departure. This dataset is often used to evaluate concept drift classifier.
Forest Cover Type
Assigns cartographic variables such as elevation, slope, soil type, ... of 30 x 30 meter cells to different forest cover types. Only forests with minimal human-caused disturbances were used, so that resulting forest cover types are more a result of ecological processes. It is often used as a benchmark for drift algorithms. We used the normalized version as it also can be found [here] (http://moa.cms.waikato.ac.nz/datasets/).
Poker Hand
One million randomly drawn poker hands are represented by five cards each encoded with its suit and rank. The class is the resulting poker hand itself such as one pair, full house and so forth. This dataset has in its original form no drift, since the poker hand definitions do not change and the instances are randomly generated. However, we used the version presented in PAW, in which virtual drift is introduced via sorting the instances by rank and suit. Duplicate hands were also removed. We used the normalized version as it also can be found here.
MNIST-8M
Loosli et al. used pseudo-random deformations and translations to extended the well known MNIST database to eight million instances. The ten handwritten digits are encoded in 782 binary features.
HIGGS
This dataset consists of eleven million simulated particle collisions. The goal of this binary classification problem is to distinguish between a signal process producing Higgs bosons and a background process. The data consist of low-level kinematic features recorded as well as some derived high-level indicators.