Home

Awesome

Overview

This repository holds the source code and raw experimental results of our ICDE 2020 paper "Active Model Selection for Positive Unlabeled Time Series Classification". This repository has the following four folders.

Code: all our source code. We will later show how to use it.

Data: a sample UCR dataset (GunPointMaleVersusFemale), as well as the ECG data (201_A_pyramid and 205_A_pyramid) we used in the case study in arrhythmia detection. We will elaborate on the datasets later.

Results: This folder consists of all our raw results. There are three types of files in this folder:

Seeds: seed files used by the code, including the initial PL sets used to run the candidate ST-1NN models, and sampling orders for model selection methods using the random sampling (Rand) strategy.

How to use the code

Please take the following steps to obtain PUTSC model selection results.

  1. Use GPU-DTW.cu to calculate the DTW distances.
  2. Use ST1NN.cpp to run the candidate ST-1NN models
  3. Use modelSelection.cpp to run the model selection methods.

Input parameters of GPU-DTW.cu:

Input parameters of ST1NN.cpp

Input parameters of modelSelection.cpp

Final output of modelSelection.cpp includes the following.

On the datasets

In our paper, we have used two data sources: the UCR archive and MIT-BIH Arrhythmia Database (MITDB). Their references and web links can be found in our paper.

As with UCR datasets, we have used the original data without further editing. The complete datasets are available at http://www.cs.ucr.edu/~eamonn/time_series_data_2018/. Note that for datasets with missing values and variable time series lengths, the UCR archive has provided officially preprocessed versions of them, which is the data we have used in our experiments.

As with MITDB data, we have used leads A of Record 201 and 205. The original data is available at https://physionet.org/content/mitdb/1.0.0/. We have preprocessed the data following the practice of this paper below.

J. He, L. Sun, J. Rong, H. Wang, and Y. Zhang, "A pyramid-like model for heartbeat classification from ECG recordings," PLOS ONE, vol. 13, pp. 1–19, 11 2018

For the data preprocessing source code, see https://github.com/SamHO666/A-Pyramid-like-Model-for-Heartbeat-Classification. For user convenience, we have attached the preprocessed dataset (in UCR format, see 201_A_pyramid_TRAIN.tsv and 205_A_pyramid_TRAIN.tsv in the Data folder). Note that the data has been relabeled such that the VEB class is labeled as "1", while all other data is labeled as "0".

Note that for UCR datasets, we only used the training sets. This is because in transductive PUTSC, there are no testing sets in the conventional sense. Therefore, following the practice of the paper listed below, we only used the training sets in our experiments.

M. G. Castellanos, C. Bergmeir, I. Triguero, Y. Rodríguez, J. M. Benítez, "On the stopping criteria for k-Nearest Neighbor in positive unlabeled time series classification problems," Inf. Sci. 328, pp. 42-59, 2016

As with MITDB data, we used the entire records (a small proportion of the data is discarded in the preprocessing phase) which was not separated into training and testing sets by the original contributors. The files "201_A_pyramid_TRAIN.tsv" and "205_A_pyramid_TRAIN.tsv"in the Data folder were named this way only to facilitate file reading by our code. It does NOT correspond to an actual training/testing set.

Is ST-1NN the state-of-the-art PUTSC paradigm?

This question is a little weird, because ST-1NN is a paradigm, namely a category of models, not a single model. Usually, it is more appropriate to discuss whether a model is SOTA or not. However, since we are dealing with PUTSC model selection and in this paper we have focused on model selection for ST-1NN, this question becomes more relevant. To rephrase it, can any of the existing non-ST-1NN models significantly outperform the best possible performance by existing ST-1NN based models?

Unfortunately, our best answer is that there is no sufficient evidence to show that any non-ST-1NN based models can significantly outperform it. This is because there are no comprehensive experiments on a large batch of datasets to compare the two. However, we can confidently say that our own PU-Shapelets (PUSh) is significantly weaker than ST-1NN. Concretely, we have compared PUSh with ST-1NN on the 122 UCR datasets used in our paper. For ST-1NN, in each run on each dataset we use the result of the best of the 1271 models on that run (this is effectively the Optimal method in our paper). We use this because we want to simulate the case where ST-1NN's potential is fully unleashed. For PUSh, due to the limited time we have and the fact that PUSh is too memory-inefficient to run on certain large datasets, for now only results on 94 datasets are available (see fscores_PUSh_ST-1NN.xlsx).

We use one-tail Wilcoxon signed rank test to see if PUSh is weaker than ST-1NN. Giving PUSh the edge, we set its F1-scores on datasets where the actual result is not available to the perfect (and unlikely) 1. Even with this unfair advantage, the p-value is 0.0021 < 0.05, which indicates that PUSh is significantly weaker than ST-1NN at best.

Issues with Critical Difference Diagrams

To plot the critical difference diagrams, we use Friedman test (alpha = 0.05) to decide if there are significant differences among the 18 methods. We then use two-sided Wilcoxon signed rank test (alpha = 0.05) with Holm correction to divide them into groups such that methods in the same group have no significant difference. In the diagrams, methods to the right have higher average rankings than those to the left according to Friedman test. Methods in the same clique (marked by solid bars) have no significant difference according to Wilcoxon signed rank test. However, there are two issues that we have not been able to fully solve.

The first issue concerns Holm correction. Since signed rank test is a pairwise test, to control family-wise error rate (FWER), some correction method must be used when doing multiple such tests. Such correction often has to do with the number of comparisons (i.e. number of hypotheses, or the number of p-values) k. For example, let us consider Bonferroni correction, a more conservative yet simpler correction than Holm. For k comparisons, Bonferroni correction times each of the k p-values by k to obtain the adjusted p-values. These adjusted (rather than the original) p-values are then used to compare with the pre-set alpha (say 0.05) to decide whether to reject each hypothesis.

In our case, suppose there are m methods to compare (for example, for AL-based methods, m = 18 in our case), thus we need to perform m(m-1)/2 two-sided signed rank tests. If we use Bonferroni correction, intuitively we should time each p-value by m(m-1)/2. Similar procedures can be done for Holm correction. However, in the following paper,

Alessio Benavoli, Giorgio Corani, Francesca Mangili: Should We Really Use Post-Hoc Tests Based on Mean-Ranks? JMLR 17: 5:1-5:10 (2016)

the authors suggested that we should instead time m(m-1), seemingly because of the two-sided nature of the signed rank tests. Personally, we find it hard to understand the rationale behind this, and we do not know how to take into account the two-sided nature for Holm correction. We have contacted the authors of the JMLR paper for their advice, and have also posted relevant questions on online forums such as Cross Validated and Research Gate (https://www.researchgate.net/post/How_to_conduct_Holms_procedure_for_two-sided_Wilcoxon_signed_rank_test_when_conducting_pairwise_comparisons_for_multiple_classifiers), but has so far not been able to obtain a conclusive answer as to how to address this. Therefore, in our paper, for all Holm corrections we have used the value m(m-1)/2, disregarding this issue. We have chosen to do so because it seems that this conforms with the R implementation of such a procedure. However, we are open to suggestions on potentially better practices.

The second issue concerns the conflict between Friedman Test and Wilcoxon signed rank test. In the diagram, the methods are displayed in order of their average rankings obtained in Friedman test, higher (i.e. smaller) rankings are to the right. However, when forming cliques, signed rank test is used. This presents a potential conflict. Suppose we have m' (3 <= m' <= m) methods, among which are A, B and C, where A ranks higher than B, and B ranks higher than C. In this case, due to the difference between Friedman and signed rank tests, there is a chance that the signed rank p-value (after adjustment by Holm correction) between A and C is larger than that between A and B. It may even be the case that signed rank test dictates that A and C should be in the same clique, while A and B should not. In this case, there is no way to plot the clique in the diagram to accurately reflect this. Our current way of handling this is that as long as A and C are in the same clique, A is not in the same clique as the method ranked just below C, and C is not in the same clique as the method ranked just above A, we add a clique (black bar) from A to C, covering all methods between them. This may not be an ideal solution. However, we suspect if there is a better one. It may be the case that the critical difference diagram is innately unable to handle this. However, since as far as we know it is the best visualization tool for comparisons of multiple methods on multiple datasets, we currently settle on this solution. We welcome any suggestion on potentially better practices.