Awesome

Selection pipeline for hctsa features

This is a pipeline for selecting small time-series feature sets from the comprehensive feature collection contained in the hctsa toolbox. Features are selected by their classification performance across a collection of time-series classification problems. The pipeline was used to generate the small feature set catch22 - CAnonical Time-series CHaracteristics based on the problems contained in the UEA/UCR time-series classification repository.

For information the pipeline and the catch22 feature set see our preprint:

C.H. Lubba, S.S. Sethi, P. Knaute, S.R. Schultz, B.D. Fulcher, N.S. Jones. catch22: CAnonical Time-series CHaracteristics. Data Mining and Knowledge Discovery (2019).

For information on the full hctsa library of over 7000 features, see the following (open-access) publications:

B.D. Fulcher and N.S. Jones. hctsa: A computational framework for automated time-series phenotyping using massive feature extraction. Cell Systems 5, 527 (2017).
B.D. Fulcher, M.A. Little, N.S. Jones Highly comparative time-series analysis: the empirical structure of time series and their methods. J. Roy. Soc. Interface 10, 83 (2013).

Running the pipeline

Computing the hctsa-matrices

The selection process relies on computed and normalized feature-matrices from the hctsa toolbox.

:wave::wave::wave: Computed data (using v0.97 of hctsa) that we used for our analysis can be downloaded from this figshare repository. :wave::wave::wave:

See hctsa for instructions on how to compute the construct hctsa files from your data, run the features and normalize the matrices (hctsa relies on Matlab). The computed, normalized HCTSA mat files should be placed into a folder called input_data inside the op_importance folder with file names HCSTA_<dataset name>_N.mat.

Running the `op_importance` pipeline

The pipeline can be launched from the op_importance directory as

python Workflow.py <runtype>

Where <runtype> is a string composed of 2-3 parts delimited by an underscore: <classifier>_<normalisation>(_null). Where <classifier> selects the classifier type used among svm, dectree, linear and normalisation is either scaledrobustsigmoid or maxmin. An appended _null in the <runtype>-string means that distributions of classification accuracies for each feature are generated in a permutation-based procedure that shuffles the labels of the classification problems.

First step: compute null accuracy distributions

First, null distributions need to by generated by e.g.,

python Workflow.py dectree_maxmin_null

This can take long as 1000 classification runs are done on each dataset. It it preferable to do this computation on a cluster.

Make sure that compute_features = True is set in the main function of Workflow.py.

Second step: compute valid accuracy distributions

python Workflow.py dectree_maxmin_null

Third step: further analyses

Once the valid and null accuracies have been computed, all following analyses can be run without re-classification by setting compute_features = False.

See below for an example output of the pipeline that plots correlations in performance across datasets of the 500 best features as well as the clusters they end up in.

Example output