Awesome

MISSION

MISSION: Ultra Large-Scale Feature Selection using Count-Sketches

An ICML 2018 paper by Amirali Aghazadeh*, Ryan Spring*, Daniel LeJeune, Gautam Dasarathy, Anshumali Shrivastava, Richard G. Baraniuk

* These authors contributed equally and are listed alphabetically.

How-To-Run + Code Versions

All data files are formatted using the VW input format

Build executables by running Makefile
Mission Logistic Regression

// Hyperparameters
// Size of Top-K Heap
const size_t TOPK = (1 << 14) - 1;

// Size of Count-Sketch Array
const size_t D = (1 << 18) - 1;

// Number of Arrays in Count-Sketch
const size_t N = 3;

// Learning Rate
const float LR = 5e-1;

./mission_logistic train_data test_data

Fine-Grained Mission Softmax Regression

// Hyperparameters

// Size of Top-K Heap
const size_t TOPK = (1 << 20) - 1;

// Number of Classes
const size_t K = 193;

// Size of Count-Sketch Array
const size_t D = (1 << 24) - 1;

// Number of Arrays in Count-Sketch
const size_t N = 3;

// Learning Rate
const float LR = 1e-2;

// Length of String Feature Representation
const size_t LEN = 12;

./fine_mission_softmax train_data test_data

Coarse-Grained Mission Softmax Regression

// Hyperparameters

// Size of Top-K Heap
const size_t TOPK = (1 << 22) - 1;

// Number of Classes
const size_t K = 193;

// Size of Count-Sketch Array
const size_t D = (1 << 24) - 1;

// Number of Arrays in Count-Sketch
const size_t N = 3;

// Learning Rate
const float LR = 1e-1;

// Length of String Feature Representation
const size_t LEN = 12;

./coarse_mission_softmax [train_data_part_1 train_data_part_2 ... train_data_part_n] test_data

Feature Hashing Softmax Regression

// Hyperparameters

// Number of Classes
const size_t K = 193;

// Size of Count-Sketch Array
const size_t D = (1 << 24) - 1;

// Learning Rate
const float LR = 1e-2;

// Length of String Feature Representation
const size_t LEN = 12;

./softmax [train_data_part_1 train_data_part_2 ... train_data_part_n] test_data

Optimizations

Mission streams in the dataset via Memory-Mapped I/O instead of loading everything directly into memory -
Necessary for Tera-Scale Datasets
AVX SIMD optimization for fast Softmax Regression
The code is currently optimized for the Splice-Site and DNA Metagenomics datasets.

Mission Softmax Regression

Fine-Grained Feature Set - Each class maintains a separate feature set, so there is a top-k heap for each class.
Coarse-Grained Feature Set - All the classes share a common set of features, so there is only one top-k heap. -
Each feature is measured by its L1 Norm for all classes.
Data Parallelism - Each worker maintains a separate heap, while aggregating gradients in the same count-sketch.

Awesome

MISSION

How-To-Run + Code Versions

Optimizations

Mission Softmax Regression

Datasets