Home

Awesome

Downloads

StoRM (Stochastic Random Mutator)

A robust hyperparameter tuner for high-dimensional, categorically and/or conditionally-parameterized, intractable optimization problems (Ex. Neural Network)

Motivations of this tuner

Neural network hyper parameter optimization is an especially challenging task due to a few different reasons:

Recent research has discussed there is not a lot of reproducible evidence that show any of today's state of the art techniques significantly beat a plain old random search with some form of early stopping- https://arxiv.org/pdf/1902.07638.pdf

How does this tuner attempt to solve these issues?

All of the issues mentioned above make it very difficult if not impossible to do any sort of intelligently guided search for NN architecture/training hyperparameters. That is why this tuner opts against attempting to build some sort of surrogate function or gradient-based method to model the probability of the search space, and instead aims for something simpler and hopefully more robust to the problems we're facing. The user shouldn't expect a magic algorithm that takes the least amount of steps possible to reach global minima, but they should be able to expect something that is more efficient than random search.

The StoRM tuner can be thought of intuitively as a combination of a grid search combined with random search, where the "distance" between the current best configuration and the next evaluation candidate, is probability based. We randomly mutate the current best configuration along different axes (and sometimes even multiple times along the same axis). The number of mutations made for the next evaluation candidate, is based on a user-defined probability.

The default value for randomize_axis_factor is 0.5 which means that there is a 50% chance just one mutation will be made. There is a 25% chance two mutations will be made. A 12.5% chance that three mutations will be made, and so on.

This approach aims to address the issues stated above by allowing enough freedom so that we do respect the complexity of the search space, while also probabilistically restricting how different the next evaluation candidate is from the current best, to provide some level of guidance and locality to the search.

Installation

pip install storm-tuner

Usage

Here we define our hyperparameter space by providing our own configuration building method.

NOTE: The configuration building method is an important component of StoRM's functionality. Even though parameters can be accessed elsewhere, for example when the model is trained or during data preprocessing, all parameters must be defined in this method. By executing the model building function prior to any trial execution logic, we draw a configuration based upon which parameters are actually in use, and ensure we do not waste resources testing virtually identical configurations.

After we define our HP space, it will usually make the most sense for our function to return an untrained model at this point. However, one may opt to return more than a model in some circumstances (for example an optimizer as well) or they may even opt to not return anything at all and build the model later. This is entirely up to the user.

All parameters take the form: hp.Param('parameter_name', [value1, value2...], ordered=False). Setting a parameter to ordered=True, will ensure the tuner is only able to select adjacent values per a single mutation step. This is an important feature for parameters where there is ordinality.

Let's take a look at an example of a configuration building method.

def build_model(hp, *args):
    model = Sequential()
    
    # we can define train-time param in the build model function to be used later on in run_trial
    hp.Param('batch_size', [32, 64, 128, 256], ordered=True)
    
    # storm works easily with loops as well
    for x in range(hp.Param('num_layers', [1, 2, 3, 4], ordered=True)):
        model.add(Dense(hp.Param('kernel_size_' + str(x), [50, 100, 200], ordered=True)))
    
    # here is a categorical parameter that most tuners do not do well with
    activation_choices = ['tanh', 'softsign', 'selu', 'relu', 'elu', 'softplus']
    model.add(Activation(hp.Param('activation', activation_choices)))
    
    # example of nested parameter
    if hp.Param('dropout', [True, False]):
        model.add(Dropout(hp.Param('dropout_val', [0.1, 0.2, 0.3, 0.4, 0.5], ordered=True)))

    model.add(Dense(1))

    model.compile(loss='mse', optimizer=SGD(momentum=0.9))
    return model

We are required to override the run_trial() method for our own Tuner implementation, this encapsulates all the execution of a single trial. All the run_trial method needs to do is assign a score to the trial self.score_trial(trial, score) using a given parameter configuration trial.hyperparameters. How the user generates a score for the supplied configuration is entirely up to them.

The self.build_fn(hp) function called in run_trial is what will supply us with a blank model (as mentioned above).

As we can see, any arguments you provide in the search() entry method, can be accessed in your run_trial() method. This is also true for build_model as well, if any parameters need to be passed in at this scope.

Now let's take a look at an example of a run_trial method.

from storm_tuner import Tuner

class MyTuner(Tuner):

    def run_trial(self, trial, *args):
        # retrieve hyperparameters
        hp = trial.hyperparameters
        
        # retrieve any parameters supplied via main search method
        X_train, y_train, X_test, y_test = args[0], args[1], args[2], args[3]
        
        # create our model/configuration
        model = self.build_fn(hp)
        
        # train model
        history = model.fit(X_train,
                            y_train,
                            epochs=25,
                            validation_data=(X_test, y_test),
                            
                            # here we access a parameter at train time
                            batch_size=hp.values['batch_size'])
                            
        # calculate score
        score = history.history['val_loss'][-1]
        
        # assign score to the trial
        self.score_trial(trial, score)

We initialize our Tuner and provide our training data

tuner = MyTuner(objective_direction='min', build_fn=build_model)
tuner.search(X_train, y_train, X_test, y_test)

Customizing the tuner

One of the main goals of this library is to save the user from having to tune the tuner itself. An excess of tunable variables can confuse the user, make intuitive documentation more difficult, and even have a substantial negative effect on the tuner's performance if they are not set correctly.

With this tuner we have 2 main adjustable parameters to customize your search preferences...


    def __init__(self,
                 build_fn=build_model,
                 objective_direction='min',
                 init_random=10,
                 max_trials=100,
                 randomize_axis_factor=0.5)

init_random: How many initial iterations to perform random search. This is helpful for getting the search to an decent configuration, so that we don't waste too much time descending from a suboptimal starting point. Recommended range: between 3 and 10

randomize_axis_factor: The main exploitative/explorative tradeoff parameter (0 -> 1). A value closer to 1 means that steps will generally have more mutations. A value closer to 0 will mean steps are more likely to only do a single mutation. A value bewteen 0.5 and 0.75 seems reasonable in most cases and will almost always be good enough.

NOTE: For both of these parameters, the higher degree of parameter codependency that is expected and/or the more parameters that you are optimizing, it may be beneficial to set these values higher.

StoRM's design goals

The StoRM tuner is designed to be as simple as possible. The tuner supplies a parameter configuration and the user assigns this configuration a score. We leave it up to the user to implement any number of methodologies that might fit their goals and use cases. These can include:

Storm is designed to be as generic as possible AND there is actually nothing specific to neural networks or a particular ML library coded in this project. This type of freedom allows the user to optimize parameters used at various stages of the experiment... data pre-processing, architecture, training, etc.

Because of the tuner's experiment-agnostic approach, StoRM can be even more advantageous when used with various branches of ML that utilize NNs for the model yet have another set of hyper parameters to optimize that can make the search space even trickier and harder for traditional approaches to handle. For example, reinforcement learning.

How to get the most out of StoRM

Of course, most of the success of StoRM revolves around the user's ability to parameterize the search space appropriately. StoRM will only work as well as the parameter space it operates on.

A few things to keep in mind when parameterizing your search space...

In most cases the selection of values should be fairly intuitive...

batch size: [32, 64, 128, 256]

momentum: [1 - pow(10, -0.5), 1 - pow(10, -1), 1 - pow(10, -1.5), 1 - pow(10, -2)]

kernel size: [50, 100, 150, 200, 250]

lr: [1e-2, 1e-3, 1e-4]

At the end of the day there is then nothing stopping the user from re-parameterizing their search space after narrowing in on promising areas from running storm tuner at a broader scope.

StoRM is library-agnostic.

Although the examples here use Tensorflow/Keras, as mentioned, StoRM works with any library or algorithm (sklearn, pytorch, etc.). One simply defines any parameters we are optimizing in build_fn. The user can decide to return a model right here and utilize StoRM's inline parameterization, or they can opt to use parameters in run_trial.

What types of problems can StoRM be used for?

StoRM should be used for optimization problems where the parameter space can be high dimensional and has many categorical/conditional variables. StoRM should also be used when parameters do not need to be optimized at very fine levels, but rather we need to find good general choices. In short, StoRM will be most effective when there are many codependent decisions to be made.

StoRM will probably not be the best tuner to use if you are optimizing many real valued parameters that always have an affect on the target function, with low codependencies, and which can be sensitive to small changes such that we should offer the real valued spectrum of values, and not just a few bins to chose from. For these types of problems, Bayesian Optimization will still be more effective.

Other notes/features

A StoRM Trial like the one used in the run trial method above, has a metrics dictionary to easily allows us to store any pertinent information to this trial for review later on.

Benchmarking

Run compare_to_random_search.py from examples directory to compare performance to random search for yourself.

Here we can see that over ten trials each, StoRM has a clear advantage.

tuned scores mean: 0.0013418583347811364 | stdev: 0.001806810901973602

random scores mean: 0.010490878883283586 | stdev: 0.006145158964894091

Contributing

Benchmarking, examples, bug submissions, and PRs all welcome!