Home

Awesome

Active Learning Toolbox for MATLAB

This software package provides a toolbox for testing pool-based active-learning algorithms in MATLAB.

Active Learning

Specifically, we consider the following scenario. There is a pool of datapoints X. We may successively select a set of points x in X to observe. Each observation reveals a discrete, integer-valued label y in L for x. This labeling process might be nondeterministic; we might choose the same point x twice and observe different labels each time. In active learning, we typically assume we have a budget B that limits the number of points we may observe.

Our goal is to iteratively build a set of observations

D = (X, Y)

that achieves some goal in an efficient manner. One typical goal is that this training set allows us to accurately predict the labels on the unobserved points. Assume we have a probabilistic model

p(y | x, D)

and let U = X \ X represent the set of unobserved points. We might with to minimize either the 0/1 loss on the unlabeled points

\sum_{x in U} (\hat{y} \neq y),

where \hat{y} = \argmax p(y | x, D), or the log loss:

\sum_{x in U} -\log p(y | x, D).

We could sample a random set of B points, but by careful consideration of our observation locations, we hope we can do significantly better than this. One common active learning strategy, known as uncertainty sampling, iteratively chooses to make an observation at the point with the largest marginal entropy given the current data:

x* = \argmax H(y | x, D),

with the hope that these queries can better map out the boundaries between classes.

Of course, there are countless goals besides minimizing generalization error and numerous other strategies besides the highly myopic uncertainty sampling. Indeed, many active learning scenerios might not involve probability models at all. Providing a highly adaptable and extensible toolbox for conducting arbitrary pool-based active learning experiments is the goal of this project.

Using this Toolbox

The most-important function is active_learning, which simulates an active learning experiment using the following procedure:

Given: initially labeled points X,
       corresponding labels Y,
	   budget B

for i = 1:B
  % find points available for labeling
  eligible_points = selector(x, y)

  % decide on point(s) to observe
  x_star = query_strategy(x, y, eligible_points)

  % observe point(s)
  y_star = label_oracle(x_star)

  % add observation(s) to training set
  X = [X, x_star]
  Y = [Y, y_star]
end

The implementation supports user-specified:

Each of these are provided as function handles satisfying a desired API, described below.

This function also supports arbitrary user-specified callbacks called after each round of the experiment. This can be useful, for example, for plotting the progress of the algorithm and/or printing statistics such as test error online.

Selectors

A selector considers the current labeled dataset and indicates which of the unlabeled points should be considered for observation at this time.

Selectors must satisfy the following interface:

test_ind = selector(problem, train_ind, observed_labels)

Inputs:

Output:

The following general-purpose selectors are provided in this toolbox:

In addition, the following "meta" selectors are provided, which combine or modify the outputs of other selectors:

Query Strategies

Query strategies select which of the points currently eligible for labeling (returned by a selector) should be observed next.

Query strategies must satisfy the following interface:

query_ind = query_strategy(problem, train_ind, observed_labels, test_ind)

Inputs:

Output:

The following query strategies are provided in this toolbox:

Label Oracles

Label oracles are functions that, given a set of points chosen to be queried, returns a set of corresponding labels. In general, they need not be deterministic, which is especially interesting when points can be queried multiple times.

Label oracles must satisfy the following interface:

label = label_oracle(problem, query_ind)

Inputs:

Output:

The following general-purpose label oracles are provided in this toolbox: