Home

Awesome

K-means clustering with missing values

Implementation of the K-means clustering algorithm, for a dataset in which data points can have missing values for some coordinates.

We expect the following arguments to the class:

When computing the distance from each data point to the cluster centroids, we use the Euclidean distance. For data points with values for certain dimensions missing, we only use the known dimensions (overlap between known values of data point and centroid). When we recompute the cluster centroids, if all data points for a cluster have unknown values for a certain dimension, we set that dimension to unknown for the centroid as well.

For initialisation, we:

We iterate until no points change cluster anymore.

The cluster assignments can then be retrieved as an from KMeans.cluster_assignments, or as a matrix from KMeans.clustering_results.

If a cluster becomes empty, we either reassign its centroid randomly ('random'), or assign the point furthest away from its current cluster centroid ('singleton').

Usage is as follows: