Home

Awesome

KMeansClusterer

k-means clustering in Ruby. Uses NArray under the hood for fast calculations.

Jump to the examples directory to see this in action.

Features

Install

gem install kmeans-clusterer

Usage

Simple example:

require 'kmeans-clusterer'

data = [[40.71,-74.01],[34.05,-118.24],[39.29,-76.61],
        [45.52,-122.68],[38.9,-77.04],[36.11,-115.17]]

labels = ['New York', 'Los Angeles', 'Baltimore', 
          'Portland', 'Washington DC', 'Las Vegas']

k = 2 # find 2 clusters in data

kmeans = KMeansClusterer.run k, data, labels: labels, runs: 5

kmeans.clusters.each do |cluster|
  puts  cluster.id.to_s + '. ' + 
        cluster.points.map(&:label).join(", ") + "\t" +
        cluster.centroid.to_s
end

# Use existing clusters for prediction with new data:
predicted = kmeans.predict [[41.85,-87.65]] # Chicago
puts "\nClosest cluster to Chicago: #{predicted[0]}"

# Clustering quality score. Value between -1.0..1.0 (1.0 is best)
puts "\nSilhouette score: #{kmeans.silhouette.round(2)}"

Output of simple example:

0. New York, Baltimore, Washington DC [39.63, -75.89]
1. Los Angeles, Portland, Las Vegas [38.56, -118.7]

Closest cluster to Chicago: 0

Silhouette score: 0.91

Options

The following options can be passed in to KMeansClusterer.run:

optiondefaultdescription
:labelsniloptional array of Ruby objects to collate with data array
:runs10number of times to run kmeans
:logfalseprint stats after each run
:init:kmppalgorithm for picking initial cluster centroids. Accepts :kmpp, :random, or an array of k centroids
:scale_datafalsescales features before clustering using formula (data - mean) / std
:float_precision:doublefloat precision to use. :double or :single
:max_iter300max iterations per run