weka.clusterers
Class SpectralClusterer

weka.clusterers.SpectralClusterer

public class SpectralClusterer

Spectral clusterer class. For more information see:

This implementation assumes that the data instances are points in an Euclidean space. The algorithm is based on the concept of similarity between points instead of distance. Given two points x and y and their Euclidean distance d(x, y), their similarity is defined as exp(- d(x, y)^2 / (2 * sigma^2)), where sigma is a scaling factor (its default value is 1.0).

There is a distance cut factor r, if the distance d(x, y) between two points is greater than r then their similarity is set to 0. This parameter combined with the use of sparse matrices can improve the performances w.r.t. the memory.

To classify a new instance w.r.t. the partitions found this implementation applies a naive min-distance algorithm that assigns the instance to the cluster that contains the nearest point. Since the distance to similarity function is bijective and monotone the nearest point is also the most similar one.

Valid options are:

This implementation relies on the COLT numeric package for Java written by Wolfgang Hoschek. For other information about COLT see its home page at http://nicewww.cern.ch/~hoschek/colt/index.htm.

According to the license, the copyright notice is reported below:
   Written by Wolfgang Hoschek. Check the Colt home page for more info.
   Copyright © 1999 CERN - European Organization for Nuclear Research.
   Permission to use, copy, modify, distribute and sell this software and
   its documentation for any purpose is hereby granted without fee,
   provided that the above copyright notice appear in all copies and that
   both that copyright notice and this permission notice appear in
   supporting documentation. CERN makes no representations about
   the suitability of this software for any purpose. It is provided "as is"
   without expressed or implied warranty.
 

This software is issued under GNU General Public License.

    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation; either version 2 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program; if not, write to the Free Software
    Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.

Author:
Luigi Dragone (luigi@luigidragone.com)
See Also:
COLT

Field Summary
protected  double alpha_star
          The alpha star parameter value
protected  int[] cluster
          The class number of each point in the dataset
protected  int numOfClusters
          The number of different clusters found
protected static java.util.Vector options
           
protected  double r
          The distance cut factor
protected  double sigma
          The sigma scaling factor
protected  boolean useSparseMatrix
          The using sparse matrix flag
protected  cern.colt.matrix.DoubleMatrix2D v
          The points of the dataset
 
Constructor Summary
SpectralClusterer()
          Constructor.
 
Method Summary
 java.lang.String alphaStarTipText()
          Returns the tip text for this property
protected static double asso(cern.colt.matrix.DoubleMatrix2D W, int[] a, int[] b)
          Computes the association degree between two partitions of a graph.
The association degree is defined as the sum of the weights of all the edges between points of the two partitions.
protected static int[][] bestCut(cern.colt.matrix.DoubleMatrix2D W)
          Returns the best cut of a graph w.r.t. the degree of dissimilarity between points of different partitions and the degree of similarity between points of the same partition.
 void buildClusterer(weka.clusterers.Instances data)
          Generates a clusterer by the mean of spectral clustering algorithm.
 int clusterInstance(weka.clusterers.Instance instance)
          Classifies an instance w.r.t. the partitions found.
protected static double distnorm2(cern.colt.matrix.DoubleMatrix1D x, cern.colt.matrix.DoubleMatrix1D y)
          Returns the Euclidean distance between two points.
 double getAlphaStar()
          Returns the current value of the alpha star factor.
 java.lang.String[] getOptions()
          Gets the current settings of the options.
 double getR()
          Returns the current value of the r distance cur parameter.
 double getSigma()
          Returns the current value of the sigma factor.
 boolean getUseSparseMatrix()
          Returns the status of using sparse matrix flag.
 java.lang.String globalInfo()
          Returns a string describing this clusterer
 java.util.Enumeration listOptions()
          Returns an enumeration describing the available options.
protected static int[] merge(int[] a, int[] b)
          Merges two sets of points represented as integer vectors.
protected static double Nasso(cern.colt.matrix.DoubleMatrix2D W, int[] a, int[] b)
          Returns the normalized association degree between two partitions of a graph.
protected static double Nasso(cern.colt.matrix.DoubleMatrix2D W, int[] a, int[] b, int[] v)
          Returns the normalized association degree between two partitions of a graph w.r.t. a given subgraph.
protected static double Ncut(cern.colt.matrix.DoubleMatrix2D W, int[] a, int[] b)
          Returns the normalized dissimilarity degree (or cut) between two partitions of a graph.
protected static double Ncut(cern.colt.matrix.DoubleMatrix2D W, int[] a, int[] b, int[] v)
          Returns the normalized dissimilarity degree (or cut) between two partitions of a graph w.r.t. a given subgraph.
 int numberOfClusters()
          Returns the number of clusters found.
protected static int[][] partition(cern.colt.matrix.DoubleMatrix2D W, double alpha_star)
          Splits recursively the points of the graph while the value of the best cut found is less of a specified limit (the alpha star factor).
 java.lang.String rTipText()
          Returns the tip text for this property.
 void setAlphaStar(double alpha_star)
          Sets the new value of the alpha star factor.
 void setOptions(java.lang.String[] options)
          Parses a given list of options.
 void setR(double r)
          Sets the new value of the r distance cut parameter.
 void setSigma(double sigma)
          Sets the new value of the sigma scaling factor.
 void setUseSparseMatrix(boolean useSparseMatrix)
          Sets the use of sparse representation for similarity matrix.
 java.lang.String sigmaTipText()
          Returns the tip text for this property.
 java.lang.String useSparseMatrixTipText()
          Returns the tip text for this property.
 

Field Detail

v

protected cern.colt.matrix.DoubleMatrix2D v
The points of the dataset

cluster

protected int[] cluster
The class number of each point in the dataset

numOfClusters

protected int numOfClusters
The number of different clusters found

alpha_star

protected double alpha_star
The alpha star parameter value

r

protected double r
The distance cut factor

sigma

protected double sigma
The sigma scaling factor

useSparseMatrix

protected boolean useSparseMatrix
The using sparse matrix flag

options

protected static java.util.Vector options
Constructor Detail

SpectralClusterer

public SpectralClusterer()
Constructor.
Method Detail

distnorm2

protected static double distnorm2(cern.colt.matrix.DoubleMatrix1D x,
                                  cern.colt.matrix.DoubleMatrix1D y)
Returns the Euclidean distance between two points. It is used to compute the similarity degree of these ones.
Parameters:
x - the first point
y - the second point
Returns:
the Euclidean distance between the points

merge

protected static int[] merge(int[] a,
                             int[] b)
Merges two sets of points represented as integer vectors. The sets are not overlapped.
Parameters:
a - the first set of points
b - the second set of points
Returns:
the union of the two sets

asso

protected static double asso(cern.colt.matrix.DoubleMatrix2D W,
                             int[] a,
                             int[] b)
Computes the association degree between two partitions of a graph.
The association degree is defined as the sum of the weights of all the edges between points of the two partitions.
Parameters:
W - the weight matrix of the graph
a - the points of the first partition
b - the points of the second partition
Returns:
the association degree

Nasso

protected static double Nasso(cern.colt.matrix.DoubleMatrix2D W,
                              int[] a,
                              int[] b)
Returns the normalized association degree between two partitions of a graph.
Parameters:
W - the weight matrix of the graph
a - the points of the first partition
b - the points of the second partition
Returns:
the normalized association degree

Nasso

protected static double Nasso(cern.colt.matrix.DoubleMatrix2D W,
                              int[] a,
                              int[] b,
                              int[] v)
Returns the normalized association degree between two partitions of a graph w.r.t. a given subgraph.
Parameters:
W - the weight matrix of the graph
a - the points of the first partition
b - the points of the second partition
v - the points of the subgraph
Returns:
the normalized association degree

Ncut

protected static double Ncut(cern.colt.matrix.DoubleMatrix2D W,
                             int[] a,
                             int[] b)
Returns the normalized dissimilarity degree (or cut) between two partitions of a graph.
Parameters:
W - the weight matrix of the graph
a - the points of the first partition
b - the points of the second partition
Returns:
the normalized cut

Ncut

protected static double Ncut(cern.colt.matrix.DoubleMatrix2D W,
                             int[] a,
                             int[] b,
                             int[] v)
Returns the normalized dissimilarity degree (or cut) between two partitions of a graph w.r.t. a given subgraph.
Parameters:
W - the weight matrix of the graph
a - the points of the first partition
b - the points of the second partition
v - the points of the subgraph.
Returns:
the normalized cut

bestCut

protected static int[][] bestCut(cern.colt.matrix.DoubleMatrix2D W)
Returns the best cut of a graph w.r.t. the degree of dissimilarity between points of different partitions and the degree of similarity between points of the same partition.
Parameters:
W - the weight matrix of the graph
Returns:
an array of two elements, each of these contains the points of a partition

partition

protected static int[][] partition(cern.colt.matrix.DoubleMatrix2D W,
                                   double alpha_star)
Splits recursively the points of the graph while the value of the best cut found is less of a specified limit (the alpha star factor).
Parameters:
W - the weight matrix of the graph
alpha_star - the alpha star factor
Returns:
an array of sets of points (partitions)

numberOfClusters

public int numberOfClusters()
                     throws java.lang.Exception
Returns the number of clusters found.
Returns:
the number of clusters

clusterInstance

public int clusterInstance(weka.clusterers.Instance instance)
                    throws java.lang.Exception
Classifies an instance w.r.t. the partitions found. It applies a naive min-distance algorithm.
Parameters:
instance - the instance to classify
Returns:
the cluster that contains the nearest point to the instance

buildClusterer

public void buildClusterer(weka.clusterers.Instances data)
                    throws java.lang.Exception
Generates a clusterer by the mean of spectral clustering algorithm.
Parameters:
data - set of instances serving as training data
Throws:
java.lang.Exception - if the clusterer has not been generated successfully

globalInfo

public java.lang.String globalInfo()
Returns a string describing this clusterer
Returns:
a description of the evaluator suitable for displaying in the explorer/experimenter gui

listOptions

public java.util.Enumeration listOptions()
Returns an enumeration describing the available options.

Returns:
an enumeration of all the available options

setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Parses a given list of options.
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

getOptions

public java.lang.String[] getOptions()
Gets the current settings of the options.
Returns:
an array of strings suitable for passing to setOptions()

setAlphaStar

public void setAlphaStar(double alpha_star)
                  throws java.lang.Exception
Sets the new value of the alpha star factor.
Parameters:
alpah_star - the new value (0 < alpha_star < 1)
Throws:
java.lang.Exception - if alpha_star is not between 0 and 1

getAlphaStar

public double getAlphaStar()
Returns the current value of the alpha star factor.
Returns:
the alpha star factor

alphaStarTipText

public java.lang.String alphaStarTipText()
Returns the tip text for this property
Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setSigma

public void setSigma(double sigma)
              throws java.lang.Exception
Sets the new value of the sigma scaling factor.
Parameters:
sigma - the new value (sigma > 0)
Throws:
java.lang.Exception - if sigma is not strictly greater than 0

getSigma

public double getSigma()
Returns the current value of the sigma factor.
Returns:
the sigma factor

sigmaTipText

public java.lang.String sigmaTipText()
Returns the tip text for this property.
Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setR

public void setR(double r)
          throws java.lang.Exception
Sets the new value of the r distance cut parameter.
Parameters:
r - the new value (r > 0 || r == -1)
Throws:
java.lang.Exception - if r is not -1 and r is not a positive number

getR

public double getR()
Returns the current value of the r distance cur parameter.
Returns:
the r distance cut parameter

rTipText

public java.lang.String rTipText()
Returns the tip text for this property.
Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setUseSparseMatrix

public void setUseSparseMatrix(boolean useSparseMatrix)
Sets the use of sparse representation for similarity matrix.
Parameters:
useSparseMatrix - true for sparse matrix representation

getUseSparseMatrix

public boolean getUseSparseMatrix()
Returns the status of using sparse matrix flag.
Returns:
the status of using sparse matrix flag

useSparseMatrixTipText

public java.lang.String useSparseMatrixTipText()
Returns the tip text for this property.
Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui