Spectral Clusterer for WEKA
- Overview
- Change log
- Spectral Clustering
- Release notes
- System requirements
- Installation instructions
- Build instructions
- Documentation
- References
Overview
This is preliminary, but fully functional, implementation of the Spectral Clustering algorithm for the WEKA framework.
WEKA is an Open Source Knowledge Discovering and Data Mining system developed in Java by the University of Waikato in New Zealand It offers many algorithms and tools commonly available only in commercial systems, but, differently from the last ones, it is open and freely expandable.
The implementation relies on Java linear algebra package called COLT and written by Wolfgang Hoschek at CERN. WEKA is released under the GNU General Public License, while COLT is released under ad-hoc, but GNU GPL compatible, license, then the Spectral Clusterer is consequently released under GNU General Public License. It can be freely used, modified and distributed in conformity with the GNU General Public License.
This software is released under GNU General Public License. It can be freely used, modified and distributed in conformity with the GNU General Public License. The License Agreement is available here.
For any information about the GNU GPL and its implications please refer to GNU site.
The content of this site is provided “AS IS” without warranties of any kind either express or implied. Currently, I am working on to improve the performance (and correct its bug too). Please send me any comment or suggestion and report me any bug you notice. If you plan to employ this component in some DM or KD application, please notice it me (this is not conforming to GNU GPL, but I would know if my work is useful).
Change log
- Upgraded to WEKA 3.6 (1.1) 22 Aug 2010
- First release (1.0) 8 Jan 2002
Spectral Clustering
The proposed implementation is derived from [1] and [2]. In [3] is presented another version of spectral clustering algorithm, and a comparative analysis of various clustering algorithms based on spectral methods is in [4].
The spectral clustering algorithm is based on the concept of similarity between point instead of distance, as other algorithms do. The implemented algorithm is formulated as graph partition problem where the weight of each edge is the similarity between points that correspond to vertex connected by the edge.
The goal of the algorithm is find the minimum weight cuts in the graph, but this problem can be addressed by the means of linear algebra, in particular by the eigenvalue decomposition techniques, from which the term “spectral” derives.
Since the WEKA clustering framework relies on the distance between points, I have introduced a function to compute the similarity from the Euclidean distance:
s(x, y) = exp( - d(x, y) ^ 2 / (2 * sigma ^ 2))
where:
s(x, y)
is the similarity between the points x and y;d(x, y)
is metric distance function (e.g., Euclidean distance);- sigma is an user specified scaling factor.
This a limit of the implementation, because one could need a different similarity function, i.e. to deal with nominal attribute. This goal could be achieved defining accordingly a distance function for the specific problem input space exploiting the WEKA DistanceFunction class.
More information about the algorithm can also be retrieved in source documentation.
Release notes
The component has been developed using the WEKA stable version 3.6 and the COLT version 1.2. It is no compatible with previous versions. It has been built with standard Sun JDK 1.6 compiler.
Since the algorithm relies on the similarity matrix it needs an amount of memory proportionate to square of the number of points. To reduce this amount there an option that switches to sparse matrix representation.
The time complexity of the implemented algorithm depends on the complexity of the eigenvalue decomposition algorithm (about O(n^3)
, where n is the number of rows/columns, or rather the number of items in the dataset), but it is possible to reduce the time complexity, since the algorithm needs certain eigenvectors only (which are corresponding to smallest eigenvalues in magnitude), I think it possible to implement an ad-hoc decomposition function based on Lanczos or sub-space iteration methods for sparse matrices.
A limit is the lack of concise representation of partitions. In this version each cluster is modelled as the set of its own points. To cluster a new instance the implementation search the set that contains the nearest point.
I am not aware of any proposal for modelling the clusters in this case, but I plan to improve the classification method by the introduction of a less naive algorithm (i.e. k-NN). In this case also I need to replace the distance function with a similarity one.
System requirements
In order to use this component, you need:
- A Java 6 compatible Virtual Machine
- The Waikato Environment for Knowledge Analysis release 3.6
- The COLT Numeric Library release 1.2
Installation instructions
The installation is a bit involved, because the WEKA environment lacks of a standard module registration procedure.
To use this module one must follow the subsequent instructions:
- Download and install the WEKA System. It can be downloaded from the University of Waikato website. Install it following its own instructions.
- Download and install the COLT numeric package. It can be downloaded from the COLT project’s website. To install it follow its own instructions.
- Download the Spectral Clusterer from here (The source code, according to GNU GPL, is included in the same file).
- To run WEKA, the Java runtime’s classpath should simply include the following jars:
weka.jar
,colt.jar
, andweka-spectral-clusterer.jar
.
For instance, assuming that these files are in the current directory, the command to issue is:java -cp weka.jar;weka-spectral-clusterer.jar;colt.jar weka.gui.GUIChooser
That’s all folk. Now you can start the WEKA knowledge explorer, the new algorithm is available in the Cluster tab of the Explorer.
Build instructions
The Spectral Clusterer component could be built from the source code available here. It should be enough adding the WEKA and COLT libraries to the compiler’s classpath, in order to compile it. Please refer to the documentation of your IDE for further details.
Documentation
For more information about the component and its options see the Javadoc file. It can be built from the source and it is available for convenience here. See the source comment and the cited references for specific details about the algorithm and its implementation.
References
[1] Shi, J., and J. Malik (1997) "Normalized Cuts and Image Segmentation", in Proc. of IEEE Conf. on Comp. Vision and Pattern Recognition, Puerto Rico.
[2] Kannan, R., S. Vempala, and A. Vetta (2000) "On Clusterings - Good, Bad and Spectral", Tech. Report, CS Dept., Yale University.
[3] Ng, A. Y., M. I. Jordan, and Y. Weiss (2001) "On Spectral Clustering: Analysis and an algorithm", in Advances in Neural Information Processing Systems 14.
[4] Weiss, Y. (1999) "Segmentation using eigenvectors: a unifying view", Tech. Rep., CS. Dept., UC Berkeley.
I think there must be a colon(:) instead of semi colon(;) while adding classpath with java -cp command. For me, it only worked with (:).I am using ubuntu.
The path separator char depends upon the operating system: it is a colon(:) on unix-like systems and a semi-colon(;) on Windows/DOS-derivates.
Hi,
I have followed the instructions, but after running WEKA just like step 4 orders, I still can’t see the Spectral Clusterer under the Cluster tab of WEKA’s explorer.
Please check JDK/JRE, Weka and COLT versions.
Will the current implementation be able to cluster a graph whose adjacency matrix we know but we do-not know the individual coordinates of each point. How should the format of Instance be so as to do such a clustering.
The base spectral clustering algorithm should be able to perform such task, but given the integration specifications of WEKA framework, you have to express you problem in terms of point-to-point distance, so it is not so easy to encode a graph.
A possible work-around, not very elegant indeed, it is to encode your graph into a custom distance function that actually returns the inverse of the similarity between the pair of node-ids that has been provided as arguments. In other words, you should encode your graph into a function.
Another option is to extract the core of the algorithm and to adapt it to run outside the WEKA framework, so you are free to provide a proper similarity matrix.
Hi,
I have two questions:
- Which range of values would you recommend for the parameters alpha and sigma? I am thinking of performing a grid search, to optimise.
- How to you define the number of clusters? I couldn’t find any such parameter in https://www.luigidragone.com/datamining/SpectralClusterer.html
Thanks
The parameter tuning is very specific of the problem and data that you aim to solve, and how are they encoded. For some example values you could check the cited references, where Authors report settings used in their tests.
Regarding the number of clusters, you could use the alpha-star value to control the termination of the recursive partitioning.
Eventually you could setup a more complex experiment, validating the quality of the computed cluster, in a meta-learning framework, in order to find the best-fit configuration for your problem.