Weka Get Cluster Assignments

Class SimpleKMeans

  • All Implemented Interfaces:
    java.io.Serializable, java.lang.Cloneable, Clusterer, NumberOfClustersRequestable, CapabilitiesHandler, CapabilitiesIgnorer, CommandlineRunnable, OptionHandler, Randomizable, RevisionHandler, TechnicalInformationHandler, WeightedInstancesHandler


    public class SimpleKMeans extends RandomizableClusterer implements NumberOfClustersRequestable, WeightedInstancesHandler, TechnicalInformationHandler
    Cluster data using the k means algorithm. Can use either the Euclidean distance (default) or the Manhattan distance. If the Manhattan distance is used, then centroids are computed as the component-wise median rather than mean. For more information see:

    D. Arthur, S. Vassilvitskii: k-means++: the advantages of carefull seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 1027-1035, 2007. BibTeX: @inproceedings{Arthur2007, author = {D. Arthur and S. Vassilvitskii}, booktitle = {Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms}, pages = {1027-1035}, title = {k-means++: the advantages of carefull seeding}, year = {2007} } Valid options are: -N <num> Number of clusters. (default 2). -init Initialization method to use. 0 = random, 1 = k-means++, 2 = canopy, 3 = farthest first. (default = 0) -C Use canopies to reduce the number of distance calculations. -max-candidates <num> Maximum number of candidate canopies to retain in memory at any one time when using canopy clustering. T2 distance plus, data characteristics, will determine how many candidate canopies are formed before periodic and final pruning are performed, which might result in exceess memory consumption. This setting avoids large numbers of candidate canopies consuming memory. (default = 100) -periodic-pruning <num> How often to prune low density canopies when using canopy clustering. (default = every 10,000 training instances) -min-density Minimum canopy density, when using canopy clustering, below which a canopy will be pruned during periodic pruning. (default = 2 instances) -t2 The T2 distance to use when using canopy clustering. Values < 0 indicate that a heuristic based on attribute std. deviation should be used to set this. (default = -1.0) -t1 The T1 distance to use when using canopy clustering. A value < 0 is taken as a positive multiplier for T2. (default = -1.5) -V Display std. deviations for centroids. -M Don't replace missing values with mean/mode. -A <classname and options> Distance function to use. (default: weka.core.EuclideanDistance) -I <num> Maximum number of iterations. -O Preserve order of instances. -fast Enables faster distance calculations, using cut-off values. Disables the calculation/output of squared errors/distances. -num-slots <num> Number of execution slots. (default 1 - i.e. no parallelism) -S <num> Random number seed. (default 10) -output-debug-info If set, clusterer is run in debug mode and may output additional info to the console -do-not-check-capabilities If set, clusterer capabilities are not checked before clusterer is built (use with caution).
    Version:
    $Revision: 11444 $
    Author:
    Mark Hall (mhall@cs.waikato.ac.nz), Eibe Frank (eibe@cs.waikato.ac.nz)
    See Also:
    , Serialized Form
    • Constructor Summary

      Constructor and Description
    • Method Summary

      • Methods inherited from class weka.clusterers.AbstractClusterer

      • Methods inherited from class java.lang.Object

    • Constructor Detail

      • SimpleKMeans

        public SimpleKMeans()
    • Method Detail

      • globalInfo

        public java.lang.String globalInfo()
        Returns a string describing this clusterer.
        Returns:
        a description of the evaluator suitable for displaying in the explorer/experimenter gui
      • buildClusterer

        public void buildClusterer(Instances data) throws java.lang.Exception
        Generates a clusterer. Has to initialize all fields of the clusterer that are not being set via options.
        Specified by:
         in interface 
        Specified by:
         in class 
        Parameters:
        - set of instances serving as training data
        Throws:
        - if the clusterer has not been generated successfully
      • clusterInstance

        public int clusterInstance(Instance instance) throws java.lang.Exception
        Classifies a given instance.
        Specified by:
         in interface 
        Overrides:
         in class 
        Parameters:
        - the instance to be assigned to a cluster
        Returns:
        the number of the assigned cluster as an interger if the class is enumerated, otherwise the predicted value
        Throws:
        - if instance could not be classified successfully
      • numberOfClusters

        public int numberOfClusters() throws java.lang.Exception
        Returns the number of clusters.
        Specified by:
         in interface 
        Specified by:
         in class 
        Returns:
        the number of clusters generated for a training dataset.
        Throws:
        - if number of clusters could not be returned successfully
      • listOptions

        public java.util.Enumeration<Option> listOptions()
        Returns an enumeration describing the available options.
        Specified by:
         in interface 
        Overrides:
         in class 
        Returns:
        an enumeration of all the available options.
      • numClustersTipText

        public java.lang.String numClustersTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setNumClusters

        public void setNumClusters(int n) throws java.lang.Exception
        set the number of clusters to generate.
        Specified by:
         in interface 
        Parameters:
        - the number of clusters to generate
        Throws:
        - if number of clusters is negative
      • getNumClusters

        public int getNumClusters()
        gets the number of clusters to generate.
        Returns:
        the number of clusters to generate
      • initializationMethodTipText

        public java.lang.String initializationMethodTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setInitializationMethod

        public void setInitializationMethod(SelectedTag method)
        Set the initialization method to use
        Parameters:
        - the initialization method to use
      • getInitializationMethod

        public SelectedTag getInitializationMethod()
        Get the initialization method to use
        Returns:
        method the initialization method to use
      • reduceNumberOfDistanceCalcsViaCanopiesTipText

        public java.lang.String reduceNumberOfDistanceCalcsViaCanopiesTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setReduceNumberOfDistanceCalcsViaCanopies

        public void setReduceNumberOfDistanceCalcsViaCanopies(boolean c)
        Set whether to use canopies to reduce the number of distance computations required
        Parameters:
        - true if canopies are to be used to reduce the number of distance computations
      • getReduceNumberOfDistanceCalcsViaCanopies

        public boolean getReduceNumberOfDistanceCalcsViaCanopies()
        Get whether to use canopies to reduce the number of distance computations required
        Returns:
        true if canopies are to be used to reduce the number of distance computations
      • canopyPeriodicPruningRateTipText

        public java.lang.String canopyPeriodicPruningRateTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setCanopyPeriodicPruningRate

        public void setCanopyPeriodicPruningRate(int p)
        Set the how often to prune low density canopies during training (if using canopy clustering)
        Parameters:
        - how often (every p instances) to prune low density canopies
      • getCanopyPeriodicPruningRate

        public int getCanopyPeriodicPruningRate()
        Get the how often to prune low density canopies during training (if using canopy clustering)
        Returns:
        how often (every p instances) to prune low density canopies
      • canopyMinimumCanopyDensityTipText

        public java.lang.String canopyMinimumCanopyDensityTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setCanopyMinimumCanopyDensity

        public void setCanopyMinimumCanopyDensity(double dens)
        Set the minimum T2-based density below which a canopy will be pruned during periodic pruning.
        Parameters:
        - the minimum canopy density
      • getCanopyMinimumCanopyDensity

        public double getCanopyMinimumCanopyDensity()
        Get the minimum T2-based density below which a canopy will be pruned during periodic pruning.
        Returns:
        the minimum canopy density
      • canopyMaxNumCanopiesToHoldInMemoryTipText

        public java.lang.String canopyMaxNumCanopiesToHoldInMemoryTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setCanopyMaxNumCanopiesToHoldInMemory

        public void setCanopyMaxNumCanopiesToHoldInMemory(int max)
        Set the maximum number of candidate canopies to retain in memory during training. T2 distance and data characteristics determine how many candidate canopies are formed before periodic and final pruning are performed. There may not be enough memory available if T2 is set too low.
        Parameters:
        - the maximum number of candidate canopies to retain in memory during training
      • getCanopyMaxNumCanopiesToHoldInMemory

        public int getCanopyMaxNumCanopiesToHoldInMemory()
        Get the maximum number of candidate canopies to retain in memory during training. T2 distance and data characteristics determine how many candidate canopies are formed before periodic and final pruning are performed. There may not be enough memory available if T2 is set too low.
        Returns:
        the maximum number of candidate canopies to retain in memory during training
      • canopyT2TipText

        public java.lang.String canopyT2TipText()
        Tip text for this property
        Returns:
        the tip text for this property
      • setCanopyT2

        public void setCanopyT2(double t2)
        Set the t2 radius to use when canopy clustering is being used as start points and/or to reduce the number of distance calcs
        Parameters:
        - the t2 radius to use
      • getCanopyT2

        public double getCanopyT2()
        Get the t2 radius to use when canopy clustering is being used as start points and/or to reduce the number of distance calcs
        Returns:
        the t2 radius to use
      • canopyT1TipText

        public java.lang.String canopyT1TipText()
        Tip text for this property
        Returns:
        the tip text for this property
      • setCanopyT1

        public void setCanopyT1(double t1)
        Set the t1 radius to use when canopy clustering is being used as start points and/or to reduce the number of distance calcs
        Parameters:
        - the t1 radius to use
      • getCanopyT1

        public double getCanopyT1()
        Get the t1 radius to use when canopy clustering is being used as start points and/or to reduce the number of distance calcs
        Returns:
        the t1 radius to use
      • maxIterationsTipText

        public java.lang.String maxIterationsTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setMaxIterations

        public void setMaxIterations(int n) throws java.lang.Exception
        set the maximum number of iterations to be executed.
        Parameters:
        - the maximum number of iterations
        Throws:
        - if maximum number of iteration is smaller than 1
      • getMaxIterations

        public int getMaxIterations()
        gets the number of maximum iterations to be executed.
        Returns:
        the number of clusters to generate
      • displayStdDevsTipText

        public java.lang.String displayStdDevsTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setDisplayStdDevs

        public void setDisplayStdDevs(boolean stdD)
        Sets whether standard deviations and nominal count. Should be displayed in the clustering output.
        Parameters:
        - true if std. devs and counts should be displayed
      • getDisplayStdDevs

        public boolean getDisplayStdDevs()
        Gets whether standard deviations and nominal count. Should be displayed in the clustering output.
        Returns:
        true if std. devs and counts should be displayed
      • dontReplaceMissingValuesTipText

        public java.lang.String dontReplaceMissingValuesTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setDontReplaceMissingValues

        public void setDontReplaceMissingValues(boolean r)
        Sets whether missing values are to be replaced.
        Parameters:
        - true if missing values are to be replaced
      • getDontReplaceMissingValues

        public boolean getDontReplaceMissingValues()
        Gets whether missing values are to be replaced.
        Returns:
        true if missing values are to be replaced
      • distanceFunctionTipText

        public java.lang.String distanceFunctionTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • getDistanceFunction

        public DistanceFunction getDistanceFunction()
        returns the distance function currently in use.
        Returns:
        the distance function
      • setDistanceFunction

        public void setDistanceFunction(DistanceFunction df) throws java.lang.Exception
        sets the distance function to use for instance comparison.
        Parameters:
        - the new distance function to use
        Throws:
        - if instances cannot be processed
      • preserveInstancesOrderTipText

        public java.lang.String preserveInstancesOrderTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setPreserveInstancesOrder

        public void setPreserveInstancesOrder(boolean r)
        Sets whether order of instances must be preserved.
        Parameters:
        - true if missing values are to be replaced
      • getPreserveInstancesOrder

        public boolean getPreserveInstancesOrder()
        Gets whether order of instances must be preserved.
        Returns:
        true if missing values are to be replaced
      • fastDistanceCalcTipText

        public java.lang.String fastDistanceCalcTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setFastDistanceCalc

        public void setFastDistanceCalc(boolean value)
        Sets whether to use faster distance calculation.
        Parameters:
        - true if faster calculation to be used
      • getFastDistanceCalc

        public boolean getFastDistanceCalc()
        Gets whether to use faster distance calculation.
        Returns:
        true if faster calculation is used
      • numExecutionSlotsTipText

        public java.lang.String numExecutionSlotsTipText()
        Returns the tip text for this property
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setNumExecutionSlots

        public void setNumExecutionSlots(int slots)
        Set the degree of parallelism to use.
        Parameters:
        - the number of tasks to run in parallel when computing the nearest neighbors and evaluating different values of k between the lower and upper bounds
      • getNumExecutionSlots

        public int getNumExecutionSlots()
        Get the degree of parallelism to use.
        Returns:
        the number of tasks to run in parallel when computing the nearest neighbors and evaluating different values of k between the lower and upper bounds
      • setOptions

        public void setOptions(java.lang.String[] options) throws java.lang.Exception
        Parses a given list of options. Valid options are: -N <num> Number of clusters. (default 2). -init Initialization method to use. 0 = random, 1 = k-means++, 2 = canopy, 3 = farthest first. (default = 0) -C Use canopies to reduce the number of distance calculations. -max-candidates <num> Maximum number of candidate canopies to retain in memory at any one time when using canopy clustering. T2 distance plus, data characteristics, will determine how many candidate canopies are formed before periodic and final pruning are performed, which might result in exceess memory consumption. This setting avoids large numbers of candidate canopies consuming memory. (default = 100) -periodic-pruning <num> How often to prune low density canopies when using canopy clustering. (default = every 10,000 training instances) -min-density Minimum canopy density, when using canopy clustering, below which a canopy will be pruned during periodic pruning. (default = 2 instances) -t2 The T2 distance to use when using canopy clustering. Values < 0 indicate that a heuristic based on attribute std. deviation should be used to set this. (default = -1.0) -t1 The T1 distance to use when using canopy clustering. A value < 0 is taken as a positive multiplier for T2. (default = -1.5) -V Display std. deviations for centroids. -M Don't replace missing values with mean/mode. -A <classname and options> Distance function to use. (default: weka.core.EuclideanDistance) -I <num> Maximum number of iterations. -O Preserve order of instances. -fast Enables faster distance calculations, using cut-off values. Disables the calculation/output of squared errors/distances. -num-slots <num> Number of execution slots. (default 1 - i.e. no parallelism) -S <num> Random number seed. (default 10) -output-debug-info If set, clusterer is run in debug mode and may output additional info to the console -do-not-check-capabilities If set, clusterer capabilities are not checked before clusterer is built (use with caution).
        Specified by:
         in interface 
        Overrides:
         in class 
        Parameters:
        - the list of options as an array of strings
        Throws:
        - if an option is not supported
      • getOptions

        public java.lang.String[] getOptions()
        Gets the current settings of SimpleKMeans.
        Specified by:
         in interface 
        Overrides:
         in class 
        Returns:
        an array of strings suitable for passing to setOptions()
      • toString

        public java.lang.String toString()
        return a string describing this clusterer.
        Overrides:
         in class 
        Returns:
        a description of the clusterer as a string
      • getClusterCentroids

        public Instances getClusterCentroids()
        Gets the the cluster centroids.
        Returns:
        the cluster centroids
      • getClusterStandardDevs

        public Instances getClusterStandardDevs()
        Gets the standard deviations of the numeric attributes in each cluster.
        Returns:
        the standard deviations of the numeric attributes in each cluster
      • getClusterNominalCounts

        public double[][][] getClusterNominalCounts()
        Returns for each cluster the weighted frequency counts for the values of each nominal attribute.
        Returns:
        the counts
      • getSquaredError

        public double getSquaredError()
        Gets the squared error for all clusters.
        Returns:
        the squared error, NaN if fast distance calculation is used
        See Also:
      • getClusterSizes

        public double[] getClusterSizes()
        Gets the sum of weights for all the instances in each cluster.
        Returns:
        The number of instances in each cluster
      • getAssignments

        public int[] getAssignments() throws java.lang.Exception
        Gets the assignments for each instance.
        Returns:
        Array of indexes of the centroid assigned to each instance
        Throws:
        - if order of instances wasn't preserved or no assignments were made
      • main

        public static void main(java.lang.String[] args)
        Main method for executing this class.
        Parameters:
        - use -h to list all parameters

Using Weka 3 for clustering

Clustering

Get to the Weka Explorer environment and load the training file using the Preprocess mode. Try first with weather.arff. Get to the Cluster mode (by clicking on the Cluster tab) and select a clustering algorithm, for example SimpleKMeans. Then click on Start and you get the clustering result in the output window. The actual clustering for this algorithm is shown as one instance for each cluster representing the cluster centroid.









 


 









 


 











Evaluation

The way Weka evaluates the clusterings depends on the cluster mode you select. Four different cluster modes are available (as buttons in the Cluster mode panel):

  1. Use training set (default). After generating the clustering Weka classifies the training instances into clusters according to the cluster representation and computes the percentage of instances falling in each cluster. For example, the above clustering produced by k-means shows 43% (6 instances) in cluster 0 and 57% (8 instances) in cluster 1.
  2. In Supplied test set or Percentage split Weka can evaluate clusterings on separate test data if the cluster representation is probabilistic (e.g. for EM).
  3. Classes to clusters evaluation. In this mode Weka first ignores the class attribute and generates the clustering. Then during the test phase it assigns classes to the clusters, based on the majority value of the class attribute within each cluster. Then it computes the classification error, based on this assignment and also shows the corresponding confusion matrix. An example of this for k-means is shown below.













 









 


 










 







EM

The EM clustering scheme generates probabilistic descriptions of the clusters in terms of mean and standard deviation for the numeric attributes and value counts (incremented by 1 and modified with a small value to avoid zero probabilities) - for the nominal ones. In "Classes to clusters" evaluation mode this algorithm also outputs the log-likelihood, assigns classes to the clusters and prints the confusion matrix and the error rate, as shown in the example below. More about EM and other clustering schemes available in Weka can be found in the text, pages 296-297.




 


 






Cobweb

Cobweb generates hierarchical clustering, where clusters are described probabilistically. Below is an example clustering of the weather data (weather.arff). The class attribute (play) is ignored (using the ignore attributes panel) in order to allow later classes to clusters evaluation. Doing this automatically through the "Classes to clusters" option does not make much sense for hierarchical clustering, because of the large number of clusters. Sometimes we need to evaluate particular clusters or levels in the clustering hierarchy. We shall discuss here an approach to this.

Let us first see how Weka represents the Cobweb clusters. Below is a copy of the output window, showing the run time information and the structure of the clustering tree.














 












 

















Here is some comment on the output above:
  •   in the command line specifies the Cobweb parameters Acuity and Cutoff (see the text, page 215). They can be specified through the pop-up window that appears by clicking on area left to the Choose button.
  • node N or leaf N represents a subcluster, whose parent cluster is N.
  • The clustering tree structure is shown as a horizontal tree, where subclusters are aligned at the same column. For example, cluster 1 (referred to in node 1) has three subclusters 2 (leaf 2), 3 (leaf 3) and 4 (leaf 4).
  • The root cluster is 0. Each line with node 0 defines a subcluster of the root.
  • The number in square brackets after nodeN represents the number of  instances in the parent cluster N.
  • Clusters with [1] at the end of the line are instances.
  • For example, in the above structure cluster 1 has 8 instances and its subclusters 2, 3 and 4 have 2, 3 and 3 instances correspondingly.
  • To view the clustering tree right click on the last line in the result list window and then select Visualize tree.
To evaluate the Cobweb clustering using the classes to clusters  approach we need to know the class values of the instances, belonging to the clusters. We can get this information from Weka in the following way: After Weka finishes (with the class attribute ignored), right click on the last line in the result list window. Then choose Visualize cluster assignments - you get the Weka cluster visualize window. Here you can view the clusters, for example by putting Instance_number on X and Cluster on Y. Then click on Save and choose a file name (*.arff). Weka saves the cluster assignments in an ARFF file. Below is shown the file corresponding to the above Cobweb clustering.


























To represent the cluster assignments Weka adds a new attribute Cluster and includes its corresponding values at the end of each data line. Note that allother attributes are shown, including the ignored ones (play, in this case). Also, only the leaf clusters are shown.

Now, to compute the classes to clusters error in, say, cluster 3 we look at the corresponding data rows in the ARFF file and get the distribution of the class variable: {no, no, yes}. This means that the majority class is no and the error is 1/3.

If we want to compute the error not only for leaf clusters, we need to look at the clustering structure (the Visualize tree option helps here) and determine how the leaf clusters are combined in other clusters at higher levels of the hierarchy. For example, at the top level we have two clusters - 1 and 5. We can get the class distribution of 5 directly from the data (because 5 is a leaf) - 3 yes's and 3 no's. While for cluster 1 we need its subclusters - 2, 3 and 4. Summing up the class values we get 6 yes's and 2 no's. Finally, the majority in cluster 1 is yes and in cluster 5 is no (could be yes too) and the error (for the top level partitioning in two clusters) is 5/14.

Weka provides another approach to see the instances belonging to each cluster. When you visualize the clustering tree, you can click on a node and then see the visualization of the instances falling into the corresponding cluster (i.e. into the leafs of the subtree). This is a very useful feature, however if you ignore an attribute (as we did with "play" in the experiments above) it does not show in the visualization.
 

0 comments

Leave a Reply

Your email address will not be published. Required fields are marked *