The k -means clustering technique: General considerations and implementation in Mathematica

Data clustering techniques are valuable tools for researchers working with large databases of multivariate data. In this tutorial, we present a simple yet powerful one: the k -means clustering technique, through three different algorithms: the Forgy/Lloyd, algorithm, the MacQueen algorithm and the Hartigan & Wong algorithm. We then present an implementation in Mathematica and various examples of the different options available to illustrate the application of the technique

Data clustering techniques are descriptive data analysis techniques that can be applied to multivariate data sets to uncover the structure present in the data.They are particularly useful when classical second order statistics (the sample mean and covariance) cannot be used.Namely, in exploratory data analysis, one of the assumptions that is made is that no prior knowledge about the dataset, and therefore the dataset's distribution, is available.In such a situation, data clustering can be a valuable tool.Data clustering is a form of unsupervised classification, as the clusters are formed by evaluating similarities and dissimilarities of intrinsic characteristics between different cases, and the grouping of cases is based on those emergent similarities and not on an external criterion.Also, these techniques can be useful for datasets of any dimensionality over three, as it is very difficult for humans to compare items of such complexity reliably without a support to aid the comparison.
The technique presented in this tutorial, k-means clustering, belongs to partitioning-based techniques grouping, which are based on the iterative relocation of data points between clusters.It is used to divide either the cases or the variables of a dataset into non-overlapping groups, or clusters, based on the characteristics uncovered.Whether the algorithm is applied to the cases or the variables of the dataset depends on which dimensions of this dataset we want to reduce the dimensionality of.The goal is to produce groups of cases/variables with a high degree of similarity within each group and a low degree of similarity between groups (Hastie, Tibshirani & Friedman, 2001).The k-means clustering technique can also be described as a centroid model as one vector representing the mean is used to describe each cluster.MacQueen (1967), the creator of one of the k-means algorithms presented in this paper, considered the main use of k-means clustering to be more of a way for researchers to gain qualitative and quantitative insight into large multivariate data sets than a way to find a unique and definitive grouping for the data.
K-means clustering is very useful in exploratory data analysis and data mining in any field of research, and as the growth in computer power has been followed by a growth in the occurrence of large data sets.Its ease of implementation, computational efficiency and low memory consumption has kept the k-means clustering very popular, even compared to other clustering techniques.Such other clustering techniques include connectivity models like hierarchical clustering methods (Hastie, Tibshirani & Friedman, 2000).These have the advantage of allowing for an unknown number of clusters to be searched for in the data, but are very costly computationally due to the fact that they are based on the dissimilarity matrix.Also included in cluster analysis methods are distribution models like expectation-maximisation algorithms and density models (Ankerst, Breunig, Kriegel & Sander, 1999).
A secondary goal of k-means clustering is the reduction of the complexity of the data.A good example would be letter grades (Faber, 1994).The numerical grades are clustered into the letters and represented by the average included in each class.
Finally, k-means clustering can also be used as an initialization step for more computationally expensive algorithms like Learning Vector Quantization or Gaussian Mixtures, thus giving an approximate separation of the data as a starting point and reducing the noise present in the dataset (Shannon, 1948).
A good cluster analysis is both efficient and effective, in that it uses as few clusters as possible while still capturing all statistically important clusters.Similarity in cluster analysis is usually taken as meaning ''proximity'', and elements closer to one another in the input space are considered more similar.Different metrics can be used to calculate this similarity, the Euclidian distance being the most common.Here c is the cluster center, x is the case it is compared to, i is the dimension of x (or c) being compared and k is the total number of dimensions.The Squared Euclidian distance , the Manhattan distance or the Maximum distance between attributes of the vectors can also be used.Of greater interest are the Mahalanobis distance , which accounts for the covariance between the vectors, and the Cosine similarity , which is a non-translation invariant version of the correlation.By translation invariant, we mean here that adding a constant to all elements of the vectors will not change the result of the correlation while it will change the result of the cosine similarity. 1 Both the correlation and cosine similarity are invariant to scaling (multiplying all elements by a nonzero constant).As the kmeans algorithms try to minimize the sum of the variances within the clusters, , where ni is the number of cases included in cluster k and , the k-means clustering technique is considered a variance minimization technique.
Mathematically, the k-means technique is an approximation of a normal mixture model with an estimation of the mixtures by maximum likelihood.Mixture models consider cluster membership as a probability for each case, based on the means, covariances, and sampling probabilities of each cluster (Symons, 1981).They represent the data as a mixture of distributions (Gaussian, Poisson, etc.), where each distribution represents a sub-population (or cluster) of the data.The k-means technique is a sub case that assumes that the mixture components (here, clusters) all have spherical covariance matrices and equal sampling probabilities.I also consider the cluster membership for each 1 The correlation is the cosine similarity of mean-centered, deviation-normalized data case a separate parameter to be estimated.

K-means clustering
We present three k-means clustering algorithms: the Forgy/Lloyd algorithm, the MacQueen algorithm and the Hartigan & Wong algorithm.We chose those three algorithms because they are the most widely used k-means clustering techniques and they all have slightly different goals and thus results.To be able to use any of the three, you first need to know how many clusters are present in your data.As this information is often unavailable, multiple trials will be necessary to find the best amount of clusters.As a starting point, it is often useful to standardize the data if the components of the cases are not in the same scale.
There is no absolute best algorithm.The choice of the optimal algorithm depends on the characteristics of the dataset (size, number of variables in the cases).Jain, Duin & Mao (2000) even suggest trying several different clustering algorithms to gain the best understanding possible about the dataset.

Forgy/Lloyd algorithm
The Lloyd algorithm (1957Lloyd algorithm ( , published 1982) ) and the Forgy's algorithm (1965) are both batch (also called offline) centroid models.A centroid is the geometric center of a convex object and can be thought of as a generalisation of the mean.Batch algorithms are algorithms where a transformative step is applied to all cases at once.They are well suited to analyse large data sets, since the incremental k-means algorithms require to store the cluster membership of each case or to do two nearest-cluster computations as each case is processed, which is computationally expensive on large datasets.The difference between the Lloyd algorithm and the Forgy algorithm is that the Lloyd algorithm considers the data distribution discrete while the Forgy algorithm considers the distribution continuous.They have exactly the same procedure apart from that consideration.For a set of cases , where is the data space of d dimensions, the algorithm tries to find a set of k cluster centers that is a solution to the minimization problem: , discrete distribution , continuous distribution where is the probability density function and d is the distance function.Note here that if the probability density function is not known, it has to be deduced from the data available.
The first step of the algorithm is to choose the k initial centroids.It can be done by assigning them based on previous empirical knowledge, if it is available, by using k random observations from the data set, by using the k observations that are the farthest from one another in the data space or just by giving them random values within ܴ ௗ .Once the initial centroids have been chosen, iterations are done on the following two steps.In the first one, each case of the data set is assigned to a cluster based on its distance from the clusters centroids, using one of the metric previously presented.All cases assigned to a centroid are said to be part of the centroid's subspace ‫ܴ(ܥ‬ ௗ ).The second step is to update the value of the centroid using the mean of the cases assigned to the centroid.Those iterations are repeated until the centroids stop changing, within a tolerance criterion decided by the researcher, or until no case changes cluster.
Here is the pseudocode describing the iterations: 1-Choose the number of clusters 2-Choose the metric to use 3-Choose the method to pick initial centroids 4-Assign initial centroids 5-While metric(centroids, cases)>threshold a.For i <= nb cases i. Assign case to closest cluster according to metric b.Recalculate centroids The k-means clustering technique can be seen as partitioning the space into Voronoi cells (Voronoi, 1907).For each two centroids, there is a line that connects them.Perpendicular to this line, there is a line, plane or hyperplane (depending on the dimensionality) that passes through the middle point of the connecting line and divides the space into two separate subspaces.The k-means clustering therefore partitions the space into k subspaces for which ci is the nearest centroid for all included elements of the subspace (Faber, 1994).

MacQueen algorithm
The MacQueen algorithm ( 1967) is an iterative (also called online or incremental) algorithm.The main difference with Forgy/Lloyd's algorithm is that the centroids are recalculated every time a case change subspace and also after each pass through all cases.The centroids are initialized the same way as in the Forgy/Lloyd algorithm and the iterations are as follow.For each case in turn, if the centroid of the subspace it currently belongs to is the nearest, no change is made.If another centroid is the closest, the case is reassigned to the other centroid and the centroids for both the old and new subspaces are recalculated as the mean of the belonging cases.The algorithm is more efficient as it updates centroids more often and usually needs to perform one complete pass through the cases to converge on a solution.
Here is the pseudocode describing the iterations: 1-Choose the number of clusters 2-Choose the metric to use 3-Choose the method to pick initial centroids 4-Assign initial centroids 5-While metric(centroids, cases)>threshold a.For i <= cases i. Assign case i to closest cluster according to metric ii.Recalculate centroids for the two affected clusters b.Recalculate centroids

Hartigan & Wong algorithm
This algorithm searches for the partition of data space with locally optimal within-cluster sum of squares of errors (SSE).It means that it may assign a case to another subspace, even if it currently belong to the subspace of the closest centroid, if doing so minimizes the total within-cluster sum of square (see below).The cluster centers are initialized the same way as in the Forgy/Lloyd algorithm.The cases are then assigned to the centroid nearest them and the centroids are calculated as the mean of the assigned data points.The iterations are as follows.If the centroid has been updated in the last step, for each data point included, the within-cluster sum of squares for each data point if included in another cluster is calculated.If one of the cluster sum of square (SSE2 in the equation below, for all i ≠ 1) is smaller than the current one (SSE1), the case is assigned to this new cluster.
The iterations continue until no case change cluster, meaning until a change would make the clusters more internally variable or more externally similar.
Here is the pseudocode describing the iterations: 1-Choose the number of clusters 2-Choose the metric to use 3-Choose the method to pick initial centroids 4-Assign initial centroids 5-Assign cases to closest centroid 6-Calculate centroids 7-For j <= nb clusters, if centroid j was updated last iteration a. Calculate SSE within cluster b.For i <= nb cases in cluster i. Compute SSE for cluster k != j if case included ii.If SSE cluster k < SSE cluster j, case change cluster

Quality of the solutions found
There are two ways to evaluate a solution found by kmeans clustering.The first one is an internal criterion and is based solely on the dataset it was applied to, and the second one is an external criterion based on a comparison between the solution found and an available known class partition for the dataset.
The Dunn index (Dunn, 1979) is an internal evaluation technique that can roughly be equated to the ratio of the inter-cluster similarity on the intra-cluster similarity: where is the distance between cluster centroids and can be calculated with any of the previously presented metrics and is the measure of inner cluster variation.As we are looking for compact clusters, the solution with the highest Dunn index is considered the best.
As an external evaluator, the Jaccard index (Jaccard, 1901) is often used when a previous reliable classification of the data is available.It computes the similarity between the found solution and the benchmark as a percentage of correct classification.It calculates the size of the intersection (the cases present in the same clusters in both solutions) divided by the size of the union (all the cases from both datasets):

Limitations of the technique
The k-means clustering technique will always converge, but it is liable to find a local minimum solution instead of a global one, and as such may not find the optimal partition.The k-means algorithms are local search heuristics, and are therefore sensitive to the initial centroids chosen (Ayramo & Karkkainen, 2006).To counteract this limitation, it is recommended to do multiple applications of the technique, with different starting points, to obtain a more stable solution through the averaging of the solutions obtained.
Also, to be able to use the technique, the number of clusters present in your data must be decided at the onset, even if such information is not available a priori.Therefore, multiple trials are necessary to find the best amount of clusters.Thirdly, it is possible to create empty clusters with the Forgy/Lloyd algorithm if all cases are moved at once from a centroid subspace.Fourthly, the MacQueen and Hartigan methods are sensitive to the order in which the points are relocated, yielding different solutions depending on the order.
Fifthly, k-means clustering has a bias to create clusters of equal size, even if doing so doesn't best represent the group distributions in the data.It thus works best for clusters that are globular in shape, have equivalent size and have equivalent data densities (Ayramo & Karkkainen, 2006).Even if the dataset contains clusters that are not equiprobable, the k-means technique will tend to produce clusters that are more equiprobable than the population clusters.Corrections for this bias can be done by maximizing the likelihood without the assumption of equal sampling probabilities (Symons, 1981).
Finally, the technique has problems with outliers, as it is based on the mean, a descriptive statistic not robust to outliers.The outliers will tend to skew the centroid position toward them and have a disproportionate importance within the cluster.A solution to this was proposed by Ayramo & Karkkainen (2006).They suggested using the spatial median instead to get a more robust clustering.

Optimisation of the algorithms usage
While the algorithms presented are very efficient, since the technique is often used as a first classifier on large datasets, any optimisation that speeds the convergence of the clustering is useful.Bottou and Bengio (1995) have found that the fastest convergence on a solution is usually obtained by using an online algorithm for the first iteration through the entire dataset and an off-line algorithm subsequently as needed.This comes from the fact that online k-means benefits from the redundancies of the k training set and improve the centroids by going through a few cases (depending on the amount of redundancies) as much as would a full iteration through the offline algorithm (Bengio, 1991).

For very large datasets
For very large datasets that would make the computation of the previous algorithms too computationally expensive, it is possible to choose a random sample from the whole population of cases and apply the algorithm on the sample.If the sample is sufficiently large, the distribution of these initial reference points should reflect the distribution of cases in the entire set.

Fuzzy k-means clustering
In fuzzy k-means clustering (Bezdek, 1981), each case has a set of degree of belonging relative to all clusters.It differs from previously presented k-means clustering where each case belongs only to one cluster at a time.In this algorithm, the centroid of a cluster (ck) is the mean of all cases in the dataset, weighted by their degree of belonging to the cluster (wk).
The degree of belonging is a function of the distance of the case from the centroid, which includes a parameter controlling for the highest weight given to the closest case.It iterates until a user-set criterion is reached.Like the k-means clustering technique, this technique is also sensitive to initial clusters and local minima.It is particularly useful for dataset coming from area of research where partial belonging to classes is supported by theory.

Self-Organising Maps
Self-Organizing Maps (Kohonen, 1982) are an artificial neural network algorithm that aims to extract attributes present in a dataset and transcribe them into an output space of lower dimensionality, while keeping the spatial structure of the data.Doing so clusters similar cases on the map, a process that can be likened to the k-means algorithm clustering centroids.This neural network has two layers, the input layer which is the initial dataset, and an output layer that is the self-organizing map, which is usually bidimensional.There is a connection weight between each variable (or attribute) of a case and the map, thus making the connection weights matrix of the dimensionality of the input multiplied by the dimensionality of the map.It uses a Hebbian competitive learning algorithm.What is obtained at the end is a map where similar elements are contiguous, which also give a two dimensional representation of the data.It is therefore useful if a graphic representation of the data is advantageous to its comprehension.

Gaussian-expectation maximization (GEM)
This algorithm was first explained by Dempster, Laird & Rubin (1977).It uses a linear combination of d-dimensional Gaussian distributions as the cluster centers.It aims to minimize where ‫ݔ‪൫‬‬ หܿ ൯ is the probability of xi (the case), given that it is generated by a Gaussian distribution that has cj as its center, and ‫ܿ(‬ ) is the prior probability of said center.It also computes a soft membership for each center through Bayes rule: .

The Mathematica Notebook
There exists a function in Mathematica, ''FindClusters'', that implements the k-means clustering technique with an alternative algorithm called k-medoids.This algorithm is equivalent to the Forgy/Lloyd algorithm but it uses cases from the datasets as centroids instead of the arithmetical mean.The implementation of the algorithm in Mathematica allows for the use of different metrics.There is also a function in Matlab called "kmeans" that implements the kmeans clustering technique.It uses a batch algorithm in a first phase, then an iterative algorithm in a second phase.Finally, there is no implementation of the k-means technique in SPSS, but an implementation of hierarchical clustering is available.As the goals of this tutorial are to showcase the workings of the k-means clustering technique and to help understand said technique better, we created a Mathematica Notebook where the inner workings of all three algorithms are open to view (available on the TQMP website).
The Notebook has clearly labeled sections.The initial section contains all of the modules used in the Notebook.This is where you can see the inner workings of the algorithms.In the section of the Notebook where user changes are allowed, you find various subsections that explicit the parameters the user needs to input.The first one is used to import the data, which should be in a database format (.txt, .dat,etc.), and should not include the variable names.The second section allows to standardize the dataset variables if need be.The third section put a label on each case to keep track of cases as they are clustered.The next sections allows to choose the number of clusters, the stop criterion on the number of iterations, the tolerance level between the cluster solutions, the metric to be used (between Euclidian distance, Squared Euclidian distance, Manhattan distance, Maximum distance, Mahalanobis distance and Cosine similarity) and the starting centroids.To choose the centroids, random assignation or farthest vectors assignation are available.The following section is the heart of the Notebook.Here you can choose to use the Forgy/Lloyd, MacQueen or Hartigan & Wang algorithm.The algorithms iterate until the user-inputted criterion on the number of iterations or centroid change is reached.For each algorithm, you obtain the number of iterations through the whole dataset needed for the solution to converge, the centroids vectors and the cases belonging to each cluster.The next section implements the Dunn index, which evaluates the internal quality of the solution and outputs the Dunn index.Next is a visualisation of the cases and their centroids for bidimensionnal or tridimensional datasets.The next section calculates the equation of the vector/plan that separates two centroids subspaces.Finally, the last section uses Mathematica's implementation of the ANOVA to allow the user to compare clusters to see for which variables the clusters are significantly different from one another.

Example
1a. Let's take a toy example and use our Mathematica notebook to find a clustering solution.The first thing that we need to do is activate the initialisation cells that contain the modules.We'll use a dataset (at the beginning of the Mathematica Notebook) that has four dimensions and nine cases.Please activate only the dataset needed.As the variables are not on the same scale, we start by standardizing the data, as seen in Table 1.
Now, as we have no prior information on the dataset, we chose to pick three clusters and to choose the cases that are the furthest from one another, as seen in Table 2.
We chose the Lloyd method to find the clusters with a Euclidian distance metric.We then run the main program, which iterates in turn to assign cases to centroids and move the centroids.After 2 iterations, the centroids have attained their final position and the cases have changed clusters to their final position.The solution found has one cluster containing cases 1, 6 and 8, one cluster containing case 4 and one cluster containing the remaining cases (2,3,5 and 7).This is illustrated in Figure 1.
1b.It is often interesting to see which variables contributed to the clustering the most.While not strictly part of the k-means clustering technique, it is a useful step when the variables have meaning assigned to them.To do so, we use the ANOVA with post-hoc test found at the end of the Notebook.We find that cluster 1 contain cases with a higher age than cluster 3 (F(2,6) = 14.11, p < .01)and that cluster 2 is higher than both the other two clusters on the information variable (F(2,6) = 5.77, p < .05).No clusters differ significantly on the verbal expression and performance variables.1c.It is also possible to obtain the equation of the boundary between the clusters' neighborhood.To do so, use the ''Finding the equation'' section of the Notebook.The output obtained is shown in Table 3.
If we read this output, we find that the equation of the vector that separates cluster 1 form cluster 2 is -2.28 -1.37p -2.77i -2.33ve -.79a, where p is performance, i is information, ve is verbal expression and a is age.So any new cases within each of the boundaries could be classified as belonging to the corresponding centroid.
2a. Let's briefly present a visual example of clusters centers moving.To do so, we used a dataset with 2 attributes (named xp1-xp3.dat in the Notebook).The dataset is very asymmetrical and as such would not be an ideal case to apply the algorithm on (since the distribution of the clusters is unlikely to be spherical), but it will show the moving of the clusters effectively.We chose a four clusters solution and used the Forgy/Lloyd algorithm with Euclidian distance again.
We can see in Figure 2 that the densest area of the dataset has two clusters, as the algorithm tries to form equiprobable clusters.The data that is really high on the first variable belong to a cluster while the data that is really high on the second variable belong to a fourth cluster.
2b.The next simulation compares the different metrics for the same algorithm.We used the Forgy/Lloyd algorithm with random starting clusters.We can see in Figure 3 that depending on what we want to prioritize in the data, the different metrics can be used to reach different goals.The Euclidian distance prioritizes the minimisation of all differences within the cluster while the cosine similarity prioritizes the maximization of all the similarities.Finally, the maximum distance prioritizes the minimisation of the distance of the most extreme elements of each case within the cluster.
3a.This example demonstrates the influence of using different algorithms.We used the very well-known iris flower dataset (Fisher, 1936).The dataset (named iris.dat in the Notebook) consists of 150 cases and there are three classes present in the data.Cases are split into equiprobable groups (group 1 is from cases 1 to 50, group 2 is from 51 to 100 and group 3 is from 101 to 150).Each iris is described by three attributes (sepal length, sepal width and petal length).
We chose random starting vectors from the dataset for the simulations centroids and used the same for all three algorithms: .We also used the Euclidian metric to calculate the distance between cases and centroids.Table 4 summarizes the results.
We can see that all algorithms make mistakes in the classification of the irises (as was expected from the characteristics of the data).For each, the greyed out cases are misclassified.We find that the Forgy/Lloyd algorithm is the one making the fewer mistakes, as indicated here by the Dunn and Jaccard indexes, but the graphs of Figure 4 shows that the best visual fit comes from the MacQueen algorithm.Mathematica K-medoid algorithm Mathematica's implementation gives the clustering of the cases, but not the cluster centers or the tag of the cases.It makes it quite confusing to actually keep track of individual cases.Here is the graphical solution obtained, which is very similar to the MacQueen solution (Figure 5).

Figure 1 .
Figure 1.Upper section) Cases assignation to clusters Lower section) Output from the Notebook giving the number of iterations, the final centroids and the cases belonging to each clusters

(
Figure 4. Final classification of different algorithms

The k-means clustering technique: General considerations and implementation in Mathematica Laurence Morissette and Sylvain Chartier
Université d'Ottawa Data clustering techniques are valuable tools for researchers working with large databases of multivariate data.In this tutorial, we present a simple yet powerful one: the k-means clustering technique, through three different algorithms: the Forgy/Lloyd, algorithm, the MacQueen algorithm and the Hartigan & Wong algorithm.We then present an implementation in Mathematica and various examples of the different options available to illustrate the application of the technique.

Table 1 .
Data used left) non standardised right) standardised

Table 3 .
Plans defining the limits of each centroid subspace

Table 4 .
Summary of the results