This lab is dedicated to the k-means clustering method. In words, k-means takes as input a collection of points in \(\mathbb{R}^d\) (a numerical dataset) and a positive integer \(k\). It returns a collection of \(k\) points (the centers) from \(\mathbb{R}^d\). The centers define a Voronoï tesselation/partition/diagran of \(\mathbb{R}^d\). The Voronoï cells define a clustering of the original dataset.
In the next chunk, we generate a Voronoï diagram on \(\mathbb{R}^2\) with \(100\) cells defined from \(100\) random points drawn from the uniform distribution on a square. Function stat_voronoi() comes from ggvoronoi
If \(c \in \mathcal{C}\) is the closest centroid to \(X \in \mathbb{R}^p\),
\[\|c - X\|^2\]
is the quantization/reconstruction error suffered when using codebook \(\mathcal{C}\) to approximate \(X\)
If there are no restrictions on the dimension of the input space, on the number of centroids, or on sample size, computing an optimal codebook is a \(\mathsf{NP}\) -hard problem
The kmeans() function
kmeans() is a wrapper for a collection of Algorithms that look like the Lloyd algorithm
Initialize
Choose \(k\) centroids
Iterations: Two phases
Movement
Assign each sample point to the closest centroid (Assign each sample point to a class in the Voronoi partition defined by the centroids)
Update
For each class in the current Voronoi partition, update the centroid so as to minimize the Within Cluster Sum of Squared distances.
No guarantee to converge to a global optimum!
Proceeed by trial and error.
Repeat the algorithm and keep the best result.
Iris data
Question
Run kmeans() on the projection of the Iris dataset (data(iris)) on the Petal plane.
Use broom::augment() and broom::tidy() to prepare two dataframes that will allow you to overlay a scatterplot of the dataset and a Voronoï diagram defined by the centers output by kmeans().