This lab is dedicated to the k-means clustering method. In words, k-means takes as input a collection of points in \(\mathbb{R}^d\) (a numerical dataset) and a positive integer \(k\). It returns a collection of \(k\) points (the centers) from \(\mathbb{R}^d\). The centers define a Voronoï tesselation/partition/diagran of \(\mathbb{R}^d\). The Voronoï cells define a clustering of the original dataset.
In the next chunk, we generate a Voronoï diagram on \(\mathbb{R}^2\) with \(100\) cells defined from \(100\) random points drawn from the uniform distribution on a square. Function stat_voronoi() comes from ggvoronoi
If \(c \in \mathcal{C}\) is the closest centroid to \(X \in \mathbb{R}^p\),
\[\|c - X\|^2\]
is the quantization/reconstruction error suffered when using codebook \(\mathcal{C}\) to approximate \(X\)
If there are no restrictions on the dimension of the input space, on the number of centroids, or on sample size, computing an optimal codebook is a \(\mathsf{NP}\) -hard problem
The kmeans() function
kmeans() is a wrapper for a collection of Algorithms that look like the Lloyd algorithm
Initialize
Choose \(k\) centroids
Iterations: Two phases
Movement
Assign each sample point to the closest centroid (Assign each sample point to a class in the Voronoi partition defined by the centroids)
Update
For each class in the current Voronoi partition, update the centroid so as to minimize the Within Cluster Sum of Squared distances.
No guarantee to converge to a global optimum!
Proceeed by trial and error.
Repeat the algorithm and keep the best result.
Iris data
Question
Run kmeans() on the projection of the Iris dataset (data(iris)) on the Petal plane.
Objects of class kmeans have two attributes names and class. Because of the class attribute, objects of class kmeans are not just lists with named elements.
Summarizing a clustering
Question
Check the structure of objects of class kmeans and use broom::tidy() to get a summary.
df_centers<-select(iris, starts_with("Petal"))|>kmeans(centers =3)|>broom::tidy()df_centers|>gt::gt()|>gt::fmt_number(decimals =2)|>gt::tab_caption("Iris clustering in the Petal plane, kmeans with 3 clusters")
Iris clustering in the Petal plane, kmeans with 3 clusters
Petal.Length
Petal.Width
size
withinss
cluster
4.27
1.34
52.00
13.06
1
1.46
0.25
50.00
2.02
2
5.60
2.04
48.00
16.29
3
How are the rows related to clusters?
What are the coordinates good for?
What does the size column mean?
withinss stands for Within Sum of Squares. How is it computed? Why is it useful?
Visualizing a clustering
Question
Use broom::augment() and broom::tidy() to prepare two dataframes that will allow you to overlay a scatterplot of the dataset and a Voronoï diagram defined by the centers output by kmeans().
plot_km_centroids(broom::augment(km.2.swiss, scale(swiss)),broom::tidy(km.2.swiss), Education, Infant.Mortality)+labs( title="Kmeans over Swiss dataset, k=2")
Solution
Code
plot_km_centroids(augment(km.2.swiss.pca,broom::augment(prcomp(swiss_scaled), data=swiss)),tidy(km.2.swiss.pca),.fittedPC1,.fittedPC2)+labs( title="Kmeans over Swiss dataset, k=2", subtitle="Clustering over the Principal components")
plot_km_centroids(augment(km.4.swiss.pca,broom::augment(prcomp(swiss_scaled), data=swiss)),tidy(km.4.swiss.pca),.fittedPC1,.fittedPC2)+labs( title="Kmeans over Swiss dataset, k=4", subtitle="Clustering over the Principal components")