We work on USArrests dataset. We want to classify the 50 (united) states on the basis of the arrests profile and the urbanization rate. We rely on hierarchical, bottom-up classification.
–[dendrogram w/ 2 branches and 50 members at h = 701] |–[dendrogram w/ 2 branches and 16 members at h = 141] | |–[dendrogram w/ 2 branches and 10 members at h = 69.3] | | |–[dendrogram w/ 2 branches and 3 members at h = 30.1] .. | | --[dendrogram w/ 2 branches and 7 members at h = 43.4] .. |–[dendrogram w/ 2 branches and 6 members at h = 82.3] | |–[dendrogram w/ 2 branches and 4 members at h = 33.4] .. | --[dendrogram w/ 2 branches and 2 members at h = 38.5] ..–[dendrogram w/ 2 branches and 34 members at h = 353] |–[dendrogram w/ 2 branches and 14 members at h = 106] | |–[dendrogram w/ 2 branches and 6 members at h = 42.5] .. | --[dendrogram w/ 2 branches and 8 members at h = 44.8] ..–[dendrogram w/ 2 branches and 20 members at h = 163] |–[dendrogram w/ 2 branches and 10 members at h = 38.5] .. `–[dendrogram w/ 2 branches and 10 members at h = 66] .. etc…
# label(dend.1)dend.2<-as.dendrogram(hcl.1)# order it the closest we can to the order of the observations:dend.2<-rotate(dend.2, 1:50)# Color the branches based on the clusters:dend.2<-color_branches(dend.2, k=3)#, groupLabels=iris_species)# Manually match the labels, as much as possible, to the real classification of the flowers:# labels_colors(dend.2) <-# rainbow_hcl(3)[sort_levels_values(# as.numeric(iris[,5])[order.dendrogram(dend.2)]# )]
Ward method
The meth=ward.D2 option allows you to aggregate individuals according to the method of Ward, that is, according to the variance.
Question
What is the distance used? Describe the method of classification by variance?
The output clas$height gives the jump height of the dendrogram to each new iteration. In the case of Ward’s method, she is proportional to the loss of inter-class variance.
Question
How many groups are there at step 0? at the last step?
How many iterations are there?
Recall the definition of inter-class variance.
What is the inter-class variance at step 0? at the last step? How is it going according to the number of groups (or according to the number of iterations)?
By comparing the total inertia and the `clas$height’ output, find the coefficient of proportionality between the loss of inter-class variance and height of jumps.
Choice of the number of classes
Question
Plot the curve corresponding to the loss of variance inter in as a function of the number of iterations :
Select the “optimal” number of classes.
Verify that, for the number of classes chosen, the number by class is sufficient (we can use the cutree function).
These classes can be represented using a dendrogram
You can also colour the leaves of the tree corresponding to a class. To do this, install and load the package `dendextend’.
Link with PCA
We will represent the classes obtained in the factorial design(s) obtained by the PCA. This will make it possible to represent the classes and describe them according to the variables initials.
Question
Represent the coordinates of the individuals in each group in the first factorial plane (with one color for each class). The vector generated by `cutree’ can be used to form a color vector. Interpretation.