We work on USArrests dataset. We want to classify the 50 (united) states on the basis of the arrests profile and the urbanization rate. We rely on hierarchical, bottom-up classification.
Exploration of results of hierarchical clustering (objects of class hclust) is facilitated by converting to class dendrogram.
Question
Ward method
The meth=ward.D2 option allows you to aggregate individuals according to the method of Ward, that is, according to the variance.
Question
What is the distance used? Describe the method of classification by variance?
Question
How many groups are there at step 0? at the last step?
How many iterations are there?
Recall the definition of inter-class variance.
What is the inter-class variance at step 0? at the last step? How is it going according to the number of groups (or according to the number of iterations)?
By comparing the total inertia and the `clas$height’ output, find the coefficient of proportionality between the loss of inter-class variance and height of jumps.
Choice of the number of classes
Question
Plot the curve corresponding to the loss of variance inter in as a function of the number of iterations :
Select the “optimal” number of classes.
Verify that, for the number of classes chosen, the number by class is sufficient (we can use the cutree function).
These classes can be represented using a dendrogram
You can also colour the leaves of the tree corresponding to a class. To do this, install and load the package `dendextend’.
Link with PCA
We will represent the classes obtained in the factorial design(s) obtained by the PCA. This will make it possible to represent the classes and describe them according to the variables initials.
Question
Represent the coordinates of the individuals in each group in the first factorial plane (with one color for each class). The vector generated by `cutree’ can be used to form a color vector. Interpretation.