Code
stopifnot(
require(FactoMineR),
require(factoextra),
require(FactoInvestigate)
)
M1 MIDS/MFA/LOGOS |
Année 2024 |
Besides the usual packages (tidyverse
, …), we shall require FactoMineR
and related packages.
stopifnot(
require(FactoMineR),
require(factoextra),
require(FactoInvestigate)
)
mortality
datasetThe goal is to investigate a possible link between age group and Cause of death. We work with dataset mortality
from package FactoMineR
data("mortality", package = "FactoMineR")
#help(mortality)
A data frame with 62 rows (the different Causes of death) and 18 columns. Each column corresponds to an age interval (15-24, 25-34, 35-44, 45-54, 55-64, 65-74, 75-84, 85-94, 95 and more) in a year. The 9 first columns correspond to data in 1979 and the 9 last columns to data in 2006. In each cell, the counts of deaths for a Cause of death in an age interval (in a year) is given.
See also EuroStat:
Read the documentation of the mortality
dataset. Is this a sample? an aggregated dataset?
If you consider mortality
as an agregated dataset, can you figure out the organization of the sample mortality
was built from?
Before proceeding to Correspondence Analysis (CA), let us tidy up the table and draw some elementary plots.
mortality
, so as to obtain a tibble with columns Cause
, year
, while keeping all columns named after age groups (tidy up the data so as to obtain a tibble in partially long format).rowwise()
and sum(c_cross())
so as to compute the total number of deaths per year
and Cause
in column total
. This allows to mimic rowSums()
inside a pipeline. Column grand_total
is computed using a window function over grouping by Cause
.Build a bar plot to display the importance of Causes of deaths in France in years 1979 and 2006
Compute and display the total number of deaths in France in years 1979 and 2006.
Compute the marginal counts for each year (1979, 2006). Compare.
Start from a 2-way contingency table \(X\) with \(\sum_{i,j} X_{i,j}=N\)
Normalize \(P = \frac{1}{N}X\) (correspondance matrix)
Let \(r\) (resp. \(c\)) be the row (resp. column) wise sums vector
Let \(D_r=\text{diag}(r)\) denote the diagonal matrix with row sums of \(P\) as coefficients
Let \(D_c=\text{diag}(c)\) denote the diagonal matrix with column sums of \(P\) as coefficients
The row profiles matrix is \(D_r^{-1} \times P\)
The standardized residuals matrix is \(S = D_r^{-1/2} \times \left(P - r c^\top\right) \times D_c^{-1/2}\)
CA consists in computing the SVD of the standardized residuals matrix \(S = U \times D \times V^\top\)
From the SVD, we get
When calling svd(.)
, the argument should be \[D_r^{1/2}\times \left(D_r^{-1} \times P \times D_c^{-1}- \mathbf{I}\times \mathbf{I}^\top \right)\times D_c^{1/2}= D_r^{-1/2}\times \left( P - r \times c^\top \right)\times D_c^{-1/2}\]
As \[D_r^{-1} \times P \times D_c^{-1} - \mathbf{I}\mathbf{I}^\top = (D_r^{-1/2} \times U)\times D \times (D_c^{-1/2}\times V)^\top\]
\((D_r^{-1/2} \times U)\times D \times (D_c^{-1/2}\times V)^\top\) is the extended SVD of \[D_r^{-1} \times P \times D_c^{-1} - \mathbf{I}\mathbf{I}^\top\] with respect to \(D_r\) and \(D_c\)
Perform CA on the two contingency tables.
You may use FactoMineR::CA()
. It is interesting to compute the correspondence analysis in your own way, by preparing the matrix that is handled to svd()
and returning a named list containing all relevant information.
Do the Jedi and Sith build their own light sabers? Jedi do. It’s a key part of the religion to have a kyber crystal close to you, to build the saber through the power of the force creating a blade unique and in tune with them
If you did use FactoMineR::CA()
, explain the organization of the result.
Draw screeplots. Why are they useful? Comment briefly.
Perform row profiles analysis.
What are the classical plots? How can you build them from the output of FactoMiner::CA
?
Build the table of row contributions (the so-called \(\cos^2\))
Plot the result of row profile analysis using plot.CA
from FactoMineR
.
Perform column profiles analysis
Build the symmetric plots (biplots) for correspondence analysis of Mortalitity data
Mosaic plots provide an alternative way of exploring contingency tables. They are particularly handy when handling 2-way contingency tables.
Draw mosaic plots for the two contingency tables living inside mortality
datasets.
Are you able to deliver an interpretation of this Correspondence Analysis?
Build the standardized matrix for row profiles analysis. Compute the pairwise distance matrix using the \(\chi^2\) distances. Should you work centered row profiles?
Perform hierarchical clustering of row profiles with method/linkage "single"
. Check the definition of the method. Did you know the underlying algorithm? If yes, in which context did you get acquainted with this algorithm?
Choose the number of classes (provide justification).
Can you explain the size of the different classes in the partition?
Row profiles that do not belong to the majority class are called atypical.
Compute the share of inertia of atypical row profiles.
Draw a symmetric plot (biplot) outlining the atypical row profiles.
Calculate the theoretical population table for deces
. Do you possible to carry out a chi-squared test?
Perform a hierarchical classification of the line profiles into two classes.
Merge the rows of deces
corresponding to the same class (you can use the the tapply
function), and perform a chi-square test. chi-square test. What’s the conclusion?
Why is it more advantageous to carry out this grouping into two classes compared to arbitrarily grouping two classes, in order to prove the dependence between these two variables?
Represent individuals from the majority class. Do they all seem to you to correspond to an average profile?
Try to explain this phenomenon considering the way in which hierarchical classification uses the Single Linkage method.
The mortality
dataset should be taken with grain of salt. Assigning a single Cause to every death is not a trivial task. It is even questionable: if somebody dies from some infection beCause she could not be cured using an available drug due to another preexisting pathology, who is the culprit?