LAB: Correspondance Analysis

Published

March 21, 2025

M1 MIDS/MFA/LOGOS

Université Paris Cité

Année 2024

Course Homepage

Moodle

Besides the usual packages (tidyverse, …), we shall require FactoMineR and related packages.

Code

stopifnot(
  require(FactoMineR),
  require(factoextra),
  require(FactoInvestigate)
)

Correspondence Analysis

The `mortality` dataset

The goal is to investigate a possible link between age group and Cause of death. We work with dataset mortality from package FactoMineR

Code

data("mortality", package = "FactoMineR")

Code

#help(mortality)

A data frame with 62 rows (the different Causes of death) and 18 columns. Each column corresponds to an age interval (15-24, 25-34, 35-44, 45-54, 55-64, 65-74, 75-84, 85-94, 95 and more) in a year. The 9 first columns correspond to data in 1979 and the 9 last columns to data in 2006. In each cell, the counts of deaths for a Cause of death in an age interval (in a year) is given.

Source: Centre d’épidemiologie sur les Causes de décès médicales

Cause	15-24 (06)	25-34 (06)	35-44 (06)	45-54 (06)	55-64 (06)	65-74 (06)	75-84 (06)	85-94 (06)	95 and more (06)
Kidney and urethra disease	3	10	35	106	266	655	2432	2454	545
Addiction to prescription medication	18	77	72	15	4	2	1	0	0
Blood and hematopoietic disorders	20	22	43	94	159	259	642	670	213
Malignant tumour of the larynx, trachea, bronchus and lungs	0	38	681	4059	7285	8026	7678	1869	128
Other ill-defined symptoms and conditions	42	126	346	752	1077	1746	4943	7920	4253
Unknown or unspecified causes	227	476	928	1517	1585	1487	2446	2298	872
Homicides	52	83	65	63	42	27	21	6	0
Viral hepatitis	0	9	76	117	94	143	224	61	6
Malignant ovarian tumour	4	12	63	302	603	863	1054	402	38
Rhumatoid arthritis and osteoarthritis	0	1	0	4	14	99	231	176	54

Elementary statistics and table wrangling

Before proceeding to Correspondence Analysis (CA), let us tidy up the table and draw some elementary plots.

Question

Start by partially pivoting mortality, so as to obtain a tibble with columns Cause, year, while keeping all columns named after age groups (tidy up the data so as to obtain a tibble in partially long format).
Use rowwise() and sum(c_cross()) so as to compute the total number of deaths per year and Cause in column total. This allows to mimic rowSums() inside a pipeline. Column grand_total is computed using a window function over grouping by Cause.

Solution

Code

mortality_long <- mortality  |> 
  pivot_longer(
    cols=-Cause,
    cols_vary="slowest",
    names_to=c(".value", "year"),
    names_pattern="([\\w\\- ]*) \\(([0-9]{2})\\)"
  )  |> 
  mutate(year=ifelse(year=='06', 2006, 1979)) |> 
  rowwise() |> 
  mutate(total_year=sum(c_across(-c(Cause, year)))) |> 
  group_by(Cause) |> 
  mutate(grand_total = sum(total_year)) |> 
  ungroup()

Code

mortality_long |>
 slice_sample(n=10) |>
 gt::gt() |>
 my_gt() |>
 gt::tab_caption("A sample of rows from Mortality table in long form")

A sample of rows from Mortality table in long form
Cause	year	15-24	25-34	35-44	45-54	55-64	65-74	75-84	85-94	95 and more	total_year	grand_total
Events of undetermined intention	1979	297	382	296	329	239	279	221	66	1	2110	2558
Gastroduodenal ulcer	2006	3	1	13	38	48	99	264	244	55	765	2981
Cerebrovascular disease	2006	35	75	311	902	1575	3719	12172	11385	2464	32638	98795
Malignant tumour of the of the colon	1979	10	39	114	556	1130	2898	3541	1257	48	9593	21753
Meningococal disease	2006	8	0	1	1	0	4	2	0	0	16	60
Diabetes	1979	17	43	61	232	585	2064	3044	1018	47	7111	18083
Other tumours	2006	39	45	141	300	564	1010	2192	1713	318	6322	10399
Malignant tumour of the of the colon	2006	4	17	135	538	1421	2572	4641	2486	346	12160	21753
Other ill-defined symptoms and conditions	1979	41	75	116	276	511	1635	5641	7330	1500	17125	38330
Homicides	1979	92	116	115	65	50	33	35	10	0	516	875

Solution

A truly tidy version of the dataset can be obtained from further pivoting.

Code

mortality_tidy <- mortality_long |> 
  pivot_longer(
    cols=-c(year,Cause,total_year, grand_total),
    cols_vary="slowest",
    names_to=c("age_range"),
    values_to=c("#deaths")
) |>
  mutate(age_range = factor(age_range, levels=sort(unique(age_range)),ordered=T))

Code

mortality_tidy |>
  sample_n(5) |>
  gt::gt()

Cause	year	total_year	grand_total	age_range	#deaths
Other ill-defined symptoms and conditions	2006	21205	38330	25-34	126
Malignant ovarian tumour	2006	3341	5661	85-94	402
Road accidents	2006	4648	15255	95 and more	8
Meningococal disease	1979	44	60	85-94	0
Chronic liver disease	2006	7669	23596	35-44	453

Question

Build a bar plot to display the importance of Causes of deaths in France in years 1979 and 2006

Solution

Code

th <- theme_get()
(
mortality_long |> 
  mutate(Cause=fct_reorder(Cause, desc(grand_total))) |>
  mutate(year=as_factor(year)) |>
  ggplot() +
  scale_fill_discrete() +
  aes(x=Cause, 
      y=total_year, 
      fill=year) +
  geom_col(position=position_dodge()) +
  theme(
    legend.position="none",
    axis.text.x=element_blank(), #remove x axis labels
    axis.ticks.x=element_blank(), #remove x axis ticks
  ) +
  labs(
    title = "Causes of death, France, 1979, 2006",
    subtitle= "Raw counts"
  ) +
  xlab(label=NULL)
) |>
  plotly::ggplotly()

Code

oth <- theme_set(th)

Question

Compute and display the total number of deaths in France in years 1979 and 2006.

Solution

Code

mortality_long |> 
  group_by(year) |> 
  summarise(total_deaths = sum(total_year)) |>
  gt::gt() |>
  gt::cols_label(
    year= "Year", 
    total_deaths = "#Deaths") |>
  gt::tab_caption("Mortality in France")

Mortality in France
Year	#Deaths
1979	529974
2006	510921

Question

Compute the marginal counts for each year (1979, 2006). Compare.

Solution

Counts have already been computed above.

Code

mortality_long |> 
  select(Cause, year, total_year, grand_total) |> 
  pivot_wider(
    id_cols=c(Cause, grand_total), 
    names_from = year, 
    values_from = total_year) |> 
  rename(Total=grand_total) |> 
  arrange(desc(Total)) |>
  gt::gt() |>
  my_gt()

Cause	Total	1979	2006
Cerebrovascular disease	98795	66157	32638
Other heart disease	97297	54105	43192
Ischemic cardiomyopathy	88338	49532	38806
Other illnesses relating to circulation	61937	31218	30719
Malignant tumour of the larynx, trachea, bronchus and lungs	50604	20840	29764
Other malignent tumours	48809	23262	25547
Other diseases of the nervous system and sensory organs	38891	12056	26835
Other ill-defined symptoms and conditions	38330	17125	21205
Other digestive conditions	32697	18092	14605
Other respiratory ailments	26339	14197	12142
Unknown or unspecified causes	26192	14356	11836
Chronic liver disease	23596	15927	7669
Other accidents	23353	10921	12432
Malignant tumour of the of the colon	21753	9593	12160
Suicides	20337	9952	10385
Malignant tumour of the breast	20236	8605	11631
Malignant neplasm of the lymphatic and hematopoietic tissues	20000	7589	12411
Diabetes	18083	7111	10972
Falls	17711	12503	5208
Other chronic respiritory illnesses	17331	9680	7651
Other psychological and behavioural disorders	17160	3749	13411
Malignant tumour of the prostate	15514	6577	8937
Road accidents	15255	10607	4648
Pneumonia	14700	5057	9643
Malignant tumour of the stomach	13783	9020	4763
Other endocrinological, metabolic and nutritional conditions	13665	6030	7635
Kidney and urethra disease	13613	7107	6506
Malignant tumour of the of the pancreas	12851	4588	8263
Other infectious diseases and parasites	11466	4045	7421
Malignant tumour of the liver and intrahepatic biliary tract	11452	4001	7451
Other tumours	10399	4077	6322
Malignant tumour of the lip, pharynx and mouth	9722	5606	4116
Malignant tumour of the oesophogus	9267	5430	3837
Malignant tumour of the rectum and anus	9246	4980	4266
Malignant tumour of the bladder	8322	3633	4689
Alcohol abuse and alcohol-related psychosis	6327	3371	2956
Malignant ovarian tumour	5661	2320	3341
Malignant tumour in other parts of the uterus	5263	2940	2323
Malignant tumour of the kidney	5205	2101	3104
Other genito-urinary diseases	4387	2552	1835
Blood and hematopoietic disorders	4299	2177	2122
Other diseases of the osteo-articular system + muscles and connecting tissue	3935	1025	2910
Infections of the skin and sub-cutaneous cellular tissue	3439	1649	1790
Gastroduodenal ulcer	2981	2216	765
Tuberculosis	2797	2070	727
Other external injury and poisoning	2731	1023	1708
Events of undetermined intention	2558	2110	448
Asthma	2488	1495	993
Malignant melanoma	2205	658	1547
Malignant tumour of the cervix	1527	824	703
Accidental poisoning	1510	503	1007
Rhumatoid arthritis and osteoarthritis	1284	705	579
Influenza	1166	1051	115
Viral hepatitis	1059	329	730
Homicides	875	516	359
Other congenital defects and chromosomal abnormalities	592	145	447
Congenital defects of the circulatory system	540	275	265
Meningitis	481	362	119
Addiction to prescription medication	222	33	189
Complications in pregnancy and childbirth	150	91	59
Congenital defects of the nervous system	109	61	48
Meningococal disease	60	44	16

Correspondance Analysis

CA executive summary

Start from a 2-way contingency table $X$ with $\sum_{i,j} X_{i,j}=N$
Normalize $P = \frac{1}{N}X$ (correspondance matrix)
Let $r$ (resp. $c$) be the row (resp. column) wise sums vector
Let $D_r=\text{diag}(r)$ denote the diagonal matrix with row sums of $P$ as coefficients
Let $D_c=\text{diag}(c)$ denote the diagonal matrix with column sums of $P$ as coefficients
The row profiles matrix is $D_r^{-1} \times P$
The standardized residuals matrix is $S = D_r^{-1/2} \times \left(P - r c^\top\right) \times D_c^{-1/2}$

CA consists in computing the SVD of the standardized residuals matrix $S = U \times D \times V^\top$

From the SVD, we get

$D_r^{-1/2} \times U$ standardized coordinates of rows
$D_c^{-1/2} \times V$ standardized coordinates of columns
$D_r^{-1/2} \times U \times D$ principal coordinates of rows
$D_c^{-1/2} \times V \times D$ principal coordinates of columns
Squared singular values: the principal inertia

When calling svd(.), the argument should be \[D_r^{1/2}\times \left(D_r^{-1} \times P \times D_c^{-1}- \mathbf{I}\times \mathbf{I}^\top \right)\times D_c^{1/2}= D_r^{-1/2}\times \left( P - r \times c^\top \right)\times D_c^{-1/2}\]

CA and extended SVD

As \[D_r^{-1} \times P \times D_c^{-1} - \mathbf{I}\mathbf{I}^\top = (D_r^{-1/2} \times U)\times D \times (D_c^{-1/2}\times V)^\top\]

$(D_r^{-1/2} \times U)\times D \times (D_c^{-1/2}\times V)^\top$ is the extended SVD of \[D_r^{-1} \times P \times D_c^{-1} - \mathbf{I}\mathbf{I}^\top\] with respect to $D_r$ and $D_c$

Question

Perform CA on the two contingency tables.

You may use FactoMineR::CA(). It is interesting to compute the correspondence analysis in your own way, by preparing the matrix that is handled to svd() and returning a named list containing all relevant information.

Do the Jedi and Sith build their own light sabers? Jedi do. It’s a key part of the religion to have a kyber crystal close to you, to build the saber through the power of the force creating a blade unique and in tune with them

Solution

Code

lst_ca <- list()

for (y in c('79', '06')) {
  lst_ca[[y]] <- mortality |> 
    select(ends_with(glue('({y})'))) |> 
    FactoMineR::CA(ncp=8, graph = F)
}

Code

lst <- map(c('79', '06'), 
             \(x) select(mortality, ends_with(glue('({x})'))) |>
             FactoMineR::CA(ncp=8, graph = F)
           )

Question

If you did use FactoMineR::CA(), explain the organization of the result.

Solution

The result of FactoMineR::CA(...) is a named and nested list with five elements:

eig: a matrix/array containing enough information to build a screeplot.
call: a list of 9, containing the call to CA(), an object of type language, telling (in principle) the user how CA() was called. However, this is a quoted expression. Here we need to guess the value of y in the calling environment understand what’s going on.

Code

lst_ca[[1]]$call$call

FactoMineR::CA(X = select(mortality, ends_with(glue("({y})"))), 
    ncp = 8, graph = F)

Element call also contains the table margin distributions marge.col and marge.row. The truncation rank ncp (number of components) can be assigned before computing the SVD (default value is 5). Element $X$ stores the contingency table that was effectively used for computing Correpondence Analysis.

row: Information gathered from SVD to facilitate row profiles analysis.
col: a list structured in the same way as element row. Used for column profiles analysis
svd: a list of 3, just as the resuld of svd() containing the singular values, the left and right singular vectors of matrix $...$

In principle, all relevant information can be gathered from components svd, call.marge.row, and call.marge.col.

Screeplots

Question

Draw screeplots. Why are they useful? Comment briefly.

Solution

Code

ca_79 <- lst_ca[[1]]

ca_79$eig |> 
  as_tibble() |> 
  mutate(across(where(is.numeric), ~ round(.x, digits=2))) |> 
  gt::gt()

eigenvalue	percentage of variance	cumulative percentage of variance
0.29	61.00	61.00
0.14	28.98	89.98
0.03	6.13	96.12
0.01	2.86	98.98
0.00	0.73	99.70
0.00	0.17	99.88
0.00	0.09	99.97
0.00	0.03	100.00

Solution

Screeplot

Code

ca_79$eig |> 
  as_tibble() |> 
  rownames_to_column(var="PC") |> 
  rename(percent=eigenvalue, cumulative=`cumulative percentage of variance`) |> 
  ggplot() + 
  aes(x=PC, y=percent, label=round(cumulative,2)) +
  geom_text(angle=45, vjust=-1, hjust=-0.1) + 
  geom_col(fill=NA, colour="black") +
  ylab("Squared singular values") +
  ylim(c(0, .4)) +
  labs(
    title="Screeplot for CA",
    subtitle = "Mortality 1979: Age Group versus Causes of Death"
  )

Row profiles analysis

Question

Perform row profiles analysis.

What are the classical plots? How can you build them from the output of FactoMiner::CA?

Build the table of row contributions (the so-called $\cos^2$)

Solution

Attribute row of objects of class CA (exported from FactoMineR) is the starting point of any row profiles analysis.

Code

ca_79_row <- ca_79$row

Attribute row is a named list made of $4$ components.

Solution

coord: a matrix with named rows and columns. The number of rows of coord matches the number of rows of the contingency table (here, the number of possible death Causes). The number of columns matches the rank of the truncated SVD that underlies Correspondance Analysis. Here it is $5$ which also the rank of the standardized contingency table.

The row principal coordinates are the principal coordinates of each row profile in terms of the principal component.

The columns of coord are pairwise orthogonal in the inner product space defined by diag(call$marge.row) (which embodies the marginal probabilities of the so-called Causes of deaths)

Code

x <- ca_79$row$coord
r <- ca_79$call$marge.row

A <- round(t(x) %*% diag(r) %*% x, 2)

is_diagonal <- function (A, tol=1e-2){
  norm(diag(diag(A))-A, type='F') <= tol
}

# We expect A to be diagonal
is_diagonal(A)

[1] TRUE

We can recover row$coord from the left singular vectors and the singular values:

Code

with(ca_79,   
  norm(row$coord - with(svd, U %*% diag(vs[1:ca_79$call$ncp])), 'F')
)

[1] 0

Code

prep_rows <- ca_79_row$coord |> 
  as_tibble() |> 
  mutate(name= rownames(ca_79_row$coord)) |> 
  relocate(name) |> 
  mutate(prop=r, inertia=ca_79_row$inertia) 

prep_rows |> 
  mutate(across(where(is.numeric), \(x) round(x,2))) |> 
  gt::gt()

name	Dim 1	Dim 2	Dim 3	Dim 4	Dim 5	Dim 6	Dim 7	Dim 8	prop	inertia
Accidental poisoning	1.53	0.52	0.04	-0.17	0.01	-0.03	-0.01	-0.05	0.00	0.00
Addiction to prescription medication	1.75	0.44	-0.12	0.27	-0.12	-0.24	0.19	0.22	0.00	0.00
Alcohol abuse and alcohol-related psychosis	0.65	-0.74	0.44	-0.08	-0.17	0.09	0.02	-0.02	0.01	0.01
Asthma	-0.01	-0.10	-0.12	-0.05	0.02	0.00	0.05	0.03	0.00	0.00
Blood and hematopoietic disorders	-0.08	0.07	-0.07	-0.05	-0.04	-0.02	-0.01	0.03	0.00	0.00
Cerebrovascular disease	-0.32	0.17	-0.06	-0.04	-0.06	-0.01	0.00	0.00	0.12	0.02
Chronic liver disease	0.49	-0.83	0.30	0.09	-0.08	0.04	-0.01	-0.01	0.03	0.03
Complications in pregnancy and childbirth	3.53	1.80	0.28	-2.09	0.99	-0.77	0.15	0.06	0.00	0.00
Congenital defects of the circulatory system	1.94	0.54	0.01	-0.10	0.11	-0.01	-0.08	0.11	0.00	0.00
Congenital defects of the nervous system	2.36	0.98	-0.23	0.47	-0.07	-0.38	0.23	-0.07	0.00	0.00
Diabetes	-0.21	-0.05	-0.24	-0.07	0.02	0.02	0.03	0.01	0.01	0.00
Events of undetermined intention	1.71	0.35	0.24	-0.46	0.05	-0.03	0.01	-0.01	0.00	0.01
Falls	-0.34	0.49	0.27	0.07	-0.02	-0.05	-0.05	-0.02	0.02	0.01
Gastroduodenal ulcer	-0.11	-0.14	-0.08	-0.03	-0.03	0.01	0.01	-0.01	0.00	0.00
Homicides	2.19	0.62	0.35	-0.72	0.02	0.22	-0.18	0.08	0.00	0.01
Infections of the skin and sub-cutaneous cellular tissue	-0.41	0.40	0.11	0.02	-0.06	-0.03	-0.01	-0.02	0.00	0.00
Influenza	-0.41	0.51	0.28	0.10	0.06	0.02	-0.02	-0.02	0.00	0.00
Ischemic cardiomyopathy	-0.14	-0.18	-0.17	-0.02	0.02	0.02	0.01	-0.01	0.09	0.01
Kidney and urethra disease	-0.33	0.23	0.00	-0.02	-0.04	-0.01	-0.01	0.02	0.01	0.00
Malignant melanoma	0.63	-0.34	0.25	-0.32	0.06	0.07	-0.06	0.02	0.00	0.00
Malignant neplasm of the lymphatic and hematopoietic tissues	0.33	-0.18	-0.10	-0.11	0.07	0.02	0.02	-0.03	0.01	0.00
Malignant ovarian tumour	0.28	-0.59	0.07	0.09	-0.01	-0.03	0.01	-0.01	0.00	0.00
Malignant tumour in other parts of the uterus	0.08	-0.41	0.01	0.06	0.01	0.00	-0.01	0.05	0.01	0.00
Malignant tumour of the bladder	-0.13	-0.25	-0.23	0.01	0.08	0.01	0.00	0.02	0.01	0.00
Malignant tumour of the breast	0.21	-0.49	0.17	0.03	-0.06	0.03	-0.02	0.02	0.02	0.01
Malignant tumour of the cervix	0.42	-0.60	0.29	-0.07	-0.16	0.16	-0.07	0.05	0.00	0.00
Malignant tumour of the kidney	0.07	-0.45	-0.10	0.07	0.11	-0.02	-0.03	0.01	0.00	0.00
Malignant tumour of the larynx, trachea, bronchus and lungs	0.21	-0.69	-0.01	0.14	0.08	-0.04	-0.01	-0.01	0.04	0.02
Malignant tumour of the lip, pharynx and mouth	0.41	-0.82	0.31	0.21	-0.16	-0.07	0.09	0.00	0.01	0.01
Malignant tumour of the liver and intrahepatic biliary tract	0.06	-0.44	-0.13	0.05	0.13	-0.05	-0.02	0.01	0.01	0.00
Malignant tumour of the oesophogus	0.21	-0.69	0.07	0.17	0.01	-0.06	0.02	0.03	0.01	0.01
Malignant tumour of the of the colon	-0.13	-0.18	-0.17	-0.01	0.03	0.00	0.00	0.00	0.02	0.00
Malignant tumour of the of the pancreas	-0.01	-0.37	-0.15	0.04	0.10	0.02	-0.04	0.01	0.01	0.00
Malignant tumour of the prostate	-0.29	-0.01	-0.35	-0.09	0.03	0.05	0.03	-0.01	0.01	0.00
Malignant tumour of the rectum and anus	-0.08	-0.27	-0.16	0.01	0.09	0.00	-0.02	-0.02	0.01	0.00
Malignant tumour of the stomach	-0.12	-0.21	-0.19	-0.02	0.05	0.02	-0.01	0.01	0.02	0.00
Meningitis	0.63	-0.22	0.02	-0.05	0.03	0.08	-0.12	0.04	0.00	0.00
Meningococal disease	1.93	0.72	-0.55	0.77	-0.09	0.08	-0.02	0.05	0.00	0.00
Other accidents	1.18	0.30	0.01	-0.02	-0.02	0.00	0.01	-0.01	0.02	0.03
Other chronic respiritory illnesses	-0.27	0.06	-0.11	-0.01	0.01	0.01	0.01	0.01	0.02	0.00
Other congenital defects and chromosomal abnormalities	1.09	0.19	0.04	-0.29	0.13	-0.27	0.12	0.01	0.00	0.00
Other digestive conditions	-0.13	0.02	-0.01	-0.05	-0.03	0.00	0.00	0.00	0.03	0.00
Other diseases of the nervous system and sensory organs	-0.08	0.12	-0.10	-0.04	-0.04	0.01	0.01	0.01	0.02	0.00
Other diseases of the osteo-articular system + muscles and connecting tissue	-0.03	0.00	-0.06	0.00	0.01	0.02	-0.01	0.00	0.00	0.00
Other endocrinological, metabolic and nutritional conditions	-0.27	0.23	0.12	0.03	-0.02	0.01	-0.01	0.00	0.01	0.00
Other external injury and poisoning	0.23	-0.10	-0.03	-0.06	0.04	0.05	-0.04	-0.03	0.00	0.00
Other genito-urinary diseases	-0.37	0.26	-0.07	-0.06	-0.07	0.00	0.02	0.02	0.00	0.00
Other heart disease	-0.36	0.28	0.07	0.01	-0.02	-0.01	-0.01	0.00	0.10	0.02
Other ill-defined symptoms and conditions	-0.49	0.59	0.47	0.18	0.16	0.05	0.03	0.01	0.03	0.03
Other illnesses relating to circulation	-0.23	0.07	-0.06	-0.02	-0.01	0.00	-0.01	0.00	0.06	0.00
Other infectious diseases and parasites	-0.05	0.01	-0.07	-0.04	0.01	0.00	0.01	0.01	0.01	0.00
Other malignent tumours	0.09	-0.29	-0.04	0.02	0.04	-0.02	-0.02	0.00	0.04	0.00
Other psychological and behavioural disorders	-0.40	0.36	-0.03	-0.07	-0.11	-0.02	0.01	-0.01	0.01	0.00
Other respiratory ailments	-0.26	0.17	0.04	0.02	0.02	-0.01	0.00	-0.01	0.03	0.00
Other tumours	0.19	-0.25	0.01	-0.01	0.03	0.00	-0.03	0.02	0.01	0.00
Pneumonia	-0.38	0.38	0.17	0.04	-0.03	0.00	-0.01	0.00	0.01	0.00
Rhumatoid arthritis and osteoarthritis	-0.23	0.02	-0.22	-0.08	0.00	0.00	0.00	-0.05	0.00	0.00
Road accidents	2.47	1.05	-0.33	0.36	-0.03	0.01	0.00	0.00	0.02	0.15
Suicides	1.39	0.04	0.31	-0.48	0.06	-0.02	0.01	0.00	0.02	0.04
Tuberculosis	0.13	-0.34	0.01	-0.04	0.01	0.02	-0.01	-0.02	0.00	0.00
Unknown or unspecified causes	0.43	0.05	0.13	-0.08	0.01	0.01	0.01	0.00	0.03	0.01
Viral hepatitis	0.49	-0.10	-0.09	0.04	-0.02	-0.02	0.00	0.00	0.00	0.00

Solution

inertia: a numerical vector with length matching the number of rows of coord, contrib and cos2.

Inertia is the way CA measures variation between row profiles. Total inertia is the $\chi^2$ statistic divided by sample size.

Row inertia can be obtained by multiplying the row marginal probability by the squared Euclidean norm of the row in the principal coordinate matrix.

Code

with (ca_79_row,
  sum(abs(r* (rowSums(coord^2)) - inertia))
)

[1] 1.877449e-16

Solution

cos2: Coefficients of matrix cos2 are the share of row inertia from the corresponding cell in coord

Code

with (ca_79_row,
  norm((diag(r/inertia) %*% coord^2) - cos2, type='F')
)

[1] 5.432216e-16

Solution

contrib

Not too surprisingly, coord, contrib, and cos2 share the same row names and column names.

Code

sum(ca_79$call$X)

[1] 529974

Code

sum((rowSums(ca_79$call$X)/sum(ca_79$call$X) - r)^2)

[1] 6.311339e-35

The Row Profiles are the rows of matrix R below

Code

P <- as.matrix(with(ca_79$call, Xtot/N))
coord <- ca_79_row$coord
inertia <- ca_79_row$inertia

r <- ca_79$call$marge.row
c <- colSums(P)

n <- nrow(P)
p <- ncol(P)

R <- diag(r^(-1)) %*% P 

Q <- R - matrix(1, nrow = n, ncol = n) %*% P

Code

M <- diag(r^(-1)) %*% P %*% diag(c^(-1)) - matrix(1, nrow=n, ncol=p)

n * norm(diag(r^(1/2)) %*% M %*% diag(c^(1/2)), type = "F")^2

[1] 29.39279

Solution

We can now display a scatterplot from component coord. This is called a Row Plot.

Code

p_scat <-  ( 
  prep_rows |> 
    ggplot() +
    aes(x=`Dim 1`, y=`Dim 2`, label=name) +
    geom_point() +
    coord_fixed() 
  ) 

p_scat |> plotly::ggplotly()

Solution

With little effort, it is possible to scale the points so as to tell the reader the relative numerical importance of each Cause of death. Coloring/filling the points using inertia also helps: high inertia rows match light-colored points.

Code

ppp <- prep_rows |> 
    ggplot() +
    aes(x=`Dim 1`, 
        y=`Dim 2`, 
        label=name, 
        size=prop, 
        fill=log10(inertia),
        color=log10(inertia)) +
    geom_point(alpha=0.75) +
    scale_size_area() +
    coord_fixed() +
    scale_fill_viridis_c(aesthetics=c("fill", "color"), 
                         guide="colorbar", 
                         direction = 1) +
    ggtitle(
      "Mortality France 1979: Row plot"
    )
  
ppp |> plotly::ggplotly()

Code

# (ca_79$row)$contrib

Question

Plot the result of row profile analysis using plot.CA from FactoMineR.

Question

Perform column profiles analysis

Code

names(ca_79_row)

[1] "coord"   "contrib" "cos2"    "inertia"

Solution

Code

age_group_names <-  str_match(rownames(ca_79$col$coord), '([\\w \\-]*) \\(79\\)')[,2]

prep_cols <- ca_79$col$coord |> 
  as_tibble() |> 
  mutate(name= age_group_names) |> 
  relocate(name) |> 
  mutate(prop=c, inertia=ca_79$col$inertia)

Code

(
  prep_cols |> 
    ggplot() +
    aes(x=`Dim 1`, 
        y=`Dim 2`, 
        label=name, 
        size=prop, 
        fill=log10(inertia),
        color=log10(inertia)) +
    geom_point(alpha=0.75) +
    scale_size_area() +
    coord_fixed() +
    scale_fill_viridis_c(aesthetics=c("fill", "color"),direction = 1) +
    ggtitle(
      "Mortality France 1979: Col plot"
    )) |> plotly::ggplotly()

Symmetric plots

Question

Build the symmetric plots (biplots) for correspondence analysis of Mortalitity data

From the shelf

Code

plot.CA(ca_79)

Solution

Code

(
prep_rows |> 
    ggplot() +
    aes(x=`Dim 1`, 
        y=`Dim 2`, 
        label=name, 
        size=prop, 
        fill=log10(inertia),
        color=log10(inertia)) +
    geom_point(alpha=0.75) +
    scale_size_area() +
    coord_fixed() +
    scale_fill_viridis_c(aesthetics=c("fill", "color"),direction = 1) +
    geom_point(data = prep_cols,
      aes(x=`Dim 1`, 
        y=`Dim 2`, 
        label=name, 
        size=prop, 
        fill=log10(inertia),
        color=log10(inertia)
      ),
      shape="square",
      alpha=.5,      
    )
) |> plotly::ggplotly()

Warning in geom_point(data = prep_cols, aes(x = `Dim 1`, y = `Dim 2`, label =
name, : Ignoring unknown aesthetics: label

Solution

It is convenient to use distinct color scales for rows and columns.

Code

(
prep_rows |> 
    ggplot() +
    scale_size_area() +
    coord_fixed() +
    aes(x=`Dim 1`, 
        y=`Dim 2`, 
        text=name, 
        size=prop, 
        fill=log10(inertia)) +
    geom_point(alpha=0.75) +
    scale_fill_viridis_c(option="D") +
    geom_point(data = prep_cols,
      aes(x=`Dim 1`, 
        y=`Dim 2`, 
        text=name, 
        size=prop, 
        color=log10(inertia)
      ),
      shape="square",
      alpha=.5,      
    ) +
    scale_color_viridis_c(option="F") +
    theme_minimal(
    )  
)  |> plotly::ggplotly()

Warning in geom_point(data = prep_cols, aes(x = `Dim 1`, y = `Dim 2`, text =
name, : Ignoring unknown aesthetics: text

Mosaicplots

Question

Mosaic plots provide an alternative way of exploring contingency tables. They are particularly handy when handling 2-way contingency tables.

Draw mosaic plots for the two contingency tables living inside mortality datasets.

Solution

Code

mortality |> 
  select(ends_with('(06)')) |> 
  chisq.test() |> 
  broom::glance()

Warning in chisq.test(select(mortality, ends_with("(06)"))): Chi-squared
approximation may be incorrect

# A tibble: 1 × 4
  statistic p.value parameter method                    
      <dbl>   <dbl>     <int> <chr>                     
1   229784.       0       488 Pearson's Chi-squared test

Code

mortality |> 
  select(ends_with('(06)')) |> 
  as.matrix() |> 
  as.table() |> 
  mosaicplot(color = T)

Question

Are you able to deliver an interpretation of this Correspondence Analysis?

Hierarchical clusetring of row profiles

Question

Build the standardized matrix for row profiles analysis. Compute the pairwise distance matrix using the $\chi^2$ distances. Should you work centered row profiles?

We use the weighted $\ell_2$ distances defined by the product of the two marginal distributions. The squared distance between the conditional probabilities defined by rows $a$ and $a'$ is \[\sum_{b} \frac{\left( N_{a,b}/N_{a,.} - N_{a',b}/N_{a',.}\right)^2}{N_{.,b}/N}\]

The $\ell_2$ distance between the rows of the principal coordinates matrix row$coord coincides since they are all centered and normalized with respect to $(N_{.,b}/N)$.

Code

dist_Causes_79 <- ca_79$row$coord[,1:8] |> 
  dist()

Code

hc_79 <- hclust(dist_Causes_79, method = "single")

Code

stopifnot(
  require(ggdendro),
  require(dendextend),
  require(sloop)
)

The instance of hclust is transformed into a an object of class dendro. Class dendro is equipped with a variety of functions/methods for analyzing, visualizing, and exploiting the result of hclust().

Code

dendro_79 <- dendro_data(hc_79)

Code

class(dendro_79)

[1] "dendro"

Code

(
dendro_79 |> 
  ggdendrogram(
    leaf_labels = T, 
    rotate = T) + 
  ggdendro::theme_dendro() +
  scale_y_reverse()
  ) |> plotly::ggplotly()

Scale for y is already present.
Adding another scale for y, which will replace the existing scale.

Question

Perform hierarchical clustering of row profiles with method/linkage "single". Check the definition of the method. Did you know the underlying algorithm? If yes, in which context did you get acquainted with this algorithm?

Question

Choose the number of classes (provide justification).

Question

Can you explain the size of the different classes in the partition?

Atypical row profiles

Question

Row profiles that do not belong to the majority class are called atypical.

Compute the share of inertia of atypical row profiles.
Draw a symmetric plot (biplot) outlining the atypical row profiles.

Investigating independence/association

Question

Calculate the theoretical population table for deces. Do you possible to carry out a chi-squared test?
Perform a hierarchical classification of the line profiles into two classes.
Merge the rows of deces corresponding to the same class (you can use the the tapply function), and perform a chi-square test. chi-square test. What’s the conclusion?
Why is it more advantageous to carry out this grouping into two classes compared to arbitrarily grouping two classes, in order to prove the dependence between these two variables?

About the “average profile”

Question

Represent individuals from the majority class. Do they all seem to you to correspond to an average profile?
Try to explain this phenomenon considering the way in which hierarchical classification uses the Single Linkage method.

Caveat

The mortality dataset should be taken with grain of salt. Assigning a single Cause to every death is not a trivial task. It is even questionable: if somebody dies from some infection beCause she could not be cured using an available drug due to another preexisting pathology, who is the culprit?

Correspondence Analysis

The mortality dataset

Elementary statistics and table wrangling

Correspondance Analysis

Screeplots

Screeplot

Row profiles analysis

Symmetric plots

From the shelf

Mosaicplots

Hierarchical clusetring of row profiles

Atypical row profiles

Investigating independence/association

About the “average profile”

The `mortality` dataset