LAB: Correspondance Analysis

Published

March 21, 2025

M1 MIDS/MFA/LOGOS

Université Paris Cité

Année 2024

Course Homepage

Moodle

Besides the usual packages (tidyverse, …), we shall require FactoMineR and related packages.

Code
stopifnot(
  require(FactoMineR),
  require(factoextra),
  require(FactoInvestigate)
)

Correspondence Analysis

The mortality dataset

The goal is to investigate a possible link between age group and Cause of death. We work with dataset mortality from package FactoMineR

Code
data("mortality", package = "FactoMineR")
Code
#help(mortality)

A data frame with 62 rows (the different Causes of death) and 18 columns. Each column corresponds to an age interval (15-24, 25-34, 35-44, 45-54, 55-64, 65-74, 75-84, 85-94, 95 and more) in a year. The 9 first columns correspond to data in 1979 and the 9 last columns to data in 2006. In each cell, the counts of deaths for a Cause of death in an age interval (in a year) is given.

Source
Centre d’épidemiologie sur les Causes de décès médicales

See also EuroStat:

Question

Read the documentation of the mortality dataset. Is this a sample? an aggregated dataset?

If you consider mortality as an agregated dataset, can you figure out the organization of the sample mortality was built from?

Solution

The mortality dataset is an aggregated dataset. It has been built from two samples. Each sample was built from the collection of death certificates from one calendar year in France (years 1999 and 2006). From each death certificate, two categorical pieces of information were extracted: age group of the deceased and a Cause of death. Each sample was then grouped by age group and Cause of death and counts were computed. This defines a two-ways contingency table in long form. The contingency table in wide form is obtained by pivoting: pick column names from column age group and values from counts. Column Cause of depth provide row names.

The final form of the dataset is obtained by concatenating the two contingency tables along the second axis.

Code
mortality <- mortality |> 
    mutate(Cause = rownames(mortality)) |>
    mutate(Cause = factor(Cause)) |>
    relocate(Cause)
Code
my_gt <- function(gt_tbl){
  gt_tbl |>
  tab_style(
    style = list(
      "font-variant: small-caps;"
    ),
    locations = cells_body(columns = Cause)
  ) |>
  gt::cols_align(
    align="left",
    columns=Cause
  ) 

}
Code
mortality |>
    select(Cause, ends_with('(06)')) |> 
    sample_n(10) |>
    gt::gt() |>
    my_gt()
Cause 15-24 (06) 25-34 (06) 35-44 (06) 45-54 (06) 55-64 (06) 65-74 (06) 75-84 (06) 85-94 (06) 95 and more (06)
Kidney and urethra disease 3 10 35 106 266 655 2432 2454 545
Addiction to prescription medication 18 77 72 15 4 2 1 0 0
Blood and hematopoietic disorders 20 22 43 94 159 259 642 670 213
Malignant tumour of the larynx, trachea, bronchus and lungs 0 38 681 4059 7285 8026 7678 1869 128
Other ill-defined symptoms and conditions 42 126 346 752 1077 1746 4943 7920 4253
Unknown or unspecified causes 227 476 928 1517 1585 1487 2446 2298 872
Homicides 52 83 65 63 42 27 21 6 0
Viral hepatitis 0 9 76 117 94 143 224 61 6
Malignant ovarian tumour 4 12 63 302 603 863 1054 402 38
Rhumatoid arthritis and osteoarthritis 0 1 0 4 14 99 231 176 54

Elementary statistics and table wrangling

Before proceeding to Correspondence Analysis (CA), let us tidy up the table and draw some elementary plots.

Question
  • Start by partially pivoting mortality, so as to obtain a tibble with columns Cause, year, while keeping all columns named after age groups (tidy up the data so as to obtain a tibble in partially long format).
  • Use rowwise() and sum(c_cross()) so as to compute the total number of deaths per year and Cause in column total. This allows to mimic rowSums() inside a pipeline. Column grand_total is computed using a window function over grouping by Cause.
Solution
Code
mortality_long <- mortality  |> 
  pivot_longer(
    cols=-Cause,
    cols_vary="slowest",
    names_to=c(".value", "year"),
    names_pattern="([\\w\\- ]*) \\(([0-9]{2})\\)"
  )  |> 
  mutate(year=ifelse(year=='06', 2006, 1979)) |> 
  rowwise() |> 
  mutate(total_year=sum(c_across(-c(Cause, year)))) |> 
  group_by(Cause) |> 
  mutate(grand_total = sum(total_year)) |> 
  ungroup()
Code
mortality_long |>
 slice_sample(n=10) |>
 gt::gt() |>
 my_gt() |>
 gt::tab_caption("A sample of rows from Mortality table in long form")
A sample of rows from Mortality table in long form
Cause year 15-24 25-34 35-44 45-54 55-64 65-74 75-84 85-94 95 and more total_year grand_total
Events of undetermined intention 1979 297 382 296 329 239 279 221 66 1 2110 2558
Gastroduodenal ulcer 2006 3 1 13 38 48 99 264 244 55 765 2981
Cerebrovascular disease 2006 35 75 311 902 1575 3719 12172 11385 2464 32638 98795
Malignant tumour of the of the colon 1979 10 39 114 556 1130 2898 3541 1257 48 9593 21753
Meningococal disease 2006 8 0 1 1 0 4 2 0 0 16 60
Diabetes 1979 17 43 61 232 585 2064 3044 1018 47 7111 18083
Other tumours 2006 39 45 141 300 564 1010 2192 1713 318 6322 10399
Malignant tumour of the of the colon 2006 4 17 135 538 1421 2572 4641 2486 346 12160 21753
Other ill-defined symptoms and conditions 1979 41 75 116 276 511 1635 5641 7330 1500 17125 38330
Homicides 1979 92 116 115 65 50 33 35 10 0 516 875
Solution

A truly tidy version of the dataset can be obtained from further pivoting.

Code
mortality_tidy <- mortality_long |> 
  pivot_longer(
    cols=-c(year,Cause,total_year, grand_total),
    cols_vary="slowest",
    names_to=c("age_range"),
    values_to=c("#deaths")
) |>
  mutate(age_range = factor(age_range, levels=sort(unique(age_range)),ordered=T))
Code
mortality_tidy |>
  sample_n(5) |>
  gt::gt()
Cause year total_year grand_total age_range #deaths
Other ill-defined symptoms and conditions 2006 21205 38330 25-34 126
Malignant ovarian tumour 2006 3341 5661 85-94 402
Road accidents 2006 4648 15255 95 and more 8
Meningococal disease 1979 44 60 85-94 0
Chronic liver disease 2006 7669 23596 35-44 453
Question

Build a bar plot to display the importance of Causes of deaths in France in years 1979 and 2006

Solution
Code
th <- theme_get()
(
mortality_long |> 
  mutate(Cause=fct_reorder(Cause, desc(grand_total))) |>
  mutate(year=as_factor(year)) |>
  ggplot() +
  scale_fill_discrete() +
  aes(x=Cause, 
      y=total_year, 
      fill=year) +
  geom_col(position=position_dodge()) +
  theme(
    legend.position="none",
    axis.text.x=element_blank(), #remove x axis labels
    axis.ticks.x=element_blank(), #remove x axis ticks
  ) +
  labs(
    title = "Causes of death, France, 1979, 2006",
    subtitle= "Raw counts"
  ) +
  xlab(label=NULL)
) |>
  plotly::ggplotly()  
Code
oth <- theme_set(th)
Question

Compute and display the total number of deaths in France in years 1979 and 2006.

Solution
Code
mortality_long |> 
  group_by(year) |> 
  summarise(total_deaths = sum(total_year)) |>
  gt::gt() |>
  gt::cols_label(
    year= "Year", 
    total_deaths = "#Deaths") |>
  gt::tab_caption("Mortality in France")
Mortality in France
Year #Deaths
1979 529974
2006 510921
Question

Compute the marginal counts for each year (1979, 2006). Compare.

Solution

Counts have already been computed above.

Code
mortality_long |> 
  select(Cause, year, total_year, grand_total) |> 
  pivot_wider(
    id_cols=c(Cause, grand_total), 
    names_from = year, 
    values_from = total_year) |> 
  rename(Total=grand_total) |> 
  arrange(desc(Total)) |>
  gt::gt() |>
  my_gt()
Cause Total 1979 2006
Cerebrovascular disease 98795 66157 32638
Other heart disease 97297 54105 43192
Ischemic cardiomyopathy 88338 49532 38806
Other illnesses relating to circulation 61937 31218 30719
Malignant tumour of the larynx, trachea, bronchus and lungs 50604 20840 29764
Other malignent tumours 48809 23262 25547
Other diseases of the nervous system and sensory organs 38891 12056 26835
Other ill-defined symptoms and conditions 38330 17125 21205
Other digestive conditions 32697 18092 14605
Other respiratory ailments 26339 14197 12142
Unknown or unspecified causes 26192 14356 11836
Chronic liver disease 23596 15927 7669
Other accidents 23353 10921 12432
Malignant tumour of the of the colon 21753 9593 12160
Suicides 20337 9952 10385
Malignant tumour of the breast 20236 8605 11631
Malignant neplasm of the lymphatic and hematopoietic tissues 20000 7589 12411
Diabetes 18083 7111 10972
Falls 17711 12503 5208
Other chronic respiritory illnesses 17331 9680 7651
Other psychological and behavioural disorders 17160 3749 13411
Malignant tumour of the prostate 15514 6577 8937
Road accidents 15255 10607 4648
Pneumonia 14700 5057 9643
Malignant tumour of the stomach 13783 9020 4763
Other endocrinological, metabolic and nutritional conditions 13665 6030 7635
Kidney and urethra disease 13613 7107 6506
Malignant tumour of the of the pancreas 12851 4588 8263
Other infectious diseases and parasites 11466 4045 7421
Malignant tumour of the liver and intrahepatic biliary tract 11452 4001 7451
Other tumours 10399 4077 6322
Malignant tumour of the lip, pharynx and mouth 9722 5606 4116
Malignant tumour of the oesophogus 9267 5430 3837
Malignant tumour of the rectum and anus 9246 4980 4266
Malignant tumour of the bladder 8322 3633 4689
Alcohol abuse and alcohol-related psychosis 6327 3371 2956
Malignant ovarian tumour 5661 2320 3341
Malignant tumour in other parts of the uterus 5263 2940 2323
Malignant tumour of the kidney 5205 2101 3104
Other genito-urinary diseases 4387 2552 1835
Blood and hematopoietic disorders 4299 2177 2122
Other diseases of the osteo-articular system + muscles and connecting tissue 3935 1025 2910
Infections of the skin and sub-cutaneous cellular tissue 3439 1649 1790
Gastroduodenal ulcer 2981 2216 765
Tuberculosis 2797 2070 727
Other external injury and poisoning 2731 1023 1708
Events of undetermined intention 2558 2110 448
Asthma 2488 1495 993
Malignant melanoma 2205 658 1547
Malignant tumour of the cervix 1527 824 703
Accidental poisoning 1510 503 1007
Rhumatoid arthritis and osteoarthritis 1284 705 579
Influenza 1166 1051 115
Viral hepatitis 1059 329 730
Homicides 875 516 359
Other congenital defects and chromosomal abnormalities 592 145 447
Congenital defects of the circulatory system 540 275 265
Meningitis 481 362 119
Addiction to prescription medication 222 33 189
Complications in pregnancy and childbirth 150 91 59
Congenital defects of the nervous system 109 61 48
Meningococal disease 60 44 16

Correspondance Analysis

CA executive summary
  • Start from a 2-way contingency table \(X\) with \(\sum_{i,j} X_{i,j}=N\)

  • Normalize \(P = \frac{1}{N}X\) (correspondance matrix)

  • Let \(r\) (resp. \(c\)) be the row (resp. column) wise sums vector

  • Let \(D_r=\text{diag}(r)\) denote the diagonal matrix with row sums of \(P\) as coefficients

  • Let \(D_c=\text{diag}(c)\) denote the diagonal matrix with column sums of \(P\) as coefficients

  • The row profiles matrix is \(D_r^{-1} \times P\)

  • The standardized residuals matrix is \(S = D_r^{-1/2} \times \left(P - r c^\top\right) \times D_c^{-1/2}\)

CA consists in computing the SVD of the standardized residuals matrix \(S = U \times D \times V^\top\)

From the SVD, we get

  • \(D_r^{-1/2} \times U\) standardized coordinates of rows
  • \(D_c^{-1/2} \times V\) standardized coordinates of columns
  • \(D_r^{-1/2} \times U \times D\) principal coordinates of rows
  • \(D_c^{-1/2} \times V \times D\) principal coordinates of columns
  • Squared singular values: the principal inertia

When calling svd(.), the argument should be \[D_r^{1/2}\times \left(D_r^{-1} \times P \times D_c^{-1}- \mathbf{I}\times \mathbf{I}^\top \right)\times D_c^{1/2}= D_r^{-1/2}\times \left( P - r \times c^\top \right)\times D_c^{-1/2}\]

CA and extended SVD

As \[D_r^{-1} \times P \times D_c^{-1} - \mathbf{I}\mathbf{I}^\top = (D_r^{-1/2} \times U)\times D \times (D_c^{-1/2}\times V)^\top\]

\((D_r^{-1/2} \times U)\times D \times (D_c^{-1/2}\times V)^\top\) is the extended SVD of \[D_r^{-1} \times P \times D_c^{-1} - \mathbf{I}\mathbf{I}^\top\] with respect to \(D_r\) and \(D_c\)

Question

Perform CA on the two contingency tables.

You may use FactoMineR::CA(). It is interesting to compute the correspondence analysis in your own way, by preparing the matrix that is handled to svd() and returning a named list containing all relevant information.

Do the Jedi and Sith build their own light sabers? Jedi do. It’s a key part of the religion to have a kyber crystal close to you, to build the saber through the power of the force creating a blade unique and in tune with them

Solution
Code
lst_ca <- list()

for (y in c('79', '06')) {
  lst_ca[[y]] <- mortality |> 
    select(ends_with(glue('({y})'))) |> 
    FactoMineR::CA(ncp=8, graph = F)
}
Code
lst <- map(c('79', '06'), 
             \(x) select(mortality, ends_with(glue('({x})'))) |>
             FactoMineR::CA(ncp=8, graph = F)
           )
Question

If you did use FactoMineR::CA(), explain the organization of the result.

Solution

The result of FactoMineR::CA(...) is a named and nested list with five elements:

eig
a matrix/array containing enough information to build a screeplot.
call
a list of 9, containing the call to CA(), an object of type language, telling (in principle) the user how CA() was called. However, this is a quoted expression. Here we need to guess the value of y in the calling environment understand what’s going on.
Code
lst_ca[[1]]$call$call
FactoMineR::CA(X = select(mortality, ends_with(glue("({y})"))), 
    ncp = 8, graph = F)

Element call also contains the table margin distributions marge.col and marge.row. The truncation rank ncp (number of components) can be assigned before computing the SVD (default value is 5). Element \(X\) stores the contingency table that was effectively used for computing Correpondence Analysis.

row
Information gathered from SVD to facilitate row profiles analysis.
col
a list structured in the same way as element row. Used for column profiles analysis
svd
a list of 3, just as the resuld of svd() containing the singular values, the left and right singular vectors of matrix \(...\)

In principle, all relevant information can be gathered from components svd, call.marge.row, and call.marge.col.

Screeplots

Question

Draw screeplots. Why are they useful? Comment briefly.

Solution
Code
ca_79 <- lst_ca[[1]]

ca_79$eig |> 
  as_tibble() |> 
  mutate(across(where(is.numeric), ~ round(.x, digits=2))) |> 
  gt::gt()
eigenvalue percentage of variance cumulative percentage of variance
0.29 61.00 61.00
0.14 28.98 89.98
0.03 6.13 96.12
0.01 2.86 98.98
0.00 0.73 99.70
0.00 0.17 99.88
0.00 0.09 99.97
0.00 0.03 100.00
Solution

Screeplot

Code
ca_79$eig |> 
  as_tibble() |> 
  rownames_to_column(var="PC") |> 
  rename(percent=eigenvalue, cumulative=`cumulative percentage of variance`) |> 
  ggplot() + 
  aes(x=PC, y=percent, label=round(cumulative,2)) +
  geom_text(angle=45, vjust=-1, hjust=-0.1) + 
  geom_col(fill=NA, colour="black") +
  ylab("Squared singular values") +
  ylim(c(0, .4)) +
  labs(
    title="Screeplot for CA",
    subtitle = "Mortality 1979: Age Group versus Causes of Death"
  )

Row profiles analysis

Question

Perform row profiles analysis.

What are the classical plots? How can you build them from the output of FactoMiner::CA?

Build the table of row contributions (the so-called \(\cos^2\))

Solution

Attribute row of objects of class CA (exported from FactoMineR) is the starting point of any row profiles analysis.

Code
ca_79_row <- ca_79$row

Attribute row is a named list made of \(4\) components.

Solution
coord
a matrix with named rows and columns. The number of rows of coord matches the number of rows of the contingency table (here, the number of possible death Causes). The number of columns matches the rank of the truncated SVD that underlies Correspondance Analysis. Here it is \(5\) which also the rank of the standardized contingency table.

The row principal coordinates are the principal coordinates of each row profile in terms of the principal component.

The columns of coord are pairwise orthogonal in the inner product space defined by diag(call$marge.row) (which embodies the marginal probabilities of the so-called Causes of deaths)

Code
x <- ca_79$row$coord
r <- ca_79$call$marge.row

A <- round(t(x) %*% diag(r) %*% x, 2)

is_diagonal <- function (A, tol=1e-2){
  norm(diag(diag(A))-A, type='F') <= tol
}

# We expect A to be diagonal
is_diagonal(A)
[1] TRUE

We can recover row$coord from the left singular vectors and the singular values:

Code
with(ca_79,   
  norm(row$coord - with(svd, U %*% diag(vs[1:ca_79$call$ncp])), 'F')
)
[1] 0
Code
prep_rows <- ca_79_row$coord |> 
  as_tibble() |> 
  mutate(name= rownames(ca_79_row$coord)) |> 
  relocate(name) |> 
  mutate(prop=r, inertia=ca_79_row$inertia) 

prep_rows |> 
  mutate(across(where(is.numeric), \(x) round(x,2))) |> 
  gt::gt()
name Dim 1 Dim 2 Dim 3 Dim 4 Dim 5 Dim 6 Dim 7 Dim 8 prop inertia
Accidental poisoning 1.53 0.52 0.04 -0.17 0.01 -0.03 -0.01 -0.05 0.00 0.00
Addiction to prescription medication 1.75 0.44 -0.12 0.27 -0.12 -0.24 0.19 0.22 0.00 0.00
Alcohol abuse and alcohol-related psychosis 0.65 -0.74 0.44 -0.08 -0.17 0.09 0.02 -0.02 0.01 0.01
Asthma -0.01 -0.10 -0.12 -0.05 0.02 0.00 0.05 0.03 0.00 0.00
Blood and hematopoietic disorders -0.08 0.07 -0.07 -0.05 -0.04 -0.02 -0.01 0.03 0.00 0.00
Cerebrovascular disease -0.32 0.17 -0.06 -0.04 -0.06 -0.01 0.00 0.00 0.12 0.02
Chronic liver disease 0.49 -0.83 0.30 0.09 -0.08 0.04 -0.01 -0.01 0.03 0.03
Complications in pregnancy and childbirth 3.53 1.80 0.28 -2.09 0.99 -0.77 0.15 0.06 0.00 0.00
Congenital defects of the circulatory system 1.94 0.54 0.01 -0.10 0.11 -0.01 -0.08 0.11 0.00 0.00
Congenital defects of the nervous system 2.36 0.98 -0.23 0.47 -0.07 -0.38 0.23 -0.07 0.00 0.00
Diabetes -0.21 -0.05 -0.24 -0.07 0.02 0.02 0.03 0.01 0.01 0.00
Events of undetermined intention 1.71 0.35 0.24 -0.46 0.05 -0.03 0.01 -0.01 0.00 0.01
Falls -0.34 0.49 0.27 0.07 -0.02 -0.05 -0.05 -0.02 0.02 0.01
Gastroduodenal ulcer -0.11 -0.14 -0.08 -0.03 -0.03 0.01 0.01 -0.01 0.00 0.00
Homicides 2.19 0.62 0.35 -0.72 0.02 0.22 -0.18 0.08 0.00 0.01
Infections of the skin and sub-cutaneous cellular tissue -0.41 0.40 0.11 0.02 -0.06 -0.03 -0.01 -0.02 0.00 0.00
Influenza -0.41 0.51 0.28 0.10 0.06 0.02 -0.02 -0.02 0.00 0.00
Ischemic cardiomyopathy -0.14 -0.18 -0.17 -0.02 0.02 0.02 0.01 -0.01 0.09 0.01
Kidney and urethra disease -0.33 0.23 0.00 -0.02 -0.04 -0.01 -0.01 0.02 0.01 0.00
Malignant melanoma 0.63 -0.34 0.25 -0.32 0.06 0.07 -0.06 0.02 0.00 0.00
Malignant neplasm of the lymphatic and hematopoietic tissues 0.33 -0.18 -0.10 -0.11 0.07 0.02 0.02 -0.03 0.01 0.00
Malignant ovarian tumour 0.28 -0.59 0.07 0.09 -0.01 -0.03 0.01 -0.01 0.00 0.00
Malignant tumour in other parts of the uterus 0.08 -0.41 0.01 0.06 0.01 0.00 -0.01 0.05 0.01 0.00
Malignant tumour of the bladder -0.13 -0.25 -0.23 0.01 0.08 0.01 0.00 0.02 0.01 0.00
Malignant tumour of the breast 0.21 -0.49 0.17 0.03 -0.06 0.03 -0.02 0.02 0.02 0.01
Malignant tumour of the cervix 0.42 -0.60 0.29 -0.07 -0.16 0.16 -0.07 0.05 0.00 0.00
Malignant tumour of the kidney 0.07 -0.45 -0.10 0.07 0.11 -0.02 -0.03 0.01 0.00 0.00
Malignant tumour of the larynx, trachea, bronchus and lungs 0.21 -0.69 -0.01 0.14 0.08 -0.04 -0.01 -0.01 0.04 0.02
Malignant tumour of the lip, pharynx and mouth 0.41 -0.82 0.31 0.21 -0.16 -0.07 0.09 0.00 0.01 0.01
Malignant tumour of the liver and intrahepatic biliary tract 0.06 -0.44 -0.13 0.05 0.13 -0.05 -0.02 0.01 0.01 0.00
Malignant tumour of the oesophogus 0.21 -0.69 0.07 0.17 0.01 -0.06 0.02 0.03 0.01 0.01
Malignant tumour of the of the colon -0.13 -0.18 -0.17 -0.01 0.03 0.00 0.00 0.00 0.02 0.00
Malignant tumour of the of the pancreas -0.01 -0.37 -0.15 0.04 0.10 0.02 -0.04 0.01 0.01 0.00
Malignant tumour of the prostate -0.29 -0.01 -0.35 -0.09 0.03 0.05 0.03 -0.01 0.01 0.00
Malignant tumour of the rectum and anus -0.08 -0.27 -0.16 0.01 0.09 0.00 -0.02 -0.02 0.01 0.00
Malignant tumour of the stomach -0.12 -0.21 -0.19 -0.02 0.05 0.02 -0.01 0.01 0.02 0.00
Meningitis 0.63 -0.22 0.02 -0.05 0.03 0.08 -0.12 0.04 0.00 0.00
Meningococal disease 1.93 0.72 -0.55 0.77 -0.09 0.08 -0.02 0.05 0.00 0.00
Other accidents 1.18 0.30 0.01 -0.02 -0.02 0.00 0.01 -0.01 0.02 0.03
Other chronic respiritory illnesses -0.27 0.06 -0.11 -0.01 0.01 0.01 0.01 0.01 0.02 0.00
Other congenital defects and chromosomal abnormalities 1.09 0.19 0.04 -0.29 0.13 -0.27 0.12 0.01 0.00 0.00
Other digestive conditions -0.13 0.02 -0.01 -0.05 -0.03 0.00 0.00 0.00 0.03 0.00
Other diseases of the nervous system and sensory organs -0.08 0.12 -0.10 -0.04 -0.04 0.01 0.01 0.01 0.02 0.00
Other diseases of the osteo-articular system + muscles and connecting tissue -0.03 0.00 -0.06 0.00 0.01 0.02 -0.01 0.00 0.00 0.00
Other endocrinological, metabolic and nutritional conditions -0.27 0.23 0.12 0.03 -0.02 0.01 -0.01 0.00 0.01 0.00
Other external injury and poisoning 0.23 -0.10 -0.03 -0.06 0.04 0.05 -0.04 -0.03 0.00 0.00
Other genito-urinary diseases -0.37 0.26 -0.07 -0.06 -0.07 0.00 0.02 0.02 0.00 0.00
Other heart disease -0.36 0.28 0.07 0.01 -0.02 -0.01 -0.01 0.00 0.10 0.02
Other ill-defined symptoms and conditions -0.49 0.59 0.47 0.18 0.16 0.05 0.03 0.01 0.03 0.03
Other illnesses relating to circulation -0.23 0.07 -0.06 -0.02 -0.01 0.00 -0.01 0.00 0.06 0.00
Other infectious diseases and parasites -0.05 0.01 -0.07 -0.04 0.01 0.00 0.01 0.01 0.01 0.00
Other malignent tumours 0.09 -0.29 -0.04 0.02 0.04 -0.02 -0.02 0.00 0.04 0.00
Other psychological and behavioural disorders -0.40 0.36 -0.03 -0.07 -0.11 -0.02 0.01 -0.01 0.01 0.00
Other respiratory ailments -0.26 0.17 0.04 0.02 0.02 -0.01 0.00 -0.01 0.03 0.00
Other tumours 0.19 -0.25 0.01 -0.01 0.03 0.00 -0.03 0.02 0.01 0.00
Pneumonia -0.38 0.38 0.17 0.04 -0.03 0.00 -0.01 0.00 0.01 0.00
Rhumatoid arthritis and osteoarthritis -0.23 0.02 -0.22 -0.08 0.00 0.00 0.00 -0.05 0.00 0.00
Road accidents 2.47 1.05 -0.33 0.36 -0.03 0.01 0.00 0.00 0.02 0.15
Suicides 1.39 0.04 0.31 -0.48 0.06 -0.02 0.01 0.00 0.02 0.04
Tuberculosis 0.13 -0.34 0.01 -0.04 0.01 0.02 -0.01 -0.02 0.00 0.00
Unknown or unspecified causes 0.43 0.05 0.13 -0.08 0.01 0.01 0.01 0.00 0.03 0.01
Viral hepatitis 0.49 -0.10 -0.09 0.04 -0.02 -0.02 0.00 0.00 0.00 0.00
Solution
inertia
a numerical vector with length matching the number of rows of coord, contrib and cos2.

Inertia is the way CA measures variation between row profiles. Total inertia is the \(\chi^2\) statistic divided by sample size.

Row inertia can be obtained by multiplying the row marginal probability by the squared Euclidean norm of the row in the principal coordinate matrix.

Code
with (ca_79_row,
  sum(abs(r* (rowSums(coord^2)) - inertia))
)
[1] 1.877449e-16
Solution
cos2
Coefficients of matrix cos2 are the share of row inertia from the corresponding cell in coord
Code
with (ca_79_row,
  norm((diag(r/inertia) %*% coord^2) - cos2, type='F')
)
[1] 5.432216e-16
Solution
contrib

Not too surprisingly, coord, contrib, and cos2 share the same row names and column names.

Code
sum(ca_79$call$X)
[1] 529974
Code
sum((rowSums(ca_79$call$X)/sum(ca_79$call$X) - r)^2)
[1] 6.311339e-35

The Row Profiles are the rows of matrix R below

Code
P <- as.matrix(with(ca_79$call, Xtot/N))
coord <- ca_79_row$coord
inertia <- ca_79_row$inertia

r <- ca_79$call$marge.row
c <- colSums(P)

n <- nrow(P)
p <- ncol(P)

R <- diag(r^(-1)) %*% P 

Q <- R - matrix(1, nrow = n, ncol = n) %*% P
Code
M <- diag(r^(-1)) %*% P %*% diag(c^(-1)) - matrix(1, nrow=n, ncol=p)

n * norm(diag(r^(1/2)) %*% M %*% diag(c^(1/2)), type = "F")^2
[1] 29.39279
Solution

We can now display a scatterplot from component coord. This is called a Row Plot.

Code
p_scat <-  ( 
  prep_rows |> 
    ggplot() +
    aes(x=`Dim 1`, y=`Dim 2`, label=name) +
    geom_point() +
    coord_fixed() 
  ) 

p_scat |> plotly::ggplotly()
Solution

With little effort, it is possible to scale the points so as to tell the reader the relative numerical importance of each Cause of death. Coloring/filling the points using inertia also helps: high inertia rows match light-colored points.

Code
ppp <- prep_rows |> 
    ggplot() +
    aes(x=`Dim 1`, 
        y=`Dim 2`, 
        label=name, 
        size=prop, 
        fill=log10(inertia),
        color=log10(inertia)) +
    geom_point(alpha=0.75) +
    scale_size_area() +
    coord_fixed() +
    scale_fill_viridis_c(aesthetics=c("fill", "color"), 
                         guide="colorbar", 
                         direction = 1) +
    ggtitle(
      "Mortality France 1979: Row plot"
    )
  
ppp |> plotly::ggplotly()
Code
# (ca_79$row)$contrib
Question

Plot the result of row profile analysis using plot.CA from FactoMineR.

Question

Perform column profiles analysis

Code
names(ca_79_row)
[1] "coord"   "contrib" "cos2"    "inertia"
Solution
Code
age_group_names <-  str_match(rownames(ca_79$col$coord), '([\\w \\-]*) \\(79\\)')[,2]

prep_cols <- ca_79$col$coord |> 
  as_tibble() |> 
  mutate(name= age_group_names) |> 
  relocate(name) |> 
  mutate(prop=c, inertia=ca_79$col$inertia)
Code
(
  prep_cols |> 
    ggplot() +
    aes(x=`Dim 1`, 
        y=`Dim 2`, 
        label=name, 
        size=prop, 
        fill=log10(inertia),
        color=log10(inertia)) +
    geom_point(alpha=0.75) +
    scale_size_area() +
    coord_fixed() +
    scale_fill_viridis_c(aesthetics=c("fill", "color"),direction = 1) +
    ggtitle(
      "Mortality France 1979: Col plot"
    )) |> plotly::ggplotly()

Symmetric plots

Question

Build the symmetric plots (biplots) for correspondence analysis of Mortalitity data

From the shelf

Code
plot.CA(ca_79)
Solution

Code
(
prep_rows |> 
    ggplot() +
    aes(x=`Dim 1`, 
        y=`Dim 2`, 
        label=name, 
        size=prop, 
        fill=log10(inertia),
        color=log10(inertia)) +
    geom_point(alpha=0.75) +
    scale_size_area() +
    coord_fixed() +
    scale_fill_viridis_c(aesthetics=c("fill", "color"),direction = 1) +
    geom_point(data = prep_cols,
      aes(x=`Dim 1`, 
        y=`Dim 2`, 
        label=name, 
        size=prop, 
        fill=log10(inertia),
        color=log10(inertia)
      ),
      shape="square",
      alpha=.5,      
    )
) |> plotly::ggplotly()
Warning in geom_point(data = prep_cols, aes(x = `Dim 1`, y = `Dim 2`, label =
name, : Ignoring unknown aesthetics: label
Solution

It is convenient to use distinct color scales for rows and columns.

Code
(
prep_rows |> 
    ggplot() +
    scale_size_area() +
    coord_fixed() +
    aes(x=`Dim 1`, 
        y=`Dim 2`, 
        text=name, 
        size=prop, 
        fill=log10(inertia)) +
    geom_point(alpha=0.75) +
    scale_fill_viridis_c(option="D") +
    geom_point(data = prep_cols,
      aes(x=`Dim 1`, 
        y=`Dim 2`, 
        text=name, 
        size=prop, 
        color=log10(inertia)
      ),
      shape="square",
      alpha=.5,      
    ) +
    scale_color_viridis_c(option="F") +
    theme_minimal(
    )  
)  |> plotly::ggplotly()
Warning in geom_point(data = prep_cols, aes(x = `Dim 1`, y = `Dim 2`, text =
name, : Ignoring unknown aesthetics: text

Mosaicplots

Question

Mosaic plots provide an alternative way of exploring contingency tables. They are particularly handy when handling 2-way contingency tables.

Draw mosaic plots for the two contingency tables living inside mortality datasets.

Solution
Code
mortality |> 
  select(ends_with('(06)')) |> 
  chisq.test() |> 
  broom::glance()
Warning in chisq.test(select(mortality, ends_with("(06)"))): Chi-squared
approximation may be incorrect
# A tibble: 1 × 4
  statistic p.value parameter method                    
      <dbl>   <dbl>     <int> <chr>                     
1   229784.       0       488 Pearson's Chi-squared test
Code
mortality |> 
  select(ends_with('(06)')) |> 
  as.matrix() |> 
  as.table() |> 
  mosaicplot(color = T)

Question

Are you able to deliver an interpretation of this Correspondence Analysis?

Hierarchical clusetring of row profiles

Question

Build the standardized matrix for row profiles analysis. Compute the pairwise distance matrix using the \(\chi^2\) distances. Should you work centered row profiles?

We use the weighted \(\ell_2\) distances defined by the product of the two marginal distributions. The squared distance between the conditional probabilities defined by rows \(a\) and \(a'\) is \[\sum_{b} \frac{\left( N_{a,b}/N_{a,.} - N_{a',b}/N_{a',.}\right)^2}{N_{.,b}/N}\]

The \(\ell_2\) distance between the rows of the principal coordinates matrix row$coord coincides since they are all centered and normalized with respect to \((N_{.,b}/N)\).

Code
dist_Causes_79 <- ca_79$row$coord[,1:8] |> 
  dist()
Code
hc_79 <- hclust(dist_Causes_79, method = "single")
Code
stopifnot(
  require(ggdendro),
  require(dendextend),
  require(sloop)
)

The instance of hclust is transformed into a an object of class dendro. Class dendro is equipped with a variety of functions/methods for analyzing, visualizing, and exploiting the result of hclust().

Code
dendro_79 <- dendro_data(hc_79)
Code
class(dendro_79)
[1] "dendro"
Code
(
dendro_79 |> 
  ggdendrogram(
    leaf_labels = T, 
    rotate = T) + 
  ggdendro::theme_dendro() +
  scale_y_reverse()
  ) |> plotly::ggplotly()
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.
Question

Perform hierarchical clustering of row profiles with method/linkage "single". Check the definition of the method. Did you know the underlying algorithm? If yes, in which context did you get acquainted with this algorithm?

Question

Choose the number of classes (provide justification).

Question

Can you explain the size of the different classes in the partition?

Atypical row profiles

Question

Row profiles that do not belong to the majority class are called atypical.

  1. Compute the share of inertia of atypical row profiles.

  2. Draw a symmetric plot (biplot) outlining the atypical row profiles.

Investigating independence/association

Question
  1. Calculate the theoretical population table for deces. Do you possible to carry out a chi-squared test?

  2. Perform a hierarchical classification of the line profiles into two classes.

  3. Merge the rows of deces corresponding to the same class (you can use the the tapply function), and perform a chi-square test. chi-square test. What’s the conclusion?

  4. Why is it more advantageous to carry out this grouping into two classes compared to arbitrarily grouping two classes, in order to prove the dependence between these two variables?

About the “average profile”

Question
  1. Represent individuals from the majority class. Do they all seem to you to correspond to an average profile?

  2. Try to explain this phenomenon considering the way in which hierarchical classification uses the Single Linkage method.

Caveat

The mortality dataset should be taken with grain of salt. Assigning a single Cause to every death is not a trivial task. It is even questionable: if somebody dies from some infection beCause she could not be cured using an available drug due to another preexisting pathology, who is the culprit?