Code
stopifnot(
require(FactoMineR),
require(factoextra),
require(FactoInvestigate)
)M1 MIDS/MFA/LOGOS |
| Année 2024 |
Besides the usual packages (tidyverse, …), we shall require FactoMineR and related packages.
stopifnot(
require(FactoMineR),
require(factoextra),
require(FactoInvestigate)
)mortality datasetThe goal is to investigate a possible link between age group and Cause of death. We work with dataset mortality from package FactoMineR
data("mortality", package = "FactoMineR")#help(mortality)A data frame with 62 rows (the different Causes of death) and 18 columns. Each column corresponds to an age interval (15-24, 25-34, 35-44, 45-54, 55-64, 65-74, 75-84, 85-94, 95 and more) in a year. The 9 first columns correspond to data in 1979 and the 9 last columns to data in 2006. In each cell, the counts of deaths for a Cause of death in an age interval (in a year) is given.
See also EuroStat:
Read the documentation of the mortality dataset. Is this a sample? an aggregated dataset?
If you consider mortality as an agregated dataset, can you figure out the organization of the sample mortality was built from?
The mortality dataset is an aggregated dataset. It has been built from two samples. Each sample was built from the collection of death certificates from one calendar year in France (years 1999 and 2006). From each death certificate, two categorical pieces of information were extracted: age group of the deceased and a Cause of death. Each sample was then grouped by age group and Cause of death and counts were computed. This defines a two-ways contingency table in long form. The contingency table in wide form is obtained by pivoting: pick column names from column age group and values from counts. Column Cause of depth provide row names.
The final form of the dataset is obtained by concatenating the two contingency tables along the second axis.
mortality <- mortality |>
mutate(Cause = rownames(mortality)) |>
mutate(Cause = factor(Cause)) |>
relocate(Cause)my_gt <- function(gt_tbl){
gt_tbl |>
tab_style(
style = list(
"font-variant: small-caps;"
),
locations = cells_body(columns = Cause)
) |>
gt::cols_align(
align="left",
columns=Cause
)
}mortality |>
select(Cause, ends_with('(06)')) |>
sample_n(10) |>
gt::gt() |>
my_gt()| Cause | 15-24 (06) | 25-34 (06) | 35-44 (06) | 45-54 (06) | 55-64 (06) | 65-74 (06) | 75-84 (06) | 85-94 (06) | 95 and more (06) |
|---|---|---|---|---|---|---|---|---|---|
| Kidney and urethra disease | 3 | 10 | 35 | 106 | 266 | 655 | 2432 | 2454 | 545 |
| Addiction to prescription medication | 18 | 77 | 72 | 15 | 4 | 2 | 1 | 0 | 0 |
| Blood and hematopoietic disorders | 20 | 22 | 43 | 94 | 159 | 259 | 642 | 670 | 213 |
| Malignant tumour of the larynx, trachea, bronchus and lungs | 0 | 38 | 681 | 4059 | 7285 | 8026 | 7678 | 1869 | 128 |
| Other ill-defined symptoms and conditions | 42 | 126 | 346 | 752 | 1077 | 1746 | 4943 | 7920 | 4253 |
| Unknown or unspecified causes | 227 | 476 | 928 | 1517 | 1585 | 1487 | 2446 | 2298 | 872 |
| Homicides | 52 | 83 | 65 | 63 | 42 | 27 | 21 | 6 | 0 |
| Viral hepatitis | 0 | 9 | 76 | 117 | 94 | 143 | 224 | 61 | 6 |
| Malignant ovarian tumour | 4 | 12 | 63 | 302 | 603 | 863 | 1054 | 402 | 38 |
| Rhumatoid arthritis and osteoarthritis | 0 | 1 | 0 | 4 | 14 | 99 | 231 | 176 | 54 |
Before proceeding to Correspondence Analysis (CA), let us tidy up the table and draw some elementary plots.
mortality, so as to obtain a tibble with columns Cause, year, while keeping all columns named after age groups (tidy up the data so as to obtain a tibble in partially long format).rowwise() and sum(c_cross()) so as to compute the total number of deaths per year and Cause in column total. This allows to mimic rowSums() inside a pipeline. Column grand_total is computed using a window function over grouping by Cause.mortality_long <- mortality |>
pivot_longer(
cols=-Cause,
cols_vary="slowest",
names_to=c(".value", "year"),
names_pattern="([\\w\\- ]*) \\(([0-9]{2})\\)"
) |>
mutate(year=ifelse(year=='06', 2006, 1979)) |>
rowwise() |>
mutate(total_year=sum(c_across(-c(Cause, year)))) |>
group_by(Cause) |>
mutate(grand_total = sum(total_year)) |>
ungroup()mortality_long |>
slice_sample(n=10) |>
gt::gt() |>
my_gt() |>
gt::tab_caption("A sample of rows from Mortality table in long form")| Cause | year | 15-24 | 25-34 | 35-44 | 45-54 | 55-64 | 65-74 | 75-84 | 85-94 | 95 and more | total_year | grand_total |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Events of undetermined intention | 1979 | 297 | 382 | 296 | 329 | 239 | 279 | 221 | 66 | 1 | 2110 | 2558 |
| Gastroduodenal ulcer | 2006 | 3 | 1 | 13 | 38 | 48 | 99 | 264 | 244 | 55 | 765 | 2981 |
| Cerebrovascular disease | 2006 | 35 | 75 | 311 | 902 | 1575 | 3719 | 12172 | 11385 | 2464 | 32638 | 98795 |
| Malignant tumour of the of the colon | 1979 | 10 | 39 | 114 | 556 | 1130 | 2898 | 3541 | 1257 | 48 | 9593 | 21753 |
| Meningococal disease | 2006 | 8 | 0 | 1 | 1 | 0 | 4 | 2 | 0 | 0 | 16 | 60 |
| Diabetes | 1979 | 17 | 43 | 61 | 232 | 585 | 2064 | 3044 | 1018 | 47 | 7111 | 18083 |
| Other tumours | 2006 | 39 | 45 | 141 | 300 | 564 | 1010 | 2192 | 1713 | 318 | 6322 | 10399 |
| Malignant tumour of the of the colon | 2006 | 4 | 17 | 135 | 538 | 1421 | 2572 | 4641 | 2486 | 346 | 12160 | 21753 |
| Other ill-defined symptoms and conditions | 1979 | 41 | 75 | 116 | 276 | 511 | 1635 | 5641 | 7330 | 1500 | 17125 | 38330 |
| Homicides | 1979 | 92 | 116 | 115 | 65 | 50 | 33 | 35 | 10 | 0 | 516 | 875 |
A truly tidy version of the dataset can be obtained from further pivoting.
mortality_tidy <- mortality_long |>
pivot_longer(
cols=-c(year,Cause,total_year, grand_total),
cols_vary="slowest",
names_to=c("age_range"),
values_to=c("#deaths")
) |>
mutate(age_range = factor(age_range, levels=sort(unique(age_range)),ordered=T))mortality_tidy |>
sample_n(5) |>
gt::gt()| Cause | year | total_year | grand_total | age_range | #deaths |
|---|---|---|---|---|---|
| Other ill-defined symptoms and conditions | 2006 | 21205 | 38330 | 25-34 | 126 |
| Malignant ovarian tumour | 2006 | 3341 | 5661 | 85-94 | 402 |
| Road accidents | 2006 | 4648 | 15255 | 95 and more | 8 |
| Meningococal disease | 1979 | 44 | 60 | 85-94 | 0 |
| Chronic liver disease | 2006 | 7669 | 23596 | 35-44 | 453 |
Build a bar plot to display the importance of Causes of deaths in France in years 1979 and 2006
th <- theme_get()
(
mortality_long |>
mutate(Cause=fct_reorder(Cause, desc(grand_total))) |>
mutate(year=as_factor(year)) |>
ggplot() +
scale_fill_discrete() +
aes(x=Cause,
y=total_year,
fill=year) +
geom_col(position=position_dodge()) +
theme(
legend.position="none",
axis.text.x=element_blank(), #remove x axis labels
axis.ticks.x=element_blank(), #remove x axis ticks
) +
labs(
title = "Causes of death, France, 1979, 2006",
subtitle= "Raw counts"
) +
xlab(label=NULL)
) |>
plotly::ggplotly() oth <- theme_set(th)Compute and display the total number of deaths in France in years 1979 and 2006.
mortality_long |>
group_by(year) |>
summarise(total_deaths = sum(total_year)) |>
gt::gt() |>
gt::cols_label(
year= "Year",
total_deaths = "#Deaths") |>
gt::tab_caption("Mortality in France")| Year | #Deaths |
|---|---|
| 1979 | 529974 |
| 2006 | 510921 |
Compute the marginal counts for each year (1979, 2006). Compare.
Counts have already been computed above.
mortality_long |>
select(Cause, year, total_year, grand_total) |>
pivot_wider(
id_cols=c(Cause, grand_total),
names_from = year,
values_from = total_year) |>
rename(Total=grand_total) |>
arrange(desc(Total)) |>
gt::gt() |>
my_gt()| Cause | Total | 1979 | 2006 |
|---|---|---|---|
| Cerebrovascular disease | 98795 | 66157 | 32638 |
| Other heart disease | 97297 | 54105 | 43192 |
| Ischemic cardiomyopathy | 88338 | 49532 | 38806 |
| Other illnesses relating to circulation | 61937 | 31218 | 30719 |
| Malignant tumour of the larynx, trachea, bronchus and lungs | 50604 | 20840 | 29764 |
| Other malignent tumours | 48809 | 23262 | 25547 |
| Other diseases of the nervous system and sensory organs | 38891 | 12056 | 26835 |
| Other ill-defined symptoms and conditions | 38330 | 17125 | 21205 |
| Other digestive conditions | 32697 | 18092 | 14605 |
| Other respiratory ailments | 26339 | 14197 | 12142 |
| Unknown or unspecified causes | 26192 | 14356 | 11836 |
| Chronic liver disease | 23596 | 15927 | 7669 |
| Other accidents | 23353 | 10921 | 12432 |
| Malignant tumour of the of the colon | 21753 | 9593 | 12160 |
| Suicides | 20337 | 9952 | 10385 |
| Malignant tumour of the breast | 20236 | 8605 | 11631 |
| Malignant neplasm of the lymphatic and hematopoietic tissues | 20000 | 7589 | 12411 |
| Diabetes | 18083 | 7111 | 10972 |
| Falls | 17711 | 12503 | 5208 |
| Other chronic respiritory illnesses | 17331 | 9680 | 7651 |
| Other psychological and behavioural disorders | 17160 | 3749 | 13411 |
| Malignant tumour of the prostate | 15514 | 6577 | 8937 |
| Road accidents | 15255 | 10607 | 4648 |
| Pneumonia | 14700 | 5057 | 9643 |
| Malignant tumour of the stomach | 13783 | 9020 | 4763 |
| Other endocrinological, metabolic and nutritional conditions | 13665 | 6030 | 7635 |
| Kidney and urethra disease | 13613 | 7107 | 6506 |
| Malignant tumour of the of the pancreas | 12851 | 4588 | 8263 |
| Other infectious diseases and parasites | 11466 | 4045 | 7421 |
| Malignant tumour of the liver and intrahepatic biliary tract | 11452 | 4001 | 7451 |
| Other tumours | 10399 | 4077 | 6322 |
| Malignant tumour of the lip, pharynx and mouth | 9722 | 5606 | 4116 |
| Malignant tumour of the oesophogus | 9267 | 5430 | 3837 |
| Malignant tumour of the rectum and anus | 9246 | 4980 | 4266 |
| Malignant tumour of the bladder | 8322 | 3633 | 4689 |
| Alcohol abuse and alcohol-related psychosis | 6327 | 3371 | 2956 |
| Malignant ovarian tumour | 5661 | 2320 | 3341 |
| Malignant tumour in other parts of the uterus | 5263 | 2940 | 2323 |
| Malignant tumour of the kidney | 5205 | 2101 | 3104 |
| Other genito-urinary diseases | 4387 | 2552 | 1835 |
| Blood and hematopoietic disorders | 4299 | 2177 | 2122 |
| Other diseases of the osteo-articular system + muscles and connecting tissue | 3935 | 1025 | 2910 |
| Infections of the skin and sub-cutaneous cellular tissue | 3439 | 1649 | 1790 |
| Gastroduodenal ulcer | 2981 | 2216 | 765 |
| Tuberculosis | 2797 | 2070 | 727 |
| Other external injury and poisoning | 2731 | 1023 | 1708 |
| Events of undetermined intention | 2558 | 2110 | 448 |
| Asthma | 2488 | 1495 | 993 |
| Malignant melanoma | 2205 | 658 | 1547 |
| Malignant tumour of the cervix | 1527 | 824 | 703 |
| Accidental poisoning | 1510 | 503 | 1007 |
| Rhumatoid arthritis and osteoarthritis | 1284 | 705 | 579 |
| Influenza | 1166 | 1051 | 115 |
| Viral hepatitis | 1059 | 329 | 730 |
| Homicides | 875 | 516 | 359 |
| Other congenital defects and chromosomal abnormalities | 592 | 145 | 447 |
| Congenital defects of the circulatory system | 540 | 275 | 265 |
| Meningitis | 481 | 362 | 119 |
| Addiction to prescription medication | 222 | 33 | 189 |
| Complications in pregnancy and childbirth | 150 | 91 | 59 |
| Congenital defects of the nervous system | 109 | 61 | 48 |
| Meningococal disease | 60 | 44 | 16 |
Start from a 2-way contingency table \(X\) with \(\sum_{i,j} X_{i,j}=N\)
Normalize \(P = \frac{1}{N}X\) (correspondance matrix)
Let \(r\) (resp. \(c\)) be the row (resp. column) wise sums vector
Let \(D_r=\text{diag}(r)\) denote the diagonal matrix with row sums of \(P\) as coefficients
Let \(D_c=\text{diag}(c)\) denote the diagonal matrix with column sums of \(P\) as coefficients
The row profiles matrix is \(D_r^{-1} \times P\)
The standardized residuals matrix is \(S = D_r^{-1/2} \times \left(P - r c^\top\right) \times D_c^{-1/2}\)
CA consists in computing the SVD of the standardized residuals matrix \(S = U \times D \times V^\top\)
From the SVD, we get
When calling svd(.), the argument should be \[D_r^{1/2}\times \left(D_r^{-1} \times P \times D_c^{-1}- \mathbf{I}\times \mathbf{I}^\top \right)\times D_c^{1/2}= D_r^{-1/2}\times \left( P - r \times c^\top \right)\times D_c^{-1/2}\]
As \[D_r^{-1} \times P \times D_c^{-1} - \mathbf{I}\mathbf{I}^\top = (D_r^{-1/2} \times U)\times D \times (D_c^{-1/2}\times V)^\top\]
\((D_r^{-1/2} \times U)\times D \times (D_c^{-1/2}\times V)^\top\) is the extended SVD of \[D_r^{-1} \times P \times D_c^{-1} - \mathbf{I}\mathbf{I}^\top\] with respect to \(D_r\) and \(D_c\)
Perform CA on the two contingency tables.
You may use FactoMineR::CA(). It is interesting to compute the correspondence analysis in your own way, by preparing the matrix that is handled to svd() and returning a named list containing all relevant information.
Do the Jedi and Sith build their own light sabers? Jedi do. It’s a key part of the religion to have a kyber crystal close to you, to build the saber through the power of the force creating a blade unique and in tune with them
lst_ca <- list()
for (y in c('79', '06')) {
lst_ca[[y]] <- mortality |>
select(ends_with(glue('({y})'))) |>
FactoMineR::CA(ncp=8, graph = F)
}lst <- map(c('79', '06'),
\(x) select(mortality, ends_with(glue('({x})'))) |>
FactoMineR::CA(ncp=8, graph = F)
)If you did use FactoMineR::CA(), explain the organization of the result.
The result of FactoMineR::CA(...) is a named and nested list with five elements:
eigcallCA(), an object of type language, telling (in principle) the user how CA() was called. However, this is a quoted expression. Here we need to guess the value of y in the calling environment understand what’s going on.
lst_ca[[1]]$call$callFactoMineR::CA(X = select(mortality, ends_with(glue("({y})"))),
ncp = 8, graph = F)
Element call also contains the table margin distributions marge.col and marge.row. The truncation rank ncp (number of components) can be assigned before computing the SVD (default value is 5). Element \(X\) stores the contingency table that was effectively used for computing Correpondence Analysis.
rowcolrow. Used for column profiles analysis
svdsvd() containing the singular values, the left and right singular vectors of matrix \(...\)
In principle, all relevant information can be gathered from components svd, call.marge.row, and call.marge.col.
Draw screeplots. Why are they useful? Comment briefly.
ca_79 <- lst_ca[[1]]
ca_79$eig |>
as_tibble() |>
mutate(across(where(is.numeric), ~ round(.x, digits=2))) |>
gt::gt()| eigenvalue | percentage of variance | cumulative percentage of variance |
|---|---|---|
| 0.29 | 61.00 | 61.00 |
| 0.14 | 28.98 | 89.98 |
| 0.03 | 6.13 | 96.12 |
| 0.01 | 2.86 | 98.98 |
| 0.00 | 0.73 | 99.70 |
| 0.00 | 0.17 | 99.88 |
| 0.00 | 0.09 | 99.97 |
| 0.00 | 0.03 | 100.00 |
ca_79$eig |>
as_tibble() |>
rownames_to_column(var="PC") |>
rename(percent=eigenvalue, cumulative=`cumulative percentage of variance`) |>
ggplot() +
aes(x=PC, y=percent, label=round(cumulative,2)) +
geom_text(angle=45, vjust=-1, hjust=-0.1) +
geom_col(fill=NA, colour="black") +
ylab("Squared singular values") +
ylim(c(0, .4)) +
labs(
title="Screeplot for CA",
subtitle = "Mortality 1979: Age Group versus Causes of Death"
)Perform row profiles analysis.
What are the classical plots? How can you build them from the output of FactoMiner::CA?
Build the table of row contributions (the so-called \(\cos^2\))
Attribute row of objects of class CA (exported from FactoMineR) is the starting point of any row profiles analysis.
ca_79_row <- ca_79$rowAttribute row is a named list made of \(4\) components.
coordcoord matches the number of rows of the contingency table (here, the number of possible death Causes). The number of columns matches the rank of the truncated SVD that underlies Correspondance Analysis. Here it is \(5\) which also the rank of the standardized contingency table.
The row principal coordinates are the principal coordinates of each row profile in terms of the principal component.
The columns of coord are pairwise orthogonal in the inner product space defined by diag(call$marge.row) (which embodies the marginal probabilities of the so-called Causes of deaths)
x <- ca_79$row$coord
r <- ca_79$call$marge.row
A <- round(t(x) %*% diag(r) %*% x, 2)
is_diagonal <- function (A, tol=1e-2){
norm(diag(diag(A))-A, type='F') <= tol
}
# We expect A to be diagonal
is_diagonal(A)[1] TRUE
We can recover row$coord from the left singular vectors and the singular values:
with(ca_79,
norm(row$coord - with(svd, U %*% diag(vs[1:ca_79$call$ncp])), 'F')
)[1] 0
prep_rows <- ca_79_row$coord |>
as_tibble() |>
mutate(name= rownames(ca_79_row$coord)) |>
relocate(name) |>
mutate(prop=r, inertia=ca_79_row$inertia)
prep_rows |>
mutate(across(where(is.numeric), \(x) round(x,2))) |>
gt::gt()| name | Dim 1 | Dim 2 | Dim 3 | Dim 4 | Dim 5 | Dim 6 | Dim 7 | Dim 8 | prop | inertia |
|---|---|---|---|---|---|---|---|---|---|---|
| Accidental poisoning | 1.53 | 0.52 | 0.04 | -0.17 | 0.01 | -0.03 | -0.01 | -0.05 | 0.00 | 0.00 |
| Addiction to prescription medication | 1.75 | 0.44 | -0.12 | 0.27 | -0.12 | -0.24 | 0.19 | 0.22 | 0.00 | 0.00 |
| Alcohol abuse and alcohol-related psychosis | 0.65 | -0.74 | 0.44 | -0.08 | -0.17 | 0.09 | 0.02 | -0.02 | 0.01 | 0.01 |
| Asthma | -0.01 | -0.10 | -0.12 | -0.05 | 0.02 | 0.00 | 0.05 | 0.03 | 0.00 | 0.00 |
| Blood and hematopoietic disorders | -0.08 | 0.07 | -0.07 | -0.05 | -0.04 | -0.02 | -0.01 | 0.03 | 0.00 | 0.00 |
| Cerebrovascular disease | -0.32 | 0.17 | -0.06 | -0.04 | -0.06 | -0.01 | 0.00 | 0.00 | 0.12 | 0.02 |
| Chronic liver disease | 0.49 | -0.83 | 0.30 | 0.09 | -0.08 | 0.04 | -0.01 | -0.01 | 0.03 | 0.03 |
| Complications in pregnancy and childbirth | 3.53 | 1.80 | 0.28 | -2.09 | 0.99 | -0.77 | 0.15 | 0.06 | 0.00 | 0.00 |
| Congenital defects of the circulatory system | 1.94 | 0.54 | 0.01 | -0.10 | 0.11 | -0.01 | -0.08 | 0.11 | 0.00 | 0.00 |
| Congenital defects of the nervous system | 2.36 | 0.98 | -0.23 | 0.47 | -0.07 | -0.38 | 0.23 | -0.07 | 0.00 | 0.00 |
| Diabetes | -0.21 | -0.05 | -0.24 | -0.07 | 0.02 | 0.02 | 0.03 | 0.01 | 0.01 | 0.00 |
| Events of undetermined intention | 1.71 | 0.35 | 0.24 | -0.46 | 0.05 | -0.03 | 0.01 | -0.01 | 0.00 | 0.01 |
| Falls | -0.34 | 0.49 | 0.27 | 0.07 | -0.02 | -0.05 | -0.05 | -0.02 | 0.02 | 0.01 |
| Gastroduodenal ulcer | -0.11 | -0.14 | -0.08 | -0.03 | -0.03 | 0.01 | 0.01 | -0.01 | 0.00 | 0.00 |
| Homicides | 2.19 | 0.62 | 0.35 | -0.72 | 0.02 | 0.22 | -0.18 | 0.08 | 0.00 | 0.01 |
| Infections of the skin and sub-cutaneous cellular tissue | -0.41 | 0.40 | 0.11 | 0.02 | -0.06 | -0.03 | -0.01 | -0.02 | 0.00 | 0.00 |
| Influenza | -0.41 | 0.51 | 0.28 | 0.10 | 0.06 | 0.02 | -0.02 | -0.02 | 0.00 | 0.00 |
| Ischemic cardiomyopathy | -0.14 | -0.18 | -0.17 | -0.02 | 0.02 | 0.02 | 0.01 | -0.01 | 0.09 | 0.01 |
| Kidney and urethra disease | -0.33 | 0.23 | 0.00 | -0.02 | -0.04 | -0.01 | -0.01 | 0.02 | 0.01 | 0.00 |
| Malignant melanoma | 0.63 | -0.34 | 0.25 | -0.32 | 0.06 | 0.07 | -0.06 | 0.02 | 0.00 | 0.00 |
| Malignant neplasm of the lymphatic and hematopoietic tissues | 0.33 | -0.18 | -0.10 | -0.11 | 0.07 | 0.02 | 0.02 | -0.03 | 0.01 | 0.00 |
| Malignant ovarian tumour | 0.28 | -0.59 | 0.07 | 0.09 | -0.01 | -0.03 | 0.01 | -0.01 | 0.00 | 0.00 |
| Malignant tumour in other parts of the uterus | 0.08 | -0.41 | 0.01 | 0.06 | 0.01 | 0.00 | -0.01 | 0.05 | 0.01 | 0.00 |
| Malignant tumour of the bladder | -0.13 | -0.25 | -0.23 | 0.01 | 0.08 | 0.01 | 0.00 | 0.02 | 0.01 | 0.00 |
| Malignant tumour of the breast | 0.21 | -0.49 | 0.17 | 0.03 | -0.06 | 0.03 | -0.02 | 0.02 | 0.02 | 0.01 |
| Malignant tumour of the cervix | 0.42 | -0.60 | 0.29 | -0.07 | -0.16 | 0.16 | -0.07 | 0.05 | 0.00 | 0.00 |
| Malignant tumour of the kidney | 0.07 | -0.45 | -0.10 | 0.07 | 0.11 | -0.02 | -0.03 | 0.01 | 0.00 | 0.00 |
| Malignant tumour of the larynx, trachea, bronchus and lungs | 0.21 | -0.69 | -0.01 | 0.14 | 0.08 | -0.04 | -0.01 | -0.01 | 0.04 | 0.02 |
| Malignant tumour of the lip, pharynx and mouth | 0.41 | -0.82 | 0.31 | 0.21 | -0.16 | -0.07 | 0.09 | 0.00 | 0.01 | 0.01 |
| Malignant tumour of the liver and intrahepatic biliary tract | 0.06 | -0.44 | -0.13 | 0.05 | 0.13 | -0.05 | -0.02 | 0.01 | 0.01 | 0.00 |
| Malignant tumour of the oesophogus | 0.21 | -0.69 | 0.07 | 0.17 | 0.01 | -0.06 | 0.02 | 0.03 | 0.01 | 0.01 |
| Malignant tumour of the of the colon | -0.13 | -0.18 | -0.17 | -0.01 | 0.03 | 0.00 | 0.00 | 0.00 | 0.02 | 0.00 |
| Malignant tumour of the of the pancreas | -0.01 | -0.37 | -0.15 | 0.04 | 0.10 | 0.02 | -0.04 | 0.01 | 0.01 | 0.00 |
| Malignant tumour of the prostate | -0.29 | -0.01 | -0.35 | -0.09 | 0.03 | 0.05 | 0.03 | -0.01 | 0.01 | 0.00 |
| Malignant tumour of the rectum and anus | -0.08 | -0.27 | -0.16 | 0.01 | 0.09 | 0.00 | -0.02 | -0.02 | 0.01 | 0.00 |
| Malignant tumour of the stomach | -0.12 | -0.21 | -0.19 | -0.02 | 0.05 | 0.02 | -0.01 | 0.01 | 0.02 | 0.00 |
| Meningitis | 0.63 | -0.22 | 0.02 | -0.05 | 0.03 | 0.08 | -0.12 | 0.04 | 0.00 | 0.00 |
| Meningococal disease | 1.93 | 0.72 | -0.55 | 0.77 | -0.09 | 0.08 | -0.02 | 0.05 | 0.00 | 0.00 |
| Other accidents | 1.18 | 0.30 | 0.01 | -0.02 | -0.02 | 0.00 | 0.01 | -0.01 | 0.02 | 0.03 |
| Other chronic respiritory illnesses | -0.27 | 0.06 | -0.11 | -0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.02 | 0.00 |
| Other congenital defects and chromosomal abnormalities | 1.09 | 0.19 | 0.04 | -0.29 | 0.13 | -0.27 | 0.12 | 0.01 | 0.00 | 0.00 |
| Other digestive conditions | -0.13 | 0.02 | -0.01 | -0.05 | -0.03 | 0.00 | 0.00 | 0.00 | 0.03 | 0.00 |
| Other diseases of the nervous system and sensory organs | -0.08 | 0.12 | -0.10 | -0.04 | -0.04 | 0.01 | 0.01 | 0.01 | 0.02 | 0.00 |
| Other diseases of the osteo-articular system + muscles and connecting tissue | -0.03 | 0.00 | -0.06 | 0.00 | 0.01 | 0.02 | -0.01 | 0.00 | 0.00 | 0.00 |
| Other endocrinological, metabolic and nutritional conditions | -0.27 | 0.23 | 0.12 | 0.03 | -0.02 | 0.01 | -0.01 | 0.00 | 0.01 | 0.00 |
| Other external injury and poisoning | 0.23 | -0.10 | -0.03 | -0.06 | 0.04 | 0.05 | -0.04 | -0.03 | 0.00 | 0.00 |
| Other genito-urinary diseases | -0.37 | 0.26 | -0.07 | -0.06 | -0.07 | 0.00 | 0.02 | 0.02 | 0.00 | 0.00 |
| Other heart disease | -0.36 | 0.28 | 0.07 | 0.01 | -0.02 | -0.01 | -0.01 | 0.00 | 0.10 | 0.02 |
| Other ill-defined symptoms and conditions | -0.49 | 0.59 | 0.47 | 0.18 | 0.16 | 0.05 | 0.03 | 0.01 | 0.03 | 0.03 |
| Other illnesses relating to circulation | -0.23 | 0.07 | -0.06 | -0.02 | -0.01 | 0.00 | -0.01 | 0.00 | 0.06 | 0.00 |
| Other infectious diseases and parasites | -0.05 | 0.01 | -0.07 | -0.04 | 0.01 | 0.00 | 0.01 | 0.01 | 0.01 | 0.00 |
| Other malignent tumours | 0.09 | -0.29 | -0.04 | 0.02 | 0.04 | -0.02 | -0.02 | 0.00 | 0.04 | 0.00 |
| Other psychological and behavioural disorders | -0.40 | 0.36 | -0.03 | -0.07 | -0.11 | -0.02 | 0.01 | -0.01 | 0.01 | 0.00 |
| Other respiratory ailments | -0.26 | 0.17 | 0.04 | 0.02 | 0.02 | -0.01 | 0.00 | -0.01 | 0.03 | 0.00 |
| Other tumours | 0.19 | -0.25 | 0.01 | -0.01 | 0.03 | 0.00 | -0.03 | 0.02 | 0.01 | 0.00 |
| Pneumonia | -0.38 | 0.38 | 0.17 | 0.04 | -0.03 | 0.00 | -0.01 | 0.00 | 0.01 | 0.00 |
| Rhumatoid arthritis and osteoarthritis | -0.23 | 0.02 | -0.22 | -0.08 | 0.00 | 0.00 | 0.00 | -0.05 | 0.00 | 0.00 |
| Road accidents | 2.47 | 1.05 | -0.33 | 0.36 | -0.03 | 0.01 | 0.00 | 0.00 | 0.02 | 0.15 |
| Suicides | 1.39 | 0.04 | 0.31 | -0.48 | 0.06 | -0.02 | 0.01 | 0.00 | 0.02 | 0.04 |
| Tuberculosis | 0.13 | -0.34 | 0.01 | -0.04 | 0.01 | 0.02 | -0.01 | -0.02 | 0.00 | 0.00 |
| Unknown or unspecified causes | 0.43 | 0.05 | 0.13 | -0.08 | 0.01 | 0.01 | 0.01 | 0.00 | 0.03 | 0.01 |
| Viral hepatitis | 0.49 | -0.10 | -0.09 | 0.04 | -0.02 | -0.02 | 0.00 | 0.00 | 0.00 | 0.00 |
inertiacoord, contrib and cos2.
Inertia is the way CA measures variation between row profiles. Total inertia is the \(\chi^2\) statistic divided by sample size.
Row inertia can be obtained by multiplying the row marginal probability by the squared Euclidean norm of the row in the principal coordinate matrix.
with (ca_79_row,
sum(abs(r* (rowSums(coord^2)) - inertia))
)[1] 1.877449e-16
cos2cos2 are the share of row inertia from the corresponding cell in coord
with (ca_79_row,
norm((diag(r/inertia) %*% coord^2) - cos2, type='F')
)[1] 5.432216e-16
contribNot too surprisingly, coord, contrib, and cos2 share the same row names and column names.
sum(ca_79$call$X)[1] 529974
sum((rowSums(ca_79$call$X)/sum(ca_79$call$X) - r)^2)[1] 6.311339e-35
The Row Profiles are the rows of matrix R below
P <- as.matrix(with(ca_79$call, Xtot/N))
coord <- ca_79_row$coord
inertia <- ca_79_row$inertia
r <- ca_79$call$marge.row
c <- colSums(P)
n <- nrow(P)
p <- ncol(P)
R <- diag(r^(-1)) %*% P
Q <- R - matrix(1, nrow = n, ncol = n) %*% PM <- diag(r^(-1)) %*% P %*% diag(c^(-1)) - matrix(1, nrow=n, ncol=p)
n * norm(diag(r^(1/2)) %*% M %*% diag(c^(1/2)), type = "F")^2[1] 29.39279
We can now display a scatterplot from component coord. This is called a Row Plot.
p_scat <- (
prep_rows |>
ggplot() +
aes(x=`Dim 1`, y=`Dim 2`, label=name) +
geom_point() +
coord_fixed()
)
p_scat |> plotly::ggplotly()With little effort, it is possible to scale the points so as to tell the reader the relative numerical importance of each Cause of death. Coloring/filling the points using inertia also helps: high inertia rows match light-colored points.
ppp <- prep_rows |>
ggplot() +
aes(x=`Dim 1`,
y=`Dim 2`,
label=name,
size=prop,
fill=log10(inertia),
color=log10(inertia)) +
geom_point(alpha=0.75) +
scale_size_area() +
coord_fixed() +
scale_fill_viridis_c(aesthetics=c("fill", "color"),
guide="colorbar",
direction = 1) +
ggtitle(
"Mortality France 1979: Row plot"
)
ppp |> plotly::ggplotly()# (ca_79$row)$contribPlot the result of row profile analysis using plot.CA from FactoMineR.
Perform column profiles analysis
names(ca_79_row)[1] "coord" "contrib" "cos2" "inertia"
age_group_names <- str_match(rownames(ca_79$col$coord), '([\\w \\-]*) \\(79\\)')[,2]
prep_cols <- ca_79$col$coord |>
as_tibble() |>
mutate(name= age_group_names) |>
relocate(name) |>
mutate(prop=c, inertia=ca_79$col$inertia)(
prep_cols |>
ggplot() +
aes(x=`Dim 1`,
y=`Dim 2`,
label=name,
size=prop,
fill=log10(inertia),
color=log10(inertia)) +
geom_point(alpha=0.75) +
scale_size_area() +
coord_fixed() +
scale_fill_viridis_c(aesthetics=c("fill", "color"),direction = 1) +
ggtitle(
"Mortality France 1979: Col plot"
)) |> plotly::ggplotly()Build the symmetric plots (biplots) for correspondence analysis of Mortalitity data
plot.CA(ca_79)(
prep_rows |>
ggplot() +
aes(x=`Dim 1`,
y=`Dim 2`,
label=name,
size=prop,
fill=log10(inertia),
color=log10(inertia)) +
geom_point(alpha=0.75) +
scale_size_area() +
coord_fixed() +
scale_fill_viridis_c(aesthetics=c("fill", "color"),direction = 1) +
geom_point(data = prep_cols,
aes(x=`Dim 1`,
y=`Dim 2`,
label=name,
size=prop,
fill=log10(inertia),
color=log10(inertia)
),
shape="square",
alpha=.5,
)
) |> plotly::ggplotly()Warning in geom_point(data = prep_cols, aes(x = `Dim 1`, y = `Dim 2`, label =
name, : Ignoring unknown aesthetics: label
It is convenient to use distinct color scales for rows and columns.
(
prep_rows |>
ggplot() +
scale_size_area() +
coord_fixed() +
aes(x=`Dim 1`,
y=`Dim 2`,
text=name,
size=prop,
fill=log10(inertia)) +
geom_point(alpha=0.75) +
scale_fill_viridis_c(option="D") +
geom_point(data = prep_cols,
aes(x=`Dim 1`,
y=`Dim 2`,
text=name,
size=prop,
color=log10(inertia)
),
shape="square",
alpha=.5,
) +
scale_color_viridis_c(option="F") +
theme_minimal(
)
) |> plotly::ggplotly()Warning in geom_point(data = prep_cols, aes(x = `Dim 1`, y = `Dim 2`, text =
name, : Ignoring unknown aesthetics: text
Mosaic plots provide an alternative way of exploring contingency tables. They are particularly handy when handling 2-way contingency tables.
Draw mosaic plots for the two contingency tables living inside mortality datasets.
mortality |>
select(ends_with('(06)')) |>
chisq.test() |>
broom::glance()Warning in chisq.test(select(mortality, ends_with("(06)"))): Chi-squared
approximation may be incorrect
# A tibble: 1 × 4
statistic p.value parameter method
<dbl> <dbl> <int> <chr>
1 229784. 0 488 Pearson's Chi-squared test
mortality |>
select(ends_with('(06)')) |>
as.matrix() |>
as.table() |>
mosaicplot(color = T)Are you able to deliver an interpretation of this Correspondence Analysis?
Build the standardized matrix for row profiles analysis. Compute the pairwise distance matrix using the \(\chi^2\) distances. Should you work centered row profiles?
We use the weighted \(\ell_2\) distances defined by the product of the two marginal distributions. The squared distance between the conditional probabilities defined by rows \(a\) and \(a'\) is \[\sum_{b} \frac{\left( N_{a,b}/N_{a,.} - N_{a',b}/N_{a',.}\right)^2}{N_{.,b}/N}\]
The \(\ell_2\) distance between the rows of the principal coordinates matrix row$coord coincides since they are all centered and normalized with respect to \((N_{.,b}/N)\).
dist_Causes_79 <- ca_79$row$coord[,1:8] |>
dist()hc_79 <- hclust(dist_Causes_79, method = "single")stopifnot(
require(ggdendro),
require(dendextend),
require(sloop)
)The instance of hclust is transformed into a an object of class dendro. Class dendro is equipped with a variety of functions/methods for analyzing, visualizing, and exploiting the result of hclust().
dendro_79 <- dendro_data(hc_79)class(dendro_79)[1] "dendro"
(
dendro_79 |>
ggdendrogram(
leaf_labels = T,
rotate = T) +
ggdendro::theme_dendro() +
scale_y_reverse()
) |> plotly::ggplotly()Scale for y is already present.
Adding another scale for y, which will replace the existing scale.
Perform hierarchical clustering of row profiles with method/linkage "single". Check the definition of the method. Did you know the underlying algorithm? If yes, in which context did you get acquainted with this algorithm?
Choose the number of classes (provide justification).
Can you explain the size of the different classes in the partition?
Row profiles that do not belong to the majority class are called atypical.
Compute the share of inertia of atypical row profiles.
Draw a symmetric plot (biplot) outlining the atypical row profiles.
Calculate the theoretical population table for deces. Do you possible to carry out a chi-squared test?
Perform a hierarchical classification of the line profiles into two classes.
Merge the rows of deces corresponding to the same class (you can use the the tapply function), and perform a chi-square test. chi-square test. What’s the conclusion?
Why is it more advantageous to carry out this grouping into two classes compared to arbitrarily grouping two classes, in order to prove the dependence between these two variables?
Represent individuals from the majority class. Do they all seem to you to correspond to an average profile?
Try to explain this phenomenon considering the way in which hierarchical classification uses the Single Linkage method.
The mortality dataset should be taken with grain of salt. Assigning a single Cause to every death is not a trivial task. It is even questionable: if somebody dies from some infection beCause she could not be cured using an available drug due to another preexisting pathology, who is the culprit?