Code
stopifnot(
require(FactoMineR),
require(factoextra),
require(FactoInvestigate)
)
M1 MIDS/MFA/LOGOS |
Année 2024 |
Besides the usual packages (tidyverse
, …), we shall require FactoMineR
and related packages.
stopifnot(
require(FactoMineR),
require(factoextra),
require(FactoInvestigate)
)
mortality
datasetThe goal is to investigate a possible link between age group and Cause of death. We work with dataset mortality
from package FactoMineR
data("mortality", package = "FactoMineR")
#help(mortality)
A data frame with 62 rows (the different Causes of death) and 18 columns. Each column corresponds to an age interval (15-24, 25-34, 35-44, 45-54, 55-64, 65-74, 75-84, 85-94, 95 and more) in a year. The 9 first columns correspond to data in 1979 and the 9 last columns to data in 2006. In each cell, the counts of deaths for a Cause of death in an age interval (in a year) is given.
See also EuroStat:
Read the documentation of the mortality
dataset. Is this a sample? an aggregated dataset?
If you consider mortality
as an agregated dataset, can you figure out the organization of the sample mortality
was built from?
The mortality
dataset is an aggregated dataset. It has been built from two samples. Each sample was built from the collection of death certificates from one calendar year in France (years 1999 and 2006). From each death certificate, two categorical pieces of information were extracted: age group of the deceased and a Cause of death. Each sample was then grouped by age group and Cause of death and counts were computed. This defines a two-ways contingency table in long form. The contingency table in wide form is obtained by pivoting: pick column names from column age group and values from counts. Column Cause of depth provide row names.
The final form of the dataset is obtained by concatenating the two contingency tables along the second axis.
<- mortality |>
mortality mutate(Cause = rownames(mortality)) |>
mutate(Cause = factor(Cause)) |>
relocate(Cause)
<- function(gt_tbl){
my_gt |>
gt_tbl tab_style(
style = list(
"font-variant: small-caps;"
),locations = cells_body(columns = Cause)
|>
) ::cols_align(
gtalign="left",
columns=Cause
)
}
|>
mortality select(Cause, ends_with('(06)')) |>
sample_n(10) |>
::gt() |>
gtmy_gt()
Cause | 15-24 (06) | 25-34 (06) | 35-44 (06) | 45-54 (06) | 55-64 (06) | 65-74 (06) | 75-84 (06) | 85-94 (06) | 95 and more (06) |
---|---|---|---|---|---|---|---|---|---|
Kidney and urethra disease | 3 | 10 | 35 | 106 | 266 | 655 | 2432 | 2454 | 545 |
Addiction to prescription medication | 18 | 77 | 72 | 15 | 4 | 2 | 1 | 0 | 0 |
Blood and hematopoietic disorders | 20 | 22 | 43 | 94 | 159 | 259 | 642 | 670 | 213 |
Malignant tumour of the larynx, trachea, bronchus and lungs | 0 | 38 | 681 | 4059 | 7285 | 8026 | 7678 | 1869 | 128 |
Other ill-defined symptoms and conditions | 42 | 126 | 346 | 752 | 1077 | 1746 | 4943 | 7920 | 4253 |
Unknown or unspecified causes | 227 | 476 | 928 | 1517 | 1585 | 1487 | 2446 | 2298 | 872 |
Homicides | 52 | 83 | 65 | 63 | 42 | 27 | 21 | 6 | 0 |
Viral hepatitis | 0 | 9 | 76 | 117 | 94 | 143 | 224 | 61 | 6 |
Malignant ovarian tumour | 4 | 12 | 63 | 302 | 603 | 863 | 1054 | 402 | 38 |
Rhumatoid arthritis and osteoarthritis | 0 | 1 | 0 | 4 | 14 | 99 | 231 | 176 | 54 |
Before proceeding to Correspondence Analysis (CA), let us tidy up the table and draw some elementary plots.
mortality
, so as to obtain a tibble with columns Cause
, year
, while keeping all columns named after age groups (tidy up the data so as to obtain a tibble in partially long format).rowwise()
and sum(c_cross())
so as to compute the total number of deaths per year
and Cause
in column total
. This allows to mimic rowSums()
inside a pipeline. Column grand_total
is computed using a window function over grouping by Cause
.<- mortality |>
mortality_long pivot_longer(
cols=-Cause,
cols_vary="slowest",
names_to=c(".value", "year"),
names_pattern="([\\w\\- ]*) \\(([0-9]{2})\\)"
|>
) mutate(year=ifelse(year=='06', 2006, 1979)) |>
rowwise() |>
mutate(total_year=sum(c_across(-c(Cause, year)))) |>
group_by(Cause) |>
mutate(grand_total = sum(total_year)) |>
ungroup()
|>
mortality_long slice_sample(n=10) |>
::gt() |>
gtmy_gt() |>
::tab_caption("A sample of rows from Mortality table in long form") gt
Cause | year | 15-24 | 25-34 | 35-44 | 45-54 | 55-64 | 65-74 | 75-84 | 85-94 | 95 and more | total_year | grand_total |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Events of undetermined intention | 1979 | 297 | 382 | 296 | 329 | 239 | 279 | 221 | 66 | 1 | 2110 | 2558 |
Gastroduodenal ulcer | 2006 | 3 | 1 | 13 | 38 | 48 | 99 | 264 | 244 | 55 | 765 | 2981 |
Cerebrovascular disease | 2006 | 35 | 75 | 311 | 902 | 1575 | 3719 | 12172 | 11385 | 2464 | 32638 | 98795 |
Malignant tumour of the of the colon | 1979 | 10 | 39 | 114 | 556 | 1130 | 2898 | 3541 | 1257 | 48 | 9593 | 21753 |
Meningococal disease | 2006 | 8 | 0 | 1 | 1 | 0 | 4 | 2 | 0 | 0 | 16 | 60 |
Diabetes | 1979 | 17 | 43 | 61 | 232 | 585 | 2064 | 3044 | 1018 | 47 | 7111 | 18083 |
Other tumours | 2006 | 39 | 45 | 141 | 300 | 564 | 1010 | 2192 | 1713 | 318 | 6322 | 10399 |
Malignant tumour of the of the colon | 2006 | 4 | 17 | 135 | 538 | 1421 | 2572 | 4641 | 2486 | 346 | 12160 | 21753 |
Other ill-defined symptoms and conditions | 1979 | 41 | 75 | 116 | 276 | 511 | 1635 | 5641 | 7330 | 1500 | 17125 | 38330 |
Homicides | 1979 | 92 | 116 | 115 | 65 | 50 | 33 | 35 | 10 | 0 | 516 | 875 |
A truly tidy version of the dataset can be obtained from further pivoting.
<- mortality_long |>
mortality_tidy pivot_longer(
cols=-c(year,Cause,total_year, grand_total),
cols_vary="slowest",
names_to=c("age_range"),
values_to=c("#deaths")
|>
) mutate(age_range = factor(age_range, levels=sort(unique(age_range)),ordered=T))
|>
mortality_tidy sample_n(5) |>
::gt() gt
Cause | year | total_year | grand_total | age_range | #deaths |
---|---|---|---|---|---|
Other ill-defined symptoms and conditions | 2006 | 21205 | 38330 | 25-34 | 126 |
Malignant ovarian tumour | 2006 | 3341 | 5661 | 85-94 | 402 |
Road accidents | 2006 | 4648 | 15255 | 95 and more | 8 |
Meningococal disease | 1979 | 44 | 60 | 85-94 | 0 |
Chronic liver disease | 2006 | 7669 | 23596 | 35-44 | 453 |
Build a bar plot to display the importance of Causes of deaths in France in years 1979 and 2006
<- theme_get()
th
(|>
mortality_long mutate(Cause=fct_reorder(Cause, desc(grand_total))) |>
mutate(year=as_factor(year)) |>
ggplot() +
scale_fill_discrete() +
aes(x=Cause,
y=total_year,
fill=year) +
geom_col(position=position_dodge()) +
theme(
legend.position="none",
axis.text.x=element_blank(), #remove x axis labels
axis.ticks.x=element_blank(), #remove x axis ticks
+
) labs(
title = "Causes of death, France, 1979, 2006",
subtitle= "Raw counts"
+
) xlab(label=NULL)
|>
) ::ggplotly() plotly
<- theme_set(th) oth
Compute and display the total number of deaths in France in years 1979 and 2006.
|>
mortality_long group_by(year) |>
summarise(total_deaths = sum(total_year)) |>
::gt() |>
gt::cols_label(
gtyear= "Year",
total_deaths = "#Deaths") |>
::tab_caption("Mortality in France") gt
Year | #Deaths |
---|---|
1979 | 529974 |
2006 | 510921 |
Compute the marginal counts for each year (1979, 2006). Compare.
Counts have already been computed above.
|>
mortality_long select(Cause, year, total_year, grand_total) |>
pivot_wider(
id_cols=c(Cause, grand_total),
names_from = year,
values_from = total_year) |>
rename(Total=grand_total) |>
arrange(desc(Total)) |>
::gt() |>
gtmy_gt()
Cause | Total | 1979 | 2006 |
---|---|---|---|
Cerebrovascular disease | 98795 | 66157 | 32638 |
Other heart disease | 97297 | 54105 | 43192 |
Ischemic cardiomyopathy | 88338 | 49532 | 38806 |
Other illnesses relating to circulation | 61937 | 31218 | 30719 |
Malignant tumour of the larynx, trachea, bronchus and lungs | 50604 | 20840 | 29764 |
Other malignent tumours | 48809 | 23262 | 25547 |
Other diseases of the nervous system and sensory organs | 38891 | 12056 | 26835 |
Other ill-defined symptoms and conditions | 38330 | 17125 | 21205 |
Other digestive conditions | 32697 | 18092 | 14605 |
Other respiratory ailments | 26339 | 14197 | 12142 |
Unknown or unspecified causes | 26192 | 14356 | 11836 |
Chronic liver disease | 23596 | 15927 | 7669 |
Other accidents | 23353 | 10921 | 12432 |
Malignant tumour of the of the colon | 21753 | 9593 | 12160 |
Suicides | 20337 | 9952 | 10385 |
Malignant tumour of the breast | 20236 | 8605 | 11631 |
Malignant neplasm of the lymphatic and hematopoietic tissues | 20000 | 7589 | 12411 |
Diabetes | 18083 | 7111 | 10972 |
Falls | 17711 | 12503 | 5208 |
Other chronic respiritory illnesses | 17331 | 9680 | 7651 |
Other psychological and behavioural disorders | 17160 | 3749 | 13411 |
Malignant tumour of the prostate | 15514 | 6577 | 8937 |
Road accidents | 15255 | 10607 | 4648 |
Pneumonia | 14700 | 5057 | 9643 |
Malignant tumour of the stomach | 13783 | 9020 | 4763 |
Other endocrinological, metabolic and nutritional conditions | 13665 | 6030 | 7635 |
Kidney and urethra disease | 13613 | 7107 | 6506 |
Malignant tumour of the of the pancreas | 12851 | 4588 | 8263 |
Other infectious diseases and parasites | 11466 | 4045 | 7421 |
Malignant tumour of the liver and intrahepatic biliary tract | 11452 | 4001 | 7451 |
Other tumours | 10399 | 4077 | 6322 |
Malignant tumour of the lip, pharynx and mouth | 9722 | 5606 | 4116 |
Malignant tumour of the oesophogus | 9267 | 5430 | 3837 |
Malignant tumour of the rectum and anus | 9246 | 4980 | 4266 |
Malignant tumour of the bladder | 8322 | 3633 | 4689 |
Alcohol abuse and alcohol-related psychosis | 6327 | 3371 | 2956 |
Malignant ovarian tumour | 5661 | 2320 | 3341 |
Malignant tumour in other parts of the uterus | 5263 | 2940 | 2323 |
Malignant tumour of the kidney | 5205 | 2101 | 3104 |
Other genito-urinary diseases | 4387 | 2552 | 1835 |
Blood and hematopoietic disorders | 4299 | 2177 | 2122 |
Other diseases of the osteo-articular system + muscles and connecting tissue | 3935 | 1025 | 2910 |
Infections of the skin and sub-cutaneous cellular tissue | 3439 | 1649 | 1790 |
Gastroduodenal ulcer | 2981 | 2216 | 765 |
Tuberculosis | 2797 | 2070 | 727 |
Other external injury and poisoning | 2731 | 1023 | 1708 |
Events of undetermined intention | 2558 | 2110 | 448 |
Asthma | 2488 | 1495 | 993 |
Malignant melanoma | 2205 | 658 | 1547 |
Malignant tumour of the cervix | 1527 | 824 | 703 |
Accidental poisoning | 1510 | 503 | 1007 |
Rhumatoid arthritis and osteoarthritis | 1284 | 705 | 579 |
Influenza | 1166 | 1051 | 115 |
Viral hepatitis | 1059 | 329 | 730 |
Homicides | 875 | 516 | 359 |
Other congenital defects and chromosomal abnormalities | 592 | 145 | 447 |
Congenital defects of the circulatory system | 540 | 275 | 265 |
Meningitis | 481 | 362 | 119 |
Addiction to prescription medication | 222 | 33 | 189 |
Complications in pregnancy and childbirth | 150 | 91 | 59 |
Congenital defects of the nervous system | 109 | 61 | 48 |
Meningococal disease | 60 | 44 | 16 |
Start from a 2-way contingency table \(X\) with \(\sum_{i,j} X_{i,j}=N\)
Normalize \(P = \frac{1}{N}X\) (correspondance matrix)
Let \(r\) (resp. \(c\)) be the row (resp. column) wise sums vector
Let \(D_r=\text{diag}(r)\) denote the diagonal matrix with row sums of \(P\) as coefficients
Let \(D_c=\text{diag}(c)\) denote the diagonal matrix with column sums of \(P\) as coefficients
The row profiles matrix is \(D_r^{-1} \times P\)
The standardized residuals matrix is \(S = D_r^{-1/2} \times \left(P - r c^\top\right) \times D_c^{-1/2}\)
CA consists in computing the SVD of the standardized residuals matrix \(S = U \times D \times V^\top\)
From the SVD, we get
When calling svd(.)
, the argument should be \[D_r^{1/2}\times \left(D_r^{-1} \times P \times D_c^{-1}- \mathbf{I}\times \mathbf{I}^\top \right)\times D_c^{1/2}= D_r^{-1/2}\times \left( P - r \times c^\top \right)\times D_c^{-1/2}\]
As \[D_r^{-1} \times P \times D_c^{-1} - \mathbf{I}\mathbf{I}^\top = (D_r^{-1/2} \times U)\times D \times (D_c^{-1/2}\times V)^\top\]
\((D_r^{-1/2} \times U)\times D \times (D_c^{-1/2}\times V)^\top\) is the extended SVD of \[D_r^{-1} \times P \times D_c^{-1} - \mathbf{I}\mathbf{I}^\top\] with respect to \(D_r\) and \(D_c\)
Perform CA on the two contingency tables.
You may use FactoMineR::CA()
. It is interesting to compute the correspondence analysis in your own way, by preparing the matrix that is handled to svd()
and returning a named list containing all relevant information.
Do the Jedi and Sith build their own light sabers? Jedi do. It’s a key part of the religion to have a kyber crystal close to you, to build the saber through the power of the force creating a blade unique and in tune with them
<- list()
lst_ca
for (y in c('79', '06')) {
<- mortality |>
lst_ca[[y]] select(ends_with(glue('({y})'))) |>
::CA(ncp=8, graph = F)
FactoMineR }
<- map(c('79', '06'),
lst select(mortality, ends_with(glue('({x})'))) |>
\(x) ::CA(ncp=8, graph = F)
FactoMineR )
If you did use FactoMineR::CA()
, explain the organization of the result.
The result of FactoMineR::CA(...)
is a named and nested list with five elements:
eig
call
CA()
, an object of type language
, telling (in principle) the user how CA()
was called. However, this is a quoted expression. Here we need to guess the value of y
in the calling environment understand what’s going on.
1]]$call$call lst_ca[[
FactoMineR::CA(X = select(mortality, ends_with(glue("({y})"))),
ncp = 8, graph = F)
Element call
also contains the table margin distributions marge.col
and marge.row
. The truncation rank ncp
(number of components) can be assigned before computing the SVD (default value is 5). Element \(X\) stores the contingency table that was effectively used for computing Correpondence Analysis.
row
col
row
. Used for column profiles analysis
svd
svd()
containing the singular values, the left and right singular vectors of matrix \(...\)
In principle, all relevant information can be gathered from components svd
, call.marge.row
, and call.marge.col
.
Draw screeplots. Why are they useful? Comment briefly.
<- lst_ca[[1]]
ca_79
$eig |>
ca_79as_tibble() |>
mutate(across(where(is.numeric), ~ round(.x, digits=2))) |>
::gt() gt
eigenvalue | percentage of variance | cumulative percentage of variance |
---|---|---|
0.29 | 61.00 | 61.00 |
0.14 | 28.98 | 89.98 |
0.03 | 6.13 | 96.12 |
0.01 | 2.86 | 98.98 |
0.00 | 0.73 | 99.70 |
0.00 | 0.17 | 99.88 |
0.00 | 0.09 | 99.97 |
0.00 | 0.03 | 100.00 |
$eig |>
ca_79as_tibble() |>
rownames_to_column(var="PC") |>
rename(percent=eigenvalue, cumulative=`cumulative percentage of variance`) |>
ggplot() +
aes(x=PC, y=percent, label=round(cumulative,2)) +
geom_text(angle=45, vjust=-1, hjust=-0.1) +
geom_col(fill=NA, colour="black") +
ylab("Squared singular values") +
ylim(c(0, .4)) +
labs(
title="Screeplot for CA",
subtitle = "Mortality 1979: Age Group versus Causes of Death"
)
Perform row profiles analysis.
What are the classical plots? How can you build them from the output of FactoMiner::CA
?
Build the table of row contributions (the so-called \(\cos^2\))
Attribute row
of objects of class CA
(exported from FactoMineR
) is the starting point of any row profiles analysis.
<- ca_79$row ca_79_row
Attribute row
is a named list made of \(4\) components.
coord
coord
matches the number of rows of the contingency table (here, the number of possible death Causes). The number of columns matches the rank of the truncated SVD that underlies Correspondance Analysis. Here it is \(5\) which also the rank of the standardized contingency table.
The row principal coordinates are the principal coordinates of each row profile in terms of the principal component.
The columns of coord
are pairwise orthogonal in the inner product space defined by diag(call$marge.row)
(which embodies the marginal probabilities of the so-called Causes of deaths)
<- ca_79$row$coord
x <- ca_79$call$marge.row
r
<- round(t(x) %*% diag(r) %*% x, 2)
A
<- function (A, tol=1e-2){
is_diagonal norm(diag(diag(A))-A, type='F') <= tol
}
# We expect A to be diagonal
is_diagonal(A)
[1] TRUE
We can recover row$coord
from the left singular vectors and the singular values:
with(ca_79,
norm(row$coord - with(svd, U %*% diag(vs[1:ca_79$call$ncp])), 'F')
)
[1] 0
<- ca_79_row$coord |>
prep_rows as_tibble() |>
mutate(name= rownames(ca_79_row$coord)) |>
relocate(name) |>
mutate(prop=r, inertia=ca_79_row$inertia)
|>
prep_rows mutate(across(where(is.numeric), \(x) round(x,2))) |>
::gt() gt
name | Dim 1 | Dim 2 | Dim 3 | Dim 4 | Dim 5 | Dim 6 | Dim 7 | Dim 8 | prop | inertia |
---|---|---|---|---|---|---|---|---|---|---|
Accidental poisoning | 1.53 | 0.52 | 0.04 | -0.17 | 0.01 | -0.03 | -0.01 | -0.05 | 0.00 | 0.00 |
Addiction to prescription medication | 1.75 | 0.44 | -0.12 | 0.27 | -0.12 | -0.24 | 0.19 | 0.22 | 0.00 | 0.00 |
Alcohol abuse and alcohol-related psychosis | 0.65 | -0.74 | 0.44 | -0.08 | -0.17 | 0.09 | 0.02 | -0.02 | 0.01 | 0.01 |
Asthma | -0.01 | -0.10 | -0.12 | -0.05 | 0.02 | 0.00 | 0.05 | 0.03 | 0.00 | 0.00 |
Blood and hematopoietic disorders | -0.08 | 0.07 | -0.07 | -0.05 | -0.04 | -0.02 | -0.01 | 0.03 | 0.00 | 0.00 |
Cerebrovascular disease | -0.32 | 0.17 | -0.06 | -0.04 | -0.06 | -0.01 | 0.00 | 0.00 | 0.12 | 0.02 |
Chronic liver disease | 0.49 | -0.83 | 0.30 | 0.09 | -0.08 | 0.04 | -0.01 | -0.01 | 0.03 | 0.03 |
Complications in pregnancy and childbirth | 3.53 | 1.80 | 0.28 | -2.09 | 0.99 | -0.77 | 0.15 | 0.06 | 0.00 | 0.00 |
Congenital defects of the circulatory system | 1.94 | 0.54 | 0.01 | -0.10 | 0.11 | -0.01 | -0.08 | 0.11 | 0.00 | 0.00 |
Congenital defects of the nervous system | 2.36 | 0.98 | -0.23 | 0.47 | -0.07 | -0.38 | 0.23 | -0.07 | 0.00 | 0.00 |
Diabetes | -0.21 | -0.05 | -0.24 | -0.07 | 0.02 | 0.02 | 0.03 | 0.01 | 0.01 | 0.00 |
Events of undetermined intention | 1.71 | 0.35 | 0.24 | -0.46 | 0.05 | -0.03 | 0.01 | -0.01 | 0.00 | 0.01 |
Falls | -0.34 | 0.49 | 0.27 | 0.07 | -0.02 | -0.05 | -0.05 | -0.02 | 0.02 | 0.01 |
Gastroduodenal ulcer | -0.11 | -0.14 | -0.08 | -0.03 | -0.03 | 0.01 | 0.01 | -0.01 | 0.00 | 0.00 |
Homicides | 2.19 | 0.62 | 0.35 | -0.72 | 0.02 | 0.22 | -0.18 | 0.08 | 0.00 | 0.01 |
Infections of the skin and sub-cutaneous cellular tissue | -0.41 | 0.40 | 0.11 | 0.02 | -0.06 | -0.03 | -0.01 | -0.02 | 0.00 | 0.00 |
Influenza | -0.41 | 0.51 | 0.28 | 0.10 | 0.06 | 0.02 | -0.02 | -0.02 | 0.00 | 0.00 |
Ischemic cardiomyopathy | -0.14 | -0.18 | -0.17 | -0.02 | 0.02 | 0.02 | 0.01 | -0.01 | 0.09 | 0.01 |
Kidney and urethra disease | -0.33 | 0.23 | 0.00 | -0.02 | -0.04 | -0.01 | -0.01 | 0.02 | 0.01 | 0.00 |
Malignant melanoma | 0.63 | -0.34 | 0.25 | -0.32 | 0.06 | 0.07 | -0.06 | 0.02 | 0.00 | 0.00 |
Malignant neplasm of the lymphatic and hematopoietic tissues | 0.33 | -0.18 | -0.10 | -0.11 | 0.07 | 0.02 | 0.02 | -0.03 | 0.01 | 0.00 |
Malignant ovarian tumour | 0.28 | -0.59 | 0.07 | 0.09 | -0.01 | -0.03 | 0.01 | -0.01 | 0.00 | 0.00 |
Malignant tumour in other parts of the uterus | 0.08 | -0.41 | 0.01 | 0.06 | 0.01 | 0.00 | -0.01 | 0.05 | 0.01 | 0.00 |
Malignant tumour of the bladder | -0.13 | -0.25 | -0.23 | 0.01 | 0.08 | 0.01 | 0.00 | 0.02 | 0.01 | 0.00 |
Malignant tumour of the breast | 0.21 | -0.49 | 0.17 | 0.03 | -0.06 | 0.03 | -0.02 | 0.02 | 0.02 | 0.01 |
Malignant tumour of the cervix | 0.42 | -0.60 | 0.29 | -0.07 | -0.16 | 0.16 | -0.07 | 0.05 | 0.00 | 0.00 |
Malignant tumour of the kidney | 0.07 | -0.45 | -0.10 | 0.07 | 0.11 | -0.02 | -0.03 | 0.01 | 0.00 | 0.00 |
Malignant tumour of the larynx, trachea, bronchus and lungs | 0.21 | -0.69 | -0.01 | 0.14 | 0.08 | -0.04 | -0.01 | -0.01 | 0.04 | 0.02 |
Malignant tumour of the lip, pharynx and mouth | 0.41 | -0.82 | 0.31 | 0.21 | -0.16 | -0.07 | 0.09 | 0.00 | 0.01 | 0.01 |
Malignant tumour of the liver and intrahepatic biliary tract | 0.06 | -0.44 | -0.13 | 0.05 | 0.13 | -0.05 | -0.02 | 0.01 | 0.01 | 0.00 |
Malignant tumour of the oesophogus | 0.21 | -0.69 | 0.07 | 0.17 | 0.01 | -0.06 | 0.02 | 0.03 | 0.01 | 0.01 |
Malignant tumour of the of the colon | -0.13 | -0.18 | -0.17 | -0.01 | 0.03 | 0.00 | 0.00 | 0.00 | 0.02 | 0.00 |
Malignant tumour of the of the pancreas | -0.01 | -0.37 | -0.15 | 0.04 | 0.10 | 0.02 | -0.04 | 0.01 | 0.01 | 0.00 |
Malignant tumour of the prostate | -0.29 | -0.01 | -0.35 | -0.09 | 0.03 | 0.05 | 0.03 | -0.01 | 0.01 | 0.00 |
Malignant tumour of the rectum and anus | -0.08 | -0.27 | -0.16 | 0.01 | 0.09 | 0.00 | -0.02 | -0.02 | 0.01 | 0.00 |
Malignant tumour of the stomach | -0.12 | -0.21 | -0.19 | -0.02 | 0.05 | 0.02 | -0.01 | 0.01 | 0.02 | 0.00 |
Meningitis | 0.63 | -0.22 | 0.02 | -0.05 | 0.03 | 0.08 | -0.12 | 0.04 | 0.00 | 0.00 |
Meningococal disease | 1.93 | 0.72 | -0.55 | 0.77 | -0.09 | 0.08 | -0.02 | 0.05 | 0.00 | 0.00 |
Other accidents | 1.18 | 0.30 | 0.01 | -0.02 | -0.02 | 0.00 | 0.01 | -0.01 | 0.02 | 0.03 |
Other chronic respiritory illnesses | -0.27 | 0.06 | -0.11 | -0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.02 | 0.00 |
Other congenital defects and chromosomal abnormalities | 1.09 | 0.19 | 0.04 | -0.29 | 0.13 | -0.27 | 0.12 | 0.01 | 0.00 | 0.00 |
Other digestive conditions | -0.13 | 0.02 | -0.01 | -0.05 | -0.03 | 0.00 | 0.00 | 0.00 | 0.03 | 0.00 |
Other diseases of the nervous system and sensory organs | -0.08 | 0.12 | -0.10 | -0.04 | -0.04 | 0.01 | 0.01 | 0.01 | 0.02 | 0.00 |
Other diseases of the osteo-articular system + muscles and connecting tissue | -0.03 | 0.00 | -0.06 | 0.00 | 0.01 | 0.02 | -0.01 | 0.00 | 0.00 | 0.00 |
Other endocrinological, metabolic and nutritional conditions | -0.27 | 0.23 | 0.12 | 0.03 | -0.02 | 0.01 | -0.01 | 0.00 | 0.01 | 0.00 |
Other external injury and poisoning | 0.23 | -0.10 | -0.03 | -0.06 | 0.04 | 0.05 | -0.04 | -0.03 | 0.00 | 0.00 |
Other genito-urinary diseases | -0.37 | 0.26 | -0.07 | -0.06 | -0.07 | 0.00 | 0.02 | 0.02 | 0.00 | 0.00 |
Other heart disease | -0.36 | 0.28 | 0.07 | 0.01 | -0.02 | -0.01 | -0.01 | 0.00 | 0.10 | 0.02 |
Other ill-defined symptoms and conditions | -0.49 | 0.59 | 0.47 | 0.18 | 0.16 | 0.05 | 0.03 | 0.01 | 0.03 | 0.03 |
Other illnesses relating to circulation | -0.23 | 0.07 | -0.06 | -0.02 | -0.01 | 0.00 | -0.01 | 0.00 | 0.06 | 0.00 |
Other infectious diseases and parasites | -0.05 | 0.01 | -0.07 | -0.04 | 0.01 | 0.00 | 0.01 | 0.01 | 0.01 | 0.00 |
Other malignent tumours | 0.09 | -0.29 | -0.04 | 0.02 | 0.04 | -0.02 | -0.02 | 0.00 | 0.04 | 0.00 |
Other psychological and behavioural disorders | -0.40 | 0.36 | -0.03 | -0.07 | -0.11 | -0.02 | 0.01 | -0.01 | 0.01 | 0.00 |
Other respiratory ailments | -0.26 | 0.17 | 0.04 | 0.02 | 0.02 | -0.01 | 0.00 | -0.01 | 0.03 | 0.00 |
Other tumours | 0.19 | -0.25 | 0.01 | -0.01 | 0.03 | 0.00 | -0.03 | 0.02 | 0.01 | 0.00 |
Pneumonia | -0.38 | 0.38 | 0.17 | 0.04 | -0.03 | 0.00 | -0.01 | 0.00 | 0.01 | 0.00 |
Rhumatoid arthritis and osteoarthritis | -0.23 | 0.02 | -0.22 | -0.08 | 0.00 | 0.00 | 0.00 | -0.05 | 0.00 | 0.00 |
Road accidents | 2.47 | 1.05 | -0.33 | 0.36 | -0.03 | 0.01 | 0.00 | 0.00 | 0.02 | 0.15 |
Suicides | 1.39 | 0.04 | 0.31 | -0.48 | 0.06 | -0.02 | 0.01 | 0.00 | 0.02 | 0.04 |
Tuberculosis | 0.13 | -0.34 | 0.01 | -0.04 | 0.01 | 0.02 | -0.01 | -0.02 | 0.00 | 0.00 |
Unknown or unspecified causes | 0.43 | 0.05 | 0.13 | -0.08 | 0.01 | 0.01 | 0.01 | 0.00 | 0.03 | 0.01 |
Viral hepatitis | 0.49 | -0.10 | -0.09 | 0.04 | -0.02 | -0.02 | 0.00 | 0.00 | 0.00 | 0.00 |
inertia
coord
, contrib
and cos2
.
Inertia is the way CA measures variation between row profiles. Total inertia is the \(\chi^2\) statistic divided by sample size.
Row inertia can be obtained by multiplying the row marginal probability by the squared Euclidean norm of the row in the principal coordinate matrix.
with (ca_79_row,
sum(abs(r* (rowSums(coord^2)) - inertia))
)
[1] 1.877449e-16
cos2
cos2
are the share of row inertia from the corresponding cell in coord
with (ca_79_row,
norm((diag(r/inertia) %*% coord^2) - cos2, type='F')
)
[1] 5.432216e-16
contrib
Not too surprisingly, coord
, contrib
, and cos2
share the same row names and column names.
sum(ca_79$call$X)
[1] 529974
sum((rowSums(ca_79$call$X)/sum(ca_79$call$X) - r)^2)
[1] 6.311339e-35
The Row Profiles are the rows of matrix R
below
<- as.matrix(with(ca_79$call, Xtot/N))
P <- ca_79_row$coord
coord <- ca_79_row$inertia
inertia
<- ca_79$call$marge.row
r <- colSums(P)
c
<- nrow(P)
n <- ncol(P)
p
<- diag(r^(-1)) %*% P
R
<- R - matrix(1, nrow = n, ncol = n) %*% P Q
<- diag(r^(-1)) %*% P %*% diag(c^(-1)) - matrix(1, nrow=n, ncol=p)
M
* norm(diag(r^(1/2)) %*% M %*% diag(c^(1/2)), type = "F")^2 n
[1] 29.39279
We can now display a scatterplot from component coord
. This is called a Row Plot.
<- (
p_scat |>
prep_rows ggplot() +
aes(x=`Dim 1`, y=`Dim 2`, label=name) +
geom_point() +
coord_fixed()
)
|> plotly::ggplotly() p_scat
With little effort, it is possible to scale the points so as to tell the reader the relative numerical importance of each Cause of death. Coloring/filling the points using inertia also helps: high inertia rows match light-colored points.
<- prep_rows |>
ppp ggplot() +
aes(x=`Dim 1`,
y=`Dim 2`,
label=name,
size=prop,
fill=log10(inertia),
color=log10(inertia)) +
geom_point(alpha=0.75) +
scale_size_area() +
coord_fixed() +
scale_fill_viridis_c(aesthetics=c("fill", "color"),
guide="colorbar",
direction = 1) +
ggtitle(
"Mortality France 1979: Row plot"
)
|> plotly::ggplotly() ppp
# (ca_79$row)$contrib
Plot the result of row profile analysis using plot.CA
from FactoMineR
.
Perform column profiles analysis
names(ca_79_row)
[1] "coord" "contrib" "cos2" "inertia"
<- str_match(rownames(ca_79$col$coord), '([\\w \\-]*) \\(79\\)')[,2]
age_group_names
<- ca_79$col$coord |>
prep_cols as_tibble() |>
mutate(name= age_group_names) |>
relocate(name) |>
mutate(prop=c, inertia=ca_79$col$inertia)
(|>
prep_cols ggplot() +
aes(x=`Dim 1`,
y=`Dim 2`,
label=name,
size=prop,
fill=log10(inertia),
color=log10(inertia)) +
geom_point(alpha=0.75) +
scale_size_area() +
coord_fixed() +
scale_fill_viridis_c(aesthetics=c("fill", "color"),direction = 1) +
ggtitle(
"Mortality France 1979: Col plot"
|> plotly::ggplotly() ))
Build the symmetric plots (biplots) for correspondence analysis of Mortalitity data
plot.CA(ca_79)
(|>
prep_rows ggplot() +
aes(x=`Dim 1`,
y=`Dim 2`,
label=name,
size=prop,
fill=log10(inertia),
color=log10(inertia)) +
geom_point(alpha=0.75) +
scale_size_area() +
coord_fixed() +
scale_fill_viridis_c(aesthetics=c("fill", "color"),direction = 1) +
geom_point(data = prep_cols,
aes(x=`Dim 1`,
y=`Dim 2`,
label=name,
size=prop,
fill=log10(inertia),
color=log10(inertia)
),shape="square",
alpha=.5,
)|> plotly::ggplotly() )
Warning in geom_point(data = prep_cols, aes(x = `Dim 1`, y = `Dim 2`, label =
name, : Ignoring unknown aesthetics: label
It is convenient to use distinct color scales for rows and columns.
(|>
prep_rows ggplot() +
scale_size_area() +
coord_fixed() +
aes(x=`Dim 1`,
y=`Dim 2`,
text=name,
size=prop,
fill=log10(inertia)) +
geom_point(alpha=0.75) +
scale_fill_viridis_c(option="D") +
geom_point(data = prep_cols,
aes(x=`Dim 1`,
y=`Dim 2`,
text=name,
size=prop,
color=log10(inertia)
),shape="square",
alpha=.5,
+
) scale_color_viridis_c(option="F") +
theme_minimal(
) |> plotly::ggplotly() )
Warning in geom_point(data = prep_cols, aes(x = `Dim 1`, y = `Dim 2`, text =
name, : Ignoring unknown aesthetics: text
Mosaic plots provide an alternative way of exploring contingency tables. They are particularly handy when handling 2-way contingency tables.
Draw mosaic plots for the two contingency tables living inside mortality
datasets.
|>
mortality select(ends_with('(06)')) |>
chisq.test() |>
::glance() broom
Warning in chisq.test(select(mortality, ends_with("(06)"))): Chi-squared
approximation may be incorrect
# A tibble: 1 × 4
statistic p.value parameter method
<dbl> <dbl> <int> <chr>
1 229784. 0 488 Pearson's Chi-squared test
|>
mortality select(ends_with('(06)')) |>
as.matrix() |>
as.table() |>
mosaicplot(color = T)
Are you able to deliver an interpretation of this Correspondence Analysis?
Build the standardized matrix for row profiles analysis. Compute the pairwise distance matrix using the \(\chi^2\) distances. Should you work centered row profiles?
We use the weighted \(\ell_2\) distances defined by the product of the two marginal distributions. The squared distance between the conditional probabilities defined by rows \(a\) and \(a'\) is \[\sum_{b} \frac{\left( N_{a,b}/N_{a,.} - N_{a',b}/N_{a',.}\right)^2}{N_{.,b}/N}\]
The \(\ell_2\) distance between the rows of the principal coordinates matrix row$coord
coincides since they are all centered and normalized with respect to \((N_{.,b}/N)\).
<- ca_79$row$coord[,1:8] |>
dist_Causes_79 dist()
<- hclust(dist_Causes_79, method = "single") hc_79
stopifnot(
require(ggdendro),
require(dendextend),
require(sloop)
)
The instance of hclust
is transformed into a an object of class dendro
. Class dendro
is equipped with a variety of functions/methods for analyzing, visualizing, and exploiting the result of hclust()
.
<- dendro_data(hc_79) dendro_79
class(dendro_79)
[1] "dendro"
(|>
dendro_79 ggdendrogram(
leaf_labels = T,
rotate = T) +
::theme_dendro() +
ggdendroscale_y_reverse()
|> plotly::ggplotly() )
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.
Perform hierarchical clustering of row profiles with method/linkage "single"
. Check the definition of the method. Did you know the underlying algorithm? If yes, in which context did you get acquainted with this algorithm?
Choose the number of classes (provide justification).
Can you explain the size of the different classes in the partition?
Row profiles that do not belong to the majority class are called atypical.
Compute the share of inertia of atypical row profiles.
Draw a symmetric plot (biplot) outlining the atypical row profiles.
Calculate the theoretical population table for deces
. Do you possible to carry out a chi-squared test?
Perform a hierarchical classification of the line profiles into two classes.
Merge the rows of deces
corresponding to the same class (you can use the the tapply
function), and perform a chi-square test. chi-square test. What’s the conclusion?
Why is it more advantageous to carry out this grouping into two classes compared to arbitrarily grouping two classes, in order to prove the dependence between these two variables?
Represent individuals from the majority class. Do they all seem to you to correspond to an average profile?
Try to explain this phenomenon considering the way in which hierarchical classification uses the Single Linkage method.
The mortality
dataset should be taken with grain of salt. Assigning a single Cause to every death is not a trivial task. It is even questionable: if somebody dies from some infection beCause she could not be cured using an available drug due to another preexisting pathology, who is the culprit?