PCA and SVD over log-mortality tables

facet screeplots for gender and countries
comment the screeplots
comment the correlation circles
comment the individual plots

In the next chunks we compute the PCA (with standardization and centering) for all Countries and Genders in the database. The dplyr pipeline prepares grouped tibble corresponding to the different countries and genders. The collection is fed to group_map which attempts to compute pca for each group. This output is a list of (key, value) pairs.

Henceforth, we exclude data from Sweden since for several years after \(2000\), mortality risks at some ages between \(5\) and \(10\) are zero. I do not know how reliable this information is. It might just be the outcome of random fluctuations: a modern Swedish generation typically comprises \(110000\) children. Data obtained from larger (and possibly less healthy) European countries suggest that mortality risks between ages \(5\) and \(10\) are less than \(5\times 10^{-5}\). This suggests that the yearly number of girls or boys dying ate age \(x\) between \(5\) and \(10\) in Sweeden is dominated by a Poisson distribution with parameter \(3\). The probability of outcome \(0\) under Poisson distribution with parameter \(3\) is approximately 0.05.

Screeplots

The screeplots for standardized-centered PCAs for countries and genders shows

the inertia projected on the first two principal components accounts for no less than \(90\%\) of total inertia.
for each country the inertia projected on the first two principal components is larger for Females than for Males
The approximation by rank two matrices is worst for the Dutch matrices

But for all countries and genders, most of the inertia is captured by the first two axes. USA and Canada exhibit a larger inertia on the second axis.

This suggests that for all countries, genders, the scaled and centered log-quotient mortality matrices are well-approximated by rank-one matrices. However, recall that we measure approximation error using the Frobenius norm. We weight the different ages (columns) in a uniform way. This does not take into account the impact the different columns on the Life Expectancy.

Screeplots from PCA without standardization tell us in advance how well Lee-Carter approximation will perform for a given population. The larger the share of inertia carried by the first component, the better the Lee-Carter approximation. From the figure above, we can infer that Lee-Carter approximation will be better for Female populations and than for Male populations.

Correlation circles

For a given PCA, the correlation circle tells us how close the different variables are from the plane spanned by the first two components and where the projection of the variables lay on this plan. This can be the basis of some variables clustering procedures.

The correlation circle delivered by ade4 is not readable: - all variables are aligned with the first component as could have been predicted from the screeplot - labels overlapping hides a possible structure

For Canada

For France and Italy, as could have been guessed from screeplots, all centered and standardized variables are highly correlated with the first component. This is even more striking for women than for men.

For the Netherlands, Spain and the US, especially for men, variables associated with old ages are substantially correlated with the second component.

A common pattern emerges across countries and genders. Recall that PCA (and SVD) are uniquely defined up to signs of eigenvectors (singular vectors). The following description is taken from the correlation circle for women in France.

Age groups 60-79 and 80-99 are packed in the same region. The two age groups are almost aligned along parallel curves and ranked according to age along each curve. The two curves look oriented into opposite directions
Age group 40-59 is also packed in a small region and almost ordered along the second component
Age group 20-39 forms a SE-NW oriented cloud, points are ordered according to ages
Age group 0-19 spans the whole plot. Ages 0-7 are ordered around the second component. Ages 8-15 tend to cluster in the same region. Age 18 almost look like an outlier. Ages 16-17 are at an intermediate position between the outlier 18 and the 10-15 group.

Looking at the variables on the plane generated by the second and third principal components is also interesting.

Ages below 20 lie around a SW-NE axis
Ages between 20 and 40 lie on the south-west
Ages between 40 and 60 are clustered and almost aligned on the sout-east
Ages above 60 are almost aligned along a line
Ages above 80 are almost aligned along another line on the north east quarter

Individuals (Years)

If we plot the rows (years) on the plane spanned by the first two principal axes, patterns across genders and countries again look very similar. Axis 1 roughly corresponds to a time axis. If we look at points corresponding to the first 15 years, plots for European countries are very much alike. For Female plots, the four points corresponding to years 1948-1951 follow the same paths in all six European countries; for the US the path is different. For England and Wales, France and Italy, the projections on the planed spanned by the first two principal axes look strikingly similar.

Canonical Correlation Analysis

Build a function that takes as input

a dataframe like life_table_pivot,
a couple of countries, say Spain and Sweeden,
a vector of Year, say 1948:1998
a Gender say Female

returns a matrix called \(Z\) with rows corresponding to Year and columns corresponding to couples (Country, Age).

Lee-Carter model for US mortality

During the last century, in the USA and in western Europe, central death rates at all ages have exhibited a general decreasing trend. This decreasing trend has not always been homogeneous across ages.

The Lee-Carter model has been designed to model and forecast the evolution of the log-central death rates for the United States during the XXth century.

Let \(A_{x,t}\) denote the log central death rate at age \(x\) during year \(t\in T\) for a given population (defined by Gender and Country).

The Lee-Carter model assumes that observed loagrithmic central death rates are sampled according to the following model

\[A_{x,t} \sim_{\text{independent}} a_x + b_x \kappa_t + \epsilon_{x,t}\]

where \((a_x)_x, (b_x)_x\) and \((\kappa_t)_t\) are unknown vectors that satisfy

\[a_x = \frac{1}{|T|}\sum_{t \in T} A_{x,t}\qquad \sum_{t\in T} \kappa_t = 0 \qquad \sum_{x} b_x^2 =1\]

and \(\epsilon_{x,t}\) are i.i.d Gaussian random variables.

US data

Fit a Lee-Carter model on the American data (for Male and Female data) training on years 1933 up to 1995.
Compare the fit provided by the Lee-Carter model with the fit provided by a rank \(2\) truncated SVD
Compare vectors avec \((a_x)_x, (b_x)_x\) and \((\kappa_t)_t\) with appropriate singular vectors.
Use the Lee-Carter model to predict the central death rates for years \(2000\) up to \(2015\)
Plot predictions and observations for years \(2000, 2005, 2010, 2015\)

Application of Lee-Carter model to a European Country

Fit a Lee-Carter model to a European country (Framce) using demography::lca
Comment
Compare with rank-2 truncated SVD
Use the Lee-Carter model to predict the central death rates for years \(2000\) up to \(2015\)
Plot predictions and observations for years \(2000, 2005, 2010, 2015\)

Predictions of life expectancies at different ages

Use Lee-Carter approximation to approximate residual life expectations
Compare with observed residual life expectations

References

Lee, R. D., & Carter, L. R. (1992). Modeling and Forecasting U. S. Mortality. Journal of the American Statistical Association, 87(419), 659–671. https://doi.org/10.2307/2290201
Preston, S. H., Heuveline, P., & Guillot, M. (2001). Measuring and Modeling Population Processes, Oxford: Blackwell Publishers, 2001, xv+ 291 pp.

Appendix

country_iso3 <- c(
  "France"="FRA",
  "Canada"="CAN",
  "Netherlands"="NLD",
  "Sweden"="SWE",
  "Italy"="ITA",
  "Spain"="ESP",
  "Portugal"="PRT",
  "USA"="USA"
)

tb_iso3 <- tibble(
  name=names(country_iso3), iso3=country_iso3
)
ds_path <- '~/Dropbox/HMD/hld-part.parquet'

ds <- open_dataset(ds_path)
tb <- ds |>
  filter(
    Country %in% country_iso3, 
    Sex %in% c(1L, 2L), 
    Region=='0', 
    Residence=='0',
    SocDem=='0',
    Version=='1',
    Year1 >= 1948, 
    Ethnicity=='0', 
    AgeInt==1,
    TypeLT==1) |> 
    select(-c(Region, Ethnicity,Year2,AgeInt, TypeLT, Residence, SocDem, Version)) |> 
  collect()
tb <- tb  |>  
  summarise(across(ends_with('(x)'), mean), .by=c(Country, Sex, Year1, Age)) 
life_table <- tb |>
  mutate(Iso3 = factor(Country)) |>
  mutate(Gender = factor(Sex, labels=c("Male", "Female"))) |>
  mutate(Area = fct_collapse(Iso3, 
    SE = c("PRT", "ESP", "ITA", "FRA"), 
    NE = c("NLD", "SWE", "CAN"), 
    USA = "USA")) |> 
  rename(Year=Year1) |> 
  select(Area, Country, Gender, Year, Age, `q(x)`) |> 
  mutate(`q(x)`=ifelse(`q(x)`==0, 2e-5 , `q(x)`))   # special care for data from Sweden around age 7-8
#
#
# Nest the life tables data frame acording to Area, Country, Gender
lt_nested <- life_table  |>  
  filter(Age < 90)  |>       # no data above age 90 for some years in Spain
  nest(.by=c(Area, Country, Gender))  |>  
  mutate(data_tf=map(data, 
                  \(df)  pivot_wider(df, id_cols=Year, names_from=Age, values_from=`q(x)`)))
#
#
# Wrapper for prcomp for per Country/Gender PCA
prcomp_lt <- function(df, ...){
  df |> 
    tibble::column_to_rownames("Year") |>
    mutate(across(everything(), \(x) log(x, 10))) |> 
    scale() |> 
    prcomp(...)
}
#
#
# Add resultof  prcomp for per Country/Gender PCA as a list column

lt_nested <- lt_nested |> 
  mutate(pca = map(data_tf, \(x) try(prcomp_lt(x, rank=5)))) 
#
#
# Add preparation for screeplots per Country/Gender PCA as a list column

lt_nested <- lt_nested |> 
  mutate(scrp = map(pca, \(x) drop_na(tidy(x, matrix="d"))))  
# 
#
# Add screeplots per Country/Gender PCA as a list column

lt_nested  |>  
  select(Country, Gender, scrp)  |>  
  unnest(cols=c(scrp)) |> 
  ggplot() +
    aes(x=PC, y=percent, fill=Country) +
    geom_col(position="dodge") +
    facet_wrap(~ Gender) +
    labs(
      title="Screeplots of log mortality quotients at Age < 90, Years between 1950-2020"
    )
#
#
# Add preparation for correlation circles per Country/Gender PCA as a list column

prep_co_circle <- function(pco) {
  r <- with(pco,  t(t(rotation) * sdev[1:ncol(rotation)]))
  as_tibble(r) |> 
    rename_with(.fn = \(x) gsub('PC', '', x), .cols=everything()) |>
    mutate(row_id=rownames(r))
}
#
#
#
plot_co_circle <- function(df, tit){
  df |> 
    ggplot() +
    aes(x=`1`, y=`2`, label=row_id) +
    geom_segment(
      aes(xend=0, yend=0, color=cut(
        as.integer(row_id),breaks=c(0, 6, 15, 65, 90), right=F))# ,
    ) +
  scale_color_viridis_d(name="Age") +
  ggrepel::geom_text_repel() +
  coord_fixed() +
  xlim(c(-1.1, 1.1)) + ylim(c(-1.1, 1.1)) +
  ggforce::geom_circle(aes(x0=0, y0=0, r=1), linetype="dashed") +
  theme_minimal() +
  labs(title=tit)
}

#
#  
# Add correlation  circle plots per Country/Gender PCA as a list column

lt_nested <- lt_nested |>  
  mutate(cocirc= map(pca, prep_co_circle))  |> 
  mutate(tit=glue::glue("Cor circle {Country} {Gender}")) |> 
  mutate(p_co_circ=map2(cocirc, tit, plot_co_circle))
#
#
# Correlation circles for Canada
tmp <- lt_nested |>  
  filter(Country=='CAN') |> pull(p_co_circ)

tmp[[1]] + tmp[[2]]
#
#
# Correlation circles for Italy

tmp <- lt_nested |>  
  filter(Country=='ITA') |> pull(p_co_circ)

tmp[[1]] + tmp[[2]]
#
#
# Correlation circles for USA
tmp <- lt_nested |>  
  filter(Country=='USA') |> pull(p_co_circ)

tmp[[1]] + tmp[[2]]
#
#
# Add row plots as a list column
lt_nested  <- lt_nested |>  
  mutate(rowplots=map2(pca, data_tf, \(pco, df2) autoplot(pco, df2))) |> 
  mutate(rowplots=map(rowplots, \(p) p + coord_fixed()  + ggrepel::geom_text_repel(aes(label=rownames)))) |> 
  mutate(rowplots=map2(rowplots, tit, \(x,y) x + labs(title=y)))
#
#
# Row plot for USA
tmp <- lt_nested |>  
  filter(Country=='FRA') |> pull(rowplots)

tmp[[1]] + tmp[[2]]
#
#
# Lee-Carter analysis
res <- demography::lca(
  demography::fr.mort, 
  ages=seq(20, 80), 
  years=seq(1948, 1998), 
  restype="logrates")
#
#
# Lee-Carter analysis plots
plot(res)