LAB: Multiple Correspondence Analysis (MCA)

Published

March 25, 2025

M1 MIDS/MFA/LOGOS

Université Paris Cité

Année 2024

Course Homepage

Moodle

Besides the usual packages (tidyverse, …), we shall require FactoMineR and related packages.

The GSS dataset

We will use data coming from the General Social Survey. The General Social Survey data can be retrieved using the gssr package. If needed, install the gssr package and its companion package gssrdoc.

Code
Loading required package: gssr
Package loaded. To attach the GSS data, type data(gss_all) at the console.
For the codebook, type data(gss_dict).
For the panel data and documentation, type e.g. data(gss_panel08_long) and data(gss_panel_doc).
For help on a specific GSS variable, type ?varname at the console.
Loading required package: gssrdoc

The data we will use are panel data made available by the GSS. In order to explore them, it’s very useful toload gss_dict and gss_panel_doc.

Code
data(gss_dict)
data(gss_panel_doc)

Check the online help.

Code
?gss_panel_doc

gss_dict and gss_panel_doc are dataframes that can be queried:

Code
gss_panel_doc |> 
  dplyr::sample_n(5) |> 
  select(id, text)  |> 
  gt::gt()
id text
padeg 20. If finished 9th-12th grade: Did he ever get a high school diploma or a GED certificate?
relhh13 1632. What is (PERSON)'s relationship to head of household?
rellife Please tell me whether you strongly agree, agree, disagree, or strongly disagree with the following statements: 679. I try hard to carry my religious beliefs over into all my other dealings in life.
raclive 129. Are there any (negroes/blacks/African-Americans) living in this neighborhood now?
old3 1621. Please tell me the names of the people who usually live in this household. Let's start with the head of the household. c. How old was [PERSON] on his/her last birthday?

In the panel questionnaire, some questions have race in the field id. Check again the online help.

Code
?race

The answer is something like that:

Question 24. What race do you consider yourself?

And it contains a brief summary of the answers given through the years.

Year Black Other White iap (NA) Total
2010 311 183 1550 - 2044
2012 301 196 1477 - 1974
2014 386 262 1890 - 2538

(this is just an excerpt)

Code
gss_panel_doc |> 
  filter(str_detect(id, "race")) |>
  slice_sample(n=1, by=text) |> 
  select(id, description, text) |> 
  gt::gt()
id description text
race RACE 24. What race do you consider yourself?
racecen1 RACECEN1 1602. What is your race? Indicate one or more races that you consider yourself to be.
natrace NATRACE 68. We are faced with many problems in this country, none of some of these problems, and for each one I'd like you to tell me whether you think we're spending too much money on it, too little money, or about the right amount. h. Improving the conditions of Blacks.
natracey NATRACEY 69. We are faced with many problems in this country, none of some of these problems, and for each one I'd like you to tell me whether you think we're spending too much money on it, too little money, or about the right amount. h. Assistance to blacks.
intrace3 INTRACE3 What is your race? Indicate one or more races that you consider yourself to be.

The data set we will use comes from year 2010 panel data.

Code
data("gss_panel10_long")

gss_panel10_long 
# A tibble: 6,132 × 1,200
   firstid  wave oversamp sampcode  sample      form         formwt vpsu  vstrat
   <fct>   <dbl>    <dbl> <dbl+lbl> <dbl+lbl>   <dbl+lbl>     <dbl> <dbl> <dbl+>
 1 1           1        1 501       9 [2000 FP] 2 [ALTERNAT…      1 1     2240  
 2 1           2        1 501       9 [2000 FP] 2 [ALTERNAT…      1 1     2240  
 3 1           3        1 501       9 [2000 FP] 2 [ALTERNAT…      1 1     2240  
 4 2           1        1 501       9 [2000 FP] 1 [STANDARD…      1 1     2240  
 5 2           2        1 501       9 [2000 FP] 1 [STANDARD…      1 1     2240  
 6 2           3        1 501       9 [2000 FP] 1 [STANDARD…      1 1     2240  
 7 3           1        1 501       9 [2000 FP] 1 [STANDARD…      1 2     2240  
 8 3           2        1 501       9 [2000 FP] 1 [STANDARD…      1 2     2240  
 9 3           3        1 501       9 [2000 FP] 1 [STANDARD…      1 2     2240  
10 4           1        1 501       9 [2000 FP] 2 [ALTERNAT…      1 2     2240  
# ℹ 6,122 more rows
# ℹ 1,191 more variables: samptype <dbl+lbl>, wtpan12 <dbl+lbl>,
#   wtpan123 <dbl+lbl>, wtpannr12 <dbl+lbl>, wtpannr123 <dbl+lbl>,
#   id <dbl+lbl>, mar1 <dbl+lbl>, mar2 <dbl+lbl>, mar3 <dbl+lbl>,
#   mar4 <dbl+lbl>, mar5 <dbl+lbl>, mar6 <dbl+lbl>, mar7 <dbl+lbl>,
#   mar8 <dbl+lbl>, mar9 <dbl+lbl>, mar10 <dbl>, mar11 <dbl+lbl>, mar12 <dbl>,
#   mar13 <dbl>, mar14 <dbl>, abany <dbl+lbl>, abdefect <dbl+lbl>, …

At the beginning, the panel was made of roughly 2000 respondents. And the respondents were interviewed in 2010, 2012 and 2014 (the three waves).

Code
gss_panel10_long  |>  
  filter(wave==3, !is.na(id)) |>
  select(firstid, wave, id, sex)
# A tibble: 1,304 × 4
   firstid  wave id        sex       
   <fct>   <dbl> <dbl+lbl> <dbl+lbl> 
 1 1           3 10001     1 [MALE]  
 2 2           3 10002     2 [FEMALE]
 3 3           3 10003     2 [FEMALE]
 4 6           3 10004     1 [MALE]  
 5 7           3 10005     2 [FEMALE]
 6 9           3 10006     2 [FEMALE]
 7 10          3 10007     2 [FEMALE]
 8 11          3 10008     2 [FEMALE]
 9 12          3 10009     1 [MALE]  
10 13          3 10010     1 [MALE]  
# ℹ 1,294 more rows

Some respondents left the survey. Attrition can be monitored within the next query.

Code
gss_panel10_long |> 
  select(wave, id) |>
  group_by(wave) |>
  summarize(observed = n_distinct(id),
            missing = sum(is.na(id)))
# A tibble: 3 × 3
   wave observed missing
  <dbl>    <int>   <int>
1     1     2044       0
2     2     1552     493
3     3     1305     740

The confidence topic

Do GSS surveys are made of a huge number of questions. Not all questions were asked to the respondents. Indeed, each question was asked to two thirds of the respondents. Questions were related to demographic features (Age, Sex, Level of education, Employment and so on). Answers to these questions can be used to determine whether the panel sample is representative. Data can be compared with last census data (there is a census every ten years in the US).

A number of questions are related to the confidence topic. Respondents were asked about the level of confidence they put in a wide array of institutions.

Code
gss_panel_doc |> 
  filter(str_detect(text, "confidence")) |>
  slice_sample(n=1, by=text) |> 
  select(id, description, text) |> 
  gt::gt()
id description text
confinan CONFINAN far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? a. Banks and financial institutions.
conbus CONBUS far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? b. Major companies.
conclerg CONCLERG far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? c. Organized religion.
coneduc CONEDUC far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? d. Education.
confed CONFED far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? e. Executive branch of the federal government.
conlabor CONLABOR far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? f. Organized labor.
conpress CONPRESS far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? g. Press.
conmedic CONMEDIC far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? h. Medicine.
contv CONTV far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? i. Television.
conjudge CONJUDGE far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? j. U.S. Supreme Court.
consci CONSCI far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? k. Scientific community.
conlegis CONLEGIS far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? l. Congress
conarmy CONARMY far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? m. Military

For institutions like Science or Medicine, or Executive branch of federal government, Respondents were asked whether they have: Great deal of confidence, Only some confidence, Hardly any confidence in the institution. The same Likert scale with 3 levels was used for all institutions.

Question

From the gss_panel10_long dataset, extract columns corresponding to questions from the confidence topic

Code
panel_doc <- gssrdoc::gss_panel_doc

Table wrangling

Before proceeding to Multiple Correspondence Analysis (CA), let us select the set of active variables.

Question

Project gss_panel10_long on firstid, wave, id, sex, and columns with names in

  confinan conbus conclerg coneduc confed conlabor conpress conmedic contv conjudge consci conlegis conarmy

Filter so as to keep only wave 1.

Transform all relevant columns into factors.

Elementary statistics

Question

Use skimmr to summarize your dataset.

Question

There are a lot of missing data in your data set.

How are missing cells related?

Drop rows with missing data in the confidence questions.

What are the dimensions of your data set?

Question

In your explore possible associations between the answers to the different confidence questions?

How would you test possible independence between answer to confidence in science and confidence in the Army.

The case for using MCA

In order to construct a field of ideological and political attitudes, which will subsequently be used as a reference for locating the empirical typologies of response styles and survey compliance, we apply multiple correspondence analysis (MCA). MCA belongs to the family of techniques used in geometric data analysis (Le Roux and Rouanet 2004). It allows for the extraction of the most important dimensions in a set of categorical variables, and the graphical representation of variable categories and individuals relative to each other in a coordinate system. Distances between categories as well as individuals can be interpreted as a measure of (dis)similarity: If categories often co-appear in individual’s responses, they are located close together in the space produced by MCA. Rare co-appearances, accordingly, result in a larger distance between the respective categories. Furthermore, illustrative variables can passively be projected into the field, a technique that has been termed visual regression analysis (Lebart et al. 1984). Whereas the space is determined by the distances between the categories of active variables, passive variables do not alter the structure of the constructed field, but appear in their average and hence most likely position.

From https://doi.org/10.1007/s11135-016-0458-3

In this lab, we won’t look at the field of ideological and political attitudes, but rather at the field of confidence level in a variety of institutions.

Multiple Correspondance Analysis

MCA executive summary

The input of multiple correspondence analysis is a data frame \(X\) with \(n\) rows and \(p\) categorical columns. Multiple Correspondence Analysis starts by building the indicator matrix. The indicator matrix is built by one-hot encoding of each categorical variable.

  • A categorical variable \(V_j\) (factor) with \(q\) levels is mapped to \(q\) \(\{0,1\}\) -valued variables \(V_{j,r}\) for \(r \leq q\)

  • If levels are indexed by \(\{1, \ldots, q\}\), if the value of the categorical variable \(V_j\) from row \(i\) is \(k \in \{1, \ldots, q\}\), the bina$ \[k \mapsto \underbrace{0,\ldots, 0}_{k-1}, 1, \underbrace{0, \ldots, 0}_{q-k}\]

  • The indicator matrix has as many rows as the data matrix

  • The number of columns of the indicator matrix is the sum of the number of levels of the categorical variables/columns of the data matrix

  • The indicator matrix is a numerical matrix. It is suitable for factorial methodss

Recall \(X\) is the data matrix with \(n\) rows (individuals) and \(p\) categorical columns (variables)

For \(j \in \{1, \ldots, p\}\), let \(J_j\) denote the number of levels(categories) of variable \(j\)

Let \(q = \sum_{j\leq p} J_j\) be the sum of the number of levels throughout the variables

Let \(Z\) be the incidence matrix with \(n\) rows and \(q\) columns

For \(j\leq p\) and \(k \leq J_j\), let \(\langle j, k\rangle = \sum_{j'<j} J_{j'}+k\)

Let \(N = n \times p = \sum_{i\leq n} \sum_{j \leq p} X_{i,j}\) and \[P = \frac{1}{N} Z\]

(the correspondence matrix for MCA)

The row wise sums of correspondence matrix \(P\) are all equal to \(1/n=p/N\)

The column wise sum of the correspondence matrix \(P\) for the \(k\)th level of the \(j\)th variable of \(X\) ( \(j \leq p\) ) is \[N_{\langle j,k\rangle}/N = f_{\langle j,k\rangle}/p\]

where \(f_{\langle j,k\rangle}\) stands for the relative frequency of level \(k\) of the \(j\)th variable

\[D_r = \frac{1}{n}\text{Id}_n\qquad D_c =\text{diag}\left(\frac{f_{\langle j,k\rangle}}{p}\right)_{j \leq p, k\leq J_j}\]

In MCA, we compute the SVD \(U \times D \times V^\top\) of the standardized residuals matrix:

\[S = D_r^{-1/2}\times \left(P - r\times c^\top\right) \times D_c^{-1/2} = \sqrt{n}\left(P - r\times c^\top\right) \times D_c^{-1/2}\]

Coefficient \(i, \langle j, k\rangle\) of \(S\) is \[\frac{\mathbb{I}_{i, \langle j, k\rangle}- f_{\langle j,k\rangle}}{\sqrt{n f_{\langle j,k\rangle}/p}}\]

MCA consists in computing the SVD of the standardized residuals matrix \(S = U \times D \times V^\top\)

From the SVD, we get

  • \(D_r^{-1/2} \times U\) standardized coordinates of rows
  • \(D_c^{-1/2} \times V\) standardized coordinates of columns
  • \(D_r^{-1/2} \times U \times D\) principal coordinates of rows
  • \(D_c^{-1/2} \times V \times D\) principal coordinates of columns
  • Squared singular values: the principal inertia

When calling svd(.), the argument should be \[D_r^{1/2}\times \left(D_r^{-1} \times P \times D_c^{-1}- \mathbf{I}\times \mathbf{I}^\top \right)\times D_c^{1/2}= D_r^{-1/2}\times \left( P - r \times c^\top \right)\times D_c^{-1/2}\]

MCA and extended SVD

As

\[D_r^{-1} \times P \times D_c^{-1} - \mathbf{I}\mathbf{I}^\top = (D_r^{-1/2} \times U)\times D \times (D_c^{-1/2}\times V)^\top\]

\[(D_r^{-1/2} \times U)\times D \times (D_c^{-1/2}\times V)^\top\]

is the extended SVD of

\[D_r^{-1} \times P \times D_c^{-1} - \mathbf{I}\mathbf{I}^\top\]

with respect to \(D_r\) and \(D_c\)

Question

Perform MCA on the indicator matrix.

You may use FactoMineR::MCA(). It is interesting to compute the correspondence analysis in your own way, by preparing the matrix that is handled to svd() and returning a named list containing all relevant information.

Question

If you did use FactoMineR::MCA(), explain the organization of the result.

Screeplots

Question

Draw screeplots. Why are they useful? Comment briefly.

Individuals

Question

Perform Individual profiles analysis.

What are the classical plots? How can you build them from the output of FactoMiner::MCA?

Build the table of row contributions (the so-called \(\cos^2\))

Variables/Categories

Question

Perform column profiles (categories) analysis. You may use factoextra::fviz_mca_var()

Question

What is the v.test component of the var component of an MCA object?

Symmetric plots

Question

Build the symmetric plots (biplots) for multiple correspondence analysis.

Mosaicplots

MCA can be complemented by Mosaicplots, Double Decker plots, Chi-square tests, and Correspondence analyses between pair of variables.

Question

Draw a mosaic plot to visualize association between confidence levesl in Science and confidence level in Medicine.

Further references

Barth, Alice and Schmitz, Andreas. 2018. Response quality and ideological dispositions: an integrative approach using geometric and classifying techniques. Quality & Quantity

When analyzing survey data, response quality has consequential implications for substantial conclusions. Differences in response quality are usually explained by personality, or socio-demographic or cognitive characteristics. Little, however, is known about how respondents’ political attitudes, values, and opinions impact on quality aspects. This is a striking analytical omission, as potential associations between political values and various forms of response biases and artefacts call into question surveys’ ability to represent ‘public opinion’. In this contribution, response quality is traced back to respondents’ political and ideological dispositions. For this purpose, a relational understanding of response quality is applied that takes into account different aspects of response behaviors, as well as the interrelations between these indicators. Using data from the US General Social Survey (2010–2014), an empirical typology of response quality is created via finite mixture analysis. The resulting classes are then related to positions in the US field of ideological dispositions constructed via multiple correspondence analysis. The analyses reveal that there are (1) different combinations of response patterns and thus different empirical response types, and (2) that these types of response quality systematically vary with regard to the respondents’ political and ideological (dis)positions. Implications of the findings for public opinion surveys are discussed.