LAB: Multiple Correspondence Analysis (MCA)
M1 MIDS/MFA/LOGOS |
Année 2024 |
Besides the usual packages (tidyverse
, …), we shall require FactoMineR
and related packages.
The GSS
dataset
We will use data coming from the General Social Survey. The General Social Survey data can be retrieved using the gssr
package. If needed, install the gssr
package and its companion package gssrdoc
.
Loading required package: gssr
Package loaded. To attach the GSS data, type data(gss_all) at the console.
For the codebook, type data(gss_dict).
For the panel data and documentation, type e.g. data(gss_panel08_long) and data(gss_panel_doc).
For help on a specific GSS variable, type ?varname at the console.
Loading required package: gssrdoc
The data we will use are panel data made available by the GSS. In order to explore them, it’s very useful toload gss_dict
and gss_panel_doc
.
Check the online help.
Code
?gss_panel_doc
gss_dict
and gss_panel_doc
are dataframes that can be queried:
id | text |
---|---|
padeg | 20. If finished 9th-12th grade: Did he ever get a high school diploma or a GED certificate? |
relhh13 | 1632. What is (PERSON)'s relationship to head of household? |
rellife | Please tell me whether you strongly agree, agree, disagree, or strongly disagree with the following statements: 679. I try hard to carry my religious beliefs over into all my other dealings in life. |
raclive | 129. Are there any (negroes/blacks/African-Americans) living in this neighborhood now? |
old3 | 1621. Please tell me the names of the people who usually live in this household. Let's start with the head of the household. c. How old was [PERSON] on his/her last birthday? |
In the panel questionnaire, some questions have race
in the field id
. Check again the online help.
Code
?race
The answer is something like that:
Question 24. What race do you consider yourself?
And it contains a brief summary of the answers given through the years.
Year | Black | Other | White | iap (NA) | Total |
---|---|---|---|---|---|
2010 | 311 | 183 | 1550 | - | 2044 |
2012 | 301 | 196 | 1477 | - | 1974 |
2014 | 386 | 262 | 1890 | - | 2538 |
(this is just an excerpt)
Code
id | description | text |
---|---|---|
race | RACE | 24. What race do you consider yourself? |
racecen1 | RACECEN1 | 1602. What is your race? Indicate one or more races that you consider yourself to be. |
natrace | NATRACE | 68. We are faced with many problems in this country, none of some of these problems, and for each one I'd like you to tell me whether you think we're spending too much money on it, too little money, or about the right amount. h. Improving the conditions of Blacks. |
natracey | NATRACEY | 69. We are faced with many problems in this country, none of some of these problems, and for each one I'd like you to tell me whether you think we're spending too much money on it, too little money, or about the right amount. h. Assistance to blacks. |
intrace3 | INTRACE3 | What is your race? Indicate one or more races that you consider yourself to be. |
The data set we will use comes from year 2010 panel data.
Code
data("gss_panel10_long")
gss_panel10_long
# A tibble: 6,132 × 1,200
firstid wave oversamp sampcode sample form formwt vpsu vstrat
<fct> <dbl> <dbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl> <dbl+>
1 1 1 1 501 9 [2000 FP] 2 [ALTERNAT… 1 1 2240
2 1 2 1 501 9 [2000 FP] 2 [ALTERNAT… 1 1 2240
3 1 3 1 501 9 [2000 FP] 2 [ALTERNAT… 1 1 2240
4 2 1 1 501 9 [2000 FP] 1 [STANDARD… 1 1 2240
5 2 2 1 501 9 [2000 FP] 1 [STANDARD… 1 1 2240
6 2 3 1 501 9 [2000 FP] 1 [STANDARD… 1 1 2240
7 3 1 1 501 9 [2000 FP] 1 [STANDARD… 1 2 2240
8 3 2 1 501 9 [2000 FP] 1 [STANDARD… 1 2 2240
9 3 3 1 501 9 [2000 FP] 1 [STANDARD… 1 2 2240
10 4 1 1 501 9 [2000 FP] 2 [ALTERNAT… 1 2 2240
# ℹ 6,122 more rows
# ℹ 1,191 more variables: samptype <dbl+lbl>, wtpan12 <dbl+lbl>,
# wtpan123 <dbl+lbl>, wtpannr12 <dbl+lbl>, wtpannr123 <dbl+lbl>,
# id <dbl+lbl>, mar1 <dbl+lbl>, mar2 <dbl+lbl>, mar3 <dbl+lbl>,
# mar4 <dbl+lbl>, mar5 <dbl+lbl>, mar6 <dbl+lbl>, mar7 <dbl+lbl>,
# mar8 <dbl+lbl>, mar9 <dbl+lbl>, mar10 <dbl>, mar11 <dbl+lbl>, mar12 <dbl>,
# mar13 <dbl>, mar14 <dbl>, abany <dbl+lbl>, abdefect <dbl+lbl>, …
At the beginning, the panel was made of roughly 2000 respondents. And the respondents were interviewed in 2010, 2012 and 2014 (the three waves).
# A tibble: 1,304 × 4
firstid wave id sex
<fct> <dbl> <dbl+lbl> <dbl+lbl>
1 1 3 10001 1 [MALE]
2 2 3 10002 2 [FEMALE]
3 3 3 10003 2 [FEMALE]
4 6 3 10004 1 [MALE]
5 7 3 10005 2 [FEMALE]
6 9 3 10006 2 [FEMALE]
7 10 3 10007 2 [FEMALE]
8 11 3 10008 2 [FEMALE]
9 12 3 10009 1 [MALE]
10 13 3 10010 1 [MALE]
# ℹ 1,294 more rows
Some respondents left the survey. Attrition can be monitored within the next query.
The confidence topic
Do GSS surveys are made of a huge number of questions. Not all questions were asked to the respondents. Indeed, each question was asked to two thirds of the respondents. Questions were related to demographic features (Age, Sex, Level of education, Employment and so on). Answers to these questions can be used to determine whether the panel sample is representative. Data can be compared with last census data (there is a census every ten years in the US).
A number of questions are related to the confidence topic. Respondents were asked about the level of confidence they put in a wide array of institutions.
Code
id | description | text |
---|---|---|
confinan | CONFINAN | far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? a. Banks and financial institutions. |
conbus | CONBUS | far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? b. Major companies. |
conclerg | CONCLERG | far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? c. Organized religion. |
coneduc | CONEDUC | far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? d. Education. |
confed | CONFED | far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? e. Executive branch of the federal government. |
conlabor | CONLABOR | far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? f. Organized labor. |
conpress | CONPRESS | far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? g. Press. |
conmedic | CONMEDIC | far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? h. Medicine. |
contv | CONTV | far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? i. Television. |
conjudge | CONJUDGE | far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? j. U.S. Supreme Court. |
consci | CONSCI | far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? k. Scientific community. |
conlegis | CONLEGIS | far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? l. Congress |
conarmy | CONARMY | far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? m. Military |
For institutions like Science or Medicine, or Executive branch of federal government, Respondents were asked whether they have: Great deal of confidence, Only some confidence, Hardly any confidence in the institution. The same Likert scale with 3 levels was used for all institutions.
From the gss_panel10_long
dataset, extract columns corresponding to questions from the confidence topic
Code
panel_doc <- gssrdoc::gss_panel_doc
Table wrangling
Before proceeding to Multiple Correspondence Analysis (CA), let us select the set of active variables.
Project gss_panel10_long
on firstid, wave, id, sex,
and columns with names in
confinan conbus conclerg coneduc confed conlabor conpress conmedic contv conjudge consci conlegis conarmy
Filter so as to keep only wave 1
.
Transform all relevant columns into factors.
Elementary statistics
Use skimmr
to summarize your dataset.
There are a lot of missing data in your data set.
How are missing cells related?
Drop rows with missing data in the confidence questions.
What are the dimensions of your data set?
In your explore possible associations between the answers to the different confidence questions?
How would you test possible independence between answer to confidence in science and confidence in the Army.
The case for using MCA
In order to construct a field of ideological and political attitudes, which will subsequently be used as a reference for locating the empirical typologies of response styles and survey compliance, we apply multiple correspondence analysis (MCA). MCA belongs to the family of techniques used in geometric data analysis (Le Roux and Rouanet 2004). It allows for the extraction of the most important dimensions in a set of categorical variables, and the graphical representation of variable categories and individuals relative to each other in a coordinate system. Distances between categories as well as individuals can be interpreted as a measure of (dis)similarity: If categories often co-appear in individual’s responses, they are located close together in the space produced by MCA. Rare co-appearances, accordingly, result in a larger distance between the respective categories. Furthermore, illustrative variables can passively be projected into the field, a technique that has been termed visual regression analysis (Lebart et al. 1984). Whereas the space is determined by the distances between the categories of active variables, passive variables do not alter the structure of the constructed field, but appear in their average and hence most likely position.
From https://doi.org/10.1007/s11135-016-0458-3
In this lab, we won’t look at the field of ideological and political attitudes, but rather at the field of confidence level in a variety of institutions.
Multiple Correspondance Analysis
The input of multiple correspondence analysis is a data frame \(X\) with \(n\) rows and \(p\) categorical columns. Multiple Correspondence Analysis starts by building the indicator matrix. The indicator matrix is built by one-hot encoding of each categorical variable.
A categorical variable \(V_j\) (factor) with \(q\) levels is mapped to \(q\) \(\{0,1\}\) -valued variables \(V_{j,r}\) for \(r \leq q\)
If levels are indexed by \(\{1, \ldots, q\}\), if the value of the categorical variable \(V_j\) from row \(i\) is \(k \in \{1, \ldots, q\}\), the bina$ \[k \mapsto \underbrace{0,\ldots, 0}_{k-1}, 1, \underbrace{0, \ldots, 0}_{q-k}\]
The indicator matrix has as many rows as the data matrix
The number of columns of the indicator matrix is the sum of the number of levels of the categorical variables/columns of the data matrix
The indicator matrix is a numerical matrix. It is suitable for factorial methodss
Recall \(X\) is the data matrix with \(n\) rows (individuals) and \(p\) categorical columns (variables)
For \(j \in \{1, \ldots, p\}\), let \(J_j\) denote the number of levels(categories) of variable \(j\)
Let \(q = \sum_{j\leq p} J_j\) be the sum of the number of levels throughout the variables
Let \(Z\) be the incidence matrix with \(n\) rows and \(q\) columns
For \(j\leq p\) and \(k \leq J_j\), let \(\langle j, k\rangle = \sum_{j'<j} J_{j'}+k\)
Let \(N = n \times p = \sum_{i\leq n} \sum_{j \leq p} X_{i,j}\) and \[P = \frac{1}{N} Z\]
(the correspondence matrix for MCA)
The row wise sums of correspondence matrix \(P\) are all equal to \(1/n=p/N\)
The column wise sum of the correspondence matrix \(P\) for the \(k\)th level of the \(j\)th variable of \(X\) ( \(j \leq p\) ) is \[N_{\langle j,k\rangle}/N = f_{\langle j,k\rangle}/p\]
where \(f_{\langle j,k\rangle}\) stands for the relative frequency of level \(k\) of the \(j\)th variable
\[D_r = \frac{1}{n}\text{Id}_n\qquad D_c =\text{diag}\left(\frac{f_{\langle j,k\rangle}}{p}\right)_{j \leq p, k\leq J_j}\]
In MCA, we compute the SVD \(U \times D \times V^\top\) of the standardized residuals matrix:
\[S = D_r^{-1/2}\times \left(P - r\times c^\top\right) \times D_c^{-1/2} = \sqrt{n}\left(P - r\times c^\top\right) \times D_c^{-1/2}\]
Coefficient \(i, \langle j, k\rangle\) of \(S\) is \[\frac{\mathbb{I}_{i, \langle j, k\rangle}- f_{\langle j,k\rangle}}{\sqrt{n f_{\langle j,k\rangle}/p}}\]
MCA consists in computing the SVD of the standardized residuals matrix \(S = U \times D \times V^\top\)
From the SVD, we get
- \(D_r^{-1/2} \times U\) standardized coordinates of rows
- \(D_c^{-1/2} \times V\) standardized coordinates of columns
- \(D_r^{-1/2} \times U \times D\) principal coordinates of rows
- \(D_c^{-1/2} \times V \times D\) principal coordinates of columns
- Squared singular values: the principal inertia
When calling svd(.)
, the argument should be \[D_r^{1/2}\times \left(D_r^{-1} \times P \times D_c^{-1}- \mathbf{I}\times \mathbf{I}^\top \right)\times D_c^{1/2}= D_r^{-1/2}\times \left( P - r \times c^\top \right)\times D_c^{-1/2}\]
As
\[D_r^{-1} \times P \times D_c^{-1} - \mathbf{I}\mathbf{I}^\top = (D_r^{-1/2} \times U)\times D \times (D_c^{-1/2}\times V)^\top\]
\[(D_r^{-1/2} \times U)\times D \times (D_c^{-1/2}\times V)^\top\]
is the extended SVD of
\[D_r^{-1} \times P \times D_c^{-1} - \mathbf{I}\mathbf{I}^\top\]
with respect to \(D_r\) and \(D_c\)
Perform MCA on the indicator matrix.
You may use FactoMineR::MCA()
. It is interesting to compute the correspondence analysis in your own way, by preparing the matrix that is handled to svd()
and returning a named list containing all relevant information.
If you did use FactoMineR::MCA()
, explain the organization of the result.
Screeplots
Draw screeplots. Why are they useful? Comment briefly.
Individuals
Perform Individual profiles analysis.
What are the classical plots? How can you build them from the output of FactoMiner::MCA
?
Build the table of row contributions (the so-called \(\cos^2\))
Variables/Categories
Perform column profiles (categories) analysis. You may use factoextra::fviz_mca_var()
What is the v.test
component of the var
component of an MCA
object?
Symmetric plots
Build the symmetric plots (biplots) for multiple correspondence analysis.
Mosaicplots
MCA can be complemented by Mosaicplots, Double Decker plots, Chi-square tests, and Correspondence analyses between pair of variables.
Draw a mosaic plot to visualize association between confidence levesl in Science and confidence level in Medicine.
Further references
Barth, Alice and Schmitz, Andreas. 2018. Response quality and ideological dispositions: an integrative approach using geometric and classifying techniques. Quality & Quantity
When analyzing survey data, response quality has consequential implications for substantial conclusions. Differences in response quality are usually explained by personality, or socio-demographic or cognitive characteristics. Little, however, is known about how respondents’ political attitudes, values, and opinions impact on quality aspects. This is a striking analytical omission, as potential associations between political values and various forms of response biases and artefacts call into question surveys’ ability to represent ‘public opinion’. In this contribution, response quality is traced back to respondents’ political and ideological dispositions. For this purpose, a relational understanding of response quality is applied that takes into account different aspects of response behaviors, as well as the interrelations between these indicators. Using data from the US General Social Survey (2010–2014), an empirical typology of response quality is created via finite mixture analysis. The resulting classes are then related to positions in the US field of ideological dispositions constructed via multiple correspondence analysis. The analyses reveal that there are (1) different combinations of response patterns and thus different empirical response types, and (2) that these types of response quality systematically vary with regard to the respondents’ political and ideological (dis)positions. Implications of the findings for public opinion surveys are discussed.