LAB: Multiple Correspondence Analysis (MCA)

Published

March 28, 2025

M1 MIDS/MFA/LOGOS

Université Paris Cité

Année 2024

Course Homepage

Moodle

Besides the usual packages (tidyverse, …), we shall require FactoMineR and related packages.

The GSS dataset

We will use data coming from the General Social Survey. The General Social Survey data can be retrieved using the gssr package. If needed, install the gssr package and its companion package gssrdoc.

Code
Loading required package: gssr
Package loaded. To attach the GSS data, type data(gss_all) at the console.
For the codebook, type data(gss_dict).
For the panel data and documentation, type e.g. data(gss_panel08_long) and data(gss_panel_doc).
For help on a specific GSS variable, type ?varname at the console.
Loading required package: gssrdoc

The data we will use are panel data made available by the GSS. In order to explore them, it’s very useful toload gss_dict and gss_panel_doc.

Code
data(gss_dict)
data(gss_panel_doc)

Check the online help.

Code
?gss_panel_doc

gss_dict and gss_panel_doc are dataframes that can be queried:

Code
gss_panel_doc |> 
  dplyr::sample_n(5) |> 
  select(id, text)  |> 
  gt::gt()
id text
relhhd8 1622. What is (PERSON)'s relationship to (HEAD OF HOUSEHOLD)?
denom16 115a. If Protestant: What specific denomination is that, if any?
conjudge far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? j. U.S. Supreme Court.
wordh 1612. We would like to know something about how people go about guessing words they do not know. On this card are listed some words--you may know some of them, and you may not know quite a few of them. On each line the first word is in capital letters like BEAST. Then there are five other words. Tell me the number of the word that comes closest to the meaning of the word in capital letters. For example, if the word in capital letters is BEAST, you would say "4" since "animal" come closer to BEAST than any of the other words. If you wish, I will read the words to you. These words are difficult for almost everyone just give me your best guess if you are not sure of the answer.
compwage 1087. Do you believe your fixed annual wages in calendar year 2005 were higher or lower than those of employees with similar experience and job descriptions in other companies in your region? Please answer on a 1 to 5 scale.

In the panel questionnaire, some questions have race in the field id. Check again the online help.

Code
?race

The answer is something like that:

Question 24. What race do you consider yourself?

And it contains a brief summary of the answers given through the years.

Year Black Other White iap (NA) Total
2010 311 183 1550 - 2044
2012 301 196 1477 - 1974
2014 386 262 1890 - 2538

(this is just an excerpt)

Code
gss_panel_doc |> 
  filter(str_detect(id, "race")) |>
  slice_sample(n=1, by=text) |> 
  select(id, description, text) |> 
  gt::gt()
id description text
race RACE 24. What race do you consider yourself?
racecen3 RACECEN3 1602. What is your race? Indicate one or more races that you consider yourself to be.
natrace NATRACE 68. We are faced with many problems in this country, none of some of these problems, and for each one I'd like you to tell me whether you think we're spending too much money on it, too little money, or about the right amount. h. Improving the conditions of Blacks.
natracey NATRACEY 69. We are faced with many problems in this country, none of some of these problems, and for each one I'd like you to tell me whether you think we're spending too much money on it, too little money, or about the right amount. h. Assistance to blacks.
intrace3 INTRACE3 What is your race? Indicate one or more races that you consider yourself to be.

The data set we will use comes from year 2010 panel data.

Code
data("gss_panel10_long")

gss_panel10_long 
# A tibble: 6,132 × 1,200
   firstid  wave oversamp sampcode  sample      form         formwt vpsu  vstrat
   <fct>   <dbl>    <dbl> <dbl+lbl> <dbl+lbl>   <dbl+lbl>     <dbl> <dbl> <dbl+>
 1 1           1        1 501       9 [2000 FP] 2 [ALTERNAT…      1 1     2240  
 2 1           2        1 501       9 [2000 FP] 2 [ALTERNAT…      1 1     2240  
 3 1           3        1 501       9 [2000 FP] 2 [ALTERNAT…      1 1     2240  
 4 2           1        1 501       9 [2000 FP] 1 [STANDARD…      1 1     2240  
 5 2           2        1 501       9 [2000 FP] 1 [STANDARD…      1 1     2240  
 6 2           3        1 501       9 [2000 FP] 1 [STANDARD…      1 1     2240  
 7 3           1        1 501       9 [2000 FP] 1 [STANDARD…      1 2     2240  
 8 3           2        1 501       9 [2000 FP] 1 [STANDARD…      1 2     2240  
 9 3           3        1 501       9 [2000 FP] 1 [STANDARD…      1 2     2240  
10 4           1        1 501       9 [2000 FP] 2 [ALTERNAT…      1 2     2240  
# ℹ 6,122 more rows
# ℹ 1,191 more variables: samptype <dbl+lbl>, wtpan12 <dbl+lbl>,
#   wtpan123 <dbl+lbl>, wtpannr12 <dbl+lbl>, wtpannr123 <dbl+lbl>,
#   id <dbl+lbl>, mar1 <dbl+lbl>, mar2 <dbl+lbl>, mar3 <dbl+lbl>,
#   mar4 <dbl+lbl>, mar5 <dbl+lbl>, mar6 <dbl+lbl>, mar7 <dbl+lbl>,
#   mar8 <dbl+lbl>, mar9 <dbl+lbl>, mar10 <dbl>, mar11 <dbl+lbl>, mar12 <dbl>,
#   mar13 <dbl>, mar14 <dbl>, abany <dbl+lbl>, abdefect <dbl+lbl>, …

At the beginning, the panel was made of roughly 2000 respondents. And the respondents were interviewed in 2010, 2012 and 2014 (the three waves).

Code
gss_panel10_long  |>  
  filter(wave==3, !is.na(id)) |>
  select(firstid, wave, id, sex)
# A tibble: 1,304 × 4
   firstid  wave id        sex       
   <fct>   <dbl> <dbl+lbl> <dbl+lbl> 
 1 1           3 10001     1 [MALE]  
 2 2           3 10002     2 [FEMALE]
 3 3           3 10003     2 [FEMALE]
 4 6           3 10004     1 [MALE]  
 5 7           3 10005     2 [FEMALE]
 6 9           3 10006     2 [FEMALE]
 7 10          3 10007     2 [FEMALE]
 8 11          3 10008     2 [FEMALE]
 9 12          3 10009     1 [MALE]  
10 13          3 10010     1 [MALE]  
# ℹ 1,294 more rows

Some respondents left the survey. Attrition can be monitored within the next query.

Code
gss_panel10_long |> 
  select(wave, id) |>
  group_by(wave) |>
  summarize(observed = n_distinct(id),
            missing = sum(is.na(id)))
# A tibble: 3 × 3
   wave observed missing
  <dbl>    <int>   <int>
1     1     2044       0
2     2     1552     493
3     3     1305     740

The confidence topic

Do GSS surveys are made of a huge number of questions. Not all questions were asked to the respondents. Indeed, each question was asked to two thirds of the respondents. Questions were related to demographic features (Age, Sex, Level of education, Employment and so on). Answers to these questions can be used to determine whether the panel sample is representative. Data can be compared with last census data (there is a census every ten years in the US).

A number of questions are related to the confidence topic. Respondents were asked about the level of confidence they put in a wide array of institutions.

Code
gss_panel_doc |> 
  filter(str_detect(text, "confidence")) |>
  slice_sample(n=1, by=text) |> 
  select(id, description, text) |> 
  gt::gt()
id description text
confinan CONFINAN far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? a. Banks and financial institutions.
conbus CONBUS far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? b. Major companies.
conclerg CONCLERG far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? c. Organized religion.
coneduc CONEDUC far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? d. Education.
confed CONFED far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? e. Executive branch of the federal government.
conlabor CONLABOR far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? f. Organized labor.
conpress CONPRESS far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? g. Press.
conmedic CONMEDIC far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? h. Medicine.
contv CONTV far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? i. Television.
conjudge CONJUDGE far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? j. U.S. Supreme Court.
consci CONSCI far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? k. Scientific community.
conlegis CONLEGIS far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? l. Congress
conarmy CONARMY far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? m. Military

For institutions like Science or Medicine, or Executive branch of federal government, Respondents were asked whether they have: Great deal of confidence, Only some confidence, Hardly any confidence in the institution. The same Likert scale with 3 levels was used for all institutions.

Question

From the gss_panel10_long dataset, extract columns corresponding to questions from the confidence topic

Solution

We first gather the id of the questions pertaining to the confidence topic.

Code
confidence_ids <- gss_panel_doc |> 
  filter(str_detect(text, "confidence")) |>
  slice_sample(n=1, by=text) |> 
  pull(id)

Then we protect the panel data on the questions pertaining to the confidence topic. We also keep the respondent key, sex, and the relevant wave. s

Code
gss_panel10_long |> 
  filter(wave==1) |>
  select(firstid, wave, id, sex, contains(confidence_ids)) |>
  sample_n(10) |>
  gt::gt()
firstid wave id_1: id sex_1: RESPONDENTS SEX confinan_1: CONFID IN BANKS & FINANCIAL INSTITUTIONS conbus_1: CONFIDENCE IN MAJOR COMPANIES conclerg_1: CONFIDENCE IN ORGANIZED RELIGION coneduc_1: CONFIDENCE IN EDUCATION confed_1: CONFID. IN EXEC BRANCH OF FED GOVT conlabor_1: CONFIDENCE IN ORGANIZED LABOR conpress_1: CONFIDENCE IN PRESS conmedic_1: CONFIDENCE IN MEDICINE contv_1: CONFIDENCE IN TELEVISION conjudge_1: CONFID. IN UNITED STATES SUPREME COURT consci_1: CONFIDENCE IN SCIENTIFIC COMMUNITY econsci conlegis_1: CONFIDENCE IN CONGRESS conarmy_1: CONFIDENCE IN MILITARY
1225 1 1225 2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
1511 1 1511 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
1153 1 1153 2 3 2 2 1 2 2 2 2 1 2 2 NA 2 1
1585 1 1585 2 3 2 2 2 2 2 2 2 2 2 2 NA 1 3
940 1 940 2 3 2 3 2 3 3 2 1 2 3 2 NA 3 2
110 1 110 2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
559 1 559 2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
1536 1 1536 1 3 2 2 2 2 2 3 2 3 2 2 NA 3 1
846 1 846 2 3 2 2 3 3 2 3 1 3 2 1 4 3 2
220 1 220 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Let us go back to the documentation.

Code
gss_panel_doc |>
  filter(id %in% confidence_ids) |> 
  select(-ends_with("2"), -ends_with("3")) |>
  gt::gt()
id description text properties_1 marginals_1
confinan CONFINAN far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? a. Banks and financial institutions. c("Data type", "Missing-data codes", "Record/column"), c("numeric", "0,8,9", "1/265"), c("CONFINAN_1", "CONFINAN_1", "CONFINAN_1") c(30.1, 55.2, 14.7, NA, NA, 100), c("395", "725", "193", "671", "16", "2,000"), c("1", "2", "3", "0", "8", NA), c("A GREAT DEAL", "ONLY SOME", "HARDLY ANY", "IAP", "DK", "Total"), c("CONFINAN_1", "CONFINAN_1", "CONFINAN_1", "CONFINAN_1", "CONFINAN_1", "CONFINAN_1")
conbus CONBUS far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? b. Major companies. c("Data type", "Missing-data codes", "Record/column"), c("numeric", "0,8,9", "1/247"), c("CONBUS_1", "CONBUS_1", "CONBUS_1") c(17.6, 62.3, 20.1, NA, NA, NA, 100), c("228", "806", "260", "671", "32", "3", "2,000"), c("1", "2", "3", "0", "8", "9", NA), c("A GREAT DEAL", "ONLY SOME", "HARDLY ANY", "IAP", "DK", NA, "Total"), c("CONBUS_1", "CONBUS_1", "CONBUS_1", "CONBUS_1", "CONBUS_1", "CONBUS_1", "CONBUS_1")
conclerg CONCLERG far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? c. Organized religion. c("Data type", "Missing-data codes", "Record/column"), c("numeric", "0,8,9", "1/250"), c("CONCLERG_1", "CONCLERG_1", "CONCLERG_1") c(25.1, 52.5, 22.4, NA, NA, NA, 100), c("324", "678", "289", "671", "35", "3", "2,000"), c("1", "2", "3", "0", "8", "9", NA), c("A GREAT DEAL", "ONLY SOME", "HARDLY ANY", "IAP", "DK", NA, "Total"), c("CONCLERG_1", "CONCLERG_1", "CONCLERG_1", "CONCLERG_1", "CONCLERG_1", "CONCLERG_1", "CONCLERG_1")
coneduc CONEDUC far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? d. Education. c("Data type", "Missing-data codes", "Record/column"), c("numeric", "0,8,9", "1/259"), c("CONEDUC_1", "CONEDUC_1", "CONEDUC_1") c(27.1, 56.5, 16.4, NA, NA, 100), c("359", "747", "217", "671", "6", "2,000"), c("1", "2", "3", "0", "8", NA), c("A GREAT DEAL", "ONLY SOME", "HARDLY ANY", "IAP", "DK", "Total"), c("CONEDUC_1", "CONEDUC_1", "CONEDUC_1", "CONEDUC_1", "CONEDUC_1", "CONEDUC_1")
confed CONFED far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? e. Executive branch of the federal government. c("Data type", "Missing-data codes", "Record/column"), c("numeric", "0,8,9", "1/262"), c("CONFED_1", "CONFED_1", "CONFED_1") c(13.8, 45.4, 40.8, NA, NA, NA, 100), c("179", "590", "531", "671", "26", "3", "2,000"), c("1", "2", "3", "0", "8", "9", NA), c("A GREAT DEAL", "ONLY SOME", "HARDLY ANY", "IAP", "DK", NA, "Total"), c("CONFED_1", "CONFED_1", "CONFED_1", "CONFED_1", "CONFED_1", "CONFED_1", "CONFED_1")
conlabor CONLABOR far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? f. Organized labor. c("Data type", "Missing-data codes", "Record/column"), c("numeric", "0,8,9", "1/289"), c("CONLABOR_1", "CONLABOR_1", "CONLABOR_1") c(12.3, 58.1, 29.6, NA, NA, 100), c("155", "731", "372", "671", "71", "2,000"), c("1", "2", "3", "0", "8", NA), c("A GREAT DEAL", "ONLY SOME", "HARDLY ANY", "IAP", "DK", "Total"), c("CONLABOR_1", "CONLABOR_1", "CONLABOR_1", "CONLABOR_1", "CONLABOR_1", "CONLABOR_1")
conpress CONPRESS far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? g. Press. c("Data type", "Missing-data codes", "Record/column"), c("numeric", "0,8,9", "1/298"), c("CONPRESS_1", "CONPRESS_1", "CONPRESS_1") c(10, 48.5, 41.5, NA, NA, NA, 100), c("132", "638", "546", "671", "12", "1", "2,000"), c("1", "2", "3", "0", "8", "9", NA), c("A GREAT DEAL", "ONLY SOME", "HARDLY ANY", "IAP", "DK", NA, "Total"), c("CONPRESS_1", "CONPRESS_1", "CONPRESS_1", "CONPRESS_1", "CONPRESS_1", "CONPRESS_1", "CONPRESS_1")
conmedic CONMEDIC far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? h. Medicine. c("Data type", "Missing-data codes", "Record/column"), c("numeric", "0,8,9", "1/295"), c("CONMEDIC_1", "CONMEDIC_1", "CONMEDIC_1") c(40.1, 49.3, 10.6, NA, NA, 100), c("530", "651", "140", "671", "8", "2,000"), c("1", "2", "3", "0", "8", NA), c("A GREAT DEAL", "ONLY SOME", "HARDLY ANY", "IAP", "DK", "Total"), c("CONMEDIC_1", "CONMEDIC_1", "CONMEDIC_1", "CONMEDIC_1", "CONMEDIC_1", "CONMEDIC_1")
contv CONTV far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? i. Television. c("Data type", "Missing-data codes", "Record/column"), c("numeric", "0,8,9", "1/328"), c("CONTV_1", "CONTV_1", "CONTV_1") c(9.4, 48.2, 42.3, NA, NA, NA, 100), c("124", "634", "556", "671", "14", "1", "2,000"), c("1", "2", "3", "0", "8", "9", NA), c("A GREAT DEAL", "ONLY SOME", "HARDLY ANY", "IAP", "DK", NA, "Total"), c("CONTV_1", "CONTV_1", "CONTV_1", "CONTV_1", "CONTV_1", "CONTV_1", "CONTV_1")
conjudge CONJUDGE far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? j. U.S. Supreme Court. c("Data type", "Missing-data codes", "Record/column"), c("numeric", "0,8,9", "1/286"), c("CONJUDGE_1", "CONJUDGE_1", "CONJUDGE_1") c(31.8, 50.3, 17.8, NA, NA, NA, 100), c("410", "649", "230", "671", "38", "2", "2,000"), c("1", "2", "3", "0", "8", "9", NA), c("A GREAT DEAL", "ONLY SOME", "HARDLY ANY", "IAP", "DK", NA, "Total"), c("CONJUDGE_1", "CONJUDGE_1", "CONJUDGE_1", "CONJUDGE_1", "CONJUDGE_1", "CONJUDGE_1", "CONJUDGE_1")
consci CONSCI far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? k. Scientific community. c("Data type", "Missing-data codes", "Record/column"), c("numeric", "0,8,9", "1/319"), c("CONSCI_1", "CONSCI_1", "CONSCI_1") c(41.6, 51, 7.3, NA, NA, NA, 100), c("527", "646", "93", "671", "62", "1", "2,000"), c("1", "2", "3", "0", "8", "9", NA), c("A GREAT DEAL", "ONLY SOME", "HARDLY ANY", "IAP", "DK", NA, "Total"), c("CONSCI_1", "CONSCI_1", "CONSCI_1", "CONSCI_1", "CONSCI_1", "CONSCI_1", "CONSCI_1")
conlegis CONLEGIS far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? l. Congress c("Data type", "Missing-data codes", "Record/column"), c("numeric", "0,8,9", "1/292"), c("CONLEGIS_1", "CONLEGIS_1", "CONLEGIS_1") c(10.8, 51.2, 37.9, NA, NA, NA, 100), c("141", "666", "493", "671", "28", "1", "2,000"), c("1", "2", "3", "0", "8", "9", NA), c("A GREAT DEAL", "ONLY SOME", "HARDLY ANY", "IAP", "DK", NA, "Total"), c("CONLEGIS_1", "CONLEGIS_1", "CONLEGIS_1", "CONLEGIS_1", "CONLEGIS_1", "CONLEGIS_1", "CONLEGIS_1")
conarmy CONARMY far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? m. Military c("Data type", "Missing-data codes", "Record/column"), c("numeric", "0,8,9", "1/244"), c("CONARMY_1", "CONARMY_1", "CONARMY_1") c(46.8, 39.8, 13.4, NA, NA, 100), c("613", "521", "175", "671", "20", "2,000"), c("1", "2", "3", "0", "8", NA), c("A GREAT DEAL", "ONLY SOME", "HARDLY ANY", "IAP", "DK", "Total"), c("CONARMY_1", "CONARMY_1", "CONARMY_1", "CONARMY_1", "CONARMY_1", "CONARMY_1")

Levels labels for answers to confidence questions are:

Code
foo <- gss_panel_doc |>
   filter(id %in% confidence_ids) |> 
   select(-ends_with("2"), -ends_with("3")) |> 
   unnest(
    cols=c(properties_1), names_repair = "universal") |> 
   unnest(
    cols=c(marginals_1), names_repair = "universal") |> 
   head(6) 
New names:
New names:
• `id` -> `id...1`
• `id` -> `id...6`
Code
foo |> gt::gt()
id...1 description text property value...5 id...6 percent n value...9 label id...11
confinan CONFINAN far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? a. Banks and financial institutions. Data type numeric CONFINAN_1 30.1 395 1 A GREAT DEAL CONFINAN_1
confinan CONFINAN far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? a. Banks and financial institutions. Data type numeric CONFINAN_1 55.2 725 2 ONLY SOME CONFINAN_1
confinan CONFINAN far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? a. Banks and financial institutions. Data type numeric CONFINAN_1 14.7 193 3 HARDLY ANY CONFINAN_1
confinan CONFINAN far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? a. Banks and financial institutions. Data type numeric CONFINAN_1 NA 671 0 IAP CONFINAN_1
confinan CONFINAN far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? a. Banks and financial institutions. Data type numeric CONFINAN_1 NA 16 8 DK CONFINAN_1
confinan CONFINAN far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? a. Banks and financial institutions. Data type numeric CONFINAN_1 100.0 2,000 NA Total CONFINAN_1
  • IAP stands for respondent was not asked the question.
  • DK stands for Don’t Know

If respondents were not given the opportunity to provide a response to a question because it was not included in the survey they were administered, then their responses are coded as “IAP.”

Code
panel_doc <- gssrdoc::gss_panel_doc

Table wrangling

Before proceeding to Multiple Correspondence Analysis (CA), let us select the set of active variables.

Question

Project gss_panel10_long on firstid, wave, id, sex, and columns with names in

  confinan conbus conclerg coneduc confed conlabor conpress conmedic contv conjudge consci conlegis conarmy

Filter so as to keep only wave 1.

Transform all relevant columns into factors.

Solution
Code
confidence_panel10 <- gss_panel10_long |> 
  filter(wave==1) |>
  select(firstid, wave, id, sex, contains(confidence_ids)) |>
  mutate(across(starts_with("con"), as_factor)) |>
  mutate(sex=as_factor(sex))
Code
confidence_10 <- gss_panel_doc |>
   filter(id %in% confidence_ids) |> 
   select(-ends_with("2"), -ends_with("3")) |> 
   unnest(
    cols=c(properties_1), names_repair = "universal") |> 
   unnest(
    cols=c(marginals_1), names_repair = "universal") |>
   mutate(text=str_squish(text)) |>
   mutate(text= str_match(text, "\\? [\\w]\\.(.*)")) |> 
   mutate(text=text[,2]) |> 
   filter(property == "Missing-data codes") |>
   select(-starts_with("id"), -property, -`value...5`)
New names:
New names:
• `id` -> `id...1`
• `id` -> `id...6`
Code
levels_table <- confidence_10 |> 
  filter(label!='Total' | is.na(label)) |>
  distinct(`value...9`, label)

Elementary statistics

Question

Use skimmr to summarize your dataset.

Solution
Code
confidence_panel10 |>
  skimr::skim(where(is.factor)) |>
  select(-factor.ordered, -skim_type)|>
  filter(str_starts(skim_variable,"con")) |>
  gt::gt() |>
  gt::fmt_number(decimals=3) |>
  gt::tab_caption(
    caption = "GSS Panel data 2010 First wave, Topic confidence"
  )
GSS Panel data 2010 First wave, Topic confidence
skim_variable n_missing complete_rate factor.n_unique factor.top_counts
confinan 680.000 0.667 3.000 ONL: 656, HAR: 564, A G: 144, IAP: 0
conbus 705.000 0.655 3.000 ONL: 848, HAR: 317, A G: 174, IAP: 0
conclerg 725.000 0.645 3.000 ONL: 720, HAR: 328, A G: 271, IAP: 0
coneduc 681.000 0.667 3.000 ONL: 793, A G: 367, HAR: 203, IAP: 0
confed 698.000 0.659 3.000 ONL: 620, HAR: 499, A G: 227, IAP: 0
conlabor 744.000 0.636 3.000 ONL: 781, HAR: 373, A G: 146, IAP: 0
conpress 689.000 0.663 3.000 ONL: 621, HAR: 594, A G: 140, IAP: 0
conmedic 681.000 0.667 3.000 ONL: 647, A G: 558, HAR: 158, IAP: 0
contv 681.000 0.667 3.000 ONL: 677, HAR: 521, A G: 165, IAP: 0
conjudge 709.000 0.653 3.000 ONL: 706, A G: 400, HAR: 229, IAP: 0
consci 738.000 0.639 3.000 ONL: 683, A G: 544, HAR: 79, IAP: 0
conlegis 697.000 0.659 3.000 ONL: 635, HAR: 587, A G: 125, IAP: 0
conarmy 688.000 0.663 3.000 A G: 720, ONL: 507, HAR: 129, IAP: 0
Code
confidence_panel10 |>
  filter(is.na(conarmy)) |>
  filter(is.na(conclerg)) |>
  filter(is.na(conpress))
Question

There are a lot of missing data in your data set.

How are missing cells related?

Drop rows with missing data in the confidence questions.

What are the dimensions of your data set?

Solution
Code
confidence_data <- confidence_panel10 |> 
  mutate(across(starts_with("con"), .fns=\(x) !is.na(x), .names="not_{.col}_isna")) |>
  rowwise() |>
  filter(all(c_across(ends_with("isna")))) |> 
  ungroup() |> 
  select(-ends_with("isna"), -firstid, -wave, -econsci) |>
  mutate(across(starts_with("con"), fct_drop), sex=as_factor(sex))

confidence_data |> 
  slice_sample(n=10) |> 
  gt::gt()
id_1: id sex_1: RESPONDENTS SEX confinan_1: CONFID IN BANKS & FINANCIAL INSTITUTIONS conbus_1: CONFIDENCE IN MAJOR COMPANIES conclerg_1: CONFIDENCE IN ORGANIZED RELIGION coneduc_1: CONFIDENCE IN EDUCATION confed_1: CONFID. IN EXEC BRANCH OF FED GOVT conlabor_1: CONFIDENCE IN ORGANIZED LABOR conpress_1: CONFIDENCE IN PRESS conmedic_1: CONFIDENCE IN MEDICINE contv_1: CONFIDENCE IN TELEVISION conjudge_1: CONFID. IN UNITED STATES SUPREME COURT consci_1: CONFIDENCE IN SCIENTIFIC COMMUNITY conlegis_1: CONFIDENCE IN CONGRESS conarmy_1: CONFIDENCE IN MILITARY
1894 FEMALE ONLY SOME ONLY SOME A GREAT DEAL A GREAT DEAL ONLY SOME ONLY SOME HARDLY ANY A GREAT DEAL ONLY SOME A GREAT DEAL A GREAT DEAL ONLY SOME ONLY SOME
1639 FEMALE ONLY SOME ONLY SOME ONLY SOME A GREAT DEAL ONLY SOME ONLY SOME ONLY SOME A GREAT DEAL ONLY SOME ONLY SOME ONLY SOME ONLY SOME A GREAT DEAL
718 MALE HARDLY ANY ONLY SOME HARDLY ANY ONLY SOME HARDLY ANY ONLY SOME HARDLY ANY A GREAT DEAL ONLY SOME HARDLY ANY ONLY SOME HARDLY ANY HARDLY ANY
1101 FEMALE ONLY SOME ONLY SOME HARDLY ANY ONLY SOME A GREAT DEAL HARDLY ANY A GREAT DEAL ONLY SOME A GREAT DEAL A GREAT DEAL ONLY SOME ONLY SOME A GREAT DEAL
1120 FEMALE ONLY SOME ONLY SOME ONLY SOME ONLY SOME ONLY SOME ONLY SOME HARDLY ANY ONLY SOME ONLY SOME ONLY SOME ONLY SOME ONLY SOME ONLY SOME
1417 FEMALE ONLY SOME ONLY SOME HARDLY ANY ONLY SOME HARDLY ANY ONLY SOME HARDLY ANY HARDLY ANY HARDLY ANY ONLY SOME A GREAT DEAL ONLY SOME A GREAT DEAL
1454 MALE HARDLY ANY ONLY SOME HARDLY ANY ONLY SOME HARDLY ANY ONLY SOME HARDLY ANY A GREAT DEAL HARDLY ANY A GREAT DEAL A GREAT DEAL HARDLY ANY A GREAT DEAL
1598 MALE HARDLY ANY ONLY SOME ONLY SOME A GREAT DEAL ONLY SOME ONLY SOME ONLY SOME ONLY SOME ONLY SOME ONLY SOME ONLY SOME HARDLY ANY A GREAT DEAL
62 FEMALE A GREAT DEAL A GREAT DEAL ONLY SOME A GREAT DEAL A GREAT DEAL A GREAT DEAL A GREAT DEAL A GREAT DEAL ONLY SOME A GREAT DEAL A GREAT DEAL A GREAT DEAL A GREAT DEAL
1157 MALE A GREAT DEAL A GREAT DEAL A GREAT DEAL A GREAT DEAL ONLY SOME ONLY SOME HARDLY ANY A GREAT DEAL ONLY SOME A GREAT DEAL A GREAT DEAL A GREAT DEAL A GREAT DEAL
Question

In your explore possible associations between the answers to the different confidence questions?

How would you test possible independence between answer to confidence in science and confidence in the Army.

Solution

TODO

The case for using MCA

In order to construct a field of ideological and political attitudes, which will subsequently be used as a reference for locating the empirical typologies of response styles and survey compliance, we apply multiple correspondence analysis (MCA). MCA belongs to the family of techniques used in geometric data analysis (Le Roux and Rouanet 2004). It allows for the extraction of the most important dimensions in a set of categorical variables, and the graphical representation of variable categories and individuals relative to each other in a coordinate system. Distances between categories as well as individuals can be interpreted as a measure of (dis)similarity: If categories often co-appear in individual’s responses, they are located close together in the space produced by MCA. Rare co-appearances, accordingly, result in a larger distance between the respective categories. Furthermore, illustrative variables can passively be projected into the field, a technique that has been termed visual regression analysis (Lebart et al. 1984). Whereas the space is determined by the distances between the categories of active variables, passive variables do not alter the structure of the constructed field, but appear in their average and hence most likely position.

From https://doi.org/10.1007/s11135-016-0458-3

In this lab, we won’t look at the field of ideological and political attitudes, but rather at the field of confidence level in a variety of institutions.

Multiple Correspondance Analysis

MCA executive summary

The input of multiple correspondence analysis is a data frame \(X\) with \(n\) rows and \(p\) categorical columns. Multiple Correspondence Analysis starts by building the indicator matrix. The indicator matrix is built by one-hot encoding of each categorical variable.

  • A categorical variable \(V_j\) (factor) with \(q\) levels is mapped to \(q\) \(\{0,1\}\) -valued variables \(V_{j,r}\) for \(r \leq q\)

  • If levels are indexed by \(\{1, \ldots, q\}\), if the value of the categorical variable \(V_j\) from row \(i\) is \(k \in \{1, \ldots, q\}\), the bina$ \[k \mapsto \underbrace{0,\ldots, 0}_{k-1}, 1, \underbrace{0, \ldots, 0}_{q-k}\]

  • The indicator matrix has as many rows as the data matrix

  • The number of columns of the indicator matrix is the sum of the number of levels of the categorical variables/columns of the data matrix

  • The indicator matrix is a numerical matrix. It is suitable for factorial methodss

Recall \(X\) is the data matrix with \(n\) rows (individuals) and \(p\) categorical columns (variables)

For \(j \in \{1, \ldots, p\}\), let \(J_j\) denote the number of levels(categories) of variable \(j\)

Let \(q = \sum_{j\leq p} J_j\) be the sum of the number of levels throughout the variables

Let \(Z\) be the incidence matrix with \(n\) rows and \(q\) columns

For \(j\leq p\) and \(k \leq J_j\), let \(\langle j, k\rangle = \sum_{j'<j} J_{j'}+k\)

Let \(N = n \times p = \sum_{i\leq n} \sum_{j \leq p} X_{i,j}\) and \[P = \frac{1}{N} Z\]

(the correspondence matrix for MCA)

The row wise sums of correspondence matrix \(P\) are all equal to \(1/n=p/N\)

The column wise sum of the correspondence matrix \(P\) for the \(k\)th level of the \(j\)th variable of \(X\) ( \(j \leq p\) ) is \[N_{\langle j,k\rangle}/N = f_{\langle j,k\rangle}/p\]

where \(f_{\langle j,k\rangle}\) stands for the relative frequency of level \(k\) of the \(j\)th variable

\[D_r = \frac{1}{n}\text{Id}_n\qquad D_c =\text{diag}\left(\frac{f_{\langle j,k\rangle}}{p}\right)_{j \leq p, k\leq J_j}\]

In MCA, we compute the SVD \(U \times D \times V^\top\) of the standardized residuals matrix:

\[S = D_r^{-1/2}\times \left(P - r\times c^\top\right) \times D_c^{-1/2} = \sqrt{n}\left(P - r\times c^\top\right) \times D_c^{-1/2}\]

Coefficient \(i, \langle j, k\rangle\) of \(S\) is \[\frac{\mathbb{I}_{i, \langle j, k\rangle}- f_{\langle j,k\rangle}}{\sqrt{n f_{\langle j,k\rangle}/p}}\]

MCA consists in computing the SVD of the standardized residuals matrix \(S = U \times D \times V^\top\)

From the SVD, we get

  • \(D_r^{-1/2} \times U\) standardized coordinates of rows
  • \(D_c^{-1/2} \times V\) standardized coordinates of columns
  • \(D_r^{-1/2} \times U \times D\) principal coordinates of rows
  • \(D_c^{-1/2} \times V \times D\) principal coordinates of columns
  • Squared singular values: the principal inertia

When calling svd(.), the argument should be \[D_r^{1/2}\times \left(D_r^{-1} \times P \times D_c^{-1}- \mathbf{I}\times \mathbf{I}^\top \right)\times D_c^{1/2}= D_r^{-1/2}\times \left( P - r \times c^\top \right)\times D_c^{-1/2}\]

MCA and extended SVD

As

\[D_r^{-1} \times P \times D_c^{-1} - \mathbf{I}\mathbf{I}^\top = (D_r^{-1/2} \times U)\times D \times (D_c^{-1/2}\times V)^\top\]

\[(D_r^{-1/2} \times U)\times D \times (D_c^{-1/2}\times V)^\top\]

is the extended SVD of

\[D_r^{-1} \times P \times D_c^{-1} - \mathbf{I}\mathbf{I}^\top\]

with respect to \(D_r\) and \(D_c\)

Question

Perform MCA on the indicator matrix.

You may use FactoMineR::MCA(). It is interesting to compute the correspondence analysis in your own way, by preparing the matrix that is handled to svd() and returning a named list containing all relevant information.

Solution

We lazyly rely on FactoMineR::MCA.

Code
res.mca <- confidence_data |> 
  select(starts_with("con")) |> 
  FactoMineR::MCA(graph=F)

The result is an object of S3 class MCA.

s

Question

If you did use FactoMineR::MCA(), explain the organization of the result.

Solution

The result of FactoMineR::MCA(...) is a named and nested list with five elements:

eig
a matrix/array containing enough information to build a screeplot.
call
a list of 9, containing the call to CA(), an object of type language, telling (in principle) the user how CA() was called. However, this is a quoted expression. Here we need to guess the value of y in the calling environment understand what’s going on.

Element call also contains the table margin distributions marge.col and marge.row. The truncation rank ncp (number of components) can be assigned before computing the SVD (default value is 5). Element \(X\) stores the contingency table that was effectively used for computing Correpondence Analysis.

ind
Information gathered from SVD to facilitate individulals (respndents) analysis.
var
a list structured in the same way as element ind. Used for variables/categories profiles analysis
svd
a list of 3, just as the resuld of svd() containing the singular values, the left and right singular vectors of matrix \(...\)
Solution, output of print(res.mca)
Name Description
1 $eig “eigenvalues”
2 $var “results for the variables (categories)”
3 $var$coord “coord. of the categories”
4 $var$cos2 “cos2 for the categories”
5 $var$contrib “contributions of the categories”
6 $var$v.test “v-test for the categories”
7 $ind “results for the individuals”
8 $ind$coord “coord. for the individuals”
9 $ind$cos2 “cos2 for the individuals”
10 $ind$contrib “contributions of the individuals”
11 $quali.sup “results for the supplementary categorical variables”
12 $quali.sup$coord “coord. for the supplementary categories”
13 $quali.sup$cos2 “cos2 for the supplementary categories”
14 $quali.sup$v.test “v-test for the supplementary categories”
15 $call “intermediate results”
16 $call$marge.col “weights of columns”
17 $call$marge.li “weights of rows”

res.mca$svd is not part of the output!

  • eig is computed from the singular values in res.mca$svd

  • var contains material for plotting information about categories and variables on factorial planes

  • ind conatins material for plotting information about individuals on on factorial planes

In principle, all relevant information can be gathered from components svd, call.marge.li, and call.marge.col.

Screeplots

Question

Draw screeplots. Why are they useful? Comment briefly.

Solution

We first define a tidying function for preparing the screeplot. It is designed along the lines of method tidy for objects of class prcomp. It consists in building a data frame from component eig of the MCA object.

Code
tidy_mca_scree <- function(res_mca){
  res_mca$eig |> 
    as_tibble(rownames="Dim") |>
    rename(
      `% Inertia` = `percentage of variance`,
      `Cumulative % Inertia` = `cumulative percentage of variance`
    ) |> 
    mutate(
      PC = as.integer(str_match(Dim, 'dim ([0-9]*)')[,2])
    )
}

We can now build a column plot.

Code
res.mca |> 
  tidy_mca_scree() |>
  ggplot() +
  aes(x=PC, y=eigenvalue, label=glue("{round(`% Inertia`,2)} %")) +
  geom_col() +
  geom_text(angle=45, vjust=-1, hjust=-.1) 

Note that the share of cumulated inertia doesn’t grow very fast with the number of components. Situation is different from what we met when performing Correspondence Analysis or Principal Component Analysis. This tells us that we will hqve to look beyond the first two axis. This is commonplace with multiple correspondence analysis. Look at the reference below where most of the interesting information can be found on the 3rd and 4th axis.

Individuals

Question

Perform Individual profiles analysis.

What are the classical plots? How can you build them from the output of FactoMiner::MCA?

Build the table of row contributions (the so-called \(\cos^2\))

Solution

Attribute ind of objects of class MCA (exported from FactoMineR) is the starting point of any row profiles analysis.

We design an augment method. For objects of class MCA. This is like what is done for objects of class prcomp and different from what we did with objects of class CA. Indeed, MCA on the indicator matrix handles a sample just like PCA and not a contingency table like CA.

Code
augment_mca <- function(res_mca, data=NULL){
  
  result <- res_mca$ind$coord |> 
    as_tibble()  
  
  if (!is.null(data)) {
    result <- result |>
      bind_cols(confidence_data)
      # inner_join(confidence_data, by=c("id")) 
  }
  
  return(result)
}
Solution

Attribute ind is a named list made of \(4\) components.

coord
a matrix with named rows and columns. The number of rows of coord matches the number of rows of the table fed to MCA (here, the number of respondents in the panel). The number of columns matches the rank of the truncated SVD that underlies Multiple Correspondance Analysis. Here it is the default truncation \(5\).

The principal coordinates are the principal coordinates of each individual in terms of the principal component.

The columns of coord are pairwise orthogonal in the usual inner product space.

Function augment_mca Bind the columns of coord with the columns of the original dataset.

Code
res.mca |> 
#| label: augment_mca_glimpse
  augment_mca(confidence_data) |> 
  glimpse()
Rows: 1,183
Columns: 20
$ `Dim 1`  <dbl> 1.35465733, -0.10861408, -0.33683136, 0.72450001, 0.72525336,…
$ `Dim 2`  <dbl> 0.41190167, -0.90245225, -0.45183706, 0.25425970, 0.10225521,…
$ `Dim 3`  <dbl> 0.161371575, 0.010268886, 0.107542556, -0.009273224, -0.38139…
$ `Dim 4`  <dbl> 0.141129674, 0.310980059, -0.097880877, -0.085996832, 0.36106…
$ `Dim 5`  <dbl> 0.227285260, 0.086562193, 0.016680275, -0.115998402, -0.06681…
$ id       <dbl+lbl>  4,  6,  7,  9, 11, 15, 19, 20, 21, 24, 26, 28, 31, 41, 4…
$ sex      <dbl+lbl> 2, 1, 2, 2, 2, 2, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 1, …
$ confinan <fct> A GREAT DEAL, ONLY SOME, ONLY SOME, HARDLY ANY, ONLY SOME, ON…
$ conbus   <fct> ONLY SOME, ONLY SOME, ONLY SOME, ONLY SOME, ONLY SOME, ONLY S…
$ conclerg <fct> A GREAT DEAL, ONLY SOME, ONLY SOME, A GREAT DEAL, A GREAT DEA…
$ coneduc  <fct> A GREAT DEAL, ONLY SOME, ONLY SOME, ONLY SOME, A GREAT DEAL, …
$ confed   <fct> A GREAT DEAL, ONLY SOME, ONLY SOME, A GREAT DEAL, A GREAT DEA…
$ conlabor <fct> A GREAT DEAL, ONLY SOME, HARDLY ANY, ONLY SOME, HARDLY ANY, O…
$ conpress <fct> A GREAT DEAL, ONLY SOME, ONLY SOME, HARDLY ANY, ONLY SOME, HA…
$ conmedic <fct> A GREAT DEAL, ONLY SOME, HARDLY ANY, A GREAT DEAL, ONLY SOME,…
$ contv    <fct> ONLY SOME, ONLY SOME, HARDLY ANY, ONLY SOME, ONLY SOME, ONLY …
$ conjudge <fct> A GREAT DEAL, ONLY SOME, ONLY SOME, A GREAT DEAL, A GREAT DEA…
$ consci   <fct> A GREAT DEAL, ONLY SOME, ONLY SOME, A GREAT DEAL, ONLY SOME, …
$ conlegis <fct> ONLY SOME, ONLY SOME, ONLY SOME, A GREAT DEAL, A GREAT DEAL, …
$ conarmy  <fct> A GREAT DEAL, ONLY SOME, ONLY SOME, A GREAT DEAL, A GREAT DEA…

Note that, we can recover ind$coord from the left singular vectors and the singular values.

The output of augment_mca is the basis of the individual plots. Individual plots are defined by picking a pair of principal axes. Him use the 3rd and the 4th axis, and we map the sex column from the original dataset to the fill aesthetics.

Code
res.mca |> 
  augment_mca(confidence_data) |> 
  ggplot() +
  aes(x=`Dim 3`, y=`Dim 4`, fill=sex) +
  geom_point(alpha=.2, size=.2) +
  coord_fixed() +
  scale_fill_viridis_d()

Solution
cos2
Coefficients of matrix cos2 are the share of row inertia from the corresponding cell in coord
Code
res.mca |> 
  augment_mca(confidence_data) |> 
  ggplot() +
  aes(x=`Dim 1`, y=`Dim 2`, fill=sex) +
  geom_point(alpha=.2, size=.2) +
  coord_fixed() +
  scale_fill_viridis_d()

Solution
contrib

Not too surprisingly, coord, contrib, and cos2 share the same row names and column names.

Variables/Categories

Question

Perform column profiles (categories) analysis. You may use factoextra::fviz_mca_var()

Solution

The var component. Of an MCA object is related with the right singular vectors of the residual matrix.

Code
p <- res.mca |> 
  factoextra::fviz_mca_var() +
  coord_fixed() 
  
p +
  facet_zoom(
    xy=`Dim 1`>=0 & `Dim 2`>=0,
    xlim=c(0,2),ylim=c(0,2))

Following what has been done for objects of class prcomp, we design a tidy_mca_var function that turns the var component into a dataframe. Row names which correspond to categories, are turned into a column we call levels.

Code
tidy_mca_var <- function(res_mca){
  res_mca$var$coord |> 
  as_tibble(rownames="levels")
}

A value in column levels indicate a variable., that is a topic (science, religion, army, …) and a confidence level (A value on a Likert scale). We enrich the data frame by extracting the variable (topic) and the value (sentiment).

We are then ready to plot an analog of the correlation circle.

Code
tidied_mca_var <- tidy_mca_var(res.mca) |>   
  mutate(sentiment=str_split(as.character(levels),'_')) |>
  mutate(topic=map_chr(sentiment, \(x)x[1]), sentiment=map_chr(sentiment, \(x)x[2])) 

p <- tidied_mca_var |> 
  ggplot() +
    aes(x=`Dim 1`, y=`Dim 2`, label=levels, fill=topic, shape=sentiment) +
    geom_point(alpha=.5, show.legend = F) +
    coord_fixed() +
    scale_color_viridis_d() +
    theme(legend.position="none") +
    labs(
      title = "MCA: Variables categories",
      subtitle = "Color: topic, Shape: confidence level",
      caption = "GSS data, panel 2010, wave 1"
    )

p |>
  plotly::ggplotly() 

Resorting to plotly allows us to mitigate the overplotting effect.

The plot on the first two axes seems easy to interpret. As we move along the first axis from the left, we see growing confidence.
On the positive part of the second axis, we have extreme confidence levels, either high or low. This suggests that individuals have trends: either we trust all institutions; or none of them; or finally a little bit all of them.

How could we comfort such a narrative?

Code
(
 p + 
    aes(x=`Dim 3`, y=`Dim 4`, label=levels, color=sentiment) 
) |>   plotly::ggplotly() 
Code
(
  p +
    aes(x=`Dim 4`, y=`Dim 5`, label=levels) 
) |>  plotly::ggplotly() 
Question

What is the v.test component of the var component of an MCA object?

Solution

Here is a chatbot answer:

In Multiple Correspondence Analysis (MCA), the v-test (also known as the test-value or t-test value) is a statistical measure used to assess the significance of the association between a category and a dimension (factor). It helps determine which categories contribute the most to the definition of a given dimension.

Definition and Calculation

The v-test is computed as:

\[v = \frac{p - P}{\sqrt{P(1 - P)/n}}\]

where: - \(p\) = proportion of individuals in a given category for the specific dimension, - \(P\) = overall proportion of individuals in that category across all dimensions, - \(n\) = total number of individuals.

Alternatively, the v-test can also be expressed as:

\[v = \frac{\text{category coordinate}}{\text{standard deviation of the coordinate}}\]

This essentially provides a z-score, which follows a standard normal distribution under the null hypothesis.

Interpretation

  • A high absolute v-test value (typically above 2 or below -2) indicates that the category is strongly associated with the given dimension.
  • Positive v-test: The category is overrepresented in that dimension.
  • Negative v-test: The category is underrepresented in that dimension.
  • If the absolute value of the v-test is below 2, the association is not statistically significant (at a 5% significance level).

Why is it useful?

  • It helps interpret MCA results quantitatively rather than relying only on visual inspection.
  • It aids in identifying key categories that structure the data on a given axis.
  • It provides statistical validation for category contributions to dimensions.

Would you like a practical example or interpretation of a specific MCA result?

Symmetric plots

Question

Build the symmetric plots (biplots) for multiple correspondence analysis.

From the shelf

We resort to factoextra. In principle, the biplot can be constructed from the coord matrices in the var and the ind components of the MCA object.

Code
p <- res.mca |> 
  factoextra::fviz_mca_biplot(
    label= "var",
    alpha.ind=.5
  )

p

Code
# p |>  plotly::ggplotly()
Solution

Mosaicplots

MCA can be complemented by Mosaicplots, Double Decker plots, Chi-square tests, and Correspondence analyses between pair of variables.

Question

Draw a mosaic plot to visualize association between confidence levesl in Science and confidence level in Medicine.

Solution
Code
confidence_data |> 
  select(consci, conmedic) |>
  table() |> 
  mosaicplot(color=T)

Code
confidence_data |> 
  select(consci, conmedic) |> 
  table() |>
  chisq.test() |> 
  broom::glance() |> 
  gt::gt() |> 
  gt::tab_caption("Chi2 independence test for Confidence in Science and Confidence in Medicine")
Chi2 independence test for Confidence in Science and Confidence in Medicine
statistic p.value parameter method
137.8878 7.994493e-29 4 Pearson's Chi-squared test
Code
confidence_data |> 
  select(consci, conclerg) |> 
  table() |>
  chisq.test() |> 
  broom::glance() |> 
  gt::gt() |> 
  gt::tab_caption("Chi2 independence test for Confidence in Science and Confidence in Clergy")
Chi2 independence test for Confidence in Science and Confidence in Clergy
statistic p.value parameter method
39.27785 6.103953e-08 4 Pearson's Chi-squared test
Code
t_sex_clerg_sci <- confidence_data |> 
  select(sex, conclerg, consci) |> 
  table()
  
t_sex_clerg_sci  |>
  vcd::doubledecker( )

Code
t_sex_clerg_sci |> 
  mosaicplot(color=T, main="Confidence in Clergy and Science")

Further references

Barth, Alice and Schmitz, Andreas. 2018. Response quality and ideological dispositions: an integrative approach using geometric and classifying techniques. Quality & Quantity

When analyzing survey data, response quality has consequential implications for substantial conclusions. Differences in response quality are usually explained by personality, or socio-demographic or cognitive characteristics. Little, however, is known about how respondents’ political attitudes, values, and opinions impact on quality aspects. This is a striking analytical omission, as potential associations between political values and various forms of response biases and artefacts call into question surveys’ ability to represent ‘public opinion’. In this contribution, response quality is traced back to respondents’ political and ideological dispositions. For this purpose, a relational understanding of response quality is applied that takes into account different aspects of response behaviors, as well as the interrelations between these indicators. Using data from the US General Social Survey (2010–2014), an empirical typology of response quality is created via finite mixture analysis. The resulting classes are then related to positions in the US field of ideological dispositions constructed via multiple correspondence analysis. The analyses reveal that there are (1) different combinations of response patterns and thus different empirical response types, and (2) that these types of response quality systematically vary with regard to the respondents’ political and ideological (dis)positions. Implications of the findings for public opinion surveys are discussed.