LAB: Multiple Correspondence Analysis (MCA)

Published

March 28, 2025

M1 MIDS/MFA/LOGOS

Université Paris Cité

Année 2024

Course Homepage

Moodle

Besides the usual packages (tidyverse, …), we shall require FactoMineR and related packages.

Code

stopifnot(
  require(FactoMineR),
  require(factoextra),
  require(FactoInvestigate)
)

The `GSS` dataset

We will use data coming from the General Social Survey. The General Social Survey data can be retrieved using the gssr package. If needed, install the gssr package and its companion package gssrdoc.

Code

stopifnot(
  require("gssr"),
  require("gssrdoc")
)

Loading required package: gssr

Package loaded. To attach the GSS data, type data(gss_all) at the console.
For the codebook, type data(gss_dict).
For the panel data and documentation, type e.g. data(gss_panel08_long) and data(gss_panel_doc).
For help on a specific GSS variable, type ?varname at the console.

Loading required package: gssrdoc

The data we will use are panel data made available by the GSS. In order to explore them, it’s very useful toload gss_dict and gss_panel_doc.

Code

data(gss_dict)
data(gss_panel_doc)

Check the online help.

Code

?gss_panel_doc

gss_dict and gss_panel_doc are dataframes that can be queried:

Code

gss_panel_doc |> 
  dplyr::sample_n(5) |> 
  select(id, text)  |> 
  gt::gt()

id	text
relhhd8	1622. What is (PERSON)'s relationship to (HEAD OF HOUSEHOLD)?
denom16	115a. If Protestant: What specific denomination is that, if any?
conjudge	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? j. U.S. Supreme Court.
wordh	1612. We would like to know something about how people go about guessing words they do not know. On this card are listed some words--you may know some of them, and you may not know quite a few of them. On each line the first word is in capital letters like BEAST. Then there are five other words. Tell me the number of the word that comes closest to the meaning of the word in capital letters. For example, if the word in capital letters is BEAST, you would say "4" since "animal" come closer to BEAST than any of the other words. If you wish, I will read the words to you. These words are difficult for almost everyone just give me your best guess if you are not sure of the answer.
compwage	1087. Do you believe your fixed annual wages in calendar year 2005 were higher or lower than those of employees with similar experience and job descriptions in other companies in your region? Please answer on a 1 to 5 scale.

In the panel questionnaire, some questions have race in the field id. Check again the online help.

Code

?race

The answer is something like that:

Question 24. What race do you consider yourself?

And it contains a brief summary of the answers given through the years.

Year	Black	Other	White	iap (NA)	Total
2010	311	183	1550	-	2044
2012	301	196	1477	-	1974
2014	386	262	1890	-	2538

(this is just an excerpt)

Code

gss_panel_doc |> 
  filter(str_detect(id, "race")) |>
  slice_sample(n=1, by=text) |> 
  select(id, description, text) |> 
  gt::gt()

id	description	text
race	RACE	24. What race do you consider yourself?
racecen3	RACECEN3	1602. What is your race? Indicate one or more races that you consider yourself to be.
natrace	NATRACE	68. We are faced with many problems in this country, none of some of these problems, and for each one I'd like you to tell me whether you think we're spending too much money on it, too little money, or about the right amount. h. Improving the conditions of Blacks.
natracey	NATRACEY	69. We are faced with many problems in this country, none of some of these problems, and for each one I'd like you to tell me whether you think we're spending too much money on it, too little money, or about the right amount. h. Assistance to blacks.
intrace3	INTRACE3	What is your race? Indicate one or more races that you consider yourself to be.

The data set we will use comes from year 2010 panel data.

Code

data("gss_panel10_long")

gss_panel10_long

# A tibble: 6,132 × 1,200
   firstid  wave oversamp sampcode  sample      form         formwt vpsu  vstrat
   <fct>   <dbl>    <dbl> <dbl+lbl> <dbl+lbl>   <dbl+lbl>     <dbl> <dbl> <dbl+>
 1 1           1        1 501       9 [2000 FP] 2 [ALTERNAT…      1 1     2240  
 2 1           2        1 501       9 [2000 FP] 2 [ALTERNAT…      1 1     2240  
 3 1           3        1 501       9 [2000 FP] 2 [ALTERNAT…      1 1     2240  
 4 2           1        1 501       9 [2000 FP] 1 [STANDARD…      1 1     2240  
 5 2           2        1 501       9 [2000 FP] 1 [STANDARD…      1 1     2240  
 6 2           3        1 501       9 [2000 FP] 1 [STANDARD…      1 1     2240  
 7 3           1        1 501       9 [2000 FP] 1 [STANDARD…      1 2     2240  
 8 3           2        1 501       9 [2000 FP] 1 [STANDARD…      1 2     2240  
 9 3           3        1 501       9 [2000 FP] 1 [STANDARD…      1 2     2240  
10 4           1        1 501       9 [2000 FP] 2 [ALTERNAT…      1 2     2240  
# ℹ 6,122 more rows
# ℹ 1,191 more variables: samptype <dbl+lbl>, wtpan12 <dbl+lbl>,
#   wtpan123 <dbl+lbl>, wtpannr12 <dbl+lbl>, wtpannr123 <dbl+lbl>,
#   id <dbl+lbl>, mar1 <dbl+lbl>, mar2 <dbl+lbl>, mar3 <dbl+lbl>,
#   mar4 <dbl+lbl>, mar5 <dbl+lbl>, mar6 <dbl+lbl>, mar7 <dbl+lbl>,
#   mar8 <dbl+lbl>, mar9 <dbl+lbl>, mar10 <dbl>, mar11 <dbl+lbl>, mar12 <dbl>,
#   mar13 <dbl>, mar14 <dbl>, abany <dbl+lbl>, abdefect <dbl+lbl>, …

At the beginning, the panel was made of roughly 2000 respondents. And the respondents were interviewed in 2010, 2012 and 2014 (the three waves).

Code

gss_panel10_long  |>  
  filter(wave==3, !is.na(id)) |>
  select(firstid, wave, id, sex)

# A tibble: 1,304 × 4
   firstid  wave id        sex       
   <fct>   <dbl> <dbl+lbl> <dbl+lbl> 
 1 1           3 10001     1 [MALE]  
 2 2           3 10002     2 [FEMALE]
 3 3           3 10003     2 [FEMALE]
 4 6           3 10004     1 [MALE]  
 5 7           3 10005     2 [FEMALE]
 6 9           3 10006     2 [FEMALE]
 7 10          3 10007     2 [FEMALE]
 8 11          3 10008     2 [FEMALE]
 9 12          3 10009     1 [MALE]  
10 13          3 10010     1 [MALE]  
# ℹ 1,294 more rows

Some respondents left the survey. Attrition can be monitored within the next query.

Code

gss_panel10_long |> 
  select(wave, id) |>
  group_by(wave) |>
  summarize(observed = n_distinct(id),
            missing = sum(is.na(id)))

# A tibble: 3 × 3
   wave observed missing
  <dbl>    <int>   <int>
1     1     2044       0
2     2     1552     493
3     3     1305     740

The confidence topic

Do GSS surveys are made of a huge number of questions. Not all questions were asked to the respondents. Indeed, each question was asked to two thirds of the respondents. Questions were related to demographic features (Age, Sex, Level of education, Employment and so on). Answers to these questions can be used to determine whether the panel sample is representative. Data can be compared with last census data (there is a census every ten years in the US).

A number of questions are related to the confidence topic. Respondents were asked about the level of confidence they put in a wide array of institutions.

Code

gss_panel_doc |> 
  filter(str_detect(text, "confidence")) |>
  slice_sample(n=1, by=text) |> 
  select(id, description, text) |> 
  gt::gt()

id	description	text
confinan	CONFINAN	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? a. Banks and financial institutions.
conbus	CONBUS	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? b. Major companies.
conclerg	CONCLERG	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? c. Organized religion.
coneduc	CONEDUC	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? d. Education.
confed	CONFED	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? e. Executive branch of the federal government.
conlabor	CONLABOR	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? f. Organized labor.
conpress	CONPRESS	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? g. Press.
conmedic	CONMEDIC	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? h. Medicine.
contv	CONTV	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? i. Television.
conjudge	CONJUDGE	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? j. U.S. Supreme Court.
consci	CONSCI	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? k. Scientific community.
conlegis	CONLEGIS	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? l. Congress
conarmy	CONARMY	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? m. Military

For institutions like Science or Medicine, or Executive branch of federal government, Respondents were asked whether they have: Great deal of confidence, Only some confidence, Hardly any confidence in the institution. The same Likert scale with 3 levels was used for all institutions.

Question

From the gss_panel10_long dataset, extract columns corresponding to questions from the confidence topic

Solution

We first gather the id of the questions pertaining to the confidence topic.

Code

confidence_ids <- gss_panel_doc |> 
  filter(str_detect(text, "confidence")) |>
  slice_sample(n=1, by=text) |> 
  pull(id)

Then we protect the panel data on the questions pertaining to the confidence topic. We also keep the respondent key, sex, and the relevant wave. s

Code

gss_panel10_long |> 
  filter(wave==1) |>
  select(firstid, wave, id, sex, contains(confidence_ids)) |>
  sample_n(10) |>
  gt::gt()

firstid	wave	id_1: id	sex_1: RESPONDENTS SEX	confinan_1: CONFID IN BANKS & FINANCIAL INSTITUTIONS	conbus_1: CONFIDENCE IN MAJOR COMPANIES	conclerg_1: CONFIDENCE IN ORGANIZED RELIGION	coneduc_1: CONFIDENCE IN EDUCATION	confed_1: CONFID. IN EXEC BRANCH OF FED GOVT	conlabor_1: CONFIDENCE IN ORGANIZED LABOR	conpress_1: CONFIDENCE IN PRESS	conmedic_1: CONFIDENCE IN MEDICINE	contv_1: CONFIDENCE IN TELEVISION	conjudge_1: CONFID. IN UNITED STATES SUPREME COURT	consci_1: CONFIDENCE IN SCIENTIFIC COMMUNITY	econsci	conlegis_1: CONFIDENCE IN CONGRESS	conarmy_1: CONFIDENCE IN MILITARY
1225	1	1225	2	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1511	1	1511	1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1153	1	1153	2	3	2	2	1	2	2	2	2	1	2	2	NA	2	1
1585	1	1585	2	3	2	2	2	2	2	2	2	2	2	2	NA	1	3
940	1	940	2	3	2	3	2	3	3	2	1	2	3	2	NA	3	2
110	1	110	2	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
559	1	559	2	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1536	1	1536	1	3	2	2	2	2	2	3	2	3	2	2	NA	3	1
846	1	846	2	3	2	2	3	3	2	3	1	3	2	1	4	3	2
220	1	220	1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

Let us go back to the documentation.

Code

gss_panel_doc |>
  filter(id %in% confidence_ids) |> 
  select(-ends_with("2"), -ends_with("3")) |>
  gt::gt()

id	description	text	properties_1	marginals_1
confinan	CONFINAN	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? a. Banks and financial institutions.	c("Data type", "Missing-data codes", "Record/column"), c("numeric", "0,8,9", "1/265"), c("CONFINAN_1", "CONFINAN_1", "CONFINAN_1")	c(30.1, 55.2, 14.7, NA, NA, 100), c("395", "725", "193", "671", "16", "2,000"), c("1", "2", "3", "0", "8", NA), c("A GREAT DEAL", "ONLY SOME", "HARDLY ANY", "IAP", "DK", "Total"), c("CONFINAN_1", "CONFINAN_1", "CONFINAN_1", "CONFINAN_1", "CONFINAN_1", "CONFINAN_1")
conbus	CONBUS	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? b. Major companies.	c("Data type", "Missing-data codes", "Record/column"), c("numeric", "0,8,9", "1/247"), c("CONBUS_1", "CONBUS_1", "CONBUS_1")	c(17.6, 62.3, 20.1, NA, NA, NA, 100), c("228", "806", "260", "671", "32", "3", "2,000"), c("1", "2", "3", "0", "8", "9", NA), c("A GREAT DEAL", "ONLY SOME", "HARDLY ANY", "IAP", "DK", NA, "Total"), c("CONBUS_1", "CONBUS_1", "CONBUS_1", "CONBUS_1", "CONBUS_1", "CONBUS_1", "CONBUS_1")
conclerg	CONCLERG	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? c. Organized religion.	c("Data type", "Missing-data codes", "Record/column"), c("numeric", "0,8,9", "1/250"), c("CONCLERG_1", "CONCLERG_1", "CONCLERG_1")	c(25.1, 52.5, 22.4, NA, NA, NA, 100), c("324", "678", "289", "671", "35", "3", "2,000"), c("1", "2", "3", "0", "8", "9", NA), c("A GREAT DEAL", "ONLY SOME", "HARDLY ANY", "IAP", "DK", NA, "Total"), c("CONCLERG_1", "CONCLERG_1", "CONCLERG_1", "CONCLERG_1", "CONCLERG_1", "CONCLERG_1", "CONCLERG_1")
coneduc	CONEDUC	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? d. Education.	c("Data type", "Missing-data codes", "Record/column"), c("numeric", "0,8,9", "1/259"), c("CONEDUC_1", "CONEDUC_1", "CONEDUC_1")	c(27.1, 56.5, 16.4, NA, NA, 100), c("359", "747", "217", "671", "6", "2,000"), c("1", "2", "3", "0", "8", NA), c("A GREAT DEAL", "ONLY SOME", "HARDLY ANY", "IAP", "DK", "Total"), c("CONEDUC_1", "CONEDUC_1", "CONEDUC_1", "CONEDUC_1", "CONEDUC_1", "CONEDUC_1")
confed	CONFED	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? e. Executive branch of the federal government.	c("Data type", "Missing-data codes", "Record/column"), c("numeric", "0,8,9", "1/262"), c("CONFED_1", "CONFED_1", "CONFED_1")	c(13.8, 45.4, 40.8, NA, NA, NA, 100), c("179", "590", "531", "671", "26", "3", "2,000"), c("1", "2", "3", "0", "8", "9", NA), c("A GREAT DEAL", "ONLY SOME", "HARDLY ANY", "IAP", "DK", NA, "Total"), c("CONFED_1", "CONFED_1", "CONFED_1", "CONFED_1", "CONFED_1", "CONFED_1", "CONFED_1")
conlabor	CONLABOR	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? f. Organized labor.	c("Data type", "Missing-data codes", "Record/column"), c("numeric", "0,8,9", "1/289"), c("CONLABOR_1", "CONLABOR_1", "CONLABOR_1")	c(12.3, 58.1, 29.6, NA, NA, 100), c("155", "731", "372", "671", "71", "2,000"), c("1", "2", "3", "0", "8", NA), c("A GREAT DEAL", "ONLY SOME", "HARDLY ANY", "IAP", "DK", "Total"), c("CONLABOR_1", "CONLABOR_1", "CONLABOR_1", "CONLABOR_1", "CONLABOR_1", "CONLABOR_1")
conpress	CONPRESS	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? g. Press.	c("Data type", "Missing-data codes", "Record/column"), c("numeric", "0,8,9", "1/298"), c("CONPRESS_1", "CONPRESS_1", "CONPRESS_1")	c(10, 48.5, 41.5, NA, NA, NA, 100), c("132", "638", "546", "671", "12", "1", "2,000"), c("1", "2", "3", "0", "8", "9", NA), c("A GREAT DEAL", "ONLY SOME", "HARDLY ANY", "IAP", "DK", NA, "Total"), c("CONPRESS_1", "CONPRESS_1", "CONPRESS_1", "CONPRESS_1", "CONPRESS_1", "CONPRESS_1", "CONPRESS_1")
conmedic	CONMEDIC	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? h. Medicine.	c("Data type", "Missing-data codes", "Record/column"), c("numeric", "0,8,9", "1/295"), c("CONMEDIC_1", "CONMEDIC_1", "CONMEDIC_1")	c(40.1, 49.3, 10.6, NA, NA, 100), c("530", "651", "140", "671", "8", "2,000"), c("1", "2", "3", "0", "8", NA), c("A GREAT DEAL", "ONLY SOME", "HARDLY ANY", "IAP", "DK", "Total"), c("CONMEDIC_1", "CONMEDIC_1", "CONMEDIC_1", "CONMEDIC_1", "CONMEDIC_1", "CONMEDIC_1")
contv	CONTV	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? i. Television.	c("Data type", "Missing-data codes", "Record/column"), c("numeric", "0,8,9", "1/328"), c("CONTV_1", "CONTV_1", "CONTV_1")	c(9.4, 48.2, 42.3, NA, NA, NA, 100), c("124", "634", "556", "671", "14", "1", "2,000"), c("1", "2", "3", "0", "8", "9", NA), c("A GREAT DEAL", "ONLY SOME", "HARDLY ANY", "IAP", "DK", NA, "Total"), c("CONTV_1", "CONTV_1", "CONTV_1", "CONTV_1", "CONTV_1", "CONTV_1", "CONTV_1")
conjudge	CONJUDGE	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? j. U.S. Supreme Court.	c("Data type", "Missing-data codes", "Record/column"), c("numeric", "0,8,9", "1/286"), c("CONJUDGE_1", "CONJUDGE_1", "CONJUDGE_1")	c(31.8, 50.3, 17.8, NA, NA, NA, 100), c("410", "649", "230", "671", "38", "2", "2,000"), c("1", "2", "3", "0", "8", "9", NA), c("A GREAT DEAL", "ONLY SOME", "HARDLY ANY", "IAP", "DK", NA, "Total"), c("CONJUDGE_1", "CONJUDGE_1", "CONJUDGE_1", "CONJUDGE_1", "CONJUDGE_1", "CONJUDGE_1", "CONJUDGE_1")
consci	CONSCI	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? k. Scientific community.	c("Data type", "Missing-data codes", "Record/column"), c("numeric", "0,8,9", "1/319"), c("CONSCI_1", "CONSCI_1", "CONSCI_1")	c(41.6, 51, 7.3, NA, NA, NA, 100), c("527", "646", "93", "671", "62", "1", "2,000"), c("1", "2", "3", "0", "8", "9", NA), c("A GREAT DEAL", "ONLY SOME", "HARDLY ANY", "IAP", "DK", NA, "Total"), c("CONSCI_1", "CONSCI_1", "CONSCI_1", "CONSCI_1", "CONSCI_1", "CONSCI_1", "CONSCI_1")
conlegis	CONLEGIS	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? l. Congress	c("Data type", "Missing-data codes", "Record/column"), c("numeric", "0,8,9", "1/292"), c("CONLEGIS_1", "CONLEGIS_1", "CONLEGIS_1")	c(10.8, 51.2, 37.9, NA, NA, NA, 100), c("141", "666", "493", "671", "28", "1", "2,000"), c("1", "2", "3", "0", "8", "9", NA), c("A GREAT DEAL", "ONLY SOME", "HARDLY ANY", "IAP", "DK", NA, "Total"), c("CONLEGIS_1", "CONLEGIS_1", "CONLEGIS_1", "CONLEGIS_1", "CONLEGIS_1", "CONLEGIS_1", "CONLEGIS_1")
conarmy	CONARMY	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? m. Military	c("Data type", "Missing-data codes", "Record/column"), c("numeric", "0,8,9", "1/244"), c("CONARMY_1", "CONARMY_1", "CONARMY_1")	c(46.8, 39.8, 13.4, NA, NA, 100), c("613", "521", "175", "671", "20", "2,000"), c("1", "2", "3", "0", "8", NA), c("A GREAT DEAL", "ONLY SOME", "HARDLY ANY", "IAP", "DK", "Total"), c("CONARMY_1", "CONARMY_1", "CONARMY_1", "CONARMY_1", "CONARMY_1", "CONARMY_1")

Levels labels for answers to confidence questions are:

Code

foo <- gss_panel_doc |>
   filter(id %in% confidence_ids) |> 
   select(-ends_with("2"), -ends_with("3")) |> 
   unnest(
    cols=c(properties_1), names_repair = "universal") |> 
   unnest(
    cols=c(marginals_1), names_repair = "universal") |> 
   head(6)

New names:
New names:
• `id` -> `id...1`
• `id` -> `id...6`

Code

foo |> gt::gt()

id...1	description	text	property	value...5	id...6	percent	n	value...9	label	id...11
confinan	CONFINAN	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? a. Banks and financial institutions.	Data type	numeric	CONFINAN_1	30.1	395	1	A GREAT DEAL	CONFINAN_1
confinan	CONFINAN	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? a. Banks and financial institutions.	Data type	numeric	CONFINAN_1	55.2	725	2	ONLY SOME	CONFINAN_1
confinan	CONFINAN	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? a. Banks and financial institutions.	Data type	numeric	CONFINAN_1	14.7	193	3	HARDLY ANY	CONFINAN_1
confinan	CONFINAN	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? a. Banks and financial institutions.	Data type	numeric	CONFINAN_1	NA	671	0	IAP	CONFINAN_1
confinan	CONFINAN	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? a. Banks and financial institutions.	Data type	numeric	CONFINAN_1	NA	16	8	DK	CONFINAN_1
confinan	CONFINAN	far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? a. Banks and financial institutions.	Data type	numeric	CONFINAN_1	100.0	2,000	NA	Total	CONFINAN_1

IAP stands for respondent was not asked the question.
DK stands for Don’t Know

If respondents were not given the opportunity to provide a response to a question because it was not included in the survey they were administered, then their responses are coded as “IAP.”

Code

panel_doc <- gssrdoc::gss_panel_doc

Table wrangling

Before proceeding to Multiple Correspondence Analysis (CA), let us select the set of active variables.

Question

Project gss_panel10_long on firstid, wave, id, sex, and columns with names in

  confinan conbus conclerg coneduc confed conlabor conpress conmedic contv conjudge consci conlegis conarmy

Filter so as to keep only wave 1.

Transform all relevant columns into factors.

Solution

Code

confidence_panel10 <- gss_panel10_long |> 
  filter(wave==1) |>
  select(firstid, wave, id, sex, contains(confidence_ids)) |>
  mutate(across(starts_with("con"), as_factor)) |>
  mutate(sex=as_factor(sex))

Code

confidence_10 <- gss_panel_doc |>
   filter(id %in% confidence_ids) |> 
   select(-ends_with("2"), -ends_with("3")) |> 
   unnest(
    cols=c(properties_1), names_repair = "universal") |> 
   unnest(
    cols=c(marginals_1), names_repair = "universal") |>
   mutate(text=str_squish(text)) |>
   mutate(text= str_match(text, "\\? [\\w]\\.(.*)")) |> 
   mutate(text=text[,2]) |> 
   filter(property == "Missing-data codes") |>
   select(-starts_with("id"), -property, -`value...5`)

New names:
New names:
• `id` -> `id...1`
• `id` -> `id...6`

Code

levels_table <- confidence_10 |> 
  filter(label!='Total' | is.na(label)) |>
  distinct(`value...9`, label)

Elementary statistics

Question

Use skimmr to summarize your dataset.

Solution

Code

confidence_panel10 |>
  skimr::skim(where(is.factor)) |>
  select(-factor.ordered, -skim_type)|>
  filter(str_starts(skim_variable,"con")) |>
  gt::gt() |>
  gt::fmt_number(decimals=3) |>
  gt::tab_caption(
    caption = "GSS Panel data 2010 First wave, Topic confidence"
  )

GSS Panel data 2010 First wave, Topic confidence
skim_variable	n_missing	complete_rate	factor.n_unique	factor.top_counts
confinan	680.000	0.667	3.000	ONL: 656, HAR: 564, A G: 144, IAP: 0
conbus	705.000	0.655	3.000	ONL: 848, HAR: 317, A G: 174, IAP: 0
conclerg	725.000	0.645	3.000	ONL: 720, HAR: 328, A G: 271, IAP: 0
coneduc	681.000	0.667	3.000	ONL: 793, A G: 367, HAR: 203, IAP: 0
confed	698.000	0.659	3.000	ONL: 620, HAR: 499, A G: 227, IAP: 0
conlabor	744.000	0.636	3.000	ONL: 781, HAR: 373, A G: 146, IAP: 0
conpress	689.000	0.663	3.000	ONL: 621, HAR: 594, A G: 140, IAP: 0
conmedic	681.000	0.667	3.000	ONL: 647, A G: 558, HAR: 158, IAP: 0
contv	681.000	0.667	3.000	ONL: 677, HAR: 521, A G: 165, IAP: 0
conjudge	709.000	0.653	3.000	ONL: 706, A G: 400, HAR: 229, IAP: 0
consci	738.000	0.639	3.000	ONL: 683, A G: 544, HAR: 79, IAP: 0
conlegis	697.000	0.659	3.000	ONL: 635, HAR: 587, A G: 125, IAP: 0
conarmy	688.000	0.663	3.000	A G: 720, ONL: 507, HAR: 129, IAP: 0

Code

confidence_panel10 |>
  filter(is.na(conarmy)) |>
  filter(is.na(conclerg)) |>
  filter(is.na(conpress))

Question

There are a lot of missing data in your data set.

How are missing cells related?

Drop rows with missing data in the confidence questions.

What are the dimensions of your data set?

Solution

Code

confidence_data <- confidence_panel10 |> 
  mutate(across(starts_with("con"), .fns=\(x) !is.na(x), .names="not_{.col}_isna")) |>
  rowwise() |>
  filter(all(c_across(ends_with("isna")))) |> 
  ungroup() |> 
  select(-ends_with("isna"), -firstid, -wave, -econsci) |>
  mutate(across(starts_with("con"), fct_drop), sex=as_factor(sex))

confidence_data |> 
  slice_sample(n=10) |> 
  gt::gt()

id_1: id	sex_1: RESPONDENTS SEX	confinan_1: CONFID IN BANKS & FINANCIAL INSTITUTIONS	conbus_1: CONFIDENCE IN MAJOR COMPANIES	conclerg_1: CONFIDENCE IN ORGANIZED RELIGION	coneduc_1: CONFIDENCE IN EDUCATION	confed_1: CONFID. IN EXEC BRANCH OF FED GOVT	conlabor_1: CONFIDENCE IN ORGANIZED LABOR	conpress_1: CONFIDENCE IN PRESS	conmedic_1: CONFIDENCE IN MEDICINE	contv_1: CONFIDENCE IN TELEVISION	conjudge_1: CONFID. IN UNITED STATES SUPREME COURT	consci_1: CONFIDENCE IN SCIENTIFIC COMMUNITY	conlegis_1: CONFIDENCE IN CONGRESS	conarmy_1: CONFIDENCE IN MILITARY
1894	FEMALE	ONLY SOME	ONLY SOME	A GREAT DEAL	A GREAT DEAL	ONLY SOME	ONLY SOME	HARDLY ANY	A GREAT DEAL	ONLY SOME	A GREAT DEAL	A GREAT DEAL	ONLY SOME	ONLY SOME
1639	FEMALE	ONLY SOME	ONLY SOME	ONLY SOME	A GREAT DEAL	ONLY SOME	ONLY SOME	ONLY SOME	A GREAT DEAL	ONLY SOME	ONLY SOME	ONLY SOME	ONLY SOME	A GREAT DEAL
718	MALE	HARDLY ANY	ONLY SOME	HARDLY ANY	ONLY SOME	HARDLY ANY	ONLY SOME	HARDLY ANY	A GREAT DEAL	ONLY SOME	HARDLY ANY	ONLY SOME	HARDLY ANY	HARDLY ANY
1101	FEMALE	ONLY SOME	ONLY SOME	HARDLY ANY	ONLY SOME	A GREAT DEAL	HARDLY ANY	A GREAT DEAL	ONLY SOME	A GREAT DEAL	A GREAT DEAL	ONLY SOME	ONLY SOME	A GREAT DEAL
1120	FEMALE	ONLY SOME	ONLY SOME	ONLY SOME	ONLY SOME	ONLY SOME	ONLY SOME	HARDLY ANY	ONLY SOME	ONLY SOME	ONLY SOME	ONLY SOME	ONLY SOME	ONLY SOME
1417	FEMALE	ONLY SOME	ONLY SOME	HARDLY ANY	ONLY SOME	HARDLY ANY	ONLY SOME	HARDLY ANY	HARDLY ANY	HARDLY ANY	ONLY SOME	A GREAT DEAL	ONLY SOME	A GREAT DEAL
1454	MALE	HARDLY ANY	ONLY SOME	HARDLY ANY	ONLY SOME	HARDLY ANY	ONLY SOME	HARDLY ANY	A GREAT DEAL	HARDLY ANY	A GREAT DEAL	A GREAT DEAL	HARDLY ANY	A GREAT DEAL
1598	MALE	HARDLY ANY	ONLY SOME	ONLY SOME	A GREAT DEAL	ONLY SOME	ONLY SOME	ONLY SOME	ONLY SOME	ONLY SOME	ONLY SOME	ONLY SOME	HARDLY ANY	A GREAT DEAL
62	FEMALE	A GREAT DEAL	A GREAT DEAL	ONLY SOME	A GREAT DEAL	A GREAT DEAL	A GREAT DEAL	A GREAT DEAL	A GREAT DEAL	ONLY SOME	A GREAT DEAL	A GREAT DEAL	A GREAT DEAL	A GREAT DEAL
1157	MALE	A GREAT DEAL	A GREAT DEAL	A GREAT DEAL	A GREAT DEAL	ONLY SOME	ONLY SOME	HARDLY ANY	A GREAT DEAL	ONLY SOME	A GREAT DEAL	A GREAT DEAL	A GREAT DEAL	A GREAT DEAL

Question

In your explore possible associations between the answers to the different confidence questions?

How would you test possible independence between answer to confidence in science and confidence in the Army.

Solution

TODO

The case for using MCA

In order to construct a field of ideological and political attitudes, which will subsequently be used as a reference for locating the empirical typologies of response styles and survey compliance, we apply multiple correspondence analysis (MCA). MCA belongs to the family of techniques used in geometric data analysis (Le Roux and Rouanet 2004). It allows for the extraction of the most important dimensions in a set of categorical variables, and the graphical representation of variable categories and individuals relative to each other in a coordinate system. Distances between categories as well as individuals can be interpreted as a measure of (dis)similarity: If categories often co-appear in individual’s responses, they are located close together in the space produced by MCA. Rare co-appearances, accordingly, result in a larger distance between the respective categories. Furthermore, illustrative variables can passively be projected into the field, a technique that has been termed visual regression analysis (Lebart et al. 1984). Whereas the space is determined by the distances between the categories of active variables, passive variables do not alter the structure of the constructed field, but appear in their average and hence most likely position.

From https://doi.org/10.1007/s11135-016-0458-3

In this lab, we won’t look at the field of ideological and political attitudes, but rather at the field of confidence level in a variety of institutions.

Multiple Correspondance Analysis

MCA executive summary

The input of multiple correspondence analysis is a data frame $X$ with $n$ rows and $p$ categorical columns. Multiple Correspondence Analysis starts by building the indicator matrix. The indicator matrix is built by one-hot encoding of each categorical variable.

A categorical variable $V_j$ (factor) with $q$ levels is mapped to $q$ $\{0,1\}$ -valued variables $V_{j,r}$ for $r \leq q$
If levels are indexed by $\{1, \ldots, q\}$, if the value of the categorical variable $V_j$ from row $i$ is $k \in \{1, \ldots, q\}$, the bina$ \[k \mapsto \underbrace{0,\ldots, 0}_{k-1}, 1, \underbrace{0, \ldots, 0}_{q-k}\]
The indicator matrix has as many rows as the data matrix
The number of columns of the indicator matrix is the sum of the number of levels of the categorical variables/columns of the data matrix
The indicator matrix is a numerical matrix. It is suitable for factorial methodss

Recall $X$ is the data matrix with $n$ rows (individuals) and $p$ categorical columns (variables)

For $j \in \{1, \ldots, p\}$, let $J_j$ denote the number of levels(categories) of variable $j$

Let $q = \sum_{j\leq p} J_j$ be the sum of the number of levels throughout the variables

Let $Z$ be the incidence matrix with $n$ rows and $q$ columns

For $j\leq p$ and $k \leq J_j$, let $\langle j, k\rangle = \sum_{j'<j} J_{j'}+k$

Let $N = n \times p = \sum_{i\leq n} \sum_{j \leq p} X_{i,j}$ and \[P = \frac{1}{N} Z\]

(the correspondence matrix for MCA)

The row wise sums of correspondence matrix $P$ are all equal to $1/n=p/N$

The column wise sum of the correspondence matrix $P$ for the $k$th level of the $j$th variable of $X$ ( $j \leq p$ ) is \[N_{\langle j,k\rangle}/N = f_{\langle j,k\rangle}/p\]

where $f_{\langle j,k\rangle}$ stands for the relative frequency of level $k$ of the $j$th variable

\[D_r = \frac{1}{n}\text{Id}_n\qquad D_c =\text{diag}\left(\frac{f_{\langle j,k\rangle}}{p}\right)_{j \leq p, k\leq J_j}\]

In MCA, we compute the SVD $U \times D \times V^\top$ of the standardized residuals matrix:

\[S = D_r^{-1/2}\times \left(P - r\times c^\top\right) \times D_c^{-1/2} = \sqrt{n}\left(P - r\times c^\top\right) \times D_c^{-1/2}\]

Coefficient $i, \langle j, k\rangle$ of $S$ is \[\frac{\mathbb{I}_{i, \langle j, k\rangle}- f_{\langle j,k\rangle}}{\sqrt{n f_{\langle j,k\rangle}/p}}\]

MCA consists in computing the SVD of the standardized residuals matrix $S = U \times D \times V^\top$

From the SVD, we get

$D_r^{-1/2} \times U$ standardized coordinates of rows
$D_c^{-1/2} \times V$ standardized coordinates of columns
$D_r^{-1/2} \times U \times D$ principal coordinates of rows
$D_c^{-1/2} \times V \times D$ principal coordinates of columns
Squared singular values: the principal inertia

When calling svd(.), the argument should be \[D_r^{1/2}\times \left(D_r^{-1} \times P \times D_c^{-1}- \mathbf{I}\times \mathbf{I}^\top \right)\times D_c^{1/2}= D_r^{-1/2}\times \left( P - r \times c^\top \right)\times D_c^{-1/2}\]

MCA and extended SVD

\[D_r^{-1} \times P \times D_c^{-1} - \mathbf{I}\mathbf{I}^\top = (D_r^{-1/2} \times U)\times D \times (D_c^{-1/2}\times V)^\top\]

\[(D_r^{-1/2} \times U)\times D \times (D_c^{-1/2}\times V)^\top\]

is the extended SVD of

\[D_r^{-1} \times P \times D_c^{-1} - \mathbf{I}\mathbf{I}^\top\]

with respect to $D_r$ and $D_c$

Question

Perform MCA on the indicator matrix.

You may use FactoMineR::MCA(). It is interesting to compute the correspondence analysis in your own way, by preparing the matrix that is handled to svd() and returning a named list containing all relevant information.

Solution

We lazyly rely on FactoMineR::MCA.

Code

res.mca <- confidence_data |> 
  select(starts_with("con")) |> 
  FactoMineR::MCA(graph=F)

The result is an object of S3 class MCA.

Question

If you did use FactoMineR::MCA(), explain the organization of the result.

Solution

The result of FactoMineR::MCA(...) is a named and nested list with five elements:

eig: a matrix/array containing enough information to build a screeplot.
call: a list of 9, containing the call to CA(), an object of type language, telling (in principle) the user how CA() was called. However, this is a quoted expression. Here we need to guess the value of y in the calling environment understand what’s going on.

Element call also contains the table margin distributions marge.col and marge.row. The truncation rank ncp (number of components) can be assigned before computing the SVD (default value is 5). Element $X$ stores the contingency table that was effectively used for computing Correpondence Analysis.

ind: Information gathered from SVD to facilitate individulals (respndents) analysis.
var: a list structured in the same way as element ind. Used for variables/categories profiles analysis
svd: a list of 3, just as the resuld of svd() containing the singular values, the left and right singular vectors of matrix $...$

Solution, output of print(res.mca)

	Name	Description
1	`$eig`	“eigenvalues”
2	`$var`	“results for the variables (categories)”
3	`$var$coord`	“coord. of the categories”
4	`$var$cos2`	“cos2 for the categories”
5	`$var$contrib`	“contributions of the categories”
6	`$var$v.test`	“v-test for the categories”
7	`$ind`	“results for the individuals”
8	`$ind$coord`	“coord. for the individuals”
9	`$ind$cos2`	“cos2 for the individuals”
10	`$ind$contrib`	“contributions of the individuals”
11	`$quali.sup`	“results for the supplementary categorical variables”
12	`$quali.sup$coord`	“coord. for the supplementary categories”
13	`$quali.sup$cos2`	“cos2 for the supplementary categories”
14	`$quali.sup$v.test`	“v-test for the supplementary categories”
15	`$call`	“intermediate results”
16	`$call$marge.col`	“weights of columns”
17	`$call$marge.li`	“weights of rows”

res.mca$svd is not part of the output!

eig is computed from the singular values in res.mca$svd
var contains material for plotting information about categories and variables on factorial planes
ind conatins material for plotting information about individuals on on factorial planes

In principle, all relevant information can be gathered from components svd, call.marge.li, and call.marge.col.

Screeplots

Question

Draw screeplots. Why are they useful? Comment briefly.

Solution

We first define a tidying function for preparing the screeplot. It is designed along the lines of method tidy for objects of class prcomp. It consists in building a data frame from component eig of the MCA object.

Code

tidy_mca_scree <- function(res_mca){
  res_mca$eig |> 
    as_tibble(rownames="Dim") |>
    rename(
      `% Inertia` = `percentage of variance`,
      `Cumulative % Inertia` = `cumulative percentage of variance`
    ) |> 
    mutate(
      PC = as.integer(str_match(Dim, 'dim ([0-9]*)')[,2])
    )
}

We can now build a column plot.

Code

res.mca |> 
  tidy_mca_scree() |>
  ggplot() +
  aes(x=PC, y=eigenvalue, label=glue("{round(`% Inertia`,2)} %")) +
  geom_col() +
  geom_text(angle=45, vjust=-1, hjust=-.1)

Note that the share of cumulated inertia doesn’t grow very fast with the number of components. Situation is different from what we met when performing Correspondence Analysis or Principal Component Analysis. This tells us that we will hqve to look beyond the first two axis. This is commonplace with multiple correspondence analysis. Look at the reference below where most of the interesting information can be found on the 3rd and 4th axis.

Individuals

Question

Perform Individual profiles analysis.

What are the classical plots? How can you build them from the output of FactoMiner::MCA?

Build the table of row contributions (the so-called $\cos^2$)

Solution

Attribute ind of objects of class MCA (exported from FactoMineR) is the starting point of any row profiles analysis.

We design an augment method. For objects of class MCA. This is like what is done for objects of class prcomp and different from what we did with objects of class CA. Indeed, MCA on the indicator matrix handles a sample just like PCA and not a contingency table like CA.

Code

augment_mca <- function(res_mca, data=NULL){
  
  result <- res_mca$ind$coord |> 
    as_tibble()  
  
  if (!is.null(data)) {
    result <- result |>
      bind_cols(confidence_data)
      # inner_join(confidence_data, by=c("id")) 
  }
  
  return(result)
}

Solution

Attribute ind is a named list made of $4$ components.

coord: a matrix with named rows and columns. The number of rows of coord matches the number of rows of the table fed to MCA (here, the number of respondents in the panel). The number of columns matches the rank of the truncated SVD that underlies Multiple Correspondance Analysis. Here it is the default truncation $5$.

The principal coordinates are the principal coordinates of each individual in terms of the principal component.

The columns of coord are pairwise orthogonal in the usual inner product space.

Function augment_mca Bind the columns of coord with the columns of the original dataset.

Code

res.mca |> 
#| label: augment_mca_glimpse
  augment_mca(confidence_data) |> 
  glimpse()

Rows: 1,183
Columns: 20
$ `Dim 1`  <dbl> 1.35465733, -0.10861408, -0.33683136, 0.72450001, 0.72525336,…
$ `Dim 2`  <dbl> 0.41190167, -0.90245225, -0.45183706, 0.25425970, 0.10225521,…
$ `Dim 3`  <dbl> 0.161371575, 0.010268886, 0.107542556, -0.009273224, -0.38139…
$ `Dim 4`  <dbl> 0.141129674, 0.310980059, -0.097880877, -0.085996832, 0.36106…
$ `Dim 5`  <dbl> 0.227285260, 0.086562193, 0.016680275, -0.115998402, -0.06681…
$ id       <dbl+lbl>  4,  6,  7,  9, 11, 15, 19, 20, 21, 24, 26, 28, 31, 41, 4…
$ sex      <dbl+lbl> 2, 1, 2, 2, 2, 2, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 1, …
$ confinan <fct> A GREAT DEAL, ONLY SOME, ONLY SOME, HARDLY ANY, ONLY SOME, ON…
$ conbus   <fct> ONLY SOME, ONLY SOME, ONLY SOME, ONLY SOME, ONLY SOME, ONLY S…
$ conclerg <fct> A GREAT DEAL, ONLY SOME, ONLY SOME, A GREAT DEAL, A GREAT DEA…
$ coneduc  <fct> A GREAT DEAL, ONLY SOME, ONLY SOME, ONLY SOME, A GREAT DEAL, …
$ confed   <fct> A GREAT DEAL, ONLY SOME, ONLY SOME, A GREAT DEAL, A GREAT DEA…
$ conlabor <fct> A GREAT DEAL, ONLY SOME, HARDLY ANY, ONLY SOME, HARDLY ANY, O…
$ conpress <fct> A GREAT DEAL, ONLY SOME, ONLY SOME, HARDLY ANY, ONLY SOME, HA…
$ conmedic <fct> A GREAT DEAL, ONLY SOME, HARDLY ANY, A GREAT DEAL, ONLY SOME,…
$ contv    <fct> ONLY SOME, ONLY SOME, HARDLY ANY, ONLY SOME, ONLY SOME, ONLY …
$ conjudge <fct> A GREAT DEAL, ONLY SOME, ONLY SOME, A GREAT DEAL, A GREAT DEA…
$ consci   <fct> A GREAT DEAL, ONLY SOME, ONLY SOME, A GREAT DEAL, ONLY SOME, …
$ conlegis <fct> ONLY SOME, ONLY SOME, ONLY SOME, A GREAT DEAL, A GREAT DEAL, …
$ conarmy  <fct> A GREAT DEAL, ONLY SOME, ONLY SOME, A GREAT DEAL, A GREAT DEA…

Note that, we can recover ind$coord from the left singular vectors and the singular values.

The output of augment_mca is the basis of the individual plots. Individual plots are defined by picking a pair of principal axes. Him use the 3rd and the 4th axis, and we map the sex column from the original dataset to the fill aesthetics.

Code

res.mca |> 
  augment_mca(confidence_data) |> 
  ggplot() +
  aes(x=`Dim 3`, y=`Dim 4`, fill=sex) +
  geom_point(alpha=.2, size=.2) +
  coord_fixed() +
  scale_fill_viridis_d()

Solution

cos2: Coefficients of matrix cos2 are the share of row inertia from the corresponding cell in coord

Code

res.mca |> 
  augment_mca(confidence_data) |> 
  ggplot() +
  aes(x=`Dim 1`, y=`Dim 2`, fill=sex) +
  geom_point(alpha=.2, size=.2) +
  coord_fixed() +
  scale_fill_viridis_d()

Solution

contrib

Not too surprisingly, coord, contrib, and cos2 share the same row names and column names.

Variables/Categories

Question

Perform column profiles (categories) analysis. You may use factoextra::fviz_mca_var()

Solution

The var component. Of an MCA object is related with the right singular vectors of the residual matrix.

Code

p <- res.mca |> 
  factoextra::fviz_mca_var() +
  coord_fixed() 
  
p +
  facet_zoom(
    xy=`Dim 1`>=0 & `Dim 2`>=0,
    xlim=c(0,2),ylim=c(0,2))

Following what has been done for objects of class prcomp, we design a tidy_mca_var function that turns the var component into a dataframe. Row names which correspond to categories, are turned into a column we call levels.

Code

tidy_mca_var <- function(res_mca){
  res_mca$var$coord |> 
  as_tibble(rownames="levels")
}

A value in column levels indicate a variable., that is a topic (science, religion, army, …) and a confidence level (A value on a Likert scale). We enrich the data frame by extracting the variable (topic) and the value (sentiment).

We are then ready to plot an analog of the correlation circle.

Code

tidied_mca_var <- tidy_mca_var(res.mca) |>   
  mutate(sentiment=str_split(as.character(levels),'_')) |>
  mutate(topic=map_chr(sentiment, \(x)x[1]), sentiment=map_chr(sentiment, \(x)x[2])) 

p <- tidied_mca_var |> 
  ggplot() +
    aes(x=`Dim 1`, y=`Dim 2`, label=levels, fill=topic, shape=sentiment) +
    geom_point(alpha=.5, show.legend = F) +
    coord_fixed() +
    scale_color_viridis_d() +
    theme(legend.position="none") +
    labs(
      title = "MCA: Variables categories",
      subtitle = "Color: topic, Shape: confidence level",
      caption = "GSS data, panel 2010, wave 1"
    )

p |>
  plotly::ggplotly()

Resorting to plotly allows us to mitigate the overplotting effect.

The plot on the first two axes seems easy to interpret. As we move along the first axis from the left, we see growing confidence.
On the positive part of the second axis, we have extreme confidence levels, either high or low. This suggests that individuals have trends: either we trust all institutions; or none of them; or finally a little bit all of them.

How could we comfort such a narrative?

Code

(
 p + 
    aes(x=`Dim 3`, y=`Dim 4`, label=levels, color=sentiment) 
) |>   plotly::ggplotly()

Code

(
  p +
    aes(x=`Dim 4`, y=`Dim 5`, label=levels) 
) |>  plotly::ggplotly()

Question

What is the v.test component of the var component of an MCA object?

Solution

Here is a chatbot answer:

In Multiple Correspondence Analysis (MCA), the v-test (also known as the test-value or t-test value) is a statistical measure used to assess the significance of the association between a category and a dimension (factor). It helps determine which categories contribute the most to the definition of a given dimension.

Definition and Calculation

The v-test is computed as:

\[v = \frac{p - P}{\sqrt{P(1 - P)/n}}\]

where: - $p$ = proportion of individuals in a given category for the specific dimension, - $P$ = overall proportion of individuals in that category across all dimensions, - $n$ = total number of individuals.

Alternatively, the v-test can also be expressed as:

\[v = \frac{\text{category coordinate}}{\text{standard deviation of the coordinate}}\]

This essentially provides a z-score, which follows a standard normal distribution under the null hypothesis.

Interpretation

A high absolute v-test value (typically above 2 or below -2) indicates that the category is strongly associated with the given dimension.
Positive v-test: The category is overrepresented in that dimension.
Negative v-test: The category is underrepresented in that dimension.
If the absolute value of the v-test is below 2, the association is not statistically significant (at a 5% significance level).

Why is it useful?

It helps interpret MCA results quantitatively rather than relying only on visual inspection.
It aids in identifying key categories that structure the data on a given axis.
It provides statistical validation for category contributions to dimensions.

Would you like a practical example or interpretation of a specific MCA result?

Symmetric plots

Question

Build the symmetric plots (biplots) for multiple correspondence analysis.

From the shelf

We resort to factoextra. In principle, the biplot can be constructed from the coord matrices in the var and the ind components of the MCA object.

Code

p <- res.mca |> 
  factoextra::fviz_mca_biplot(
    label= "var",
    alpha.ind=.5
  )

p

Code

# p |>  plotly::ggplotly()

Solution

Mosaicplots

MCA can be complemented by Mosaicplots, Double Decker plots, Chi-square tests, and Correspondence analyses between pair of variables.

Question

Draw a mosaic plot to visualize association between confidence levesl in Science and confidence level in Medicine.

Solution

Code

confidence_data |> 
  select(consci, conmedic) |>
  table() |> 
  mosaicplot(color=T)

Code

confidence_data |> 
  select(consci, conmedic) |> 
  table() |>
  chisq.test() |> 
  broom::glance() |> 
  gt::gt() |> 
  gt::tab_caption("Chi2 independence test for Confidence in Science and Confidence in Medicine")

Chi2 independence test for Confidence in Science and Confidence in Medicine
statistic	p.value	parameter	method
137.8878	7.994493e-29	4	Pearson's Chi-squared test

Code

confidence_data |> 
  select(consci, conclerg) |> 
  table() |>
  chisq.test() |> 
  broom::glance() |> 
  gt::gt() |> 
  gt::tab_caption("Chi2 independence test for Confidence in Science and Confidence in Clergy")

Chi2 independence test for Confidence in Science and Confidence in Clergy
statistic	p.value	parameter	method
39.27785	6.103953e-08	4	Pearson's Chi-squared test

Code

t_sex_clerg_sci <- confidence_data |> 
  select(sex, conclerg, consci) |> 
  table()
  
t_sex_clerg_sci  |>
  vcd::doubledecker( )

Code

t_sex_clerg_sci |> 
  mosaicplot(color=T, main="Confidence in Clergy and Science")

Further references

Barth, Alice and Schmitz, Andreas. 2018. Response quality and ideological dispositions: an integrative approach using geometric and classifying techniques. Quality & Quantity

When analyzing survey data, response quality has consequential implications for substantial conclusions. Differences in response quality are usually explained by personality, or socio-demographic or cognitive characteristics. Little, however, is known about how respondents’ political attitudes, values, and opinions impact on quality aspects. This is a striking analytical omission, as potential associations between political values and various forms of response biases and artefacts call into question surveys’ ability to represent ‘public opinion’. In this contribution, response quality is traced back to respondents’ political and ideological dispositions. For this purpose, a relational understanding of response quality is applied that takes into account different aspects of response behaviors, as well as the interrelations between these indicators. Using data from the US General Social Survey (2010–2014), an empirical typology of response quality is created via finite mixture analysis. The resulting classes are then related to positions in the US field of ideological dispositions constructed via multiple correspondence analysis. The analyses reveal that there are (1) different combinations of response patterns and thus different empirical response types, and (2) that these types of response quality systematically vary with regard to the respondents’ political and ideological (dis)positions. Implications of the findings for public opinion surveys are discussed.

The GSS dataset

The confidence topic

Table wrangling

Elementary statistics

The case for using MCA

Multiple Correspondance Analysis

Screeplots

Individuals

Variables/Categories

Definition and Calculation

Interpretation

Why is it useful?

Symmetric plots

From the shelf

Mosaicplots

Further references

The `GSS` dataset