| title | rounds | year |
|---|---|---|
| European P | 1 | 2004 |
| Local | 2 | 2008 |
| President | 2 | 2012 |
| Parliament | 2 | 2017 |
| Local | 2 | 2020 |
Hmw II: SVD methods and Elections Data
- Due : May 18 2026
- Work in pairs
- Deliver your work through a github repository
- Present your work (15 minutes) on 2026-06-03
This homework is about
- Using Matrix Factorization methods in Data analysis
- Investigating voting patterns in Paris (or elsewhere)
I. Voting Data
Voting data in Paris per polling station can be obtained from a variety of websites.
- An example
- Another example
- Yet another one
- Opendatasoft API
- https://opendata.paris.fr
- https://data.smartidf.services/pages/data/
- https://data.opendatasoft.com
Many datasets are available in several formats. When possible, use parquet. parquet files can be uploaded using package arrow.
Data concerning polling stations can also be gathered from various sources.
You first task will be to design an extraction pipeline to obtain the voting data you will analyse. You will gather data corresponding to different types of elections (Municipales, Régionales, Législatives, Européennes, Présidentielles) that took place since Year 2000.
II. Data cleaning
Some data cleaning may be necessary, for example
- Some parties changed their names during the last 25 years. Defining a mapping can facilitate the comparison of results from different elections
- Some parties were born during the last 25 years
- Check that the names of
bulletins nuls,bulletins blancs, … are consistent across the different datasets.
Design a cleaning pipeline. Save the cleaned data.
III. Applying Matrix Factorization Methods (SVD)
For one election round, the outcome is summarized by a tibble where rows (individuals) are polling stations and variables/columns are the number of votes obtained by the different condidates/parties.
Perform PCA on different elections. Visualize and describe the result (attention, this is data analysis, not political science).
Choropleths are welcome
Perform CCA to compare different elections.
Feel free to combine different methods.
IV. Clustering
Perform clustering of polling stations
Compare the results of different clustering methods (kmeans with different initializations, hierarchical clustering with different distances and different merging methods).
Compare clustering
V. Regression
Consider first round of Parliementary Elections in 2024 as the reponse variables and Elections between 2017 and European Elections 2024 as explanatory variables.
Fit linear regressions. Discuss the results.
References
- Advanced R Programming
- Packages
- Programming with/for
ggplot2 - Programming with
dplyr tidyevalhelpers- Cheatsheets
| Criterion | Points | Details |
|---|---|---|
| Documentation/Report | 45% | English/French |
| Presentation | 40% | |
| Data gathering/cleaning pipelines | 15% |