Hmw I : Tables and visualization

Published

January 13, 2025

Important
  • Due : February 7, 2025
  • Work in pairs
  • Deliver your work as a qmd file through a github repository
  • Use the quarto package for reproducible research
  • The report should be rendered at least in HTML format, and possibly also in PDF format

Objectives

This homework is dedicated to table wrangling and visualization.

OECD offers detailed statistics at different territorial levels.

Using data gathered from OECD, build dataframes with a gapminder-like schema (territorial units/timestamps as keys, statistics for other columns). Using ggplot2 and plotly, design plots and animations like the [Rossling address to BBC]().

Your data and graphical pipelines should be designed so that they can be at least partially reused to perform gapminder-like animations with other kind of data gathered from OECD. For example one, could replace life expectancy at birth by life expectancy at 60, infant mortality, or any other health index. One could replace GDP per capita with social expenditure per capita, or something like that.

Your deliverable shall consist in a qmd file that can be rendered in HTML format.

You shall describe the downloaded data.

Plots shall be endowed with titles, legends and captions,

Data pipelines and graphical pipelines shall be given in an appendix.

Data

Data have to be downloaded from https://data.oecd.org (the old site https://stats.oecd.org which could be drilled using package OECD is shutting down).

Use OECD Data explorer to find the data you need. Follow the guidleines given in the OECD Data Explorer API.

Once you have decided which data you need, you shall build an extraction pipeline using httr and rvest to download the data or tools from OECD package.

Table wrangling should be performed using tidyverse tools.

Your extraction pipeline shall be reproducible and shall be given in an appendix.

Keep the downloaded data in a separate subdirectory. Your working directory (working tree) should look like something like that:

.
├── .git/
├── DATA/
│   ├── OECD.foo.csv
│   ├── OECD.foobar.csv
│   ├── OECD.bar.csv
|   :
├── _extensions/
├── _outdir/
├── _metadata.yml
├── _quarto.yml
├── my_report.qmd
├── :
└── README.md

Report organization

The first part (introduction) of the report shall be dedicated to the description of the data you have downloaded. You shall motivate your choice and non-trivial aspects of the data (for example if you were discussing GDP per capita against Life expectancy, you should remind the reader about the definition of Life expectancy and GDP). You shall also give a hint about why you intend to plot some variables against others.

The second part (results) shall be dedicated to plots and animations. Commenting a plot is not paraphrasing. It consists in adding informations and explanations that are not already in and around the plot (this includes the plot itself, title, subtitle, caption, and guides). It also consists in questions and issues that the plot raises. For example, in the Gapminder presentation, the apparent connection between life expectancy and GDP per capita deserves to be discussed (is it stationary? is it homogeneous throughout continents ? …). Refrain from overplaying your hand: yours plots are not likely to provide causal explanations. Comment the data, all the data, and nothing but the data.

The third part is the appendix. The first two parts should be text and plots only. The third part should be code only.

The appendix shall be dedicated to the description of the data wrangling pipeline. You shall give the code.

You shall also give the code of the graphical pipelines in the appendix.

You shall avoid copy-paste coding. Don’t Repeat Yourself. The tidyverse is your friend. knitr provide the tools to organize the Quarto file so that you can write your code once and use it many times, once for data wrangling and plotting (without echoing), then for listing and explanation.

Grading criteria

Criterion Points Details
Narrative, spelling and syntax 25% English/French
Plots correction 20% choice of aesthetics, geom, scale
Plot style 15% Titles, legends, labels, breaks …
Table wrangling 15% ETL, SQL like manipulations
Computing Statistics 5% Aggregations, LR, PCA, CA, …
DRY compliance 20% DRY principle at Wikipedia