Hmw I : Tables and visualization

Published

March 10, 2025

Important

Due : April 12, 2025
Work in pairs
Deliver your work as a qmd file through a github repository
Use the quarto package for reproducible research
The report should be rendered at least in HTML format, and possibly also in PDF format

Objectives

This homework is dedicated to table wrangling, text processing and visualization.

Extract/Load/Transform the Balzac corpus made of the 92 novels, short stories, and essays forming La comédie humaine.
Annotate the corpus. In natural language processing (NLP), annotation consists of carrying out what is known as grammatical analysis. Some excellent annotation tools are available, for example spacy. We recommend you use these tools.
Organize and persit the result of the annotation process at your convenience. Document and motivate your choice.
Perform Stylometric Analysis and visualize the results using either plotly or altair.
To perform stylometric analysis, have first a look at Data Humanities with R. R Package stylo is also a great source of inspiration. Computational stylistics provides a general perspective on the subject.
Playing simultaneoulsy with R and Python is easy thanks to py2r and/or reticulate.

Caution

Your deliverable shall consist in a qmd file that can be rendered in HTML format.

The deliverable shall be completed with a short (10mn + 5mn questions) oral presentation.

You shall describe the downloaded data.

You shall describe and motivate the way you organized the annotation process (if performed in a multithreaded/multiprocessing way).

Plots shall be endowed with titles, legends and captions,

Data pipelines and graphical pipelines shall be given in an appendix.

Data

La comédie humaine is a collection of 92 novels, short stories, and essays (romans, nouvelles et essais) written by Honoré de Balzac between 1929 and his death in 1850. The first complete edition is called the Furne edition.

A sample of elements from La Comédie Humaine
Titre	Année	Genre
La Bourse	1832	roman
Le Chef-d'œuvre inconnu	1831	roman
Les Secrets de la princesse de Cadignan	1839	roman
Adieu	1830	roman
La Femme abandonnée	1833	roman

Data can be downloaded/scrapped from different sources

https://github.com/dh-trier/balzac/tree/master
https://www.gutenberg.org/ (17 volumes of Comédie humaines)
many others places

Table wrangling should be performed using Use Pandas/Polars Dataframes or Dask Dataframes (not Vaex).

Your extraction (ELT) pipeline shall be reproducible and shall be given in an appendix.

You are not supposed to deliver the text files as a zipped archive.

Annotate with Spacy

Annotation shall be done on a per novel basis. It can be performed in a parallel (and distributed) way.

Graphical pipelines should be reproducible and shall be given in an appendix.

Keep the downloaded data in a separate subdirectory. Your working directory (working tree) should look like something like that:

.
├── .git/
├── DATA/
├──
|   :
├── _extensions/
├── _outdir/
├── _metadata.yml
├── _quarto.yml
├── our_report.qmd
├── :
└── README.md

Report organization

The first part (introduction) of the report shall be dedicated to the description of the data to be extracted and to the extraction pipeline.

The second part of the report shall be dedicated to the description of load/transform pipeline

The third part of the report shall be dedicated to the description of the annotation pipeline

The fourth part of the report shall be dedicated to the stylometric analysis: which questions did you pick up (and why?), plots, summary tables and comments. Refrain from overplaying your hand: yours plots are not likely to provide a new literary interpretation of Balzac opera. Comment the data, all the data, and nothing but the data.

The fifth part is the appendix. The first four parts should be mostly text and plots. The fifth part should be code only.

The appendix shall be dedicated to the details of the pipelines. You shall give the code.

You shall also give the code of the graphical pipelines in the appendix.

You shall avoid copy-paste coding. Don’t Repeat Yourself. knitr provide the tools to organize the Quarto file so that you can write your code once and use it many times, once for data wrangling and plotting (without echoing), then for listing and explanation.

References

Data sources

Project Gutenberg: you can find the La Comédie Humaine using a simple search. All volumes can be downloaded as text files from there.

Note

Project Gutenberg is not the only possible data source.

Grading criteria

Criterion	Points	Details
Narrative, spelling and syntax	20%	English/French
Plots correction	15%	choice of `aesthetics`, `geom`, `scale` …
Plot style	10%	Titles, legends, labels, breaks …
ETL	20%	ETL, SQL like manipulations
Annotation	10%	Annotations
Computing Statistics	5%	…
DRY compliance	20%	DRY principle at Wikipedia

Appendix

For this homework, you are likely to use the jupyter engine instead of the knitr engine. To organize your report so that selected code chunks appear in the appendix, have a look at the contents shortcode in the quarto documentation.