A sample of elements from La Comédie Humaine | ||
---|---|---|
Titre | Année | Genre |
La Bourse | 1832 | roman |
Le Chef-d'œuvre inconnu | 1831 | roman |
Les Secrets de la princesse de Cadignan | 1839 | roman |
Adieu | 1830 | roman |
La Femme abandonnée | 1833 | roman |
Hmw I : Tables and visualization
- Due : April 12, 2025
- Work in pairs
- Deliver your work as a
qmd
file through a github repository - Use the
quarto
package for reproducible research - The report should be rendered at least in HTML format, and possibly also in PDF format
Objectives
This homework is dedicated to table wrangling, text processing and visualization.
- Extract/Load/Transform the Balzac corpus made of the 92 novels, short stories, and essays forming La comédie humaine.
- Annotate the corpus. In natural language processing (NLP), annotation consists of carrying out what is known as grammatical analysis. Some excellent annotation tools are available, for example
spacy
. We recommend you use these tools.
- Organize and persit the result of the annotation process at your convenience. Document and motivate your choice.
- Perform Stylometric Analysis and visualize the results using either
plotly
oraltair
. - To perform stylometric analysis, have first a look at Data Humanities with R.
R
Packagestylo
is also a great source of inspiration. Computational stylistics provides a general perspective on the subject.
- Playing simultaneoulsy with
R
andPython
is easy thanks topy2r
and/orreticulate
.
Your deliverable shall consist in a qmd
file that can be rendered in HTML format.
The deliverable shall be completed with a short (10mn + 5mn questions) oral presentation.
You shall describe the downloaded data.
You shall describe and motivate the way you organized the annotation process (if performed in a multithreaded/multiprocessing way).
Plots shall be endowed with titles, legends and captions,
Data pipelines and graphical pipelines shall be given in an appendix.
Data
La comédie humaine is a collection of 92 novels, short stories, and essays (romans, nouvelles et essais) written by Honoré de Balzac between 1929 and his death in 1850. The first complete edition is called the Furne edition.
Data can be downloaded/scrapped from different sources
- https://github.com/dh-trier/balzac/tree/master
- https://www.gutenberg.org/ (17 volumes of Comédie humaines)
- many others places
Table wrangling should be performed using Use Pandas/Polars
Dataframes or Dask
Dataframes (not Vaex
).
Your extraction (ELT) pipeline shall be reproducible and shall be given in an appendix.
You are not supposed to deliver the text files as a zipped archive.
Annotate with Spacy
Annotation shall be done on a per novel basis. It can be performed in a parallel (and distributed) way.
Graphical pipelines should be reproducible and shall be given in an appendix.
Keep the downloaded data in a separate subdirectory. Your working directory (working tree) should look like something like that:
.
├── .git/
├── DATA/
├──
| :
├── _extensions/
├── _outdir/
├── _metadata.yml
├── _quarto.yml
├── our_report.qmd
├── : └── README.md
Report organization
The first part (introduction) of the report shall be dedicated to the description of the data to be extracted and to the extraction pipeline.
The second part of the report shall be dedicated to the description of load/transform pipeline
The third part of the report shall be dedicated to the description of the annotation pipeline
The fourth part of the report shall be dedicated to the stylometric analysis: which questions did you pick up (and why?), plots, summary tables and comments. Refrain from overplaying your hand: yours plots are not likely to provide a new literary interpretation of Balzac opera. Comment the data, all the data, and nothing but the data.
The fifth part is the appendix. The first four parts should be mostly text and plots. The fifth part should be code only.
The appendix shall be dedicated to the details of the pipelines. You shall give the code.
You shall also give the code of the graphical pipelines in the appendix.
You shall avoid copy-paste coding. Don’t Repeat Yourself. knitr
provide the tools to organize the Quarto file so that you can write your code once and use it many times, once for data wrangling and plotting (without echoing), then for listing and explanation.
References
Data sources
Project Gutenberg
: you can find theLa Comédie Humaine
using a simple search. All volumes can be downloaded as text files from there.
Project Gutenberg
is not the only possible data source.
Grading criteria
Criterion | Points | Details |
---|---|---|
Narrative, spelling and syntax | 20% | English/French |
Plots correction | 15% | choice of aesthetics , geom , scale … |
Plot style | 10% | Titles, legends, labels, breaks … |
ETL | 20% | ETL, SQL like manipulations |
Annotation | 10% | Annotations |
Computing Statistics | 5% | … |
DRY compliance | 20% | DRY principle at Wikipedia |
Appendix
For this homework, you are likely to use the jupyter
engine instead of the knitr
engine. To organize your report so that selected code chunks appear in the appendix, have a look at the contents
shortcode in the quarto documentation.