Hmw : Spark, File formats, and NLP
- Due : May 18, 2026
- Work in pairs
- Deliver your work as a
qmdfile through a github repository - Use the
quartopackage for reproducible research - Use
pysparkorsparlyr - Use
spark-nlpfor text annotation - The report should be rendered at least in HTML format, and possibly also in PDF format
Objectives
This homework is an opportunity to use pyspark/sparlyr/spark-nlp.
- Extract/Load/Transform a corpus comprising novels from the Romantic era (Balzac, Dumas, Stendhal, Sand, Sue, …, Soeurs Bronté, Scott) as well as authors considered as Realists (Flaubert, Zola, Dickens, …). This task requires a substantial data acquisition work.
- Annotate the corpus with
spark-nlp - Perform Stylometric Analysis and Visualize the results using either
plotlyoraltair - Design a way to store results using
parquet/orcfiles. Motivate your solution.
If time allows, compare annotations from Spark NLP and annotations from Spacy (usability, agreement).
In Stylometric analysis, you should at least
- Compute Flesch-Kincaid and Kandel-Moles readability indices and design a visualization
- Compute sliding readability indices over sliding windows defined by different window sizes. How stable are readability indices?
- Display Zipf plots for the different documents
- Segment the different texts into dialog and narration parts.
The goal of the project is not stylometric analysis, but performing a series of text processing jobs with Spark. Your workflow should work at least in local mode, if possible in standalone mode. If you are able to work on cluster using platforms like google colab, it will be appreciated.
Tune your spark session so as to minimize shuffles, use multi-core architectures as much as possible.
Your deliverable shall consist in a qmd file that can be rendered in HTML format.
You shall describe the downloaded data.
Plots shall be endowed with titles, legends and captions,
Data, NLP pipelines and graphical pipelines shall be given in an appendix.
Data
Data can be downloaded/scrapped from different sources
- https://github.com/dh-trier/balzac/tree/master (not complete)
- https://www.gutenberg.org/ (17 volumes of Comédie humaines)
- Find the data for the other authors by yourself.
Table wrangling should be performed using Spark Dataframes Dataframes.
Your extraction (ELT) pipeline shall be reproducible and shall be given and motivated in an appendix.
You are not supposed to deliver the text files as a zipped archive.
Annotate with spark-nlp
You will have to use Spark 3.5.x and not Spark 4.y.
Annotation shall be done on a per novel basis. It should be performed in a parallel (and distributed) way.
Graphical pipelines should be reproducible and shall be given in an appendix.
Keep the downloaded data in a separate subdirectory. Your working directory (working tree) should look like something like that:
.
├── .git/
├── DATA/
├──
| :
├── _extensions/
├── _outdir/
├── _metadata.yml
├── _quarto.yml
├── our_report.qmd
├── our_presentation.qmd
├── :
└── README.md
Report organization
The first part (introduction) of the report shall be dedicated to the description of the data to be extracted and to the extraction pipeline.
The second part of the report shall be dedicated to the description of load/transform pipeline.
The third part of the report shall be dedicated to the description of the annotation pipelines.
The fourth part of the report shall be dedicated to the stylometric analysis: which questions did you pick u25p, plots, summary tables and comments. Refrain from overplaying your hand: yours plots are not likely to provide a new literary interpretation of XIXth century literature. Comment the data, all the data, and nothing but the data.
The fifth and most important part deals with your experience with Spark and big data file formats.
- Use Spark UI and possibly other profiling tools to investigate the ways Spark processes the jobs you submit.
- Identify critical shuffles
- Try to mitigate the impact of critical shuffles
- Detail the mitigating strategies
- Compare their impact
- Assess the impact of caching/persisting/checkpointing on your workflow
- Compare the impact of using csv/parquet/orc/json for storage (on local file systems at least, hdfs if you can)
- The sixth part is the appendix. The first four parts should be mostly text and plots. The fifth part should be text, tables and diagrams. The sixth part should be code only.The appendix shall be dedicated to the details of the pipelines. You shall give the code. You shall also give the code of the graphical pipelines in the appendix. See report template with the code in the appendix using Jupyter engine
You shall avoid copy-paste coding. Don’t Repeat Yourself. Quarto with Jupyter engine provides the tools to organize the Quarto file so that you can write your code once and use it many times, once for data wrangling and plotting (without echoing), then for listing and explanation.
jupyter engine
References
Data sources
Project Gutenberg: you can find theLa Comédie Humaine(Balzac) using a simple search. All volumes can be downloaded as text files from there.- For other authors, it is up to you.
Grading criteria
| Criterion | Points | Details |
|---|---|---|
| Spark experience | \(30\%\) | |
| Narrative, spelling and syntax | \(10\%\) | English/French |
| Plots correction | \(10\%\) | choice of aesthetics, geom, scale … |
| Plot style | \(5\%\) | Titles, legends, labels, breaks … |
| ETL | \(15\%\) | ETL, SQL like manipulations |
| Annotation | \(10\%\) | Annotations |
| Computing Statistics | \(10\%\) | … |
| DRY compliance | \(10 \%\) | DRY principle at Wikipedia |