Exploratory Data Analysis

Class Notes

Author

Stéphane Boucheron

Published

January 4, 2024

Foreword

These notes are for course Exploratory Data Analysis. The audience is expected to have a dual Mathematics/Computer Science background.

Here is a foreword suggested by ChatGPT :

Welcome to the fascinating world of Exploratory Data Analysis! In these class notes, you’re about to embark on a journey that goes beyond the numbers, charts, and graphs. EDA is not just about crunching data; it’s about unraveling stories hidden within, discovering patterns, and posing questions that lead to deeper insights. So, buckle up, engage your curiosity, and let’s dive into the art and science of exploring the rich tapestry of data. Get ready to uncover the tales that numbers whisper and unleash the power of discovery!

There is nothing wrong in this AI generated address, but it should be taken with a grain of salt. Let us start with EDA is not just about crunching data. Data explorers can not assume that the data just sit there, waiting to be crunched and being unravelled. More than often, data have to be found in data warehouses, datalakes, or even in data swamps. We will spend time on data tidying. Data have to be fetched, and possibly transformed and reshaped in order to be crunched.

Data crunching requires different kinds of skills. The data we work on are ususally tabular. Data scientists have to be able to handle swiftly tabular data: they need to query, update, and sometimes create tables in a way that reminds relational database engineering. We will spend time on table wrangling.

Crunching data is not just about counting occurrences of something in a sample. Crafting statistical summaries has a long history that relies heavily on the modeling tools offered by Probability theory. We will not be shy about resorting to this mathematical background. This may go beyond describing samples using empirical distributions. Namely, we will unravel concepts elaborated in the field of social sciences like inequality indices, or residual life expectancy. We will also build on tools from matrix analyis (least-squares linear regression, singular value decomposition) to understand how dependencies can be assessed from empirical data.

Last, EDA is not just about crunching data; it’s about unraveling stories hidden within, discovering patterns, and posing questions that lead to deeper insights, that is EDA also about story telling. Communicating results to a wider audience is best done through possibly animated graphs, charts and dashboards. We will spend a lot of time on visualization and elaboration of paper/web reports, and, if time allows, to create interactive dashboards.

In order to explore all those aspects of EDA, we will rely on , a programming language and an emvironment geared towards statistical computing.

Documents will be created in Quarto Markdown format, like these notes. See for example Quarto for Scientists by Nicholas Tierney.