Python Data Science Stack

Technologies Big Data Master MIDS/MFA/LOGOIS

Équipe de Statistique

LPSM Université Paris-Cité

2025-01-17

What is `Python` ?

born in 1990
designed by Guido van Rossum (BDFL)
multi-purpose
easy to read
easy to learn
object-oriented
strongly and dynamically typed
cross-platform

Features of `Python`

High-level data types (tuples, dict, list, set, etc.)
Standard libraries with batteries included
- String services,
- Regular expressions
- Datetime
- …
Libraries for scientific computing
Easy and efficient I/O, many file formats
OS, threading, multiprocessing
Networking, email, html, webserver, scrapping
Can be extended with C/C++ and easily accelerated (cython, numba, pypy)
Tons of external libraries

Features of `Python`

Trends

The `stackoverflow` 2023 survey

`Python` popularity growth

Why `Python` for data science ?

Besides these features, Python has:

large communities for data science, analytics, etc.
many, well-established, well-documented libraries
huge demand from the industry

The `Python` Data Science Stack: Maths / Science

Numpy

numpy is all about multi-dimensional arrays and matrices
high-level computation such as
- linear algebra: numpy.linalg
- random number generation:numpy.random
Fast but not optimized for multi-threaded architectures
Not for distributed multi-machine settings

Scipy

scipy extends numpy with extra modules:
- optimization,
- integration,
- FFT, signal and image processing
- …
Sparse matrix formats in scipy.sparse

The `Python` Data Science Stack: Data processing

Pandas

pandas builds upon numpy to provide a high-performance, easy-to-use DataFrame object, with high-level data processing
Easy I/O with most data format : csv, json, hdf5, feather, parquet, etc.
SQL semantics: select, filter, join, groupby, agg, , where, etc.
Very large general-purpose library for data processing, not distributed, medium scale data only

Links

Dask

dask is roughly a distributed and parallel pandas
Same API has pandas !
Task scheduling, lazy evaluation, distributed dataframes
Still young and far behind spark, but can be useful
Easier than spark, full Python (no JVM)

Links

Dask homepage

Pyspark

pyspark is the python API to spark, a big data processing framework
We will use it a lot in this course
Native API to spark is scala: pyspark can be slower (much slower if you are not careful)

Links

`SQLAlchemy`

Object Relational Model (ORM)
ODBC

Links

Pyarrow

The universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics

Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. The Arrow memory format also supports zero-copy reads for lightning-fast data access without serialization overhead.

Links

The `Python` Data Science Stack: Data Visualization

Matplotlib

matplotlib provides versatile 2D plotting capabilities
- scientific computing
- data visualization
Large and customizable library
The historical one, somewhat low-level when plotting things related to data

Links

Matplotlib Homepage

Plotly

An interactive visualization library for web browsers based on javascript graphic library d3.js
With a clean and simple python interface, can be used in a jupyter notebook
Interactions enabled by default (zoom, etc.) and fast rendering
Very good looking plots with good default parameters

Links

Plotly homepage

Altair

Vega-Altair: Declarative Visualization in Python

Vega-Altair is a declarative visualization library for Python. Its simple, friendly and consistent API, built on top of the powerful Vega-Lite grammar, empowers you to spend less time writing code and more time exploring your data.

Links

The `Python` Data Science Stack: Dashboards

Dash

Links

Dash homepage

Shiny

Links

Shiny homepage

`Python` Data Science Stack: environments

Pure Python interfaces

Ways to use all these tools

Write a script script.py and use python directly in a CLI : python script.py
Use the ipython interactive shell

Interfaces : Jupyter

Use jupyter: a web application that allows to create and run documents, called notebooks (with .ipynb extension)
Notebooks can contain code, equations, visualizations, text, etc. (literate programming)
Each notebook has a kernel running a python/R,Julia, … thread
A problem: a ipynb file is a json document. Leads to bad code diff, a problem with git versioning

Links

`Quarto`

Interfaces/IDE : VS Code (and other editors)

Python and R

`Reticulate`

Reticulate embeds a Python session within your R session, enabling seamless, high-performance interoperability. If you are an R developer that uses Python for some of your work or a member of data science team that uses both languages, reticulate can dramatically streamline your workflow!

Links

Reticulate homepage

`Py2R`

Python has several well-written packages for statistics and data science, but CRAN, R’s central repository, contains thousands of packages implementing sophisticated statistical algorithms that have been field-tested over many years. Thanks to the rpy2 package, Pythonistas can take advantage of the great work already done by the R community. rpy2 provides an interface that allows you to run R in Python processes. Users can move between languages and use the best of both programming languages.

rpy2 homepage

But also…

Many libraries for statistics, machine learning and deep learning

Statistics

statsmodels

Machine learning

Deep learning

Getting faster

numba, cython, cupy

And …

Python APIs for most databases and clouds
Processing and plotting tools for Geospatial data
Image processing
Web development, web scrapping

among many many many other things…

Python Data Science Stack

What is Python ?

Features of Python

Features of Python

Features of Python

Trends

The stackoverflow 2023 survey

Python popularity growth

Python popularity growth

Why Python for data science ?

The Python Data Science Stack: Maths / Science

Numpy

Scipy

The Python Data Science Stack: Data processing

Pandas

Dask

Pyspark

SQLAlchemy

Pyarrow

The Python Data Science Stack: Data Visualization

Matplotlib

Plotly

Altair

The Python Data Science Stack: Dashboards

Dash

Shiny

Python Data Science Stack: environments

Pure Python interfaces

Interfaces : Jupyter

Quarto

Interfaces/IDE : VS Code (and other editors)

Python and R

Reticulate

Py2R

But also…

Statistics

Machine learning

Deep learning

Getting faster

And …

Thank you !

What is `Python` ?

Features of `Python`

Features of `Python`

Features of `Python`

The `stackoverflow` 2023 survey

`Python` popularity growth

`Python` popularity growth

Why `Python` for data science ?

The `Python` Data Science Stack: Maths / Science

The `Python` Data Science Stack: Data processing

`SQLAlchemy`

The `Python` Data Science Stack: Data Visualization

The `Python` Data Science Stack: Dashboards

`Python` Data Science Stack: environments

`Quarto`

`Reticulate`

`Py2R`