Technologies Big Data Master MIDS/MFA/LOGOIS
2025-01-17
Python ?Python
Python
tuples, dict, list, set, etc.)C/C++ and easily accelerated (cython, numba, pypy)Python
stackoverflow 2023 survey
Python popularity growthPython popularity growthPython for data science ?Besides these features, Python has:
Python Data Science Stack: Maths / Science
numpy is all about multi-dimensional arrays and matricesnumpy.linalg
numpy.random

scipy extends numpy with extra modules:
scipy.sparse
Python Data Science Stack: Data processing

pandas builds upon numpy to provide a high-performance, easy-to-use DataFrame object, with high-level data processingcsv, json, hdf5, feather, parquet, etc.SQL semantics: select, filter, join, groupby, agg, , where, etc.

dask is roughly a distributed and parallel pandas
pandas !spark, but can be usefulspark, full Python (no JVM)Links


pyspark is the python API to spark, a big data processing frameworkspark is scala: pyspark can be slower (much slower if you are not careful)SQLAlchemy



The universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. The Arrow memory format also supports zero-copy reads for lightning-fast data access without serialization overhead.
Python Data Science Stack: Data Visualization![]() |
![]() |
![]() |
matplotlib provides versatile 2D plotting capabilities
Links
![]() |
![]() |
![]() |
javascript graphic library d3.jspython interface, can be used in a jupyter notebookLinks
![]() |
![]() |
![]() |
Vega-Altair: Declarative Visualization in Python
Vega-Altair is a declarative visualization library for Python. Its simple, friendly and consistent API, built on top of the powerful Vega-Lite grammar, empowers you to spend less time writing code and more time exploring your data.
Python Data Science Stack: DashboardsLinks
Links
Python Data Science Stack: environments


Ways to use all these tools
Write a script script.py and use python directly in a CLI : python script.py
Use the ipython interactive shell



jupyter: a web application that allows to create and run documents, called notebooks (with .ipynb extension)notebook has a kernel running a python/R,Julia, … threadipynb file is a json document. Leads to bad code diff, a problem with git versioningLinks
QuartoReticulateReticulate embeds a Python session within your R session, enabling seamless, high-performance interoperability. If you are an R developer that uses Python for some of your work or a member of data science team that uses both languages, reticulate can dramatically streamline your workflow!
Links
Py2RPython has several well-written packages for statistics and data science, but CRAN, R’s central repository, contains thousands of packages implementing sophisticated statistical algorithms that have been field-tested over many years. Thanks to the
rpy2package, Pythonistas can take advantage of the great work already done by the R community.rpy2provides an interface that allows you to run R in Python processes. Users can move between languages and use the best of both programming languages.
Many libraries for statistics, machine learning and deep learning
numba, cython, cupy
Python APIs for most databases and clouds
Processing and plotting tools for Geospatial data
Image processing
Web development, web scrapping
among many many many other things…
IFEBY030 – Technos Big Data – M1 MIDS/MFA/LOGOS – UParis Cité