Technologies Big Data Master MIDS/MFA/LOGOIS
2025-01-17
Python
?Python
Python
tuples
, dict
, list
, set
, etc.)C/C++
and easily accelerated (cython
, numba
, pypy
)Python
stackoverflow
2023 survey
Python
popularity growthPython
popularity growthPython
for data science ?Besides these features, Python
has:
Python
Data Science Stack: Maths / Sciencenumpy
is all about multi-dimensional arrays and matricesnumpy.linalg
numpy.random
scipy
extends numpy
with extra modules:
scipy.sparse
Python
Data Science Stack: Data processingpandas
builds upon numpy
to provide a high-performance, easy-to-use DataFrame
object, with high-level data processingcsv
, json
, hdf5
, feather
, parquet
, etc.SQL
semantics: select
, filter
, join
, groupby
, agg
, , where
, etc.dask
is roughly a distributed and parallel pandas
pandas
!spark
, but can be usefulspark
, full Python
(no JVM
)Links
pyspark
is the python
API to spark
, a big data processing frameworkspark
is scala
: pyspark
can be slower (much slower if you are not careful)SQLAlchemy
The universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. The Arrow memory format also supports zero-copy reads for lightning-fast data access without serialization overhead.
Python
Data Science Stack: Data Visualizationmatplotlib
provides versatile 2D plotting capabilities
Links
javascript
graphic library d3.js
python
interface, can be used in a jupyter
notebookLinks
Vega-Altair: Declarative Visualization in Python
Vega-Altair is a declarative visualization library for Python. Its simple, friendly and consistent API, built on top of the powerful Vega-Lite grammar, empowers you to spend less time writing code and more time exploring your data.
Python
Data Science Stack: DashboardsLinks
Links
Python
Data Science Stack: environmentsWays to use all these tools
Write a script script.py
and use python
directly in a CLI : python script.py
Use the ipython
interactive shell
jupyter
: a web application that allows to create and run documents, called notebooks (with .ipynb
extension)notebook
has a kernel
running a python
/R
,Julia
, … threadipynb
file is a json
document. Leads to bad code diff, a problem with git
versioningLinks
Quarto
Reticulate
Reticulate embeds a Python session within your R session, enabling seamless, high-performance interoperability. If you are an R developer that uses Python for some of your work or a member of data science team that uses both languages, reticulate can dramatically streamline your workflow!
Links
Py2R
Python has several well-written packages for statistics and data science, but CRAN, R’s central repository, contains thousands of packages implementing sophisticated statistical algorithms that have been field-tested over many years. Thanks to the
rpy2
package, Pythonistas can take advantage of the great work already done by the R community.rpy2
provides an interface that allows you to run R in Python processes. Users can move between languages and use the best of both programming languages.
Many libraries for statistics, machine learning and deep learning
numba
, cython
, cupy
Python
APIs for most databases and clouds
Processing and plotting tools for Geospatial data
Image processing
Web development, web scrapping
among many many many other things…
IFEBY030 – Technos Big Data – M1 MIDS/MFA/LOGOS – UParis Cité