Technologies Big Data Master MIDS/MFA/LOGOIS
2025-01-17
Overview - dask’s place in the universe.
Delayed
- the single-function way to parallelize general python code.
1x. Lazy
- some of the principles behind lazy execution, for the interested
Bag
- the first high-level collection: a generalized iterator for use with a functional programming style and to clean messy data (scaling up and out Python lists)
Array
- blocked numpy
-like functionality with a collection of numpy
arrays spread across your cluster
Dataframe
- parallelized operations on many pandas
dataframes
spread across your cluster
Distributed
- Dask’s scheduler for clusters, with details of how to view the UI
Advanced Distributed - further details on distributed computing, including how to debug
Dataframe
Storage - efficient ways to read and write dataframes to disc
Machine Learning - applying dask
to machine-learning problems.
Type | Typical size | Features | Tool |
---|---|---|---|
Small data | Few GigaBytes | Fits in RAM | Pandas |
Medium data | Less than 2 Terabytes | Does not fit in RAM, fits on hard drive | Dask |
Large data | Petabytes | Does not fit on hard drivve | Spark |
Dask provides multi-core and distributed parallel execution on larger-than-memory datasets.
Dask provides high-level Array
, Bag
, and DataFrame
collections that mimic NumPy
, lists
, and Pandas
but can operate in parallel on datasets that don’t fit into memory
. . . Dask provides dynamic task schedulers that execute task graphs in parallel.
These execution engines power the high-level collections but can also power custom, user-defined workloads.
These schedulers are low-latency and work hard to run computations in a small memory footprint
The single-function way to parallelize general python code.
LocalCluster
Dask can set itself up easily in your Python session if you create a LocalCluster
object, which sets everything up for you.
Alternatively, you can skip this part, and Dask will operate within a thread pool contained entirely with your local process.
The result of the evaluation of sum_data()
depends not only on its argument, hence on the Delayed
e
but also on the side effects of add_data()
, that is on the Delayed
b
and d
Note that not only the DAG was wrong but the result obtained above was not the intended result.
By default, Dask
Delayed
uses the threaded scheduler in order to avoid data transfer costs.
Consider using multi-processing scheduler or dask.distributed scheduler on a local machine or on a cluster if your code does not release the
GIL
well (computations that are dominated by pure Python code, or computations wrapping external code and holding onto it).
Sometimes you want to create and destroy work during execution, launch tasks from other tasks, etc. For this, see the Futures
interface.
Dask Dataframes parallelize the popular pandas library, providing:
- Larger-than-memory execution for single machines, allowing you to process data that is larger than your available RAM
- Parallel execution for faster processing
- Distributed computation for terabyte-sized datasets
Dask Dataframes are similar in this regard to Apache Spark, but use the familiar pandas API and memory model. One Dask dataframe is simply a collection of pandas dataframes on different computers.
Dask DataFrame helps you process large tabular data by parallelizing pandas, either on your laptop for larger-than-memory computing, or on a distributed cluster of computers.
Just pandas: Dask DataFrames are a collection of many pandas DataFrames.
The API is the same. The execution is the same.
Large scale: Works on 100 GiB on a laptop, or 100 TiB on a cluster.
Easy to use: Pure Python, easy to set up and debug.
Column of four squares collectively labeled as a Dask DataFrame with a single constituent square labeled as a pandas DataFrame.
Dask DataFrames coordinate many pandas DataFrames/Series arranged along the index. A Dask DataFrame is partitioned row-wise, grouping rows by index value for efficiency. These pandas objects may live on disk or on other machines.
index = pd.date_range("2021-09-01",
periods=2400,
freq="1H")
df = pd.DataFrame({
"a": np.arange(2400),
"b": list("abcaddbe" * 300)},
index=index)
ddf = dd.from_pandas(df, npartitions=20)
ddf.head()
Dask DataFrames coordinate many Pandas DataFrames/Series arranged along an index.
We define a Dask DataFrame object with the following components:
- A Dask graph with a special set of keys designating partitions, such as (‘x’, 0), (‘x’, 1), …
- A name to identify which keys in the Dask graph refer to this DataFrame, such as ‘x’
- An empty Pandas object containing appropriate metadata (e.g. column names, dtypes, etc.)
- A sequence of partition boundaries along the index called divisions
parquet
After you have generated a task graph, it is the scheduler’s job to execute it (see Scheduling).
By default, for the majority of Dask APIs, when you call
compute()
on a Dask object, Dask uses the thread pool on your computer (a.k.a threaded scheduler) to run computations in parallel. This is true forDask Array
,Dask DataFrame
, andDask Delayed
. The exception beingDask Bag
which uses the multiprocessing scheduler by default.
If you want more control, use the
distributed scheduler
instead. Despite having “distributed” in it’s name, the distributed scheduler works well on both single and multiple machines. Think of it as the “advanced scheduler”.
Dask schedulers come with diagnostics to help you understand the performance characteristics of your computations.
By using these diagnostics and with some thought, we can often identify the slow parts of troublesome computations.
The single-machine and distributed schedulers come with different diagnostic tools. These tools are deeply integrated into each scheduler, so a tool designed for one will not transfer over to the other.
* [Docs](https://dask.org/)
* [Examples](https://examples.dask.org/)
* [Code](https://github.com/dask/dask/)
* [Blog](https://blog.dask.org/)
* [Scipy 2020]()
* [`dask`](http://stackoverflow.com/questions/tagged/dask) tag on Stack Overflow, for usage questions
* [github issues](https://github.com/dask/dask/issues/new) for bug reports and feature requests
* [gitter chat](https://gitter.im/dask/dask) for general, non-bug, discussion
IFEBY030 – Technos Big Data – M1 MIDS/MFA/LOGOS – UParis Cité