Technologies Big Data Master MIDS/MFA/LOGOIS
2025-01-17
Dask in picture
Overview - dask’s place in the universe.
Delayed
- the single-function way to parallelize general python code.
Dataframe
- parallelized operations on many pandas
dataframes
spread across your cluster
Type | Typical size | Features | Tool |
---|---|---|---|
Small data | Few GigaBytes | Fits in RAM | Pandas |
Medium data | Less than 2 Terabytes | Does not fit in RAM, fits on hard drive | Dask |
Large data | Petabytes | Does not fit on hard drive | Spark |
Dask provides multi-core and distributed parallel execution on larger-than-memory datasets
Dask provides high-level Array
, Bag
, and DataFrame
collections that mimic NumPy
, lists
, and Pandas
but can operate in parallel on datasets that do not fit into memory
Dask provides dynamic task schedulers that execute task graphs in parallel.
These schedulers/execution engines power the high-level collections but can also power custom, user-defined workloads
These schedulers are low-latency and work hard to run computations in a small memory footprint
Dask adoption metrics
The single-function way to parallelize general python code
<dask.config.set at 0x768f0d1a4d40>
LocalCluster
Dask can set itself up easily in your Python session if you create a LocalCluster
object, which sets everything up for you.
Alternatively, you can skip this part, and Dask will operate within a thread pool contained entirely with your local process.
data = [1, 2, 3, 4, 5]
output = []
for x in data:
a = inc(x)
b = double(x)
c = add(a, b)
output.append(c)
total = sum(output)
total
x
x
by 2
c == (x+1) + 2*x == 3*x+1
50
output = []
for x in data:
a = dask.delayed(inc)(x)
b = dask.delayed(double)(x)
c = dask.delayed(add)(a, b)
output.append(c)
total = dask.delayed(sum)(output)
total
inc
using dask.delayed()
sum()
Delayed('sum-0e2e3c70-0734-43ab-b5b0-b48d4bb5d8eb')
50
@dask.delayed
def inc(x):
return x + 1
@dask.delayed
def double(x):
return x * 2
@dask.delayed
def add(x, y):
return x + y
data = [1, 2, 3, 4, 5]
output = []
for x in data:
a = inc(x)
b = double(x)
c = add(a, b)
output.append(c)
total = dask.delayed(sum)(output)
total
total.compute()
50
6
12
The result of the evaluation of sum_data()
depends not only on its argument, hence on the Delayed
e
, but also on the side effects of add_data()
, that is on the Delayed
b
and d
Note that not only the DAG was wrong but the result obtained above was not the intended result.
By default, Dask
Delayed
uses the threaded scheduler in order to avoid data transfer costs
Consider using multi-processing scheduler or dask.distributed scheduler on a local machine or on a cluster if your code does not release the
GIL
well (computations that are dominated by pure Python code, or computations wrapping external code and holding onto it).
import numpy as np
import pandas as pd
import dask.dataframe as dd
import dask.array as da
import dask.bag as db
Dask Dataframes parallelize the popular
pandas
library, providing:
- Larger-than-memory execution for single machines, allowing you to process data that is larger than your available RAM
- Parallel execution for faster processing
- Distributed computation for terabyte-sized datasets
Dask Dataframes are similar to Apache Spark, but use the familiar
pandas
API and memory model
One Dask dataframe is simply a coordinated collection of pandas dataframes on different computers
Dask DataFrame helps you process large tabular data by parallelizing Pandas, either on your laptop for larger-than-memory computing, or on a distributed cluster of computers.
Just
pandas
: Dask DataFrames are a collection of manypandas
DataFrames.
The API is the same1. The execution is the same
Large scale: Works on 100 GiB on a laptop, or 100 TiB on a cluster.
Easy to use: Pure Python, easy to set up and debug.
Dask DataFrames coordinate many pandas DataFrames/Series arranged along the index. A Dask DataFrame is partitioned row-wise, grouping rows by index value for efficiency. These pandas objects may live on disk or on other machines.
a | b | |
---|---|---|
2021-09-01 00:00:00 | 0 | a |
2021-09-01 01:00:00 | 1 | b |
2021-09-01 02:00:00 | 2 | c |
2021-09-01 03:00:00 | 3 | a |
2021-09-01 04:00:00 | 4 | d |
(Timestamp('2021-09-01 00:00:00'),
Timestamp('2021-09-06 00:00:00'),
Timestamp('2021-09-11 00:00:00'),
Timestamp('2021-09-16 00:00:00'),
Timestamp('2021-09-21 00:00:00'),
Timestamp('2021-09-26 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-06 00:00:00'),
Timestamp('2021-10-11 00:00:00'),
Timestamp('2021-10-16 00:00:00'),
Timestamp('2021-10-21 00:00:00'),
Timestamp('2021-10-26 00:00:00'),
Timestamp('2021-10-31 00:00:00'),
Timestamp('2021-11-05 00:00:00'),
Timestamp('2021-11-10 00:00:00'),
Timestamp('2021-11-15 00:00:00'),
Timestamp('2021-11-20 00:00:00'),
Timestamp('2021-11-25 00:00:00'),
Timestamp('2021-11-30 00:00:00'),
Timestamp('2021-12-05 00:00:00'),
Timestamp('2021-12-09 23:00:00'))
TODO
a | b | |
---|---|---|
npartitions=1 | ||
2021-09-06 | int64 | string |
2021-09-11 | ... | ... |
a | b | |
---|---|---|
npartitions=2 | ||
2021-10-01 00:00:00.000000000 | int64 | string |
2021-10-06 00:00:00.000000000 | ... | ... |
2021-10-09 05:00:59.999999999 | ... | ... |
Dask DataFrames coordinate many Pandas DataFrames/Series arranged along an index.
We define a Dask DataFrame object with the following components:
- A Dask graph with a special set of keys designating partitions, such as (‘x’, 0), (‘x’, 1), …
- A name to identify which keys in the Dask graph refer to this DataFrame, such as ‘x’
- An empty Pandas object containing appropriate metadata (e.g. column names, dtypes, etc.)
- A sequence of partition boundaries along the index called divisions
parquet
fname = 'fhvhv_tripdata_2022-11.parquet'
dpath = '../../../../Downloads/'
globpath = 'fhvhv_tripdata_20*-*.parquet'
!ls -l ../../../../Downloads/fhvhv_tripdata_20*-*.parquet
'/home/boucheron/Documents'
%%time
data = dd.read_parquet(
os.path.join(dpath, globpath),
categories= ['PULocationID',
'DOLocationID'],
engine='auto'
)
After you have generated a task graph, it is the scheduler’s job to execute it (see Scheduling).
By default, for the majority of Dask APIs, when you call
compute()
on a Dask object, Dask uses the thread pool on your computer (a.k.a threaded scheduler) to run computations in parallel. This is true forDask Array
,Dask DataFrame
, andDask Delayed
. The exception beingDask Bag
which uses the multiprocessing scheduler by default.
If you want more control, use the
distributed scheduler
instead. Despite having “distributed” in it’s name, the distributed scheduler works well on both single and multiple machines. Think of it as the “advanced scheduler”.
Dask schedulers come with diagnostics to help you understand the performance characteristics of your computations
By using these diagnostics and with some thought, we can often identify the slow parts of troublesome computations
The single-machine and distributed schedulers come with different diagnostic tools
These tools are deeply integrated into each scheduler, so a tool designed for one will not transfer over to the other
dask
tag on Stack Overflow, for usage questionstitle | text | date | |
---|---|---|---|
npartitions=1 | |||
int64 | int64 | int64 | |
... | ... | ... |
Dask Series Structure:
npartitions=1
float64
...
Dask Name: getitem, 4 expressions
Expr=((ReadParquetFSSpec(117185e)[['passenger_count', 'tip_amount']]).mean(observed=False, chunk_kwargs={'numeric_only': False}, aggregate_kwargs={'numeric_only': False}, _slice='tip_amount'))['tip_amount']
Client-16edf3bb-108c-11f0-83c2-ac91a1bd3e89
Connection method: Cluster object | Cluster type: distributed.LocalCluster |
Dashboard: http://127.0.0.1:8787/status |
f05f977d
Dashboard: http://127.0.0.1:8787/status | Workers: 5 |
Total threads: 20 | Total memory: 30.96 GiB |
Status: running | Using processes: True |
Scheduler-718a693b-252a-46c1-8c52-f3bf40392248
Comm: tcp://127.0.0.1:42119 | Workers: 5 |
Dashboard: http://127.0.0.1:8787/status | Total threads: 20 |
Started: Just now | Total memory: 30.96 GiB |
Comm: tcp://127.0.0.1:43811 | Total threads: 4 |
Dashboard: http://127.0.0.1:41563/status | Memory: 6.19 GiB |
Nanny: tcp://127.0.0.1:35935 | |
Local directory: /tmp/dask-scratch-space/worker-_gb6juxy |
Comm: tcp://127.0.0.1:44137 | Total threads: 4 |
Dashboard: http://127.0.0.1:33397/status | Memory: 6.19 GiB |
Nanny: tcp://127.0.0.1:37185 | |
Local directory: /tmp/dask-scratch-space/worker-cgaj78q6 |
Comm: tcp://127.0.0.1:44955 | Total threads: 4 |
Dashboard: http://127.0.0.1:35973/status | Memory: 6.19 GiB |
Nanny: tcp://127.0.0.1:39909 | |
Local directory: /tmp/dask-scratch-space/worker-4d4y_kdh |
Comm: tcp://127.0.0.1:38251 | Total threads: 4 |
Dashboard: http://127.0.0.1:41813/status | Memory: 6.19 GiB |
Nanny: tcp://127.0.0.1:33301 | |
Local directory: /tmp/dask-scratch-space/worker-qv2wm_g6 |
Comm: tcp://127.0.0.1:36207 | Total threads: 4 |
Dashboard: http://127.0.0.1:39999/status | Memory: 6.19 GiB |
Nanny: tcp://127.0.0.1:35947 | |
Local directory: /tmp/dask-scratch-space/worker-4z5v9mw3 |
IFEBY030 – Technos Big Data – M1 MIDS/MFA/LOGOS – UParis Cité