Dask

Technologies Big Data Master MIDS/MFA/LOGOIS

Équipe de Statistique

LPSM Université Paris-Cité

2025-01-17

Dask: Big picture

Bird-eye Big Picture

Dask in picture

Overview - dask’s place in the universe.
Delayed - the single-function way to parallelize general python code.
Dataframe - parallelized operations on many pandas dataframes spread across your cluster

Flavours of (big) data

Type	Typical size	Features	Tool
Small data	Few GigaBytes	Fits in RAM	Pandas
Medium data	Less than 2 Terabytes	Does not fit in RAM, fits on hard drive	Dask
Large data	Petabytes	Does not fit on hard drive	Spark

Dask provides multi-core and distributed parallel execution on larger-than-memory datasets

Dask provides high-level Array, Bag, and DataFrame collections that mimic NumPy, lists, and Pandas but can operate in parallel on datasets that do not fit into memory

Dask provides dynamic task schedulers that execute task graphs in parallel.

These schedulers/execution engines power the high-level collections but can also power custom, user-defined workloads

These schedulers are low-latency and work hard to run computations in a small memory footprint

Sources

Dask Tutorial

Dask FAQ

Trends

Dask adoption metrics

Delayed

Delayed (in a nutshell)

The single-function way to parallelize general python code

Imports

import dask

dask.config.set(scheduler='threads')
dask.config.set({'dataframe.query-planning': True})

<dask.config.set at 0x78c88f3cb770>

import dask.dataframe as dd
import dask.bag as db

from dask import delayed
import dask.threaded

from dask.distributed import Client
from dask.diagnostics import ProgressBar
from dask.diagnostics import Profiler, ResourceProfiler, CacheProfiler

`LocalCluster`

Dask can set itself up easily in your Python session if you create a LocalCluster object, which sets everything up for you.

# from dask.distributed import LocalCluster

# cluster = LocalCluster()
# client = cluster.get_client()

Normal Dask work …

Alternatively, you can skip this part, and Dask will operate within a thread pool contained entirely with your local process.

Delaying Pyhton tasks

A job (I)

def inc(x):
  return x + 1

def double(x):
  return x * 2

def add(x, y):
  return x + y

A job (II): piecing elements together

data = [1, 2, 3, 4, 5]

output = []

for x in data:
  a = inc(x)
  b = double(x)
  c = add(a, b)
  output.append(c)
  
total = sum(output)
  
total

1: Increment x
2: Multiply x by 2
3: c == (x+1) + 2*x == 3*x+1

Delaying existing functions

output = []

for x in data:
  a = dask.delayed(inc)(x)
  b = dask.delayed(double)(x) 
  c = dask.delayed(add)(a, b) 
  output.append(c)
  
total = dask.delayed(sum)(output)
  
total

1: Decorating inc using dask.delayed()
2: Decorating sum()

Delayed('sum-a311ec5e-4dc8-4b75-ae0a-f455f0558e22')

total.compute()

1: Collecting the results

Another way of using decorators

@dask.delayed
def inc(x):
  return x + 1

@dask.delayed
def double(x):
  return x * 2

@dask.delayed
def add(x, y):
  return x + y

data = [1, 2, 3, 4, 5]

output = []
for x in data:
  a = inc(x)
  b = double(x)
  c = add(a, b)
  output.append(c)
  
total = dask.delayed(sum)(output)
total
total.compute()

1: Decorating the definition
2: Reusing the Python code
3: Collecting results

Visualizing the task graph

total.visualize()

Tweaking the task graph

Another job

DATA = []

@dask.delayed
def inc(x):
  return x + 1

@dask.delayed
def add_data(x):
  DATA.append(x)

@dask.delayed
def sum_data(x):
  return sum(DATA) + x

a = inc(1)
b = add_data(a)
c = inc(3)
d = add_data(c)
e = inc(5)
f = sum_data(e)
f.compute()

A flawed task graph

f.visualize()

Fixing

from dask.graph_manipulation import bind

g = bind(sum_data, [b, d])(e)

g.compute()

The result of the evaluation of sum_data() depends not only on its argument, hence on the Delayed e, but also on the side effects of add_data(), that is on the Delayed b and d

Note that not only the DAG was wrong but the result obtained above was not the intended result.

g.visualize()

By default, Dask Delayed uses the threaded scheduler in order to avoid data transfer costs

Consider using multi-processing scheduler or dask.distributed scheduler on a local machine or on a cluster if your code does not release the GIL well (computations that are dominated by pure Python code, or computations wrapping external code and holding onto it).

High level collections

Importing the usual suspects

import numpy as np
import pandas as pd

import dask.dataframe as dd
import dask.array as da
import dask.bag as db

1: Standard dataframes in Python
2: Parallelized and distributed dataframes in Python

Bird-eye view

Dataframes

Dask Dataframes parallelize the popular pandas library, providing:

Larger-than-memory execution for single machines, allowing you to process data that is larger than your available RAM

Parallel execution for faster processing

Distributed computation for terabyte-sized datasets

Dask Dataframes are similar to Apache Spark, but use the familiar pandas API and memory model

One Dask dataframe is simply a coordinated collection of pandas dataframes on different computers

Dask DataFrame helps you process large tabular data by parallelizing Pandas, either on your laptop for larger-than-memory computing, or on a distributed cluster of computers.

Column of four squares collectively labeled as a Dask DataFrame with a single constituent square labeled as a pandas DataFrame

Just pandas: Dask DataFrames are a collection of many pandas DataFrames.

The API is the same¹. The execution is the same

Large scale: Works on 100 GiB on a laptop, or 100 TiB on a cluster.

Easy to use: Pure Python, easy to set up and debug.

Dask DataFrames coordinate many pandas DataFrames/Series arranged along the index. A Dask DataFrame is partitioned row-wise, grouping rows by index value for efficiency. These pandas objects may live on disk or on other machines.

Creating a dask dataframe

index = pd.date_range("2021-09-01", 
                      periods=2400, 
                      freq="1H")

df = pd.DataFrame({
  "a": np.arange(2400), 
  "b": list("abcaddbe" * 300)}, 
  index=index)
  
ddf = dd.from_pandas(df, npartitions=20)

ddf.head()

1: In Dask, proper partitioning is a key performance issue
2: The dataframe API is (almost) the same as in Pandas!

	a	b
2021-09-01 00:00:00	0	a
2021-09-01 01:00:00	1	b
2021-09-01 02:00:00	2	c
2021-09-01 03:00:00	3	a
2021-09-01 04:00:00	4	d

Inside the dataframe

A sketch of the interplay between index and partitioning

ddf.divisions

(Timestamp('2021-09-01 00:00:00'),
 Timestamp('2021-09-06 00:00:00'),
 Timestamp('2021-09-11 00:00:00'),
 Timestamp('2021-09-16 00:00:00'),
 Timestamp('2021-09-21 00:00:00'),
 Timestamp('2021-09-26 00:00:00'),
 Timestamp('2021-10-01 00:00:00'),
 Timestamp('2021-10-06 00:00:00'),
 Timestamp('2021-10-11 00:00:00'),
 Timestamp('2021-10-16 00:00:00'),
 Timestamp('2021-10-21 00:00:00'),
 Timestamp('2021-10-26 00:00:00'),
 Timestamp('2021-10-31 00:00:00'),
 Timestamp('2021-11-05 00:00:00'),
 Timestamp('2021-11-10 00:00:00'),
 Timestamp('2021-11-15 00:00:00'),
 Timestamp('2021-11-20 00:00:00'),
 Timestamp('2021-11-25 00:00:00'),
 Timestamp('2021-11-30 00:00:00'),
 Timestamp('2021-12-05 00:00:00'),
 Timestamp('2021-12-09 23:00:00'))

A dataframe has a task graph

ddf.visualize()

TODO

What’s in a partition?

ddf.partitions[1]

1: This is the second class of the partition

Dask DataFrame Structure:

	a	b
npartitions=1
2021-09-06	int64	string
2021-09-11	...	...

Dask Name: partitions, 2 expressions

Slicing

ddf["2021-10-01":"2021-10-09 5:00"]

1: Like slicing NumPy arrays or pandas DataFrame.

Dask DataFrame Structure:

	a	b
npartitions=2
2021-10-01 00:00:00.000000000	int64	string
2021-10-06 00:00:00.000000000	...	...
2021-10-09 05:00:59.999999999	...	...

Dask Name: loc, 2 expressions

Dask dataframes (cont’d)

Dask DataFrames coordinate many Pandas DataFrames/Series arranged along an index.

We define a Dask DataFrame object with the following components:

A Dask graph with a special set of keys designating partitions, such as (‘x’, 0), (‘x’, 1), …

A name to identify which keys in the Dask graph refer to this DataFrame, such as ‘x’

An empty Pandas object containing appropriate metadata (e.g. column names, dtypes, etc.)

A sequence of partition boundaries along the index called divisions

Methods

( 
  ddf.a
    .mean()
)

<dask_expr.expr.Scalar: expr=df['a'].mean(), dtype=float64>

( 
  ddf.a
    .mean()
    .compute()
)

np.float64(1199.5)

(
  ddf
    .b
    .unique()
)

Dask Series Structure:
npartitions=20
    string
       ...
     ...  
       ...
       ...
Dask Name: unique, 3 expressions
Expr=Unique(frame=df['b'])

Reading and writing from `parquet`

fname = 'fhvhv_tripdata_2022-11.parquet'
dpath = '../../../../Downloads/'

globpath = 'fhvhv_tripdata_20*-*.parquet'

!ls -l ../../../../Downloads/fhvhv_tripdata_20*-*.parquet

import os

os.path.expanduser('~' + '/Documents')

'/home/boucheron/Documents'

%%time 

data = dd.read_parquet(
  os.path.join(dpath, globpath),
  categories= ['PULocationID',
               'DOLocationID'], 
  engine='auto'
)

type(data)

#| eval: false
df = data.to_dask_dataframe()

df.info()
df._meta.dtypes

df.npartitions

df.head()

type(df)

df._meta.dtypes

df._meta_nonempty

df.info()

df.divisions

df.describe(include="all")

Partitioning and saving to parquet

import pyarrow as pa

schm = pa.Schema.from_pandas(df._meta)

schm

df.PULocationID.unique().compute()

df.to_parquet( 
  'fhvhv_tripdata_2022-11',
  partition_on= ['PULocationID'],
  engine='pyarrow', 
  schema = schm
  )

df.info(memory_usage=True)

Schedulers

After you have generated a task graph, it is the scheduler’s job to execute it (see Scheduling).

By default, for the majority of Dask APIs, when you call compute() on a Dask object, Dask uses the thread pool on your computer (a.k.a threaded scheduler) to run computations in parallel. This is true for Dask Array, Dask DataFrame, and Dask Delayed. The exception being Dask Bag which uses the multiprocessing scheduler by default.

If you want more control, use the distributed scheduler instead. Despite having “distributed” in it’s name, the distributed scheduler works well on both single and multiple machines. Think of it as the “advanced scheduler”.

Performance

Dask schedulers come with diagnostics to help you understand the performance characteristics of your computations

By using these diagnostics and with some thought, we can often identify the slow parts of troublesome computations

The single-machine and distributed schedulers come with different diagnostic tools

These tools are deeply integrated into each scheduler, so a tool designed for one will not transfer over to the other

Dask query optimization

Demo

Visualize task graphs

Single threaded scheduler and a normal Python profiler

Diagnostics for the single-machine scheduler

Diagnostics for the distributed scheduler and dashboard

Scale up/Scale out

References

Reference

Ask for help

dask tag on Stack Overflow, for usage questions
github issues for bug reports and feature requests
gitter chat for general, non-bug, discussion

Books

Scaling Python with Dask
Data Science with Python and Dask
[Dask Definitive Guide (to appear 2025)]

Blogs

Loading a Parquet file

dpath = '/home/boucheron/Dropbox/MMD-2021/DATA/ny_corpus_prq/'

globpath = '*/*.parquet'

data = dd.read_parquet(
  os.path.join(dpath, globpath),
  engine='auto'
)

data.info

<bound method DataFrame.info of Dask DataFrame Structure:
                 title   topic    text             date
npartitions=77                                         
                string  string  string  category[known]
                   ...     ...     ...              ...
...                ...     ...     ...              ...
                   ...     ...     ...              ...
                   ...     ...     ...              ...
Dask Name: read_parquet, 1 expression
Expr=ReadParquetFSSpec(92994fd)>

( 
  data
    .groupby("topic")
    .count()
)

Dask DataFrame Structure:

	title	text	date
npartitions=1
	int64	int64	int64
	...	...	...

Dask Name: count, 2 expressions

ddf = dd.read_parquet(
    "s3://dask-data/nyc-taxi/nyc-2015.parquet/part.*.parquet",
    columns=[
      "passenger_count", 
      "tip_amount"],
    storage_options={"anon": True},
)

result = (
  ddf
    .groupby("passenger_count")
    .tip_amount
    .mean()
#    .compute()
)

result

Dask Series Structure:
npartitions=1
    float64
        ...
Dask Name: getitem, 4 expressions
Expr=((ReadParquetFSSpec(117185e)[['passenger_count', 'tip_amount']]).mean(observed=False, chunk_kwargs={'numeric_only': False}, aggregate_kwargs={'numeric_only': False}, _slice='tip_amount'))['tip_amount']

import dask.dataframe as dd
from dask.distributed import Client

client = Client()
client

Client

Client-e594632d-2ea1-11f0-bb25-300505fc3398

Connection method: Cluster object	Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status

Cluster Info

LocalCluster

f53f80c6

Dashboard: http://127.0.0.1:8787/status	Workers: 5
Total threads: 20	Total memory: 30.96 GiB
Status: running	Using processes: True

Scheduler Info

Scheduler

Scheduler-7a7e20ae-680f-413f-8349-5ef74ab66d41

Comm: tcp://127.0.0.1:36451	Workers: 5
Dashboard: http://127.0.0.1:8787/status	Total threads: 20
Started: Just now	Total memory: 30.96 GiB

Workers

Worker: 0

Comm: tcp://127.0.0.1:38695	Total threads: 4
Dashboard: http://127.0.0.1:42037/status	Memory: 6.19 GiB
Nanny: tcp://127.0.0.1:36639
Local directory: /tmp/dask-scratch-space/worker-v192hinx

Worker: 1

Comm: tcp://127.0.0.1:38607	Total threads: 4
Dashboard: http://127.0.0.1:36703/status	Memory: 6.19 GiB
Nanny: tcp://127.0.0.1:44205
Local directory: /tmp/dask-scratch-space/worker-6vl6qmav

Worker: 2

Comm: tcp://127.0.0.1:46393	Total threads: 4
Dashboard: http://127.0.0.1:45617/status	Memory: 6.19 GiB
Nanny: tcp://127.0.0.1:43189
Local directory: /tmp/dask-scratch-space/worker-0fb4s5of

Worker: 3

Comm: tcp://127.0.0.1:35543	Total threads: 4
Dashboard: http://127.0.0.1:44503/status	Memory: 6.19 GiB
Nanny: tcp://127.0.0.1:34849
Local directory: /tmp/dask-scratch-space/worker-wg65skaf

Worker: 4

Comm: tcp://127.0.0.1:38367	Total threads: 4
Dashboard: http://127.0.0.1:46265/status	Memory: 6.19 GiB
Nanny: tcp://127.0.0.1:42089
Local directory: /tmp/dask-scratch-space/worker-dv2lna_x

Dask

Dask: Big picture

Bird-eye Big Picture

Flavours of (big) data

Sources

Trends

Delayed

Delayed (in a nutshell)

Imports

LocalCluster

Normal Dask work …

Delaying Pyhton tasks

A job (I)

A job (II): piecing elements together

Delaying existing functions

Another way of using decorators

Visualizing the task graph

Tweaking the task graph

Another job

A flawed task graph

Fixing

High level collections

Importing the usual suspects

Bird-eye view

Dataframes

Creating a dask dataframe

Inside the dataframe

A sketch of the interplay between index and partitioning

A dataframe has a task graph

What’s in a partition?

Slicing

Dask dataframes (cont’d)

Methods

Reading and writing from parquet

Partitioning and saving to parquet

Schedulers

Performance

Dask query optimization

Visualize task graphs

Single threaded scheduler and a normal Python profiler

Diagnostics for the single-machine scheduler

Diagnostics for the distributed scheduler and dashboard

Scale up/Scale out

References

Reference

Ask for help

Books

Blogs

Loading a Parquet file

Client

Cluster Info

LocalCluster

Scheduler Info

Scheduler

Workers

Worker: 0

Worker: 1

Worker: 2

Worker: 3

Worker: 4

`LocalCluster`

Reading and writing from `parquet`