Dask

Technologies Big Data Master MIDS/MFA/LOGOIS

Équipe de Statistique

LPSM Université Paris-Cité

2025-01-17

Dask: Big picture

Bird-eye Big Picture

Dask in picture

  • Overview - dask’s place in the universe.

  • Delayed - the single-function way to parallelize general python code.

  • Dataframe - parallelized operations on many pandas dataframes spread across your cluster

Flavours of (big) data

Type Typical size Features Tool
Small data Few GigaBytes Fits in RAM Pandas
Medium data Less than 2 Terabytes Does not fit in RAM, fits on hard drive Dask
Large data Petabytes Does not fit on hard drive Spark

Dask provides multi-core and distributed parallel execution on larger-than-memory datasets

Dask provides high-level Array, Bag, and DataFrame collections that mimic NumPy, lists, and Pandas but can operate in parallel on datasets that do not fit into memory

Dask provides dynamic task schedulers that execute task graphs in parallel.

These schedulers/execution engines power the high-level collections but can also power custom, user-defined workloads

These schedulers are low-latency and work hard to run computations in a small memory footprint

Sources

Dask Tutorial

Dask FAQ

Delayed

Delayed (in a nutshell)

The single-function way to parallelize general python code

Imports

import dask

dask.config.set(scheduler='threads')
dask.config.set({'dataframe.query-planning': True})
<dask.config.set at 0x768f0d1a4d40>
import dask.dataframe as dd
import dask.bag as db
from dask import delayed
import dask.threaded

from dask.distributed import Client
from dask.diagnostics import ProgressBar
from dask.diagnostics import Profiler, ResourceProfiler, CacheProfiler

LocalCluster

Dask can set itself up easily in your Python session if you create a LocalCluster object, which sets everything up for you.

# from dask.distributed import LocalCluster

# cluster = LocalCluster()
# client = cluster.get_client()

Normal Dask work …

Alternatively, you can skip this part, and Dask will operate within a thread pool contained entirely with your local process.

Delaying Pyhton tasks

A job (I)

def inc(x):
  return x + 1

def double(x):
  return x * 2

def add(x, y):
  return x + y

A job (II): piecing elements together

data = [1, 2, 3, 4, 5]

output = []

for x in data:
  a = inc(x)
  b = double(x)
  c = add(a, b)
  output.append(c)
  
total = sum(output)
  
total 
1
Increment x
2
Multiply x by 2
3
c == (x+1) + 2*x == 3*x+1
50

Delaying existing functions

output = []

for x in data:
  a = dask.delayed(inc)(x)
  b = dask.delayed(double)(x) 
  c = dask.delayed(add)(a, b) 
  output.append(c)
  
total = dask.delayed(sum)(output)
  
total
1
Decorating inc using dask.delayed()
2
Decorating sum()
Delayed('sum-0e2e3c70-0734-43ab-b5b0-b48d4bb5d8eb')
total.compute()
1
Collecting the results
50

Another way of using decorators

@dask.delayed
def inc(x):
  return x + 1

@dask.delayed
def double(x):
  return x * 2

@dask.delayed
def add(x, y):
  return x + y

data = [1, 2, 3, 4, 5]

output = []
for x in data:
  a = inc(x)
  b = double(x)
  c = add(a, b)
  output.append(c)
  
total = dask.delayed(sum)(output)
total
total.compute()
1
Decorating the definition
2
Reusing the Python code
3
Collecting results
50

Visualizing the task graph

total.visualize()

Tweaking the task graph

Another job

DATA = []

@dask.delayed
def inc(x):
  return x + 1

@dask.delayed
def add_data(x):
  DATA.append(x)

@dask.delayed
def sum_data(x):
  return sum(DATA) + x

a = inc(1)
b = add_data(a)
c = inc(3)
d = add_data(c)
e = inc(5)
f = sum_data(e)
f.compute()
6

A flawed task graph

f.visualize()

Fixing

from dask.graph_manipulation import bind

g = bind(sum_data, [b, d])(e)

g.compute()
12

The result of the evaluation of sum_data() depends not only on its argument, hence on the Delayed e, but also on the side effects of add_data(), that is on the Delayed b and d

Note that not only the DAG was wrong but the result obtained above was not the intended result.

g.visualize()

By default, Dask Delayed uses the threaded scheduler in order to avoid data transfer costs

Consider using multi-processing scheduler or dask.distributed scheduler on a local machine or on a cluster if your code does not release the GIL well (computations that are dominated by pure Python code, or computations wrapping external code and holding onto it).

High level collections

Importing the usual suspects

import numpy as np
import pandas as pd

import dask.dataframe as dd
import dask.array as da
import dask.bag as db
1
Standard dataframes in Python
2
Parallelized and distributed dataframes in Python

Bird-eye view

Dataframes

Dask Dataframes parallelize the popular pandas library, providing:

  • Larger-than-memory execution for single machines, allowing you to process data that is larger than your available RAM
  • Parallel execution for faster processing
  • Distributed computation for terabyte-sized datasets

Dask Dataframes are similar to Apache Spark, but use the familiar pandas API and memory model

One Dask dataframe is simply a coordinated collection of pandas dataframes on different computers

Dask DataFrame helps you process large tabular data by parallelizing Pandas, either on your laptop for larger-than-memory computing, or on a distributed cluster of computers.

Column of four squares collectively labeled as a Dask DataFrame with a single constituent square labeled as a pandas DataFrame

Just pandas: Dask DataFrames are a collection of many pandas DataFrames.

The API is the same1. The execution is the same

Large scale: Works on 100 GiB on a laptop, or 100 TiB on a cluster.

Easy to use: Pure Python, easy to set up and debug.

Dask DataFrames coordinate many pandas DataFrames/Series arranged along the index. A Dask DataFrame is partitioned row-wise, grouping rows by index value for efficiency. These pandas objects may live on disk or on other machines.

Creating a dask dataframe

index = pd.date_range("2021-09-01", 
                      periods=2400, 
                      freq="1H")

df = pd.DataFrame({
  "a": np.arange(2400), 
  "b": list("abcaddbe" * 300)}, 
  index=index)
  
ddf = dd.from_pandas(df, npartitions=20)

ddf.head()
1
In Dask, proper partitioning is a key performance issue
2
The dataframe API is (almost) the same as in Pandas!
a b
2021-09-01 00:00:00 0 a
2021-09-01 01:00:00 1 b
2021-09-01 02:00:00 2 c
2021-09-01 03:00:00 3 a
2021-09-01 04:00:00 4 d

Inside the dataframe

A sketch of the interplay between index and partitioning

ddf.divisions
(Timestamp('2021-09-01 00:00:00'),
 Timestamp('2021-09-06 00:00:00'),
 Timestamp('2021-09-11 00:00:00'),
 Timestamp('2021-09-16 00:00:00'),
 Timestamp('2021-09-21 00:00:00'),
 Timestamp('2021-09-26 00:00:00'),
 Timestamp('2021-10-01 00:00:00'),
 Timestamp('2021-10-06 00:00:00'),
 Timestamp('2021-10-11 00:00:00'),
 Timestamp('2021-10-16 00:00:00'),
 Timestamp('2021-10-21 00:00:00'),
 Timestamp('2021-10-26 00:00:00'),
 Timestamp('2021-10-31 00:00:00'),
 Timestamp('2021-11-05 00:00:00'),
 Timestamp('2021-11-10 00:00:00'),
 Timestamp('2021-11-15 00:00:00'),
 Timestamp('2021-11-20 00:00:00'),
 Timestamp('2021-11-25 00:00:00'),
 Timestamp('2021-11-30 00:00:00'),
 Timestamp('2021-12-05 00:00:00'),
 Timestamp('2021-12-09 23:00:00'))

A dataframe has a task graph

ddf.visualize()

TODO

What’s in a partition?

ddf.partitions[1]
1
This is the second class of the partition
Dask DataFrame Structure:
a b
npartitions=1
2021-09-06 int64 string
2021-09-11 ... ...
Dask Name: partitions, 2 expressions

Slicing

ddf["2021-10-01":"2021-10-09 5:00"]
1
Like slicing NumPy arrays or pandas DataFrame.
Dask DataFrame Structure:
a b
npartitions=2
2021-10-01 00:00:00.000000000 int64 string
2021-10-06 00:00:00.000000000 ... ...
2021-10-09 05:00:59.999999999 ... ...
Dask Name: loc, 2 expressions

Dask dataframes (cont’d)

Dask DataFrames coordinate many Pandas DataFrames/Series arranged along an index.

We define a Dask DataFrame object with the following components:

  • A Dask graph with a special set of keys designating partitions, such as (‘x’, 0), (‘x’, 1), …
  • A name to identify which keys in the Dask graph refer to this DataFrame, such as ‘x’
  • An empty Pandas object containing appropriate metadata (e.g. column names, dtypes, etc.)
  • A sequence of partition boundaries along the index called divisions

Methods

( 
  ddf.a
    .mean()
)
<dask_expr.expr.Scalar: expr=df['a'].mean(), dtype=float64>
( 
  ddf.a
    .mean()
    .compute()
)
np.float64(1199.5)
(
  ddf
    .b
    .unique()
)
Dask Series Structure:
npartitions=20
    string
       ...
     ...  
       ...
       ...
Dask Name: unique, 3 expressions
Expr=Unique(frame=df['b'])

Reading and writing from parquet

fname = 'fhvhv_tripdata_2022-11.parquet'
dpath = '../../../../Downloads/'

globpath = 'fhvhv_tripdata_20*-*.parquet'

!ls -l ../../../../Downloads/fhvhv_tripdata_20*-*.parquet
import os

os.path.expanduser('~' + '/Documents')
'/home/boucheron/Documents'
%%time 

data = dd.read_parquet(
  os.path.join(dpath, globpath),
  categories= ['PULocationID',
               'DOLocationID'], 
  engine='auto'
)
type(data)
#| eval: false
df = data.to_dask_dataframe()
df.info()
df._meta.dtypes

df.npartitions
df.head()
type(df)
df._meta.dtypes
df._meta_nonempty
df.info()
df.divisions
df.describe(include="all")

Partitioning and saving to parquet

import pyarrow as pa

schm = pa.Schema.from_pandas(df._meta)

schm
df.PULocationID.unique().compute()
df.to_parquet( 
  'fhvhv_tripdata_2022-11',
  partition_on= ['PULocationID'],
  engine='pyarrow', 
  schema = schm
  )
df.info(memory_usage=True)

Schedulers

After you have generated a task graph, it is the scheduler’s job to execute it (see Scheduling).

By default, for the majority of Dask APIs, when you call compute() on a Dask object, Dask uses the thread pool on your computer (a.k.a threaded scheduler) to run computations in parallel. This is true for Dask Array, Dask DataFrame, and Dask Delayed. The exception being Dask Bag which uses the multiprocessing scheduler by default.

If you want more control, use the distributed scheduler instead. Despite having “distributed” in it’s name, the distributed scheduler works well on both single and multiple machines. Think of it as the “advanced scheduler”.

Performance

Dask schedulers come with diagnostics to help you understand the performance characteristics of your computations

By using these diagnostics and with some thought, we can often identify the slow parts of troublesome computations

The single-machine and distributed schedulers come with different diagnostic tools

These tools are deeply integrated into each scheduler, so a tool designed for one will not transfer over to the other

Dask query optimization

Demo

Visualize task graphs

Single threaded scheduler and a normal Python profiler

Diagnostics for the single-machine scheduler

Diagnostics for the distributed scheduler and dashboard

Scale up/Scale out

References

Reference

Ask for help

  • dask tag on Stack Overflow, for usage questions
  • github issues for bug reports and feature requests
  • gitter chat for general, non-bug, discussion

Books

Blogs

Loading a Parquet file

dpath = '/home/boucheron/Dropbox/MMD-2021/DATA/ny_corpus_prq/'

globpath = '*/*.parquet'

data = dd.read_parquet(
  os.path.join(dpath, globpath),
  engine='auto'
)
data.info
<bound method DataFrame.info of Dask DataFrame Structure:
                 title   topic    text             date
npartitions=77                                         
                string  string  string  category[known]
                   ...     ...     ...              ...
...                ...     ...     ...              ...
                   ...     ...     ...              ...
                   ...     ...     ...              ...
Dask Name: read_parquet, 1 expression
Expr=ReadParquetFSSpec(92994fd)>

( 
  data
    .groupby("topic")
    .count()
)
Dask DataFrame Structure:
title text date
npartitions=1
int64 int64 int64
... ... ...
Dask Name: count, 2 expressions

ddf = dd.read_parquet(
    "s3://dask-data/nyc-taxi/nyc-2015.parquet/part.*.parquet",
    columns=[
      "passenger_count", 
      "tip_amount"],
    storage_options={"anon": True},
)
result = (
  ddf
    .groupby("passenger_count")
    .tip_amount
    .mean()
#    .compute()
)

result
Dask Series Structure:
npartitions=1
    float64
        ...
Dask Name: getitem, 4 expressions
Expr=((ReadParquetFSSpec(117185e)[['passenger_count', 'tip_amount']]).mean(observed=False, chunk_kwargs={'numeric_only': False}, aggregate_kwargs={'numeric_only': False}, _slice='tip_amount'))['tip_amount']
import dask.dataframe as dd
from dask.distributed import Client
client = Client()
client

Client

Client-16edf3bb-108c-11f0-83c2-ac91a1bd3e89

Connection method: Cluster object Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status

Cluster Info