Analyse de Données Master MIDS
2021-12-21
Speadsheets (Excel)
Relational tables
Dataframes in datascience frameworks
data.frame
, tibble
, …pandas.dataframe
spark
: dataframe
Dask
: dataframe
In Data Science, each framework comes with its own flavor(s) of table(s)
Tables from relational databases serve as inspiration
In legacy dataframes shape the life of statisticians and data scientists
The purpose of this session is
describe dataframes from an end-user viewpoint (we leave aside implementations)
presenting tools for
Metapackage tidyverse
provides tools to create, query, tidy dataframes as well as tools to load data from various sources and save them in persistent storage
nycflights13
provides the dataframes we play with
gt
for tayloring table displays
flights
tableRows: 6
Columns: 19
$ year <int> 2013,…
$ month <int> 1, 1,…
$ day <int> 1, 1,…
$ dep_time <int> 517, …
$ sched_dep_time <int> 515, …
$ dep_delay <dbl> 2, 4,…
$ arr_time <int> 830, …
$ sched_arr_time <int> 819, …
$ arr_delay <dbl> 11, 2…
$ carrier <chr> "UA",…
$ flight <int> 1545,…
$ tailnum <chr> "N142…
$ origin <chr> "EWR"…
$ dest <chr> "IAH"…
$ air_time <dbl> 227, …
$ distance <dbl> 1400,…
$ hour <dbl> 5, 5,…
$ minute <dbl> 15, 2…
$ time_hour <dttm> 2013…
A table is a list of columns
Each column has
Rows: 336,776
Columns: 19
$ year <int> 2013, 2013, 2013, 2013, 2…
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time <int> 517, 533, 542, 544, 554, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, …
$ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, …
$ arr_time <int> 830, 850, 923, 1004, 812,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837,…
$ arr_delay <dbl> 11, 20, 33, -18, -25, 12,…
$ carrier <chr> "UA", "UA", "AA", "B6", "…
$ flight <int> 1545, 1714, 1141, 725, 46…
$ tailnum <chr> "N14228", "N24211", "N619…
$ origin <chr> "EWR", "LGA", "JFK", "JFK…
$ dest <chr> "IAH", "IAH", "MIA", "BQN…
$ air_time <dbl> 227, 227, 160, 183, 116, …
$ distance <dbl> 1400, 1416, 1089, 1576, 7…
$ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6…
$ minute <dbl> 15, 29, 40, 45, 0, 58, 0,…
$ time_hour <dttm> 2013-01-01 05:00:00, 201…
flights
has 19 columnsvector
) of items with the same type/classflights
has 336776 rowsclass | columns |
---|---|
integer |
‘year’ ‘month’ ‘day’ ‘dep_time’ ‘sched_dep_time’ ‘arr_time’ ‘sched_arr_time’ ‘flight’ |
numeric |
‘dep_delay’ ‘arr_delay’ ‘air_time’ ‘distance’ ‘hour’ ‘minute’ |
character |
‘carrier’ ‘tailnum’ ‘origin’ ‘dest’ |
POSIXct |
‘time_hour’ |
POSIXt |
‘time_hour’ |
A column, as a vector, may be belong to different classes
Other classes: factor
for categorical variables
Columns dest
, origin
carrier
could be coerced as factors
Should columns dest
and origin
be coerced to the same factor?
nycflights13
cols(
year = col_integer(),
month = col_integer(),
day = col_integer(),
dep_time = col_integer(),
sched_dep_time = col_integer(),
dep_delay = col_double(),
arr_time = col_integer(),
sched_arr_time = col_integer(),
arr_delay = col_double(),
carrier = col_character(),
flight = col_integer(),
tailnum = col_character(),
origin = col_character(),
dest = col_character(),
air_time = col_double(),
distance = col_double(),
hour = col_double(),
minute = col_double(),
time_hour = col_datetime(format = "")
)
\(\approx\) table schema in relational databases
Column specifications are useful when loading dataframes from structured text files like .csv
files
.csv
files do not contain typing information
File loaders from package readr
can be tipped about column classes using column specifications
dplyr
SQL stands for structured/simple Query Language
A query language elaborated during the 1970’s at IBM by E. Codd
Geared towards exploitation of collections of relational tables
Less powerful but simpler to use than a programming language
dplyr
is a principled -friendly implementation of SQL ideas (and other things)
At the core of SQL lies the idea of a table calculus called relational algebra
Convention: \(R\) is a table with columns \(A_1, \ldots, A_k\)
Projection (picking columns)
\(\pi(R, A_1, A_3)\)
Selection/Filtering (picking rows)
\(\sigma(R, {\text{condition}})\)
Join (mulitple tables operation)
\(\bowtie(R,S, {\text{condition}})\)
Any operation produces a table
The schema of the derived table depends on the operation (but does not depend on the content/value of the operands)
Table calculus relies on a small set of basic operations \(\pi, \sigma, \bowtie\)
Each operation has one or two table operands and produce a table
There is more to SQL than relational algebra
\(\pi(R, {A_1, A_3})\)
A projection \(\pi(\cdot, {A_1, A_3})\) is defined by a set of column names, say \(A_1, A_3\)
If \(R\) has columns with given names, the result is a table with names \(A_1, A_3\) and one row per row of \(R\)
A projection is parametrized by a list of column names
dplyr
Base provides tools to perform relational algebra operations
But:
Base does not provide a consistent API
The lack of a consistent API makes operation chaining tricky
dplyr
verbsFive basic verbs:
Pick observations/rows by their values (filter()
) σ(…)
Pick variables by their names (select()
) π(…)
Reorder the rows (arrange()
)
Create new variables with functions of existing variables (mutate()
)
Collapse many values down to a single summary (summarise()
)
And
group_by()
changes the scope of each function from operating on the entire dataset to operating on it group-by-groupAll verbs work similarly:
The first argument is a data frame (table).
The subsequent arguments describe what to do with the data frame, using the variable/column names (without quotes)
The result is a new data frame (table)
dplyr::select()
as a projection operator (π)\(\pi(R, \underbrace{A_1, \ldots, A_3}_{\text{column names}})\)
or, equivalently
|>
is the pipe operator
x |> f(y, z)
is translated to f(x, y, z)
and then evaluated
dplyr::select()
Function select
has a variable number of arguments
Function select
has a variable number of arguments
Function select
allows to pick column by names (and much more)
Note that in the current environment, there are no objects called A1
, A3
The consistent API allows to use the pipe operator
Caution
There is also a select()
function in base R
A1 | A2 | A3 | D |
---|---|---|---|
2 | q | 2021-10-21 | r |
4 | e | 2021-10-28 | q |
6 | a | 2021-11-04 | o |
8 | j | 2021-11-11 | g |
10 | d | 2021-11-18 | d |
E | F | G | D |
---|---|---|---|
3 | y | 2021-10-21 | o |
4 | e | 2021-10-22 | c |
6 | n | 2021-10-23 | i |
9 | t | 2021-10-24 | d |
10 | r | 2021-10-25 | e |
flights
on origin
and dest
# A tibble: 6 × 2
origin dest
<chr> <chr>
1 EWR IAH
2 LGA IAH
3 JFK MIA
4 JFK BQN
5 LGA ATL
6 EWR ORD
A more readable equivalent of
A selection/filtering operation is defined by a condition that can be checked on the rows of tables with convenient schema
\(\sigma(R, \text{condition})\) returns a table with the same schema as \(R\)
The resulting table contains the rows/tuples of \(R\) that satisfy \(\text{condition}\)
\(\sigma(R, \text{FALSE})\) returns an empty table with the same schema as \(R\)
flights
based on origin
and dest
and then projecting on dest, time_hour, carrier
# A tibble: 6 × 3
dest time_hour carrier
<chr> <dttm> <chr>
1 LAX 2013-01-01 06:00:00 UA
2 ATL 2013-01-01 06:00:00 DL
3 LAX 2013-01-01 07:00:00 VX
4 LAX 2013-01-01 07:00:00 B6
5 LAX 2013-01-01 07:00:00 AA
6 ATL 2013-01-01 08:00:00 DL
In SQL ( parlance:
filter(R, condition_1, condition_2)
is meant to return the rows of R
that satisfy condition_1
and condition_2
filter(R, condition_1 & condition_2)
is an equivalent formulation
filter(R, condition_1 | condition_2)
is meant to return the rows of R
that satisfy condition_1
or condition_2
(possibly both)
filter(R, xor(condition_1,condition_2))
is meant to return the rows of R
that satisfy either condition_1
or condition_2
(just one of them)
filter(R, ! condition_1)
is meant to return the rows of R
that do not satisfy condition_1
Numerical column dep_time
contains many NA's
(missing values)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1 907 1401 1349 1744 2400 8255
Missing values (NA
and variants) should be handled with care
[1] NA
[1] TRUE
uses three-valued logic
Generate complete truth tables for and, or, xor
TRUE |
TRUE | FALSE | NA |
FALSE |
FALSE | FALSE | FALSE |
NA |
NA | FALSE | NA |
TRUE |
TRUE | TRUE | TRUE |
FALSE |
TRUE | FALSE | NA |
NA |
TRUE | NA | NA |
TRUE |
FALSE | TRUE | NA |
FALSE |
TRUE | FALSE | NA |
NA |
NA | NA | NA |
slice()
: choosing rows based on locationIn base dataframe cells can be addressed by indices
flights[5000:5010,seq(1, 19, by=5)]
returns rows 5000:5010
and columns 1, 6, 11
from dataframe flights
This can be done in a (verbose) dplyr
way using slice()
and select()
combined with aggregation (group_by()
) variants of slice_
may be used to perform windowing operations
Useful variant slice_sample()
Note
\(\bowtie(R,S, {\text{condition}})\)
stands for
join rows/tuples of \(R\) and rows/tuples of \(S\) that satisfy \(\text{condition}\)
nycflights
tablesThe nycflights13
package offers five related tables:
flights
weather
(hourly weather conditions at different locations)airports
(airports full names, location, …)planes
(model, manufacturer, year, …)airlines
(full names)This is an instance of a Star Schema
Fact tables record measurements for a specific event
Fact tables generally consist of numeric values, and foreign keys to dimensional data where descriptive information is kept
Dimension tables record informations about entities involved in events recorded in Fact tables
Rows: 26,115
Columns: 15
$ origin <chr> "EWR", "EWR", "EWR", "EWR", "…
$ year <int> 2013, 2013, 2013, 2013, 2013,…
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ hour <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10…
$ temp <dbl> 39.02, 39.02, 39.02, 39.92, 3…
$ dewp <dbl> 26.06, 26.96, 28.04, 28.04, 2…
$ humid <dbl> 59.37, 61.63, 64.43, 62.21, 6…
$ wind_dir <dbl> 270, 250, 240, 250, 260, 240,…
$ wind_speed <dbl> 10.35702, 8.05546, 11.50780, …
$ wind_gust <dbl> NA, NA, NA, NA, NA, NA, NA, N…
$ precip <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ pressure <dbl> 1012.0, 1012.3, 1012.5, 1012.…
$ visib <dbl> 10, 10, 10, 10, 10, 10, 10, 1…
$ time_hour <dttm> 2013-01-01 01:00:00, 2013-01…
flights
and weather
We want to complement information in flights
using data weather
Motivation: we would like to relate delays (arr_delay
) and weather conditions
can we explain (justify) delays using weather data?
can we predict delays using weather data?
For each flight (row in flights
)
year
, month
, day
, hour
(computed from time_hour
) indicate the approaximate time of departure
origin
indicates the airport where the plane takes off
Each row of weather
contains corresponding information
for each row of flights
we look for rows of weather
with matching values in year
, month
, day
, hour
and origin
NATURAL INNER JOIN between the tables
inner_join
: natural joinRows: 335,220
Columns: 14
$ year <int> 2013, 2013, 2013, 2013, 2…
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ sched_dep_time <int> 515, 529, 540, 545, 600, …
$ arr_time <int> 830, 850, 923, 1004, 812,…
$ arr_delay <dbl> 11, 20, 33, -18, -25, 12,…
$ flight <int> 1545, 1714, 1141, 725, 46…
$ origin <chr> "EWR", "LGA", "JFK", "JFK…
$ air_time <dbl> 227, 227, 160, 183, 116, …
$ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6…
$ time_hour <dttm> 2013-01-01 05:00:00, 201…
$ dewp <dbl> 28.04, 24.98, 26.96, 26.9…
$ wind_dir <dbl> 260, 250, 260, 260, 260, …
$ wind_gust <dbl> NA, 21.86482, NA, NA, 23.…
$ pressure <dbl> 1011.9, 1011.4, 1012.1, 1…
The schema of the result is the union of the schemas of the operands
A tuple from flights
matches a tuple from weather
if the tuple have the same values in the common columns:
[1] "year" "month" "day" "dep_time"
[5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
[9] "arr_delay" "carrier" "flight" "tailnum"
[13] "origin" "dest" "air_time" "distance"
[17] "hour" "minute" "time_hour" "temp"
[21] "dewp" "humid" "wind_dir" "wind_speed"
[25] "wind_gust" "precip" "pressure" "visib"
default behavior of inner_join
: all columns shared by \(R\) and \(S\). Common columns have the same name in both schema. They are expected to have the same class
manual definition: in many settings, we want to overrule the default behavior. We specify manually which column from \(R\) should match which column from \(S\)
flights
and weather
:Recall that columns year
, month
day
hour
can be computed from time_hour
inner_join
by
:
by=c("A1", "A3", "A7")
row r
from R
and s
from S
match if r.A1 == s.A1
, r.A3 == s.A3
, r.A7 == s.A7
by=c("A1"="B", "A3"="C", "A7"="D")
row r
from R
and s
from S
match if r.A1 == s.B
, r.A3 == s.C
, r.A7 == s.D
suffix
: If there are non-joined duplicate variables in x
and y
, these suffixes will be added to the output to disambiguate them.
keep
: Should the join keys from both x
and y
be preserved in the output?
na_matches
: Should NA and NaN values match one another?
Different flavors of join
can be used to join one table to columns from another, matching values with the rows that they correspond to
Each join retains a different combination of values from the tables
left_join(x, y, by = NULL, suffix = c(".x", ".y"), ...)
Join matching values from y
to x
. Retain all rows of x
padding missing values from y
by NA
semi_join
…
anti_join
…
inner_join
A1 | A2 | A3 | D |
---|---|---|---|
2 | q | 2021-10-21 | r |
4 | e | 2021-10-28 | q |
6 | a | 2021-11-04 | o |
8 | j | 2021-11-11 | g |
10 | d | 2021-11-18 | d |
E | F | G | D |
---|---|---|---|
3 | y | 2021-10-21 | o |
4 | e | 2021-10-22 | c |
6 | n | 2021-10-23 | i |
9 | t | 2021-10-24 | d |
10 | r | 2021-10-25 | e |
E | F | G | D.x | A2 | A3 | D.y |
---|---|---|---|---|---|---|
4 | e | 2021-10-22 | c | e | 2021-10-28 | q |
6 | n | 2021-10-23 | i | a | 2021-11-04 | o |
10 | r | 2021-10-25 | e | d | 2021-11-18 | d |
left_join
A1 | A2 | A3 | D |
---|---|---|---|
2 | q | 2021-10-21 | r |
4 | e | 2021-10-28 | q |
6 | a | 2021-11-04 | o |
8 | j | 2021-11-11 | g |
10 | d | 2021-11-18 | d |
E | F | G | D |
---|---|---|---|
3 | y | 2021-10-21 | o |
4 | e | 2021-10-22 | c |
6 | n | 2021-10-23 | i |
9 | t | 2021-10-24 | d |
10 | r | 2021-10-25 | e |
E | F | G | D.x | A2 | A3 | D.y |
---|---|---|---|---|---|---|
3 | y | 2021-10-21 | o | NA | NA | NA |
4 | e | 2021-10-22 | c | e | 2021-10-28 | q |
6 | n | 2021-10-23 | i | a | 2021-11-04 | o |
9 | t | 2021-10-24 | d | NA | NA | NA |
10 | r | 2021-10-25 | e | d | 2021-11-18 | d |
semi_join
anti_join
A1 | A2 | A3 | D |
---|---|---|---|
2 | q | 2021-10-21 | r |
4 | e | 2021-10-28 | q |
6 | a | 2021-11-04 | o |
8 | j | 2021-11-11 | g |
10 | d | 2021-11-18 | d |
E | F | G | D |
---|---|---|---|
3 | y | 2021-10-21 | o |
4 | e | 2021-10-22 | c |
6 | n | 2021-10-23 | i |
9 | t | 2021-10-24 | d |
10 | r | 2021-10-25 | e |
E | F | G | D |
---|---|---|---|
4 | e | 2021-10-22 | c |
6 | n | 2021-10-23 | i |
10 | r | 2021-10-25 | e |
E | F | G | D |
---|---|---|---|
3 | y | 2021-10-21 | o |
9 | t | 2021-10-24 | d |
In relational databases, joins are not restricted to natural joins
\[U \leftarrow R \bowtie_{\theta} S\]
reads as
\[\begin{array}{rl} T & \leftarrow R \times S\\ U & \leftarrow \sigma(T, \theta)\end{array}\]
where
\(R \times S\) is the cartesian product of \(R\) and \(S\)
\(\theta\) is a boolean expression that can be evaluated on any tuple of \(R \times S\)
Note
: We can implement \(\theta\)/conditional-joins by pipelining a cross product and a filtering
Caution
: Cross products are costly:
Note
: RDBMS use query planning and optimization, indexing to circumvent the cross product bottleneck (when possible)
Tip
: if we need to perform a \(\theta\)-join
flights
and weather
The natural join between flights
and weather
we implemented can be regarded as an ad hoc conditional join between normalized versions of weather
and flights
Table flights
and weather
are redundant: year
, month
, day
, hour
can be computed from time_hour
Assume flights
and weather
are trimmed so as to become irredundant
The conditional join is then based on truncations of variables time_hour
SELECT
*
FROM
flights AS f, weather AS w
WHERE
date_trunc('hour', f.time_hour) = date_trunc('hour', w.time_hour)
flights
and weather
allows us to transform a tricky conditional join into a simple natural join
Creation of new columns may happen
on the fly
when altering (enriching) the schema of a table
In databases, creation of new columns may be the result of a query or be the result of altering a table schema with ALTER TABLE ADD COLUMN ...
In tidyverse()
we use verbs mutate
or add_column
to add columns to the input table
mutate
.data
: the input data frame
new_col= expression
:
new_col
is the name of a new column
expression
is evaluated on each row of .data
or it is a vector of length 1
all
is the default behavior, retains all columns from .data
breaks_delay <- with(flights,
c(min(arr_delay, na.rm=TRUE),
0, 30,
max(arr_delay, na.rm=TRUE))
)
level_delay <- c("None",
"Moderate",
"Large")
flights |>
mutate(large_delay = cut(
arr_delay, #<<
breaks=breaks_delay, #<<
labels=level_delay, #<<
ordered_result=TRUE)) |> #<<
select(large_delay, arr_delay) |>
sample_n(5)
# A tibble: 5 × 2
large_delay arr_delay
<ord> <dbl>
1 Large 219
2 Moderate 18
3 None -19
4 None -16
5 None -1
# A tibble: 5 × 4
large_delay arr_delay origin dest
<ord> <dbl> <fct> <fct>
1 None -44 LGA CVG
2 None -15 EWR DAY
3 Large 136 EWR DEN
4 None -9 EWR TPA
5 Moderate 14 LGA TPA
A (tidy) dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative)
Values are organised in two ways
Every value belongs to a variable and an observation
A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units
An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes
The principles of tidy data are tied to those of relational databases and Codd’s relational algebra
Codd’s Principles
dplyr
functions expect and return tidy tables
In a tidy table
Each variable is a column
Each observation is a row
Every cell is a single value
In order to tell whether a table is tidy, we need to know what is the population under investigation, what are the observations/individuals, which measures are performed on each individual, …
Column headers are values, not variable names.
Multiple variables are stored in one column.
Variables are stored in both rows and columns.
Multiple types of observational units are stored in the same table.
A single observational unit is stored in multiple tables.
…
tidyr::...
pivot_wider
and pivot_longer
separate
and unite
Handling missing values with complete
, fill
, …
…
pivot_wider( #<<
data,
id_cols = NULL, #<<
names_from = name, #<<
names_prefix = "",
values_from = value, #<<
...
)
some optional arguments are missing
When reporting, we often use pivot_wider
(explicitely or implicitely) to make results more readable, possibly to conform to a tradition
# A tibble: 16 × 2
carrier count
<chr> <int>
1 UA 58665
2 B6 54635
3 EV 54173
4 DL 48110
5 AA 32729
6 MQ 26397
7 US 20536
8 9E 18460
9 WN 12275
10 VX 5162
11 FL 3260
12 AS 714
13 F9 685
14 YV 601
15 HA 342
16 OO 32
flights |>
group_by(origin, wday(time_hour, abbr=T, label=T)) |> #<<
summarise(count=n(), .groups="drop") |> #<<
rename(day_of_week=`wday(time_hour, abbr = T, label = T)`) |>
pivot_wider( #<<
id_cols="origin", #<<
names_from="day_of_week", #<<
values_from="count") |> #<<
kable(caption="Departures per day")
origin | dim. | lun. | mar. | mer. | jeu. | ven. | sam. |
---|---|---|---|---|---|---|---|
EWR | 16425 | 18329 | 18243 | 18180 | 18169 | 18142 | 13347 |
JFK | 15966 | 16104 | 16017 | 15841 | 16087 | 16176 | 15088 |
LGA | 13966 | 16257 | 16162 | 16039 | 15963 | 15990 | 10285 |
TODO
TODO
slider
TODO
|>
, %>%
and other pipesAll dplyr
functions take a table as the first argument
Rather than forcing the user to either save intermediate objects or nest functions, dplyr
provides the |>
operator from magrittr
x |> f(y)
turns into f(x, y)
The result from one step is piped into the next step
Use |>
to rewrite multiple operations that you can read left-to-right/top-to-bottom
%>%
%>%
is not tied to dplyr
%>%
can be used with packages from tidyverse
%>%
can be used outside tidyverse
that is with functions which take a table (or something else) as a second, third or keyword argument Use pronoun .
to denote the LHS of the pipe expression
|>
(version > 4.)As of version 4.1 (2021), base offers a pipe operator denoted by |>
Magrittr
offers several variants of |>
%T>%
%<>%
%$%
Base has a pipe()
function to manipulate connections (Files, URLs, …)
Table calculus with dplyr
MA7BY020 – Analyse de Données – M1 MIDS – UParis Cité