Tables manipulation with dplyr

Analyse de Données Master MIDS

2021-12-21

Tables

Tables (examples)

  • Speadsheets (Excel)

  • Relational tables

  • Dataframes in datascience frameworks

    • : data.frame, tibble, …
    • : pandas.dataframe
    • spark: dataframe
    • Dask: dataframe
    • and many others

Tables (Why ?)

In Data Science, each framework comes with its own flavor(s) of table(s)

Tables from relational databases serve as inspiration

In legacy dataframes shape the life of statisticians and data scientists

The purpose of this session is

  • describe dataframes from an end-user viewpoint (we leave aside implementations)

  • presenting tools for

    • accessing information within dataframes (querying)
    • summarizing information (aggregation queries)
    • cleaning/cleaning dataframes (tidying)

Loading tables and packages

require("tidyverse")      # All we need is there
require("nycflights13")    # for flight data
require("gt")
require("kableExtra")
# 
data(flights)

About loaded packages

  • Metapackage tidyverse provides tools to create, query, tidy dataframes as well as tools to load data from various sources and save them in persistent storage

  • nycflights13 provides the dataframes we play with

  • gt for tayloring table displays

The flights table

head(flights) |>
  glimpse(width = 30) 
Rows: 6
Columns: 19
$ year           <int> 2013,…
$ month          <int> 1, 1,…
$ day            <int> 1, 1,…
$ dep_time       <int> 517, …
$ sched_dep_time <int> 515, …
$ dep_delay      <dbl> 2, 4,…
$ arr_time       <int> 830, …
$ sched_arr_time <int> 819, …
$ arr_delay      <dbl> 11, 2…
$ carrier        <chr> "UA",…
$ flight         <int> 1545,…
$ tailnum        <chr> "N142…
$ origin         <chr> "EWR"…
$ dest           <chr> "IAH"…
$ air_time       <dbl> 227, …
$ distance       <dbl> 1400,…
$ hour           <dbl> 5, 5,…
$ minute         <dbl> 15, 2…
$ time_hour      <dttm> 2013…
  • A dataframe is a two-ways (two-dimensional) table

  • head(df) displays the first 6 rows of its first argument

  • The vectors making a dataframe may have different types/classes (a dataframe is not a matrix)

  • Compare str(), glimpse(), head()

Table schema

A table is a list of columns

Each column has

  • name and
  • type (class in
glimpse(flights,   #<<
        width=50)
Rows: 336,776
Columns: 19
$ year           <int> 2013, 2013, 2013, 2013, 2…
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time       <int> 517, 533, 542, 544, 554, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, …
$ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, …
$ arr_time       <int> 830, 850, 923, 1004, 812,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837,…
$ arr_delay      <dbl> 11, 20, 33, -18, -25, 12,…
$ carrier        <chr> "UA", "UA", "AA", "B6", "…
$ flight         <int> 1545, 1714, 1141, 725, 46…
$ tailnum        <chr> "N14228", "N24211", "N619…
$ origin         <chr> "EWR", "LGA", "JFK", "JFK…
$ dest           <chr> "IAH", "IAH", "MIA", "BQN…
$ air_time       <dbl> 227, 227, 160, 183, 116, …
$ distance       <dbl> 1400, 1416, 1089, 1576, 7…
$ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6…
$ minute         <dbl> 15, 29, 40, 45, 0, 58, 0,…
$ time_hour      <dttm> 2013-01-01 05:00:00, 201…

  • flights has 19 columns
  • Each column is a sequence (vector) of items with the same type/class
  • All columns have the same length
  • flights has 336776 rows
  • In parlance, a row is (often) called a tuple
  • In parlance, a column is (often) called a variable

Column types

class columns
integer ‘year’ ‘month’ ‘day’ ‘dep_time’ ‘sched_dep_time’ ‘arr_time’ ‘sched_arr_time’ ‘flight’
numeric ‘dep_delay’ ‘arr_delay’ ‘air_time’ ‘distance’ ‘hour’ ‘minute’
character ‘carrier’ ‘tailnum’ ‘origin’ ‘dest’
POSIXct ‘time_hour’
POSIXt ‘time_hour’

A column, as a vector, may be belong to different classes

Other classes: factor for categorical variables

Columns dest, origin carrier could be coerced as factors

Should columns dest and origin be coerced to the same factor?

nycflights13

Columns specification

as.col_spec(flights)
cols(
  year = col_integer(),
  month = col_integer(),
  day = col_integer(),
  dep_time = col_integer(),
  sched_dep_time = col_integer(),
  dep_delay = col_double(),
  arr_time = col_integer(),
  sched_arr_time = col_integer(),
  arr_delay = col_double(),
  carrier = col_character(),
  flight = col_integer(),
  tailnum = col_character(),
  origin = col_character(),
  dest = col_character(),
  air_time = col_double(),
  distance = col_double(),
  hour = col_double(),
  minute = col_double(),
  time_hour = col_datetime(format = "")
)

\(\approx\) table schema in relational databases

Column specifications are useful when loading dataframes from structured text files like .csv files

.csv files do not contain typing information

File loaders from package readr can be tipped about column classes using column specifications

SQL and Relational algebra with dplyr

  • SQL stands for structured/simple Query Language

  • A query language elaborated during the 1970’s at IBM by E. Codd

  • Geared towards exploitation of collections of relational tables

  • Less powerful but simpler to use than a programming language

  • dplyr is a principled -friendly implementation of SQL ideas (and other things)

At the core of SQL lies the idea of a table calculus called relational algebra

Relational algebra (basics)

Convention: \(R\) is a table with columns \(A_1, \ldots, A_k\)

Projection (picking columns)

\(\pi(R, A_1, A_3)\)

Selection/Filtering (picking rows)

\(\sigma(R, {\text{condition}})\)

Join (mulitple tables operation)

\(\bowtie(R,S, {\text{condition}})\)

Any operation produces a table

The schema of the derived table depends on the operation (but does not depend on the content/value of the operands)

Table calculus relies on a small set of basic operations \(\pi, \sigma, \bowtie\)

Each operation has one or two table operands and produce a table

There is more to SQL than relational algebra

Projection \(\pi\)

\(\pi(R, {A_1, A_3})\)

A projection \(\pi(\cdot, {A_1, A_3})\) is defined by a set of column names, say \(A_1, A_3\)

If \(R\) has columns with given names, the result is a table with names \(A_1, A_3\) and one row per row of \(R\)

A projection is parametrized by a list of column names

Package dplyr

Base provides tools to perform relational algebra operations

But:

  • Base does not provide a consistent API

  • The lack of a consistent API makes operation chaining tricky

dplyr verbs

Five basic verbs:

  • Pick observations/rows by their values (filter()) σ(…)

  • Pick variables by their names (select()) π(…)

  • Reorder the rows (arrange())

  • Create new variables with functions of existing variables (mutate())

  • Collapse many values down to a single summary (summarise())

And

  • group_by() changes the scope of each function from operating on the entire dataset to operating on it group-by-group

tidyverse

All verbs work similarly:

The first argument is a data frame (table).

The subsequent arguments describe what to do with the data frame, using the variable/column names (without quotes)

The result is a new data frame (table)

dplyr::select() as a projection operator (π)

\(\pi(R, \underbrace{A_1, \ldots, A_3}_{\text{column names}})\)

select(R, A1, A3) #<<

or, equivalently

R |> select(A1, A3) #<<

|> is the pipe operator

x |> f(y, z) is translated to f(x, y, z) and then evaluated

dplyr::select()

  • Function select has a variable number of arguments

  • Function select has a variable number of arguments

  • Function select allows to pick column by names (and much more)

  • Note that in the current environment, there are no objects called A1, A3

  • The consistent API allows to use the pipe operator

Caution

There is also a select() function in base R

Toy tables

spam <- set.seed(42)

R <-  tibble(A1=seq(2, 10, 2),
             A2=sample(letters, 5),
             A3=seq(from=date("2021-10-21"),
                    to=date("2021-11-20"),
                    by=7),
             D=sample(letters, 5))

S <- tibble(E=c(3,4,6,9, 10),
            F=sample(letters, 5),
            G=seq(from=date("2021-10-21"),
                   to=date("2021-10-21")+4, by=1),
            D=sample(letters,5)
          )
R
A1 A2 A3 D
2 q 2021-10-21 r
4 e 2021-10-28 q
6 a 2021-11-04 o
8 j 2021-11-11 g
10 d 2021-11-18 d
S
E F G D
3 y 2021-10-21 o
4 e 2021-10-22 c
6 n 2021-10-23 i
9 t 2021-10-24 d
10 r 2021-10-25 e

Projecting toy tables

R |> 
  dplyr::select(A2,D) |> 
  knitr::kable(caption="Projecting R")
Projecting R
A2 D
q r
e q
a o
j g
d d
R |> 
  dplyr::select(- where(is.character)) |> 
  knitr::kable(caption="Projecting R, all but character columns")
Projecting R, all but character columns
A1 A3
2 2021-10-21
4 2021-10-28
6 2021-11-04
8 2021-11-11
10 2021-11-18

Projecting flights on origin and dest

flights |>
  select(origin, dest) |>  #<<
  head()
# A tibble: 6 × 2
  origin dest 
  <chr>  <chr>
1 EWR    IAH  
2 LGA    IAH  
3 JFK    MIA  
4 JFK    BQN  
5 LGA    ATL  
6 EWR    ORD  

A more readable equivalent of

head(select(flights,
            origin,
            dest),
     10)
SELECT 
  origin, dest
FROM 
  flights
LIMIT 10;

\(\sigma(R, \text{condition})\)

  • A selection/filtering operation is defined by a condition that can be checked on the rows of tables with convenient schema

  • \(\sigma(R, \text{condition})\) returns a table with the same schema as \(R\)

  • The resulting table contains the rows/tuples of \(R\) that satisfy \(\text{condition}\)

  • \(\sigma(R, \text{FALSE})\) returns an empty table with the same schema as \(R\)

Chaining filtering and projecting

start <- date("2021-10-27")
end <- start + 21

R |>
#  filter(A2 > "n") |>  #<<
  filter(between(A3, start, end)) |>
  select(A1, A3) #<<
# A tibble: 3 × 2
     A1 A3        
  <dbl> <date>    
1     4 2021-10-28
2     6 2021-11-04
3     8 2021-11-11

Selecting flights based on origin and dest

and then projecting on dest, time_hour, carrier

flights |>
  filter(dest %in% c('ATL', 'LAX'), #<<
         origin == 'JFK') |>
  select(dest, time_hour, carrier) |> #<<
  head()
# A tibble: 6 × 3
  dest  time_hour           carrier
  <chr> <dttm>              <chr>  
1 LAX   2013-01-01 06:00:00 UA     
2 ATL   2013-01-01 06:00:00 DL     
3 LAX   2013-01-01 07:00:00 VX     
4 LAX   2013-01-01 07:00:00 B6     
5 LAX   2013-01-01 07:00:00 AA     
6 ATL   2013-01-01 08:00:00 DL     

In SQL ( parlance:

SELECT 
  dest, time_hour, carrier
FROM 
  flights
WHERE 
  dest IN ('ATL', 'LAX') AND
  origin = 'JFK'
LIMIT 6

Logical operations

  • filter(R, condition_1, condition_2) is meant to return the rows of R that satisfy condition_1 and condition_2

  • filter(R, condition_1 & condition_2) is an equivalent formulation

  • filter(R, condition_1 | condition_2) is meant to return the rows of R that satisfy condition_1 or condition_2 (possibly both)

  • filter(R, xor(condition_1,condition_2)) is meant to return the rows of R that satisfy either condition_1 or condition_2 (just one of them)

  • filter(R, ! condition_1) is meant to return the rows of R that do not satisfy condition_1

Missing values!

Numerical column dep_time contains many NA's (missing values)

# flights |> pull(dep_time) |> summary()
summary(flights$dep_time)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      1     907    1401    1349    1744    2400    8255 

Missing values (NA and variants) should be handled with care

NA & TRUE

[1] NA

NA | TRUE

[1] TRUE

Truth tables for three-valued logic

uses three-valued logic

Generate complete truth tables for and, or, xor

v <- c(TRUE, FALSE, NA) # truth values

list_tt <- map(c(`&`, `|`, xor),  #<<
               ~ outer(v, v, .x)) #<<

for (i in seq_along(list_tt)){
  colnames(list_tt[[i]]) <- v
  rownames(list_tt[[i]]) <- v
}

names(list_tt) <- c('& AND',
                    '| OR',
                    'XOR')

Truth tables

TRUE TRUE FALSE NA
FALSE FALSE FALSE FALSE
NA NA FALSE NA


TRUE TRUE TRUE TRUE
FALSE TRUE FALSE NA
NA TRUE NA NA


TRUE FALSE TRUE NA
FALSE TRUE FALSE NA
NA NA NA NA

slice(): choosing rows based on location

In base dataframe cells can be addressed by indices

flights[5000:5010,seq(1, 19, by=5)] returns rows 5000:5010 and columns 1, 6, 11 from dataframe flights

This can be done in a (verbose) dplyr way using slice() and select()

flights |>
  slice(5001:5005) |>  #<<
  select(seq(1, 19, by=5))

combined with aggregation (group_by()) variants of slice_ may be used to perform windowing operations

Useful variant slice_sample()

Joins : multi-table queries

Note

\(\bowtie(R,S, {\text{condition}})\)

stands for

join rows/tuples of \(R\) and rows/tuples of \(S\) that satisfy \(\text{condition}\)

nycflights tables

The nycflights13 package offers five related tables:

  • Fact tables:
    • flights
    • weather (hourly weather conditions at different locations)
  • Dimension tables:
    • airports (airports full names, location, …)
    • planes (model, manufacturer, year, …)
    • airlines (full names)

This is an instance of a Star Schema

About Star schemas

  • Fact tables record measurements for a specific event

  • Fact tables generally consist of numeric values, and foreign keys to dimensional data where descriptive information is kept

  • Dimension tables record informations about entities involved in events recorded in Fact tables

weather conditions

weather |>
  glimpse(width = 50)
Rows: 26,115
Columns: 15
$ origin     <chr> "EWR", "EWR", "EWR", "EWR", "…
$ year       <int> 2013, 2013, 2013, 2013, 2013,…
$ month      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ day        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ hour       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10…
$ temp       <dbl> 39.02, 39.02, 39.02, 39.92, 3…
$ dewp       <dbl> 26.06, 26.96, 28.04, 28.04, 2…
$ humid      <dbl> 59.37, 61.63, 64.43, 62.21, 6…
$ wind_dir   <dbl> 270, 250, 240, 250, 260, 240,…
$ wind_speed <dbl> 10.35702, 8.05546, 11.50780, …
$ wind_gust  <dbl> NA, NA, NA, NA, NA, NA, NA, N…
$ precip     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ pressure   <dbl> 1012.0, 1012.3, 1012.5, 1012.…
$ visib      <dbl> 10, 10, 10, 10, 10, 10, 10, 1…
$ time_hour  <dttm> 2013-01-01 01:00:00, 2013-01…

Connecting flights and weather

We want to complement information in flights using data weather

Motivation: we would like to relate delays (arr_delay) and weather conditions

  • can we explain (justify) delays using weather data?

  • can we predict delays using weather data?

For each flight (row in flights)

  • year, month, day, hour (computed from time_hour) indicate the approaximate time of departure

  • origin indicates the airport where the plane takes off

Each row of weather contains corresponding information

for each row of flights we look for rows of weather with matching values in year, month, day, hour and origin

NATURAL INNER JOIN between the tables

inner_join: natural join

f_w <- flights |>
  inner_join(weather) #<<

f_w |> 
  select(seq(1, 
             ncol(f_w),
             by=2)) |> 
  glimpse(width=50)
Rows: 335,220
Columns: 14
$ year           <int> 2013, 2013, 2013, 2013, 2…
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ sched_dep_time <int> 515, 529, 540, 545, 600, …
$ arr_time       <int> 830, 850, 923, 1004, 812,…
$ arr_delay      <dbl> 11, 20, 33, -18, -25, 12,…
$ flight         <int> 1545, 1714, 1141, 725, 46…
$ origin         <chr> "EWR", "LGA", "JFK", "JFK…
$ air_time       <dbl> 227, 227, 160, 183, 116, …
$ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6…
$ time_hour      <dttm> 2013-01-01 05:00:00, 201…
$ dewp           <dbl> 28.04, 24.98, 26.96, 26.9…
$ wind_dir       <dbl> 260, 250, 260, 260, 260, …
$ wind_gust      <dbl> NA, 21.86482, NA, NA, 23.…
$ pressure       <dbl> 1011.9, 1011.4, 1012.1, 1…

Join schema

The schema of the result is the union of the schemas of the operands

A tuple from flights matches a tuple from weather if the tuple have the same values in the common columns:

 [1] "year"           "month"          "day"            "dep_time"      
 [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
 [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
[13] "origin"         "dest"           "air_time"       "distance"      
[17] "hour"           "minute"         "time_hour"      "temp"          
[21] "dewp"           "humid"          "wind_dir"       "wind_speed"    
[25] "wind_gust"      "precip"         "pressure"       "visib"         

Which columns are used when joining tables \(R\) and \(S\)?

  • default behavior of inner_join: all columns shared by \(R\) and \(S\). Common columns have the same name in both schema. They are expected to have the same class

  • manual definition: in many settings, we want to overrule the default behavior. We specify manually which column from \(R\) should match which column from \(S\)

Natural join of flights and weather:

common_names <- base::intersect(names(weather),
                                names(flights))

setequal(
  inner_join(flights, weather),
  inner_join(flights,
             weather,
             by=common_names)
)
[1] TRUE

Are you surprised by the next chunk?

dtu  <- inner_join(flights,
           weather,
           by=c("year", "month", "day", "origin", "hour"))

dtv <- inner_join(flights,
           weather,
           by=c("origin", "time_hour"))

# setequal(dtu, dtv)

Recall that columns year, month day hour can be computed from time_hour

flights |>
  filter(year!=year(time_hour) |
         month!=month(time_hour) |
         day!=day(time_hour) |
         hour!=hour(time_hour)) |>
  nrow()
[1] 0

The two results do not have the same schema!

setdiff(colnames(dtv), colnames(dtu))
[1] "year.x"    "month.x"   "day.x"     "hour.x"    "time_hour" "year.y"   
[7] "month.y"   "day.y"     "hour.y"   
setdiff(colnames(dtu), colnames(dtv))
[1] "year"        "month"       "day"         "hour"        "time_hour.x"
[6] "time_hour.y"

Fixing

dtu  <- inner_join(flights,
           weather,
           by=c("year", "month", "day", "origin", "hour"),
           suffix= c("", ".y")) |>  #<<
           select(-ends_with(".y"))  #<<

dtv <- inner_join(flights,
           weather,
           by=c("origin", "time_hour"),
           suffix= c("", ".y")) |>  #<<
           select(-ends_with(".y"))  #<<

setequal(dtu, dtv)
[1] TRUE

About inner_join

inner_join(
  x, y,
  by = NULL,      #<<
  copy = FALSE,
  suffix = c(".x", ".y"), #<<
  ...,
  keep = FALSE,  #<<
  na_matches = "na")  #<<
  • by:

    • by=c("A1", "A3", "A7") row r from R and s from S match if r.A1 == s.A1, r.A3 == s.A3, r.A7 == s.A7
    • by=c("A1"="B", "A3"="C", "A7"="D") row r from R and s from S match if r.A1 == s.B, r.A3 == s.C, r.A7 == s.D
  • suffix: If there are non-joined duplicate variables in x and y, these suffixes will be added to the output to disambiguate them.

  • keep: Should the join keys from both x and y be preserved in the output?

  • na_matches: Should NA and NaN values match one another?

Join flavors

Different flavors of join can be used to join one table to columns from another, matching values with the rows that they correspond to

Each join retains a different combination of values from the tables

  • left_join(x, y, by = NULL, suffix = c(".x", ".y"), ...) Join matching values from y to x. Retain all rows of x padding missing values from y by NA

  • semi_join

  • anti_join

Toy examples : inner_join

R
A1 A2 A3 D
2 q 2021-10-21 r
4 e 2021-10-28 q
6 a 2021-11-04 o
8 j 2021-11-11 g
10 d 2021-11-18 d
S
E F G D
3 y 2021-10-21 o
4 e 2021-10-22 c
6 n 2021-10-23 i
9 t 2021-10-24 d
10 r 2021-10-25 e
inner_join(S, R, by=c(“E”=“A1”))
E F G D.x A2 A3 D.y
4 e 2021-10-22 c e 2021-10-28 q
6 n 2021-10-23 i a 2021-11-04 o
10 r 2021-10-25 e d 2021-11-18 d

Toy examples : left_join

R
A1 A2 A3 D
2 q 2021-10-21 r
4 e 2021-10-28 q
6 a 2021-11-04 o
8 j 2021-11-11 g
10 d 2021-11-18 d
S
E F G D
3 y 2021-10-21 o
4 e 2021-10-22 c
6 n 2021-10-23 i
9 t 2021-10-24 d
10 r 2021-10-25 e
left_join(S, R, by=c(“E”=“A1”))
E F G D.x A2 A3 D.y
3 y 2021-10-21 o NA NA NA
4 e 2021-10-22 c e 2021-10-28 q
6 n 2021-10-23 i a 2021-11-04 o
9 t 2021-10-24 d NA NA NA
10 r 2021-10-25 e d 2021-11-18 d

Toy examples : semi_join anti_join

R
A1 A2 A3 D
2 q 2021-10-21 r
4 e 2021-10-28 q
6 a 2021-11-04 o
8 j 2021-11-11 g
10 d 2021-11-18 d
S
E F G D
3 y 2021-10-21 o
4 e 2021-10-22 c
6 n 2021-10-23 i
9 t 2021-10-24 d
10 r 2021-10-25 e
semi_join(S, R, by=c(“E”=“A1”))
E F G D
4 e 2021-10-22 c
6 n 2021-10-23 i
10 r 2021-10-25 e



anti_join(S, R, by=c(“E”=“A1”))
E F G D
3 y 2021-10-21 o
9 t 2021-10-24 d

Conditional/ \(\theta\) -join

In relational databases, joins are not restricted to natural joins

\[U \leftarrow R \bowtie_{\theta} S\]

reads as

\[\begin{array}{rl} T & \leftarrow R \times S\\ U & \leftarrow \sigma(T, \theta)\end{array}\]

where

  • \(R \times S\) is the cartesian product of \(R\) and \(S\)

  • \(\theta\) is a boolean expression that can be evaluated on any tuple of \(R \times S\)

Do we need conditional/ \(\theta\) -joins?

Note

: We can implement \(\theta\)/conditional-joins by pipelining a cross product and a filtering

Caution

: Cross products are costly:

  • \(\#\text{rows}(R \times S) = \#\text{rows}(R) \times \#\text{rows}(S)\)
  • \(\#\text{cols}(R \times S) = \#\text{cols}(R) + \#\text{cols}(S)\)

Do we need conditional/ \(\theta\) -joins?

Note

: RDBMS use query planning and optimization, indexing to circumvent the cross product bottleneck (when possible)

Tip

: if we need to perform a \(\theta\)-join

  • outsource it to a RDBMS, or
  • design an ad hoc pipeline

A conditional join between flights and weather

  • The natural join between flights and weather we implemented can be regarded as an ad hoc conditional join between normalized versions of weather and flights

  • Table flights and weather are redundant: year, month, day, hour can be computed from time_hour

  • Assume flights and weather are trimmed so as to become irredundant

  • The conditional join is then based on truncations of variables time_hour

SELECT 
  *
FROM 
  flights AS f, weather AS w
WHERE 
  date_trunc('hour', f.time_hour) = date_trunc('hour', w.time_hour)
  • Adding redundant columns to flights and weather allows us to transform a tricky conditional join into a simple natural join

Creating new columns

Creation of new columns may happen

  • on the fly

  • when altering (enriching) the schema of a table

In databases, creation of new columns may be the result of a query or be the result of altering a table schema with ALTER TABLE ADD COLUMN ...

In tidyverse() we use verbs mutate or add_column to add columns to the input table

mutate

mutate(   #<<
  .data,
  new_col= expression, #<<
  ...,   #<<
  .keep = c("all", "used", "unused", "none"),
  .before = NULL,
  .after = NULL
)

.data: the input data frame

new_col= expression:

  • new_col is the name of a new column

  • expression is evaluated on each row of .data or it is a vector of length 1

  • all is the default behavior, retains all columns from .data

Creating a categorical column to spot large delays

breaks_delay <- with(flights,
  c(min(arr_delay, na.rm=TRUE),
    0, 30,
    max(arr_delay, na.rm=TRUE))
)

level_delay <- c("None",
                 "Moderate",
                 "Large")

flights |>
  mutate(large_delay = cut(
    arr_delay,  #<<
    breaks=breaks_delay, #<<
    labels=level_delay,  #<<
    ordered_result=TRUE)) |>   #<<
  select(large_delay, arr_delay) |>
  sample_n(5)
# A tibble: 5 × 2
  large_delay arr_delay
  <ord>           <dbl>
1 Large             219
2 Moderate           18
3 None              -19
4 None              -16
5 None               -1

flights |>
  mutate(foo = if_else(arr_time > sched_arr_time,        #<<
                              arr_time - sched_arr_time,
                              0L,
                              missing = NA_integer_)) |>
  group_by( (foo >0) & abs(foo - arr_delay)  > 100) |>
  summarise(N=n())
# A tibble: 3 × 2
  `(foo > 0) & abs(foo - arr_delay) > 100`      N
  <lgl>                                     <int>
1 FALSE                                    322281
2 TRUE                                       5157
3 NA                                         9338

Changing the class of a column

flights |>
  mutate(large_delay = cut(arr_delay,  #<<
    breaks=breaks_delay,
    labels=level_delay,
    ordered_result=TRUE),
    origin = as.factor(origin), #<<
    dest = as.factor(dest)    #<<
  ) |>
  select(large_delay,
    arr_delay,
    origin,
    dest) |>
  sample_n(5)
# A tibble: 5 × 4
  large_delay arr_delay origin dest 
  <ord>           <dbl> <fct>  <fct>
1 None              -44 LGA    CVG  
2 None              -15 EWR    DAY  
3 Large             136 EWR    DEN  
4 None               -9 EWR    TPA  
5 Moderate           14 LGA    TPA  

Tidy tables

Tidying tables is part of data cleaning

A (tidy) dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative)

Values are organised in two ways

Every value belongs to a variable and an observation

A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units

An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes

The principles of tidy data are tied to those of relational databases and Codd’s relational algebra

Codd’s Principles

  1. Information is represented logically in tables
  2. Data must be logically accessible by table, primary key, and column.
  3. Null values must be uniformly treated as “missing information,” not as empty strings, blanks, or zeros.
  4. Metadata (data about the database) must be stored in the database just as regular data is
  5. A single language must be able to define data, views, integrity constraints, authorization, transactions, and data manipulation
  6. Views must show the updates of their base tables and vice versa
  7. A single operation must be available to do each of the following operations: retrieve data, insert data, update data, or delete data
  8. Batch and end-user operations are logically separate from physical storage and access methods
  9. Batch and end-user operations can change the database schema without having to recreate it or the applications built upon it
  10. Integrity constraints must be available and stored in the metadata, not in an application program
  11. The data manipulation language of the relational system should not care where or how the physical data is distributed and should not require alteration if the physical data is centralized or distributed
  12. Any row processing done in the system must obey the same integrity rules and constraints that set-processing operations do

dplyr functions expect and return tidy tables

In a tidy table

  • Each variable is a column

  • Each observation is a row

  • Every cell is a single value

In order to tell whether a table is tidy, we need to know what is the population under investigation, what are the observations/individuals, which measures are performed on each individual, …

Untidy data

Column headers are values, not variable names.

Multiple variables are stored in one column.

Variables are stored in both rows and columns.

Multiple types of observational units are stored in the same table.

A single observational unit is stored in multiple tables.

Functions from tidyr::...

  • pivot_wider and pivot_longer

  • separate and unite

  • Handling missing values with complete, fill, …

tidyr website

Pivot longer

pivot_longer() is commonly needed to tidy wild-caught datasets as they often optimise for ease of data entry or ease of comparison rather than ease of analysis.

messy |> pivot_longer(
  cols=c(-row),  #<<
  names_to = "name",
  values_to = "value",
)  |> kable()

Pivot wider

pivot_wider(  #<<
  data,
  id_cols = NULL, #<<
  names_from = name, #<<
  names_prefix = "",
  values_from = value, #<<
  ...
)

some optional arguments are missing

When reporting, we often use pivot_wider (explicitely or implicitely) to make results more readable, possibly to conform to a tradition

Aggregations

How many flights per carrier?

flights |>
  group_by(carrier) |>  #<<
  summarise(count=n()) |>  #<<
  arrange(desc(count))
# A tibble: 16 × 2
   carrier count
   <chr>   <int>
 1 UA      58665
 2 B6      54635
 3 EV      54173
 4 DL      48110
 5 AA      32729
 6 MQ      26397
 7 US      20536
 8 9E      18460
 9 WN      12275
10 VX       5162
11 FL       3260
12 AS        714
13 F9        685
14 YV        601
15 HA        342
16 OO         32
--| eval: false
SELECT 
  carrier, COUNT(*) AS n
FROM 
  flights
GROUP BY 
  carrier
ORDER BY 
  n DESCENDING

How many flights per day of week per departure airport?

flights |>
  group_by(origin,  wday(time_hour, abbr=T, label=T)) |>  #<<
  summarise(count=n(), .groups="drop") |>       #<<
  rename(day_of_week=`wday(time_hour, abbr = T, label = T)`) |>
  pivot_wider(  #<<
    id_cols="origin",   #<<
    names_from="day_of_week", #<<
    values_from="count") |>  #<<
  kable(caption="Departures per day")
Departures per day
origin dim. lun. mar. mer. jeu. ven. sam.
EWR 16425 18329 18243 18180 18169 18142 13347
JFK 15966 16104 16017 15841 16087 16176 15088
LGA 13966 16257 16162 16039 15963 15990 10285

Window queries

Window queries

TODO

Partitionning and Ordering

TODO

Sliding windows and package slider

TODO

Pipelines/chaining operations

|>, %>% and other pipes

  • All dplyr functions take a table as the first argument

  • Rather than forcing the user to either save intermediate objects or nest functions, dplyr provides the |> operator from magrittr

  • x |> f(y) turns into f(x, y)

  • The result from one step is piped into the next step

  • Use |> to rewrite multiple operations that you can read left-to-right/top-to-bottom

g(f(x, y), z)

x |>
  f(y) |>
  g(z)

Magrittr %>%

  • %>% is not tied to dplyr
  • %>% can be used with packages from tidyverse
  • %>% can be used outside tidyverse that is with functions which take a table (or something else) as a second, third or keyword argument

Use pronoun . to denote the LHS of the pipe expression

Second argument of g has the same type as the result of f

g(z, f(x, y))

x %>%
  f(y) %>%
  g(z, .)   #<<

x %>% f(y) is a shorthand for x %>% f(., y)

Standard pipe |> (version > 4.)

As of version 4.1 (2021), base offers a pipe operator denoted by |>

x |> f(y) turns into f(x, y)

g(f(x, y), z)

x |>
  f(y) |>
  g(z)

the standard pipe |> has no pronoun/placeholder to denote the LHS of the pipe expression

The roundabout consists in using another new construct \(x)

g(z, w)

x |>
  (\(x) g(z, w=x))()
"une" |>
  (\(x) str_c("ceci n'est pas", x, sep=" "))() |>
  str_c("pipe", sep=" ") |>
  cat()
ceci n'est pas une pipe

Other pipes

Magrittr offers several variants of |>

  • Tee operator %T>%
  • Assignement pipe %<>%
  • Exposition operator %$%

See pipes for beginners

Base has a pipe() function to manipulate connections (Files, URLs, …)

References

The End

Table calculus with dplyr