Technologies Big Data Master MIDS/MFA/LOGOIS
2025-01-17
Spark SQL
is a library included in Spark
since version 1.3
Spark Dataframes
was introduced with version
It provides an easier interface to process tabular data
Instead of RDD
s, we deal with DataFrame
s
Since Spark
1.6, there is also the concept of Dataset
s, but only for Scala
and Java
SparkContext
and SparkSession
Before Spark 2
, there was only SparkContext
and SQLContext
All core functionality was accessed with SparkContext
All SQL
functionality needed the SQLContext
, which can be created from an SparkContext
With Spark 2
came the SparkSession
class
SparkSession
is the .stress[global entry-point] for everything Spark
-related
SparkContext
and SparkSession
Before Spark 2
DataFrame
The main entity of Spark SQL
is the DataFrame
A DataFrame is actually an RDD
of Row
s with a schema
A schema gives the names of the columns and their types
Row
is a class representing a row of the DataFrame
.
It can be used almost as a python
list
, with its size equal to the number of columns in the schema.
DataFrame
DataFrame
You can access the underlying RDD
object using .rdd
We can use the method createDataFrame
from the SparkSession instance
Can be used to create a Spark
DataFrame
from:
pandas.DataFrame
objectFull documentation can be found in the [API docs]
A DataFrame
always contains a schema
The schema defines the column names and types
In all previous examples, the schema was inferred
The schema of a DataFrame
is represented by the class types.StructType
[API doc]
When creating a DataFrame
, the schema can be either inferred or defined by the user
Spark SQL
StringType
IntegerType
LongType
FloatType
DoubleType
BooleanType
DateType
TimestampType
...
The full list of types can be found in [API doc]
Data is usually read from external sources (move the algorithms, not the data)
Spark SQL
provides connectors to read from many different sources:
Text files (CSV
, JSON
)
Distributed tabular files (Parquet
, ORC
)
In-memory data sources (Apache Arrow
)
General relational Databases (via JDBC
)
Third-party connectors to connect to many other databases
And you can create your own connector for Spark
(in Scala
)
Spark supports different file systems to look at the data:
Local files: file://path/to/file
or just path/to/file
HDFS
(Hadoop Distributed FileSystem): hdfs://path/to/file
Amazon S3
: s3://path/to/file
CSV
fileCSV
fileMain options
Some important options of the CSV
reader are listed here:
Option | Description |
---|---|
sep |
The separator character |
header |
If “true”, the first line contains the column names |
inferSchema |
If “true”, the column types will be guessed from the contents |
dateFormat |
A string representing the format of the date columns |
The full list of options can be found in the API Docs
## JSON file
df = spark.read.json("/path/to/file.json")
df = spark.read.format("json").load("/path/to/file.json")
We can use JDBC
drivers (Java) to read from relational databases
Examples of databases: Oracle
, PostgreSQL
, MySQL
, etc.
The java
driver file must be uploaded to the cluster before trying to access
This operation can be very heavy. When available, specific connectors should be used
Specific connectors are often provided by third-party libraries
or
df = spark.read.jdbc(
url="jdbc:postgresql:dbserver",
table="schema.tablename"
properties={
"user": "username",
"password": "p4ssw0rd"
}
)
HiveQL
Hive
(Hadoop
InteractiVE)
Apache
projectHiveQL
: SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.
Spark SQL
is designed to be compatible with ANSI SQL queries
Spark SQL
allows SQL
-like queries to be evaluated on Spark DataFrame
s (and on many other tables)
Spark DataFrames
have to be tagged as temporary views
Spark SQL
Queries can be submitted using spark.sql()
Method sql
for class SparkSession
provides access to SQLContext
SQL
queries form an expresive feature, it’s not the best way to code a complex logic
The Spark dataframe API offers a developper-friendly API for implementing
GROUP BY
Compare the Spark Dataframe
API with:
dplyr
, dtplyr
, dbplyr
in R
Tidyverse
Pandas
Chaining and/or piping enable modular query construction
Operation | Description |
---|---|
select |
Chooses columns from the table \(\pi\) |
selectExpr |
Chooses columns and expressions from table \(\pi\) |
where |
Filters rows based on a boolean rule \(\sigma\) |
limit |
Limits the number of rows LIMIT ...
|
orderBy |
Sorts the DataFrame based on one or more columns ORDER BY ...
|
alias |
Changes the name of a column AS ...
|
cast |
Changes the type of a column |
withColumn |
Adds a new column |
SELECT
SELECT
(continued)The argument of select()
is *cols
where cols
can be built from column names (strings), column expressions like df.age + 10
, lists
selectExpr
WHERE
LIMIT
ORDER BY
ALIAS
(name change)CAST
(type change)The full list of operations that can be applied to a DataFrame
can be found in the [DataFrame doc]
The list of operations on columns can be found in the [Column docs]
Often, we need to make many transformations using one or more functions
Spark SQL
has a package called functions
with many functions available for that
Some of those functions are only for aggregations
Examples: avg
, sum
, etc. We will cover them later
Some others are for column transformation or operations
Examples:
substr
, concat
, … (string and regex manipulation)datediff
, … (timestamp and duration)floor
, … (numerics)The full list is, once again, in the [API docs]
To use these functions, we first need to import them:
Note: the “as fn
” part is important to avoid confusion with native Python
functions such as “sum”
from pyspark.sql import functions as fn
columns = ["brand", "cost"]
df = spark.createDataFrame([
("garnier", 3.49),
("elseve", 2.71)
], columns)
round_cost = fn.round(df.cost, 1)
floor_cost = fn.floor(df.cost)
ceil_cost = fn.ceil(df.cost)
df.withColumn('round', round_cost)\
.withColumn('floor', floor_cost)\
.withColumn('ceil', ceil_cost)\
.show()
from pyspark.sql import functions as fn
columns = ["first_name", "last_name"]
df = spark.createDataFrame([
("John", "Doe"),
("Mary", "Jane")
],
columns
)
last_name_initial = fn.substring(df.last_name, 0, 1)
name = fn.concat_ws(" ", df.first_name, last_name_initial)
df.withColumn("name", name).show()
from datetime import date
from pyspark.sql import functions as fn
df = spark.createDataFrame([
(date(2015, 1, 1), date(2015, 1, 15)),
(date(2015, 2, 21), date(2015, 3, 8)),
], ["start_date", "end_date"]
)
days_between = fn.datediff(df.end_date, df.start_date)
start_month = fn.month(df.start_date)
df.withColumn('days_between', days_between)\
.withColumn('start_month', start_month)\
.show()
In the functions
package is a special function called when
This function is used to create a new column which value depends on the value of other columns
otherwise
is used to match “the rest”
Combination between conditions can be done using "&"
for “and” and "|"
for “or”
df = spark.createDataFrame([
("John", 21, "male"),
("Jane", 25, "female"),
("Albert", 46, "male"),
("Brad", 49, "super-hero")
], ["name", "age", "gender"])
supervisor = fn.when(df.gender == 'male', 'Mr. Smith')\
.when(df.gender == 'female', 'Miss Jones')\
.otherwise('NA')
df.withColumn("supervisor", supervisor).show()
Compare functions defined in pyspark.sql.functions
with functions specified in ANSI SQL and defined in popular RDBMs
Section on Functions and Operators
In RDBMs functions serve many purposes
When you need a transformation that is not available in the functions
package, you can create a User Defined Function (UDF)
Warning: the performance of this can be very very low
So, it should be used only when you are sure the operation cannot be done with available functions
To create an UDF, use functions.udf
, passing a lambda or a named functions
It is similar to the map
operation of RDDs
from pyspark.sql import functions as fn
from pyspark.sql.types import StringType
df = spark.createDataFrame([(1, 3), (4, 2)], ["first", "second"])
def my_func(col_1, col_2):
if (col_1 > col_2):
return "{} is bigger than {}".format(col_1, col_2)
else:
return "{} is bigger than {}".format(col_2, col_1)
my_udf = fn.udf(my_func, StringType())
df.withColumn("udf", my_udf(df['first'], df['second'])).show()
Spark SQL
supports joins between two DataFrame
As in ANSI SQL
, a join rule must be defined
The rule can either be a set of join keys (equi-join), or a conditional rule (\(\theta\)-join)
Join with conditional rules (\(\theta\)-joins) in Spark
can be very heavy
Several types of joins are available, default is inner
Syntax for \(\texttt{left_df} \bowtie_{\texttt{cols}} \texttt{right_df}\) is simple:
cols
contains a column name or a list of column namesjoin_type
is the type of joinfrom datetime import date
products = spark.createDataFrame([
('1', 'mouse', 'microsoft', 39.99),
('2', 'keyboard', 'logitech', 59.99),
], ['prod_id', 'prod_cat', 'prod_brand', 'prod_value'])
purchases = spark.createDataFrame([
(date(2017, 11, 1), 2, '1'),
(date(2017, 11, 2), 1, '1'),
(date(2017, 11, 5), 1, '2'),
], ['date', 'quantity', 'prod_id'])
# The default join type is the "INNER" join
purchases.join(products, 'prod_id').show()
Spark removes the duplicated column in the DataFrame
it outputs after a join operation
When joining using columns with nulls, Spark
just skips those
>>> df1.show() >>> df2.show()
+----+-----+ +----+-----+
| id| name| | id| dept|
+----+-----+ +----+-----+
| 123|name1| |null|sales|
| 456|name3| | 223|Legal|
|null|name2| | 456| IT|
+----+-----+ +----+-----+
>>> df1.join(df2, "id").show
+---+-----+-----+
| id| name| dept|
+---+-----+-----+
|123|name1|sales|
|456|name3| IT|
+---+-----+-----+
SQL Join Type | In Spark (synonyms) | Description |
---|---|---|
INNER |
"inner" |
Data from left and right matching both ways (intersection) |
FULL OUTER |
"outer" , "full" , "fullouter"
|
All rows from left and right with extra data if present (union) |
LEFT OUTER |
"leftouter" , "left"
|
Rows from left with extra data from right if present |
RIGHT OUTER |
"rightouter" , "right"
|
Rows from right with extra data from left if present |
LEFT SEMI |
"leftsemi" |
Data from left with a match with right |
LEFT ANTI |
"leftanti" |
Data from left with no match with right |
CROSS |
"cross" |
Cartesian product of left and right (never used) |
>>> inner = df_left.join(df_right, "id", "inner")
df_left df_right
+---+-----+ +---+-----+
| id|value| | id|value|
+---+-----+ +---+-----+
| 1| A1| | 3| A3|
| 2| A2| | 4| A4_1|
| 3| A3| | 4| A4|
| 4| A4| | 5| A5|
+---+-----+ | 6| A6|
+---+-----+
inner
+---+-----+-----+
| id|value|value|
+---+-----+-----+
| 3| A3| A3|
| 4| A4| A4|
| 4| A4| A4_1|
+---+-----+-----+
>>> outer = df_left.join(df_right, "id", "outer")
df_left df_right
+---+-----+ +---+-----+
| id|value| | id|value|
+---+-----+ +---+-----+
| 1| A1| | 3| A3|
| 2| A2| | 4| A4_1|
| 3| A3| | 4| A4|
| 4| A4| | 5| A5|
+---+-----+ | 6| A6|
+---+-----+
outer
+---+-----+-----+
| id|value|value|
+---+-----+-----+
| 1| A1| null|
| 2| A2| null|
| 3| A3| A3|
| 4| A4| A4|
| 4| A4| A4_1|
| 5| null| A5|
| 6| null| A6|
+---+-----+-----+
>>> left = df_left.join(df_right, "id", "left")
df_left df_right
+---+-----+ +---+-----+
| id|value| | id|value|
+---+-----+ +---+-----+
| 1| A1| | 3| A3|
| 2| A2| | 4| A4_1|
| 3| A3| | 4| A4|
| 4| A4| | 5| A5|
+---+-----+ | 6| A6|
+---+-----+
left
+---+-----+-----+
| id|value|value|
+---+-----+-----+
| 1| A1| null|
| 2| A2| null|
| 3| A3| A3|
| 4| A4| A4|
| 4| A4| A4_1|
+---+-----+-----+
>>> right = df_left.join(df_right, "id", "right")
df_left df_right
+---+-----+ +---+-----+
| id|value| | id|value|
+---+-----+ +---+-----+
| 1| A1| | 3| A3|
| 2| A2| | 4| A4_1|
| 3| A3| | 4| A4|
| 4| A4| | 5| A5|
+---+-----+ | 6| A6|
+---+-----+
right
+---+-----+-----+
| id|value|value|
+---+-----+-----+
| 3| A3| A3|
| 4| A4| A4|
| 4| A4| A4_1|
| 5| null| A5|
| 6| null| A6|
+---+-----+-----+
>>> left_semi = df_left.join(df_right, "id", "leftsemi")
df_left df_right
+---+-----+ +---+-----+
| id|value| | id|value|
+---+-----+ +---+-----+
| 1| A1| | 3| A3|
| 2| A2| | 4| A4_1|
| 3| A3| | 4| A4|
| 4| A4| | 5| A5|
+---+-----+ | 6| A6|
+---+-----+
left_semi
+---+-----+
| id|value|
+---+-----+
| 3| A3|
| 4| A4|
+---+-----+
>>> left_anti = df_left.join(df_right, "id", "leftanti")
df_left df_right
+---+-----+ +---+-----+
| id|value| | id|value|
+---+-----+ +---+-----+
| 1| A1| | 3| A3|
| 2| A2| | 4| A4_1|
| 3| A3| | 4| A4|
| 4| A4| | 5| A5|
+---+-----+ | 6| A6|
+---+-----+
left_anti
+---+-----+
| id|value|
+---+-----+
| 1| A1|
| 2| A2|
+---+-----+
Node-to-node communication strategy
Per node computation strategy
Spark approaches cluster communication in two different ways during joins.
It either incurs a shuffle join, which results in an all-to-all communication or a broadcast join.
The core foundation of our simplified view of joins is that in Spark you will have either a big table or a small table.
When you join a big table to another big table, you end up with a shuffle join
When you join a big table to another big table, you end up with a shuffle join
When you join a big table to a small table, you end up with a broadcast join
Maybe the most used operations in SQL
and Spark SQL
Similar to SQL
, we use "group by"
to perform aggregations
We usually can call the aggregation function just after groupBy
Namely, we use groupBy().agg()
Many aggregation functions in pyspark.sql.functions
Some examples:
Numerical: fn.avg
, fn.sum
, fn.min
, fn.max
, etc.
General: fn.first
, fn.last
, fn.count
, fn.countDistinct
, etc.
from pyspark.sql import functions as fn
products = spark.createDataFrame([
('1', 'mouse', 'microsoft', 39.99),
('2', 'mouse', 'microsoft', 59.99),
('3', 'keyboard', 'microsoft', 59.99),
('4', 'keyboard', 'logitech', 59.99),
('5', 'mouse', 'logitech', 29.99),
], ['prod_id', 'prod_cat', 'prod_brand', 'prod_value'])
products.groupBy('prod_cat').avg('prod_value').show()
# Or
products.groupBy('prod_cat').agg(fn.avg('prod_value')).show()
A very, very powerful feature
They allow to solve complex problems
ANSI SQL2003 allows for a window_clause
in aggregate function calls, the addition of which makes those functions into window functions
A good article about this feature is [here]
See also :
https://www.postgresql.org/docs/current/tutorial-window.html
It’s similar to aggregations, but the number of rows doesn’t change
Instead, new columns are created, and the aggregated values are duplicated for values of the same “group”
There are
min
, max
, avg
, sum
andlag
, lead
, rank
from pyspark.sql import Window
from pyspark.sql import functions as fn
# First, we create the Window definition
window = Window.partitionBy('prod_brand')
# Then, we can use "over" to aggregate on this window
avg = fn.avg('prod_value').over(window)
# Finally, we can it as a classical column
products.withColumn('avg_brand_value', fn.round(avg, 2)).show()
from pyspark.sql import Window
from pyspark.sql import functions as fn
# Multiple windows can be defined
window1 = Window.partitionBy('prod_brand')
window2 = Window.partitionBy('prod_cat')
avg_brand = fn.avg('prod_value').over(window1)
avg_cat = fn.avg('prod_value').over(window2)
products \
.withColumn('avg_by_brand', fn.round(avg_brand, 2)) \
.withColumn('avg_by_cat', fn.round(avg_cat, 2)) \
.show()
lag
and lead
are special functions used over an ordered window
They are used to take the “previous” and “next” value within the window
Very useful in datasets with a date column for instance
purchases = spark.createDataFrame([
(date(2017, 11, 1), 'mouse'),
(date(2017, 11, 2), 'mouse'),
(date(2017, 11, 4), 'keyboard'),
(date(2017, 11, 6), 'keyboard'),
(date(2017, 11, 9), 'keyboard'),
(date(2017, 11, 12), 'mouse'),
(date(2017, 11, 18), 'keyboard')
], ['date', 'prod_cat'])
purchases.show()
Another set of useful “special” functions
Also used on ordered windows
They create a rank, or an order of the items within the window
window = Window.partitionBy('category')\
.orderBy(contestants.points.desc())
rank = fn.rank().over(window)
dense_rank = fn.dense_rank().over(window)
row_number = fn.row_number().over(window)
contestants\
.withColumn('rank', rank)\
.withColumn('dense_rank', dense_rank)\
.withColumn('row_number', row_number)\
.orderBy('category', fn.col('points').desc())\
.show()
Very similar to reading. Output formats are the same: csv
, json
, parquet
, orc
, jdbc
, etc. Note that write
is an action
Instead of df.read.{source}
use df.write.{target}
Main option is mode
with possible values:
"append"
: append contents of this DataFrame
to existing data."overwrite"
: overwrite existing data"error"
: throw an exception if data already exists"ignore"
: silently ignore this operation if data already exists.Example
A lot happens under the hood when executing an action on a DataFrame
. The query goes through the following exectution stages:
IFEBY030 – Technos Big Data – M1 MIDS/MFA/LOGOS – UParis Cité