Technologies Big Data Master MIDS/MFA/LOGOIS
2025-01-17
Spark
computing framework deals with many complex issues: fault tolerance, slow machines, big datasets, etc.
It follows the next guideline
Here is an operation, run it on all the data.
Note
Jobs are divided in tasks that are executed by the workers
Note
A job in Spark represents a complete computation triggered by an action in the application code.
When you invoke an action (such as collect()
, saveAsTextFile()
, etc.) on a Spark RDD, DataFrame, or Dataset, it triggers the execution of one or more jobs.
Each job consists of one or more stages, where each stage represents a set of tasks that can be executed in parallel.
Jobs in Spark are created by transformations that have no dependency on each other, meaning each stage can execute independently.
A task is the smallest unit of work in Spark and represents the execution of a computation on a single partition of data.
Tasks are created for each partition of the RDD, DataFrame, or Dataset involved in the computation.
Spark’s execution engine assigns tasks to individual executor nodes in the cluster for parallel execution.
Tasks are executed within the context of a specific stage, and each task typically operates on a subset of the data distributed across the cluster.
The number of tasks within a stage depends on the number of partitions of the input data and the degree of parallelism configured for the Spark
application.
In summary, a job represents the entire computation triggered by an action, composed of one or more stages, each of which is divided into smaller units of work called tasks.
Tasks operate on individual partitions of the data in parallel to achieve efficient and scalable distributed computation in Spark
.
An API allows a user to interact with the software
Spark
is implemented in Scala, runs on the JVM (Java Virtual Machine)
Multiple Application Programming Interfaces (APIs):
Scala
(JVM)Java
(JVM)Python
R
This course uses primarily the Python
API. Easier to learn than Scala
and Java
Astuce
About the R
APIs: See Mastering Spark in R
See https://en.wikipedia.org/wiki/API for more on this acronym
In Python
language, look at interface
and corresponding chapter Interfaces, Protocols and ABCs in Fluent Python
For R
there are in fact two APIs, or two packages that offer a Spark
API
See Mastering Spark
with R
by Javier Luraschi, Kevin Kuo, Edgar Ruiz
When you interact with Spark
through its API, you send instructions to the Driver
map(f) - map(g) - filter(h) - reduce(l)
map(f o g)
SparkContext
versus SparkSession
SparkContext
and SparkSession
serve different purposes
SparkContext
was the main entry point for Spark applications in first versions of Apache Spark.
SparkContext
represented the connection to a Spark cluster, allowing the application to interact with the cluster manager.
SparkContext
was responsible for coordinating and managing the execution of jobs and tasks.
SparkContext
provided APIs for creating RDDs
(Resilient Distributed Datasets), which were the primary abstraction in Spark for representing distributed data.
Your python
session interacts with the driver through a SparkContext
object
In the Spark
interactive shell
An object of class SparkContext
is automatically created in the session and named sc
In a jupyter notebook
Create a SparkContext
object using:
In Spark 2.0 and later versions, SparkContext
is still available but is not the primary entry point.
Instead, SparkSession
is preferred.
SparkSession
was introduced in Spark 2.0 as a higher-level abstraction that encapsulates SparkContext
, SQLContext
, and HiveContext
.
SparkSession
provides a unified entry point for Spark functionality, integrating Structured APIs:
SQL
,DataFrame
,Dataset
and the traditional RDD-based APIs.
SparkSession
is designed to make it easier to work with structured data (like data stored in tables or files with a schema) using Spark’s DataFrame and Dataset APIs.
SparkSession
also provides built-in support for reading data from various sources (like Parquet, JSON, JDBC, etc.) into DataFrames and writing DataFrames back to different formats.
Additionally, SparkSession
simplifies the configuration of Spark properties and provides a Spark SQL CLI and a Spark Shell with SQL and DataFrame support.
Note
SparkSession
internally creates and manages a SparkContext
, so when you create a SparkSession
, you don’t need to create a SparkContext
separately.
SparkContext
is lower-level and primarily focused on managing the execution of Spark jobs and interacting with the cluster
SparkSession
provides a higher-level, more user-friendly interface for working with structured data and integrates various Spark functionalities, including SQL, DataFrame, and Dataset APIs.
Spark programs are written in terms of operations on RDDs
RDD stands for Resilient Distributed Dataset
An immutable distributed collection of objects spread across the cluster disks or memory
RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes
Parallel transformations and actions can be applied to RDDs
RDDs are automatically rebuilt on machine failure
From an iterable object iterator
(e.g. a Python list
, etc.):
From a text file:
where lines
is the resulting RDD, and sc
the spark context
Remarks
parallelize
not really used in practicejson
, csv
, xml
, parquet
, orc
, etc.)For iterators look again at Fluent Python, chapter 17 Iterators, Generators, and Classic Coroutines
Two families of operations can be performed on RDDs
What is lazy evaluation ?
When a transformation is called on a RDD:
The most important transformation is map
transformation | description |
---|---|
map(f) |
apply a function f to each element of the RDD |
Here is an example:
collect
(an action) otherwise nothing happens
map
is lazy-evaluated
Python
, three options for passing functions into Spark
lambda
expressions (anonymous functions)def
About passing functions to map
:
pickle
Spark
sends the entire pickled function to worker nodesWarning
If the function is an object method:
self
) and references to attributes of the objectpickle
Converting an object from its in-memory structure to a binary or text-oriented format for storage or transmission, in a way that allows the future reconstruction of a clone of the object on the same system or on a different one.
The
pickle
module supports serialization of arbitraryPython
objects to a binary format
from Fluent Python by Ramalho
Then we have flatMap
transformation | description |
---|---|
flatMap(f) |
apply f to each element of the RDD, then flattens the results |
filter
allows to filter an RDD
transformation | description |
---|---|
filter(f) |
Return an RDD consisting of only elements that pass the condition f passed to filter()
|
About distinct
and sample
transformation | description |
---|---|
distinct() |
Removes duplicates |
sample(withReplacement, fraction, [seed]) |
Sample an RDD, with or without replacement |
We have also pseudo-set-theoretical operations
transformation | description |
---|---|
union(otherRdd) |
Returns union with otherRdd
|
instersection(otherRdd) |
Returns intersection with otherRdd
|
subtract(otherRdd) |
Return each value in self that is not contained in otherRdd . |
Note
union()
will contain duplicates (fixed with distinct()
)intersection()
removes all duplicates (including duplicates from a single RDD)intersection()
is much worse than union()
since it requires a shuffle to identify common elementssubtract
also requires a shuffle
We have also pseudo-set-theoretical operations
transformation | description |
---|---|
union(otherRdd) |
Returns union with otherRdd
|
instersection(otherRdd) |
Returns intersection with otherRdd
|
subtract(otherRdd) |
Return each value in self that is not contained in otherRdd . |
Spark
’s mechanism for redistributing data so as tp modify the partitioningSpark
generates sets of tasks:
Another “pseudo set” operation
transformation | description |
---|---|
cartesian(otherRdd) |
Return the Cartesian product of this RDD and another one |
collect
brings the RDD
back to the driver
transformation | description |
---|---|
collect() |
Return all elements from the RDD |
Note
It’s important to count!
transformation | description |
---|---|
count() |
Return the number of elements in the RDD |
countByValue() |
Return the count of each unique value in the RDD as a dictionary of {value: count} pairs. |
How to get some (but not all) values in an RDD ?
action | description |
---|---|
take(n) |
Return n elements from the RDD (deterministic) |
top(n) |
Return first n elements from the RDD (descending order) |
takeOrdered(num, key=None) |
Get the N elements from a RDD ordered in ascending order or as specified by the optional key function. |
Note
take(n)
returns n elements from the RDD and attempts to minimize the number of partitions it accesses
collect
and take
may return the elements in an order you don’t expectHow to get some values in an RDD ?
action | description |
---|---|
take(n) |
Return n elements from the RDD (deterministic) |
top(n) |
Return first n elements from the RDD (decending order) |
takeOrdered(num, key=None) |
Get the $N $elements from a RDD ordered in ascending order or as specified by the optional key function. |
action | description |
---|---|
reduce(f) |
Reduces the elements of this RDD using the specified commutative and associative binary operator f . |
fold(zeroValue, op) |
Same as reduce() but with the provided zero value. |
op(x, y)
is allowed to modify x and return it as its result value to avoid object allocation; however, it should not modify y.reduce
applies some operation to pairs of elements until there is just one left. Throws an exception for empty collections.fold
has initial zero-value: defined for empty collections.action | description |
---|---|
reduce(f) |
Reduces the elements of this RDD using the specified commutative and associative binary operator f . |
fold(zeroValue, op) |
Same as reduce() but with the provided zero value. |
action | description |
---|---|
reduce(f) |
Reduces the elements of this RDD using the specified commutative and associative binary operator f . |
fold(zeroValue, op) |
Same as reduce() but with the provided zero value. |
Warning
With fold
, solutions can depend on the number of partitions
>>> rdd = sc.parallelize([1, 2, 4], 2) # RDD with 2 partitions
>>> rdd.fold(2.5, lambda a, b: a + b)
14.5
action | description |
---|---|
reduce(f) |
Reduces the elements of this RDD using the specified commutative and associative binary operator f . |
fold(zeroValue, op) |
Same as reduce() but with the provided zero value. |
action | description |
---|---|
reduce(f) |
Reduces the elements of this RDD using the specified commutative and associative binary operator f . |
fold(zeroValue, op) |
Same as reduce() but with the provided zero value. |
action | description |
---|---|
aggregate(zero, seqOp, combOp) |
Similar to reduce() but used to return a different type. |
Aggregates the elements of each partition, and then the results for all the partitions, given aggregation functions and zero value.
seqOp(acc, val)
: function to combine the elements of a partition from the RDD (val
) with an accumulator (acc
). RDD
combOp
: function that merges the accumulators of two partitionsaction | description |
---|---|
aggregate(zero, seqOp, combOp) |
Similar to reduce() but used to return a different type. |
The foreach
action
action | description |
---|---|
foreach(f) |
Apply a function f to each element of a RDD |
Performs an action on all of the elements in the RDD without returning any result to the driver.
Example : insert records into a database with f
The foreach()
action lets us perform computations on each element in the RDD without bringing it back locally
Spark RDDs are lazily evaluated
Each time an action is called on a RDD, this RDD and all its dependencies are recomputed
If you plan to reuse a RDD multiple times, you should use persistence
Note
spark
to reduce the number of passes over the data it has to make by grouping operations togetherHow to use persistence ?
method | description |
---|---|
cache() |
Persist the RDD in memory |
persist(storageLevel) |
Persist the RDD according to storageLevel
|
These methods must be called before the action, and do not trigger the computation
storageLevel
storageLevel
explicitcache()
and persist()
with useMemory
?Options for persistence
argument | description |
---|---|
useDisk |
Allow caching to use disk if True
|
useMemory |
Allow caching to use memory if True
|
useOffHeap |
Store data outside of JVM heap if True . Useful if using some in-memory storage system (such a Tachyon ) |
deserialized |
Cache data without serialization if True
|
replication |
Number of replications of the cached data |
replication
: If you cache data that is quite slow to be recomputed, you can use replications. If a machine fails, data will not have to be recomputed.
Options for persistence
argument | description |
---|---|
useDisk |
Allow caching to use disk if True
|
useMemory |
Allow caching to use memory if True
|
useOffHeap |
Store data outside of JVM heap if True . Useful if using some in-memory storage system (such a Tachyon ) |
deserialized |
Cache data without serialization if True
|
replication |
Number of replications of the cached data |
deserialized
:
PySpark
only support serialized caching (using pickle
)Options for persistence
argument | description |
---|---|
useDisk |
Allow caching to use disk if True
|
useMemory |
Allow caching to use memory if True
|
useOffHeap |
Store data outside of JVM heap if True . Useful if using some in-memory storage system (such a Tachyon ) |
deserialized |
Cache data without serialization if True
|
replication |
Number of replications of the cached data |
useOffHeap
tachyon
spark
is scala
running on the JVMYou can use these constants:
DISK_ONLY = StorageLevel(True, False, False, False, 1)
DISK_ONLY_2 = StorageLevel(True, False, False, False, 2)
MEMORY_AND_DISK = StorageLevel(True, True, False, True, 1)
MEMORY_AND_DISK_2 = StorageLevel(True, True, False, True, 2)
MEMORY_AND_DISK_SER = StorageLevel(True, True, False, False, 1)
MEMORY_AND_DISK_SER_2 = StorageLevel(True, True, False, False, 2)
MEMORY_ONLY = StorageLevel(False, True, False, True, 1)
MEMORY_ONLY_2 = StorageLevel(False, True, False, True, 2)
MEMORY_ONLY_SER = StorageLevel(False, True, False, False, 1)
MEMORY_ONLY_SER_2 = StorageLevel(False, True, False, False, 2)
OFF_HEAP = StorageLevel(False, False, True, False, 1)
and simply call for instance
What if you attempt to cache too much data to fit in memory ?
Spark will automatically evict old partitions using a Least Recently Used (LRU) cache policy:
For the memory-only storage levels, it will recompute these partitions the next time they are accessed
For the memory-and-disk ones, it will write them out to disk
Use unpersist()
to RDDs to manually remove them from the cache
Warning
When passing functions, you can inadvertently serialize the object containing the function.
If you pass a function that:
then Spark
sends the entire object to worker nodes, which can be much larger than the bit of information you need
Caution
This can cause your program to fail, if your class contains objects that Python can’t pickle
Passing a function with field references (don’t do this ! )
class SearchFunctions(object):
def __init__(self, query):
self.query = query
def isMatch(self, s):
return self.query in s
def getMatchesFunctionReference(self, rdd):
# Problem: references all of "self" in "self.isMatch"
return rdd.filter(self.isMatch)
def getMatchesMemberReference(self, rdd):
# Problem: references all of "self" in "self.query"
return rdd.filter(lambda x: self.query in x)
Tip
Instead, just extract the fields you need from your object into a local variable and pass that in
Python
function passing without field references
class WordFunctions(object):
...
def getMatchesNoReference(self, rdd):
# Safe: extract only the field we need into a local variable
query = self.query
return rdd.filter(lambda x: query in x)
Much better to do this instead
It’s roughly a RDD where each element is a tuple with two elements: a key and a value
(key, value)
pairs into RDD is very convenientPairRDD
Calling map
with a function returning a tuple
with two elements
All elements of a PairRDD
must be tuples with two elements (the key and the value)
PairRDD
transformation | description |
---|---|
keys() |
Return an RDD containing the keys |
values() |
Return an RDD containing the values |
sortByKey() |
Return an RDD sorted by the key |
mapValues(f) |
Apply a function f to each value of a pair RDD without changing the key |
flatMapValues(f) |
Pass each value in the key-value pair RDD through a flatMap function f without changing the keys |
PairRDD
transformation | description |
---|---|
keys() |
Return an RDD containing the keys |
values() |
Return an RDD containing the values |
sortByKey() |
Return an RDD sorted by the key |
mapValues(f) |
Apply a function f to each value of a pair RDD without changing the key |
flatMapValues(f) |
Pass each value in the key-value pair RDD through a flatMap function f without changing the keys |
PairRDD
transformation | description |
---|---|
keys() |
Return an RDD containing the keys |
values() |
Return an RDD containing the values |
sortByKey() |
Return an RDD sorted by the key |
mapValues(f) |
Apply a function f to each value of a pair RDD without changing the key |
flatMapValues(f) |
Pass each value in the key-value pair RDD through a flatMap function f without changing the keys |
PairRDD
(keyed)transformation | description |
---|---|
groupByKey() |
Group values with the same key |
reduceByKey(f) |
Merge the values for each key using an associative reduce function f . |
foldByKey(f) |
Merge the values for each key using an associative reduce function f . |
combineByKey(createCombiner, mergeValue, mergeCombiners, [partitioner]) |
Generic function to combine the elements for each key using a custom set of aggregation functions. |
PairRDD
(keyed)transformation | description |
---|---|
groupByKey() |
Group values with the same key |
reduceByKey(f) |
Merge the values for each key using an associative reduce function f . |
foldByKey(f) |
Merge the values for each key using an associative reduce function f . |
combineByKey(createCombiner, mergeValue, mergeCombiners, [partitioner]) |
Generic function to combine the elements for each key using a custom set of aggregation functions. |
PairRDD
(keyed)transformation | description |
---|---|
groupByKey() |
Group values with the same key |
reduceByKey(f) |
Merge the values for each key using an associative reduce function f . |
foldByKey(f) |
Merge the values for each key using an associative reduce function f . |
combineByKey(createCombiner, mergeValue, mergeCombiners, [partitioner]) |
Generic function to combine the elements for each key using a custom set of aggregation functions. |
PairRDD
(keyed)transformation | description |
---|---|
groupByKey() |
Group values with the same key |
reduceByKey(f) |
Merge the values for each key using an associative reduce function f . |
foldByKey(f) |
Merge the values for each key using an associative reduce function f . |
combineByKey(createCombiner, mergeValue, mergeCombiners, [partitioner]) |
Generic function to combine the elements for each key using a custom set of aggregation functions. |
combineByKey
Transforms an RDD[(K, V)]
into another RDD of type RDD[(K, C)]
for a combined type C
that can be different from V
The user must define
createCombiner
: which turns a V
into a C
mergeValue
: to merge a V
into a C
mergeCombiners
: to combine two C
’s into a single onePairRDD
(keyed)transformation | description |
---|---|
groupByKey() |
Group values with the same key |
reduceByKey(f) |
Merge the values for each key using an associative reduce function f . |
foldByKey(f) |
Merge the values for each key using an associative reduce function f . |
combineByKey(createCombiner, mergeValue, mergeCombiners, [partitioner]) |
Generic function to combine the elements for each key using a custom set of aggregation functions. |
PairRDD
transformation | description |
---|---|
subtractByKey(other) |
Remove elements with a key present in the other RDD. |
join(other) |
Inner join with other RDD. |
rightOuterJoin(other) |
Right join with other RDD. |
leftOuterJoin(other) |
Left join with other RDD. |
other
RDDPairRDD
Join operations are mainly used through the high-level API: DataFrame
objects and the spark.sql
API
We will use them a lot with the high-level API (DataFrame
from spark.sql
)
PairRDD
action | description |
---|---|
countByKey() |
Count the number of elements for each key. |
lookup(key) |
Return all the values associated with the provided key . |
collectAsMap() |
Return the key-value pairs in this RDD to the master as a Python dictionary. |
PairRDD
s, such as join
, require to scan the data more than once
Spark
: you can choose which keys will appear on the same node, but no explicit control of which worker node each key goes to.In practice, you can specify the number of partitions with
You can also use a custom partition function hash
such that hash(key)
returns a hash value
import urlparse
>>> def hash_domain(url):
# Returns a hash associated to the domain of a website
return hash(urlparse.urlparse(url).netloc)
rdd.partitionBy(20, hash_domain) # Create 20 partitions
To have finer control on partitionning, you must use the Scala API.
IFEBY030 – Technos Big Data – M1 MIDS/MFA/LOGOS – UParis Cité