Spark tips

Technologies Big Data Master MIDS/MFA/LOGOIS

Équipe de Statistique

LPSM Université Paris-Cité

2025-01-17

Spark tips

Tip 1. Use DataFrames instead of RDDs

  • Instead of using the RDD API
rdd = sc.textFile("/path/to/file.txt")
  • Use the DataFrame API
df = spark.read.textFile("/path/to/file.txt")
  • The DataFrame API uses the * Catalyst* optimizer to improve the execution plan of your Spark Job

  • The low-level RDD API does not

  • Most of the recent Spark advances are towards an improvement of the SQL

Tip 2. Avoid using regular expressions

  • Java Regex is great to parse data in an expected structure

  • But, unfortunately, it is generally a slow process when processing millions of rows

  • Increasing a little bit the parsing of rows increases a lot the entire job

  • If possible, avoid using Regex’s and try to load your data in a more structured format

Tip 3. Joins: largest dataset on the left

  • When joining two datasets where one is smaller than the other, you must put the largest on the left
joinedDF = largeDF.join(smallDF, on="id")
  • The data specified on the left is static on the executors while the data on the right is transfered between the executors

  • Something like

joinedDF = smallDF.join(largeDF, on="id")

can be much longer or even fail if largeDF is large

Tip 4. Joins: use broadcast joining

  • Often, we need to join a huge dataframe with a small one

  • Use broadcast joins for joining small datasets to larger ones

from pyspark.sql.functions import broadcast

joinedDF = largeDF.join(broadcast(smallDF), on="id")
  • Usually leads to much faster joins since is allows to avoid shuffles

Tip 5. Use caching when repeating queries

  • If you are constantly using the same DataFrame on multiple queries, you can use caching or persistence:
df = (
  spark
    .read
    .textFile("/path/to/file.txt")
    .cache()
)
  • But avoid overusing this. Depending on caching strategy (in-memory then swap to disk), cache can end up being slower than reading

  • Storage space used for caching means less space for processing

  • Caching can cost more than reading the DataFrame (e.g. only few columns are useful, predictate pushdown)

Tip 6. COMPUTE STATISTICS of tables

  • Before querying a table, it can be helpful to compute the statistics of those tables so that Catalyst can find a better plan to process it:
query = "ANALYZE TABLE db.table COMPUTE STATISTICS"
spark.sql(query)
  • However, Spark does not always get everything it needs just from the above broad COMPUTE STATISTICS call

Tip 6. COMPUTE STATISTICS of tables

  • Also helps to check specific columns so that Catalyst can better check those columns

  • It’s recommended to COMPUTE STATISTICS for any columns involved in filtering and joining :

query = "ANALYZE TABLE db.table COMPUTE STATISTICS"
            " FOR COLUMNS joinColumn, filterColumn"

spark.sql(query)

Tip 7. Shuffles: know your data

  • Shuffle is the transportation of data between workers across a Spark cluster’s network

  • It’s central for operations where a reorganization of data is required, referred to as wide dependencies (wide vs narrow dependencies)

  • This kind of operation usually is the bottleneck of your Spark application

  • To use Spark well, you need to know what you shuffle, and for this it’s essential that you know your data

Tip 8. Shuffles: beware of skews

  • Skew is an imbalance in the distribution of your data

  • If you fail to account for how your data is distributed, you may find that Spark naively places an overwhelming majority of rows on one executor, and a fraction on all the rest

  • This is skew, and it will kill your application

  • Whether by causing out of memory errors, network timeouts, or exponentially long running processes that will never terminate

Tip 9. Partitions: change the default

  • It’s absolutely essential to model the number of partitions around the kind of stuff you’re solving

  • The default value for spark.sql.shuffle.partitions is 200. It controls the number of partitions used by shuffles (= number of partitions in the resulting DataFrame of RDD).

  • Number of shuffle partitions does not change with different data size. For small data, 200 is overkill, for large data, it does not effectively use the all resources.

  • Rule of thumb: set this configuration to the number of cores you have available across all your executors

(
  spark
    .conf
    .set("spark.sql.shuffle.partitions", 42)
)

Tip 10. Partitions: well-distributed columns

  • A powerful way to control Spark shuffles is to partition your data intelligently

  • Partitioning on the right column (or set of columns) helps to balance the amount of data mapped across the cluster network in order to perform actions

  • Partitioning on a unique ID is generally a good strategy, but don’t partition on sparsely filled columns (with many NAs) or columns that over-represent particular values

Tip 11. Joins again: highly flammable

  • Joins are shuffle offenders. Dangers of SQL joining are amplified by the scale enabled by Spark

  • Even joining medium sized data can cause an explosion if there are repeated join values on both sides of your join

  • Million rows datasets with “pseudo unique” keys can explode into a billions rows join!

  • Join columns with null values usually means massive skew and an explosive join

  • A solution is to pre-fill empty cells to arbitrary balanced values (e.g. uniform random values) before running a join

Tip 12. Is your data real yet?

  • Don’t forget that operations in Spark are divided between transformations and actions. Transformations are lazy operations allowing Spark to optimize your query

  • Transformations set up a DataFrame for changes (adding a column, joining it to another, etc.) but will not execute these until an action is performed.

  • This can result in surprising results: imagine that you create an id column using monotonically_increasing_id, and then join on that column. If you do not place an action in between, your values have not been materialized. The result will be non-deterministic!

Tip 13. Checkpointing is your friend

  • Checkpointing means saving data to disk and reloading it back in, which is redundant anywhere else besides Spark.

  • It triggers an action on any waiting transformations, and truncates the Spark query plan for the checkpointed data.

  • This action shows up in your Spark UI, indicating where you are in your job.

  • It can help to conserve resources, since it can release memory that would otherwise be cached for downstream access.

  • Checkpointed data is also a valuable source for data-debugging.

Tip 14. Check your runtime with monitoring

  • Spark UI is your friend, and so are other monitoring tools that let you know how your run is going in real-time.

  • The Spark UI contains information on the job level, the stage level, and the executor level. You can see if the volume of data going to each partition or each executor makes sense, if some part of your job is taking too much time.

  • Such a monitoring tool allowing to view your total memory and CPU usage across executors is essential for resource planning and “autopsies” on failed jobs.

Tip 15. CSV reading is brittle

  • Naively reading CSVs in Spark can result in silent escape-character errors
df = spark.read.csv("quote-happy.csv")
  • Your DataFrame seems happy: no runtime exceptions, and you can execute operations on the DataFrame

  • But after careful debugging, you realize that at some point in the data, everything has shifted over one or several columns!

  • To be safe, you can include escape and quote options in your reads. Even better: use Parquet instead of CSV files!

Tip 16. Parquet is your friend

  • Read/Write operations are order of magnitude more efficient with Parquet than with uncompressed CSV files

  • Parquet is “columnar”: reads only the columns required for a sql query and skip over those that are not requested.

  • And also predicate pushdown operations on filtering operations: run queries only on relevant subsets of the values.

  • Switching from CSV to Parquet is the first thing you can do to improve performance.

  • If you are generating Parquet files from another format (using PyArrow, Pandas, etc.) be conscious that creating a single parquet file gives up a major benefit of the format: you need to partition it!

Tip 17. Problems with UDFs

UDF = User Defined Function = something very convenient

>>> from pyspark.sql import functions as F, types as T

>>> data = [{'a': 1, 'b': 0}, {'a': 10, 'b': 3}]
>>> df = spark.createDataFrame(data)

>>> def calculate_a_b_ratio(a, b):
>>>     if b > 0:
>>>         return a / b
>>>     return 0.

>>> udf_ratio_calculation = F.udf(calculate_a_b_ratio, T.FloatType())

>>> df = df.withColumn('a_b_ratio_float', udf_ratio_calculation('a', 'b'))
>>> df.show()
+---+---+---------------+
|  a|  b|a_b_ratio_float|
+---+---+---------------+
|  1|  0|            0.0|
| 10|  3|      3.3333333|
+---+---+---------------+

Tip 17. Problems with UDFs

UDF are Excruciatingly slow with pyspark and spark won’t complain if the return type is incorrect and just return nulls

>>> udf_ratio_calculation = F.udf(calculate_a_b_ratio, T.DecimalType())
>>> df = df.withColumn('a_b_ratio_dec', udf_ratio_calculation('a', 'b'))
>>> df.show()
+---+---+---------------+-------------+
|  a|  b|a_b_ratio_float|a_b_ratio_dec|
+---+---+---------------+-------------+
|  1|  0|            0.0|         null|
| 10|  3|      3.3333333|         null|
+---+---+---------------+-------------+
>>> udf_ratio_calculation = F.udf(calculate_a_b_ratio, T.BooleanType())
>>> df = df.withColumn('a_b_ratio_bool', udf_ratio_calculation('a', 'b'))
>>> df.show()
+---+---+---------------+-------------+--------------+
|  a|  b|a_b_ratio_float|a_b_ratio_dec|a_b_ratio_bool|
+---+---+---------------+-------------+--------------+
|  1|  0|            0.0|         null|          null|
| 10|  3|      3.3333333|         null|          null|
+---+---+---------------+-------------+--------------+

Tip 18. Use all of the resources

  • Spark driver memory and executor memory are set by default to 1 Go.

  • It is in general very useful to take a look at the many configuration parameters and their defaults:

https://spark.apache.org/docs/latest/configuration.html

  • Many things there that can influence your spark application

  • When running locally, adjust spark.driver.memory to something that’s reasonable for your system, e.g. "8g"

  • When running on a cluster, you might also want to tweak the spark.executor.memory (though it depends on your cluster and its configuration).

Tip 18. Use all of the resources

from pyspark import SparkConf
from pyspark.sql import SparkSession

conf = SparkConf()
conf.set('spark.executor.memory', '16g')
conf.set('spark.driver.memory', '8g')

spark_session = SparkSession.builder \
        .config(conf=conf) \
        .appName('Name') \
        .getOrCreate()

Interpret error messages

Interpret error messages

  • Error messages don’t mean what they say

  • Takes quite a while to understand that Spark complains about one thing, when the problem is somewhere else

  • "Connection reset by peer" often means that you have skews and one particular worker has run out of memory

  • "java.net.SocketTimeoutException: Write timed out" can mean that the number of partitions too high, so that the filesystem is too slow to handle the number of simultaneous writes attempted by Spark

Interpret error messages

  • "Total size of serialized results[...] is bigger than spark.driver.maxResultSize" can mean that the number of partitions is too high and results can’t fit onto a particular worker

  • "Column a is not a member of table b": you have a sql join error. Try your job locally on a small sample to avoid reverse engineering of such errors

  • Sometimes you get a true "out of memory" error. You can increase the size of individual workers, but before you do that, ask yourself, is the data well distributed ?

Interpret error messages

  • "ClassNotFoundException": usually when you are trying to connect your application to an external a database. Here is an example

Interpret error messages

  • Means that Spark cannot find the necessary jar driver to connect to the database

  • Need to provide the correct jars to your application using the spark configuration or as a command line argument

from pyspark import SparkConf
from pyspark.sql import SparkSession

jars = "/full/path/to/postgres.jar,/full/path/to/other/jar"
conf = SparkConf()
conf.set("spark.jars", jars)

spark = (
  SparkSession
    .builder
    .config(conf=conf)
    .appName('test')
    .getOrCreate()
)

or

spark-submit --jars /full/path/to/postgres.jar,/full/path/to/other/jar ...

Interpret error messages

  • All the jars must be accessible to all nodes and not local to the driver.

  • This error might also mean a Spark version mismatch between the cluster components

  • Make sure there is no space between the commas in the list of jars.

Interpret error messages

Trying to connect to a database: "java.sql.SQLException: No suitable driver"

Interpret error messages

Error happens while trying to save to a database: "java.lang.NullPointerException"

Interpret error messages

This errors usually mean that we forgot to set the driver, "org.postgresql.Driver" for Postgres:

df = spark.read.format('jdbc').options(
    url= 'db_url',
    driver='org.postgresql.Driver',  # <-- here
    dbtable='table_name',
    user='user',
    password='password'
).load()

and also make sure that the drivers’ jars are set.

Interpret error messages

Horrible error : 'NoneType' object has no attribute '_jvm'

…mainly comes from two mistakes

Interpret error messages

  1. You are using pyspark functions without having an active spark session
from pyspark.sql import SparkSession, functions as fn

class A(object):
    def __init__(self):
        self.calculations = fn.col('a') / fn.col('b')
...
# Instantiating A without an active spark session 
# will give you this error
a = A()

Interpret error messages

  1. You are using pyspark functions within a UDF:
# Create a dataframe
data = [{'a': 1, 'b': 0}, {'a': 10, 'b': 3}]
df = spark.createDataFrame(data)

# Define a simple function that returns a / b
def calculate_a_b_max(a, b):
    return F.max([a, b])

# and a udf for this function - notice the return datatype
udf_max_calculation = F.udf(calculate_a_b_ratio, T.FloatType())

df = df.withColumn('a_b_max', udf_max_calculation('a', 'b'))

df.show()

We CANNOT use pyspark functions inside a udf: a UDF operates on a row per row basis while pyspark functions on a column basis.

Thank you !