Spark NLP

Technologies Big Data Master MIDS/MFA/LOGOIS

Équipe de Statistique

LPSM Université Paris-Cité

2025-01-17

Spark-NLP in perspective

Spark NLP

Spark NLP provides an example of an application in the Apache Spark Ecosystem

Spark NLP relies on the Spark SQL Lib and Spark Dataframes (high level APIs) and also on the Spark ML Lib.

Spark NLP borrows ideas from existing NLP softwares and adapts the known techniques to the Spark principles

…

NLP deals with many applications of machine learning

Automatic translation (see deepl.com)
Topic modeling (text clustering)
Sentiment Analysis
LLMs
…

NLP Libraries

Two flavors of NLP libraries

Functionality Libraries nltk.org
Annotation Libraries spaCy’s site

spaCy and Spark?

A databricks notebook discusses possible interactions between spaCy and Spark on a use case:

Get the tweets (the texts) into a Spark dataframe using spark.sql()
Convert the Spark dataframe to a numpy array
Stream all tweets in batches using nlp.pipe()
Go through the processed tweets and take copy everything we need in a large array object
Convert back the large array object into a Spark dataframe
Save the dataframe as table, so we can query the whole thing withh SQL again

No hint at parallelizing spaCy’s annotation process

spaCy v2 (current v3.7)

spaCy v2 now fully supports the Pickle protocol, making it easy to use spaCy with Apache Spark.

spaCy v2 documentation

A short example (from John Snow Labs)

Initializing a sparknlp session
Building a toy NLP pipeline for detecting dates in a text

Imports sparknlp and others

# Import Spark NLP
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.pretrained import PretrainedPipeline
import sparknlp

Initiate Spark session

Assuming standalone mode on a laptop. master runs on localhost

spark = SparkSession.builder \
            .appName("Spark NLP") \
#            .master("spark://localhost:7077") \
            .config("spark.driver.memory", "16G") \
            .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
            .config("spark.kryoserializer.buffer.max", "2000M") \
            .config("spark.driver.maxResultSize", "0") \
            .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3") \
            .getOrCreate()

sparknlp.version()

…

spark

Toy (big) data

fr_articles = [
  ("Le dimanche 11 juillet 2021, Chiellini a utilisé le mot Kiricocho lorsque Saka s'est approché du ballon pour le penalty.",),
  ("La prochaine Coupe du monde aura lieu en novembre 2022.",),
  ("À Noël 800, Charlemagne se fit couronner empereur à Rome.",),
  ("Le Marathon de Paris a lieu le premier dimanche d'avril 2024",)
]

articles_cols = ["text"]

df = spark.createDataFrame(
  data=fr_articles, 
  schema=articles_cols)

df.printSchema()

Pipelines

document_assembler = DocumentAssembler() \
            .setInputCol("text") \
            .setOutputCol("document")

Column document contains the ‘text’ to be annotated as well as some possible metadata.

Starting point of any annotation process

Spark NLP relies on Saprk SQL for storing, moving, data.

date_matcher = DateMatcher() \
            .setInputCols(['document']) \
            .setOutputCol("date") \
            .setOutputFormat("MM/dd/yyyy") \
            .setSourceLanguage("fr")

Spark NLP adopts an original way of storing annotations
Spark NLP creates columns for annotations
Spark NLP stores annotation in Spark dataframes
Annotators are
- Tranformers
- Estimators
- Models

Transformation/Action

assembled = ( 
  document_assembler.transform(df)
)

(
 date_matcher
  .transform(assembled)
  .select('date')
  .show(10, False)
)

fr_articles.append(("Nous nous sommes rencontrés le 13/05/2018 puis le 18/05/2020.",))

fr_articles.append(("Nous nous sommes rencontrés il y a 2 jours et il m'a dit qu'il nous rendrait visite la semaine prochaine.",))

df = spark.createDataFrame(
  data=fr_articles, 
  schema=articles_cols)

df.printSchema()
df.show()

assembled = ( 
  document_assembler.transform(df)
)

(
 date_matcher
  .transform(assembled)
  .select('date')
  .show(10, False)
)

Another annotator

date_matcher_bis = MultiDateMatcher() \
            .setInputCols(['document']) \
            .setOutputCol("date") \
            .setOutputFormat("MM/dd/yyyy") \
            .setSourceLanguage("fr")

(
  date_matcher_bis
    .transform(assembled)
    .select("date")
    .show(10, False)
)

Spark NLP Design

SQL Lib and Dataframes

ML Lib, Transformers and Estimators

Spark NLP Pipelines

Getting a corpus : ETL

pattern = 'URL: http://www.nytimes.com/(?P<zedate>[0-9]{4}/[0-9]{2}/[0-9]{2})/.*'
title = 'URL: http://www.nytimes.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/(.*)'
reg_date = re.compile(pattern)
reg_title = re.compile(title)

nypath = Path('../data/nytimes_news_articles.txt')
corpus_list = list()

with open(nypath, encoding='UTF-8')  as fd:
    doc, document = None, None
    while l := fd.readline():        
        if m := reg_date.match(l):
            if doc is not None:
                corpus_list.append((*document, doc))
                doc, document = None, None
            ymd = date(*[int(n) for n in m.groups()[0].split('/')])
            title = (
                reg_title.match(l)
                  .groups()[0]
                  .split('/')
            )
            document =  (ymd, title[-1], '/'.join(title[:-1]))
            doc = ''
        else: doc = doc + l
    else:
        if doc is not None:
            corpus_list.append((*document, doc))

df_texts = spark.createDataFrame(corpus_list,
                      schema= StructType([
    StructField('date', DateType(), False),
    StructField('title', StringType(), False),
    StructField('topic', StringType(), False),
    StructField('text', StringType(), True)
]))

df_texts.printSchema()
df_texts.count()

Saving

Locally

df_texts.write.parquet('../data/ny_corpus_pq')

spam = spark.read.parquet('../data/ny_corpus_pq')

spam.printSchema()

spam.rdd.getNumPartitions()

corpus_assembled = ( 
  document_assembler.transform(df_texts)
)

corpus_assembled.printSchema()

(
  date_matcher_bis
    .transform(corpus_assembled)
    .select("title", "date")
    .show(10, False)
)

Warning

Extracted dates should be taken with a grain of salt

Public pipelines

from sparknlp.pretrained import PretrainedPipeline
explain_document_pipeline = PretrainedPipeline("explain_document_ml")

Chaining annotators

sentenceDetector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")
regexTokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

finisher = Finisher() \
    .setInputCols(["token"]) \
    .setIncludeMetadata(True)

pipeline = Pipeline().setStages([
    document_assembler,
    sentenceDetector,
    regexTokenizer,
    finisher
])

Fitting and transforming

spam = ( 
  pipeline.fit(df_texts)
    .transform(df_texts)
    .select("finished_token")
    .collect()
)

A customized pipeline

stemmer = (
  Stemmer()
    .setInputCols(['token'])
    .setOutputCol('stem')
)

lemmatizer = (
  LemmatizerModel.pretrained()
    .setInputCols(['token'])
    .setOutputCol('lemma')
)

Warning

lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
[ / ]lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
[ / ]Download done! Loading the resource.
[ — ]

[OK!]

posTagger = PerceptronModel.pretrained() \
    .setInputCols(["document", "token"]) \
    .setOutputCol("pos")

Warning

pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[ — ]pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[ \ ]Download done! Loading the resource.
[Stage 34:===========================================>              (3 + 1) / 4]
[ | ]
                                                                
[OK!]

finisher = (
  Finisher()
    .setInputCols([
      'token', 
#      'stem', 
#      'lemma', 
      'pos'])
    .setIncludeMetadata(False)
    .setOutputAsArray(True)
)

pipeline = (
  Pipeline()
    .setStages([
      document_assembler,
      sentenceDetector,
      regexTokenizer,
      posTagger, 
      finisher
    ])
)

spam = ( 
  pipeline.fit(df_texts)
    .transform(df_texts)
    .selectExpr("*")
    .collect()
)

Spark NLP and feature engineering

Topic modelling

TF-IDF

Latent Dirichlet Allocation

Distributed computations

Execution modes

standalone
client
cluster

Spark NLP and composite types in Spark Dataframes


>>> result = documentAssembler.transform(data)
>>> result.select("document").show(truncate=False)
+----------------------------------------------------------------------------------------------+
|document                                                                                      |
+----------------------------------------------------------------------------------------------+
|[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
+----------------------------------------------------------------------------------------------+
>>> result.select("document").printSchema()
root
|-- document: array (nullable = True)
|    |-- element: struct (containsNull = True)
|    |    |-- annotatorType: string (nullable = True)
|    |    |-- begin: integer (nullable = False)
|    |    |-- end: integer (nullable = False)
|    |    |-- result: string (nullable = True)
|    |    |-- metadata: map (nullable = True)
|    |    |    |-- key: string
|    |    |    |-- value: string (valueContainsNull = True)
|    |    |-- embeddings: array (nullable = True)
|    |    |    |-- element: float (containsNull = False)

Column document is of type ArrayType(). The basetype of document column is of StructType() (element), the element contains subfields of primitive type, but alo a field of type map (MapType()) and a field of type StructType().

Spark NLP

Spark-NLP in perspective

Spark NLP

NLP Libraries

Two flavors of NLP libraries

spaCy and Spark?

spaCy v2 (current v3.7)

A short example (from John Snow Labs)

Imports sparknlp and others

Initiate Spark session

Toy (big) data

Pipelines

Transformation/Action

More

Another annotator

Spark NLP Design

SQL Lib and Dataframes

ML Lib, Transformers and Estimators

Spark NLP Pipelines

Getting a corpus : ETL

Saving

Public pipelines

Chaining annotators

Fitting and transforming

A customized pipeline

Spark NLP and feature engineering

Topic modelling

TF-IDF

Latent Dirichlet Allocation

Distributed computations

Execution modes

Spark NLP and composite types in Spark Dataframes