Spark NLP

Technologies Big Data Master MIDS/MFA/LOGOIS

Équipe de Statistique

LPSM Université Paris-Cité

2025-01-17

Spark-NLP in perspective

Spark NLP

Spark NLP provides an example of an application in the Apache Spark Ecosystem

Spark NLP relies on the Spark SQL Lib and Spark Dataframes (high level APIs) and also on the Spark ML Lib.

Spark NLP borrows ideas from existing NLP softwares and adapts the known techniques to the Spark principles

NLP deals with many applications of machine learning

  • Automatic translation (see deepl.com)
  • Topic modeling (text clustering)
  • Sentiment Analysis
  • LLMs

NLP Libraries

Two flavors of NLP libraries

spaCy and Spark?

A databricks notebook discusses possible interactions between spaCy and Spark on a use case:

  • Get the tweets (the texts) into a Spark dataframe using spark.sql()
  • Convert the Spark dataframe to a numpy array
  • Stream all tweets in batches using nlp.pipe()
  • Go through the processed tweets and take copy everything we need in a large array object
  • Convert back the large array object into a Spark dataframe
  • Save the dataframe as table, so we can query the whole thing withh SQL again

No hint at parallelizing spaCy’s annotation process

spaCy v2 (current v3.7)

spaCy v2 now fully supports the Pickle protocol, making it easy to use spaCy with Apache Spark.

spaCy v2 documentation

A short example (from John Snow Labs)

  • Initializing a sparknlp session
  • Building a toy NLP pipeline for detecting dates in a text

Imports sparknlp and others

# Import Spark NLP
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.pretrained import PretrainedPipeline
import sparknlp

Initiate Spark session

Assuming standalone mode on a laptop. master runs on localhost

spark = SparkSession.builder \
            .appName("Spark NLP") \
#            .master("spark://localhost:7077") \
            .config("spark.driver.memory", "16G") \
            .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
            .config("spark.kryoserializer.buffer.max", "2000M") \
            .config("spark.driver.maxResultSize", "0") \
            .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3") \
            .getOrCreate()
sparknlp.version()

spark

Toy (big) data

fr_articles = [
  ("Le dimanche 11 juillet 2021, Chiellini a utilisé le mot Kiricocho lorsque Saka s'est approché du ballon pour le penalty.",),
  ("La prochaine Coupe du monde aura lieu en novembre 2022.",),
  ("À Noël 800, Charlemagne se fit couronner empereur à Rome.",),
  ("Le Marathon de Paris a lieu le premier dimanche d'avril 2024",)
]
articles_cols = ["text"]

df = spark.createDataFrame(
  data=fr_articles, 
  schema=articles_cols)

df.printSchema()

Pipelines

document_assembler = DocumentAssembler() \
            .setInputCol("text") \
            .setOutputCol("document")

Column document contains the ‘text’ to be annotated as well as some possible metadata.

Starting point of any annotation process

Spark NLP relies on Saprk SQL for storing, moving, data.

date_matcher = DateMatcher() \
            .setInputCols(['document']) \
            .setOutputCol("date") \
            .setOutputFormat("MM/dd/yyyy") \
            .setSourceLanguage("fr")
  • Spark NLP adopts an original way of storing annotations
  • Spark NLP creates columns for annotations
  • Spark NLP stores annotation in Spark dataframes
  • Annotators are
    • Tranformers
    • Estimators
    • Models

Transformation/Action

assembled = ( 
  document_assembler.transform(df)
)
(
 date_matcher
  .transform(assembled)
  .select('date')
  .show(10, False)
)

More

fr_articles.append(("Nous nous sommes rencontrés le 13/05/2018 puis le 18/05/2020.",))

fr_articles.append(("Nous nous sommes rencontrés il y a 2 jours et il m'a dit qu'il nous rendrait visite la semaine prochaine.",))
df = spark.createDataFrame(
  data=fr_articles, 
  schema=articles_cols)

df.printSchema()
df.show()
assembled = ( 
  document_assembler.transform(df)
)
(
 date_matcher
  .transform(assembled)
  .select('date')
  .show(10, False)
)

Another annotator

date_matcher_bis = MultiDateMatcher() \
            .setInputCols(['document']) \
            .setOutputCol("date") \
            .setOutputFormat("MM/dd/yyyy") \
            .setSourceLanguage("fr")
(
  date_matcher_bis
    .transform(assembled)
    .select("date")
    .show(10, False)
)

Spark NLP Design

SQL Lib and Dataframes

ML Lib, Transformers and Estimators

Spark NLP Pipelines

Getting a corpus : ETL

pattern = 'URL: http://www.nytimes.com/(?P<zedate>[0-9]{4}/[0-9]{2}/[0-9]{2})/.*'
title = 'URL: http://www.nytimes.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/(.*)'
reg_date = re.compile(pattern)
reg_title = re.compile(title)
nypath = Path('../data/nytimes_news_articles.txt')
corpus_list = list()
with open(nypath, encoding='UTF-8')  as fd:
    doc, document = None, None
    while l := fd.readline():        
        if m := reg_date.match(l):
            if doc is not None:
                corpus_list.append((*document, doc))
                doc, document = None, None
            ymd = date(*[int(n) for n in m.groups()[0].split('/')])
            title = (
                reg_title.match(l)
                  .groups()[0]
                  .split('/')
            )
            document =  (ymd, title[-1], '/'.join(title[:-1]))
            doc = ''
        else: doc = doc + l
    else:
        if doc is not None:
            corpus_list.append((*document, doc))
df_texts = spark.createDataFrame(corpus_list,
                      schema= StructType([
    StructField('date', DateType(), False),
    StructField('title', StringType(), False),
    StructField('topic', StringType(), False),
    StructField('text', StringType(), True)
]))
df_texts.printSchema()
df_texts.count()

Saving

Locally

df_texts.write.parquet('../data/ny_corpus_pq')
spam = spark.read.parquet('../data/ny_corpus_pq')

spam.printSchema()
spam.rdd.getNumPartitions()

corpus_assembled = ( 
  document_assembler.transform(df_texts)
)
corpus_assembled.printSchema()
(
  date_matcher_bis
    .transform(corpus_assembled)
    .select("title", "date")
    .show(10, False)
)

Avertissement

Extracted dates should be taken with a grain of salt

Public pipelines

from sparknlp.pretrained import PretrainedPipeline
explain_document_pipeline = PretrainedPipeline("explain_document_ml")

Chaining annotators

sentenceDetector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")
regexTokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")
finisher = Finisher() \
    .setInputCols(["token"]) \
    .setIncludeMetadata(True)
pipeline = Pipeline().setStages([
    document_assembler,
    sentenceDetector,
    regexTokenizer,
    finisher
])

Fitting and transforming

spam = ( 
  pipeline.fit(df_texts)
    .transform(df_texts)
    .select("finished_token")
    .collect()
)

A customized pipeline

stemmer = (
  Stemmer()
    .setInputCols(['token'])
    .setOutputCol('stem')
)
lemmatizer = (
  LemmatizerModel.pretrained()
    .setInputCols(['token'])
    .setOutputCol('lemma')
)

Avertissement

lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
[ / ]lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
[ / ]Download done! Loading the resource.
[ — ]

[OK!]
posTagger = PerceptronModel.pretrained() \
    .setInputCols(["document", "token"]) \
    .setOutputCol("pos")

Avertissement

pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[ — ]pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[ \ ]Download done! Loading the resource.
[Stage 34:===========================================>              (3 + 1) / 4]
[ | ]
                                                                
[OK!]
finisher = (
  Finisher()
    .setInputCols([
      'token', 
#      'stem', 
#      'lemma', 
      'pos'])
    .setIncludeMetadata(False)
    .setOutputAsArray(True)
)

pipeline = (
  Pipeline()
    .setStages([
      document_assembler,
      sentenceDetector,
      regexTokenizer,
      posTagger, 
      finisher
    ])
)
spam = ( 
  pipeline.fit(df_texts)
    .transform(df_texts)
    .selectExpr("*")
    .collect()
)

Spark NLP and feature engineering

Topic modelling

TF-IDF

Latent Dirichlet Allocation

Distributed computations

Execution modes

  • standalone
  • client
  • cluster

Spark NLP and composite types in Spark Dataframes


>>> result = documentAssembler.transform(data)
>>> result.select("document").show(truncate=False)
+----------------------------------------------------------------------------------------------+
|document                                                                                      |
+----------------------------------------------------------------------------------------------+
|[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
+----------------------------------------------------------------------------------------------+
>>> result.select("document").printSchema()
root
|-- document: array (nullable = True)
|    |-- element: struct (containsNull = True)
|    |    |-- annotatorType: string (nullable = True)
|    |    |-- begin: integer (nullable = False)
|    |    |-- end: integer (nullable = False)
|    |    |-- result: string (nullable = True)
|    |    |-- metadata: map (nullable = True)
|    |    |    |-- key: string
|    |    |    |-- value: string (valueContainsNull = True)
|    |    |-- embeddings: array (nullable = True)
|    |    |    |-- element: float (containsNull = False)

Column document is of type ArrayType(). The basetype of document column is of StructType() (element), the element contains subfields of primitive type, but alo a field of type map (MapType()) and a field of type StructType().