Technologies Big Data Master MIDS/MFA/LOGOIS
2026-02-17
Spark NLP provides an example of an application in the Apache Spark Ecosystem
Spark NLP relies on the Spark SQL Lib and Spark Dataframes (high level APIs) and also on the Spark ML Lib.
Spark NLP borrows ideas from existing NLP softwares and adapts the known techniques to the Spark principles
…
NLP deals with many applications of machine learning
Functionality Libraries nltk.org
Annotation Libraries spaCy’s site
A databricks notebook discusses possible interactions between spaCy and Spark on a use case:
spark.sql()numpy arraynlp.pipe()No hint at parallelizing spaCy’s annotation process
spaCy v2 now fully supports the
Pickleprotocol, making it easy to use spaCy with Apache Spark.
sparknlp sessionAssuming standalone mode on a laptop. master runs on localhost
spark = SparkSession.builder \
.appName("Spark NLP") \
.master("local[*]") \
.config("spark.driver.memory", "16G") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.config("spark.kryoserializer.buffer.max", "2000M") \
.config("spark.driver.maxResultSize", "0") \
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:6.3.2") \
.getOrCreate()…
SparkSession - in-memory
fr_articles = [
("Le dimanche 11 juillet 2021, Chiellini a utilisé le mot Kiricocho lorsque Saka s'est approché du ballon pour le penalty.",),
("La prochaine Coupe du monde aura lieu en novembre 2022.",),
("À Noël 800, Charlemagne se fit couronner empereur à Rome.",),
("Le Marathon de Paris a lieu le premier dimanche d'avril 2024",)
]Column document contains the ‘text’ to be annotated as well as some possible metadata.
Starting point of any annotation process
Spark NLP relies on Saprk SQL for storing, moving, data.
+-------------------------------------------------+
|date |
+-------------------------------------------------+
|[{date, 10, 21, 07/11/2021, {sentence -> 0}, []}]|
|[{date, 41, 53, 11/01/2022, {sentence -> 0}, []}]|
|[] |
|[{date, 3, 60, 01/01/2024, {sentence -> 0}, []}] |
+-------------------------------------------------+
root
|-- text: string (nullable = true)
+--------------------+
| text|
+--------------------+
|Le dimanche 11 ju...|
|La prochaine Coup...|
|À Noël 800, Charl...|
|Le Marathon de Pa...|
|Nous nous sommes ...|
|Nous nous sommes ...|
+--------------------+
+-------------------------------------------------+
|date |
+-------------------------------------------------+
|[{date, 10, 21, 07/11/2021, {sentence -> 0}, []}]|
|[{date, 41, 53, 11/01/2022, {sentence -> 0}, []}]|
|[] |
|[{date, 3, 60, 01/01/2024, {sentence -> 0}, []}] |
|[{date, 31, 40, 05/13/2018, {sentence -> 0}, []}]|
|[{date, 84, 92, 02/24/2026, {sentence -> 0}, []}]|
+-------------------------------------------------+
+--------------------------------------------------------------------------------------------------+
|date |
+--------------------------------------------------------------------------------------------------+
|[{date, 11, 22, 07/11/2021, {sentence -> 0}, []}] |
|[{date, 41, 53, 11/01/2022, {sentence -> 0}, []}] |
|[] |
|[{date, 53, 62, 04/01/2024, {sentence -> 0}, []}] |
|[{date, 31, 40, 05/13/2018, {sentence -> 0}, []}, {date, 50, 59, 05/18/2020, {sentence -> 0}, []}]|
|[{date, 28, 37, 02/15/2026, {sentence -> 0}, []}, {date, 80, 88, 02/24/2026, {sentence -> 0}, []}]|
+--------------------------------------------------------------------------------------------------+
##
with open(nypath, encoding='UTF-8') as fd:
doc, document = None, None
while l := fd.readline():
if m := reg_date.match(l):
if doc is not None:
corpus_list.append((*document, doc))
doc, document = None, None
ymd = date(*[int(n) for n in m.groups()[0].split('/')])
title = (
reg_title.match(l)
.groups()[0]
.split('/')
)
document = (ymd, title[-1], '/'.join(title[:-1]))
doc = ''
else: doc = doc + l
else:
if doc is not None:
corpus_list.append((*document, doc))root
|-- date: date (nullable = false)
|-- title: string (nullable = false)
|-- topic: string (nullable = false)
|-- text: string (nullable = true)
8888
Locally
root
|-- date: date (nullable = true)
|-- title: string (nullable = true)
|-- topic: string (nullable = true)
|-- text: string (nullable = true)
20
root
|-- date: date (nullable = false)
|-- title: string (nullable = false)
|-- topic: string (nullable = false)
|-- text: string (nullable = true)
|-- document: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- annotatorType: string (nullable = true)
| | |-- begin: integer (nullable = false)
| | |-- end: integer (nullable = false)
| | |-- result: string (nullable = true)
| | |-- metadata: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
| | |-- embeddings: array (nullable = true)
| | | |-- element: float (containsNull = false)
+-----------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
|title |date |
+-----------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
|washington-nationals-max-scherzer-baffles-mets-completing-a-sweep.html |[{date, 478, 1690, 01/04/2026, {sentence -> 0}, []}] |
|mayor-de-blasios-counsel-to-leave-next-month-to-lead-police-review-board.html|[{date, 1, 670, 05/01/2014, {sentence -> 0}, []}, {date, 82, 91, 03/17/2026, {sentence -> 0}, []}] |
|three-men-charged-in-killing-of-cuomo-administration-lawyer.html |[{date, 81, 547, 08/08/2026, {sentence -> 0}, []}, {date, 42, 50, 02/17/2025, {sentence -> 0}, []}, {date, 1261, 1268, 02/16/2026, {sentence -> 0}, []}] |
|tekserve-precursor-to-the-apple-store-to-close-after-29-years.html |[{date, 1699, 1710, 02/17/2011, {sentence -> 0}, []}, {date, 257, 1549, 07/23/1987, {sentence -> 0}, []}] |
|once-at-michael-phelpss-feet-and-still-chasing-them.html |[{date, 1101, 1111, 02/17/2026, {sentence -> 0}, []}, {date, 204, 3658, 08/14/2004, {sentence -> 0}, []}, {date, 2985, 2993, 02/17/2025, {sentence -> 0}, []}]|
|missy-franklin-breaks-through-in-trials-and-earns-a-return-to-olympics.html |[{date, 369, 1462, 12/21/2012, {sentence -> 0}, []}] |
|lionsgate-is-said-to-be-near-deal-to-buy-starz.html |[{date, 290, 2569, 12/28/2026, {sentence -> 0}, []}, {date, 1802, 1810, 02/17/2025, {sentence -> 0}, []}] |
|pool-rules-no-running-no-eating-or-drinking-no-men.html |[{date, 274, 1351, 12/20/1922, {sentence -> 0}, []}] |
|knicks-look-to-young-blood-and-free-agency-to-patch-porous-roster.html |[{date, 473, 3111, 05/30/2014, {sentence -> 0}, []}, {date, 1090, 1098, 02/10/2026, {sentence -> 0}, []}] |
|latest-sign-of-change-in-harlem-its-congressional-candidate.html |[{date, 389, 1270, 12/01/2026, {sentence -> 0}, []}] |
+-----------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 10 rows
Warning
Extracted dates should be taken with a grain of salt
explain_document_ml download started this may take some time.
Approx size to download 9 MB
[ | ]
explain_document_ml download started this may take some time.
Approximate size to download 9 MB
Download done! Loading the resource.
[ / ]
[ — ][OK!]
lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
[ | ]
lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
Download done! Loading the resource.
[OK!]
pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[ | ]
pos_anc download started this may take some time.
Approximate size to download 3.9 MB
Download done! Loading the resource.
[ / ][OK!]
Warning
>>> result = documentAssembler.transform(data)
>>> result.select("document").show(truncate=False)
+----------------------------------------------------------------------------------------------+
|document |
+----------------------------------------------------------------------------------------------+
|[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
+----------------------------------------------------------------------------------------------+
>>> result.select("document").printSchema()
root
|-- document: array (nullable = True)
| |-- element: struct (containsNull = True)
| | |-- annotatorType: string (nullable = True)
| | |-- begin: integer (nullable = False)
| | |-- end: integer (nullable = False)
| | |-- result: string (nullable = True)
| | |-- metadata: map (nullable = True)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = True)
| | |-- embeddings: array (nullable = True)
| | | |-- element: float (containsNull = False)Column document is of type ArrayType(). The basetype of document column is of StructType() (element), the element contains subfields of primitive type, but alo a field of type map (MapType()) and a field of type StructType().
IFEBY030 – Technos Big Data – M1 MIDS/MFA/LOGOS – UParis Cité