Technologies Big Data Master MIDS/MFA/LOGOIS
2025-01-17
Spark NLP provides an example of an application in the Apache Spark Ecosystem
Spark NLP relies on the Spark SQL Lib and Spark Dataframes (high level APIs) and also on the Spark ML Lib.
Spark NLP borrows ideas from existing NLP softwares and adapts the known techniques to the Spark principles
…
NLP deals with many applications of machine learning
Functionality Libraries nltk.org
Annotation Libraries spaCy’s site
A databricks notebook discusses possible interactions between spaCy and Spark on a use case:
spark.sql()
numpy
arraynlp.pipe()
No hint at parallelizing spaCy’s annotation process
spaCy v2 now fully supports the
Pickle
protocol, making it easy to use spaCy with Apache Spark.
sparknlp
sessionAssuming standalone
mode on a laptop. master
runs on localhost
spark = SparkSession.builder \
.appName("Spark NLP") \
# .master("spark://localhost:7077") \
.config("spark.driver.memory", "16G") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.config("spark.kryoserializer.buffer.max", "2000M") \
.config("spark.driver.maxResultSize", "0") \
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3") \
.getOrCreate()
fr_articles = [
("Le dimanche 11 juillet 2021, Chiellini a utilisé le mot Kiricocho lorsque Saka s'est approché du ballon pour le penalty.",),
("La prochaine Coupe du monde aura lieu en novembre 2022.",),
("À Noël 800, Charlemagne se fit couronner empereur à Rome.",),
("Le Marathon de Paris a lieu le premier dimanche d'avril 2024",)
]
Column document
contains the ‘text’ to be annotated as well as some possible metadata.
Starting point of any annotation process
Spark NLP relies on Saprk SQL for storing, moving, data.
with open(nypath, encoding='UTF-8') as fd:
doc, document = None, None
while l := fd.readline():
if m := reg_date.match(l):
if doc is not None:
corpus_list.append((*document, doc))
doc, document = None, None
ymd = date(*[int(n) for n in m.groups()[0].split('/')])
title = (
reg_title.match(l)
.groups()[0]
.split('/')
)
document = (ymd, title[-1], '/'.join(title[:-1]))
doc = ''
else: doc = doc + l
else:
if doc is not None:
corpus_list.append((*document, doc))
Locally
Avertissement
Extracted dates should be taken with a grain of salt
Avertissement
pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[ — ]pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[ \ ]Download done! Loading the resource.
[Stage 34:===========================================> (3 + 1) / 4]
[ | ]
[OK!]
>>> result = documentAssembler.transform(data)
>>> result.select("document").show(truncate=False)
+----------------------------------------------------------------------------------------------+
|document |
+----------------------------------------------------------------------------------------------+
|[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
+----------------------------------------------------------------------------------------------+
>>> result.select("document").printSchema()
root
|-- document: array (nullable = True)
| |-- element: struct (containsNull = True)
| | |-- annotatorType: string (nullable = True)
| | |-- begin: integer (nullable = False)
| | |-- end: integer (nullable = False)
| | |-- result: string (nullable = True)
| | |-- metadata: map (nullable = True)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = True)
| | |-- embeddings: array (nullable = True)
| | | |-- element: float (containsNull = False)
Column document
is of type ArrayType()
. The basetype of document
column is of StructType()
(element
), the element
contains subfields of primitive type, but alo a field of type map
(MapType()
) and a field of type StructType()
.
IFEBY030 – Technos Big Data – M1 MIDS/MFA/LOGOS – UParis Cité