Big data technologies

Technologies Big Data Master MIDS/MFA/LOGOIS

Équipe de Statistique

LPSM Université Paris-Cité

2025-01-17

Course logistics

Who are we ?

Stéphane Boucheron
LPSM
Statistics
https://stephane-v-boucheron.fr

Cristina Sirangelo
IRIF
Data Science, Databases
https://www.irif.fr/~amelie/

Course logistics

24 hours = 2 hours $\times$ 12 weeks : classes + hands-on
Agenda

About the hands-on

Hands-on and homeworks using Jupyter/Quarto notebooks
Using a Docker image built for the course
Hands-on must be carried out using your own laptop

Course logistics

course : https://s-v-b.github.io/IFEBY310
Bookmark it !
Follow the steps described on the tools page:

https://s-v-b.github.io/IFEBY310/tools

Course evaluation

Evaluation using homeworks and a final project
Find a friend: all work done by pairs of students
All your work goes in your private git repository and nowhere else: no emails !
All your homework will be using quarto files

`Docker`

Why `docker` ? What is it ?

Don’t mess with your python env. and configuration files
Everything in embedded in a container (better than a Virtual Machine)
A container is an instance of an image
Same image = same environment for everybody
Same image = no {version, dependencies, install} problems
It is an entreprise standard used everywhere now!

`docker`

Have a look at https://s-v-b.github.io/IFEBY310/tools
Have a look at the Dockerfile to explain a little bit how the image is built
Perform a quick demo on how to use the docker image

And that’s it for logistics !

Big data

Moore’s Law: computing power doubled every two years between 1975 and 2012
Nowadays, less than two years and a half
Rapid growth of datasets: internet activity, social networks, genomics, physics, censor networks, IOT, …
Data size trends: doubles every year according to IDC executive summary
Data deluge: Today, data is growing faster than computing power

Question

How do we catch up to process the data deluge and to learn from it ?

Order of magnitudes

bit

A bit is a value of either a 1 or 0 (on or off)

byte (B)

A byte is made of 8 bits

1 character, e.g., “a”, is one byte

Kilobyte (KB)

A kilobyte is $1024 =2^{10}$ bytes

2 or 3 paragraphs of ASCII text

Some more comparisons

Megabyte (MB)

A megabyte is $1 048 576=2^{20}$ B or $1 024$ KB

873 pages of plain text
4 books (200 pages or 240 000 characters)

Gigabyte (GB)

A gigabyte is $1 073 741 824=2^{30}$ B, $1 024$ MB or $1 048 576$ KB

894 784 pages of plain text (1 200 characters)
4 473 books (200 pages or 240 000 characters)
640 web pages (with 1.6 MB average file size)
341 digital pictures (with 3 MB average file size)
256 MP3 audio files (with 4 MB average file size)
1,5 650 MB CD

Even more

Terabyte (TB)

A terabyte is $1 099 511 627 776=2^{40}$ B, 1 024 GB or 1 048 576 MB.

916 259 689 pages of plain text (1 200 characters)
4 581 298 books (200 pages or 240 000 characters)
655 360 web pages (with 1.6 MB average file size)
349 525 digital pictures (with 3 MB average file size)
262 144 MP3 audio files (with 4 MB average file size)
1 613 650 MB CD’s
233 4.38 GB DVDs
40 25 GB Blu-ray discs

The deluge

Petabyte (PB)

A petabyte is 1 024 TB, 1 048 576 GB or 1 073 741 824 MB

\[1125899906842624 = 2^{50} \quad\text{Bytes}\]

938 249 922 368 pages of plain text (1 200 characters)
4 691 249 611 books (200 pages or 240 000 characters)
671 088 640 web pages (with 1.6 MB average file size)
357 913 941 digital pictures (with 3 MB average file size)
268 435 456 MP3 audio files (with 4 MB average file size)
1 651 910 650 MB CD’s
239 400 4.38 GB DVDs
41 943 25 GB Blu-ray discs

Exabyte, etc.

1 EB = 1 exabyte = 1 024 PB
1 ZB = 1 zettabyte = 1 024 EB

Some figures

You have every single second¹:

At least 8,000 tweets sent
900+ photos posted on Instagram
Thousands of Skype calls made
Over 70,000 Google searches performed
Around 80,000 YouTube videos viewed
Over 2 million emails sent

Some figures

There are¹:

5 billion web pages as of mid-2019 (indexed web)

and we expected²:

4.8 ZB of annual IP traffic in 2022

Note that

1 ZB $\approx$ 36 000 years of HD video
Netflix’s entire catalog is $\approx$ 3.5 years of HD video

Some figures

More figures :

facebook daily logs: 60TB
1000 genomes project: 200TB
Google web index: 10+ PB
Cost of 1TB of storage: ~$35
Time to read 1TB from disk: 3 hours if 100MB/s

Latency numbers

Memory type	Latency(ns)	Latency(us)	(ms)
L1 cache reference	0.5 ns
L2 cache reference	7 ns			14x L1 cache
Main memory reference	100 ns			20x L2, 200x L1
Compress 1K bytes with Zippy/Snappy	3,000 ns	3 us
Send 1K bytes over 1 Gbps network	10,000 ns	10 us
Read 4K randomly from SSD*	150,000 ns	150 us		~1GB/sec SSD
Read 1 MB sequentially from memory	250,000 ns	250 us
Round trip within same datacenter	500,000 ns	500 us
Read 1 MB sequentially from SSD*	1,000,000 ns	1,000 us	1 ms	~1GB/sec SSD, 4X memory
Disk seek	10,000,000 ns	10,000 us	10 ms	20x datacenter roundtrip
Read 1 MB sequentially from disk	20,000,000 ns	20,000 us	20 ms	80x memory, 20x SSD
Send packet US -> Europe -> US	150,000,000 ns	150,000 us	150 ms	600x memory

traceroute to mathscinet.ams.org (104.238.176.204), 64 hops max
  1   192.168.10.1  3,149ms  1,532ms  1,216ms 
  2   192.168.0.254  1,623ms  1,397ms  1,309ms 
  3   78.196.1.254  2,571ms  2,120ms  2,371ms 
  4   78.255.140.126  2,813ms  2,621ms  2,200ms 
  5   78.254.243.86  2,626ms  2,528ms  2,517ms 
  6   78.254.253.42  2,517ms  4,129ms  2,671ms 
  7   78.254.242.54  2,535ms  2,258ms  2,350ms 
  8   *  *  * 
  9   195.66.224.191  12,231ms  11,718ms  12,486ms 
 10   *  *  * 
 11   63.218.14.58  26,213ms  19,264ms  18,949ms 
 12   63.218.231.106  29,135ms  22,078ms  17,954ms

Latency numbers

Reading 1MB from disk = 100 x reading 1MB from memory
Sending packet from US to Europe to US = 1 000 000 x main memory reference

General tendency

True in general, not always:

memory operations : fastest
disk operations : slow
network operations : slowest

Latency numbers ¹(https://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html)

Latency numbers for mortals

Multiply all durations by a billion $10^9$

Memory type	Latency	Human duration
L1 cache reference	0.5 s	One heart beat (0.5 s)
L2 cache reference	7 s	Long yawn
Main memory reference	100 s	Brushing your teeth
Send 2K bytes over 1 Gbps network	5.5 hr	From lunch to end of work day
SSD random read	1.7 days	A normal weekend
Read 1 MB sequentially from memory	2.9 days	A long weekend
Round trip within same datacenter	5.8 days	A medium vacation
Read 1 MB sequentially from SSD	11.6 days	Waiting for almost 2 weeks for a delivery
Disk seek	16.5 weeks	A semester in university
Read 1 MB sequentially from disk	7.8 months	Almost producing a new human being
Send packet US -> Europe -> US	4.8 years	Average time it takes to complete a bachelor’s degree

Challenges

Challenges with big datasets

Large data don’t fit on a single hard-drive
One large (and expensive) machine can’t process or store all the data
For computations how do we stream data from the disk to the different layers of memory ?
Concurrent accesses to the data: disks cannot be read in parallel

Solutions

Combine several machines containing hard drives and processors on a network
Using commodity hardware: cheap, common architecture i.e. processor + RAM + disk
Scalability = more machines on the network
Partition the data across the machines

Challenges

Dealing with distributed computations adds software complexity

Scheduling: How to split the work across machines? Must exploit and optimize data locality since moving data is very expensive
Reliability: How to handle failures? Commodity (cheap) hardware fails more often. @Google [1%, 5%] HD failure/year and 0.2% DIMM failure/year
Uneven performance of machines: some nodes are slower than others

Problems sketched in

Next Generation Dabases describes the challenges faces by database industry between 1995 and 2015, that is during the onset of the data deluge

Solutions

Schedule, manage and coordinate threads and resources using appropriate software
Locks to limit access to resources
Replicate data for faster reading and reliability

Is it HPC ?

High Performance Computing (HPC)
Parallel computing

Note

For HPC, scaling up means using a bigger machine
Huge performance increase for medium scale problems
Very expensive, specialized machines, lots of processors and memory

and

Google committed to a number of key tenants when designing its data center architecture. Most significantly —and at the time, uniquely— Google committed to massively parallelizing and distributing processing across very large numbers of commodity servers.

Google also adopted a “Jedis build their own lightsabers” attitude: very little third party —and virtually no commercial— software would be found in the Google architecture.

Build was considered better than buy at Google.

From Next Generation Dabases

The Big Data universe

Many technologies combining software and cloud computing

The Big Data universe (still expanding)

Often used with/for with Machine Learning (or AI)

Tools

Softwares such as HadoopMR (Hadoop Map Reduce) and more recently Spark and Dask cope with these challenges
They are distributed computational engines: softwares that ease the development of distributed algorithms

They run on clusters (several machine on a network), managed by a resource manager such as :

Yarn : https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
Kubernetes : https://kubernetes.io

A resource manager ensures that the tasks running on the cluster do not try to use the same resources all at once

`Apache Spark`

Apache `Spark`

The course will focus mainly on Spark for big data processing

https://spark.apache.org

Spark is an enterprise standard
(cf https://spark.apache.org/powered-by.html)
One of the most used big data processing framework
Open source

The predecessor of Spark is Hadoop

See Chapter 2 in Next Generation Dabases

Guy Harrison

`Hadoop`

Hadoop has a simple API and good fault tolerance (tolerance to nodes failing midway through a processing job)
The cost is lots of data shuffling across the network
With intermediate computations written to disk over the network which we know is very time expensive

It is made of three components:

HDFS (Highly Distributed File System) inspired from GoogleFileSystem, see https://ai.google/research/pubs/pub51
YARN (Yet Another Ressource Negociator) for processing management.
MapReduce inspired from Google for processing again.
https://research.google.com/archive/mapreduce.html

The Hadoop 1.0 architecture is powerful and easy to understand, but it is limited to MapReduce workloads and it provides limited flexibility with regard to scheduling and resource allocation.

In the Hadoop 2.0 architecture, YARN (Yet Another Resource Negotiator or, recursively, YARN Application Resource Negotiator) improves scalability and flexibility by splitting the roles of the Task Tracker into two processes.

A Resource Manager controls access to the clusters resources (memory, CPU, etc.) while the Application Manager (one per job) controls task execution.

Guy Harrison. Next Generation Database

MapReduce’s wordcount example

`Spark`

Advantages of Spark over HadoopMR ?

In-memory storage: use RAM for fast iterative computations
Lower overhead for starting jobs
Simple and expressive with Scala, Python, R, Java APIs
Higher level libraries with SparkSQL, SparkStreaming, etc.

Disadvantages of Spark over HadoopMR ?

Spark requires servers with more CPU and more memory
But still much cheaper than HPC

Spark is much faster than Hadoop

Hadoop uses disk and network
Spark tries to use memory as much as possible for operations while minimizing network use

`Spark` versus `Hadoop`

	HadoopMR	Spark
Storage	Disk	in-memory or disk
Operations	Map, reduce	Map, reduce, join, sample, …
Execution model	Batch	Batch, interactive, streaming
Programming environments	Java	Scala, Java, Python, R

`Spark` and `Hadoop` comparison

For logistic regression training (a simple classification algorithm which requires several passes on a dataset)

The `Spark` stack

`Spark` can run “everywhere”

https://mesos.apache.org: Apache Mesos abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively. Mesos is built using the same principles as the Linux kernel, only at a different level of abstraction. The Mesos kernel runs on every machine and provides applications (e.g., Hadoop, Spark, Kafka, Elasticsearch) with API’s for resource management and scheduling across entire datacenter and cloud environments.
https://kubernetes.io Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling, and management of containerized applications.