Eran Toch - GitHub Pages

Data Science in the Wild

Lecture 11: In-memory Parallel Processing in Spark

Eran Toch

Data Science in the Wild, Spring 2019

!1

The Scale of Big Data

Data Science in the Wild, Spring 2019

!2

Agenda

1. Spark 2. Spark DataFrames 3. Spark SQL 4. Machine Learning on Spark 5. ML Pipelines

Data Science in the Wild, Spring 2019

!3

Spark

Data Science in the Wild, Spring 2019

!4

Technological Architecture

In Memory Data Flow

Data Warehouse

NoSQL

Scripting Pig

Processing Storage

MapReduce / YARN

Hadoop Distributed File System (HDFS)

Data Science in the Wild, Spring 2019

!5

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download