Advanced Analytics with SQL and MLLib

Advanced Analytics with "

"

SQL and MLLib

Michael Armbrust @michaelarmbrust

Slides available

here

spark.

What is Apache Spark?

Fast and general cluster computing system interoperable with Hadoop

Improves efficiency through:

?In-memory computing primitives Up to 100? faster

?General computation graphs

(2-10? on disk)

Improves usability through:

?Rich APIs in Scala, Java, Python ?Interactive shell

2-5? less code

A Unified Stack

Spark

SQL

Spark Streaming"

real-time

GraphX

graph

Spark

MLlib

machine learning

...

Why a New Programming Model?

MapReduce greatly simplified big data analysis

But once started, users wanted more:

?More complex, multi-pass analytics (e.g. ML, graph) ?More interactive ad-hoc queries ?More real-time stream processing

All 3 need faster data sharing in parallel apps

Data Sharing in MapReduce

HDFS" read

HDFS" write

iter. 1

HDFS" read

HDFS" write

iter. 2

Input

HDFS" read

query 1 query 2

result 1 result 2

. . .

Input

query 3 . . .

result 3

Slow due to replication, serialization, and disk IO

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download