Advanced Analytics with SQL and MLLib

Advanced Analytics with "

"

SQL and MLLib

Michael Armbrust

@michaelarmbrust

Slides

?

available

?

here

?

spark.

What is Apache Spark?

Fast and general cluster computing system

interoperable with Hadoop

Improves efficiency through:

? In-memory computing primitives

? General computation graphs

Up to 100¡Á faster

(2-10¡Á on disk)

Improves usability through:

? Rich APIs in Scala, Java, Python

? Interactive shell

2-5¡Á less code

A Unified Stack

Spark

SQL

Spark

Streaming"

real-time

GraphX

graph

Spark

MLlib

machine

learning

¡­

Why a New Programming Model?

MapReduce greatly simplified big data analysis

But once started, users wanted more:

? More complex, multi-pass analytics (e.g. ML, graph)

? More interactive ad-hoc queries

? More real-time stream processing

All 3 need faster data sharing in parallel apps

Data Sharing in MapReduce

HDFS"

read

HDFS"

write

HDFS"

read

iter. 1

HDFS"

write

. . .

iter. 2

Input

HDFS"

read

Input

query 1

result 1

query 2

result 2

query 3

result 3

. . .

Slow due to replication, serialization, and disk IO

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download