Advanced Analytics with SQL and MLLib

Advanced Analytics with "

"

SQL and MLLib

Michael Armbrust

@michaelarmbrust

Slides

?

available

?

here

?

spark.

What is Apache Spark?

Fast and general cluster computing system

interoperable with Hadoop

Improves efficiency through:

? In-memory computing primitives

? General computation graphs

Up to 100�� faster

(2-10�� on disk)

Improves usability through:

? Rich APIs in Scala, Java, Python

? Interactive shell

2-5�� less code

A Unified Stack

Spark

SQL

Spark

Streaming"

real-time

GraphX

graph

Spark

MLlib

machine

learning

��

Why a New Programming Model?

MapReduce greatly simplified big data analysis

But once started, users wanted more:

? More complex, multi-pass analytics (e.g. ML, graph)

? More interactive ad-hoc queries

? More real-time stream processing

All 3 need faster data sharing in parallel apps

Data Sharing in MapReduce

HDFS"

read

HDFS"

write

HDFS"

read

iter. 1

HDFS"

write

. . .

iter. 2

Input

HDFS"

read

Input

query 1

result 1

query 2

result 2

query 3

result 3

. . .

Slow due to replication, serialization, and disk IO

................
................

In order to avoid copyright disputes, this page is only a partial summary.

To fulfill the demand for quickly locating and searching documents.

It is intelligent file search solution for home and business.

Literature Lottery

To fulfill the demand for quickly locating and searching documents.

Related download

Related searches