Advanced Analytics with SQL and MLLib

Advanced Analytics with "

"

SQL and MLLib

Michael Armbrust @michaelarmbrust

Slides available

here

spark.

What is Apache Spark?

Fast and general cluster computing system interoperable with Hadoop

Improves efficiency through:

?In-memory computing primitives Up to 100? faster

?General computation graphs

(2-10? on disk)

Improves usability through:

?Rich APIs in Scala, Java, Python ?Interactive shell

2-5? less code

A Unified Stack

Spark

SQL

Spark Streaming"

real-time

GraphX

graph

Spark

MLlib

machine learning

...

Why a New Programming Model?

MapReduce greatly simplified big data analysis

But once started, users wanted more:

?More complex, multi-pass analytics (e.g. ML, graph) ?More interactive ad-hoc queries ?More real-time stream processing

All 3 need faster data sharing in parallel apps

Data Sharing in MapReduce

HDFS" read

HDFS" write

iter. 1

HDFS" read

HDFS" write

iter. 2

Input

HDFS" read

query 1 query 2

result 1 result 2

. . .

Input

query 3 . . .

result 3

Slow due to replication, serialization, and disk IO

................
................

In order to avoid copyright disputes, this page is only a partial summary.

To fulfill the demand for quickly locating and searching documents.

It is intelligent file search solution for home and business.

Literature Lottery

To fulfill the demand for quickly locating and searching documents.

Related download

Related searches