Advanced Analytics with SQL and MLLib
[Pages:57]Advanced Analytics with "
"
SQL and MLLib
Michael Armbrust
@michaelarmbrust
Slides
available
here
spark.
What is Apache Spark?
Fast and general cluster computing system interoperable with Hadoop
Improves efficiency through:
?In-memory computing primitives
Up to 100? faster
?General computation graphs
(2-10? on disk)
Improves usability through:
?Rich APIs in Scala, Java, Python
?Interactive shell
2-5? less code
A Unified Stack
Spark
SQL
Spark Streaming"
real-time
GraphX
graph
Spark
MLlib
machine learning
...
Why a New Programming Model?
MapReduce greatly simplified big data analysis
But once started, users wanted more:
?More complex, multi-pass analytics (e.g. ML, graph)
?More interactive ad-hoc queries
?More real-time stream processing
All 3 need faster data sharing in parallel apps
Data Sharing in MapReduce
HDFS" read
HDFS" write
iter. 1
HDFS" read
HDFS" write
iter. 2
Input
HDFS" read
query 1
query 2
result 1
result 2
. . .
Input
query 3
. . .
result 3
Slow due to replication, serialization, and disk IO
What We'd Like
Input
iter. 1
one-time" processing
iter. 2
query 1
query 2
. . .
Input
Distributed" memory
query 3
. . .
10-100? faster than network and disk
Spark Model
Write programs in terms of transformations on distributed datasets
Resilient Distributed Datasets (RDDs)
?Collections of objects that can be stored in memory or disk across a cluster
?Built via parallel transformations (map, filter, ...)
?Automatically rebuilt on failure
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile("hdfs://...")
BaseTRraDnDsf
ormed RDD
messages
Cache 1
errors = lines.filter(lambda x: x.startswith("ERROR")) results
Worker
messages = errors.map(lambda x: x.split(`\t')[2])
messages.cache()
Driver
tasks
lines
Block 1
messages.filter(lambda x: "foo" in x).count() messages.filter(lambda x: "bar" in x).count() . . .
Result: sfuclal-lteedxttose1arTcBh doaf tWa iknip5e-d7iaseinc" ................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- pyspark sql s q l q u e r i e s intellipaat
- pyspark sql cheat sheet python qubole
- spark walmart data analysis project exercise
- cheat sheet pyspark sql python lei mao s log book
- apache spark computer science ucsb computer science
- communication patterns stanford
- advanced analytics with sql and mllib
- with pandas f m a vectorized m a f operations cheat sheet
- communication patterns stanford university
Related searches
- advanced calculator with fractions online
- sql and python tutorial
- advanced education with victor
- advanced dementia with behavior icd 10
- advanced education with viktor
- data analytics with excel pdf
- data analytics with excel
- advanced copd with pulmonary hypertension
- mac and cheese with eggs and milk
- analytics with excel
- mac and cheese with velveeta and cheddar
- data analytics with excel example