Apache Spark Notes

SparkNotes

Apache Spark Notes

Spark essentials Spark Components Launch Spark Application

General operation Read files Transformations Actions Different Types of RDD Data Frame

Stand-alone Application PairRDD

Transformations & Actions in RDD Partition DataFrames in Spark Create DataFrames from existing RDD Loading/saving from/to data source Transformations & Actions User defined function (UDF) Repartition DataFrame Monitor Spark Applications Debug and tune spark applications Spark Streaming Spark Streaming Architecture Streaming Programing Key Concept Window Operations Fault tolerance GraphX Regular, Directed, and Property Graphs Create Property Graph Perform operations on graph Spark MLlib

08-25-2016 - 10:26 PM 1

SparkNotes

08-25-2016 - 10:26 PM

2

SparkNotes

Spark essentials

08-25-2016 - 10:26 PM

Advantages of Apache Spark:

Compatible with Hadoop Ease of development Fast Multiple language support Unified stack: Batch, Streaming, Interactive Analytics

Transformation vs. Action:

A transformation will return an RDD. Since RDD are immutable, the transformation will return a new RDD. An action will return a value.

Spark Components

The Spark core is a computational engine that is responsible for task scheduling, memory management, fault recovery and interacting with storage systems. The Spark core contains the functionality of Spark. It also contains the APIs that are used to define RDDs and manipulate them. Spark SQL can be used for working with structured data. You can query this data via SQL or HiveQL. Spark SQL supports many types of data sources such as structured Hive tables and complex JSON data. Spark streaming enables processing of live streams of data and doing real-time analytics. MLlib is a machine learning library that provides multiple types of machine learning algorithms such as classification, regression, clustering. GraphX is a library for manipulating graphs and performing graph-parallel computations.

Launch Spark Application

Local Mode

driver & worker are in the same JVM RDD & variable in same memory space No central master execution started by user

Standalone/Yarn Cluster Mode

the driver is launched from the worker process inside the cluster async,no wait

3

SparkNotes

08-25-2016 - 10:26 PM

Standalone/Yarn Client Mode

the driver is launched in the client process that submitted the job sync, need to wait

MESOS replaces Spark Master as cluster Manager and provides two modes:

1. Fine-grained mode: each task as a separate MESOS task; useful for sharing; start-up overhead 2. coarse mode: launches only one long-running task; no sharing; no start-up overhead

4

SparkNotes

General operation

Spark provides Transformation & Action, Transformation is lazily evaluated.

Read files

Text files with one record per line > sc.textFile() SequenceFiles > sc.sequenceFile[K,V] Other Hadoop inputFormats > sc.hadoopRDD A (filename, content) pairs > sc.wholeTextFile

Transformations

08-25-2016 - 10:26 PM

Function Map filter groupByKey

reduceByKey

flatMap distinct

Details Returns new RDD by applying func to each element of source Returns new RDD consisting of elements from source on which function is true Returns dataset (K,iterable) pairs on dataset of (K,V) Returns dataset (K,V) pairs where value for each key aggregated using the given reduce function return a sequence instead of single item Returns new dataset containing distinct elements of source

Actions

5

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download