Apache Spark Notes
SparkNotes
Apache Spark Notes
Spark essentials Spark Components Launch Spark Application
General operation Read files Transformations Actions Different Types of RDD Data Frame
Stand-alone Application PairRDD
Transformations & Actions in RDD Partition DataFrames in Spark Create DataFrames from existing RDD Loading/saving from/to data source Transformations & Actions User defined function (UDF) Repartition DataFrame Monitor Spark Applications Debug and tune spark applications Spark Streaming Spark Streaming Architecture Streaming Programing Key Concept Window Operations Fault tolerance GraphX Regular, Directed, and Property Graphs Create Property Graph Perform operations on graph Spark MLlib
08-25-2016 - 10:26 PM 1
SparkNotes
08-25-2016 - 10:26 PM
2
SparkNotes
Spark essentials
08-25-2016 - 10:26 PM
Advantages of Apache Spark:
Compatible with Hadoop Ease of development Fast Multiple language support Unified stack: Batch, Streaming, Interactive Analytics
Transformation vs. Action:
A transformation will return an RDD. Since RDD are immutable, the transformation will return a new RDD. An action will return a value.
Spark Components
The Spark core is a computational engine that is responsible for task scheduling, memory management, fault recovery and interacting with storage systems. The Spark core contains the functionality of Spark. It also contains the APIs that are used to define RDDs and manipulate them. Spark SQL can be used for working with structured data. You can query this data via SQL or HiveQL. Spark SQL supports many types of data sources such as structured Hive tables and complex JSON data. Spark streaming enables processing of live streams of data and doing real-time analytics. MLlib is a machine learning library that provides multiple types of machine learning algorithms such as classification, regression, clustering. GraphX is a library for manipulating graphs and performing graph-parallel computations.
Launch Spark Application
Local Mode
driver & worker are in the same JVM RDD & variable in same memory space No central master execution started by user
Standalone/Yarn Cluster Mode
the driver is launched from the worker process inside the cluster async,no wait
3
SparkNotes
08-25-2016 - 10:26 PM
Standalone/Yarn Client Mode
the driver is launched in the client process that submitted the job sync, need to wait
MESOS replaces Spark Master as cluster Manager and provides two modes:
1. Fine-grained mode: each task as a separate MESOS task; useful for sharing; start-up overhead 2. coarse mode: launches only one long-running task; no sharing; no start-up overhead
4
SparkNotes
General operation
Spark provides Transformation & Action, Transformation is lazily evaluated.
Read files
Text files with one record per line > sc.textFile() SequenceFiles > sc.sequenceFile[K,V] Other Hadoop inputFormats > sc.hadoopRDD A (filename, content) pairs > sc.wholeTextFile
Transformations
08-25-2016 - 10:26 PM
Function Map filter groupByKey
reduceByKey
flatMap distinct
Details Returns new RDD by applying func to each element of source Returns new RDD consisting of elements from source on which function is true Returns dataset (K,iterable) pairs on dataset of (K,V) Returns dataset (K,V) pairs where value for each key aggregated using the given reduce function return a sequence instead of single item Returns new dataset containing distinct elements of source
Actions
5
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- the data scientists guide to
- 1 introduction to apache spark brigham young university
- eecs e6893 big data analytics yunan lu yl4021 columbia
- cca175 practice questions and answer
- the definitive guide databricks
- spark programming spark sql big data
- spark datafrem print schema
- integration with popular big data frameworks in statistica
- delta lake cheatsheet databricks
- 2 2 data engineers