CS5412 / Lecture 25 Kishore Pusukuri, Apache Spark and RDDs Spring 2019
CS5412 / Lecture 25 Kishore Pusukuri, Apache Spark and RDDs Spring 2019
1
Recap
MapReduce
? For easily writing applications to process vast amounts of data inparallel on large clusters in a reliable, fault-tolerant manner
? Takes care of scheduling tasks, monitoring them and re-executes the failed tasks
HDFS & MapReduce: Running on the same set of nodes compute nodes and storage nodes same (keeping data close to the computation) very high throughput
YARN & MapReduce: A single master resource manager, one slave node manager per node, and AppMaster per application
2
Today's Topics
? Motivation ?Spark Basics ?Spark Programming
3
History of Hadoop and Spark
4
Apache Spark
Processing
Spark Stream
** Spark can connect to several types of cluster managers (either Spark's own standalone cluster manager, Mesos or YARN)
Spark SQL
Spark ML
Other Applications
Resource manager
Spark Core
(Standalone Scheduler)
Mesos etc.
Yet Another Resource Negotiator (YARN)
Data Storage
S3, Cassandra etc., other storage systems
Ha doop NoSQL Da ta ba se (HBa se ) Ha doop Distribute d File Syste m (HDFS)
Data Ingestion Systems
e.g., Apache Kafka, Flume, etc
Hadoop
Spark
5
Apache Ha doop La c k a Unifie d Vision
? Sparse Modules ? Diversity of APIs ? Higher Operational Costs
6
Spark Ecosystem: A Unified Pipeline
Note: Spark is not designed for IoT real-time. The streaming layer is used for continuous input streams like financial data from stock markets, where events occur steadily and must be processed as they occur. But there is no sense of direct I/O from sensors/actuators. For IoT use cases, Spark would not be suitable.
7
Key ideas
In Hadoop, each developer tends to invent his or her own style of work
With Spark, serious effort to standardize around the idea that people are writing pa ra lle l c ode tha t ofte n runs for ma ny "c yc le s" or "ite ra tions" in whic h a lot of re use of informa tion oc c urs.
Spark centers on Resilient Distributed Dataset, RDDs, that capture the informa tion be ing re use d.
8
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- spark print contents of rdd tutorial kart
- spark rdd map java python examples
- learning apache spark with python computer science software engineering
- apache spark guide cloudera
- spark python application tutorial kart
- what is apache spark github
- developing apache spark applications cloudera
- getting started with apache spark on azure databricks microsoft
- python spark shell pyspark tutorial kart
- spark read multiple text files to single rdd java python examples