Basic&Spark&Programming&and& Performance&Diagnosis&

[Pages:57]Basic Spark Programming and Performance Diagnosis

Jinliang Wei 15--719 Spring 2017

Recita@on

Today's Agenda

? PySpark shell and submiHng jobs ? Basic Spark programming ? Word Count ? How does Spark execute your program? ? Spark monitoring web UI ? What is shuffle and how does it work? ? Spark programming caveats

? Generally good prac@ces ? Important configura@on parameters ? Basic performance diagnosis

PySpark shell and submiHng jobs

Launch A Spark + HDFS Cluster on EC2

? Firstly, set environment variables: ? AWS_SECRET_ACCESS_KEY ? AWS_ACCESS_KEY_ID

? Get spark--ec2--setup ? Launch a cluster with 4 slave nodes:

./spark-ec2 -k -i \ -t m4.xlarge -s 4 -a ami-6d15ec7b \ --ebs-vol-size=200 --ebs-vol-num=1 \ --ebs-vol-type=gp2 \ --spot-price= \ launch SparkCluster

? Login as root ? Replace launch with destroy to terminate the

cluster

Your Standalone Spark Cluster

Master

Worker1

Worker2

? Spark master is the cluster manager (analogous to YARN/ Mesos).

? Workers are some@mes referred to as slaves.

? When your applica@on is submided, worker nodes run executors, which are processes that run computa@ons and store data for your applica@on.

? By default, an executor uses all cores on a worker node.

? Configurable via spark.executor.cores (normally lee as default unless too many cores per node)

Standalone Spark Master Web UI

http://[master-node-public-ip]:8080

For an overview of the cluster and state of each worker.

PySpark Shell

? Spark is installed under

/root/spark

? Launch PySpark shell /root/spark/bin/pyspark

Simple math using PySpark Shell

? Define a list of numbers: a = [1, 3, 7, 4, 2]

? Create an RDD from that list:

rdd_a = sc.parallelize(a)

? Double each element:

rdd_b = rdd_a.map(lambda x: x * 2)

? Sum the elements up:

c = rdd_b.reduce(lambda x, y: x + y)

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download