Basic&Spark&Programming&and& Performance&Diagnosis&

[Pages:57]Basic Spark Programming and Performance Diagnosis

Jinliang Wei 15--719 Spring 2017


Today's Agenda

? PySpark shell and submiHng jobs ? Basic Spark programming ? Word Count ? How does Spark execute your program? ? Spark monitoring web UI ? What is shuffle and how does it work? ? Spark programming caveats

? Generally good prac@ces ? Important configura@on parameters ? Basic performance diagnosis

PySpark shell and submiHng jobs

Launch A Spark + HDFS Cluster on EC2

? Firstly, set environment variables: ? AWS_SECRET_ACCESS_KEY ? AWS_ACCESS_KEY_ID

? Get spark--ec2--setup ? Launch a cluster with 4 slave nodes:

./spark-ec2 -k -i \ -t m4.xlarge -s 4 -a ami-6d15ec7b \ --ebs-vol-size=200 --ebs-vol-num=1 \ --ebs-vol-type=gp2 \ --spot-price= \ launch SparkCluster

? Login as root ? Replace launch with destroy to terminate the


Your Standalone Spark Cluster




? Spark master is the cluster manager (analogous to YARN/ Mesos).

? Workers are some@mes referred to as slaves.

? When your applica@on is submided, worker nodes run executors, which are processes that run computa@ons and store data for your applica@on.

? By default, an executor uses all cores on a worker node.

? Configurable via spark.executor.cores (normally lee as default unless too many cores per node)

Standalone Spark Master Web UI


For an overview of the cluster and state of each worker.

PySpark Shell

? Spark is installed under


? Launch PySpark shell /root/spark/bin/pyspark

Simple math using PySpark Shell

? Define a list of numbers: a = [1, 3, 7, 4, 2]

? Create an RDD from that list:

rdd_a = sc.parallelize(a)

? Double each element:

rdd_b = x: x * 2)

? Sum the elements up:

c = rdd_b.reduce(lambda x, y: x + y)


In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download