Basic&Spark&Programming&and& Performance&Diagnosis&
[Pages:57]Basic
Spark
Programming
and
Performance
Diagnosis
Jinliang
Wei
15--719
Spring
2017
Recita@on
Today's
Agenda
? PySpark
shell
and
submiHng
jobs
? Basic
Spark
programming
?
Word
Count
? How
does
Spark
execute
your
program?
? Spark
monitoring
web
UI
? What
is
shuffle
and
how
does
it
work?
? Spark
programming
caveats
? Generally
good
prac@ces
? Important
configura@on
parameters
? Basic
performance
diagnosis
PySpark
shell
and
submiHng
jobs
Launch
A
Spark
+
HDFS
Cluster
on
EC2
? Firstly,
set
environment
variables:
? AWS_SECRET_ACCESS_KEY ? AWS_ACCESS_KEY_ID
? Get
spark--ec2--setup
? Launch
a
cluster
with
4
slave
nodes:
./spark-ec2 -k -i \ -t m4.xlarge -s 4 -a ami-6d15ec7b \ --ebs-vol-size=200 --ebs-vol-num=1 \ --ebs-vol-type=gp2 \ --spot-price= \ launch SparkCluster
? Login
as
root ? Replace
launch
with
destroy
to
terminate
the
cluster
Your
Standalone
Spark
Cluster
Master
Worker1
Worker2
? Spark
master
is
the
cluster
manager
(analogous
to
YARN/ Mesos).
? Workers
are
some@mes
referred
to
as
slaves.
? When
your
applica@on
is
submided,
worker
nodes
run
executors,
which
are
processes
that
run
computa@ons
and
store
data
for
your
applica@on.
? By
default,
an
executor
uses
all
cores
on
a
worker
node.
? Configurable
via
spark.executor.cores
(normally
lee
as
default
unless
too
many
cores
per
node)
Standalone
Spark
Master
Web
UI
http://[master-node-public-ip]:8080
For
an
overview
of
the
cluster
and
state
of
each
worker.
PySpark
Shell
? Spark
is
installed
under
/root/spark
? Launch
PySpark
shell
/root/spark/bin/pyspark
Simple
math
using
PySpark
Shell
? Define
a
list
of
numbers:
a = [1, 3, 7, 4, 2]
? Create
an
RDD
from
that
list:
rdd_a = sc.parallelize(a)
? Double
each
element:
rdd_b = rdd_a.map(lambda x: x * 2)
? Sum
the
elements
up:
c = rdd_b.reduce(lambda x, y: x + y)
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- analyzing data with spark in azure databricks
- apache spark computer science ucsb computer science
- cheat sheet for pyspark github
- spark programming spark sql
- cheat sheet pyspark sql python lei mao s log book
- pyspark 2 4 quick reference guide wisewithdata
- three practical use cases with azure databricks
- tuning random forest hyperparameters across big data
- basic spark programming and performance diagnosis