Big data tutorial w2 spark
EECS E6893 Big Data Analytics Spark 101
Yvonne Lee, yl4573@columbia.edu
9/17/21
1
Agenda
Functional programming in Python
Lambda
Crash course in Spark (PySpark)
RDD Useful RDD operations
Actions Transformations
Example: Word count
2
Functional programming in Python
3
Lambda expression
Creating small, one-time, anonymous function objects in Python
Syntax: lambda arguments: expression
Any number of arguments Single expression
Could be used together with map, filter, reduce
Example:
Add: add = lambda x, y : x + y
def add (x, y): return x + y
type(add) =
add(2,3)
4
Crash course in Spark
5
Resilient Distributed Datasets (RDD)
An abstraction
a collection of elements partitioned across the nodes of the cluster can be operated on in parallel
Spark is RDD-centric RDDs are immutable RDDs can be cached in memory RDDs are computed lazily RDDs know who their parents are RDDs automatically recover from failures
6
Useful RDD Actions
take(n): return the first n elements in the RDD as an array. collect(): return all elements of the RDD as an array. Use with caution. count(): return the number of elements in the RDD as an int. saveAsTextFile(`path/to/dir'): save the RDD to files in a directory. Will create
the directory if it doesn't exist and will fail if it does. foreach(func): execute the function against every element in the RDD, but
don't keep any results.
7
Useful RDD transformations
8
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- spark sql is the spark component for it provides a
- big data tutorial w2 spark
- cca175 practice questions and answer
- transformations and actions databricks
- dataframes home ucsd dse mas
- apache spark europa
- spark programming spark sql
- 1 introduction to apache spark brigham young university
- eecs e6893 big data analytics spark dataframe spark sql
- convert rdd to dataframe pyspark without schema