Apache Spark - Europa

Apache Spark

Lorenzo Di Gaetano

THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Eurostat

What is Apache Spark?

? A general purpose framework for big data processing

? It interfaces with many distributed file systems, such as Hdfs (Hadoop Distributed File System), Amazon S3, Apache Cassandra and many others

? 100 times faster than Hadoop for in-memory computation

2

Eurostat

Multilanguage API

? You can write applications in various languages

? Java ? Python ? Scala ?R

? In the context of this course we will consider Python

3

Eurostat

Built-in Libraries

4

Eurostat

Third party libraries

? Many third party libraries are available

?

? We used spark-csv in our examples ? We will see later how to use an external jar on

our application

5

Eurostat

Running Spark

? Once you correctly installed spark you can use it in two ways.

? spark-submit: it's the CLI command you can use to launch python spark applications

? pyspark: used to launch an interactive python shell.

6

Eurostat

PySpark up and running!

7

Eurostat

SparkContext

? Every Spark application starts from the SparkContext

? Entry point to the Spark API

? Using spark-submit you have to manually create a SparkContext object in your code

? Using the pyspark interactive shell a SparkContext is automatically available in a variable called sc

8

Eurostat

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download