CSE481 - Colab 1

[Pages:23]10/26/2020

[CLASS] CSE481 - Spark Demo In Class.ipynb - Colaboratory

CSE481 - Colab 1

Spark Tutorial

In this tutorial you will learn how to use Apache Spark in local mode on a Colab enviroment. Credits to Tiziano Piccardi for his Spark Tutorial used in the Applied Data Analysis class at EPFL. Adapted From Stanford's CS246

Setup

Let's setup Spark on your Colab environment. Run the cell below!

!pip install pyspark !pip install -U -q PyDrive !apt update !apt install openjdk-8-jdk-headless -qq import os os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"



1/23

10/26/2020

[CLASS] CSE481 - Spark Demo In Class.ipynb - Colaboratory

Collecting pyspark Downloading || 204.2MB 26kB/s

Collecting py4j==0.10.9 Downloading || 204kB 46.7MB/s

Building wheels for collected packages: pyspark Building wheel for pyspark (setup.py) ... done Created wheel for pyspark: filename=pyspark-3.0.1-py2.py3-none-any.whl size=204612243 sha256=11dc840c033e07b4adeaf1f Stored in directory: /root/.cache/pip/wheels/5e/bd/07/031766ca628adec8435bb40f0bd83bb676ce65ff4007f8e73f

Successfully built pyspark Installing collected packages: py4j, pyspark Successfully installed py4j-0.10.9 pyspark-3.0.1 Get:1 bionic InRelease [15.9 kB] Hit:2 bionic InRelease Get:3 bionic-updates InRelease [88.7 kB] Hit:4 bionic InRelease Get:5 bionic-cran40/ InRelease [3,626 B] Get:6 bionic-backports InRelease [74.6 kB] Get:7 bionic-security InRelease [88.7 kB] Ign:8 InRelease Get:9 bionic/main Sources [1,685 kB] Ign:10 InRelease Get:11 Release [697 B] Get:12 Release [564 B] Get:13 Release.gpg [836 B] Get:14 Release.gpg [833 B] Get:15 bionic/main amd64 Packages [863 kB] Get:16 bionic-cran40/ Packages [39.3 kB] Get:17 bionic-updates/main amd64 Packages [2,165 kB] Get:18 bionic-updates/universe amd64 Packages [2,115 kB] Get:19 bionic-updates/restricted amd64 Packages [239 kB] Get:20 bionic-updates/multiverse amd64 Packages [45.9 kB] Ign:21 Packages Get:21 Packages [370 kB] Get:22 Packages [57.0 kB] Get:23 bionic-security/universe amd64 Packages [1,352 kB] Get:24 bionic-security/multiverse amd64 Packages [15.4 kB] Get:25 bionic-security/main amd64 Packages [1,745 kB] Get:26 bionic-security/restricted amd64 Packages [211 kB] Fetched 11.2 MB in 3s (4,118 kB/s) Reading package lists... Done Building dependency tree



2/23

10/26/2020

[CLASS] CSE481 - Spark Demo In Class.ipynb - Colaboratory

Reading state information... Done 57 packages can be upgraded. Run 'apt list --upgradable' to see them. The following additional packages will be installed:

openjdk-8-jre-headless Suggested packages:

openjdk-8-demo openjdk-8-source libnss-mdns fonts-dejavu-extra fonts-ipafont-gothic fonts-ipafont-mincho fonts-wqy-microhei fonts-wqy-zenhei fonts-indic The following NEW packages will be installed: openjdk-8-jdk-headless openjdk-8-jre-headless 0 upgraded, 2 newly installed, 0 to remove and 57 not upgraded. Need to get 35.8 MB of archives. After this operation, 140 MB of additional disk space will be used. Selecting previously unselected package openjdk-8-jre-headless:amd64. (Reading database ... 144611 files and directories currently installed.) Preparing to unpack .../openjdk-8-jre-headless_8u265-b01-0ubuntu2~18.04_amd64.deb ... Unpacking openjdk-8-jre-headless:amd64 (8u265-b01-0ubuntu2~18.04) ... Selecting previously unselected package openjdk-8-jdk-headless:amd64. Preparing to unpack .../openjdk-8-jdk-headless_8u265-b01-0ubuntu2~18.04_amd64.deb ... Unpacking openjdk-8-jdk-headless:amd64 (8u265-b01-0ubuntu2~18.04) ... Setting up openjdk-8-jre-headless:amd64 (8u265-b01-0ubuntu2~18.04) ... update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/orbd to provide /usr/bin/orbd (orbd) in auto mode update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/servertool to provide /usr/bin/servertool (server

Now wupedaatuet-haelntteicrantaetiavGeso:ogulseiDngriv/euscrli/elnitbt/ojdvmo/wjnavloaa-d8-tohpeenjledkw-eamwdi6ll4/bjerper/obcien/stsninagmeinseoruvr tSopaprrkovjoibd.e /usr/bin/tnameserv (tnameser

Setting up openjdk-8-jdk-headless:amd64 (8u265-b01-0ubuntu2~18.04) ...

Makeuspudaretet-oaflotlelornwatthieveisn:teurascintigve/iunssrt/rluicbt/iojnvms/. java-8-openjdk-amd64/bin/idlj to provide /usr/bin/idlj (idlj) in auto mode

update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/bin/wsimport to provide /usr/bin/wsimport (wsimport) in a update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/bin/jsadebugd to provide /usr/bin/jsadebugd (jsadebugd) i from puypddraitvee-.aaluttehrniamtpiovrets:GouosgilnegAu/tuhsr/lib/jvm/java-8-openjdk-amd64/bin/native2ascii to provide /usr/bin/native2ascii (native from puypddraitvee-.adlrtievrenaitmipvoerst: GuosoignlgeD/ruisvre/lib/jvm/java-8-openjdk-amd64/bin/javah to provide /usr/bin/javah (javah) in auto mode from guopodgaltee.-caolltaebrniamtpiovrets:auutshing /usr/lib/jvm/java-8-openjdk-amd64/bin/clhsdb to provide /usr/bin/clhsdb (clhsdb) in auto mo update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/bin/jhat to provide /usr/bin/jhat (jhat) in auto mode from ouapudtaht2ec-lailetnetr.ncaltiievnets:imupsoirntg G/ouosgrl/elCirbe/djevnmt/ijaalvsa-8-openjdk-amd64/bin/extcheck to provide /usr/bin/extcheck (extcheck) in a update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/bin/hsdb to provide /usr/bin/hsdb (hsdb) in auto mode # Authuepndtaitcea-taeltaenrdnactrievaetse: tuhseinPgyD/ruisvre/lcilbi/ejnvtm/java-8-openjdk-amd64/bin/schemagen to provide /usr/bin/schemagen (schemagen) i auth.auuptdhaetnet-iaclatteer_nuasteirv(e)s: using /usr/lib/jvm/java-8-openjdk-amd64/bin/xjc to provide /usr/bin/xjc (xjc) in auto mode gauthu=pdGaotoeg-laelAtuetrhn(a)tives: using /usr/lib/jvm/java-8-openjdk-amd64/bin/wsgen to provide /usr/bin/wsgen (wsgen) in auto mode

gauth.credentials = GoogleCredentials.get_application_default()

drive = GoogleDrive(gauth)

id='1L6pCQkldvdBoaEhRFzL0VnrggEFvqON4' downloaded = drive.CreateFile({'id': id})



3/23

10/26/2020

[CLASS] CSE481 - Spark Demo In Class.ipynb - Colaboratory

downloaded.GetContentFile('Bombing_Operations.json.gz')

id='14dyBmcTBA32uXPxDbqr0bFDIzGxMTWwl' downloaded = drive.CreateFile({'id': id}) downloaded.GetContentFile('Aircraft_Glossary.json.gz')

If you executed the cells above, you should be able to see the les Bombing_Operations.json.gz and Aircraft_Glossary.json.gz under the "Files" tab on the left panel.

# Let's import the libraries we will need import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline

import pyspark from pyspark.sql import * from pyspark.sql.functions import * from pyspark import SparkContext, SparkConf

Let's initialize the Spark context.

# create the session conf = SparkConf().set("spark.ui.port", "4050")

# create the context sc = pyspark.SparkContext(conf=conf) spark = SparkSession.builder.getOrCreate()

You can easily check the current version and get the link of the web interface. In the Spark UI, you can monitor the progress of your job and debug the performance bottlenecks (if your Colab is running with a local runtime).

spark



4/23

10/26/2020

SparkSession - in-memory SparkContext Spark UI

Version v3.0.1

Master local[*]

AppName pyspark-shell

[CLASS] CSE481 - Spark Demo In Class.ipynb - Colaboratory

If you are running this Colab on the Google hosted runtime, the cell below will create a ngrok tunnel which will allow you to still check the Spark UI.

!wget !unzip ngrok-stable-linux-amd64.zip get_ipython().system_raw('./ngrok http 4050 &') !curl | python3 -c \

"import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url']);"



5/23

10/26/2020

[CLASS] CSE481 - Spark Demo In Class.ipynb - Colaboratory

--2020-10-26 21:53:52-- Resolving bin.equinox.io (bin.equinox.io)... 34.195.187.253, 34.198.20.103, 52.73.16.193, ...

Vietnam War Connecting to bin.equinox.io (bin.equinox.io)|34.195.187.253|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 13773305 (13M) [application/octet-stream]

Pres.SJaovhinngsotno::W`hnagtrdoko-ysotuabtlhein-klianbuoxu-tamthdi6s4V.zieitpn.a2m' thing? I'd like to hear you talk a little bit.

Sen. Rngursoske-lsl:tWabellel,-flrainnukxly-, M10r.0P%[re=s=i=d=e=n=t=, =it='s==t=h=e=d=a=m==n=>w]ors1e3.m13eMss t1h3a.t8IMeB/vser sawi,nan1.d0Isdon't like to brag and I never have been right

many times in my life, but I knew that we were going to get into this sort of mess when we went in there.

2020-10-26 21:53:54 (13.8 MB/s) - `ngrok-stable-linux-amd64.zip.2' saved [13773305/13773305]

May 27, 1964

Archive: ngrok-stable-linux-amd64.zip

replace ngrok? [y]es, [n]o, [A]ll, [N]one, [r]ename: y

inflating: ngrok

% Total % Received % Xferd Average Speed Time Time

Time Current

Dload Upload Total Spent Left Speed

100 966 100 966 0

0 69000

0 --:--:-- --:--:-- --:--:-- 69000



The Vietnam War, also known as the Second Indochina War, and in Vietnam as the Resistance War Against America or simply the American War, was a con ict that occurred in Vietnam, Laos, and Cambodia from 1 November 1955 to the fall of Saigon on 30 April 1975. It was the second of the Indochina Wars and was o cially fought between North Vietnam and the government of South Vietnam.

The dataset describes all the air force operation in during the Vietnam War.

Bombing_Operations Get the dataset here



6/23

10/26/2020

[CLASS] CSE481 - Spark Demo In Class.ipynb - Colaboratory

AirCraft: Aircraft model (example: EC-47) ContryFlyingMission: Country MissionDate: Date of the mission OperationSupported: Supported War operation (example: Operation Rolling Thunder) PeriodOfDay: Day or night TakeoffLocation: Take off airport TimeOnTarget WeaponType WeaponsLoadedWeight

Aircraft_Glossary Get the dataset here

AirCraft: Aircraft model (example: EC-47) AirCraftName AirCraftType

Dataset Information:

THOR is a painstakingly cultivated database of historic aerial bombings from World War I through Vietnam. THOR has already proven useful in nding unexploded ordnance in Southeast Asia and improving Air Force combat tactics:

Load the datasets:

Bombing_Operations = spark.read.json("Bombing_Operations.json.gz") Aircraft_Glossary = spark.read.json("Aircraft_Glossary.json.gz")

Check the schema:

Bombing_Operations.printSchema()



7/23

10/26/2020

[CLASS] CSE481 - Spark Demo In Class.ipynb - Colaboratory

root |-- AirCraft: string (nullable = true) |-- ContryFlyingMission: string (nullable = true) |-- MissionDate: string (nullable = true) |-- OperationSupported: string (nullable = true) |-- PeriodOfDay: string (nullable = true) |-- TakeoffLocation: string (nullable = true) |-- TargetCountry: string (nullable = true) |-- TimeOnTarget: double (nullable = true) |-- WeaponType: string (nullable = true) |-- WeaponsLoadedWeight: long (nullable = true)

Aircraft_Glossary.printSchema()

root |-- AirCraft: string (nullable = true) |-- AirCraftName: string (nullable = true) |-- AirCraftType: string (nullable = true)

Get a sample with take() :

Bombing_Operations.take(3) [Row(AirCraft='EC-47', ContryFlyingMission='UNITED STATES OF AMERICA', MissionDate='1971-06-05', OperationSupported=No Row(AirCraft='EC-47', ContryFlyingMission='UNITED STATES OF AMERICA', MissionDate='1972-12-26', OperationSupported=No Row(AirCraft='RF-4', ContryFlyingMission='UNITED STATES OF AMERICA', MissionDate='1973-07-28', OperationSupported=Non

Get a formatted sample with show() :

Aircraft_Glossary.show()



8/23

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download