EECS E6893 Big Data Analytics Hritik Jain, hj2533@columbia ...

EECS E6893 Big Data Analytics

HW1: Clustering, Classification, and Spark MLlib

Hritik Jain, hj2533@columbia.edu

11/06/2020

1

Agenda

¡ñ

¡ñ

¡ñ

¡ñ

Spark Dataframe

Spark SQL

Spark MLlib

HW1

¡ð

¡ð

Iterative K-means clustering

Logistic Regression

2

Spark Dataframe

¡ñ

¡ñ

¡ñ

¡ñ

An abstraction, an immutable distributed collection of data like RDD

Data is organized into named columns, like a table in DB

Create from RDD, Hive table, or other data sources

Easy conversion to and from Pandas Dataframe

3

Spark Dataframe: read from csv file

4

Spark Dataframe: common operations

5

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download