Introduction to Big Data with Apache Spark

Introduction to Big Data with Apache Spark

UC BERKELEY

This Lecture

Structured Data and Relational Databases

The Structured Query Language (SQL)

SQL and pySpark Joins

Review: Key Data Management Concepts

? A data model is a collection of concepts for describing data

? A schema is a description of a particular collection of data, using a

given data model

Whither Structured Data?

? Conventional Wisdom:

?Only 20% of data is structured.

? Decreasing due to:

?Consumer applications

?Enterprise search

?Media applications



The Structure Spectrum

Structured Semi-Structured Unstructured

(schema-first)

(schema-later)

(schema-never)

This lecture

Relational Database

Formatted Messages

Documents XML JSON

Tagged Te

xt/Media

Plain Text

Media

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download