Introduction to Big Data with Apache Spark

Introduction to Big Data with Apache Spark

UC BERKELEY

This Lecture

The Structure Spectrum

Files: Formats and Performance

Tabular Data: Examples, Challenges, pySpark DataFrames

Log Files

Review:The Big Picture

Extract

Transform

Load

Key Data Management Concepts

? A data model is a collection of concepts for describing data

? A schema is a description of a particular collection of data, using a given data model

The Structure Spectrum

Structured Semi-Structured Unstructured

(schema-first)

(schema-later)

(schema-never)

Relational Database

Formatted Messages

Documents XML

Tagged Text/Media

Plain Text

Media

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download