Data Science-3 ETL - GitHub Pages

Data Science in the Wild

Lecture 3: ETL - Extract, Transform, Load

Eran Toch

Data Science in the Wild, Spring 2019

!1

The Data Science Model

Ask question

Data Engineering

World's Data

Data Science in the Wild, Spring 2019

Experiment Learn Analyze

Visualize Understand

Write

Report

Operationalize

System

!2

ETL Pipeline

Extract

Transform & Clean

Load

Data Science in the Wild, Spring 2019

Sources

DW !3

ETL: Practical Considerations

? Typically, ETL takes 80% of the development time in a DW project (Vassiliadis et al.).

? ETL is particularly difficult to generalize beyond one data science task ? Why?

Data Science in the Wild, Spring 2019

!4

Agenda

1.ETL Processes 2. Pandas 3.Cleaning datasets 4.Handling missing data 5.Handling outliers 6.Understanding data sources

Data Science in the Wild, Spring 2019

!5

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download