Magpie - Alekh - Dask dataframe filter

[Pages:17]Magpie

Python at Speed and Scale using Cloud Backends

Alekh Jindal, K. Venkatesh Emani, Maureen Daum, Olga Poppe, Brandon Haynes, Anna Pavlenko, Ayushi Gupta, Karthik Ramachandra, Carlo Curino, Andreas Mueller, Wentao Wu, Hiren Patel

Gray Systems Lab, Microsoft Azure Data, Microsoft Microsoft Research University of Washington

The Python and The Cloud

+

? De-facto for ad-hoc analysis ? Pandas dataframes highly popular ? Performance is a challenge

? Hyper-scale performance ? Several SQL processing backends ? Enterprise data already on cloud

The current landscape ... is a fragmented jungle!

Higher-level Abstractions

APIs

Vaex

DASK

Cuda

Ray

PySpark

Dataframe Dataframe Dataframe Programs Dataframe

Data Layer

NumPy Arrays

Distributed

Ibis

SQL + User Defined Functions

SQL + Built-in Functions

SQL Extensions

Relational Tables

Backends

Native Python

Dask

Nvidia

Ray

RAPIDS

PySpark

Azure Synapse

Analytics

Apache Spark

PostgreSQL

Microsoft SCOPE

Apache MADlib

Google Microsoft BigQuery SQL Server

Extending Python ecosystem

Extending SQL databases

Magpie

Pythonic Environment Unified Dataframe API

Magpie Middleware

PyFroid Compiler Cross Optimization

Common Data Layer

Polyengines & Mappers

Azure Synapse Analytics

Native Python

Database

Backends

Microsoft SCOPE

Apache Spark

Apache Google PostgreSQL MADlib BigQuery SQL Server

Familiar Python surface

Ongoing standardization

Batching Pandas into large query expressions

Backend selection using past workloads

Cache commonly seen dataframes

Multi-backend environments and libraries

Cloud backends

Batching Pandas

Pandas Ibis API

Lazy Translation

Ibis Expression

Blue parts: already in IBIS, Green parts: our contributions

The number of taxi trips per weekday over the NYC Taxi dataset

1 import pyfroid.pandas as pd

# vs import pandas as pd

2 df = pd.read_sql(`nyctaxi', con)

# fetch data

3 df = df[df.fare_amount > 0]

# filter bad rows

4 df[`day'] = df.pickup_datetime.dt.dayofweek

# add features

5 df = df.groupby([`day'])[`passenger_count'].sum() # aggregation

6 print(df)

# use dataframe

Pandas Dataframe Program

AGGREGATION

SELECTION

SELECTION `week_day'= EXTR. WEEKDAY[i32*]

`nyctaxi' PREDICATES

COL [timestamp]

BY COL [i32*] `week_day'

GREATER [boolean]

`pickup_datetime'

COL [float32*] LITERAL [float32]

`fare_amount'

0

Intermediate Representation

METRICS SUM [i64] COL [i32*] `passenger_count'

Pushing Data Science down

Pandas Ibis API

Lazy Translation

Ibis Expression

SQL

Spark

Pandas

...

Postgres MySQL

...

SQL DW

SCOPE

Blue parts: already in IBIS, Green parts: our contributions

AGGREGATION

SELECTION

SELECTION `week_day'= EXTR. WEEKDAY[i32*]

`nyctaxi' PREDICATES

COL [timestamp]

BY COL [i32*] `week_day'

GREATER [boolean]

`pickup_datetime'

COL [float32*] LITERAL [float32]

`fare_amount'

0

METRICS SUM [i64] COL [i32*] `passenger_count'

SELECT DATEPART(WEEKDAY, pickup_datetime) AS day, SUM(passenger_count)

FROM nyctaxi WHERE fare_amount > 0 GROUP BY DATEPART(WEEKDAY, pickup_datetime)

T-SQL Statement

Cloud backends

SELECT (Cost: 0%)

Group by Aggregates

Shuffle (Cost: 100%)

Group by Project Filter

Get

Aggregates

SQL DW Execution Plan

Pushing Data Science down

Pandas Ibis API

Lazy Translation

Ibis Expression

SQL

Spark

Pandas

...

AGGREGATION

SELECTION

SELECTION `week_day'= EXTR. WEEKDAY[i32*]

`nyctaxi' PREDICATES

COL [timestamp]

BY COL [i32*] `week_day'

GREATER [boolean]

`pickup_datetime'

COL [float32*] LITERAL [float32]

`fare_amount'

0

METRICS SUM [i64] COL [i32*] `passenger_count'

df = script.extract(path, schema) .select("fare_amount > 0") .groupby("day") .project("pickup_datetime

.DayOfWeek.ToString() AS day", "passenger_count")

Postgres MySQL

...

SQL DW

SCOPE

Cloud backends

SCOPE Script

Blue parts: already in IBIS, Green parts: our contributions

SCOPE Execution Plan

Impact: speed-up using SQL DW

Speed up Speed up

Growing input size

25

20

15

1.4x to 20x speedup

10

5

0

10k 100k 1m 5m 10m

Number of rows

Growing query complexity

25

X

20

15

7x to 380x speedup

10

5

0

1

2

3

4

Number of joins

................
................

In order to avoid copyright disputes, this page is only a partial summary.

To fulfill the demand for quickly locating and searching documents.

It is intelligent file search solution for home and business.

Literature Lottery

Magpie - Alekh

To fulfill the demand for quickly locating and searching documents.

Related download

Related searches