Magpie - Alekh
[Pages:17]Magpie
Python at Speed and Scale using Cloud Backends
Alekh Jindal, K. Venkatesh Emani, Maureen Daum, Olga Poppe, Brandon Haynes, Anna Pavlenko, Ayushi Gupta, Karthik Ramachandra, Carlo Curino, Andreas Mueller, Wentao Wu, Hiren Patel
Gray Systems Lab, Microsoft Azure Data, Microsoft Microsoft Research University of Washington
The Python and The Cloud
+
? De-facto for ad-hoc analysis ? Pandas dataframes highly popular ? Performance is a challenge
? Hyper-scale performance ? Several SQL processing backends ? Enterprise data already on cloud
The current landscape ... is a fragmented jungle!
Higher-level Abstractions
APIs
Vaex
DASK
Cuda
Ray
PySpark
Dataframe Dataframe Dataframe Programs Dataframe
Data Layer
NumPy Arrays
Distributed
Ibis
SQL + User Defined Functions
SQL + Built-in Functions
SQL Extensions
Relational Tables
Backends
Native Python
Dask
Nvidia
Ray
RAPIDS
PySpark
Azure Synapse
Analytics
Apache Spark
PostgreSQL
Microsoft SCOPE
Apache MADlib
Google Microsoft BigQuery SQL Server
Extending Python ecosystem
Extending SQL databases
Magpie
Pythonic Environment Unified Dataframe API
Magpie Middleware
PyFroid Compiler Cross Optimization
Common Data Layer
Polyengines & Mappers
Azure Synapse Analytics
Native Python
Database
Backends
Microsoft SCOPE
Apache Spark
Apache Google PostgreSQL MADlib BigQuery SQL Server
Familiar Python surface
Ongoing standardization
Batching Pandas into large query expressions
Backend selection using past workloads
Cache commonly seen dataframes
Multi-backend environments and libraries
Cloud backends
Batching Pandas
Pandas Ibis API
Lazy Translation
Ibis Expression
Blue parts: already in IBIS, Green parts: our contributions
The number of taxi trips per weekday over the NYC Taxi dataset
1 import pyfroid.pandas as pd
# vs import pandas as pd
2 df = pd.read_sql(`nyctaxi', con)
# fetch data
3 df = df[df.fare_amount > 0]
# filter bad rows
4 df[`day'] = df.pickup_datetime.dt.dayofweek
# add features
5 df = df.groupby([`day'])[`passenger_count'].sum() # aggregation
6 print(df)
# use dataframe
Pandas Dataframe Program
AGGREGATION
SELECTION
SELECTION `week_day'= EXTR. WEEKDAY[i32*]
`nyctaxi' PREDICATES
COL [timestamp]
BY COL [i32*] `week_day'
GREATER [boolean]
`pickup_datetime'
COL [float32*] LITERAL [float32]
`fare_amount'
0
Intermediate Representation
METRICS SUM [i64] COL [i32*] `passenger_count'
Pushing Data Science down
Pandas Ibis API
Lazy Translation
Ibis Expression
SQL
Spark
Pandas
...
Postgres MySQL
...
SQL DW
SCOPE
Blue parts: already in IBIS, Green parts: our contributions
AGGREGATION
SELECTION
SELECTION `week_day'= EXTR. WEEKDAY[i32*]
`nyctaxi' PREDICATES
COL [timestamp]
BY COL [i32*] `week_day'
GREATER [boolean]
`pickup_datetime'
COL [float32*] LITERAL [float32]
`fare_amount'
0
METRICS SUM [i64] COL [i32*] `passenger_count'
SELECT DATEPART(WEEKDAY, pickup_datetime) AS day, SUM(passenger_count)
FROM nyctaxi WHERE fare_amount > 0 GROUP BY DATEPART(WEEKDAY, pickup_datetime)
T-SQL Statement
Cloud backends
SELECT (Cost: 0%)
Group by Aggregates
Shuffle (Cost: 100%)
Group by Project Filter
Get
Aggregates
SQL DW Execution Plan
Pushing Data Science down
Pandas Ibis API
Lazy Translation
Ibis Expression
SQL
Spark
Pandas
...
AGGREGATION
SELECTION
SELECTION `week_day'= EXTR. WEEKDAY[i32*]
`nyctaxi' PREDICATES
COL [timestamp]
BY COL [i32*] `week_day'
GREATER [boolean]
`pickup_datetime'
COL [float32*] LITERAL [float32]
`fare_amount'
0
METRICS SUM [i64] COL [i32*] `passenger_count'
df = script.extract(path, schema) .select("fare_amount > 0") .groupby("day") .project("pickup_datetime
.DayOfWeek.ToString() AS day", "passenger_count")
Postgres MySQL
...
SQL DW
SCOPE
Cloud backends
SCOPE Script
Blue parts: already in IBIS, Green parts: our contributions
SCOPE Execution Plan
Impact: speed-up using SQL DW
Speed up Speed up
Growing input size
25
20
15
1.4x to 20x speedup
10
5
0
10k 100k 1m 5m 10m
Number of rows
Growing query complexity
25
X
20
15
7x to 380x speedup
10
5
0
1
2
3
4
Number of joins
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- magpie alekh
- harnessing the power of anaconda for scalable data science
- lecture 4 dask github pages
- scaling pydata up and out pycon
- spark and dask edu
- mapping datasets to object storage dana robinson quincey
- accelerated stream etl processing ongpus
- python at warp speed german aerospace center
- read the docs
- openomics a bioinformatics api to integrate multi omics