PySpark in Apache Spark 3.3 and Beyond - Databricks

PySpark in Apache Spark 3.3 and Beyond

Hyukjin Kwon

Software Engineer, Databricks

Xinrong Meng

Software Engineer, Databricks

Who are you?

Hyukjin Kwon

? @HyukjinKwon in GitHub ? Tech lead, PySpark team @ Databricks ? Top 2 contributor in Apache Spark ? PySpark, SparkR, Spark SQL, etc.,

Xinrong Meng

? @xinrong-databricks in GitHub ? PySpark team @ Databricks ? Major contributor in PySpark

Project Zen

? Be Pythonic ? Better and easier use of PySpark ? Better interoperability with other Python libraries

Pandas API on Spark

pandas provides data structures for in-memory analytics ... using pandas to analyze datasets that are larger than memory datasets somewhat tricky. ... it's worth considering not using pandas. pandas isn't the right tool for all situations. ...

Pandas API on Spark

>>> from pandas import read_csv >>> from pyspark.pandas import read_csv >>> df = read_csv("data.csv")

Drop-in replacement

? Pandas API on Upcoming Apache SparkTM 3.2 ? SPIP: Support pandas API layer on PySpark

What is this talk about?

What's new?

? Pandas API on Spark

? Faster default index ? Better API coverage

? New Functionalities

? datetime.timedelta support ? PyArrow batch interface ? Python standard string formatter in sql

? Productivity

? Better autocompletion ? Python/Pandas UDF profiler ? Error classification

What's next?

? Usability

? Spark Connect project ? Py4J improvement ? Native NumPy support ? Better docstrings

? Performance

? Source-native index in Pandas API on Spark ? Optimized createDataFrame with Arrow

? Feature parity

? Observable API for Structured Streaming ? Latest pandas API in Pandas API on Spark

What's new?

Pandas API on Spark

Faster default index

>>> import pandas as pd >>> pd.DataFrame({"col": ["a", "b", "c"]})

col 0 a 1 b 2 c

pandas' default (range) index

Sequence increasing one by one, challenging in distributed computation

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download