Extending Machine Learning Algorithms Databricks with ...

Extending Machine Learning Algorithms with PySpark

Karen Feng, Kiavash Kianfar Databricks

Agenda

Discuss using PySpark (especially Pandas UDFs) to perform machine learning at unprecedented scale

Learn about an application for a genomics use case (GloWGR)

Design decisions

1. Problem: Genomic data are growing too quickly for existing tools Solution: Use big data tools (Spark)

Design decisions

1. Problem: Genomic data are growing too quickly for existing tools Solution: Use big data tools (Spark)

2. Problem: Bioinformaticians are not familiar with the native languages used by big data tools (Scala) Solution: Provide clients for high-level languages (Python)

Design decisions

1. Problem: Genomic data are growing too quickly for existing tools Solution: Use big data tools (Spark)

2. Problem: Bioinformaticians are not familiar with the native languages used by big data tools (Scala) Solution: Provide clients for high-level languages (Python)

3. Problem: Performant, maintainable machine learning algorithms are difficult to write natively in big data tools (Spark SQL expressions) Solution: Write algorithms in high-level languages and link them to big data tools (PySpark)

Problem 1

Genomic data are growing too fast for existing tools

Genomic data are growing at an exponential pace

Genomic data are growing at an exponential pace

Biobank datasets are growing in scale

? Next-generation sequencing

? Genotyping arrays (1Mb) ? Whole exome sequence (39Mb) ? Whole genome sequence (3200Mb)

? 1,000s of samples 100,000s of samples

? 10s of traits 1000s of traits

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download