Extending Machine Learning Algorithms Databricks with ...
Extending Machine Learning Algorithms with PySpark
Karen Feng, Kiavash Kianfar Databricks
Agenda
Discuss using PySpark (especially Pandas UDFs) to perform machine learning at unprecedented scale
Learn about an application for a genomics use case (GloWGR)
Design decisions
1. Problem: Genomic data are growing too quickly for existing tools Solution: Use big data tools (Spark)
Design decisions
1. Problem: Genomic data are growing too quickly for existing tools Solution: Use big data tools (Spark)
2. Problem: Bioinformaticians are not familiar with the native languages used by big data tools (Scala) Solution: Provide clients for high-level languages (Python)
Design decisions
1. Problem: Genomic data are growing too quickly for existing tools Solution: Use big data tools (Spark)
2. Problem: Bioinformaticians are not familiar with the native languages used by big data tools (Scala) Solution: Provide clients for high-level languages (Python)
3. Problem: Performant, maintainable machine learning algorithms are difficult to write natively in big data tools (Spark SQL expressions) Solution: Write algorithms in high-level languages and link them to big data tools (PySpark)
Problem 1
Genomic data are growing too fast for existing tools
Genomic data are growing at an exponential pace
Genomic data are growing at an exponential pace
Biobank datasets are growing in scale
? Next-generation sequencing
? Genotyping arrays (1Mb) ? Whole exome sequence (39Mb) ? Whole genome sequence (3200Mb)
? 1,000s of samples 100,000s of samples
? 10s of traits 1000s of traits
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- pandas udf and python type hint in apache spark 3
- improving python and spark performance and
- dataframes github pages
- pyspark 2 4 quick reference guide wisewithdata
- research project report spark blinkdb and sampling
- python data engineer with pyspark
- intro to dataframes and spark sql github pages
- pyspark of warcraft europython
- cheat sheet pyspark sql python lei mao s log book
- pyspark with kafka and databricks content
Related searches
- machine learning audiobook
- matlab machine learning pdf
- probability for machine learning pdf
- machine learning testing
- ai vs machine learning vs deep learning
- machine learning vs deep learning
- machine learning and artificial intelligence
- machine learning vs ai vs deep learning
- difference between machine learning and ai
- machine learning neural networks
- machine learning vs neural network
- machine learning backpropagation