CS 649 Intro to Big Data: Tools and Methods Spring Semester, 2022 Doc ...

[Pages:35]CS 649 Intro to Big Data: Tools and Methods Spring Semester, 2022

Doc 15 Panda Alternatives Feb 24, 2022

Copyright ?, All rights reserved. 2022 SDSU & Roger Whitney, 5500 Campanile Drive, San Diego, CA 92182-7700 USA. OpenContent (http:// openpub/) license de nes the copyright on this document.

if



2

How to Scale Python/Numpy/Panda Code?

Memory

3

How to Scale Python/Numpy/Panda Code?

Python & Parallelism

Modern hardware Multi-core Multi-processor

Python Can only run one thread at a time! GIL - Global Interpreter Lock

4

Protecting Memory from Read-Write conflicts

Protecting mutable memory from read-write conflicts Fine-grain locks

Java - synchronized methods create lock on a single object Complicated to implement Global Lock Python Only one thread runs at a time Panda's and other libraries C code not thread safe

5

Panda Alternatives

Vaex Dask Turi create Datatable (H2O)

PyPy Julia

Dataframe DB

6

Vaex

Complete rewrite of Dataframe Uses structured files to produce memory map of data

Data stored on disk Retrieved when needed Can handle 100 GB data set on laptop Very fast

7

Basic Statistics - Calculations per second



8

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download