DSC 201: Data Analysis & Visualization

[Pages:31]DSC 201: Data Analysis & Visualization

Data Aggregation & Time Series

Dr. David Koop

D. Koop, DSC 201, Fall 2018

Split-Apply-Combine

? Coined by H. Wickham, 2011 ? Similar to Map (split+apply) Reduce (combine) paradigm ? The Pattern:

1. Split the data by some grouping variable 2. Apply some function to each group independently 3. Combine the data into some output dataset ? The apply step is usually one of : - Aggregate - Transform - Filter

D. Koop, DSC 201, Fall 2018

[T. Brandt]

2

producing a new value. Finally, the results of all those function applications are combined into a result object. The form of the resulting object will usually depend on what's

beiSngpdloitn-eAtopthpe dlyat-aC. SeoemFigburien9e-1 for a mockup of a simple group aggregation.

FDi.guKroeo9p-,1D.SIlClu2st0r1a,tiFoanllo2f0a18group aggregation

[W. McKinney, Python for Data Analysis]

3

Splitting by Variables

Journal of Statistical Software

name

age

sex

John

13

Male

Mary

15

Female

Alice

14

Female

Peter

13

Male

Roger

14

Male

Phyllis

13

Female

.(sex)

name John Peter Roger

name Mary Alice Phyllis

age

sex

13

Male

13

Male

14

Male

age

sex

15

Female

14

Female

13

Female

9

.(age)

name John Peter Phyllis

name Alice Roger

name Mary

age

sex

13

Male

13

Male

13

Female

age

sex

14

Female

14

Male

age

sex

15

Female

Figure 4: Two examples of splitting up a data frame by variables. If the data frame was split

up by both males.

sex

and

age,

there

would

only

be

one

subset

with

more

than

one

r[oHw.:W13ic-kyehaarm-o,ld2011]

D. Koop, DSC 201, Fall 2018

4

Apply+Combine: Counting

The Split-Apply-Combine Strategy for Data Analysis

.(sex)

sex Male Female

value 3 3

.(age)

age

value

13

3

14

2

15

1

.(sex, age)

sex

age

Male

13

Male

14

Female

13

Female

14

Female

15

value 2 1 1 1 1

gure 7: Illustrating the output from using ddply() on the example from Figure 4 w ow(). Splitting variables shown above each example. Note how the extra labeling colum

added so that you can identify to which subset the results apply. [H. Wickham, 2011]

D. Koop, DSC 201, Fall 2018

5

In Pandas

? groupby method creates a GroupBy object ? groupby doesn't actually compute anything until there is an apply/

aggregate step or we wish to examine the groups ? Choose keys (columns) to group by ? size(): size of the groups ? Aggregation Operations:

- count() - mean() - sum()

? Can write own function for aggregation and pass it to agg function

def peak_to_peak(arr): return arr.max() - arr.min()

grouped.agg(peak_to_peak)

D. Koop, DSC 201, Fall 2018

6

Assignment 5

? ? Aggregation, Time Series, and Visualization ? Compare Hurricane Joaquin and Hurricane Maria

D. Koop, DSC 201, Fall 2018

7

Types of GroupBy

? Aggregation: agg - n:1 n group values become one value - Examples: mean, min, median

? Apply: apply - n:m n group values become m values - Most general (could do aggregation or transform with apply) - Example: top 5 in each group - Filter

? Transform: transform - n:n n group values become n values - Cannot mutate the input

D. Koop, DSC 201, Fall 2018

8

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download