Pandas data manipulation - Bentley University

Handout 11

CS602 ?Data-Driven Development with ?Spring'21

Page 1 of 6

Handout 11 Pandas data manipulation

COMBINING DATA: MERGE AND JOIN

Already saw some examples of adding columns, rows: functions concat and append Merge function provides SQL-style `join' capabilities, based on equality of column or index values.



DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)

Similar to SQL join; implemented via merge:

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

Returns: a data frame with data from left and right merged according to the params below.

If on parameter, specifying which columns to join in, not passed and left_index and right_index are False, the intersection of the columns in the DataFrames will be inferred to be the join keys.

left, right: DataFrames to be merged. By default they will be merged by the equality of values in the commonly named column (intersection of columns).

how:

One of 'left', 'right', 'outer', 'inner'.

on:

Column or index names to join on. Must be found in both the left and right

DataFrame objects.

left_on/right_on: Columns or index levels from the left/right DataFrame to use as keys. Can either be column names, index level names.

left_index/

right_index: If True, use the index (row labels) from the left DataFrame as its join key(s).

sort:

Sort the result DataFrame by the join keys in lexicographical order. Defaults to True, setting to False will improve performance substantially in many cases.

suffixes: A tuple of string suffixes to apply to overlapping columns. Defaults to ('_x','_y').

Notice: By default items are in sorted order by the join column. The result uses a new index.

- 1 -

Handout 11

CS602 ?Data-Driven Development with ?Spring'21

Page 2 of 6

Example 1

>>> df1 emp group

0 Jane Acc 1 Anne Eng 2 Bob Eng 3 Sam HR

>>>df2 emp

0 Lisa 1 Bob 2 Jake 3 Sam

start 2004 2008 2012 2014

how ? control inclusion of non-matching rows

>>> pd.merge (df1,df2) # how='inner'

>>> pd.merge(df1,df2, how='outer')

emp group start

emp group start

0 Bob Eng 2008

0 Jane Acc

NaN

1 Sam HR 2014

1 Anne Eng

NaN

2 Bob Eng 2008.0

3 Sam HR 2014.0

4 Lisa NaN 2004.0

5 Jake NaN 2012.0

>>> pd.merge(df1,df2, how='left')

>>> pd.merge(df1,df2, how='right')

emp group start

0 Jane Acc

NaN

1 Anne Eng

NaN

2 Bob Eng 2008.0

3 Sam HR 2014.0

emp group start 0 Bob Eng 2008 1 Sam HR 2014 2 Lisa NaN 2004 3 Jake NaN 2012

Example 2

>>> df3 #employee group

emp group start 0 Jane Acc 2011 1 Anne Eng 2009 2 Bob Eng 2017 3 Sam HR 2017

>>> df4 #employee start

employee

0

Bob

1

Jake

2

Sam

start 2008 2012 2014

>>> df5 #group manager Out[25]:

group emp 0 Acc Lisa 1 Eng Tom 2 HR David

left_on/right_on, suffixes ? if join column titles do not match, and if frames contain the same columns that need to be renamed in the result.

>>> pd.merge(df3, df4, left_on='emp', right_on = 'employee')

emp group start_x employee start_y

0 Bob Eng

2017

Bob

2008

1 Sam HR

2017

Sam

2014

>>> pd.merge(df3, df4, left_on='emp',right_on='employee',suffixes = ('_group','_comp'))

emp group start_group employee start_comp

0 Bob Eng

2017

Bob

2008

1 Sam HR

2017

Sam

2014

- 2 -

Handout 11

CS602 ?Data-Driven Development with ?Spring'21

Page 3 of 6

left_index/right_index Merging by the index values (can also merge by index with column)

>>> pd.merge(df3, df4, left_index=True, right_index = True)

emp group start_x employee start_y

0 Jane Acc

2011

Bob

2008

1 Anne Eng

2009

Jake

2012

2 Bob Eng

2017

Sam

2014

on ? which columns to use for joining

>>> pd.merge(df3, df5)

>>> pd.merge(df3, df5, on = 'group')

Empty DataFrame Columns: [emp, group, start] Index: []

emp_x group 0 Jane Acc 1 Anne Eng 2 Bob Eng 3 Sam HR

start 2011 2009 2017 2017

emp_y Lisa Tom Tom

David

Practice problems:

1. Given the above definitions, what is the result of pd.merge (df3, df4) ? 2. Both df3 and df5 contain data on employees and their group names. Create a data frame that

combines all employees and their group info from df3, df5. The result should contain the following:

emp group 0 Jane Acc 1 Anne Eng 2 Bob Eng 3 Sam HR 4 Lisa Acc 5 Tom Eng 6 David HR

GROUPBY

Pandas groupby() function mimics the SQL group by clause, in creating groupings of data. Following the grouping (a.k.a split), items in groups can be used to apply a function on them, and then combine the results in a new data frame.



DataFrame.groupby(by=None, axis=0)

by: label, or list of labels, function or mapping

Used to determine the groups for the groupby. A label or list of labels may be passed to group by the columns in self. Notice that a tuple is interpreted a (single) key If by is a function, it's called on each value of the object's index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups.

- 3 -

Handout 11

CS602 ?Data-Driven Development with ?Spring'21

Page 4 of 6

Example :

>>> staff = pd.DataFrame({'emp': ['Vic', 'Anne', 'Bob', 'Sam', 'Tom', 'Bob'], 'dept': ['Acc', 'Eng', 'Eng', 'HR', 'Acc', 'HR'], 'start':[2016, 2017, 2001, 2017, 2011, 2005 ], 'vac': [11, 12, 34, 12, 16, 5 ]}).sort_values('dept')

>>> staff

emp dept start vac 0 Vic Acc 2016 11 4 Tom Acc 2011 16 1 Anne Eng 2017 12 2 Bob Eng 2001 34 3 Sam HR 2017 12 5 Bob HR 2005 5

>>> for g in staff.groupby(by = 'dept'): # Iterate over groups: >>> print(g, type(g))

('Acc', emp dept start vac

0 Vic Acc 2016 11

4 Tom Acc 2011 16)

('Eng',

emp dept start vac

1 Anne Eng 2017 12

2 Bob Eng 2001 34)

('HR', emp dept start vac

3 Sam HR 2017 12

5 Bob HR 2005 5)

>>> dg = staff.groupby(by = 'dept')

>>> dg.min()

emp start vac

dept

Acc Tom 2011 11

Eng Anne 2001 12

HR

Bob 2005 5

# Apply a group-function per each group

>>> dg.median() #group function produces a DATAFRAME indexed by the values in grouping column

start vac dept Acc 2013.5 13.5 Eng 2009.0 23.0 HR 2011.0 8.5

>>> dg.median()['vac']

dept

Acc 13.5

Eng 23.0

HR

8.5

Practice problems:

1. Suggest three data analysis scenarios on the movies data, requiring a groupby(). 2. Implement the analyses you suggested using pandas methods.

- 4 -

Handout 11

CS602 ?Data-Driven Development with ?Spring'21

Page 5 of 6

WORKING WITH MISSING DATA Real world data is rarely complete and noiseless/homogenous (i.e. error-free).

Missing data can be referred to as: None ? `Pythonic no-value' constant NaN- Not a Number, a floating point value, numpy.nan NA, null ? are not within Python/numpy/pandas vocab

? Numerical functions and operations, e.g. sum, max, etc. will perform fine on NaN, but generate an error with None. Non-numerical functions will fail, generating an error.

None == np.nan #Out[53]: False type(None) #Out[54]: NoneType type(np.nan) #Out[55]: float np.nan + 5 # Out[58]: nan None + 5 # TypeError: unsupported operand type(s) for +:'NoneType' and 'int'

pd.isnull(v) pd.isna(v)

pd.notnull(v) pd.notna(v)

Check if value is None or np.nan, return a boolean The opposite of the above functions.

? Checks for missing values:

# checking if value is missing pd.isnull(None) #Out[62]: True pd.isnull(np.nan) #Out[63]: True pd.isnull(null) #NameError: name 'null' is not defined pd.isnull(7) #Out[65]: False pd.isnull('') #Out[66]: False

pd.isna (None) #Out[70]: True pd.isna (np.nan) #Out[71]: True

? Removing or filling-in missing values in a data frame

dataframe.dropna(axis=0, how='any', thresh=None, inplace=False)

dataframe.fillna( value = None, method=None, axis= None, inplace=False)

Returns a DataFrame with rows/cols containing missing values as specified by parameters, dropped.

Returns a DataFrame with rows/cols containing missing values filled with value as specified by parameters.

- 5 -

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download