Pandas data manipulation - Bentley University

Handout 11

CS602 ?Data-Driven Development with ?Spring'23

Page 1 of 7

Handout 11 Pandas data manipulation

COMBINING DATA: MERGE AND JOIN

Already saw some examples of adding columns, rows: functions concat and append Merge function provides SQL-style `join' capabilities, based on equality of column or index values.



DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)

Similar to SQL join; implemented via merge:

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

Returns: a data frame with data from left and right merged according to the params below.

If on parameter, specifying which columns to join in, not passed and left_index and right_index are False, the intersection of the columns in the DataFrames will be inferred to be the join keys.

left, right: DataFrames to be merged. By default they will be merged by the equality of values in the commonly named column (intersection of columns).

how:

One of 'left', 'right', 'outer', 'inner'.

on:

Column or index names to join on. Must be found in both the left and right

DataFrame objects.

left_on/right_on: Columns or index levels from the left/right DataFrame to use as keys. Can either be column names, index level names.

left_index/

right_index: If True, use the index (row labels) from the left DataFrame as its join key(s).

sort:

Sort the result DataFrame by the join keys in lexicographical order. Defaults to True, setting to False will improve performance substantially in many cases.

suffixes: A tuple of string suffixes to apply to overlapping columns. Defaults to ('_x','_y').

Notice: By default items are in sorted order by the join column. The result uses a new index.

- 1 -

Handout 11

CS602 ?Data-Driven Development with ?Spring'23

Page 2 of 7

Example 1

>>> df1 emp group

0 Jane Acc 1 Anne Eng 2 Bob Eng 3 Sam HR

>>>df2 emp

0 Lisa 1 Bob 2 Jake 3 Sam

start 2004 2008 2012 2014

how ? control inclusion of non-matching rows

>>> pd.merge (df1,df2) # how='inner'

>>> pd.merge(df1,df2, how='outer')

emp group start

emp group start

0 Bob Eng 2008

0 Jane Acc

NaN

1 Sam HR 2014

1 Anne Eng

NaN

2 Bob Eng 2008.0

3 Sam HR 2014.0

4 Lisa NaN 2004.0

5 Jake NaN 2012.0

>>> pd.merge(df1,df2, how='left')

>>> pd.merge(df1,df2, how='right')

emp group start

0 Jane Acc

NaN

1 Anne Eng

NaN

2 Bob Eng 2008.0

3 Sam HR 2014.0

emp group start 0 Bob Eng 2008 1 Sam HR 2014 2 Lisa NaN 2004 3 Jake NaN 2012

Example 2

>>> df3 #employee group

emp group start 0 Jane Acc 2011 1 Anne Eng 2009 2 Bob Eng 2017 3 Sam HR 2017

>>> df4 #employee start

employee

0

Bob

1

Jake

2

Sam

start 2008 2012 2014

>>> df5 #group manager Out[25]:

group emp 0 Acc Lisa 1 Eng Tom 2 HR David

left_on/right_on, suffixes ? if join column titles do not match, and if frames contain the same columns that need to be renamed in the result.

>>> pd.merge(df3, df4, left_on='emp', right_on = 'employee')

emp group start_x employee start_y

0 Bob Eng

2017

Bob

2008

1 Sam HR

2017

Sam

2014

>>> pd.merge(df3, df4, left_on='emp',right_on='employee',suffixes = ('_group','_comp'))

emp group start_group employee start_comp

0 Bob Eng

2017

Bob

2008

1 Sam HR

2017

Sam

2014

- 2 -

Handout 11

CS602 ?Data-Driven Development with ?Spring'23

Page 3 of 7

left_index/right_index Merging by the index values (can also merge by index with column)

>>> pd.merge(df3, df4, left_index=True, right_index = True)

emp group start_x employee start_y

0 Jane Acc

2011

Bob

2008

1 Anne Eng

2009

Jake

2012

2 Bob Eng

2017

Sam

2014

on ? which columns to use for joining

>>> pd.merge(df3, df5)

>>> pd.merge(df3, df5, on = 'group')

Empty DataFrame Columns: [emp, group, start] Index: []

emp_x group 0 Jane Acc 1 Anne Eng 2 Bob Eng 3 Sam HR

start 2011 2009 2017 2017

emp_y Lisa Tom Tom

David

Practice problems:

1. Given the above definitions, what is the result of pd.merge (df3, df4) ? 2. Both df3 and df5 contain data on employees and their group names. Create a data frame that

combines all employees and their group info from df3, df5. The result should contain the following:

emp group 0 Jane Acc 1 Anne Eng 2 Bob Eng 3 Sam HR 4 Lisa Acc 5 Tom Eng 6 David HR

GROUPBY

Pandas groupby() function mimics the SQL group by clause, in creating groupings of data. Following the grouping (a.k.a split), items in groups can be used to apply a function on them, and then combine the results in a new data frame.



DataFrame.groupby(by=None, axis=0)

by: label, or list of labels, function or mapping

Used to determine the groups for the groupby. A label or list of labels may be passed to group by the columns in self. Notice that a tuple is interpreted a (single) key If by is a function, it's called on each value of the object's index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups.

- 3 -

Handout 11

CS602 ?Data-Driven Development with ?Spring'23

Page 4 of 7

Example :

>>> staff = pd.DataFrame({'emp': ['Vic', 'Anne', 'Bob', 'Sam', 'Tom', 'Bob'], 'dept': ['Acc', 'Eng', 'Eng', 'HR', 'Acc', 'HR'], 'start':[2016, 2017, 2001, 2017, 2011, 2005 ], 'vac': [11, 12, 34, 12, 16, 5 ]}).sort_values('dept')

>>> staff

emp dept start vac 0 Vic Acc 2016 11 4 Tom Acc 2011 16 1 Anne Eng 2017 12 2 Bob Eng 2001 34 3 Sam HR 2017 12 5 Bob HR 2005 5

>>> for g in staff.groupby(by = 'dept'): # Iterate over groups: >>> print(g, type(g))

('Acc', emp dept start vac

0 Vic Acc 2016 11

4 Tom Acc 2011 16)

('Eng',

emp dept start vac

1 Anne Eng 2017 12

2 Bob Eng 2001 34)

('HR', emp dept start vac

3 Sam HR 2017 12

5 Bob HR 2005 5)

>>> dg = staff.groupby(by = 'dept')

>>> dg.min()

emp start vac

dept

Acc Tom 2011 11

Eng Anne 2001 12

HR

Bob 2005 5

# Apply a group-function per each group

>>> dg.median() #group function produces a DATAFRAME indexed by the values in grouping column

start vac dept Acc 2013.5 13.5 Eng 2009.0 23.0 HR 2011.0 8.5

>>> dg.median()['vac']

dept

Acc 13.5

Eng 23.0

HR

8.5

Practice problems:

1. Suggest three data analysis scenarios on the movies data, requiring a groupby(). 2. Implement the analyses you suggested using pandas methods.

- 4 -

Handout 11

CS602 ?Data-Driven Development with ?Spring'23

Page 5 of 7

WORKING WITH MISSING DATA Real world data is rarely complete and noiseless/homogenous (i.e. error-free).

Missing data can be referred to as: None ? `Pythonic no-value' constant NaN- Not a Number, a floating point value, numpy.nan NA, null ? are not within Python/numpy/pandas vocab

? Numerical functions and operations, e.g. sum, max, etc. will perform fine on NaN, but generate an error with None. Non-numerical functions will fail, generating an error.

None == np.nan #Out[53]: False type(None) #Out[54]: NoneType type(np.nan) #Out[55]: float np.nan + 5 # Out[58]: nan None + 5 # TypeError: unsupported operand type(s) for +:'NoneType' and 'int'

pd.isnull(v) pd.isna(v)

pd.notnull(v) pd.notna(v)

Check if value is None or np.nan, return a boolean The opposite of the above functions.

? Checks for missing values:

# checking if value is missing pd.isnull(None) #Out[62]: True pd.isnull(np.nan) #Out[63]: True pd.isnull(null) #NameError: name 'null' is not defined pd.isnull(7) #Out[65]: False pd.isnull('') #Out[66]: False

pd.isna (None) #Out[70]: True pd.isna (np.nan) #Out[71]: True

? Removing or filling-in missing values in a data frame

dataframe.dropna(axis=0, how='any', thresh=None, inplace=False)

dataframe.fillna( value = None, method=None, axis= None, inplace=False)

Returns a DataFrame with rows/cols containing missing values as specified by parameters, dropped.

Returns a DataFrame with rows/cols containing missing values filled with value as specified by parameters.

- 5 -

Handout 11

CS602 ?Data-Driven Development with ?Spring'23

Page 6 of 7

>>> mv1 = pd.DataFrame({'emp': ['Jane', 'Anne', None, 'Sam'],

'group': [None, 'Eng', 'Eng', 'HR'],

'year' : [np.nan, np.nan, 2013, 2015]})

>>> mv1

emp group year

0 Jane None

NaN

1 Anne Eng

NaN

2 None Eng 2013.0

3 Sam HR 2015.0

>>> mv1.dropna(axis = 0) # drop if row has any missing values

emp group year 3 Sam HR 2015.0

>>> mv1.dropna(axis = 0, thresh = 2)# must have at least 2 missing values to drop

emp group year

1 Anne Eng

NaN

2 None Eng 2013.0

3 Sam HR 2015.0

>>> mv1.dropna(axis = 0, how = 'all') # drop if all values in a row are missing

emp group year

0 Jane None

NaN

1 Anne Eng

NaN

2 None Eng 2013.0

3 Sam HR 2015.0

Practice problems:

1. What's the result of mv1.dropna(axis = 1) ?

2. Does mv1 change as a result?

3. (a) Define function P1 that accepts two parameters: ? a data frame of format similar to the format of the IMDB data, and ? a string containing a column name, e.g. `Year'.

The function should return a tuple with two values - the average value of Rating, per each value of the specified parameter column (e.g. average rating per Year). - a list of all of the distinct values in the specified column of imdb, without repetition and sorted from lowest to highest,

Show how to call the function on the IMDB data file.

(b) Repeat asking the user to enter a column name, until user enters a valid column name from the IMDB.csv. Then, call function P1 defined above for the imdb dataframe and the specified column and print out the list that it returns (as the second return value)

4. (a) Define function P2 that accepts two parameters: ? a data frame of format similar to the format of the purchase data, and

- 6 -

Handout 11

CS602 ?Data-Driven Development with ?Spring'23

Page 7 of 7

? a data frame of format similar to the format of the product data.

The function should create a new data frame, which contains only those PRODUCT_IDs and corresponding DESCRIPTION values, which were purchased by both Male and Female users (see the GENDER column). Hint: use a set operation.

Return the created data frame.

Show how to call the function on the purchase-s and pdata-s csv data files.

(b) Read purchase-s and pdata-s csv data files into data frames and calls function P2 defined above, saving the result in variable named DF.

Then, repeat asking the user for a keyword while a non-empty string is entered, and - print those rows from the DF, which contain the keyword in their DESCRIPTION

- 7 -

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download