Pandas data manipulation - Bentley University
Handout 11
CS602 ?Data-Driven Development with ?Spring'23
Page 1 of 7
Handout 11 Pandas data manipulation
COMBINING DATA: MERGE AND JOIN
Already saw some examples of adding columns, rows: functions concat and append Merge function provides SQL-style `join' capabilities, based on equality of column or index values.
DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
Similar to SQL join; implemented via merge:
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
Returns: a data frame with data from left and right merged according to the params below.
If on parameter, specifying which columns to join in, not passed and left_index and right_index are False, the intersection of the columns in the DataFrames will be inferred to be the join keys.
left, right: DataFrames to be merged. By default they will be merged by the equality of values in the commonly named column (intersection of columns).
how:
One of 'left', 'right', 'outer', 'inner'.
on:
Column or index names to join on. Must be found in both the left and right
DataFrame objects.
left_on/right_on: Columns or index levels from the left/right DataFrame to use as keys. Can either be column names, index level names.
left_index/
right_index: If True, use the index (row labels) from the left DataFrame as its join key(s).
sort:
Sort the result DataFrame by the join keys in lexicographical order. Defaults to True, setting to False will improve performance substantially in many cases.
suffixes: A tuple of string suffixes to apply to overlapping columns. Defaults to ('_x','_y').
Notice: By default items are in sorted order by the join column. The result uses a new index.
- 1 -
Handout 11
CS602 ?Data-Driven Development with ?Spring'23
Page 2 of 7
Example 1
>>> df1 emp group
0 Jane Acc 1 Anne Eng 2 Bob Eng 3 Sam HR
>>>df2 emp
0 Lisa 1 Bob 2 Jake 3 Sam
start 2004 2008 2012 2014
how ? control inclusion of non-matching rows
>>> pd.merge (df1,df2) # how='inner'
>>> pd.merge(df1,df2, how='outer')
emp group start
emp group start
0 Bob Eng 2008
0 Jane Acc
NaN
1 Sam HR 2014
1 Anne Eng
NaN
2 Bob Eng 2008.0
3 Sam HR 2014.0
4 Lisa NaN 2004.0
5 Jake NaN 2012.0
>>> pd.merge(df1,df2, how='left')
>>> pd.merge(df1,df2, how='right')
emp group start
0 Jane Acc
NaN
1 Anne Eng
NaN
2 Bob Eng 2008.0
3 Sam HR 2014.0
emp group start 0 Bob Eng 2008 1 Sam HR 2014 2 Lisa NaN 2004 3 Jake NaN 2012
Example 2
>>> df3 #employee group
emp group start 0 Jane Acc 2011 1 Anne Eng 2009 2 Bob Eng 2017 3 Sam HR 2017
>>> df4 #employee start
employee
0
Bob
1
Jake
2
Sam
start 2008 2012 2014
>>> df5 #group manager Out[25]:
group emp 0 Acc Lisa 1 Eng Tom 2 HR David
left_on/right_on, suffixes ? if join column titles do not match, and if frames contain the same columns that need to be renamed in the result.
>>> pd.merge(df3, df4, left_on='emp', right_on = 'employee')
emp group start_x employee start_y
0 Bob Eng
2017
Bob
2008
1 Sam HR
2017
Sam
2014
>>> pd.merge(df3, df4, left_on='emp',right_on='employee',suffixes = ('_group','_comp'))
emp group start_group employee start_comp
0 Bob Eng
2017
Bob
2008
1 Sam HR
2017
Sam
2014
- 2 -
Handout 11
CS602 ?Data-Driven Development with ?Spring'23
Page 3 of 7
left_index/right_index Merging by the index values (can also merge by index with column)
>>> pd.merge(df3, df4, left_index=True, right_index = True)
emp group start_x employee start_y
0 Jane Acc
2011
Bob
2008
1 Anne Eng
2009
Jake
2012
2 Bob Eng
2017
Sam
2014
on ? which columns to use for joining
>>> pd.merge(df3, df5)
>>> pd.merge(df3, df5, on = 'group')
Empty DataFrame Columns: [emp, group, start] Index: []
emp_x group 0 Jane Acc 1 Anne Eng 2 Bob Eng 3 Sam HR
start 2011 2009 2017 2017
emp_y Lisa Tom Tom
David
Practice problems:
1. Given the above definitions, what is the result of pd.merge (df3, df4) ? 2. Both df3 and df5 contain data on employees and their group names. Create a data frame that
combines all employees and their group info from df3, df5. The result should contain the following:
emp group 0 Jane Acc 1 Anne Eng 2 Bob Eng 3 Sam HR 4 Lisa Acc 5 Tom Eng 6 David HR
GROUPBY
Pandas groupby() function mimics the SQL group by clause, in creating groupings of data. Following the grouping (a.k.a split), items in groups can be used to apply a function on them, and then combine the results in a new data frame.
DataFrame.groupby(by=None, axis=0)
by: label, or list of labels, function or mapping
Used to determine the groups for the groupby. A label or list of labels may be passed to group by the columns in self. Notice that a tuple is interpreted a (single) key If by is a function, it's called on each value of the object's index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups.
- 3 -
Handout 11
CS602 ?Data-Driven Development with ?Spring'23
Page 4 of 7
Example :
>>> staff = pd.DataFrame({'emp': ['Vic', 'Anne', 'Bob', 'Sam', 'Tom', 'Bob'], 'dept': ['Acc', 'Eng', 'Eng', 'HR', 'Acc', 'HR'], 'start':[2016, 2017, 2001, 2017, 2011, 2005 ], 'vac': [11, 12, 34, 12, 16, 5 ]}).sort_values('dept')
>>> staff
emp dept start vac 0 Vic Acc 2016 11 4 Tom Acc 2011 16 1 Anne Eng 2017 12 2 Bob Eng 2001 34 3 Sam HR 2017 12 5 Bob HR 2005 5
>>> for g in staff.groupby(by = 'dept'): # Iterate over groups: >>> print(g, type(g))
('Acc', emp dept start vac
0 Vic Acc 2016 11
4 Tom Acc 2011 16)
('Eng',
emp dept start vac
1 Anne Eng 2017 12
2 Bob Eng 2001 34)
('HR', emp dept start vac
3 Sam HR 2017 12
5 Bob HR 2005 5)
>>> dg = staff.groupby(by = 'dept')
>>> dg.min()
emp start vac
dept
Acc Tom 2011 11
Eng Anne 2001 12
HR
Bob 2005 5
# Apply a group-function per each group
>>> dg.median() #group function produces a DATAFRAME indexed by the values in grouping column
start vac dept Acc 2013.5 13.5 Eng 2009.0 23.0 HR 2011.0 8.5
>>> dg.median()['vac']
dept
Acc 13.5
Eng 23.0
HR
8.5
Practice problems:
1. Suggest three data analysis scenarios on the movies data, requiring a groupby(). 2. Implement the analyses you suggested using pandas methods.
- 4 -
Handout 11
CS602 ?Data-Driven Development with ?Spring'23
Page 5 of 7
WORKING WITH MISSING DATA Real world data is rarely complete and noiseless/homogenous (i.e. error-free).
Missing data can be referred to as: None ? `Pythonic no-value' constant NaN- Not a Number, a floating point value, numpy.nan NA, null ? are not within Python/numpy/pandas vocab
? Numerical functions and operations, e.g. sum, max, etc. will perform fine on NaN, but generate an error with None. Non-numerical functions will fail, generating an error.
None == np.nan #Out[53]: False type(None) #Out[54]: NoneType type(np.nan) #Out[55]: float np.nan + 5 # Out[58]: nan None + 5 # TypeError: unsupported operand type(s) for +:'NoneType' and 'int'
pd.isnull(v) pd.isna(v)
pd.notnull(v) pd.notna(v)
Check if value is None or np.nan, return a boolean The opposite of the above functions.
? Checks for missing values:
# checking if value is missing pd.isnull(None) #Out[62]: True pd.isnull(np.nan) #Out[63]: True pd.isnull(null) #NameError: name 'null' is not defined pd.isnull(7) #Out[65]: False pd.isnull('') #Out[66]: False
pd.isna (None) #Out[70]: True pd.isna (np.nan) #Out[71]: True
? Removing or filling-in missing values in a data frame
dataframe.dropna(axis=0, how='any', thresh=None, inplace=False)
dataframe.fillna( value = None, method=None, axis= None, inplace=False)
Returns a DataFrame with rows/cols containing missing values as specified by parameters, dropped.
Returns a DataFrame with rows/cols containing missing values filled with value as specified by parameters.
- 5 -
Handout 11
CS602 ?Data-Driven Development with ?Spring'23
Page 6 of 7
>>> mv1 = pd.DataFrame({'emp': ['Jane', 'Anne', None, 'Sam'],
'group': [None, 'Eng', 'Eng', 'HR'],
'year' : [np.nan, np.nan, 2013, 2015]})
>>> mv1
emp group year
0 Jane None
NaN
1 Anne Eng
NaN
2 None Eng 2013.0
3 Sam HR 2015.0
>>> mv1.dropna(axis = 0) # drop if row has any missing values
emp group year 3 Sam HR 2015.0
>>> mv1.dropna(axis = 0, thresh = 2)# must have at least 2 missing values to drop
emp group year
1 Anne Eng
NaN
2 None Eng 2013.0
3 Sam HR 2015.0
>>> mv1.dropna(axis = 0, how = 'all') # drop if all values in a row are missing
emp group year
0 Jane None
NaN
1 Anne Eng
NaN
2 None Eng 2013.0
3 Sam HR 2015.0
Practice problems:
1. What's the result of mv1.dropna(axis = 1) ?
2. Does mv1 change as a result?
3. (a) Define function P1 that accepts two parameters: ? a data frame of format similar to the format of the IMDB data, and ? a string containing a column name, e.g. `Year'.
The function should return a tuple with two values - the average value of Rating, per each value of the specified parameter column (e.g. average rating per Year). - a list of all of the distinct values in the specified column of imdb, without repetition and sorted from lowest to highest,
Show how to call the function on the IMDB data file.
(b) Repeat asking the user to enter a column name, until user enters a valid column name from the IMDB.csv. Then, call function P1 defined above for the imdb dataframe and the specified column and print out the list that it returns (as the second return value)
4. (a) Define function P2 that accepts two parameters: ? a data frame of format similar to the format of the purchase data, and
- 6 -
Handout 11
CS602 ?Data-Driven Development with ?Spring'23
Page 7 of 7
? a data frame of format similar to the format of the product data.
The function should create a new data frame, which contains only those PRODUCT_IDs and corresponding DESCRIPTION values, which were purchased by both Male and Female users (see the GENDER column). Hint: use a set operation.
Return the created data frame.
Show how to call the function on the purchase-s and pdata-s csv data files.
(b) Read purchase-s and pdata-s csv data files into data frames and calls function P2 defined above, saving the result in variable named DF.
Then, repeat asking the user for a keyword while a non-empty string is entered, and - print those rows from the DF, which contain the keyword in their DESCRIPTION
- 7 -
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- data handling using pandas 2
- pandas notes github pages
- worksheet data handling using pandas
- pandas data manipulation bentley university
- pandas dataframe notes university of idaho
- cheat sheet pandas python
- program list python dataframe for practical file program list python
- practical file informatics practices class xii
- 15 110 hw 6 s o ci al med i a an al yti cs
- data wrangling tidy data pandas
Related searches
- pandas read csv data types
- pandas make empty data frame
- pandas for data analysis pdf
- python pandas build data frame
- pandas data type
- data types in pandas dataframe
- pandas column data type
- data type pandas column
- pandas change dataframe data type
- pandas inner join data frames
- python pandas data types
- pandas check data type for a column