Pandas data manipulation - Bentley University
Handout 11
CS602 ?Data-Driven Development with ?Spring'21
Page 1 of 6
Handout 11 Pandas data manipulation
COMBINING DATA: MERGE AND JOIN
Already saw some examples of adding columns, rows: functions concat and append Merge function provides SQL-style `join' capabilities, based on equality of column or index values.
DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
Similar to SQL join; implemented via merge:
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
Returns: a data frame with data from left and right merged according to the params below.
If on parameter, specifying which columns to join in, not passed and left_index and right_index are False, the intersection of the columns in the DataFrames will be inferred to be the join keys.
left, right: DataFrames to be merged. By default they will be merged by the equality of values in the commonly named column (intersection of columns).
how:
One of 'left', 'right', 'outer', 'inner'.
on:
Column or index names to join on. Must be found in both the left and right
DataFrame objects.
left_on/right_on: Columns or index levels from the left/right DataFrame to use as keys. Can either be column names, index level names.
left_index/
right_index: If True, use the index (row labels) from the left DataFrame as its join key(s).
sort:
Sort the result DataFrame by the join keys in lexicographical order. Defaults to True, setting to False will improve performance substantially in many cases.
suffixes: A tuple of string suffixes to apply to overlapping columns. Defaults to ('_x','_y').
Notice: By default items are in sorted order by the join column. The result uses a new index.
- 1 -
Handout 11
CS602 ?Data-Driven Development with ?Spring'21
Page 2 of 6
Example 1
>>> df1 emp group
0 Jane Acc 1 Anne Eng 2 Bob Eng 3 Sam HR
>>>df2 emp
0 Lisa 1 Bob 2 Jake 3 Sam
start 2004 2008 2012 2014
how ? control inclusion of non-matching rows
>>> pd.merge (df1,df2) # how='inner'
>>> pd.merge(df1,df2, how='outer')
emp group start
emp group start
0 Bob Eng 2008
0 Jane Acc
NaN
1 Sam HR 2014
1 Anne Eng
NaN
2 Bob Eng 2008.0
3 Sam HR 2014.0
4 Lisa NaN 2004.0
5 Jake NaN 2012.0
>>> pd.merge(df1,df2, how='left')
>>> pd.merge(df1,df2, how='right')
emp group start
0 Jane Acc
NaN
1 Anne Eng
NaN
2 Bob Eng 2008.0
3 Sam HR 2014.0
emp group start 0 Bob Eng 2008 1 Sam HR 2014 2 Lisa NaN 2004 3 Jake NaN 2012
Example 2
>>> df3 #employee group
emp group start 0 Jane Acc 2011 1 Anne Eng 2009 2 Bob Eng 2017 3 Sam HR 2017
>>> df4 #employee start
employee
0
Bob
1
Jake
2
Sam
start 2008 2012 2014
>>> df5 #group manager Out[25]:
group emp 0 Acc Lisa 1 Eng Tom 2 HR David
left_on/right_on, suffixes ? if join column titles do not match, and if frames contain the same columns that need to be renamed in the result.
>>> pd.merge(df3, df4, left_on='emp', right_on = 'employee')
emp group start_x employee start_y
0 Bob Eng
2017
Bob
2008
1 Sam HR
2017
Sam
2014
>>> pd.merge(df3, df4, left_on='emp',right_on='employee',suffixes = ('_group','_comp'))
emp group start_group employee start_comp
0 Bob Eng
2017
Bob
2008
1 Sam HR
2017
Sam
2014
- 2 -
Handout 11
CS602 ?Data-Driven Development with ?Spring'21
Page 3 of 6
left_index/right_index Merging by the index values (can also merge by index with column)
>>> pd.merge(df3, df4, left_index=True, right_index = True)
emp group start_x employee start_y
0 Jane Acc
2011
Bob
2008
1 Anne Eng
2009
Jake
2012
2 Bob Eng
2017
Sam
2014
on ? which columns to use for joining
>>> pd.merge(df3, df5)
>>> pd.merge(df3, df5, on = 'group')
Empty DataFrame Columns: [emp, group, start] Index: []
emp_x group 0 Jane Acc 1 Anne Eng 2 Bob Eng 3 Sam HR
start 2011 2009 2017 2017
emp_y Lisa Tom Tom
David
Practice problems:
1. Given the above definitions, what is the result of pd.merge (df3, df4) ? 2. Both df3 and df5 contain data on employees and their group names. Create a data frame that
combines all employees and their group info from df3, df5. The result should contain the following:
emp group 0 Jane Acc 1 Anne Eng 2 Bob Eng 3 Sam HR 4 Lisa Acc 5 Tom Eng 6 David HR
GROUPBY
Pandas groupby() function mimics the SQL group by clause, in creating groupings of data. Following the grouping (a.k.a split), items in groups can be used to apply a function on them, and then combine the results in a new data frame.
DataFrame.groupby(by=None, axis=0)
by: label, or list of labels, function or mapping
Used to determine the groups for the groupby. A label or list of labels may be passed to group by the columns in self. Notice that a tuple is interpreted a (single) key If by is a function, it's called on each value of the object's index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups.
- 3 -
Handout 11
CS602 ?Data-Driven Development with ?Spring'21
Page 4 of 6
Example :
>>> staff = pd.DataFrame({'emp': ['Vic', 'Anne', 'Bob', 'Sam', 'Tom', 'Bob'], 'dept': ['Acc', 'Eng', 'Eng', 'HR', 'Acc', 'HR'], 'start':[2016, 2017, 2001, 2017, 2011, 2005 ], 'vac': [11, 12, 34, 12, 16, 5 ]}).sort_values('dept')
>>> staff
emp dept start vac 0 Vic Acc 2016 11 4 Tom Acc 2011 16 1 Anne Eng 2017 12 2 Bob Eng 2001 34 3 Sam HR 2017 12 5 Bob HR 2005 5
>>> for g in staff.groupby(by = 'dept'): # Iterate over groups: >>> print(g, type(g))
('Acc', emp dept start vac
0 Vic Acc 2016 11
4 Tom Acc 2011 16)
('Eng',
emp dept start vac
1 Anne Eng 2017 12
2 Bob Eng 2001 34)
('HR', emp dept start vac
3 Sam HR 2017 12
5 Bob HR 2005 5)
>>> dg = staff.groupby(by = 'dept')
>>> dg.min()
emp start vac
dept
Acc Tom 2011 11
Eng Anne 2001 12
HR
Bob 2005 5
# Apply a group-function per each group
>>> dg.median() #group function produces a DATAFRAME indexed by the values in grouping column
start vac dept Acc 2013.5 13.5 Eng 2009.0 23.0 HR 2011.0 8.5
>>> dg.median()['vac']
dept
Acc 13.5
Eng 23.0
HR
8.5
Practice problems:
1. Suggest three data analysis scenarios on the movies data, requiring a groupby(). 2. Implement the analyses you suggested using pandas methods.
- 4 -
Handout 11
CS602 ?Data-Driven Development with ?Spring'21
Page 5 of 6
WORKING WITH MISSING DATA Real world data is rarely complete and noiseless/homogenous (i.e. error-free).
Missing data can be referred to as: None ? `Pythonic no-value' constant NaN- Not a Number, a floating point value, numpy.nan NA, null ? are not within Python/numpy/pandas vocab
? Numerical functions and operations, e.g. sum, max, etc. will perform fine on NaN, but generate an error with None. Non-numerical functions will fail, generating an error.
None == np.nan #Out[53]: False type(None) #Out[54]: NoneType type(np.nan) #Out[55]: float np.nan + 5 # Out[58]: nan None + 5 # TypeError: unsupported operand type(s) for +:'NoneType' and 'int'
pd.isnull(v) pd.isna(v)
pd.notnull(v) pd.notna(v)
Check if value is None or np.nan, return a boolean The opposite of the above functions.
? Checks for missing values:
# checking if value is missing pd.isnull(None) #Out[62]: True pd.isnull(np.nan) #Out[63]: True pd.isnull(null) #NameError: name 'null' is not defined pd.isnull(7) #Out[65]: False pd.isnull('') #Out[66]: False
pd.isna (None) #Out[70]: True pd.isna (np.nan) #Out[71]: True
? Removing or filling-in missing values in a data frame
dataframe.dropna(axis=0, how='any', thresh=None, inplace=False)
dataframe.fillna( value = None, method=None, axis= None, inplace=False)
Returns a DataFrame with rows/cols containing missing values as specified by parameters, dropped.
Returns a DataFrame with rows/cols containing missing values filled with value as specified by parameters.
- 5 -
Handout 11
CS602 ?Data-Driven Development with ?Spring'21
Page 6 of 6
>>> mv1 = pd.DataFrame({'emp': ['Jane', 'Anne', None, 'Sam'],
'group': [None, 'Eng', 'Eng', 'HR'],
'year' : [np.nan, np.nan, 2013, 2015]})
>>> mv1
emp group year
0 Jane None
NaN
1 Anne Eng
NaN
2 None Eng 2013.0
3 Sam HR 2015.0
>>> mv1.dropna(axis = 0) # drop if row has any missing values
emp group year 3 Sam HR 2015.0
>>> mv1.dropna(axis = 0, thresh = 2)# must have at least 2 missing values to drop
emp group year
1 Anne Eng
NaN
2 None Eng 2013.0
3 Sam HR 2015.0
>>> mv1.dropna(axis = 0, how = 'all') # drop if all values in a row are missing
emp group year
0 Jane None
NaN
1 Anne Eng
NaN
2 None Eng 2013.0
3 Sam HR 2015.0
Practice problems:
1. What's the result of mv1.dropna(axis = 1) ?
2. Does mv1 change as a result?
- 6 -
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- pandas dataframe notes university of idaho
- reading and writing data with pandas
- technical analysis library in python documentation
- python pandas quick guide university of utah
- python programming pandas
- chapter 1 data handling using pandas i pandas
- with pandas f m a vectorized m a f operations cheat sheet
- cheat sheet pandas python datacamp
- pandas data manipulation bentley university
- data analysis with pandas
Related searches
- pandas read csv data types
- pandas make empty data frame
- pandas for data analysis pdf
- python pandas build data frame
- pandas data type
- data types in pandas dataframe
- pandas column data type
- data type pandas column
- pandas change dataframe data type
- pandas inner join data frames
- python pandas data types
- pandas check data type for a column