Handout 10 - Bentley University
Handout 10
CS602 ? Data-Driven Development with ?Fall'22
Page 1 of 9
Handout 10 Introduction to Pandas data structures
Pandas library provides efficient operations on multidimensional arrays of heterogeneous data with attached row and column labels. Built on top of NumPy.
Described in
import pandas import pandas as pd
Main data structures are Series and DataFrame.
SERIES One-dimensional array of indexed data. Indexing is not necessarily numeric; index may contain duplicate values.
>>> s = pd.Series ([ 7, 12, 3, 45]) >>> s # will show index on the left and element on the right
0
7
1 12
2
3
3 45
dtype: int64
>>> type(s) pandas.core.series.Series
>>> s.values # numpy array array([ 7, 12, 3, 45], dtype=int64)
>>> type(s.values) numpy.ndarray
>>> s.index RangeIndex(start=0, stop=4, step=1)
# Can assign a non-numeric index:
>>> si = pd.Series([210, 31, 40, 15], index = ['Porcupine','Chipmunk','Squirrel','Goat'])
>>> si
Porcupine 210
Chipmunk
31
Squirrel
40
Goat
15
dtype: int64
>>> si['Goat'] 15
>>> si[0] # can also use the numeric index! 210
>>> d = {'d': 45, 'b': 33, 'c':12, 'a': 34} >>> d
{'d': 45, 'b': 33, 'c': 12, 'a': 34}
>>> sd = pd.Series(d) # Series created from a
dictionary:
>>> sd d 45 b 33 c 12 a 34
DATAFRAME
A two-dimensional array with labeled rows and columns. Each column is of type Series. A two-dimensional array with labeled rows (index) and columns. Each column is of type Series.
- 1 -
Handout 10
CS602 ? Data-Driven Development with ?Fall'22
Page 2 of 9
Title 12 Years a Slave (2013) 127 Hours (2010) 50/50 (2011) About Time (2013)
Rating 8.1 7.6 7.7 7.8
TotalVotes 496,092 297,075 283,935 225,412
Genre1 Biography Adventure Comedy Comedy
Add this code to see data frames fully:
pd.set_option('display.max_columns', 1000) pd.set_option('display.max_rows', 1000) pd.set_option('display.width', 1000)
If having issues with encoding, also add this
import sys print(sys.stdout.encoding) sys.stdout.reconfigure(encoding='utf-8') print(sys.stdout.encoding)
Genre2 Drama Biography Drama Drama
CREATING AND SAVING A DATAFRAME
There are a number of ways to create and initialize new DataFrames, for example from
? a file, Python dictionaries, lists or tuples, existing Series or DataFrame, NumPy data arrays
In the descriptions below ? pd refers to the abbreviated name of the pandas package, df refers to a data
frame object.
pd.read_csv (filepath, and Load delimited data from a file, URL, or file-like object. Use
many other params)
comma as default delimiter
df.to_csv(filepath, index, Write the data from df dataframe out to a comma-separated
and other params)
file. index = True default, includes index col; usually set to
False.
df.copy()
Return a duplicate of df
pd.DataFrame(data, index, Create a new data frame from data, with specified index,
columns, dtype, copy)
columns, dtype
>>> df = pd.read_csv(os.path.join('data', 'tiny.csv')) >>> print ("shape", df.shape) >>> print (df)
shape (4, 5)
Title
0 12 Years a Slave (2013)
1
127 Hours (2010)
2
50/50 (2011)
3
About Time (2013)
Rating TotalVotes 8.1 496,092 7.6 297,075 7.7 283,935 7.8 225,412
Genre1 Biography Adventure
Comedy Comedy
Genre2 Drama
Biography Drama Drama
>>> df2 = df.copy() >>> df2.to_csv(os.path.join('data', 'tiny-copy.csv'), index = False)
>>> df.columns Index(['Title', 'Rating', 'TotalVotes', 'Genre1', 'Genre2'], dtype='object') >>> df.index
- 2 -
Handout 10
CS602 ? Data-Driven Development with ?Fall'22
Page 3 of 9
RangeIndex(start=0, stop=4, step=1)
>>> df3 = pd.DataFrame(columns = ["courseID", "enrolled", "instructor"]]) >>> print(df3) Empty DataFrame Columns: [courseID, enrolled, instructor] Index: []
>>> df3.courseID = ['CS399', 'CS602']
>>> df3.enrolled = [22, 17]
>>> print(df3)
courseID enrolled instructor
0 CS399
22
NaN
1 CS602
17
NaN
INDEXING, SLICING AND SELECTION, LOC AND ILOC
>>> df.Title #same as df["Title"]
0 12 Years a Slave (2013)
1
127 Hours (2010)
2
50/50 (2011)
3
About Time (2013)
Name: Title, dtype: object
>>> df["Title"][2] '50/50 (2011)'
iloc [rowsInt, colsInt] - integer based indexing loc [rowLabels, colLabels]- label based indexing
>>> df[["Rating", "TotalVotes"]][1:3] >>> df.iloc[ 1:3, 1:3] >>> df.loc[ 1:2, ["Rating", "TotalVotes"]]
Rating TotalVotes
1
7.6 297,075
2
7.7 283,935
Rating TotalVotes
1
7.6 297,075
2
7.7 283,935
Rating TotalVotes
1
7.6 297,075
2
7.7 283,935
ADDING, REMOVING AND RENAMING ROWS/COLUMNS ** Note - function descriptions below are not complete
df[newcolname] = val
df.reindex( index=None, columns=None,
fill_value=NaN )
df.drop(index=None, columns=None,
Adds a new column with newcolname as a title, filled with val
Can be used to add new rows.
Returns a reindexed DataFrame. Conform DataFrame to new index , placing NA/NaN in locations having no value in the previous index. A new object is produced, the original frame is unchanged
Remove rows or columns by specifying index or column names. index, columns : single label or list-
- 3 -
Handout 10
CS602 ? Data-Driven Development with ?Fall'22
Page 4 of 9
inplace = False)
like. A new object is produced, the original data frame is unchanged, unless inplace parameter is True.
df.rename( index = index-dict, columns = column-dict, inplace = True )
index/columns param is a dictionary of form {old:new}. Function replaces the old values in the index/columns with the new. inplace controls whether the original dataframe is changed.
>>> df2 = df.reindex(index = df.index.union([4,5,6])) #add new empty rows
>>> df2
Title Rating TotalVotes
Genre1
Genre2
0 12 Years a Slave (2013)
8.1 496,092 Biography
Drama
1
127 Hours (2010)
7.6 297,075 Adventure Biography
2
50/50 (2011)
7.7 283,935
Comedy
Drama
3
About Time (2013)
7.8 225,412
Comedy
Drama
4
NaN
NaN
NaN
NaN
NaN
5
NaN
NaN
NaN
NaN
NaN
6
NaN
NaN
NaN
NaN
NaN
>>> df2.drop(index = [4,5,6], inplace =True)
>>> df2
Title Rating TotalVotes
0 12 Years a Slave (2013)
8.1 496,092
1
127 Hours (2010)
7.6 297,075
2
50/50 (2011)
7.7 283,935
3
About Time (2013)
7.8 225,412
Genre1 Biography Adventure
Comedy Comedy
Genre2 Drama
Biography Drama Drama
>>> df2.drop(columns = ['Genre2'], inplace = True)
>>> df2
Title Rating TotalVotes
Genre1
0 12 Years a Slave (2013)
8.1 496,092 Biography
1
127 Hours (2010)
7.6 297,075 Adventure
2
50/50 (2011)
7.7 283,935
Comedy
3
About Time (2013)
7.8 225,412
Comedy
Example 2 : suppose have the following content in weather.xlsx
Day
Description
High Temp Low Temp Precip Humidity
1
PM Showers
50
51
50
77
2
PM Thundershowers
52
61
90
89
3
Mostly Sunny
45
60
0
55
>>> exdf = pd.read_excel("weather.xlsx") # also have pd.read_csv
>>> exdf
Day
Description High Temp Low Temp Precip Humidity
0 1
PM Showers
50
51
50
77
1 2 PM Thundershowers
52
61
90
89
2 3
Mostly Sunny
45
60
0
55
>>> exdf.values array([[1, 'PM Showers ', 50, 51, 50, 77],
- 4 -
Handout 10
CS602 ? Data-Driven Development with ?Fall'22
Page 5 of 9
[2, 'PM Thundershowers', 52, 61, 90, 89], [3, 'Mostly Sunny', 45, 60, 0, 55]], dtype=object)
>>> type(exdf.values) numpy.ndarray
>>> exdf.values.shape (3, 6)
>>> exdf['Precip']
0 50
1 90
2
0
Name: Precip, dtype: int64
>>> exdf['Precip'][0] 50
Practice problem 1. Create the efdx table from the xlsx file as a pandas Dataframe. 2. Change the exdf column titles to all lower case 3. Change the index (row labels) to include the rest of the week, preserving the existing data. 4. Update the Wind column with values 45.6 , 12.5, 3 5. Rearrange the columns to list the low temp, then high temp, then other fields 6. Remove column `Wind' 7. Add a column with the Average Temp, computed as the average of High and Low temp values 8. Create a NumPy table with low and high temperature values; compute min, max and median values in for each of the two parameters.
FILTERING BY CONDITION, SORTING - like the SQL Where clause; and is &. or is |, not is ~
>>>df[df.Rating < 8]
Title Rating TotalVotes
Genre1 Genre2
0 12 Years a Slave (2013)
8.1 496,092 Biography Drama
~ not & and | or
>>>df[(df.Rating > 8) | (df.Genre1 == 'Comedy') ] #note ()'s are required!!!
Title Rating TotalVotes
Genre1 Genre2
0 12 Years a Slave (2013)
8.1 496,092 Biography Drama
2
50/50 (2011)
7.7 283,935
Comedy Drama
3
About Time (2013)
7.8 225,412
Comedy Drama
df.sort_values(by= columnlst, ascending = True)
Returns df sorted by specified column . A new object is produced, the original data frame is unchanged, unless inplace parameter is True.
- 5 -
Handout 10
CS602 ? Data-Driven Development with ?Fall'22
Page 6 of 9
df.sort_index(ascending=True, inplace = False)
Returns df sorted by index . A new object is produced, the original data frame is unchanged, unless inplace parameter is True.
>>> print(df.sort_values('Rating'))
Title Rating TotalVotes
1
127 Hours (2010)
7.6 297,075
2
50/50 (2011)
7.7 283,935
3
About Time (2013)
7.8 225,412
0 12 Years a Slave (2013)
8.1 496,092
Genre1 Adventure
Comedy Comedy Biography
Genre2 Biography
Drama Drama Drama
VECTORIZED OPERATIONS, UFUNCS, ALIGNMENT
The vectorized operations from NumPy are applicable to Series and DataFrame, e.g.
>>> df2 = pd.DataFrame(np.arange(20,35).reshape(5,3), columns=list('xyz')) >>> df2
x y z 3 20 21 22 4 23 24 25 5 26 27 28 6 29 30 31 7 32 33 34
index=[3,4, 5,6, 7 ],
>>> df2.loc[:,'x']*2 + 5 Out[233]: 3 45 4 51 5 57 6 63 7 69 Name: x, dtype: int32
Note, that in operations involving two structures:
- operations are applied to elements with matching index values (alignment)
- the resulting structure contains the union of all indices
-
>>> s1 = pd.Series( range(10), [chr(code) for code in range (ord('a'), ord('a')+10)] )
>>> s2 = pd.Series( [20 for i in range(5)], [chr(code) for code in range (ord('c'),
ord('c')+5) ] )
>>> s3 = s1 + s2
>>> s3
Out[117]:
a
NaN
b
NaN
c 22.0
d 23.0
e 24.0
f 25.0
g 26.0
- 6 -
Handout 10
CS602 ? Data-Driven Development with ?Fall'22
Page 7 of 9
h
NaN
i
NaN
j
NaN
dtype: float64
WORKING WITH STRINGS. ? They are accessed via the Series's str attribute and generally have names matching the analogous built-in string methods, i.e. str.upper() instead of upper. ? Slicing using [::] syntax is done with method str.slice () ? works on strings AND lists ? Indexing - str.get() ? works on strings AND lists ? These methods exclude missing/NA values automatically.
Method name Description
cat
Concatenate strings element-wise with optional delimiter
contains
Return boolean array if each string contains pattern/regex
count
Count occurrences of pattern
endswith, startswith Equivalent to x.endswith(pattern) or x.startswith(pattern) for each element.
findall
Compute list of all occurrences of pattern/regex for each string
get
Index into each element (retrieve i-th element)
join
Join strings in each element of the Series with passed separator
len
Compute length of each string
lower, upper Convert cases; equivalent to x.lower() or x.upper() for each element.
match
Use re.match with the passed regular expression on each element, returning matched
groups as list.
pad
Add whitespace to left, right, or both sides of strings
center
Equivalent to pad(side='both')
repeat
Duplicate values; for example s.str.repeat(3) equivalent to x * 3 for each string.
replace
Replace occurrences of pattern/regex with some other string
slice
Slice each string in the Series.
split
Split strings on delimiter or regular expression
strip, rstrip, lstrip Trim whitespace, including newlines; equivalent to x.strip() (and rstrip, lstrip,
respectively) for each element.
df["Year"] = df.Title.str.slice(-5, -1)
df.Title = df.Title.str.slice(0, -6 )
print(df)
Title Rating TotalVotes
0 12 Years a Slave
8.1 496,092
1
127 Hours
7.6 297,075
2
50/50
7.7 283,935
3
About Time
7.8 225,412
Genre1 Biography Adventure
Comedy Comedy
Genre2 Drama
Biography Drama Drama
Year 2013 2010 2011 2013
>>> print(df.Title.str.split().str.get(0))
0
12
1
127
2 50/50
3 About
- 7 -
Handout 10
CS602 ? Data-Driven Development with ?Fall'22
Page 8 of 9
AGGREGATION
Recall that DataFrame.values produces a NumPy table. Hence, all aggregation methods and statistical functions that work for NumPy can be applied to DataFrame.values. The following Pandas aggregate functions are applicable to Series and DataFrames.
Function
count sum mean mad median min max mode abs prod std var quantile
Description Number of non-NA observations Sum of values Mean of values Mean absolute deviation Arithmetic median of values Minimum Maximum Mode Absolute Value Product of values Bessel-corrected sample standard deviation Unbiased variance Sample quantile (value at %)
>>> print (df.min())
Title
12 Years a Slave (2013)
Rating
7.6
TotalVotes
225,412
Genre1
Adventure
Genre2
Biography
Rating1
760.0
dtype: object
>>> print (df.Rating.mean()) 7.8
Practice problem: based on the data contained in the IMDB.csv and IMDB-dataDict.txt, compose python code using Panda's package for the following tasks.
1. Create a DataFrame containing data from the first columns up to and including Runtime. 2. Find how many movies are in the data set. 3. Select data for those movies that run for less than 2 hours 4. Find the most expensive and least expensive film, the mean and standard deviation of the
budget. 5. Find the 10 top rated movies 6. Find out the range of years for the movies listed in the data set 7. List 10 other questions to answer based on this data and create solutions for these queries.
- 8 -
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- with pandas f m a vectorized m a f operations
- handout 10 bentley university
- pandas dataframe notes university of idaho
- data wrangling tidy data pandas
- python rename inplace
- informatics practices new 065 class xii kv no 1
- before we start edu
- python class room diary be easy in my python class
- com
- ebseq script to calculate differentially expressed contigs