Handout 10 - Bentley University
Handout 10
CS602 ? Data-Driven Development with ?Fall'22
Page 1 of 9
Handout 10 Introduction to Pandas data structures
Pandas library provides efficient operations on multidimensional arrays of heterogeneous data with attached row and column labels. Built on top of NumPy.
Described in
import pandas import pandas as pd
Main data structures are Series and DataFrame.
SERIES One-dimensional array of indexed data. Indexing is not necessarily numeric; index may contain duplicate values.
>>> s = pd.Series ([ 7, 12, 3, 45]) >>> s # will show index on the left and element on the right
0
7
1 12
2
3
3 45
dtype: int64
>>> type(s) pandas.core.series.Series
>>> s.values # numpy array array([ 7, 12, 3, 45], dtype=int64)
>>> type(s.values) numpy.ndarray
>>> s.index RangeIndex(start=0, stop=4, step=1)
# Can assign a non-numeric index:
>>> si = pd.Series([210, 31, 40, 15], index = ['Porcupine','Chipmunk','Squirrel','Goat'])
>>> si
Porcupine 210
Chipmunk
31
Squirrel
40
Goat
15
dtype: int64
>>> si['Goat'] 15
>>> si[0] # can also use the numeric index! 210
>>> d = {'d': 45, 'b': 33, 'c':12, 'a': 34} >>> d
{'d': 45, 'b': 33, 'c': 12, 'a': 34}
>>> sd = pd.Series(d) # Series created from a
dictionary:
>>> sd d 45 b 33 c 12 a 34
DATAFRAME
A two-dimensional array with labeled rows and columns. Each column is of type Series. A two-dimensional array with labeled rows (index) and columns. Each column is of type Series.
- 1 -
Handout 10
CS602 ? Data-Driven Development with ?Fall'22
Page 2 of 9
Title 12 Years a Slave (2013) 127 Hours (2010) 50/50 (2011) About Time (2013)
Rating 8.1 7.6 7.7 7.8
TotalVotes 496,092 297,075 283,935 225,412
Genre1 Biography Adventure Comedy Comedy
Add this code to see data frames fully:
pd.set_option('display.max_columns', 1000) pd.set_option('display.max_rows', 1000) pd.set_option('display.width', 1000)
If having issues with encoding, also add this
import sys print(sys.stdout.encoding) sys.stdout.reconfigure(encoding='utf-8') print(sys.stdout.encoding)
Genre2 Drama Biography Drama Drama
CREATING AND SAVING A DATAFRAME
There are a number of ways to create and initialize new DataFrames, for example from
? a file, Python dictionaries, lists or tuples, existing Series or DataFrame, NumPy data arrays
In the descriptions below ? pd refers to the abbreviated name of the pandas package, df refers to a data
frame object.
pd.read_csv (filepath, and Load delimited data from a file, URL, or file-like object. Use
many other params)
comma as default delimiter
df.to_csv(filepath, index, Write the data from df dataframe out to a comma-separated
and other params)
file. index = True default, includes index col; usually set to
False.
df.copy()
Return a duplicate of df
pd.DataFrame(data, index, Create a new data frame from data, with specified index,
columns, dtype, copy)
columns, dtype
>>> df = pd.read_csv(os.path.join('data', 'tiny.csv')) >>> print ("shape", df.shape) >>> print (df)
shape (4, 5)
Title
0 12 Years a Slave (2013)
1
127 Hours (2010)
2
50/50 (2011)
3
About Time (2013)
Rating TotalVotes 8.1 496,092 7.6 297,075 7.7 283,935 7.8 225,412
Genre1 Biography Adventure
Comedy Comedy
Genre2 Drama
Biography Drama Drama
>>> df2 = df.copy() >>> df2.to_csv(os.path.join('data', 'tiny-copy.csv'), index = False)
>>> df.columns Index(['Title', 'Rating', 'TotalVotes', 'Genre1', 'Genre2'], dtype='object') >>> df.index
- 2 -
Handout 10
CS602 ? Data-Driven Development with ?Fall'22
Page 3 of 9
RangeIndex(start=0, stop=4, step=1)
>>> df3 = pd.DataFrame(columns = ["courseID", "enrolled", "instructor"]]) >>> print(df3) Empty DataFrame Columns: [courseID, enrolled, instructor] Index: []
>>> df3.courseID = ['CS399', 'CS602']
>>> df3.enrolled = [22, 17]
>>> print(df3)
courseID enrolled instructor
0 CS399
22
NaN
1 CS602
17
NaN
INDEXING, SLICING AND SELECTION, LOC AND ILOC
>>> df.Title #same as df["Title"]
0 12 Years a Slave (2013)
1
127 Hours (2010)
2
50/50 (2011)
3
About Time (2013)
Name: Title, dtype: object
>>> df["Title"][2] '50/50 (2011)'
iloc [rowsInt, colsInt] - integer based indexing loc [rowLabels, colLabels]- label based indexing
>>> df[["Rating", "TotalVotes"]][1:3] >>> df.iloc[ 1:3, 1:3] >>> df.loc[ 1:2, ["Rating", "TotalVotes"]]
Rating TotalVotes
1
7.6 297,075
2
7.7 283,935
Rating TotalVotes
1
7.6 297,075
2
7.7 283,935
Rating TotalVotes
1
7.6 297,075
2
7.7 283,935
ADDING, REMOVING AND RENAMING ROWS/COLUMNS ** Note - function descriptions below are not complete
df[newcolname] = val
df.reindex( index=None, columns=None,
fill_value=NaN )
df.drop(index=None, columns=None,
Adds a new column with newcolname as a title, filled with val
Can be used to add new rows.
Returns a reindexed DataFrame. Conform DataFrame to new index , placing NA/NaN in locations having no value in the previous index. A new object is produced, the original frame is unchanged
Remove rows or columns by specifying index or column names. index, columns : single label or list-
- 3 -
Handout 10
CS602 ? Data-Driven Development with ?Fall'22
Page 4 of 9
inplace = False)
like. A new object is produced, the original data frame is unchanged, unless inplace parameter is True.
df.rename( index = index-dict, columns = column-dict, inplace = True )
index/columns param is a dictionary of form {old:new}. Function replaces the old values in the index/columns with the new. inplace controls whether the original dataframe is changed.
>>> df2 = df.reindex(index = df.index.union([4,5,6])) #add new empty rows
>>> df2
Title Rating TotalVotes
Genre1
Genre2
0 12 Years a Slave (2013)
8.1 496,092 Biography
Drama
1
127 Hours (2010)
7.6 297,075 Adventure Biography
2
50/50 (2011)
7.7 283,935
Comedy
Drama
3
About Time (2013)
7.8 225,412
Comedy
Drama
4
NaN
NaN
NaN
NaN
NaN
5
NaN
NaN
NaN
NaN
NaN
6
NaN
NaN
NaN
NaN
NaN
>>> df2.drop(index = [4,5,6], inplace =True)
>>> df2
Title Rating TotalVotes
0 12 Years a Slave (2013)
8.1 496,092
1
127 Hours (2010)
7.6 297,075
2
50/50 (2011)
7.7 283,935
3
About Time (2013)
7.8 225,412
Genre1 Biography Adventure
Comedy Comedy
Genre2 Drama
Biography Drama Drama
>>> df2.drop(columns = ['Genre2'], inplace = True)
>>> df2
Title Rating TotalVotes
Genre1
0 12 Years a Slave (2013)
8.1 496,092 Biography
1
127 Hours (2010)
7.6 297,075 Adventure
2
50/50 (2011)
7.7 283,935
Comedy
3
About Time (2013)
7.8 225,412
Comedy
Example 2 : suppose have the following content in weather.xlsx
Day
Description
High Temp Low Temp Precip Humidity
1
PM Showers
50
51
50
77
2
PM Thundershowers
52
61
90
89
3
Mostly Sunny
45
60
0
55
>>> exdf = pd.read_excel("weather.xlsx") # also have pd.read_csv
>>> exdf
Day
Description High Temp Low Temp Precip Humidity
0 1
PM Showers
50
51
50
77
1 2 PM Thundershowers
52
61
90
89
2 3
Mostly Sunny
45
60
0
55
>>> exdf.values array([[1, 'PM Showers ', 50, 51, 50, 77],
- 4 -
Handout 10
CS602 ? Data-Driven Development with ?Fall'22
Page 5 of 9
[2, 'PM Thundershowers', 52, 61, 90, 89], [3, 'Mostly Sunny', 45, 60, 0, 55]], dtype=object)
>>> type(exdf.values) numpy.ndarray
>>> exdf.values.shape (3, 6)
>>> exdf['Precip']
0 50
1 90
2
0
Name: Precip, dtype: int64
>>> exdf['Precip'][0] 50
Practice problem 1. Create the efdx table from the xlsx file as a pandas Dataframe. 2. Change the exdf column titles to all lower case 3. Change the index (row labels) to include the rest of the week, preserving the existing data. 4. Update the Wind column with values 45.6 , 12.5, 3 5. Rearrange the columns to list the low temp, then high temp, then other fields 6. Remove column `Wind' 7. Add a column with the Average Temp, computed as the average of High and Low temp values 8. Create a NumPy table with low and high temperature values; compute min, max and median values in for each of the two parameters.
FILTERING BY CONDITION, SORTING - like the SQL Where clause; and is &. or is |, not is ~
>>>df[df.Rating < 8]
Title Rating TotalVotes
Genre1 Genre2
0 12 Years a Slave (2013)
8.1 496,092 Biography Drama
~ not & and | or
>>>df[(df.Rating > 8) | (df.Genre1 == 'Comedy') ] #note ()'s are required!!!
Title Rating TotalVotes
Genre1 Genre2
0 12 Years a Slave (2013)
8.1 496,092 Biography Drama
2
50/50 (2011)
7.7 283,935
Comedy Drama
3
About Time (2013)
7.8 225,412
Comedy Drama
df.sort_values(by= columnlst, ascending = True)
Returns df sorted by specified column . A new object is produced, the original data frame is unchanged, unless inplace parameter is True.
- 5 -
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- handout 10 bentley university
- worksheet data handling using pandas
- lecture 14 advanced pandas
- python programming pandas
- 1 pandas 1 introduction
- sample question paper set a b c term i subject
- pandas xlsxwriter charts documentation
- data wrangling tidy data pandas
- pandas groupby in action assumption university
- assumption university