Handout 10 - Bentley University

Handout 10

CS602 ? Data-Driven Development with ?Spring'23

Page 1 of 9

Handout 10 Introduction to Pandas data structures

Pandas library provides efficient operations on multidimensional arrays of heterogeneous data with attached row and column labels. Built on top of NumPy.

Described in

import pandas import pandas as pd

Main data structures are Series and DataFrame.

SERIES One-dimensional array of indexed data. Indexing is not necessarily numeric; index may contain duplicate values.

>>> s = pd.Series ([ 7, 12, 3, 45]) >>> s # will show index on the left and element on the right

0

7

1 12

2

3

3 45

dtype: int64

>>> type(s) pandas.core.series.Series

>>> s.values # numpy array array([ 7, 12, 3, 45], dtype=int64)

>>> type(s.values) numpy.ndarray

>>> s.index RangeIndex(start=0, stop=4, step=1)

# Can assign a non-numeric index:

>>> si = pd.Series([210, 31, 40, 15], index = ['Porcupine','Chipmunk','Squirrel','Goat'])

>>> si

Porcupine 210

Chipmunk

31

Squirrel

40

Goat

15

dtype: int64

>>> si['Goat'] 15

>>> si[0] # can also use the numeric index! 210

>>> d = {'d': 45, 'b': 33, 'c':12, 'a': 34} >>> d

{'d': 45, 'b': 33, 'c': 12, 'a': 34}

>>> sd = pd.Series(d) # Series created from a

dictionary:

>>> sd d 45 b 33 c 12 a 34

DATAFRAME

A two-dimensional array with labeled rows and columns. Each column is of type Series. A two-dimensional array with labeled rows (index) and columns. Each column is of type Series.

- 1 -

Handout 10

CS602 ? Data-Driven Development with ?Spring'23

Page 2 of 9

Title 12 Years a Slave (2013) 127 Hours (2010) 50/50 (2011) About Time (2013)

Rating 8.1 7.6 7.7 7.8

TotalVotes 496,092 297,075 283,935 225,412

Genre1 Biography Adventure Comedy Comedy

Add this code to see data frames fully:

pd.set_option('display.max_columns', 1000) pd.set_option('display.max_rows', 1000) pd.set_option('display.width', 1000)

If having issues with encoding, also add this

import sys print(sys.stdout.encoding) sys.stdout.reconfigure(encoding='utf-8') print(sys.stdout.encoding)

Genre2 Drama Biography Drama Drama

CREATING AND SAVING A DATAFRAME

There are a number of ways to create and initialize new DataFrames, for example from

? a file, Python dictionaries, lists or tuples, existing Series or DataFrame, NumPy data arrays

In the descriptions below ? pd refers to the abbreviated name of the pandas package, df refers to a data

frame object.

pd.read_csv (filepath, and Load delimited data from a file, URL, or file-like object. Use

many other params)

comma as default delimiter

df.to_csv(filepath, index, Write the data from df dataframe out to a comma-separated

and other params)

file. index = True default, includes index col; usually set to

False.

df.copy()

Return a duplicate of df

pd.DataFrame(data, index, Create a new data frame from data, with specified index,

columns, dtype, copy)

columns, dtype

>>> df = pd.read_csv(os.path.join('data', 'tiny.csv')) >>> print ("shape", df.shape) >>> print (df)

shape (4, 5)

Title

0 12 Years a Slave (2013)

1

127 Hours (2010)

2

50/50 (2011)

3

About Time (2013)

Rating TotalVotes 8.1 496,092 7.6 297,075 7.7 283,935 7.8 225,412

Genre1 Biography Adventure

Comedy Comedy

Genre2 Drama

Biography Drama Drama

>>> df2 = df.copy() >>> df2.to_csv(os.path.join('data', 'tiny-copy.csv'), index = False)

>>> df.columns Index(['Title', 'Rating', 'TotalVotes', 'Genre1', 'Genre2'], dtype='object')

- 2 -

Handout 10

CS602 ? Data-Driven Development with ?Spring'23

Page 3 of 9

>>> df.index RangeIndex(start=0, stop=4, step=1)

>>> df3 = pd.DataFrame(columns = ["courseID", "enrolled", "instructor"]]) >>> print(df3) Empty DataFrame Columns: [courseID, enrolled, instructor] Index: []

>>> df3.courseID = ['CS399', 'CS602']

>>> df3.enrolled = [22, 17]

>>> print(df3)

courseID enrolled instructor

0 CS399

22

NaN

1 CS602

17

NaN

INDEXING, SLICING AND SELECTION, LOC AND ILOC

>>> df.Title #same as df["Title"]

0 12 Years a Slave (2013)

1

127 Hours (2010)

2

50/50 (2011)

3

About Time (2013)

Name: Title, dtype: object

>>> df["Title"][2] '50/50 (2011)'

iloc [rowsInt, colsInt] - integer based indexing loc [rowLabels, colLabels]- label based indexing

>>> df[["Rating", "TotalVotes"]][1:3] >>> df.iloc[ 1:3, 1:3] >>> df.loc[ 1:2, ["Rating", "TotalVotes"]]

Rating TotalVotes

1

7.6 297,075

2

7.7 283,935

Rating TotalVotes

1

7.6 297,075

2

7.7 283,935

Rating TotalVotes

1

7.6 297,075

2

7.7 283,935

ADDING, REMOVING AND RENAMING ROWS/COLUMNS ** Note - function descriptions below are not complete

df[newcolname] = val df.reindex( index=None,

columns=None, fill_value=NaN )

df.drop(index=None,

Adds a new column with newcolname as a title, filled with val

Can be used to add new rows.

Returns a reindexed DataFrame. Conform DataFrame to new index , placing NA/NaN in locations having no value in the previous index. A new object is produced, the original frame is unchanged

Remove rows or columns by specifying index or

- 3 -

Handout 10

CS602 ? Data-Driven Development with ?Spring'23

Page 4 of 9

columns=None, inplace = False)

column names. index, columns : single label or listlike. A new object is produced, the original data frame is unchanged, unless inplace parameter is True.

df.rename( index = index-dict, columns = column-dict, inplace = True )

index/columns param is a dictionary of form {old:new}. Function replaces the old values in the index/columns with the new. inplace controls whether the original dataframe is changed.

>>> df2 = df.reindex(index = df.index.union([4,5,6])) #add new empty rows

>>> df2

Title Rating TotalVotes

Genre1

Genre2

0 12 Years a Slave (2013)

8.1 496,092 Biography

Drama

1

127 Hours (2010)

7.6 297,075 Adventure Biography

2

50/50 (2011)

7.7 283,935

Comedy

Drama

3

About Time (2013)

7.8 225,412

Comedy

Drama

4

NaN

NaN

NaN

NaN

NaN

5

NaN

NaN

NaN

NaN

NaN

6

NaN

NaN

NaN

NaN

NaN

>>> df2.drop(index = [4,5,6], inplace =True)

>>> df2

Title Rating TotalVotes

0 12 Years a Slave (2013)

8.1 496,092

1

127 Hours (2010)

7.6 297,075

2

50/50 (2011)

7.7 283,935

3

About Time (2013)

7.8 225,412

Genre1 Biography Adventure

Comedy Comedy

Genre2 Drama

Biography Drama Drama

>>> df2.drop(columns = ['Genre2'], inplace = True)

>>> df2

Title Rating TotalVotes

Genre1

0 12 Years a Slave (2013)

8.1 496,092 Biography

1

127 Hours (2010)

7.6 297,075 Adventure

2

50/50 (2011)

7.7 283,935

Comedy

3

About Time (2013)

7.8 225,412

Comedy

Example 2 : suppose have the following content in weather.xlsx

Day

Description

High Temp Low Temp Precip Humidity

1

PM Showers

50

51

50

77

2

PM Thundershowers

52

61

90

89

3

Mostly Sunny

45

60

0

55

>>> exdf = pd.read_excel("weather.xlsx") # also have pd.read_csv

>>> exdf

Day

Description High Temp Low Temp Precip Humidity

0 1

PM Showers

50

51

50

77

1 2 PM Thundershowers

52

61

90

89

2 3

Mostly Sunny

45

60

0

55

>>> exdf.values

- 4 -

Handout 10

CS602 ? Data-Driven Development with ?Spring'23

Page 5 of 9

array([[1, 'PM Showers ', 50, 51, 50, 77], [2, 'PM Thundershowers', 52, 61, 90, 89], [3, 'Mostly Sunny', 45, 60, 0, 55]], dtype=object)

>>> type(exdf.values) numpy.ndarray

>>> exdf.values.shape (3, 6)

>>> exdf['Precip']

0 50

1 90

2

0

Name: Precip, dtype: int64

>>> exdf['Precip'][0] 50

Practice problem 1. Create the efdx table from the xlsx file as a pandas Dataframe. 2. Change the exdf column titles to all lower case 3. Change the index (row labels) to include the rest of the week, preserving the existing data. 4. Update the Wind column with values 45.6 , 12.5, 3 5. Rearrange the columns to list the low temp, then high temp, then other fields 6. Remove column `Wind' 7. Add a column with the Average Temp, computed as the average of High and Low temp values 8. Create a NumPy table with low and high temperature values; compute min, max and median values in for each of the two parameters.

FILTERING BY CONDITION, SORTING - like the SQL Where clause; and is &. or is |, not is ~

>>>df[df.Rating < 8]

Title Rating TotalVotes

Genre1 Genre2

0 12 Years a Slave (2013)

8.1 496,092 Biography Drama

~ not & and | or

>>>df[(df.Rating > 8) | (df.Genre1 == 'Comedy') ] #note ()'s are required!!!

Title Rating TotalVotes

Genre1 Genre2

0 12 Years a Slave (2013)

8.1 496,092 Biography Drama

2

50/50 (2011)

7.7 283,935

Comedy Drama

3

About Time (2013)

7.8 225,412

Comedy Drama

df.sort_values(by= columnlst, ascending = True)

Returns df sorted by specified column . A new object is produced, the original data frame is unchanged, unless inplace parameter is True.

- 5 -

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download