Handout 10 - Bentley University

Handout 10

CS602 ? Data-Driven Development with ?Fall'22

Page 1 of 9

Handout 10 Introduction to Pandas data structures

Pandas library provides efficient operations on multidimensional arrays of heterogeneous data with attached row and column labels. Built on top of NumPy.

Described in

import pandas import pandas as pd

Main data structures are Series and DataFrame.

SERIES One-dimensional array of indexed data. Indexing is not necessarily numeric; index may contain duplicate values.

>>> s = pd.Series ([ 7, 12, 3, 45]) >>> s # will show index on the left and element on the right

0

7

1 12

2

3

3 45

dtype: int64

>>> type(s) pandas.core.series.Series

>>> s.values # numpy array array([ 7, 12, 3, 45], dtype=int64)

>>> type(s.values) numpy.ndarray

>>> s.index RangeIndex(start=0, stop=4, step=1)

# Can assign a non-numeric index:

>>> si = pd.Series([210, 31, 40, 15], index = ['Porcupine','Chipmunk','Squirrel','Goat'])

>>> si

Porcupine 210

Chipmunk

31

Squirrel

40

Goat

15

dtype: int64

>>> si['Goat'] 15

>>> si[0] # can also use the numeric index! 210

>>> d = {'d': 45, 'b': 33, 'c':12, 'a': 34} >>> d

{'d': 45, 'b': 33, 'c': 12, 'a': 34}

>>> sd = pd.Series(d) # Series created from a

dictionary:

>>> sd d 45 b 33 c 12 a 34

DATAFRAME

A two-dimensional array with labeled rows and columns. Each column is of type Series. A two-dimensional array with labeled rows (index) and columns. Each column is of type Series.

- 1 -

Handout 10

CS602 ? Data-Driven Development with ?Fall'22

Page 2 of 9

Title 12 Years a Slave (2013) 127 Hours (2010) 50/50 (2011) About Time (2013)

Rating 8.1 7.6 7.7 7.8

TotalVotes 496,092 297,075 283,935 225,412

Genre1 Biography Adventure Comedy Comedy

Add this code to see data frames fully:

pd.set_option('display.max_columns', 1000) pd.set_option('display.max_rows', 1000) pd.set_option('display.width', 1000)

If having issues with encoding, also add this

import sys print(sys.stdout.encoding) sys.stdout.reconfigure(encoding='utf-8') print(sys.stdout.encoding)

Genre2 Drama Biography Drama Drama

CREATING AND SAVING A DATAFRAME

There are a number of ways to create and initialize new DataFrames, for example from

? a file, Python dictionaries, lists or tuples, existing Series or DataFrame, NumPy data arrays

In the descriptions below ? pd refers to the abbreviated name of the pandas package, df refers to a data

frame object.

pd.read_csv (filepath, and Load delimited data from a file, URL, or file-like object. Use

many other params)

comma as default delimiter

df.to_csv(filepath, index, Write the data from df dataframe out to a comma-separated

and other params)

file. index = True default, includes index col; usually set to

False.

df.copy()

Return a duplicate of df

pd.DataFrame(data, index, Create a new data frame from data, with specified index,

columns, dtype, copy)

columns, dtype

>>> df = pd.read_csv(os.path.join('data', 'tiny.csv')) >>> print ("shape", df.shape) >>> print (df)

shape (4, 5)

Title

0 12 Years a Slave (2013)

1

127 Hours (2010)

2

50/50 (2011)

3

About Time (2013)

Rating TotalVotes 8.1 496,092 7.6 297,075 7.7 283,935 7.8 225,412

Genre1 Biography Adventure

Comedy Comedy

Genre2 Drama

Biography Drama Drama

>>> df2 = df.copy() >>> df2.to_csv(os.path.join('data', 'tiny-copy.csv'), index = False)

>>> df.columns Index(['Title', 'Rating', 'TotalVotes', 'Genre1', 'Genre2'], dtype='object') >>> df.index

- 2 -

Handout 10

CS602 ? Data-Driven Development with ?Fall'22

Page 3 of 9

RangeIndex(start=0, stop=4, step=1)

>>> df3 = pd.DataFrame(columns = ["courseID", "enrolled", "instructor"]]) >>> print(df3) Empty DataFrame Columns: [courseID, enrolled, instructor] Index: []

>>> df3.courseID = ['CS399', 'CS602']

>>> df3.enrolled = [22, 17]

>>> print(df3)

courseID enrolled instructor

0 CS399

22

NaN

1 CS602

17

NaN

INDEXING, SLICING AND SELECTION, LOC AND ILOC

>>> df.Title #same as df["Title"]

0 12 Years a Slave (2013)

1

127 Hours (2010)

2

50/50 (2011)

3

About Time (2013)

Name: Title, dtype: object

>>> df["Title"][2] '50/50 (2011)'

iloc [rowsInt, colsInt] - integer based indexing loc [rowLabels, colLabels]- label based indexing

>>> df[["Rating", "TotalVotes"]][1:3] >>> df.iloc[ 1:3, 1:3] >>> df.loc[ 1:2, ["Rating", "TotalVotes"]]

Rating TotalVotes

1

7.6 297,075

2

7.7 283,935

Rating TotalVotes

1

7.6 297,075

2

7.7 283,935

Rating TotalVotes

1

7.6 297,075

2

7.7 283,935

ADDING, REMOVING AND RENAMING ROWS/COLUMNS ** Note - function descriptions below are not complete

df[newcolname] = val

df.reindex( index=None, columns=None,

fill_value=NaN )

df.drop(index=None, columns=None,

Adds a new column with newcolname as a title, filled with val

Can be used to add new rows.

Returns a reindexed DataFrame. Conform DataFrame to new index , placing NA/NaN in locations having no value in the previous index. A new object is produced, the original frame is unchanged

Remove rows or columns by specifying index or column names. index, columns : single label or list-

- 3 -

Handout 10

CS602 ? Data-Driven Development with ?Fall'22

Page 4 of 9

inplace = False)

like. A new object is produced, the original data frame is unchanged, unless inplace parameter is True.

df.rename( index = index-dict, columns = column-dict, inplace = True )

index/columns param is a dictionary of form {old:new}. Function replaces the old values in the index/columns with the new. inplace controls whether the original dataframe is changed.

>>> df2 = df.reindex(index = df.index.union([4,5,6])) #add new empty rows

>>> df2

Title Rating TotalVotes

Genre1

Genre2

0 12 Years a Slave (2013)

8.1 496,092 Biography

Drama

1

127 Hours (2010)

7.6 297,075 Adventure Biography

2

50/50 (2011)

7.7 283,935

Comedy

Drama

3

About Time (2013)

7.8 225,412

Comedy

Drama

4

NaN

NaN

NaN

NaN

NaN

5

NaN

NaN

NaN

NaN

NaN

6

NaN

NaN

NaN

NaN

NaN

>>> df2.drop(index = [4,5,6], inplace =True)

>>> df2

Title Rating TotalVotes

0 12 Years a Slave (2013)

8.1 496,092

1

127 Hours (2010)

7.6 297,075

2

50/50 (2011)

7.7 283,935

3

About Time (2013)

7.8 225,412

Genre1 Biography Adventure

Comedy Comedy

Genre2 Drama

Biography Drama Drama

>>> df2.drop(columns = ['Genre2'], inplace = True)

>>> df2

Title Rating TotalVotes

Genre1

0 12 Years a Slave (2013)

8.1 496,092 Biography

1

127 Hours (2010)

7.6 297,075 Adventure

2

50/50 (2011)

7.7 283,935

Comedy

3

About Time (2013)

7.8 225,412

Comedy

Example 2 : suppose have the following content in weather.xlsx

Day

Description

High Temp Low Temp Precip Humidity

1

PM Showers

50

51

50

77

2

PM Thundershowers

52

61

90

89

3

Mostly Sunny

45

60

0

55

>>> exdf = pd.read_excel("weather.xlsx") # also have pd.read_csv

>>> exdf

Day

Description High Temp Low Temp Precip Humidity

0 1

PM Showers

50

51

50

77

1 2 PM Thundershowers

52

61

90

89

2 3

Mostly Sunny

45

60

0

55

>>> exdf.values array([[1, 'PM Showers ', 50, 51, 50, 77],

- 4 -

Handout 10

CS602 ? Data-Driven Development with ?Fall'22

Page 5 of 9

[2, 'PM Thundershowers', 52, 61, 90, 89], [3, 'Mostly Sunny', 45, 60, 0, 55]], dtype=object)

>>> type(exdf.values) numpy.ndarray

>>> exdf.values.shape (3, 6)

>>> exdf['Precip']

0 50

1 90

2

0

Name: Precip, dtype: int64

>>> exdf['Precip'][0] 50

Practice problem 1. Create the efdx table from the xlsx file as a pandas Dataframe. 2. Change the exdf column titles to all lower case 3. Change the index (row labels) to include the rest of the week, preserving the existing data. 4. Update the Wind column with values 45.6 , 12.5, 3 5. Rearrange the columns to list the low temp, then high temp, then other fields 6. Remove column `Wind' 7. Add a column with the Average Temp, computed as the average of High and Low temp values 8. Create a NumPy table with low and high temperature values; compute min, max and median values in for each of the two parameters.

FILTERING BY CONDITION, SORTING - like the SQL Where clause; and is &. or is |, not is ~

>>>df[df.Rating < 8]

Title Rating TotalVotes

Genre1 Genre2

0 12 Years a Slave (2013)

8.1 496,092 Biography Drama

~ not & and | or

>>>df[(df.Rating > 8) | (df.Genre1 == 'Comedy') ] #note ()'s are required!!!

Title Rating TotalVotes

Genre1 Genre2

0 12 Years a Slave (2013)

8.1 496,092 Biography Drama

2

50/50 (2011)

7.7 283,935

Comedy Drama

3

About Time (2013)

7.8 225,412

Comedy Drama

df.sort_values(by= columnlst, ascending = True)

Returns df sorted by specified column . A new object is produced, the original data frame is unchanged, unless inplace parameter is True.

- 5 -

Handout 10

CS602 ? Data-Driven Development with ?Fall'22

Page 6 of 9

df.sort_index(ascending=True, inplace = False)

Returns df sorted by index . A new object is produced, the original data frame is unchanged, unless inplace parameter is True.

>>> print(df.sort_values('Rating'))

Title Rating TotalVotes

1

127 Hours (2010)

7.6 297,075

2

50/50 (2011)

7.7 283,935

3

About Time (2013)

7.8 225,412

0 12 Years a Slave (2013)

8.1 496,092

Genre1 Adventure

Comedy Comedy Biography

Genre2 Biography

Drama Drama Drama

VECTORIZED OPERATIONS, UFUNCS, ALIGNMENT

The vectorized operations from NumPy are applicable to Series and DataFrame, e.g.

>>> df2 = pd.DataFrame(np.arange(20,35).reshape(5,3), columns=list('xyz')) >>> df2

x y z 3 20 21 22 4 23 24 25 5 26 27 28 6 29 30 31 7 32 33 34

index=[3,4, 5,6, 7 ],

>>> df2.loc[:,'x']*2 + 5 Out[233]: 3 45 4 51 5 57 6 63 7 69 Name: x, dtype: int32

Note, that in operations involving two structures:

- operations are applied to elements with matching index values (alignment)

- the resulting structure contains the union of all indices

-

>>> s1 = pd.Series( range(10), [chr(code) for code in range (ord('a'), ord('a')+10)] )

>>> s2 = pd.Series( [20 for i in range(5)], [chr(code) for code in range (ord('c'),

ord('c')+5) ] )

>>> s3 = s1 + s2

>>> s3

Out[117]:

a

NaN

b

NaN

c 22.0

d 23.0

e 24.0

f 25.0

g 26.0

- 6 -

Handout 10

CS602 ? Data-Driven Development with ?Fall'22

Page 7 of 9

h

NaN

i

NaN

j

NaN

dtype: float64

WORKING WITH STRINGS. ? They are accessed via the Series's str attribute and generally have names matching the analogous built-in string methods, i.e. str.upper() instead of upper. ? Slicing using [::] syntax is done with method str.slice () ? works on strings AND lists ? Indexing - str.get() ? works on strings AND lists ? These methods exclude missing/NA values automatically.



Method name Description

cat

Concatenate strings element-wise with optional delimiter

contains

Return boolean array if each string contains pattern/regex

count

Count occurrences of pattern

endswith, startswith Equivalent to x.endswith(pattern) or x.startswith(pattern) for each element.

findall

Compute list of all occurrences of pattern/regex for each string

get

Index into each element (retrieve i-th element)

join

Join strings in each element of the Series with passed separator

len

Compute length of each string

lower, upper Convert cases; equivalent to x.lower() or x.upper() for each element.

match

Use re.match with the passed regular expression on each element, returning matched

groups as list.

pad

Add whitespace to left, right, or both sides of strings

center

Equivalent to pad(side='both')

repeat

Duplicate values; for example s.str.repeat(3) equivalent to x * 3 for each string.

replace

Replace occurrences of pattern/regex with some other string

slice

Slice each string in the Series.

split

Split strings on delimiter or regular expression

strip, rstrip, lstrip Trim whitespace, including newlines; equivalent to x.strip() (and rstrip, lstrip,

respectively) for each element.

df["Year"] = df.Title.str.slice(-5, -1)

df.Title = df.Title.str.slice(0, -6 )

print(df)

Title Rating TotalVotes

0 12 Years a Slave

8.1 496,092

1

127 Hours

7.6 297,075

2

50/50

7.7 283,935

3

About Time

7.8 225,412

Genre1 Biography Adventure

Comedy Comedy

Genre2 Drama

Biography Drama Drama

Year 2013 2010 2011 2013

>>> print(df.Title.str.split().str.get(0))

0

12

1

127

2 50/50

3 About

- 7 -

Handout 10

CS602 ? Data-Driven Development with ?Fall'22

Page 8 of 9

AGGREGATION

Recall that DataFrame.values produces a NumPy table. Hence, all aggregation methods and statistical functions that work for NumPy can be applied to DataFrame.values. The following Pandas aggregate functions are applicable to Series and DataFrames.

Function

count sum mean mad median min max mode abs prod std var quantile

Description Number of non-NA observations Sum of values Mean of values Mean absolute deviation Arithmetic median of values Minimum Maximum Mode Absolute Value Product of values Bessel-corrected sample standard deviation Unbiased variance Sample quantile (value at %)

>>> print (df.min())

Title

12 Years a Slave (2013)

Rating

7.6

TotalVotes

225,412

Genre1

Adventure

Genre2

Biography

Rating1

760.0

dtype: object

>>> print (df.Rating.mean()) 7.8

Practice problem: based on the data contained in the IMDB.csv and IMDB-dataDict.txt, compose python code using Panda's package for the following tasks.

1. Create a DataFrame containing data from the first columns up to and including Runtime. 2. Find how many movies are in the data set. 3. Select data for those movies that run for less than 2 hours 4. Find the most expensive and least expensive film, the mean and standard deviation of the

budget. 5. Find the 10 top rated movies 6. Find out the range of years for the movies listed in the data set 7. List 10 other questions to answer based on this data and create solutions for these queries.

- 8 -

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download