Analysis of unstructured data .wroc.pl

27.10.2017

3_pandas

Analysis of unstructured data

Lecture 3 - Introduction to pandas module (continued)

Janusz Szwabiski

Overview:

Iteration over data structures Sorting Working with text data Working with missing data Grouping of data Merge, join and concatenate Time series Visualization

References:

homepage of the Pandas project: ()

In [1]: %matplotlib inline import numpy as np import pandas as pd

Iteration over data structures

behavior of basic iteration over pandas objects depends on their type Series is regarded as array-like iteration produces values DataFrame (and Panel) follows the dict-like convention of iterating over the "keys" of the objects in short, basic iteration (for i in object:) produces:

Series: values DataFrame: column labels Panel: item labels

In [2]: df = pd.DataFrame({'col1' : np.random.randn(3), 'col2' : np.random.randn(3)},

index=['a', 'b', 'c'])

1/68

27.10.2017

In [3]: df Out[3]:

col1

col2

a -0.320593 -0.205749

b -1.001097 0.730810

c -1.466919 0.784842

3_pandas

In [4]:

for col in df: print(col)

col1 col2

Pandas offers also some additional methods supporting iteration:

iteritems() - iterate over (key, value) pairs iterrows() - iterate over the rows of a dataframe as (index,Series) pairs itertuples() - iterate over the rows as named tuples of the values (lot faster than iterrows)

In [5]: df Out[5]:

col1

col2

a -0.320593 -0.205749

b -1.001097 0.730810

c -1.466919 0.784842

2/68

27.10.2017

In [6]:

for key, val in df.iteritems(): print("Key: ",key) print("Value:") print(val) print('-'*10)

Key: col1 Value: a -0.320593 b -1.001097 c -1.466919 Name: col1, dtype: float64 ---------Key: col2 Value: a -0.205749 b 0.730810 c 0.784842 Name: col2, dtype: float64 ----------

3_pandas

In [7]:

for ind, ser in df.iterrows(): #row becomes a Series of the name being its label print("Index: ",ind) print("Series:") print(ser) print('-'*10)

Index: a Series: col1 -0.320593 col2 -0.205749 Name: a, dtype: float64 ---------Index: b Series: col1 -1.001097 col2 0.730810 Name: b, dtype: float64 ---------Index: c Series: col1 -1.466919 col2 0.784842 Name: c, dtype: float64 ----------

3/68

27.10.2017

3_pandas

In [8]:

for tup in df.itertuples(): #row becomes a tuple print("Value:") print(tup) print('-'*10)

Value: Pandas(Index='a', col1=-0.32059302248966487, col2=-0.205748503728384 82) ---------Value: Pandas(Index='b', col1=-1.0010973282656557, col2=0.7308101928498936 8) ---------Value: Pandas(Index='c', col1=-1.4669194859822687, col2=0.7848419785019650 2) ----------

Warning #1

iterating through pandas objects is generally slow in many cases it is not needed and can be avoided with one of the following approaches:

look for a vectorized solution when you have a function that cannot work on the full DataFrame/Series at once, it is better to use apply() instead of iterating over the values if you need to do iterative manipulations on the values but performance is important, consider writing the inner loop using e.g. cython or numba ( ())

Warning #2

do not modify something you are iterating over usually the iterator returns a copy and not a view, and writing to it will have no effect

In [9]: df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})

In [10]: for index, row in df.iterrows():

row['a'] = 10

4/68

27.10.2017

In [11]: df Out[11]:

ab 01a 12b 23c

3_pandas

Sorting

two kinds of sorting: by index and by value since the version 0.17.0 of pandas all sorting methods return a new object by default, and do not operate in-place this behavior can be changed by passing the flag inplace=True

Sorting by index

In [12]:

df = pd.DataFrame({'col1' : np.random.randn(3), 'col2' : np.random.randn(3)}, index=['a', 'b', 'c'])

In [13]: df Out[13]:

col1

col2

a -0.617963 -0.239719

b 1.297180 0.406090

c -1.641579 0.737969

In [14]:

unsorted_df = df.reindex(index=['c', 'a', 'b'], columns=['col2', 'col1'])

5/68

................
................

In order to avoid copyright disputes, this page is only a partial summary.

To fulfill the demand for quickly locating and searching documents.

It is intelligent file search solution for home and business.

Literature Lottery

To fulfill the demand for quickly locating and searching documents.

Related download

Related searches