DSC 201: Data Analysis & Visualization

DSC 201: Data Analysis & Visualization

Data Cleaning

Dr. David Koop

D. Koop, DSC 201, Fall 2017

File Handling in Python

? Use open and f.close():

- f = open('huck-finn.txt', 'r') for line in f: if 'Huckleberry' in line: print(line.strip())

? Open flags indicate whether the file is being read ('r') or written ('w')

? Using a with statement, close can be done automatically:

- with open('huck-finn.txt', 'w') as f: for line in lines: if 'Huckleberry' in line: f.write(line)

? Closing the file is important when writing files!

D. Koop, DSC 201, Fall 2017

2

Comma-separated values (CSV) Format

? Comma is a field separator, newlines denote records

- a,b,c,d,message 1,2,3,4,hello 5,6,7,8,world 9,10,11,12,foo

? May have a header (a,b,c,d,message), but not required

? No type information: we do not know what the columns are (numbers, strings, floating point, etc.)

- Default: just keep everything as a string

- Type inference: Figure out what type to make each column based on what they look like

? What about commas in a value? double quotes

? Can use other delimiters (|, , )

D. Koop, DSC 201, Fall 2017

3

object. Table 6-1 summarizes some of them, though read_csv and read_table are likely the ones you'll use the most.

RTabelea6-d1.iPnargsingDfuancttiaonsinin pPanadans das

Function

Description

read_csv

Load delimited data from a file, URL, or file-like object; use comma as default delimiter

read_table

Load delimited data from a file, URL, or file-like object; use tab ('\t') as default delimiter

read_fwf

Read data in fixed-width column format (i.e., no delimiters)

read_clipboard Version of read_table that reads data from the clipboard; useful for converting tables from web pages

read_excel

Read tabular data from an Excel XLS or XLSX file

read_hdf

Read HDF5 files written by pandas

read_html

Read all tables found in the given HTML document

read_json

Read data from a JSON (JavaScript Object Notation) string representation

read_msgpack Read pandas data encoded using the MessagePack binary format

Fruenacdti_opnickle DReascdriapntiaornbitrary object stored in Python pickle format

read_sas

Read a SAS dataset stored in one of the SAS system's custom storage formats

read_sql

Read the results of a SQL query (using SQLAlchemy) as a pandas DataFrame

read_stata

Read a dataset from Stata file format

167

read_feather Read the Feather binary file format

I'll give an overview of the mechanics of these func[tWio.nMs,cKwihnnicehy,aPreytmhoenanfotrtDoactoanAvnearltysis] D. tKeoxotpd,aDtSaCin2t0o1a, FDaallt2a0F1r7ame. The optional arguments for these functions may fall into 4

Types of arguments for readers

? Indexing: choose a column to index the data, get column names from file or user

? Type inference and data conversion: automatic or user-defined ? Datetime parsing: can combine information from multiple columns ? Iterating: deal with very large files ? Unclean Data: skip rows (e.g. comments) or deal with formatted

numbers (e.g. 1,000,345)

D. Koop, DSC 201, Fall 2017

5

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download