DSC 201: Data Analysis & Visualization
[Pages:36]DSC 201: Data Analysis & Visualization
Data Cleaning
Dr. David Koop
D. Koop, DSC 201, Fall 2017
File Handling in Python
? Use open and f.close():
- f = open('huck-finn.txt', 'r') for line in f: if 'Huckleberry' in line: print(line.strip())
? Open flags indicate whether the file is being read ('r') or written ('w')
? Using a with statement, close can be done automatically:
- with open('huck-finn.txt', 'w') as f: for line in lines: if 'Huckleberry' in line: f.write(line)
? Closing the file is important when writing files!
D. Koop, DSC 201, Fall 2017
2
Comma-separated values (CSV) Format
? Comma is a field separator, newlines denote records
- a,b,c,d,message 1,2,3,4,hello 5,6,7,8,world 9,10,11,12,foo
? May have a header (a,b,c,d,message), but not required
? No type information: we do not know what the columns are (numbers, strings, floating point, etc.)
- Default: just keep everything as a string
- Type inference: Figure out what type to make each column based on what they look like
? What about commas in a value? double quotes
? Can use other delimiters (|, , )
D. Koop, DSC 201, Fall 2017
3
object. Table 6-1 summarizes some of them, though read_csv and read_table are likely the ones you'll use the most.
RTabelea6-d1.iPnargsingDfuancttiaonsinin pPanadans das
Function
Description
read_csv
Load delimited data from a file, URL, or file-like object; use comma as default delimiter
read_table
Load delimited data from a file, URL, or file-like object; use tab ('\t') as default delimiter
read_fwf
Read data in fixed-width column format (i.e., no delimiters)
read_clipboard Version of read_table that reads data from the clipboard; useful for converting tables from web pages
read_excel
Read tabular data from an Excel XLS or XLSX file
read_hdf
Read HDF5 files written by pandas
read_html
Read all tables found in the given HTML document
read_json
Read data from a JSON (JavaScript Object Notation) string representation
read_msgpack Read pandas data encoded using the MessagePack binary format
Fruenacdti_opnickle DReascdriapntiaornbitrary object stored in Python pickle format
read_sas
Read a SAS dataset stored in one of the SAS system's custom storage formats
read_sql
Read the results of a SQL query (using SQLAlchemy) as a pandas DataFrame
read_stata
Read a dataset from Stata file format
167
read_feather Read the Feather binary file format
I'll give an overview of the mechanics of these func[tWio.nMs,cKwihnnicehy,aPreytmhoenanfotrtDoactoanAvnearltysis] D. tKeoxotpd,aDtSaCin2t0o1a, FDaallt2a0F1r7ame. The optional arguments for these functions may fall into 4
Types of arguments for readers
? Indexing: choose a column to index the data, get column names from file or user
? Type inference and data conversion: automatic or user-defined ? Datetime parsing: can combine information from multiple columns ? Iterating: deal with very large files ? Unclean Data: skip rows (e.g. comments) or deal with formatted
numbers (e.g. 1,000,345)
D. Koop, DSC 201, Fall 2017
5
Reading and Writing CSV data with pandas
? Reading: - Basic: df = pd.read_csv(fname) - Use a different delimiter:
? df = pd.read_csv(fname, sep='\t')
- Skip the first few rows:
? df = pd.read_csv(fname, skiprows=3)
? Writing: - Basic: df.to_csv() - Change delimiter with sep kwarg:
? df.to_csv('example.dsv', sep='|')
- Change missing value representation
? df.to_csv('example.dsv', na_rep='NULL')
D. Koop, DSC 201, Fall 2017
6
JavaScript Object Notation (JSON)
? A format for web data
? Looks very similar to python dictionaries and lists
? Example:
- {"name": "Wes", "places_lived": ["United States", "Spain", "Germany"], "pet": null, "siblings": [{"name": "Scott", "age": 25, "pet": "Zuko"}, {"name": "Katie", "age": 33, "pet": "Cisco"}] }
? Only contains literals (no variables) but allows null
? Values: strings, arrays, dictionaries, numbers, booleans, or null
- Dictionary keys must be strings
- Quotation marks help differentiate string or numeric values
D. Koop, DSC 201, Fall 2017
7
eXtensible Markup Language (XML)
? Older, self-describing format with nesting
? Each field has tags
? Example:
- 373889 Metro-North Railroad Escalator Avail. 2011 12 Service Indicators M 97.00
? Top element is the root
D. Koop, DSC 201, Fall 2017
8
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related searches
- data analysis questions examples
- data analysis research paper example
- data analysis method
- data analysis methods examples
- data analysis methods in research
- types of data analysis methods
- data analysis in research methodology
- data analysis in research pdf
- examples of data analysis paper
- data analysis techniques for research
- data analysis and interpretation pdf
- data analysis tools