Multi-HypothesisParsingofTabular DatainComma ...
VRIJE UNIVERSITEIT AMSTERDAM
Multi-Hypothesis Parsing of Tabular Data in Comma-Separated Values (CSV) Files
by Till Roman D?hmen
A thesis submitted in partial fulfillment for the degree of MSc Artificial Intelligence
in the Faculty of Science Vrije Universiteit Amsterdam Supervisor: Prof.dr. P.A. Boncz Second Supervisor: Dr. H.F. M?hleisen Second Reader: Dr. Rinke Hoekstra
September 2016
VRIJE UNIVERSITEIT AMSTERDAM
Abstract
Faculty of Science Vrije Universiteit Amsterdam
MSc Artificial Intelligence
by Till Roman D?hmen
Tabular data on the web comes in various formats and shapes. Preparing data for data analysis and integration requires manual steps which go beyond simple parsing of the data. The preparation includes steps like correct configuration of the parser, removing of meaningless rows, casting of data types and reshaping of the table structure. The goal of this thesis is the development of a robust and modular system which is able to automatically transform messy CSV data sources into a tidy tabular data structure. The highly diverse corpus of CSV files from the UK open data hub will serve as a basis for the evaluation of the system.
Preface
Going back from working life to studying and research has never felt like a step back but like an important step forward. I deeply appreciate the new experiences I have made and the new insights I have gained over the past two years. This thesis can not only be regarded as product of seven months research but also as product of an entire study and practical working experience. I especially want to thank my supervisors Peter Boncz and Hannes Mh?leisen, who made possible that I could write my thesis at such a great place as the CWI and who have always been there for me to discuss ideas and to guide me into the right direction. Furthermore I would like to thank Rinke Hoekstra, Benno Willemsem, Sara Magliacane, Alexander Boer, Davide Ceolin and Hadley Wickham for their useful input and discussions. Thank you Mark, Robin, Abe and the other table tennis enthusiasts at CWI - I enjoyed the daily matches at lot! I would also like to thank my roommates in Uilenstede, my friends and my family who supported me during my study and graciously respected my busy schedule. I am especially grateful to my parents, Claudia and Roman, who inspired and supported me in so many ways.
iii
Contents
Abstract
ii
Preface
iii
Contents
iv
1 Introduction
1
1.1 Open Government Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background
4
2.1 Comma-Separated Values . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 RFC 4180 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Dialects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.3 Parsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.4 Validation Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Tidy Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Open Data Portal .uk . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 CSV Parsing Issues
15
3.1 Samples from .uk . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Categorization of Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Problem Statement and Research Questions
23
5 Related Work
27
5.1 Data Wrangling Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Database Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3 Semantic Enrichment of CSV . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.4 Automatic Table Normalization . . . . . . . . . . . . . . . . . . . . . . . . 30
5.4.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.4.2 Input Data Formats . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.4.3 Background Knowledge . . . . . . . . . . . . . . . . . . . . . . . . 34
5.4.4 Table Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.4.5 Analysis Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.4.6 Evaluation and Limitations . . . . . . . . . . . . . . . . . . . . . . 39
iv
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- 1 pandas 1 introduction
- multi hypothesisparsingoftabular dataincomma
- python data representations
- pandastable documentation read the docs
- chapter 14 data wrangling munging processing and
- advanced data management csci 490 680
- dsc 201 data analysis visualization
- data analysis
- outputin python
- programming principles in python csci 503