Multi-HypothesisParsingofTabular DatainComma ...

VRIJE UNIVERSITEIT AMSTERDAM

Multi-Hypothesis Parsing of Tabular Data in Comma-Separated Values (CSV) Files

by Till Roman D?hmen

A thesis submitted in partial fulfillment for the degree of MSc Artificial Intelligence

in the Faculty of Science Vrije Universiteit Amsterdam Supervisor: Prof.dr. P.A. Boncz Second Supervisor: Dr. H.F. M?hleisen Second Reader: Dr. Rinke Hoekstra

September 2016

VRIJE UNIVERSITEIT AMSTERDAM

Abstract

Faculty of Science Vrije Universiteit Amsterdam

MSc Artificial Intelligence

by Till Roman D?hmen

Tabular data on the web comes in various formats and shapes. Preparing data for data analysis and integration requires manual steps which go beyond simple parsing of the data. The preparation includes steps like correct configuration of the parser, removing of meaningless rows, casting of data types and reshaping of the table structure. The goal of this thesis is the development of a robust and modular system which is able to automatically transform messy CSV data sources into a tidy tabular data structure. The highly diverse corpus of CSV files from the UK open data hub will serve as a basis for the evaluation of the system.

Preface

Going back from working life to studying and research has never felt like a step back but like an important step forward. I deeply appreciate the new experiences I have made and the new insights I have gained over the past two years. This thesis can not only be regarded as product of seven months research but also as product of an entire study and practical working experience. I especially want to thank my supervisors Peter Boncz and Hannes Mh?leisen, who made possible that I could write my thesis at such a great place as the CWI and who have always been there for me to discuss ideas and to guide me into the right direction. Furthermore I would like to thank Rinke Hoekstra, Benno Willemsem, Sara Magliacane, Alexander Boer, Davide Ceolin and Hadley Wickham for their useful input and discussions. Thank you Mark, Robin, Abe and the other table tennis enthusiasts at CWI - I enjoyed the daily matches at lot! I would also like to thank my roommates in Uilenstede, my friends and my family who supported me during my study and graciously respected my busy schedule. I am especially grateful to my parents, Claudia and Roman, who inspired and supported me in so many ways.

iii

Contents

Abstract

ii

Preface

iii

Contents

iv

1 Introduction

1

1.1 Open Government Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background

4

2.1 Comma-Separated Values . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 RFC 4180 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.2 Dialects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.3 Parsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.4 Validation Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Tidy Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Open Data Portal .uk . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 CSV Parsing Issues

15

3.1 Samples from .uk . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Categorization of Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Problem Statement and Research Questions

23

5 Related Work

27

5.1 Data Wrangling Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.2 Database Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.3 Semantic Enrichment of CSV . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.4 Automatic Table Normalization . . . . . . . . . . . . . . . . . . . . . . . . 30

5.4.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.4.2 Input Data Formats . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.4.3 Background Knowledge . . . . . . . . . . . . . . . . . . . . . . . . 34

5.4.4 Table Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.4.5 Analysis Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.4.6 Evaluation and Limitations . . . . . . . . . . . . . . . . . . . . . . 39

iv

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download