Data Formats HDF5 and Parquet files - UH

COSC 6339 Big Data Analytics

Data Formats ? HDF5 and Parquet files

Edgar Gabriel Fall 2018

File Formats - Motivation

? Use-case: Analysis of all flights in the US between 20042008 using Apache Spark

File Format csv json Hadoop sequence file parquet

File Size 3.4 GB 12 GB 3.7 GB 0.55 GB

Processing Time 525 sec 2245 sec 1745 sec 100 sec

1

Scientific data libraries

? Handle data on a higher level ? Provide additional information typically not available in

flat data files (Metadata) ? Size and type of of data structure ? Data format ? Name ? Units ? Two widely used libraries available ? NetCDF ? HDF-5

HDF-5

? Hierarchical Data Format (HDF) developed since 1988 at NCSA (University of Illinois) ?

? Has gone through a long history of changes, the recent version HDF-5 available since 1999

? HDF-5 supports ? Very large files ? Parallel I/O interface ? Fortran, C, Java, Python bindings

2

HDF-5 dataset

? Multi-dimensional array of basic data elements ? A dataset consists of

? Header + data ? Header consists of

? Name ? Datatype : basic (e.g. HDF_NATIVE_FLOAT) or

compound dataypes ? Dataspace: defines size and shape of a multidimensional

array. Dimensions can be fixed or unlimited. ? Storage layout: defines how multidimensional arrays are

stored in file. Can be contiguous or chunked.

Example of an HDF-5 file

HDF5 "tempseries.h5" { GROUP "/" {

GROUP "tempseries" { DATASET "height" { DATATYPE {"H5_STD_I32BE" } DATASPACE ( ARRAY (4) (4) } DATA { 0, 50, 100, 150 } ATTRIBUTES "units" { DATATYPE {"undefined string" } DATASPACE { ARRAY (0) (0) } DATA { unable to print } } } DATASET "temperature" { DATATYPE {"H5T_IEEE_F32BE" } DATASPACE{ ARRAY( 3,8,4 ) (H5S_UNLIMITED, 8, 4) } DATA {...}

3

Storage layout: contiguous vs. chunked

contiguous

1 23 45 678 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64

chunked

1 2 3 4 17 18 19 20 5 6 7 8 21 22 23 24 9 10 11 12 25 26 27 28 13 14 15 16 29 30 31 32 33 34 35 36 49 50 51 52 37 38 39 40 53 54 55 56 41 42 43 44 57 58 59 60 45 46 47 48 61 62 63 64

Advantages and disadvantages of chunking Accessing rows and columns require the same number of accesses Data can be extended into all dimensions Efficient storage of sparse arrays Can improve caching

HDF-5 API

? HDF-5 naming convention ? All API functions start with an H5 ? The next character identifies category of functions ? H5F: functions handling files ? H5G: functions handling groups ? H5D: functions handling datasets ? H5S: functions handling dataspaces ? H5A: functions handling attributes

? A HDF-5 group is a collection of data sets ? Comparable to a directory in a UNIX-like file system

4

h5py

? Python interface to the HDF5 binary data format ? Uses NumPy and Python abstractions such as dictionary

and NumPy array syntax

Reading and Writing an HDF-5 file using h5py

import numpy as np import h5py MyData = np.random.random(size=(100,20)) h5f = h5py.File('data.h5', 'w') h5f.create_dataset('dataset_1', data=Mydata) h5f.close() h5f = h5py.File('data.h5','r') MyData = h5f['dataset_1'][:] h5f.close()

5

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download