Data Formats HDF5 and Parquet files - UH
COSC 6339 Big Data Analytics
Data Formats ? HDF5 and Parquet files
Edgar Gabriel Fall 2018
File Formats - Motivation
? Use-case: Analysis of all flights in the US between 20042008 using Apache Spark
File Format csv json Hadoop sequence file parquet
File Size 3.4 GB 12 GB 3.7 GB 0.55 GB
Processing Time 525 sec 2245 sec 1745 sec 100 sec
1
Scientific data libraries
? Handle data on a higher level ? Provide additional information typically not available in
flat data files (Metadata) ? Size and type of of data structure ? Data format ? Name ? Units ? Two widely used libraries available ? NetCDF ? HDF-5
HDF-5
? Hierarchical Data Format (HDF) developed since 1988 at NCSA (University of Illinois) ?
? Has gone through a long history of changes, the recent version HDF-5 available since 1999
? HDF-5 supports ? Very large files ? Parallel I/O interface ? Fortran, C, Java, Python bindings
2
HDF-5 dataset
? Multi-dimensional array of basic data elements ? A dataset consists of
? Header + data ? Header consists of
? Name ? Datatype : basic (e.g. HDF_NATIVE_FLOAT) or
compound dataypes ? Dataspace: defines size and shape of a multidimensional
array. Dimensions can be fixed or unlimited. ? Storage layout: defines how multidimensional arrays are
stored in file. Can be contiguous or chunked.
Example of an HDF-5 file
HDF5 "tempseries.h5" { GROUP "/" {
GROUP "tempseries" { DATASET "height" { DATATYPE {"H5_STD_I32BE" } DATASPACE ( ARRAY (4) (4) } DATA { 0, 50, 100, 150 } ATTRIBUTES "units" { DATATYPE {"undefined string" } DATASPACE { ARRAY (0) (0) } DATA { unable to print } } } DATASET "temperature" { DATATYPE {"H5T_IEEE_F32BE" } DATASPACE{ ARRAY( 3,8,4 ) (H5S_UNLIMITED, 8, 4) } DATA {...}
3
Storage layout: contiguous vs. chunked
contiguous
1 23 45 678 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
chunked
1 2 3 4 17 18 19 20 5 6 7 8 21 22 23 24 9 10 11 12 25 26 27 28 13 14 15 16 29 30 31 32 33 34 35 36 49 50 51 52 37 38 39 40 53 54 55 56 41 42 43 44 57 58 59 60 45 46 47 48 61 62 63 64
Advantages and disadvantages of chunking Accessing rows and columns require the same number of accesses Data can be extended into all dimensions Efficient storage of sparse arrays Can improve caching
HDF-5 API
? HDF-5 naming convention ? All API functions start with an H5 ? The next character identifies category of functions ? H5F: functions handling files ? H5G: functions handling groups ? H5D: functions handling datasets ? H5S: functions handling dataspaces ? H5A: functions handling attributes
? A HDF-5 group is a collection of data sets ? Comparable to a directory in a UNIX-like file system
4
h5py
? Python interface to the HDF5 binary data format ? Uses NumPy and Python abstractions such as dictionary
and NumPy array syntax
Reading and Writing an HDF-5 file using h5py
import numpy as np import h5py MyData = np.random.random(size=(100,20)) h5f = h5py.File('data.h5', 'w') h5f.create_dataset('dataset_1', data=Mydata) h5f.close() h5f = h5py.File('data.h5','r') MyData = h5f['dataset_1'][:] h5f.close()
5
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- a guide to f string formatting in python
- interfacing c c and python with swig
- s python cheat sheet data science free
- 1 functions in python
- python course
- 50 examples documentation
- python format specification mini language example
- python data persistence tutorialspoint
- handout 2 bentley university
- str s format method for introduction to programming using
Related searches
- data analyst roles and responsibilities
- data analysis techniques and methodology
- data analyst duties and responsibilities
- data analysis interpretation and presentation
- data file formats definition
- data presentation analysis and discussion
- data interpretation questions and answers
- data analysis questions and answers
- data analysis tools and techniques
- data collection analysis and reporting
- data collection methods and procedures
- data collection procedures and techniques