Data Structures for Time Series from Ocean Moorings:



Data Structures for Time Series from Ocean Moorings:

A Proposal for NVODS Aggregation Servers

1. Introduction

Data from moored oceanographic instrumentation has been made available for access over the Internet by a small number of providers using OPenDAP (DODS) servers. This data usually consists of time series of such quantities as current velocity, water temperature and salinity, bottom pressure, travel time from inverted echo sounders, and could include surface meteorological data. Future developments in instrumentation may allow some of the chemical and biological properties of the ocean to be measured by unattended moorings deployed for periods of order months. Examples might be oxygen, nutrients, chlorophyll-A, and transmissivity. One of the goals of NVODS is for the user to search multiple OPenDAP sites in order to retrieve data specific to a given problem or analysis. This means that the user would be unaware of the locations of the data and the specifics of its organization and formatting.

In keeping with NVODS philosophy, individual providers select data structures, variable and metadata naming conventions according to their local requirements and make them available over the web. The OPenDAP protocols map these structures to NVODS data structures (sequence, grid, nested structures), but keep the variable and metadata names of the original data files. The majority of ocean time series providers use netcdf () files to store their data and use DODS netcdf servers to make the files available. Netcdf is designed to store arrays (termed Grids in NVODS), variables and their associated attributes (e.g. units, long_name, etc.) in self-describing structures. It can accommodate multiple data arrays, but they usually have the same dimensions. Netcdf does not accommodate easily sequences of complex data structures. For example, elements of netcdf arrays are of one type (float, double, integer, etc.) and if data, would have the same units. A programming analogy is that netcdf is closer to a language that primarily manipulates arrays, like Fortran, than one that uses structures extensively, such as C. Because netcdf is extensively used for storing time series data, there are a number of efforts to standardize names for dimensions and attributes (e.g. COARDS; NetCDF Climate and Forecast (CF) Metadata Conventions: ), and suggest required names and units for metadata associated with ocean datasets (e.g. MBARI, Coriolis Project). Netcdf files are the basis of, or can be input into, some widely used analysis packages (e.g. PMEL’s EPIC (), MATLAB).

Despite these standardization efforts, a user attempting to use time series datasets from multiple NVODS sites for input into an analysis would have two main difficulties:

1) Finding the locations (URL’s of the files) of the data would be time consuming and difficult because of lack of easily searchable databases of site contents (metadata).

2) Dimension, variable, attribute and data structure conventions would not be consistent across sites even when filtered through OpenDAP. At present, the user needs to adapt his or her API for each dataset retrieved.

These two points are related because consistent names and databases of metadata will make searching for datasets across multiple sites easier, as well as simplifying their use.

This report is concerned mostly with proposing a structure for aggregating time series datasets, obtained by an aggregation server from individual NVODS sites, which would be returned to the user. Searching for, and requesting datasets are closely related activities, and it is suggested that constructing a metadata database for locating datasets, which is consistent with the aggregation data structure and metadata returned to user, is an important consideration. Such a metadata database would probably be implemented as relational database tables because of their wide spread use for organizing large amounts of complex information, and the availability of software to do searches over the web. A data structure is independent of its naming conventions, however, because standard names for attributes as well as determining which attributes should be present in a dataset, are important to the user, this report will make some suggestions on these topics.

The author of this report initially developed a small number of strawman proposals which were submitted to an ad-hoc group (see Appendix A), mainly drawn from people directly involved in distributing ocean data to users. A two-day workshop was held in Charleston, SC (January 13 and 14, 2004), hosted by NOAA’s Coastal Services Center, in which the proposals and related issues were discussed and modified. This report reflects the conclusions of the workshop and has been reviewed by workshop members. The author, however, is responsible for final content of the report.

The goals of the workshop, established by the participants, were:

1) The data model will be used by an aggregation server to format output for use by a technical person or scientist to produce products.

2) The data model will express all the relevant information regarding a time series.

3) The aggregation server will convert provided data to the data model. Valid time series data values will not be modified except for scaling.

4) Try to define entities and terminology for single and multiple time series.

Thus, the primary focus is on the user’s requirements, which can be different from those of the data provider. The latter is often concerned with issues such as instrument performance, QA/QC, calibration equations, details of mooring designs, data management of real-time measurements, etc. These may be of peripheral concern to the analyst-user who would often like to easily determine if the retrieved time series could be input into a complex multi time-series analysis method. Thus, knowing that a time series is “clean” (i.e. equally spaced, no gaps, faulty values removed and interpolated, etc.) is often more useful than details of the processing required to get from the original data extracted from the moored instrument to the “clean” series. It is expected that the user can always return the providers site, if questions arise on the data processing or instrument calibration, for example.

If the data structure proposal is implemented in ocean time series aggregation servers, then the ideal would be that they would be the first choice for searching and retrieval of this type of data. Thus, standards imposed by aggregation data structures may have a chance to become established and preferred, which will simplify life for the user. Except for specialist use, the original providers become “hidden” from the user, but they gain in that their data may be more widely used with little change in their present practices.

This report is organized as follows: The basic considerations that were used in constructing the proposals are given in Chapter 2. The characteristics of ocean time series data, restrictions on aggregated data, the present state of OPenDAP providers, and naming conventions are discussed. In Chapter 3, the proposal is given and discussed, including suggestions for metadata that should be present in the aggregation data structure. Chapter 4 summarizes the report and makes recommendations.

2. Background

2.1. Ocean Mooring Based Measurements

The primary focus of this report is the organization of datasets that result from deployments of moored instruments in the ocean. In this context, moorings are considered to be any platform that has nominal fixed location (latitude, longitude) in the ocean. Moorings can be configured with sub-surface or surface flotation (buoys), or be a fixed platform on the bottom. Surface buoys may be equipped with meteorological instrumentation (for wind velocity, air temperature, barometric pressure, etc.). Examples of a conventional sub-surface taut-line deepwater mooring, and a set of three shelf moorings (bottom tripod, thermistor chain with surface flotation, and a subsurface mooring) designed to make measurements of the water column at one site, are given in Figures 1 and 2, respectively. As indicated in these figures, moorings can contain a variety of measurements including instruments that acoustically profile through all or part of the water column. Examples of the latter include the Acoustic Doppler Current Profiler (ADCP) and the Inverted Echo Sounder (IES).

A mooring site may be occupied for periods of several weeks to many years. During this time a mooring may be retrieved and redeployed (i.e. serviced) many times, and it is often the case that a re-deployment is not at exactly the same position as the previous deployment(s). This may cause the nominal depths of the instruments to change slightly. Instrumentation may also change between deployments for various practical reasons such as replacement of a failed sensor. Servicing of the moorings creates gaps in the measurements at a site of anywhere between a few hours and a few days. Even though a mooring is anchored to the bottom, the upper flotation can be displaced horizontally (watch circle) and, for sub-surface moorings, vertically (draw down), by current flows and winds. Therefore, location of an instrument (latitude, longitude, depth) can change with time though deviations from the nominal location are expected to be small. In some cases these deviations are important to the analyst, however, it is more usual to consider the nominal position (particularly depth) as adequate measurement of the instrument’s location. For example, a user will often sacrifice a few meters uncertainty in the depth of an instrument, caused by changes in water depth between deployments, in return for a longer time series to input into an analysis. Institutions handle the identification of multiple deployments of moorings at a site in different ways. Some assign separate ID’s for each deployment of the mooring (and instruments). Others assign ID’s to the site and instrument locations and indicate the deployment through other attributes of the time series. The latter makes the concatenation of time series from instruments, over a number of deployments, more straightforward in that new ID’s do not need to be created for the concatenated data.

Moored instruments are generally designed to operate unattended for long periods (~ months to years), and record their data internally. Some moorings have the capability of transmitting the data from the instruments to shore in real-time using satellite, VHF radio or telecommunications networks. The latter are the basis of the Coastal Ocean Observing Programs around the country. The majority of instruments record data at equal time increments (e.g. at 15 minute intervals). However, a few use variable or adaptive time sampling schemes. For example, a wave-tide gauge may measure bottom pressure averaged over a few minutes, every 30 minutes, and every three hours collect a burst of rapidly sampled 1 Hz data. The former would be used for tidal analysis of sea level, and the latter for surface waves. In an adaptive sampling scheme, the instrument may only measure a parameter when certain thresholds are exceeded. For example, in a sediment transport experiment, wave bottom pressure and currents may be only rapidly sampled when the significant wave height exceeds a given value. Therefore, data model structures must be able to accommodate data taken at varying as well as constant time intervals.

Profiling instruments, of which the most common are ADCP’s, measure parameters (e.g. current velocity) at multiple depths remote from the device. Thus, the depth of the instrument is not the depth of the measurements, but the parameters in each depth bin are recorded at the same time intervals. ADCP’s can be directed up or down and thus data from bins at increasing distance from the head can be at decreasing or increasing depths, respectively. This can affect the arrangement of the 2-D depth-time array in the time series files of ADCP data, and again, different institutions have different conventions. Some arrange by bin number irrespective of the direction of the head, others arrange by depth of measurement values; positive direction may be up or down.

Most moored instrumentation measure a limited number of parameters. Examples include temperature, conductivity, current velocity, pressure, meteorological variables, and acoustic variables. In some cases derived variables are reported because they are more useful. For example, a moored CTD instrument measures temperature and conductivity, but usually the temperature and conductivity will be used to give salinity and sigma-t. The travel times from an IES may be converted to dynamic height and temperature/salinity profiles through the use of a complex model (GEM technology). Time series data can also undergo various degrees of processing from, as given by the instrument, to being gap-filled, filtered or averaged. It is not uncommon for sites to supply several versions of the data from a single instrument. For example, a time series may be given as the original 10-minute data and in hour-averaged form, usually in separate files.

Because multiple devices are deployed on a mooring, it cannot be assumed that all instruments are recording at the same time intervals or are synchronized in time. The instruments may not have been started at the same instance and it is possible that one instrument may sample at 20-minute intervals and another at 15 or 30 minutes. The latter may arise because of limitations in storage or battery capacity of a particular instrument.

2.2. DODS Moored Data Sites

At present, the majority of sites that offer time series data through DODS servers use netcdf files as their underlying storage format. Netcdf also seems to be the preferred format for new initiatives involving moored data such as the Coriolis Project based at IFREMER. Use of netcdf does not necessarily imply that this is the working format of the group supplying the data. In some cases, netcdf is used as an exchange format because of its common use and the development of some standards for dimensions, variable names and attributes. The netcdf DODS server software is also straightforward to implement. With the recent availability of the relational database DODS server, use of relational tables to store data and/or metadata may become another approach to making data available through OPenDAP protocols. We are aware of one case where this is being done (SCWRP), but this site is not accessible by the general public. Therefore, the fundamental storage unit for time series at most sites is the array (grid in DODS nomenclature) rather than nested structures of sequences. As part of this project, a survey was made of the characteristics of moored time series data from sites listed in the DODS catalog (). The results are presented in Table 1. This table is not meant to be a comprehensive survey of all DODS moored data sites, but rather an indication of the variety of content, conventions adopted, and formats used by data providers, even though netcdf files are used by all for data storage.

The first and second columns identify the institution providing the data and the file format used by the DODS servers. Note that three of these sites are associated with real-time Ocean Observing Systems (OOS) and the rest serve archive data. Filenames are usually organized by instrument in that data in a particular file originated from a single instrument’s data logger. There may be more than one type of data (e.g. current velocity and temperature from a current meter) but the time (and depth if an ADCP) dimensions are equal. Meteorological buoys usually keep data from all their sensors in a single file since they usually have a common data logger. The Coriolis Project proposes to place all instruments on a mooring in a single file organized by depth. Thus, variables are arranged into 2-D (depth, time) arrays. This works best if the instruments on the mooring are all similar (the Coriolis project is primarily concerned with moored CTD type instruments), and data sampling across the instruments is synchronized.

|Table 1: Characteristics of Moored Data Served by Institutions using NVODS |

|December, 2003 |

|Institution |Files | |Time Format |Variable Conventions |Data Types | |

| |Format |Content Organized | |Names |CF-Standard Names |Units |Scalars |Vectors |ADCP's |Met Buoys |

| | |by | | | | | | | | |

|USGS-WHOI |Netcdf |Instrument |EPIC time,time2 |EPIC Codes |No |COARDS |Yes |E,N,w |Depth Array (deepest bin|No |

| | | | | | | | | |1st) | |

|MBARI |Netcdf |Instrument |COARDS (seconds since |Some COARDS |No |COARDS |Yes |E,N |Depth Array | |

| | | |1970-01-01 00:00:00) | | | | | | | |

|WHOI |Netcdf |Instrument |EPIC time,time2 |COARDS |No |COARDS |Yes |E,N,w |Depth Array |No |

|TAMU |Netcdf |Instrument |yy,mm,dd,hr,min,sec arrays |Not Standard |No |COARDS |Yes |Spd,Dir & E,N |No |No |

| |(translation of| | | | | | | | | |

| |ASCII Files) | | | | | | | | | |

|SeaCOOS |Netcdf |Instrument/ Buoy |COARDS (sec since 1995-01-01 |Some COARDS |Some |COARDS |No |No |No |Yes/NDBC |

| | | |00:00:00) Date_time string | | | | | | | |

| | | |arrays referenced to COARDS | | | | | | | |

| | | |time also provided. | | | | | | | |

|NC-COOS |Netcdf |Instrument |COARDS (days since 0000-1-1 |Some COARDS |Some |COARDS |Yes |E,N,w |Depth (z) array |Yes |

| | | |00:00:00 - MatLab datenum) | | | | | | | |

| | | |yy,mm,dd,hr,min,sec arrays | | | | | | | |

| | | |referenced to COARDS time also| | | | | | | |

| | | |provided. | | | | | | | |

|GoMOOS |Netcdf |Instrument |COARDS (days since |COARDS |Yes |COARDS |Yes |E,N & Spd,Dir |Depth Array |Yes |

| | | |-4713-01-01 00:00:00 | | | | | | | |

| | | |[convertion of EPIC time 0]) | | | | | | | |

| | | |mm,dd,hr,min,sec arrays | | | | | | | |

| | | |referenced to COARDS time also| | | | | | | |

| | | |provided. | | | | | | | |

|SAIC |Netcdf |Instrument |COARDS (minutes since |COARDS |No |COARDS |Yes |E,N & rotated |Depth Array (shallowest |NDBC & |

| |(translation of|(Max 2 |yyyy-mm-dd hh:mn:ss) where | | | | |orthogonal axes |bin 1st) |C-MAN |

| |Database Files)|Variables/File) |yyyy-mm-dd hh:mn:ss is start | | | | | | | |

| | | |time of the series. | | | | | | | |

|NEFSC-NOAA |Netcdf |Instrument |EPIC time,time2 |No |No |COARDS |Yes |E,N |No |No |

|IFREMER-OTS |Netcdf |Mooring |COARDS (days since 1950-01-01 |No, EPIC |No |COARDS |Yes |No |No |No |

|(proposed) | | |00:00:00) |Codes | | | | | | |

This organization into (depth, time) arrays is probably too restrictive, if more general time series were to be combined into a single file. It could result in non-uniform spacing of data points in the common time dimension and thus, excessive number of missing data flags.

There are basically two conventions for determining the time of the data. The older is the PMEL/EPIC () specification for time as two integers. The first (time) is the astronomer’s true Julian Day referenced to 00 hours GMT, and the second (time2) is the number of milliseconds from midnight (00 hours GMT). It is noted that using two variables to specify time has some limitations in the DODS Grid format for Arrays. Grid maps the array indices to the dimension values if the variable has the same name as the dimension as specified by COARDS. Thus, salinity(time) maps time(time) but not time2(time), so not all the time information is included in the mapping. The netcdf COARDS time convention uses a single variable but the reference (t=0) date and the units are arbitrary. Almost everybody conforms to the Unidata UDUNITS specifications for dimension and variable units. The reference date conforms to FDGC specifications for date strings and can contain time zone information. If no time zone is given, GMT is assumed. Thus, a number of different reference dates and time units are used across the COARDS sites. Some sites pick a fixed reference date[1], while others use the start of the time series that changes with the file. TAMU uses separate arrays (variables) for the year, month, day, hour, minute and second, and this kind of specification is provided as secondary information at some sites. Time conversions from one system to another will be an important function for aggregation servers.

The columns under “Variable Conventions” attempt to survey which netcdf conventions sites implement. Variable and dimension names are usually unique to a site’s files. However, dimension (time, depth, latitude, longitude) and equivalent independent variable names often conform to COARDS recommendations and netcdf standard attributes such as long_name, short_name, units, and _FillValue are nearly always supplied for dependant variables. COARDS has been extended by Climate and Forecast (CF) metadata conventions () that are designed to standardize netcdf files for atmosphere, surface and ocean with model-generated data particularly in mind. CF does not place any restriction on variable names, but rather standardizes and/or recommends use of attributes of the variables. It has some useful recommendations for specification of attributes for coordinate variables and data flags. However, perhaps the most useful recommendation is that all variables be identified by a standard_name attribute, which has a precise (string) value. An example is “sea_water_temperature”. Note that standard_name does not override long_name, which is still used to describe the variable.

There are some limitations to the standard names as defined by CF, and one view, expressed at the workshop, was that they were not very ocean measurement friendly. For example, both “sea_water_temperature” and “sea_surface_temperature” are standard names, whereas oceanographers usually consider “sea_surface_temperature” to be “sea_water_temperature” at depth 0, and a file with temperatures at a number of depths, including the surface, would not make a distinction. Similarly, defined standard names, at present, make no provision for current (or wind) vector components that are not on east and north orthogonal axes. Ocean chemistry measurements are also not accommodated. However, these limitations could be probably fixed by input to CF by the ocean-measurement community. Since CF metadata standards are relatively recent and evolving, most existing sites have not implemented these conventions.

In the PMEL/EPIC () data analysis system, variables are assigned integer codes (e.g. depth has an epic_code of 3). A few sites provide these in their netcdf files and this could be another way of precisely identifying the meaning of variable names, even though they are not intuitive to the casual user.

Under the “Units” column, COARDS means that all sites conform to specifying units using the conventions established by the Unidata UDUNITS software package. The units attribute should be specified for all dimensional variables as required by COARDS and CF.

The types of moored data, provided by DODS sites, are scalars (e.g. temperature, pressure, etc.), 2 and 3-D vectors (currents), profiles (ADCP currents), and meteorological (winds, air temperature, humidity, etc.). Table 1 shows the data types available by site. Current vectors may be in component form (East & North or axes rotated to the direction of the isobaths), as speed and direction, or both. It is noted that direction usually has a different meaning for currents than for winds, and thus vector components may be less ambiguous. Some sites provide the vertical velocity component (w) for ADCP data. All of the ADCP data are provided as (depth, time) arrays and these are the main occurrences of arrays with more than the one dimension of time. Similar to time, the depth coordinate variable has more than one convention. Using COARDS/CF, the (required) attribute positive defines the direction of the depth axis where 0 is the ocean surface (or more strictly the datum for the bathymetry). Most oceanographic providers use the value “down” for positive, and this is often implied if the attribute is not given. Therefore, with this convention, the height of a wind sensor on a meteorological buoy is negative. In another convention, the instrument depth is given as a positive height above the seabed (e.g. for bottom tripod systems). An aggregation server must be able to take into account these coordinate differences.

Where meteorological data are provided, some are from the institution’s own buoys, but others provide data obtained from government sources (e.g. from NDBC buoys and C-MAN stations). In a similar manner, but not noted in the table, a few sites also provide sea-level data from NOS tide gauges. The availability of government data from non-government sites points to particular time series (either in part or complete) being available from more than one source. Even though the origins of the data are the same, the time series data from multiple sites may differ in subtle ways. For example a NDBC wind record may have been gap filled and filtered at one site, but not at another. This would indicate that processing/QC flags and source information could be important parameters of the selection mechanism for aggregation of datasets. This will be discussed more in Section 3.2.

2.3. User Requirements

The development of aggregation servers along with search and retrieval software is directed at the user. There will be different categories of users ranging from those requiring simple displays of data on a map to those requiring inputs into complex analyses. The point of view of this document is that if the data structures for aggregation servers can fulfill the needs of the working scientist, then the needs of other users can be accommodated. This does not mean that user interfaces may not differ for different user groups, but that the information, returned as a part of the aggregation data structure, should be complete enough for such multiple interface developments, if this information is designed to meet the needs of the most sophisticated user (see goals 1 and 2, Chapter 1).

Beyond simple statistics, the majority of time series analysis techniques require that the data be equally spaced in time and have no gaps or missing values. Examples are filtering, spectra, correlations, and principal component analysis (or EOF’s). The no missing value requirement may be relaxed for some methods, such as for the least-square fits used by tidal analysis. The data structures, however, should be able to accommodate all types of time series, but it would be useful to the analyst if there were attributes of the time series that indicated its state (e.g. the series equally spaced and have no gaps). This could eliminate the need for many processing steps such as applying time checks and searching for missing values and interpolating gaps.

Finding and selecting time series datasets are also crucial parts of use of the NVODS system. The user will need to perform sophisticated complex searches of metadata in order to focus an analysis on appropriate time series. This topic is not strictly within the purview of this report. However, by considering the types of query that an aggregation server may be required to service, and the possible methods that may be used to locate data, some of the requirements for multiple time series data structures are clarified. In particular, a close relationship of the data structure to relational database tables, which could be used to store metadata, could be advantageous to implementations of aggregation servers.

An example of a query that a search engine / aggregation server would need to service is:

Find all the current records that are below 1000-m depth, have durations longer than 6 months, were located in the Gulf of Mexico, and overlap the period January 1, 1990 to December 31, 1993.

Note that this query would potentially retrieve datasets from many different moorings, and therefore, an aggregation data structure should not be restricted to records from a single mooring. This query may be further restricted by requirements on data organization. For example:

Restrict returns to equally spaced data with time steps less than or equal to 1 hour, and that use East and North coordinate axes.

This could reduce the number of individual time series selected, however, there may duplicates if the data were stored in more than one file on the provider site (i.e. different versions or processing levels) or the more than one provider site has copies of the dataset. Therefore, it is likely that the returns would require order criteria. For example:

Order by increasing depth, global data quality, source institution, and instrument manufacturer.

These types of queries could be satisfied if the all the metadata of the time series datasets of the providers were catalogued in relational tables. The queries above could be formulated as SQL statements, of some complexity, depending on how the tables are organized and the information included in the table columns. Such an arrangement would also allow the user to refine the search by just interacting with the relational tables and not doing time consuming queries of individual provider sites. However, if data returns were further restricted by the values in the time series, just searching metadata tables would not be sufficient. For example:

Restrict returned data to values where the current speed is greater than 60 cm/s.

If the above data value criteria were allowed, then the aggregation server may have to further filter the time series values, before sending the data to the user. The returned data would no longer be equally spaced because possibly only a few (or no) data values in each selected time series may satisfy the criteria.

A possible organization of such a search and retrieval system is sketched in Figure 3. The search engine has been separated from the aggregation server because they are logically distinct operations. The idea is that the search engine would periodically poll the individual provider sites and populate a metadata database that describes each time series on the provider site along with its location. It is expected that this database would be in the form of relational tables. Thus, after initialization, the provider sites would only provide metadata if the site is updated. Providing the metadata to the database will be a complex task because each individual site will have different conventions and translation tables will be needed to interpret attributes of the local time series in terms of the relational table column names. However, this need only be done once as long as the local conventions of the time series files remain unchanged. Potentially many millions of time series could be catalogued this way. The advantages of the use of a relational database to organize the metadata for time series from multiple sites are as follows:

1) Searches are local and efficient and can use standard SQL and already developed query software. Iterative queries to refine the selection only involve interaction between the user and the database. Searching metadata tables should be a lot more efficient than searching through the content of millions of files scattered over a few hundred, provider sites.

2) The relational table structures impose discipline on the metadata and can enforce rules.

3) The metadata of requested records could be directly supplied to the Aggregation engine, thus, bypassing the individual sites. The individual sites could then just supply the time series arrays to the aggregation server. The overhead of translating the time series metadata, every time a time series record is delivered, would be saved.

4) A lot of authoritive support information, such as instrument descriptions or filter characteristics, could be stored in the database under well-known keys (ID’s). This could encourage a central repository for this type of information, which the provider sites could import and incorporate in their files.

5) Any work generating nice user-friendly interfaces would only have to done for the query engine sites and not the individual provider sites.

Therefore, the concept is that locations of selected time series, along with appropriate standardized metadata, are passed to the aggregation server, which then assembles the metadata and retrieves the time series from the provider sites. The aggregation server applies any needed or requested conversions and scaling, and uses the multiple time series data structure to return data to the user.

3. Data Structure Proposal

3.1. Restrictions

A number of restrictions are proposed for aggregate time series structures. Some have practical consequences in that they will simplify the data stream, and others are more related to the workshop’s philosophy for aggregation structures. They are:

1) Whether a single or multiple time series datasets are returned to the user, the resulting data stream will be a single entity (i.e. a DODS file).

This adds complexity to the data structure, but the alternative is to return each time series dataset as a separate file. Such a set of files may be large, and thus, there is more of a burden on the user to sort through the results. If the results are in one compact structure, it should be easier to select or discard individual series. On the other hand, user API’s may need to be modified to except more complex data streams.

2) If a data structure contains more than one time series variable (e.g. temperature and salinity), the measurements must be co-located in space and time.

The original proposal was that only one type of variable (scalar or vector) would be returned per request. The argument being that if the user wanted another variable, a second request could be made. However, the workshop thought that this was too restrictive and the user should be able to request more than one variable at a time. If the restriction is made that all the variables requested are from the same locations (i.e. instruments – there may be many instruments) and have the same time sampling, then a logical structure containing more than one variable could be constructed. Thus, if the input file contained velocity components and temperature from the same current meter, and all the time series arrays had the same dimension size, then they could be retrieved at the same time. If the temperature array was sampled at 20-minutes and the current vector components at 10 minutes (say), then one of the arrays (scalar or vector) would be excluded from the results. Which one would depend on how the query was constructed.

3) Except for scaling required to make units consistent, the aggregation server would not alter any time series values received from provider sites.

This implies that the aggregation server performs no time or depth averaging or interpolation. This is essentially a data integrity constraint. Therefore, using a fixed (depth, time) array, organized by mooring, for the aggregation structure is too restrictive given the nature of moored time series measurements that might have to be accommodated (see Section 2.1).

4) The aggregation data structure should be adaptable to other types of oceanographic data (e.g. Cast (CTD) or lagrangian float).

Thus, any adopted data structure should be flexible and with only small adjustments be useable for a wide range of space and time variable data.

3.2. Aggregation Data Structure Using Arrays

The netcdf model will be used to describe array structures for the reasons given above, even though the DODS grid structure is independent of the file format used for inputs. Thus, the aggregation data structure will have the usual sections of dimensions, global attributes, variables and variable attributes. Though not specified below, variables have the usual DODS base data types of byte, int32, float64, and string. Note that arrays of strings in netcdf have to be dimensioned with the maximum length of the string, because strings are treated as arrays of single characters (e.g. array_of_strings(number_of_strings, max_length_of_strings). DODS also indexes arrays with a base of 0 (similar to C). The basic idea of the aggregation data structure is that time series data of the same type (e.g. temperature) are packed into a single one-dimensional array, T.

If the number of time series is M, and the number of data values in each series is ni , where i = 0 … M-1, then the array is given as:

T [0 … n0-1, n0 … n0+n1-1, n0+n1 … n0+n1+n2+ 1, … , [pic] … [pic] ] (1)

To extract a single time series from this array, a start_index and number_of_points are required. For series j :

Start_index = [pic] ; and the stop_index = start_index + nj –1. (2)

Note that the array T [ …] has a base data type (e.g. float64) and all elements have the same units. Therefore, the aggregation server may need to convert some of the input time series (e.g. from Fahrenheit to Celsius), and provide uniform _FillValues.

As a practical matter, the aggregation server will need to use a standard set of variable names (e.g. T for temperature). The COARDS/CF standards do not have any restrictions on variable names, but rather define their attributes. Variable names differ for each provider institution for the same physical quantity, however, the standard_name CF attribute could provide an identification mechanism as discussed above. The variable names adopted by the aggregation server should probably be already in fairly general use. The Coriolis Project has proposed a standardized variable naming system for seawater temperature (TEMP), salinity (PSAL), etc.

A basic aggregation data structure is given in a CDL format {1}[2].

dimensions: // {2}

series = M ; //number of time series periods corresponding to separate instruments.

time = UNLIMITED ;

// Global Attributes {3}

// information on aggregation data processing, standard COARDS attributes.

:title = “NVODS Aggregation Server for Time Series” ;

:references = ; //URL’s of documentation

:source = “ocean observations from moorings” ;

variables: //Reduction in rank possible if arrays are constant {4}

// Series Locations and Descriptions {5}

series(series) ; //time series ID’s (constructed by the aggregation server?).

series:coordinates = “depth latitude longitude” ; //CF coordinates attribute {6}

latitude(series) ; //mooring latitude {7}

longitude(series) ; //mooring longitude

depth(series) ; //measurement depth

depth:positive = “down” ; //CF/COARDS vertical

//coordinate attribute.

mooring_ID(series) ; //defined by provider

water_depth(series) ; //water depth of mooring

instrument_ID(series) ; //community definition {8}

instrument_depth(series) ; //optional: used if any of the

//instrument depths differ from the measurement depths (e.g. if an ADCP).

site_URL(series) ; //where time series originated

// Time series parameters {9}

start_time(series) = “yyyy-mm-dd hh:mm:ss” ; //FGDC/COARDS date

//corresponds to start_index.

time_step(series) ; //Only defined for equally spaced data

time_step:units = “minutes”; //defined by user {10}

npts(series) = nj ; //number of data points in series j.

start_index(series) = [pic] ; //as defined above.

// Time series processing flags {11}

equally_spaced(series) = “y or n” ; //May be replaced by a set of community

no_fill_values(series) = “y or n” ; // defined integer flags.

filters_applied(series) //community definitions.

// Aggregated variables (names assigned and data appropriately scaled by the server)

T(time) ; //temperature T[…] defined by (1)

T:units = “degrees_C” ; //defined by user

T:_FillValue = -9999 ; //defined by user

T:standard_name = “sea_water_temperature” ; //CF required attribute

// Examples of additional co-located aggregated dependent variables

U(time) ; //East-component of current U[…]

U:units = “cm/s” ; //defined by user

U:standard_name = “eastward_sea_water_velocity” ; //CF attribute

V(time) ; //North component of current V[…]

V:units = “cm/s” ; //defined by user

V:standard_name = “northward_sea_water_velocity” ; //CF attribute

// Independent variables (form defined by user) {12}

time(time) ; //time of all data points in T[…], etc.

time:units = “minutes since 1950-01-01 00:00:00” //CF/COARDS

// Data QA/QC flags (optional) {13}

T_qc(time) ; //Temperature QC flags

T_qc:flag_values = (0, 1 ) ; //community defined (CF attribute)

T_qc:flag_meanings = “good_data corrected_or_interpolated data” ;

Notes:

1) Data types (e.g. int, float64, etc.) have been left out of this description. Values given to attributes are only for illustration.

2) This section will require character array dimensions for any “arrays_of_strings” in the file. All arrays with the time dimension have exactly the same number of points.

3) Information is supplied in the Global Attributes that apply to the whole of the aggregated data. The examples are a minimal list.

4) Where the elements of an array have constant values (e.g. latitude and longitude, if the file contains data from a single mooring), the arrays can be reduced in rank to become scalars. For conciseness, CF/COARDS attributes such as standard_name, long_name, _FillValue, units, valid_range, etc. have been omitted for most variables. Where these attributes are present, they are included for emphasis because they are considered essential. This is again a minimal list of variables describing the location (x, y, z) and instrument characteristics of each time series in the aggregation.

5) CF recommended attribute to associate a single spatial dimension (time series number) with the independent named position coordinates.

6) The spatial coordinates (latitude, longitude and depth) are here considered time independent. If any of these coordinates were time dependant, as discussed in Section 2.1, then they would be defined in the Aggregated independent variables section, and nominal values used in this section. Thus, if depth is time dependent, then it is replaced here by:

nominal_depth(series) ;

and

depth(time) ;

is added to the Aggregated Independent Variables.

8) “Community definition” is meant to suggest that if there were a sanctioned list of metadata associated with a unique ID for a variable, then a potentially large amount of information can be referenced compactly and unambiguously. For example, if instrument characteristics were cataloged with standard metadata, including precision and accuracy, then a single ID could reference all this information that would be helpful for the user’s evaluation of each time series. A suitable attribute would indicate how the ID’s should be used, e.g.

instrument_ID:conventions = “Oceanographic Instrument List v-1.0”

Such conventions must be a community effort that involves instrument manufacturers.

9) “User defined” implies that he or she requested the units and sometimes the format of the variables that are to be returned by the aggregation server.

10) If a time series has equally spaced data points, then the time of any data point can be found knowing the start_time of the series, the time_step and the data point index (relative to start_index for aggregated series). If all series in the aggregation have equally spaced points, then an independent time variable is not strictly needed, however, it probably should still be provided because it may be expected by some processing API’s.

11) A set of conventions, developed by the community, would be useful to indicate the processing level of each time series. This is a separate issue to defining data value flags. Defining whether a series is equally spaced and has no missing values would be very useful to the user. Similarly, some standard way of indicating whether filters or time averaging have been applied helps determine how the data can be used.

12) The user sets characteristics of the independent variables. Thus, for time, it might be requested that it be in the form of year, month, day, hour, minute and second, returned either as a string or as separate integer variables. If COARDS times are requested, the user sets the reference date and the units.

13) Most institutions have some kind of integer flags defined to indicate data quality of measurements. A very simple scheme of two flags (0 and 1) is illustrated here. A more elaborate scheme is proposed for use by the Coriolis project that uses flags (0 – 9). Again, community sanctioned definitions would be useful. Many providers may only use a subset and the aggregation server may need to translate meanings. The recommended CF attributes flag_values and flag_meanings can be used to define flags. Note that QC flags are not always available and in some cases would be meaningless (e.g. for filtered time series).

The proposed data structure has the following virtues:

1) The array storage is compact in that there is a minimum of wasted space that could be caused by using 2 or 3-D arrays to contain the multiple time series. DODS structures were examined as a possible storage mechanism, but arrays of structures have to have the same dimensions and this would be violated by multiple time series of differing numbers of points.

2) If series = M = 1, the structure would reduce to something very similar to many netcdf time series files for point-measurement instruments. Instruments (e.g. ADCP’s) that generate (depth, time) 2-D arrays are not as elegantly accommodated. However, reconstructing a 2-D from a 1-D array using start_index and npts, is a straightforward operation if npts is constant (e.g. FORTRAN 90 reshape function or equivalent).

3) By placing the metadata in arrays, indexed by the time series number (series), organization and use of this data is simplified and the data structure avoids proliferation of variable names. The latter would occur if each time series were given a separate structure. Populating these arrays from queries of relational tables of time series metadata should be reasonably straightforward.

4) The data structure can easily be adapted to different data types. For example, making the variables latitude, longitude, and depth aggregated time arrays, could accommodate lagrangian data from multiple RAFOS floats. Making depth the aggregating dimension could accommodate CTD cast data from hydrographic cruises. Thus, in the above structure:

dimensions:

depth = UNLIMITED; //Change time to depth

variables:

series(series) ;

series:coordinates = “time latitude longitude” ; //Change coordinate

//dependence

time(series) ; //add time of cast

//Aggregated variables

T(depth) ; //Temperature array for multiple casts

S(depth) ; //Salinity array for multiple casts

depth(depth) ; //depths of T,S measurements

Other metadata variables would probably need to be defined for this case.

3.3. Relational Tables

Relational databases are used by many institutions to manage large quantities of data. For example, WOCE use a relational database to catalog and search their data inventories. The use of relational databases to catalog time series metadata was discussed in Section 2.3 from the point of view of searching and retrieval of data. The structure of the tables would serve to supply information for the variables in the aggregation data structure discussed above. Therefore, every time series array that is accessible by an aggregation server, would have an entry in the metadata tables. Design of the tables is beyond the scope of this document, but some ideas on the connections between the aggregation data structure are presented here for future reference.

time_series //table name

id //a unique identifier of form: provider_URL/filename/variable_name

standard_name //identifies the variable and measurement type

long_name //description of variable

ancillary_variables //identifies names of QC flag arrays (e.g.T_qc) in same file

_FillValue //missing value indicator

units //measurement units

start_time

stop_time

npts //number of data points

time_step //not defined if not equally spaced

data_quality //flags for equally spaced and no missing values

filter_code //Codes (Foreign Key) for processing levels: link to a support table

depths //Measurement depths of the array (for ADCP’s)

location_id //Foreign Key to identify entry in location table

location //table name

location_id //unique Key

platform_id //Foreign Key to identify platform type and position (platform table)

instrument_id //Foreign Key to identify instrument characteristics (instrument table)

serial //Instrument serial number

instrument_depth //depth of instrument

platform //table name

platform_id //unique Key

latitude

longitude

water_depth

code //Code describing type of mooring (e.g. subsurface, bottom tripod, etc.)

owner //Information on the institution using or deploying the mooring.

instrument //table name

instrument_id //unique Key

description //e.g. current meter, ADCP, etc. (standard codes)

sensors //e.g. velocity components, temperature (standard codes)

accuracy //accuracy of sensors

precision //precision of sensors

manufacturer

comments

In this set of tables, the id in the time_series table is used to locate the array of measurement data inside a file on a provider’s DODS site. The use of codes to identify moorings types and instrument sensors implies that support tables, describing these, will be needed. These subsidiary tables are not given here so as to keep the table structures reasonably simple. There is a one to many relationship between the keys. Thus, many time series id’s will have the same location_id, and many locations_id’s will have the same platform_id. Even this fairly straightforward table structure, which corresponds quite closely to the aggregation data structure variables and attributes, will require complex translation mechanisms for the contents of each providers site that could be used to populate this database.

4. Summary and Recommendations

A flexible data structure has been proposed for aggregated moored time series data. It is designed to accommodate multiple time series arrays of varying characteristics in a single structure (file) that employs packed 1-D arrays or grids. This document outlines the reasons behind the design of this data structure, and should be used as guidance for more detailed design studies. In order to make the structure reasonably compatible with existing NVODS moored data provider sites, netcdf conventions have been adopted, and the structure is designed to use simple arrays or grids that are consistent with netcdf files. Adapting user API’s that already use time series in netcdf or DODS grid format, should be straightforward, requiring the extracting of segments of 1-D arrays.

Community conventions established by COARDS/CF for netcdf attributes should be adopted for the metadata included in the aggregation data structure. Similarly, community initiatives, for marine metadata and standard variable names, provide guidance for the variables and their attributes that may be included in the data structure. However, only require metadata be present that is directly useful to the end user-analyst. Thus, metadata for the location of a time series are required, but details of instrument calibration may not. The variables and attributes included in the data structure design (Section 3.2) are considered a minimal required list.

Specific recommendations for implementing the data structure are:

1) Use community established conventions wherever possible.

2) Encourage CF to adopt more standard_names that relate to ocean measurements.

3) Encourage the development of new conventions for describing entities in the data structure. This document has identified the following topics that would benefit from community wide standards:

▪ Standardized Instrument descriptions, including sensor accuracy and precision.

▪ Standardized measurement QC flags.

▪ Standardized descriptions of the levels of processing for time series. This includes specifying equal/non-equal time spacing, existence of gaps, and filter codes and descriptions.

4) Implement relational database tables that parallel the structure, for use in cataloging, searching, and standardizing variables names for time series metadata.

Appendix

Workshop Attendees

-----------------------

[1] The year –4713 corresponds to true Julian day zero.

[2] See notes below

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download