The issue with netCDF data types

TL; DR

If you want to be on the safe side, use only SHORT, INT, FLOAT, DOUBLE, STRING or array of CHAR as data types. In particular, don’t use single byte integers, any unsigned integer and no 64 bit integer.

The data format netCDF which is used in many places throughout our community has evolved over time. During this evolution, the basic data types representable within the netCDF data format have been expanded. Currently we are at netCDF version 4 with HDF5 as a storage backend, which offers a lot more flexibility compared to previous versions of netCDF including many more data types. One could assume that this is all fine and we can use these data types freely, but here we are unlucky and there is more to keep in mind.

In order to paint the full picture, we’ll have to take a little detour and think about what netCDF actually is or even more what we think (or probably should think) about when talking about netCDF. As we’ll see, this is not something which can be answered in a straight forward manner, so it is fair to ask if we should care at all. Finally we’ll have a look at an experiment which shows plenty of ways in which different data types may fail around netCDF and which types seem to be fine.

What is netCDF?

NetCDF is more than only the “netCDF4 on HDF5” storage format. This is not only a bold statement, but is actually asked and discussed by the netCDF developers (What is netCDF?, Jan 2020). This chapter is not exactly about the same topic in the referenced issue, but still has similar roots. The relevant part for this discussion is that there is at least netCDF, the data model and netCDF, the persistence format. It is also recognized that there are multiple persistence formats (netCDF3, netCDF4/HDF5, zarr) for the same data model and there are translation layers such as OPeNDAP which can be used as a transport mechanism between the computer on which data is stored and the computer on which data is analyzed. We further look at the relevance of the netCDF data model and netCDF persistence format for the EUREC4A datasets.

The data model(s)

There are actually a few different ideas of data models around, which share some general ideas. Maybe the most illustrative way to show this general idea is an extended version of xarray’s logo, but the underlying ideas are of course applicable independently of xarray or even Python.

dataset diagram

Fig. 2 Dataset diagram (From the xarray documentation)

A dataset is a collection of variables and dimensions. Variables are multi-dimensional arrays and they can share dimensions (i.e. to express that temperature and precipitation are defined at the same locations, the three axes of the corresponding arrays should be identified with the same dimension label, that is e.g. x, y and z). Some variables may be promoted into a coordinate, which does not require to store it differently, but it changes the interpretation of the data. In the example above, latitude and longitude would likely be identified as coordinates. One would typically use coordinate values to index into some data or to label the ticks on a plot axis, while a normal variable would usually be used to define a line or a color within a plot. There may be variables which should be coordinates only for a specific use case and vice versa. It is useful to distinguish between variables and coordinates within the metadata of a dataset, but the representation of the actual data may be the same for both, variables and coordinates. In addition to the existence of coordinates, there are more pieces of valuable additional information (think of units, coordinate reference systems, author information etc…). Thus we usually want those extra bits to datasets and variables, which can be done by the use of attributes.

It turns out that many datasets in the EUREC4A context can be represented very well in this model. I assume that in many cases when we talk about that a dataset being available in netCDF, what we really care about is that the dataset is organized along this general structure and thus can be accesses as if it were netCDF.

The storage and transport formats

netCDF

In order to put the high level idea from above into a data structure which can be stored on disk, netCDF defines two low level data models where the Classic data model came first and the Enhanced data model evolved from it. Both models care about variables, dimensions and attributes as introduced above.

The classic model knows about CHAR, BYTE, SHORT, INT, FLOAT and DOUBLE data types which are all signed. There may be at most one unlimited (i.e. resizeable) dimension per variable.

The enhanced model is based on the classic model and adds INT64 and STRING to the mix as well as unsigned variants of all integral types. Further additions of the enhanced model are groups (which are like a dataset within a dataset) and user defined types. There may be any number of unlimited dimensions per variable.

Those two data models are closely tied to the netCDF3 and netCDF4 storage formats.

  • netCDF3 is based on the classic data model and defines a storage format which is custom to netCDF.

  • netCDF4 is based on the enhanced data model and uses HDF5 as its backend storage format. HDF5 enables further enhancements like internal data compression.

Both storage formats store all data in a single file, chunking of the dataset is done internally. If the dataset should be accessed from a remote location, usually the entire dataset must be fetched before accessing the data.

In Jan 2019 HTTP byte-range enabled partial reads from remote datasets have been added to netCDF-c, but as stated in the pull request, it is not meant for production use:

Note that this is not intended as a true production capability because, as is known, this kind of access to can be quite slow. In addition, the byte-range IO drivers do not currently do any sort of optimization or caching.

Thus, if a dataset should be accessed efficiently from a remote location, something different must be added, which is what OPeNDAP is for.

OPeNDAP

OPeNDAP is a network protocol to access scientific data via network. In particular, it uses HTTP as a transport mechanism and defines a way to request subsets of a dataset which is formed along the data model as described above. OPeNDAP is thus the go-to method if an existing netCDF dataset should be made available remotely. In the context of EUREC4A, most datacenters provide access to uploaded datasets via OPeNDAP.

OPeNDAP is specified in two versions, namely version 2 and version 4. However, DAP4 is still a draft since 2004 and the only widely supported version of OPeNDAP is version 2.

The data model of OPeNDAP v2 is slightly different from the data model in netCDF. Regarding data types, OPeNDAP v2 defines Byte, Int16, UInt16, Int32, UInt32, Float32, Float64, String, URL.

As opposed to the BYTE in netCDF Classic, the Byte in OPeNDAP is unsigned. This makes byte types of netCDF Classic and OPeNDAP incompatible, but as noted in the summary, there exists a hack which tries to circumvent that. One could assume that UBYTE of netCDF Enhanced would fit to this Byte, but there are issues as well which are confirmed in the experiment.

Furthermore, there is no CHAR, so whenever a netCDF dataset containing text as a sequence of CHAR is encountered by an OPeNDAP server, this will be converted to an OPeNDAP String. A consequence of this is that CHARs and STRINGs can not be distinguished when transferred over OPeNDAP and thus may be converted into each other after one cycle through netCDF → OPeNDAP → netCDF. This behaviour is not (yet) studied in this text.

zarr

to do

zarr should be discussed in this place.

Why should we care?

Each of the formats has distinct advantages and disadvantages:

  • netCDF is a compact format, the whole dataset can be stored in a single file which can be transferred from one place to another and everything is contained. This is great for storing data and to send around smaller datasets, but retrieving only a subset of a dataset from a remote server is not possible. Libraries which can open netCDF files usually do not handle HTTP, so the dataset needs to be stored in a local directory to use it.

  • OPeNDAP is a network protocol based on HTTP which has been made exactly for the purpose of requesting subsets of a dataset from a remote server. As it is only a network protocol, it can not be used to store the data, so every dataset which should be provided via OPeNDAP must be stored in another format (like netCDF) or computed on the fly. Support for reading OPeNDAP is build into netCDF-c.

  • zarr is a storage format which is spread out into several files in a defined directory structure. This makes it a little harder to pass around, but this chunked structure makes remote access exceptionally simple and fast. Chelle Gentemann has prepared an impressive demonstration on this topic.

So, depending on the use case, different formats are optimal and none of them supports everything:

format

for storage

sending around

for remote access

widely supported

netCDF

❌, works via OPeNDAP

OPeNDAP

✅, can share netCDF

zarr

moderate

✅, with high performance

❌, netCDF-c is working on it

Thus, if we want to have datasets which can be used locally as well as remotely (some might call it “in the cloud”), we should take care that our datasets are convertible between those formats so that we can adapt to the specific use case.

An experiment

Currently, the main use case for the Aeris data server is to publish netCDF files and retrieve them back, either via direct download or via OPeNDAP. This experiment is designed to test under which circumstances data which has been created as netCDF can be retrieved correctly via OPeNDAP. In order to minimize additional problems due to the mix of incompatible libraries, all dataset decoding is done using the same netCDF-c library (independent of the access to the dataset, directly or via OPeNDAP).

The individual test cases show differences between data types, the use of negative numbers, the use of values used as numeric values, and the use of values interpreted as flags. For each data type, there is a drop down menu which shows the individual sub cases. If any one of the sub cases failed, the data type is marked as erroneous.

In each case the test follows the same steps:

  • The original dataset is dumped via ncdump

  • The OPeNDAP dataset attribute structure (.das) is retrieved via curl to look at the raw response of the server

  • The dataset as received via OPeNDAP is dumped via ncdump

  • A little script checks if it detects any significant error between the original and the OPeNDAP dataset

Results

Summary

The result of this experiment is underwhelming but shows a clear picture. A large set of data types can not be used reliably in a combination of netCDF and OPeNDAP. The failure modes however differ significantly between the various types:

  • 64 bit integers simply can not be encoded in XDR which is the binary encoding of OPeNDAP. Thus the server just fails and the user receives an error.

  • unsigned short and unsigned int could be represented by OPeNDAP in principle, however the Server introduces spurious _FillValues which additionally are of the wrong (signed) integral type such that the netCDF client library refuses to read the data. This is most likely a Bug in the specific server.

  • unsigned bytes could in principle be represented by OPeNDAP, but the values received by the client are wrong. This might be an attempt by the client to somehow handle the erroneously delivered valid_range. The big problem here is that this can result in an error without a message.

  • signed bytes can work sometimes even if they are not representable by OPeNDAP. This is due to a hack which has been introduced into netCDF-c which is based on the use of the additional _Unsigned attribute which is created automatically by the server. The hack however only applies to the data values and not to the attributes. As a consequence, the data type of the attributes may be changed depending on the values stored in the attributes. In particular, signed bytes seem to work only if they are positive. The behaviour is however really weird, so maybe one should not count on it.

As a consequence, the only numeric data types which should be used in any dataset are SHORT, INT, FLOAT and DOUBLE. STRING and CHAR (when used as text) seem to be ok, but they have not been investigated in this setting yet.

Did we learn something?

Well, it depends. In principle, these results are also present, but well hidden inside the NetCDF Users Guide: The Best Practices state:

  • Do not use char type variables for numeric data, use byte type variables instead.

[…]

NetCDF-3 does not have unsigned integer primitive types.

  • To be completely safe with unknown readers, widen the data type, or use floating point.

This excludes the use of CHAR for numeric applications and effectively also BYTE, UBYTE.

And in the data model section we can find:

For maximum interoparability with existing code, new data should be created with the The Classic Model.

This excludes the use of 64 bit integers and unsigned types.

Thus, the take-home message of this article is not new, everything could have been found in the NetCDF Users Guide. However at least I was not fully aware of all of these implications and did only find the references into the Users Guide while preparing this article. I think if we want to provide datasets which are usable by everyone, the insights of this article are important to share.